cluster validation

23
University at Buffalo The State University of New York Cluster Validation Cluster validation Assess the quality and reliability of clustering results. Why validation? To avoid finding clusters formed by chance To compare clustering algorithms To choose clustering parameters e.g., the number of clusters in the K-means algorithm

Upload: merv

Post on 07-Feb-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Cluster Validation. Cluster validation Assess the quality and reliability of clustering results. Why validation? To avoid finding clusters formed by chance To compare clustering algorithms To choose clustering parameters e.g., the number of clusters in the K-means algorithm. DBSCAN. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cluster Validation

University at Buffalo The State University of New York

Cluster Validation

Cluster validation Assess the quality and reliability of clustering

results.

Why validation?To avoid finding clusters formed by chanceTo compare clustering algorithmsTo choose clustering parameters

e.g., the number of clusters in the K-means algorithm

Page 2: Cluster Validation

University at Buffalo The State University of New York

Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link

Page 3: Cluster Validation

University at Buffalo The State University of New York

Aspects of Cluster Validation

Comparing the clustering results to ground truth (externally known results). External Index

Evaluating the quality of clusters without reference to external information. Use only the data Internal Index

Determining the reliability of clusters. To what confidence level, the clusters are not

formed by chance Statistical framework

Page 4: Cluster Validation

University at Buffalo The State University of New York

Comparing to Ground Truth

Notation N: number of objects in the data set; P={P1,…,Pm}: the set of “ground truth” clusters; C={C1,…,Cn}: the set of clusters reported by a clustering

algorithm. The “incidence matrix”

N N (both rows and columns correspond to objects). Pij = 1 if Oi and Oj belong to the same “ground truth”

cluster in P; Pij=0 otherwise. Cij = 1 if Oi and Oj belong to the same cluster in C; Cij=0

otherwise.

Page 5: Cluster Validation

University at Buffalo The State University of New York

External Index

A pair of data object (Oi,Oj) falls into one of the following categories SS: Cij=1 and Pij=1; (agree) DD: Cij=0 and Pij=0; (agree) SD: Cij=1 and Pij=0; (disagree) DS: Cij=0 and Pij=1; (disagree)

Rand index

may be dominated by DDJaccard Coefficient

||||||||

||||

||||

||

DDDSSDSS

DDSS

DisagreeAgree

AgreeRand

||||||

||

DSSDSS

SStcoefficienJaccard

Page 6: Cluster Validation

University at Buffalo The State University of New York

Internal Index

“Ground truth” may be unavailable Use only the data to measure cluster

quality Measure the “homogeneity” and “separation” of

clusters. SSE: Sum of squared errors.

Calculate the correlation between clustering results and distance matrix.

Page 7: Cluster Validation

University at Buffalo The State University of New York

Sum of Squared Error

Homogeneity is measured by the within cluster sum of squares

Exactly the objective function of K-means.Separation is measured by the between cluster sum of

squares

Where |Ci| is the size of cluster i, m is the centroid of the whole data set.

BSS + WSS = constantA larger number of clusters tend to result in smaller WSS.

i Cx

i

i

mxWSS 2)(

i

ii mmCBSS 2)(

Page 8: Cluster Validation

University at Buffalo The State University of New York

Sum of Squared Error

10100

10)35(1)34(1)32(1)31(1

0)55()44()22()11(2222

2222

Total

BSS

WSS

1 2 3 4 5 m1 m2

m

K=2 :

10010

0)33(4

10)35()34()32()31(2

2222

Total

BSS

WSSK=1 :

K=4:

1091

9)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(22

2222

Total

BSS

WSS

Page 9: Cluster Validation

University at Buffalo The State University of New York

Can also be used to estimate the number of clusters.

Sum of Squared Error

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

Page 10: Cluster Validation

University at Buffalo The State University of New York

Internal Measures: SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means

Page 11: Cluster Validation

University at Buffalo The State University of New York

Correlation with Distance Matrix

Distance Matrix Dij is the similarity between object Oi and Oj.

Incidence Matrix Cij=1 if Oi and Oj belong to the same cluster, Cij=0

otherwise

Compute the correlation between the two matricesOnly n(n-1)/2 entries needs to be calculated.

High correlation indicates good clustering.

Page 12: Cluster Validation

University at Buffalo The State University of New York

Given Distance Matrix D = {d11,d12, …, dnn } and Incidence Matrix C= { c11, c12,…, cnn } .

Correlation r between D and C is given by

n

jiij

n

jiij

n

jiijij

ccdd

ccdd

r

1,1

2_

1,1

2_

1,1

__

)()(

))((

Correlation with Distance Matrix

Page 13: Cluster Validation

University at Buffalo The State University of New York

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Page 14: Cluster Validation

University at Buffalo The State University of New York

Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link

Page 15: Cluster Validation

University at Buffalo The State University of New York

Order the similarity matrix with respect to cluster labels and inspect visually.

Using Similarity Matrix for Cluster Validation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 16: Cluster Validation

University at Buffalo The State University of New York

PointsP

oin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 17: Cluster Validation

University at Buffalo The State University of New York

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

PointsP

oin

ts20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link

Page 18: Cluster Validation

University at Buffalo The State University of New York

Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

PointsP

oin

ts20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 19: Cluster Validation

University at Buffalo The State University of New York

Reliability of Clusters

Need a framework to interpret any measure. For example, if our measure of evaluation has the

value, 10, is that good, fair, or poor?

Statistics provide a framework for cluster validity The more “atypical” a clustering result is, the more

likely it represents valid structure in the data.

Page 20: Cluster Validation

University at Buffalo The State University of New York

ExampleCompare SSE of 0.005 against three clusters in random dataSSE Histogram of 500 sets of random data points of size 100

distributed over the range 0.2 – 0.8 for x and y values

Statistical Framework for SSE

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

50

SSE

Co

unt

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

SSE = 0.005

Page 21: Cluster Validation

University at Buffalo The State University of New York

Correlation of incidence and distance matrices for the K-means of the following two data sets.

Statistical Framework for Correlation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Correlation histogram of random data

Page 22: Cluster Validation

University at Buffalo The State University of New York

Hyper-geometric Distribution

Given the total number of genes in the data set associated with term T is M, if randomly draw n genes from the data set N, what is the probability that m of the selected n genes will be associated with T?

n

N

mn

MN

m

M

nMNm ),,|Pr(

Page 23: Cluster Validation

University at Buffalo The State University of New York

P-Value

Based on Hyper-geometric distribution, the probability of having m genes or fewer associated to T in N can be calculated by summing the probabilities of a random list of N genes having 1, 2, …, m genes associated to T. So the p-value of over-representation is as follows:

),min( nM

mi

n

N

in

MN

i

M

p