machine learning - clustering analysis...

34
COMP24111 Machine Learning Cluster Validation Ke Chen Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011]

Upload: others

Post on 24-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning

Cluster Validation

Ke Chen

Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011]

Page 2: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 2

Outline • Motivation and Background • Internal index

– Motivation and general ideas

– Variance-based internal indexes

– Application: finding the “proper” cluster number

• External index – Motivation and general ideas

– Rand Index

• Application: Weighted clustering ensemble • Summary

Page 3: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 3

Motivation and Background • Motivation Supervised classification

– Class labels known for ground truth – Accuracy: based on labels given

Clustering analysis – No class labels – Evaluation still demanded

Validation needs to

– Compare clustering algorithms – Solve the number of clusters – Avoid finding patterns in noise – Find the “best” clusters from data

P

Oranges:

Apples:

Accuracy = 8/10= 80%

Evaluation

Page 4: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 4

• Illustrative Example: which one is the “best”?

Motivation and Background

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

xy

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Agglomerative (Complete Link) K-means (K=3)

Data Set (Random Points) K-means (K=3)

Page 5: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 5

• Cluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. – How to be “quantitative”: To employ the measures. – How to be “objective”: To validate the measures!

m* INPUT: DataSet(X)

Clustering Algorithm(s)

Validity Index

Different settings/configurations

Partitions P

Motivation and Background

Page 6: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 6

• Internal Criteria (Indexes) – Validate without external information – With different number of clusters – Solve the number of clusters

• External Criteria (Indexes) – Validate against “ground truth” – Compare two partitions:

(how similar) ?

?

? ?

Motivation and Background

Page 7: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 7

Internal Index • Ground truth is unavailable but unsupervised validation must

be done with “common sense” or “a priori knowledge”. • There are a variety of internal indexes: Variances-based methods

Rate-distortion methods

Davies-Bouldin index (DBI)

Bayesian Information Criterion (BIC)

Silhouette Coefficient

Minimum description principle (MDL)

Stochastic complexity (SC)

Modified Huber’s Г (MHГ) index

Page 8: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 8

Internal Index • Variances-based methods

– Minimise within cluster variance (SSW) – Maximise between cluster variance (SSB) Inter-cluster

variance is maximized

Intra-cluster variance is minimized

c

Page 9: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 9

Internal Index • Variances-based methods (cont.)

Assume an algorithm leads to a partition of K clusters where cluster i has 𝑛𝑛𝑖𝑖 data points

and ci is its centroid. d(.,.) is a distance used in this algorithm.

– Within cluster variance (SSW)

– Between cluster variance (SSB)

where c is the global mean (centroid) of the whole data set.

∑∑= =

=K

i

n

jiij

i

dKSSW1 1

2 ),()( cx

∑=

=K

iiidnKSSB

1

2 ),()( cc

Page 10: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 10

Internal Index

• Variance based F-ratio index – Measures ratio of between-cluster variance against the within-

cluster variance (original F-test) – F-ratio index (W-B index) for a partition of K clusters

∑∑

=

= === K

iii

K

i

n

jiij

dn

dK

KSSBKSSWKKF

i

1

2

1 1

2

),(

),(

)()(*)(

cc

cx

ii

iij

nj

ccxcluster in pointsdata ofnumber theis

cluster inpoint data th theis where

Page 11: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

25 23 21 19 17 15 13 11 9 7 5Clusters

F-ra

tio (x

10^5

)

minimum

IS

PNN

11

Internal Index • Application: finding the “proper” cluster number

S1 Algorithm 1

Algorithm 2

Page 12: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning

S4

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

25 20 15 10 5

Number of clusters

F-ra

tio

minimum at 15

IS

PNN

minimum at 16

12

Internal Index • Application: finding the “proper” cluster number

Algorithm 1

Algorithm 2

Page 13: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 13

External Index

• “Ground truth” is available but an clustering algorithm doesn’t use such information during unsupervised learning.

• There are a variety of internal indexes: Rand Index

Adjusted Rand Index

Pair counting index

Information theoretic index

Set matching index

DVI index

Normalised mutual information (NMI) index

Page 14: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 14

External Index • Main issues

– If the “ground truth” is known, the validity of a clustering can be verified by comparing the class or clustering labels.

– However, this is much more complicated than in supervised classification (where labels used in training) The cluster IDs in a partition resulting from clustering have been assigned

arbitrarily due to unsupervised learning – permutation.

The number of clusters may be different from the number of classes (the “ground truth”) – inconsistence.

The most important problem in external indexes would be how to find all possible correspondences between the “ground truth” and a partition (or two candidate partitions in the case of comparison).

Page 15: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 15

External Index • Rand Index

– This is the first external index proposed by Rand (1971) to address the “correspondence” problem.

– Basic idea: considering all pairs in the data set by looking into both agreement and disagreement against the “ground truth”

– The index defined as RI (X, Y)= (a+d)/(a+b+c+d)

X/Y Y: Pairs in the same class

Y: Pairs in different classes

X: Pairs in the same cluster a b X: Pairs in different clusters c d

Page 16: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 16

External Index

• Rand Index (cont.) – Example: for a 5-point data set, we have

– To calculate a, b, c and d, we have to list all possible pairs (excluding any data point to itself)

[1, 2], [1, 3], [1, 4], [1, 5] [2, 3], [2, 4], [2, 5] [3, 4], [3, 5] [4, 5]

1 2 3 4 5

X i ii ii i i

Y q p q p q

Page 17: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 17

External Index • Rand Index (cont.)

– Example: for a 5-point data set, we have

Initialisation: a, b, c, d 0 For data point pair [1, 2], we have X: [1, 2] assigned to (i, ii) –> in different clusters Y: [1, 2] labelled by (q, p) –> in different classes Thus, d d +1 = 1

1 2 3 4 5

X i ii ii i i

Y q p q p q

Page 18: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 18

External Index

• Rand Index (cont.) – Example: for a 5-point data set, we have

Current Status: a=0, b=0, c=0, d =1 For data point pair [1, 3], we have X: [1, 3] assigned to (i, ii) –> in different clusters Y: [1, 3] labelled by (q, q) –> in the same class Thus, c c +1 = 1

1 2 3 4 5

X i ii ii i i

Y q p q p q

Page 19: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 19

External Index

• Rand Index (cont.) – Example: for a 5-point data set, we have

Current Status: a=0, b=0, c=1, d =1 For data point pair [1, 4], we have X: [1, 4] assigned to (i, i) –> in the same cluster Y: [1, 4] labelled by (q, p) –> in different classes Thus, b b +1 = 1

1 2 3 4 5

X i ii ii i i

Y q p q p q

Page 20: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 20

External Index

• Rand Index (cont.) – Example: for a 5-point data set, we have

Current Status: a=0, b=1, c=1, d =1 For data point pair [1, 5], we have X: [1, 5] assigned to (i, i) –> in the same cluster Y: [1, 5] labelled by (q, q) –> in the same class Thus, a a +1 = 1

1 2 3 4 5

X i ii ii i i

Y q p q p q

Page 21: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 21

External Index

• Rand Index (cont.) – Example: for a 5-point data set, we have

Current Status: a=1, b=1, c=1, d =1 On-class Exercise: continuing until have the final value of a, b, c and d.

1 2 3 4 5

X i ii ii i i

Y q p q p q

Page 22: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 22

External Index • Rand Index: Contingency Table

– In general, a, b, c and d are calculated from a contingency table

Assume there are k clusters in X and l classes in Y 𝑛𝑛𝑖𝑖𝑖𝑖: the number of points in both cluster 𝑖𝑖 and class 𝑗𝑗

Page 23: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 23

External Index

• Rand Index: Contingency Table (cont.) – Example: for a 5-point data set, we have

1 2 3 4 5

X i ii ii i i

Y q p q p q

Contingency table p q

i 1 2 3

ii 1 1 2

2 3 5

Page 24: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning

Same cluster/class in X and Y

Same cluster in X /different classes in Y

Different clusters in X / same class in Y

Different clusters/classes in X and Y

∑∑= =

−=k

i

l

jijij nna

1 1)1(

21

)(21

1 1 1

22.∑ ∑∑

= = =

−=l

j

k

i

l

jijj nnb

)(21

1 1 1

22.∑ ∑∑

= = =

−=k

i

k

i

l

jiji nnc

))((21

1

2.

1

2.

1 1

22 ∑∑∑∑=== =

+−+=l

jj

k

ii

k

i

l

jij nnnNd

External Index

• Rand Index: measure the number of pairs in

Ex. 4

Page 25: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 25

Weighted Clustering Ensemble • Motivation

– Evidence-accumulation based clustering ensemble (Fred & Jain, 2005) is a simple clustering ensemble algorithm that use the evidence accumulated from multiple yet diversified partitions generated with different algorithms, initial conditions, different distance metrics, …

– However, this clustering ensemble algorithm is subject to limitations: • Don’t distinguish between “non-trivial” and “trivial” partitions; i.e., all partitions used in a

clustering ensemble are treated equally important. • Sensitive to a cluster distance used in the hierarchical clustering for reaching a consensus

– Clustering validity indexes provide an effective way to measure “non-trivialness” or “importance” of a partition • Internal index: directly measure the importance of a partition quantitatively and objectively • External index: indirectly measure the importance of a partition via comparing with other

partitions used in clustering ensemble for another round “evidence accumulation”

Page 26: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning

Chen & Yang: Temporal Data Clustering with Different Representations

26

Weighted Clustering Ensemble

Yun Yang and Ke Chen, “Temporal data clustering via weighted clustering ensemble with different representations,” IEEE Transactions on Knowledge and Data Engineering 23(2), pp. 307-320, 2011.

Use validity index values as weights

Page 27: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

Weighted Clustering Ensemble Example: convert clustering results into binary “Distance” matrix

A B

C

D

Cluster 1 (C1)

Cluster 2 (C2)

=

0011001111001100

1D

A D C B

D

C

A

B

“distance” Matrix

27 Chen & Yang: Temporal Data Clustering with Different Representations

Page 28: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

Weighted Clustering Ensemble Example: convert clustering results into binary “Distance” matrix

A B

C

D

Cluster 1 (C1)

Cluster 2 (C2)

=

0111101111001100

2D

A D C B

D

C

A

B

“distance Matrix”

28 Chen & Yang: Temporal Data Clustering with Different Representations

Cluster 3 (C3)

Page 29: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

Weighted Clustering Ensemble Evidence accumulation: form the collective weighted “distance” matrix

29 Chen & Yang: Temporal Data Clustering with Different Representations

=

0111101111001100

2D

=

0011001111001100

1D

])()[(31

22221111 DwwwDwwwD NDMNDME +++++=

Page 30: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

Chen & Yang: Temporal Data Clustering with Different Representations 30

Weighted Clustering Ensemble Clustering analysis with our weighted clustering ensemble on CAVIAR database

– annotated video sequences of pedestrians, a set of 222 high-quality moving trajectories – clustering analysis of trajectories is useful for many applications

Experimental setting – Representations: PCF, DCF, PLS and PDWT – Initial clustering analysis: K-mean algorithm (4<K<20), 6 initial settings – Ensemble: 320 partitions totally (80 partitions/representation)

Page 31: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

Chen & Yang: Temporal Data Clustering with Different Representations 31

Weighted Clustering Ensemble Clustering results on CAVIAR database

Page 32: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 32

Weighted Clustering Ensemble • Application: UCR time series benchmarks

Page 33: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 33

Weighted Clustering Ensemble • Application: Rand index (%) values of clustering ensembles (Yang & Chen 2011)

Page 34: Machine Learning - Clustering Analysis Basicssyllabus.cs.manchester.ac.uk/ugt/.../slides/Cluster... · – Measures ratio of between -cluster variance against the within-cluster variance

COMP24111 Machine Learning 34

Summary • Cluster validation is a process that evaluate clustering results with a pre-defined

criterion. • Two different types of cluster validation methods

– Internal indexes • no “ground truth” available • defined based on “common sense” or “a priori knowledge” • Application: finding the “proper” number of clusters, …

– External indexes • “ground truth” known or reference given (“relative index”) • Application: performance evaluation of clustering with reference information

• Apart from direct evaluation, both indexes may be applied in weighted clustering ensemble, leading to better yet robust results by ignoring trivial partitions.

K. Wang et al, “CVAP: Validation for cluster analysis,” Data Science Journal, vol. 8, May 2009. [Code online available: http://www.mathworks.com/matlabcentral/fileexchange/authors/24811]