methods for clustering k-means, soft k-means...

MACHINE LEARNING - MSc CourseAPPLIED MACHINE LEARNING

1

APPLIED MACHINE LEARNING

Methods for Clustering

K-means, Soft K-means

DBSCAN


2

Objectives

Learn basic techniques for data clustering

• K-means and soft K-means, GMM (next lecture)

• DBSCAN

Understand the issues and major challenges in clustering

• Choice of metric

• Choice of number of clusters


3

What is clustering?

Clustering is a type of multivariate statistical analysis also known as

cluster analysis, unsupervised classification analysis, or numerical

taxonomy.

Clustering is a process of partitioning a set of data (or objects) in a set

of meaningful sub-classes, called clusters.

Cluster: a collection of data objects that are “similar” to one another and

thus can be treated collectively as one group.


4

Classification versus Clustering

Supervised Classification = Classification

We know the class labels and the number of classes.

1 2 3 1 2 3

Unsupervised Classification = Clustering

We do not know the class labels and may not know

the number of classes.

? ? ? ? ? ?


5

Classification versus Clustering

Unsupervised Classification = Clustering

Hard problem when no pair of objects have exactly

the same feature.

Need to determine how similar two or more objects

are to one another.

??

?? ?


6

Which clusters can you create?

Which two subgroups of pictures are similar and why?


7

Which clusters can you create?

Which two subgroups of pictures are similar and why?


8

A good clustering method produces high quality clusters

when:

• The intra-class (that is, intra-cluster) similarity is high.

• The inter-class similarity is low.

• The quality measure of a cluster depends on the similarity

measure used!

What is Good Clustering?


9

Exercise:

Intra-class similarity is the highest when:

a) you choose to classify images with and without glasses

b) you choose to classify images of person1 against person2

Person1 with glasses

Person1 without glasses




10

Exercise:

Projection onto first two principal components after PCA





Intra-class similarity is the highest when:

a) you choose to classify images with and without glasses

b) you choose to classify images of person1 against person2


11

Exercise:

The eigenvector e1 is composed of a mix between the main characteristics of

the two faces and it is hence explanatory of both. However, since both faces

have little in common, the two groups have different coordinates onto e1 but

have quasi identical coordinates for the glasses in each subgroup. Projecting

onto e1 hence offers a mean to compute a metric of similarity across the two

persons.

Projection onto e1 against e2





e1 e2


12

Exercise:

When projecting onto e1 and e3, we can separate the image of the

person1 with and without glasses, as the eigenvector e3 embeds

features distinctive of person1 primarily.

Projection onto e1 against e3





e1 e3e2


13

Exercise:

Design a method to find out the groups when you no longer

have the class labels?

Projection onto first two principal components after PCA


14

Priors:

• Data cluster within a circle

• There are 2 clusters

x1

x2

x3

Sensitivity to Prior Knowledge

Outliers (noise)

Relevant Data


15

Priors:

• Data follow a complex distribution

• There are 3 clusters

x1

x2

x3

Sensitivity to Prior Knowledge


16

Globular Clusters

Non-Globular Clusters

Clusters’ Types

K-means

produces

globular clusters

DBSCAN

produces non-

globular clusters


17

Requirements for good clustering:

• Discovery of clusters with arbitrary shape

• Ability to deal with noise and outliers

• Insensitivity to input records’ ordering

• Scalability

• High dimensionality

• Interpretability and reusability

What is Good Clustering?


18

How to cluster?

x1

x2

What choice of model (circle, ellipse) for the cluster?

How many models?


19

x1

x2

What choice of model (circle, ellipse) for the cluster?

How many models?

Circle

Fixed number: K=2

Where to place them for optimal clustering?

K-means Clustering

K-Means clustering generates a number K of disjoint clusters to miminize:

2

1

1

,..., i k

ik

KK

k x c

J x

ix ith data point

k geometric centroid

𝑐𝒌 cluster label or number


20

K-means Clustering

Initialization: initialize at random the positions of the centers of the clusters

x1

x2

In mldemos; centroids are initialized on one datapoint with no overlap across centroids.


21

K-means Clustering

Assignment Step:

• Calculate the distance from each data point to each centroid.

• Assign the responsibility of each data point to its “closest” centroid.If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest

winning centroid).

x1

x2

arg min ,i k

ik

k d x

ix ith data point

k geometric centroid

Responsibility of cluster for point

1 if k

0 otherwise

i

ik

i

k x

kr


22

Update step (M-Step):

Recompute the position of centroid based on the assignment of the points

K-means Clustering

x1

x2

i

k

k

i

i

k

i

i

r x

r

arg min ,i k

ik

k d x


1 if k

0 otherwise

i

ik

i

k x

kr


23

K-means Clustering

x1

x2

arg min ,i k

ik

k d x


1 if k

0 otherwise

i

ik

i

k x

kr

Assignment Step:


• Assign the responsibility of each data point to its “closest” centroid.If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest

winning centroid).

i

k

k

i

i

k

i

i

r x

r


24

K-means Clustering

x1

x2

Stopping Criterion: Go back to step 2 and repeat the process until the clusters are stable.




25

K-means Clustering

x1

x2

Intersection points

K-means creates a hard partitioning of the dataset


26

Effect of the distance metric on K-means

L1-Norm L2-Norm

L3-Norm L8-Norm


27

K-means Clustering: Algorithm

1. Initialization: Pick K arbitrary centroids and set their geometric means

to random values (in mldemos; centroids are initialized on one datapoint

with no overlap across centroids).

2. Calculate the distance from each data point to each centroid .

3. Assignment Step: Assign the responsibility of each data point to its

“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant

to a data point, one assigns the data point to the smallest winning centroid).

4. Update Step: Adjust the centroids to be the means of all data points

assigned to them (M-step)

5. Go back to step 2 and repeat the process until the clusters are stable.

arg min ,i k

ik

k d x 1 if k

0 otherwise

ik

i

kr

i

k

k

i

i

k

i

i

r x

r


28

K-means Clustering

The algorithm of K-means is a simple version of

Expectation-Maximization applied to a model

composed of isotropic Gauss functions

(see next lecture)


29

K-means Clustering: Properties

• There are always K clusters.

• The clusters do not overlap. (soft K-means relaxes this assumption,

see next slides)

• Each member of a cluster is closer to its cluster than to any other

cluster.

The algorithm is guaranteed to converge

in a finite number of iterations

But it converges to a local optimum!

It is hence very sensitive to initialization of the centroids.


30

Soft K-means Clustering

Assignment Step (E-step):


• Assign the responsibility of each data point to its “closest” centroid.

x1

x2

'

,

,

'

: responsibility of cluster for point

[0,1],

Normalized over clusters: 1

i

k i

k i

k

i

d x

k

id x

k

k

i

k

r k x

er

e

r

Each data point is given a soft `degree of assignment'

to each of the means .

i

k

x


31

x1

x2



The model parameters, i.e. the means, are adjusted to match the weighted

sample means of the data points that they are responsible for.

ik

ik i

k

i

i

r x

r

'

,

,

'


[0,1],


i

k i

k i

k

i

d x

k

id x

k

k

i

k

r k x

er

e

r

The update algorithm of the soft K-means is identical to that of the hard K-means, aside from the

fact that the responsibilities to a particular cluster are now real numbers varying between 0 and 1.



32

small

~ large

large

~ small

is the stiffness

1 measures the disparity across clusters

'

,

,

'


[0,1],


i

k i

k i

k

i

d x

k

id x

k

k

i

k

r k x

er

e

r



33

Soft K-means algorithm with a small (left), medium (center) and large (right)


10 5 1


34


Iterations of the Soft K-means algorithm from the random initialization (left)

to convergence (right). Computed with = 10.


35

Advantages:

• Computationally faster than other clustering techniques.

• Produces tighter clusters, especially if the clusters are globular.

• Guaranteed to converge.

Drawbacks:

• Does not work well with non-globular clusters.

• Sensitivity to choice of initial partitions

Different initial partitions can result in different final clusters.

• Assumes a fixed number K of clusters.

It is, therefore, good practice to run the algorithm several times using

different K values, to determine the optimal number of clusters.

(soft) K-means Clustering: Properties


36

Advantages:

• Computationally faster than other clustering techniques.

• Produces tighter clusters, especially if the clusters are globular.

• Guaranteed to converge.

Drawbacks:

• Does not work well with non-globular clusters.

• Sensitivity to choice of initial partitions

Different initial partitions can result in different final clusters.

• Assumes a fixed number K of clusters.

It is, therefore, good practice to run the algorithm several times using

different K values, to determine the optimal number of clusters.

(soft) K-means Clustering: Properties


37

• Unbalanced clusters:

K-means takes into account only the distance between the means and data

points; it has no representation of the variance of the data within each

cluster.

• Elongated clusters:

K-means imposes a fixed shape for each cluster (sphere).

K-means Clustering: Weaknesses


38

Very sensitive to the choice of the number of clusters K and

the initialization. Mldemos example

K-means Clustering: Weaknesses


39

K-means would not be able to reject outliers

x1

x2

x3

K-means: Limitations

Outliers (noise)

Relevant Data


40

K-means would not be able to reject outliers

K-means assigns all datapoints to a cluster

Outliers get assigned to the closest cluster

x1

x2

x3

K-means: Limitations

DBSCAN can determine outliers and can generate non-globular clusters


41

Density Based Spatial Clustering of Applications

with Noise (DBSCAN)

x1

x2

x3

Outliers (noise)e

1. Pick a datapoint at random

2. Compute number of datapoints within e

3. If < mdata, set this datapoint as outlier

4. Go back to 1


42


with Noise (DBSCAN)

x1

x2

x3

Outliers (noise)



3. For each datapoint found, assign it to same cluster

4. Go back to 1

Cluster 1


43


with Noise (DBSCAN)

x1

x2

x3

Outliers (noise)



3. For each datapoint found, assign it to same cluster

4. Merge two clusters if distance between clusters < e

Cluster 1

Cluster 2

Cluster 1


44


with Noise (DBSCAN)

x1

x2

x3

Outliers (noise)

Cluster 1

Cluster 2

Cluster 1

Hyperparameters:

• e: size of neighborhood

• mdata: minimum

number of datapoints


46

Comparison: K-means / DBSCAN

K-means DBSCAN

Hyperparameters K: Nm of clusters e: size, Mdata: min. nm of datapoints

Computational cost O(K*M) O(M*log(M)), M: nm datapoints

Type of cluster Globular Non-globular (arbitrary shapes, non-

linear boundaries)

Robustness to noise Not robust Robust to outliers within e

K-means is computational cheap. However, it is not robust to noise and

produces only globular clusters.

DBSCAN is computationally intensive, but it can detect automatically noise

and produces clusters of arbitrary shape.

Both K-means and BDSCAN depend on choosing well the hyperparameters

To determine the hyperparameters, use evaluation methods for clustering (next)


47

Clustering methods rely on hyper parameters

• Number of clusters, elements in the cluster, distance metric

Need to determine the goodness of these choices

Clustering is unsupervised classification

Do not know the real number of clusters and the data labels

Difficult to evaluate these choices without ground truth

Evaluation of Clustering Methods

4848

ADVANCED MACHINE LEARNING

Two types of measures: Internal versus external measures

Internal measures rely on measures of similarity:

(low) intra-cluster distance versus (high) inter-cluster distances

Internal measures are problematic as the metric of similarity is

often already optimized by the clustering algorithm.

External measures rely on ground truth (class labels):

Given a (sub)-set of known class labels compute similarity of

clusters to class labels.

In real-world data, it’s hard/infeasible to gather ground truth.



49

Internal Measure: RSS

Residual Sum of Square RSS is an internal measure (available in

mldemos).

It computes the distance (in norm-2) of each datapoint from its centroid

for all clusters.

2

1

RSS=k

Kk

k x C

x

5050


Goal of K-means is to find cluster centers 𝜇𝑘 which minimize distortion.

RSS for K-Means

2

1

RSS=k

Kk

k x C

x

Measure of

Distortion

𝐾:𝑀 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠𝑅𝑆𝑆: 0

𝑀:100 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠

𝑁: 2 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠

However, it can still be used to determine an ‘optimal’ 𝐾 by monitoring the

slope of the decrease of the measure as 𝐾 increases.

By ↑ 𝐾 we ↓ 𝑅𝑆𝑆, what is the optimal 𝐾 such that 𝑅𝑆𝑆 → 0? 𝑅𝑆𝑆 = 0 when 𝐾 = 𝑀.One has as many clusters as datapoints!

5151


K-means Clustering: Examples

Procedure: Run K-means – increase monotonically number of clusters – run K-

means with several initialization and take best run;

use RSS measure to measure improvement in clustering determine a plateau

Optimal 𝑘 is at the

‘elbow’ of the curve

𝑀:100 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠𝑁: 2 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠

𝑘: 4 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠


52

K-means with RSS: Examples

Cluster Analysis of Hedge Funds (fonds speculatifs)[N. Das, 9th Int. Conf. on Computing Economis and Finance, 2011]

No legal definition of Hedge funds - consists of a wide category of

investment funds with high risk & high returns – variety of strategies for

guiding the investment

Research Question: classify type of Hedge funds based on information

provided to the client

Data Dimension (Features): such as: asset class, size of the hedge fund,

incentive fee, risk-level, and liquidity of hedge funds


53

K-means with RSS: Examples

Cluster Analysis of Hedge Funds (fonds speculatifs)[N. Das, 9th Int. Conf. on Computing Economis and Finance, 2011]

No legal definition of Hedge funds - consists of a wide category of investment funds with high

risk & high returns – variety of strategies for guiding the investment

Research Question: classify type of Hedge funds based on information provided to the client

Data Dimension (Features): such as: asset class, size of the hedge fund, incentive fee, risk-

level, and liquidity of hedge funds

Number of Clusters (K) Optimal results are found with 7

clusters.

Cutoff

Procedure: Run K-means – increase

monotonically number of clusters – run

K-means with several initialization and

take best run;

Use RSS measure to measure

improvement in clustering determine

a plateau

5454



Which one is the

‘optimal’ 𝑘?

𝑀:100 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠K: 3 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠

The ‘elbow’ or ‘plateau’ method for choosing the optimal 𝑘 from the RSS curve can

be unreliable for certain datasets:

𝑘: 11

𝑘: 2

We don’t know! We need an

additional penalty or criterion!


55

AIC and BIC determine how good the model fits the dataset in a probabilistic

sense (maximum-likelihood measure). The measure is balanced by how

many parameters are needed to get a good fit.

L: maximum likelihood of the model

: number of free parameters

: number of datapoints

- Aikaike Information Criterion: AIC= 2ln 2

- Bayesian Information Criterion: 2 ln ln

As the number of da

B

M

L B

BIC L B M

tapoints (observations) increase, BIC assigns more weights

to simpler models than AIC.

Low BIC implies either fewer explanatory variables, better fit, or both.

Penalty for an

increase in

computational costs

due to number of

parameters and

number of datapoints

Other Metrics to Evaluate Clustering Methods

Choosing AIC versus BIC depends on the application:

Is the purpose of the analysis to make predictions, or to decide which model best

represents reality?

AIC may have better predictive ability than BIC, but BIC finds a computationally more

efficient solution.

5656


AIC for K-Means

For the particular case of K-means, we do not have a maximum likelihood

estimate of the model:

𝐴𝐼𝐶 = −2 ln(𝐿) + 2𝐵

𝐴𝐼𝐶𝑅𝑆𝑆 = 𝑅𝑆𝑆 + 𝐵

However, we can formulate a metric based on the RSS that penalizes for

model complexity (# K-clusters), conceptually following AIC:

Weighting

Factor

2

1

RSS=k

Kk

k x C

x

Number of free

parameters B=(K*N)

K: # clusters

N: # dimensions

: likelihood of model

: number of free parameters

L

B

5757


BIC for K-Means

𝐵𝐼𝐶𝑅𝑆𝑆 = 𝑅𝑆𝑆 + ln(𝑀)𝐵

For the particular case of K-means, we do not have a maximum likelihood

estimate of the model:

𝐵𝐼𝐶 = −2 ln(𝐿) + ln(𝑀)𝐵

However, we can formulate a metric based on the RSS that penalizes for

model complexity (# K-clusters, # M-datapoints), conceptually following BIC:

Weighting factor penalizes wrt.

# datapoints (i.e. computational

complexity)

2

1

RSS=k

Kk

k x C

x

Number of free

parameters B=(K*N)

K: # clusters

N: # dimensions

5858



𝑀:100 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠N: 3 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠

Procedure: Run K-means – increase monotonically number of clusters – run K-

means with several initialization and take best run;

use AIC/BIC curves to find the optimal 𝑘, which is min 𝐴𝐼𝐶 or min(𝐵𝐼𝐶)

Both min(𝐵𝐼𝐶) and

min(𝐴𝐼𝐶) → 𝑘 = 2

𝑘: 2 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠

5959


BIC for K-Means


𝐵𝐼𝐶𝑅𝑆𝑆 = 𝑅𝑆𝑆 + ln(𝑀) (𝐾 ∙ 𝑁)

𝑘: 14

𝐾: 14 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠

6060


BIC for K-Means


𝐵𝐼𝐶𝑅𝑆𝑆 = 𝑅𝑆𝑆 + ln(𝑀) (𝐾 ∙ 𝑁)

𝑘: 4

𝐾: 4 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠

6161


AIC / BIC for DBSCAN

Comput centroid of each cluster and apply AIC/BIC of K−means

DBSCAN large e DBSCAN medium e DBSCAN small e


RSS 43 26 0.5

BIC 42 34 78

AIC 69 51 24

6262


AIC / BIC for DBSCAN

Comput centroid of each cluster and apply AIC/BIC of K−means


K-means DBSCAN large e DBSCAN medium e DBSCAN small e

RSS 51 95 59 0.6

BIC 65 118 88 331

AIC 55 102 67 93

K-means


63

Two types of measures: Internal versus external measures

External measures assume that a subset of datapoints have class label

semi-supervised learning

They measure how well these datapoints are clustered.

Needs to have an idea of the number of existing classes and have

labeled some datapoints

Interesting only in cases when labeling is highly time-consuming

when the data is very large (e.g. in speech recognition)



64

Semi-Supervised Learning

Clustering F1-Measure:(careful: similar but not the same F-measure as the F-measure we will see for classification!)

Tradeoff between clustering correctly all datapoints of the same class in the same

cluster and making sure that each cluster contains points of only one class.

1 1

1

: nm of labeled datapoints

: the set of classes

: nm of clusters,

: nm of members of class and of cluster

, max ,

2 , ,,

, ,

,

,

ik

i

ik

ik

i

i

i

i

c C k

i i

i

i i

i

i

i

M

C c

K

n c k

cF C K F c k

M

R c k P c kF c k

R c k P c k

nR c k

c

nP c k

k


65

Recall: proportion of

datapoints correctly

classified/clusterized

Precision: proportion of

datapoints of the same

class in the cluster

Class 1

Class 2

Labeled

Unlabeled

1 1

1



: nm of clusters,


, max ,

2 , ,,

, ,

,

,

ik

i

ik

ik

i

i

i

i

c C k

i i

i

i i

i

i

i

M

C c

K

n c k

cF C K F c k

M

R c k P c kF c k

R c k P c k

nR c k

c

nP c k

k

1 2

2 4, 1 1 , 2 1

2 4R c k R c k

1 2

2 4, 1 , 2

6 6P c k R c k


66

Penalize fraction of labeled

points in each class

1 2

2 4, , 1 , 2 0.7

6 6F C K F c k F c k

Class 1

Class 2

Labeled

Unlabeled

Picks for each class

the cluster with the

maximal F1 measure

1 1

1



: nm of clusters,


, max ,

2 , ,,

, ,

,

,

ik

i

ik

ik

i

i

i

i

c C k

i i

i

i i

i

i

i

M

C c

K

n c k

cF C K F c k

M

R c k P c kF c k

R c k P c k

nR c k

c

nP c k

k


67

Summary of F1-Measure

Picks for each class the

cluster with the maximal

F1 measure

Recall: proportion of

datapoints correctly

classified/clusterized

Precision: proportion of

datapoints of the same

class in the cluster

1 1

1



: nm of clusters,


, max ,

2 , ,,

, ,

,

,

ik

i

ik

ik

i

i

i

i

c C k

i i

i

i i

i

i

i

M

C c

K

n c k

cF C K F c k

M

R c k P c kF c k

R c k P c k

nR c k

c

nP c k

k

Penalize fraction of labeled

points in each class

Clustering F1-Measure:(careful: similar but not the same F-measure as the F-measure we will see for classification!)

Tradeoff between clustering correctly all datapoints of the same class in the same

cluster and making sure that each cluster contains points of only one class.


68

Introduced two clustering techniques: K-means and DBSCAN

Discussed pros and cons in terms of computational time, power

of representation (globular/non-globular clusters)

Introduced metrics to evaluate clustering and help to choose the

hyperparameters:

• Internal measures (RSS, AIC, BIC)

• External measures: F1-measure (also called F-measure

for clustering)

Summary of Lecture

Next week: Practical on Clustering:

You will compare performance of K-means and DBSCAN on

your datasets and use the internal and external measure to

assess these performance and choose the hyperparameters.


69

Robotic Application of Clustering Method

Variety of hand postures

when grasping objects How to generate correct hand

posture on robots?

El-Khoury, S., Miao, Li and Billard, A. (2013) On the Generation of a Variety of Grasps. Robotics and Autonomous Systems Journal.


70

Robotic Application of Clustering Method

4 DOFs industrial hand

(Barrett Technology)

9 DOFs humanoid hand

(iCub Robot)

Problem:

Choose the point of contact and generate feasible posture for the fingers to

touch the object at the correct point and with the desired force.

Difficuly:

High-degrees of freedom (large number of possible points of contact, large

number of DOFs to control)


71

Formulate the problem as Constraint-Based Optimization :

Minimize generated torques at fingertips under

constraints:

•Force closure

•Kinematic feasibility

•Collision avoidance

Nonconvex optimization yields several local / feasible solutions

From 1890 trials converge

to 791 feasible solutions

From 1890 trials converge

to 612 feasible solutions

Took ~2.65s. for each

solution!

Took ~12.14s for each

solution

Took too long for realistic application


72

Apply K-means on all solutions and group them into clusters

11 Clusters 20 Clusters


73A. Shukla and A. Billard, NIPS 2012


74


75

methods for clustering k-means, soft k-means...

Documents