machine learning - github pagesmachine learning lecture 7: k-means feng li [email protected] school of...

Machine LearningLecture 7: K-Means

Feng Li

[email protected]

https://funglee.github.io

School of Computer Science and TechnologyShandong University

Fall 2018

[email protected]

https://funglee.github.io

Clustering

• Usually an unsupervised learning problem

• Given: N unlabeled examples {x1, · · · , xN}; no. of desired partitions K

• Goal: Group the examples into K “homogeneous” partitions

• Loosely speaking, it is classification without ground truth labels

• A good clustering is one that achieves:• High within-cluster similarity• Low inter-cluster similarity

2 / 42

Similarity can be Subjective

• Clustering only looks at similarities, no labels are given

• Without labels, similarity can be hard to define

• Goal: Group the examples into K “homogeneous” partitions

• Thus using the right distance/similarity is very important in clustering

• Also important to define/ask: “Clustering based on what”?

3 / 42

Clustering: Some Examples

• Document/Image/Webpage Clustering

• Image Segmentation (clustering pixels)

• Clustering web-search results

• Clustering (people) nodes in (social) networks/graphs

• ... and many more...

4 / 42

Types of Clustering

• Flat or Partitional clustering: Partitions are independent of each other

• Hierarchical clustering: Partitions can be visualized using a tree structure (adendrogram)

5 / 42

K-Means Clustering Problem

• Given a set of observations X = {x1, x2, · · · , xN} (xi ∈ RD), partitionthe N observations into K sets (K ≤ N) {Ck}k=1,··· ,K such that the setsminimize the within-cluster sum of squares:

argmin{Ck}

K∑i=1

∑x∈Ci

‖x− µi‖2

where µi is the mean of points in set Ci• How hard is this problem?

• The problem is NP hard, but there are good heuristic algorithms that seemto work well in practice, e.g., K-means algorithm and mixtures of Gaussians

6 / 42

K-Means Algorithm (Lloyd, 1957)

• In each iteration• (Re)-Assign each example xi to its closest cluster center (based on the smallest

Euclidean distance)

Ck = {xi | ‖xi − µk‖2 ≤ ‖xi − µk′‖2, for ∀k′ 6= k}

(Ck is the set of examples assigned to cluster k with center µk)• Update the cluster means

µk = mean(Ck) =1

|Ck|∑x∈Ck

x

• Stop when cluster means or the “loss” does not change by much

7 / 42

K-means: Initialization (assume K = 2)

8 / 42

K-means iteration 1: Assigning points

9 / 42

K-means iteration 1: Recomputing the centers

10 / 42


11 / 42


12 / 42


13 / 42


14 / 42


15 / 42


16 / 42

What Loss Function is K-means Optimizing?

• Let µ1, µ2, · · · , µK be the K cluster centroids (means)

• Let zi,k be an indicator

zi,k =

{1, xi ∈ Ck0, otherwise

• Assumezi = [zi,1, zi,2, · · · , zi,k]T

with∑Kk=1 zi,k = 1, represents a length K one-hot encoding of xi

17 / 42


• Define the distortion or “loss” for the cluster assignment of xi

`({µk}k=1,··· ,K , xi, zi) =

K∑k=1

zi,k‖xi − µk‖2

• Total distortion over all points defines the K-means “loss function”

L(µ,X,Z) =

N∑i=1

K∑k=1

zi,k‖xi − µk‖2 = ‖X− Zµ‖2

where X is N ×D, Z is N ×K and µ is K ×D

18 / 42



L(µ,X,Z) =

N∑i=1

K∑k=1


where X is N ×D, Z is N ×K and µ is K ×D• Let’s have a closer look

X− Zµ =

xT1xT2...xTN

−zT1zT2...zTN

µ =

xT1 − zT1 µxT2 − zT2 µ

...xTN − zTNµ

where zTi µ is the centroid of the cluster xi belongs to

19 / 42


• Define the distortion or “loss” for the cluster assignment of xi

`({µk}, xi, zi) =K∑k=1

zi,k‖xi − µk‖2


L(µ,X,Z) =

N∑i=1

K∑k=1


where X is N ×D, Z is N ×K and µ is K ×D• The K-means problem is to minimize the above objective function w.r.t. µ

and Z

20 / 42

K-means Objective

• Consider the K-means objective function

L(µ,X,Z) = ‖X− Zµ‖2

• It is a non-convex objective problem• Many local minima possible

• Also NP-hard to minimize in general (note that Z is discrete)

• The K-means algorithm we saw is a heuristic to optimize this function

• K-means algorithm alternated between the following two steps• Fix µ, minimize w.r.t. Z (assign points to closest centers)• Fix Z, minimize w.r.t. µ (recompute the center means)

• Note: The algorithm usually converges to a local minima (though may notalways, and it may just convergence “somewhere”). Multiple runs with dif-ferent initializations can be tried to find a good solution.

21 / 42

Convergence of K-means Algorithm

• Each step (updating Z or µ) can never increase the objective

• When we update Z from Z(t−1) to Z(t)

L(µ(t−1),X,Z(t)) ≤ L(µ(t−1),X,Z(t−1))

because the new Z(t) = argminZ L(µ(t−1),X,Z)

• When we update µ from µ(t−1) to µ(t)

L(µ(t),X,Z(t)) ≤ L(µ(t−1),X,Z(t))

because the new µ(t) = argminµ L(µ,X,Z(t))

22 / 42

Convergence of K-means Algorithm (Contd.)

• Thus the K-means algorithm monotonically decreases the objective

23 / 42

Choosing K

• One way to select K for the K-means algorithm is to try different values ofK, plot the K-means objective versus K, and look at the “elbow-point”

• For the above plot, K = 6 is the elbow point

• Can also information criterion such as AIC (Akaike Information Criterion)

AIC = 2L(µ,X, Z) +K logD

and choose the K that has the smallest AIC (discourages large K)

24 / 42

K-means: Some Limitations

• Makes hard assignments of points to clusters• A point either completely belongs to a cluster or does not belong at all• No notion of a soft assignment (i.e., probability of being assigned to each

cluster: say K = 3 and for some point xi, p1 = 0.7; p2 = 0.2; p3 = 0.1)

• Works well only is the clusters are roughly of equal sizes

• Probabilistic clustering methods such as Gaussian mixture models can handleboth these issues (model each cluster using a Gaussian distribution)

• K-means also works well only when the clusters are round-shaped and doesbadly if the clusters have non-convex shapes

• Kernel K-means or Spectral clustering can handle non-convex

25 / 42

Kernel K-Means

• Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions.

d(xi, µk) = ‖φ(xi)− φ(µk)‖‖φ(xi)− φ(µk)‖2 = ‖φ(xi)‖2 + ‖φ(µk)‖2 − 2φ(xi)

Tφ(µk)

= k(xi, xi) + k(µk, µk)− 2k(xi, µk)

where k(·, ·) denotes the kernel function and φ is its (implicit) feature map

26 / 42

Kernel K-Means

• Note: φ does not have to be computed/stored for data {xi} or the clustermeans {µk} because computations only depend on kernel evaluations

27 / 42

Hierarchical Clustering

• Agglomerative (bottom-up) Clustering1. Start with each example in its own singleton cluster2. At each time-step, greedily merge 2 most similar clusters3. Stop when there is a single cluster of all examples, else go to 2

• Divisive (top-down) Clustering1. Start with all examples in the same cluster2. At each time-step, remove the outsiders from the least cohesive cluster3. Stop when each example is in its own singleton cluster, else go to 2

• Agglomerative is more popular and simpler than divisive (but less accurarate)

28 / 42

Hierarchical Clustering

• Agglomerative (bottom-up) Clustering

1. Start with each example in its own singleton cluster2. At each time-step, greedily merge 2 most similar clusters3. Stop when there is a single cluster of all examples, else go to 2

29 / 42

Hierarchical Clustering (Contd.)

• Divisive (top-down) Clustering

1. Start with all examples in the same cluster2. At each time-step, remove the “outsiders” from the least cohesive cluster3. Stop when each example is in its own singleton cluster, else go to 2

30 / 42


• Metric: A measure of distance between pairs of observations• Euclidean distance: ‖a− b‖2 =

√∑i(ai − bi)2

• Squared Euclidean distance: ‖a− b‖22 =∑

i(ai − bi)2

• Manhattan distance: ‖a− b‖1 =∑

i |ai − bi|• Maximum distance: ‖a− b‖∞ = maxi |ai − bi|• Mahalanobis distance:

√(a− b)TS−1(a− b) (S is the Covariance matrix)

31 / 42


• How to compute the dissimilarity between two clusters R and S?• Min-link or single link: results in chaining (clusters can get very large)

d(R,S) = minxR∈R,xS∈S

d(xR, xS)

• Max-link or complete-link: results in small, round shaped clusters

d(R,S) = maxxR∈R,xS∈S

d(xR, xS)

• Average-link: compromise between single and complex linkage

d(R,S) =1

|R||S| maxxR∈R,xS∈S

d(xR, xS)

32 / 42

Example: Agglomerative Clustering

33 / 42


34 / 42


35 / 42


36 / 42


37 / 42


38 / 42


39 / 42


40 / 42

Flat vs Hierarchical Clustering

• Flat clustering produces a single partitioning

• Hierarchical Clustering can give different partitionings depending on the level-of-resolution we are looking at

• Flat clustering needs the number of clusters to be specified

• Hierarchical clustering does not need the number of clusters to be specified

• Flat clustering is usually more efficient run-time wise

• Hierarchical clustering can be slow (has to make several merge/split de-cisions)

• No clear consensus on which of the two produces better clustering

41 / 42

Thanks!

Q & A

42 / 42

machine learning - github pagesmachine learning lecture 7: k-means feng li [email protected] school of...

Documents