25 machine learning unsupervised learaning k-means k-centers

Machine Learning for Data MiningUnsupervised Learning - Clustering

Andres Mendez-Vazquez

July 22, 2015

1 / 75

Images/cinvestav-1.jpg

Outline1 Supervised Learning vs. Unsupervised Learning

Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?

2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means

3 The K-Center Criterion ClusteringIntroduction

Comparison with K-meansThe Greedy K-Center Algorithm

Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness

2 / 75

3 / 75

Supervised Learning vs. Unsupervised Learning

Supervised learning:Discover patterns in the data that relate data attributes with a target(class) attribute.

Unsupervised learning:The data have no target attribute.

4 / 75

Supervised Learning vs. Unsupervised Learning

Supervised learning:Discover patterns in the data that relate data attributes with a target(class) attribute.

Unsupervised learning:The data have no target attribute.

4 / 75

5 / 75

Clustering

ClusteringIt is a technique for finding similarity groups in data, called clusters.

CalledAn unsupervised learning task as no class values denoting an a priorigrouping of the data instances are given, which is the case in supervisedlearning.

Due to historical reasonsClustering is often considered synonymous with unsupervised learning.

In fact, association rule mining is also unsupervised.

6 / 75

Clustering

6 / 75

Clustering

6 / 75

Clustering

6 / 75

7 / 75

Pattern Recognition

DefinitionSearch for structure in data

Elements of Numerical Pattern Recognition1 Process Description

I Feature Nomination, Test Data, Design Data2 Feature Analysis

I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis

I Labeling, Validity, . . .4 Classifier Design

I Classification, Estimation, Prediction, Control, . . .

8 / 75

Pattern Recognition

8 / 75

Pattern Recognition

8 / 75

Pattern Recognition

8 / 75

Pattern Recognition

8 / 75

Pattern Recognition

8 / 75

Pattern Recognition

8 / 75

Pattern Recognition

8 / 75

Pattern Recognition

8 / 75

An illustration

The data set has three natural groups of data points, i.e., 3 naturalclusters.

9 / 75

Examples

Example 1Groups people of similar sizes together to make “small”, “medium” and“large” T-Shirts.

Example 2In marketing, segment customers according to their similarities.

10 / 75

Examples

Example 1Groups people of similar sizes together to make “small”, “medium” and“large” T-Shirts.

Example 2In marketing, segment customers according to their similarities.

10 / 75

How we create this classes?

For this, we use the following conceptClustering!!!

BasicallyWe want to “reveal” the organization of patterns into “sensible” clusters(groups).

ActuallyClustering is one of the most primitive mental activities of humans, usedto handle the huge amount of information they receive every day.

11 / 75

12 / 75

What is clustering for?

Data ReductionWe can use clustering to reduce the amount of samples in each group toreduce the processing of information.

Example:I In data transmission, a representative for each cluster is defined.I Then, instead of transmitting the data samples, we transmit a code

number corresponding to the representative.I Thus, data compression is achieved.

Hypothesis GenerationIn this case we apply cluster analysis to a data set in order to infer somehypotheses concerning the nature of the data.

13 / 75

Hypothesis testingIn this context, cluster analysis is used for the verification of the validity ofa specific hypothesis.

Example:I “Big Companies Invest Abroad”

Prediction based on groupsFirst: We apply cluster analysis to the available data set.

Second: The resulting clusters are characterized based on thecharacteristics of the patterns by which they are formed.

Third: If we are given an unknown pattern, we can determine thecluster to which it is more likely to belong.

14 / 75

Aspects of clustering

A clustering algorithm - They are Many!!!Partition clustering.Hierarchical clustering.etc.

Based in a functionA distance (similarity, or dissimilarity) function.

Clustering qualityInter-clusters distance → maximized.Intra-clusters distance → minimized.

15 / 75

16 / 75

K -means clustering

K -meansIs a partitional clustering algorithm.

DefinitionLet the set of data points (or instances) D be {x1, · · · , xn} wherexi = (xi1, · · · , xir)T :

The K -means algorithm partitions the given data into K clusters.Each cluster has a cluster center, called centroid.K is specified by the user.

17 / 75

K -means clustering

17 / 75

K -means clustering

17 / 75

K -means clustering

17 / 75

K -means clustering

17 / 75

K -means algorithmThe K -means algorithm works as followsGiven k as the possible number of cluster:

1 Randomly choose K data points (seeds) to be the initial centroids,cluster centers,

I {v1, · · · , vk}2 Assign each data point to the closest centroid

I ci = arg minj{dist(xi − vj)}

3 Re-compute the centroids using the current cluster memberships.

I vj =

n∑i=1

I (ci = j)xi

n∑i=1

I (ci = j)

4 If a convergence criterion is not met, go to 2.

18 / 75

I vj =

n∑i=1

I (ci = j)xi

n∑i=1

I (ci = j)

18 / 75

I vj =

n∑i=1

I (ci = j)xi

n∑i=1

I (ci = j)

18 / 75

I vj =

n∑i=1

I (ci = j)xi

n∑i=1

I (ci = j)

18 / 75

I vj =

n∑i=1

I (ci = j)xi

n∑i=1

I (ci = j)

18 / 75

What is the code trying to do?

It is trying to find a partition SK -means tries to find a partition S that minimizes:

N∑k=1

∑i:xi∈Ck

(xi − µk)T (xi − µk) (1)

Where µk is the centroid for cluster Ck

µk = 1Nk

∑i:xi∈Ck

xi (2)

Where Nk is the number of samples in the cluster Ck .

19 / 75

N∑k=1

∑i:xi∈Ck

(xi − µk)T (xi − µk) (1)

µk = 1Nk

∑i:xi∈Ck

xi (2)

19 / 75

N∑k=1

∑i:xi∈Ck

(xi − µk)T (xi − µk) (1)

µk = 1Nk

∑i:xi∈Ck

xi (2)

19 / 75

20 / 75

What Stopping/convergence criterion should we use?

FirstNo (or minimum) re-assignments of data points to different clusters.

SecondNo (or minimum) change of centroids.

ThirdMinimum decrease in the sum of squared error (SSE),

Ck is cluster k.vk is the centroid of cluster Ck .

SSE =K∑

∑x∈ck

dist (x, vk)2

21 / 75

SSE =K∑

∑x∈ck

dist (x, vk)2

21 / 75

SSE =K∑

∑x∈ck

dist (x, vk)2

21 / 75

SSE =K∑

∑x∈ck

dist (x, vk)2

21 / 75

SSE =K∑

∑x∈ck

dist (x, vk)2

21 / 75

22 / 75

The distance function

Actually, we have the following distance functions:

Euclidean

dixt(x, y) = ||x− y|| =√

(x− y)T (x− y)

Manhattan

dixt(x, y) = ||x− y||1 =n∑

i=1|xi − yi |

Mahalanobis

dixt(x, y) = ||x− y||A =√

(x− y)T A(x− y)

23 / 75

Euclidean

dixt(x, y) = ||x− y|| =√

(x− y)T (x− y)

Manhattan

dixt(x, y) = ||x− y||1 =n∑

i=1|xi − yi |

Mahalanobis

dixt(x, y) = ||x− y||A =√

(x− y)T A(x− y)

23 / 75

Euclidean

dixt(x, y) = ||x− y|| =√

(x− y)T (x− y)

Manhattan

dixt(x, y) = ||x− y||1 =n∑

i=1|xi − yi |

Mahalanobis

dixt(x, y) = ||x− y||A =√

(x− y)T A(x− y)

23 / 75

24 / 75

An example

Dropping two possible centroids

25 / 75

An example

Calculate the memberships

26 / 75

An example

We re-calculate centroids

27 / 75

An example

We re-calculate memberships

28 / 75

An example

We re-calculate centroids and keep going

29 / 75

30 / 75

Strengths of K -means

StrengthsSimple: easy to understand and to implementEfficient: Time complexity: O(tKN ), where N is the number of datapoints, K is the number of clusters, and t is the number of iterations.Since both K and t are small. K -means is considered a linearalgorithm.

PopularityK -means is the most popular clustering algorithm.

Note thatIt terminates at a local optimum if SSE is used. The global optimum ishard to find due to complexity.

31 / 75

Weaknesses of K -means

ImportantThe algorithm is only applicable if the mean is defined.

For categorical data, K -mode - the centroid is represented by mostfrequent values.

In additionThe user needs to specify K .

OutliersThe algorithm is sensitive to outliers.

Outliers are data points that are very far away from other data points.Outliers could be errors in the data recording or some special datapoints with very different values.

32 / 75

Weaknesses of K -means: Problems with outliers

A single outlier

Outlier

Undesirable Clusters

Ideal Clusters

33 / 75

Weaknesses of K -means: How to deal with outliers

One methodIs to remove some data points in the clustering process that are muchfurther away from the centroids than other data points.

To be safe, we may want to monitor these possible outliers over a fewiterations and then decide to remove them.

Another methodIs to perform random sampling.

Since in sampling we only choose a small subset of the data points,the chance of selecting an outlier is very small.Assign the rest of the data points to the clusters by distance orsimilarity comparison, or classification.

34 / 75

Weaknesses of K -means (cont...)

The algorithm is sensitive to initial seeds

Initial Centroids

Final Clusters

35 / 75

Weaknesses of K -means : Different Densities

We have three cluster nevertheless

Original Points K-Means (3 Clusters)

36 / 75

Weaknesses of K -means: Non-globular Shapes

Here, we notice that K -means can only detect globular shapes

37 / 75

38 / 75

We change the Criterion

Instead of using the K -mean criterionWe use the K -center criterion.

New CriterionThe K -center criterion partitions the points into K clusters so as tominimize the maximum distance of any point to its cluster center.

39 / 75

We change the Criterion

Instead of using the K -mean criterionWe use the K -center criterion.

New CriterionThe K -center criterion partitions the points into K clusters so as tominimize the maximum distance of any point to its cluster center.

39 / 75

Explanation

SetupSuppose we have a data set A that contains N objects.

We want the followingWe want to partition A into K sets labeled C1, C2, ..., CK .

NowDefine a cluster size for any cluster Ck as follows.

The smallest value D for which all points in this cluster Ck are:I Within distance D of each other.I Or within distance D

2 of some point called the cluster center.

40 / 75

Explanation

40 / 75

Explanation

40 / 75

Explanation

40 / 75

Explanation

40 / 75

Explanation

40 / 75

Another Way

We haveAnother way to define the cluster size is that D is the maximum pairwisedistance between an arbitrary pair of points in the cluster, or twice themaximum distance between a data point and a chosen centroid.

ThusDenoting the cluster size of Ck by Dk , we have that the cluster size ofpartition (the way the points are grouped) S by:

D = maxk=1,...,K

Dk (3)

In another wordsThe cluster size of a partition composed of multiple clusters is themaximum size of these clusters.

41 / 75

Another Way

D = maxk=1,...,K

Dk (3)

41 / 75

Another Way

D = maxk=1,...,K

Dk (3)

41 / 75

Comparison with K-means

We use the following distance for comparisonIn order to compare these methods, we use Euclidean distance.

In K -means we assume the distance between vectors is the squaredEuclidean distanceK -means tries to find a partition S that minimizes:

N∑k=1

∑i:xi∈Ck

(xi − µk)T (xi − µk) (4)

µk = 1Nk

∑i:xi∈Ck

xi (5)

42 / 75

N∑k=1

∑i:xi∈Ck

(xi − µk)T (xi − µk) (4)

µk = 1Nk

∑i:xi∈Ck

xi (5)

42 / 75

N∑k=1

∑i:xi∈Ck

(xi − µk)T (xi − µk) (4)

µk = 1Nk

∑i:xi∈Ck

xi (5)

42 / 75

Now, what about the K -centersK -center, on the other hand, minimizes the worst case distance tothis centroid

maxk=1,...,K

maxi:xi∈Ck

(xi − µk)T (xi − µk) (6)

Whereµk is called the “centroid”, but may not be the mean vector.

PropertiesThe above objective function shows that for each cluster, only theworst scenario matters, that is, the farthest data point to the centroid.Moreover, among the clusters, only the worst cluster matters, whosefarthest data point yields the maximum distance to the centroidcomparing with the farthest data points of the other clusters.

43 / 75

maxk=1,...,K

maxi:xi∈Ck

(xi − µk)T (xi − µk) (6)

43 / 75

maxk=1,...,K

maxi:xi∈Ck

(xi − µk)T (xi − µk) (6)

43 / 75

maxk=1,...,K

maxi:xi∈Ck

(xi − µk)T (xi − µk) (6)

43 / 75

In other words

FirstThis minimax type of problem is much harder to solve than solving theobjective function of k-means.

Another formulation of K -center minimizes the worst case pairwisedistance instead of using centroids

maxk=1,...,K

maxi,j:xi ,xj∈Ck

L (xi , xj) (7)

WithL (xi , xj) denotes any distance between a pair of objects in the samecluster.

44 / 75

In other words

maxk=1,...,K

maxi,j:xi ,xj∈Ck

L (xi , xj) (7)

44 / 75

In other words

maxk=1,...,K

maxi,j:xi ,xj∈Ck

L (xi , xj) (7)

44 / 75

45 / 75

Greedy Algorithm for K -Center

Main IdeaThe idea behind the Greedy Algorithm is to choose a subset H from theoriginal dataset S consisting of K points that are farthest apart from eachother.

IntuitionSince the points in set H are far apart then the worst-case scenario hasbeen taken care of and hopefully the cluster size for the partition is small.

ThusEach point hk ∈ H represents one cluster or subset of points Ck .

46 / 75

Something NotableWe can think of it as a centroid.

HoweverTechnically it is not a centroid because it tends to be at the boundary of acluster, but conceptually we can think of it as a centroid.

ImportantThe way that we partition these points given the centroids is the same asin K -means, that is, the nearest-neighbor rule.

47 / 75

Specifically

We do the followingFor every point xi , in order to see which cluster Ck it is partitioned into,we compute its distance to each cluster centroid as follows, and find outwhich centroid is the closest:

L (xi , hk) = mink′=1,...,K

L (xi , hk′) (8)

ThusWhichever centroid with the minimum distance is selected as the clusterfor xi .

48 / 75

Specifically

We do the followingFor every point xi , in order to see which cluster Ck it is partitioned into,we compute its distance to each cluster centroid as follows, and find outwhich centroid is the closest:

L (xi , hk) = mink′=1,...,K

L (xi , hk′) (8)

ThusWhichever centroid with the minimum distance is selected as the clusterfor xi .

48 / 75

Important

For K-center clusteringWe only need pairwise distance L (xi , xj) for any xi , xj ∈ S .

Wherexi can be a non-vector representation of the objects.

As long we can calculateL (xi , xj) which makes the K -center more general than the K -means.

49 / 75

Important

49 / 75

Important

49 / 75

Properties

Something NotableThe greedy algorithm achieves an approximation factor of 2 as thedistance measure L satisfies the triangle inequality.

Thus, we have that

D∗ = minS

maxk=1,...,K

maxi,j:xi ,xj∈Ck

L (xi , xj) (9)

Then, we have the following guarantee for the greedy algorithm

D ≤ 2D∗ (10)

50 / 75

Properties

Thus, we have that

D∗ = minS

maxk=1,...,K

maxi,j:xi ,xj∈Ck

L (xi , xj) (9)

D ≤ 2D∗ (10)

50 / 75

Properties

Thus, we have that

D∗ = minS

maxk=1,...,K

maxi,j:xi ,xj∈Ck

L (xi , xj) (9)

D ≤ 2D∗ (10)

50 / 75

Nevertheless

We have thatK -center does not provide a locally optimal solution.

We can get only a solutionGuaranteeing to be within a certain performance range of the theoreticaloptimal solution.

51 / 75

Nevertheless

We have thatK -center does not provide a locally optimal solution.

We can get only a solutionGuaranteeing to be within a certain performance range of the theoreticaloptimal solution.

51 / 75

FirstSet H denotes the set of cluster centroids or cluster of representativeobjects {h1, ..., hk} ⊂ S .

SecondLet cluster (xi) be the identity of the cluster xi ∈ S belongs to.

ThirdThe distance dist (xi) is the distance between xi and its closest clusterrepresentative object (centroid):

dist (xi) = minhj∈H

L (xi , hj) (11)

52 / 75

L (xi , hj) (11)

52 / 75

L (xi , hj) (11)

52 / 75

The Main Idea

Something NotableWe always assign xi to the closest centroid. Therefore dist (xi) is theminimum distance between xi and any centroid.

The Algorithm is IterativeWe generate one cluster centroid first and then add others one by oneuntil we get K clusters.The set of centroids H starts with only a single centroid, h1.

53 / 75

The Main Idea

53 / 75

The Main Idea

53 / 75

Algorithm

Step 1Randomly select an object xj from S , let h1 = xj , H = {h1}.It does not matter how xj is selected.

Step 2For j = 1 to n:

1 dist (xi) = L (xi , h1).2 cluster (xi).

In other wordsThe dist (xi) is the computed distance between xj and h1.Because so far we only have one cluster, we will assign a cluster label1 to every xj .

54 / 75

Algorithm

54 / 75

Algorithm

54 / 75

Algorithm

54 / 75

Algorithm

54 / 75

Algorithm

54 / 75

Algorithm

54 / 75

Algorithm

Step 3For i = 2 to K

1 D = maxxj :xj∈S−H

dist (xj)

2 Choose hi ∈ S −H such that dist (hi) == D3 H = H ∪ {hi}4 for j = 1 to N5 if L (xj , hi) ≤ dist (xj)6 dist (xj) = L (xj , hi)7 cluster (xj) = i

55 / 75

Algorithm

dist (xj)

55 / 75

Algorithm

dist (xj)

55 / 75

Algorithm

dist (xj)

55 / 75

Algorithm

dist (xj)

55 / 75

Algorithm

dist (xj)

55 / 75

Algorithm

dist (xj)

55 / 75

As the algorithm progressesWe gradually add more and more cluster centroids, beginning with 2until we get to K .At each iteration, we find among all of the points which are not yetincluded in the set, a worst point:

I Worst in the sense that this point has maximum distance to itscorresponding centroid.

This worst pointIt is added to the set H .To stress gain, points already included in H are not among theconsideration.

56 / 75

Implementation

We can use the following for implementationThe disjoint-set data structure with the following operations:

I MakeSetI FindI UnionI Remove Iteratively through the disjoint trees.

Of course it is necessary to have extra fields for the necessarydistancesThus, each node-element for the sets must have a field dist (xj) such that

dist (xj) = L (xj , Find (xj)) (12)

57 / 75

Implementation

57 / 75

Implementation

57 / 75

Implementation

57 / 75

Implementation

57 / 75

Implementation

57 / 75

Although

Other things need to be taken in considerationI will allow to you to think about them

58 / 75

Example

Running the following algorithm allows to obtain better results in oneof the problematic sets

Original Points K-Centroid (3 Clusters)

59 / 75

60 / 75

The Running Time

We have thatThe running time of the algorithm is O (KN ), where K is the number ofclusters generated and N is the size of the data set.

Why?Because K -center only requires pairwise distance between any point andthe centroids.

61 / 75

The Running Time

We have thatThe running time of the algorithm is O (KN ), where K is the number ofclusters generated and N is the size of the data set.

Why?Because K -center only requires pairwise distance between any point andthe centroids.

61 / 75

Main Bound of the K -center algorithm

LemmaGiven the distance measure L satisfying the triangle inequality. If thepartition obtained by the greedy algorithm is S̃ and the optimal partitionbe S∗, such that the cluster size of S̃ be D̃ and the one for S∗ is D∗, then

D̃ ≤ 2D∗ (13)

62 / 75

63 / 75

ProofIf we look only at the first j centroidIt generates a partition j with size Dj and also:

The cluster size is the size of the biggest cluster in the currentpartition.The size of every cluster is defined as the maximum distance betweena point and the cluster centroid

D1 ≥ D2 ≥ D3... (14)

Why?Because D = max

xj :xj∈S−Hdist (xj) and lines 5-7 in the step 3.

64 / 75

D1 ≥ D2 ≥ D3... (14)

Why?Because D = max

64 / 75

D1 ≥ D2 ≥ D3... (14)

Why?Because D = max

64 / 75

D1 ≥ D2 ≥ D3... (14)

Why?Because D = max

64 / 75

D1 ≥ D2 ≥ D3... (14)

Why?Because D = max

64 / 75

No, we have that

∀i < j, L (hi , hj) ≥ Dj−1 (15)

65 / 75

Proof of the Previous Statement

First, Did you notice the following?

Dj−1 = L (hj−1, hj) (16)

In addition, it is possible to see that

L (hj−2, hj) ≥ L (hj−1, hj) (17)

How?Assume it is not true. Then, L (hj−2, hj) < L (hj−1, hj)

66 / 75

Dj−1 = L (hj−1, hj) (16)

L (hj−2, hj) ≥ L (hj−1, hj) (17)

66 / 75

Dj−1 = L (hj−1, hj) (16)

L (hj−2, hj) ≥ L (hj−1, hj) (17)

66 / 75

We have thathj−2 ∈C̃j−2 , hj−2 ∈C̃j which is a contradiction because hj−2 is generatedby the algorithm such that cannot be in any other cluster!!!

Thus iteratively

L (h1, hj) ≥ L (h2, hj) ≥ L (h3, hj) ≥ ... ≥ L (hj−1, hj) = Dj−1 (18)

67 / 75

We have thathj−2 ∈C̃j−2 , hj−2 ∈C̃j which is a contradiction because hj−2 is generatedby the algorithm such that cannot be in any other cluster!!!

Thus iteratively

L (h1, hj) ≥ L (h2, hj) ≥ L (h3, hj) ≥ ... ≥ L (hj−1, hj) = Dj−1 (18)

67 / 75

Not only that

∀j, ∃i < j, L (hi , hj) = Dj−1 (19)

ThereforeTherefore, Dj−1 is not only the lower bound for the distance between hiand hj , it is also the exact boundary.

68 / 75

Not only that

∀j, ∃i < j, L (hi , hj) = Dj−1 (19)

68 / 75

Not only that

∀j, ∃i < j, L (hi , hj) = Dj−1 (19)

68 / 75

ProofNow

Let us consider the optimal partition S∗ with K clusters and theminimum size D∗.

I Suppose the greedy algorithm generates the centroidsH̃ = {h1, h2, ..., hK}.

I For the proof here we are adding one more, hK+1.I This can be supposed without losing generality.

According to the pigeonhole principle, at least two of the centroidsamong{h1, h2, ..., hK , hK+1}will fall into one cluster k of the partition S∗.

ThusLet the two centroids be 1 ≤ i < j ≤ k + 1⇒ Using the triangle inequality:

L (hi , hj) ≤ L (hi , hk) + L (hk , hj) ≤ D∗ + D∗ = 2D∗ (20)69 / 75

ProofNow

In additionAlso L (hi , hj) ≥ Dj−1 ≥ Dk then Dk ≤ 2D∗

Now define with S̃ is the partition generated by the greedy algorithm

∆ = maxxj :xj∈S̃−H̃

minhk :hk∈H̃

L (xj , hk) (21)

BasicallyThe maximum of all points that are not centroids that minimize thedistance to some centroid for the partition generated by the greedyalgorithm.

70 / 75

minhk :hk∈H̃

L (xj , hk) (21)

70 / 75

minhk :hk∈H̃

L (xj , hk) (21)

70 / 75

Now, we are ready for the proof

Let hK+1 be an element in S̃ − H̃Such that

minhk :hk∈H̃

L (hK+1, hk) = ∆ (22)

By definition

L (hK+1, hk) ≥ ∆, ∀k = 1, ..., K (23)

Thus, we have the following setsLet Hk = {h1, ..., hk} with k = 1, 2, ..., K .

71 / 75

minhk :hk∈H̃

L (hK+1, hk) = ∆ (22)

By definition

L (hK+1, hk) ≥ ∆, ∀k = 1, ..., K (23)

71 / 75

minhk :hk∈H̃

L (hK+1, hk) = ∆ (22)

By definition

L (hK+1, hk) ≥ ∆, ∀k = 1, ..., K (23)

71 / 75

Consider the distance between hi and hj for i < j ≤ KAccording the greedy algorithm min

hk :hk∈Hj−1L (hj , hk) ≥ min

hk :hk∈Hj−1L (x l , hk)

for any xl ∈ S̃ −Hj

SincehK+1 ∈ S̃ − H̃ and S̃ − H̃ ⊂ S̃ −Hj

72 / 75

Consider the distance between hi and hj for i < j ≤ KAccording the greedy algorithm min

hk :hk∈Hj−1L (hj , hk) ≥ min

hk :hk∈Hj−1L (x l , hk)

for any xl ∈ S̃ −Hj

SincehK+1 ∈ S̃ − H̃ and S̃ − H̃ ⊂ S̃ −Hj

72 / 75

We have

The following

L (hj , hi) ≥ minhk :hk∈Hj−1

L (hj , hk)

≥ minhk :hk∈Hj−1

L (hK+1, hk)

≥ minhk :hk∈H̃

L (hK+1, hk)

We have shown that for any for i < j ≤ k + 1

L (hj , hi) ≥ ∆ (24)

73 / 75

We have

The following

L (hj , hk)

L (hK+1, hk)

≥ minhk :hk∈H̃

L (hK+1, hk)

L (hj , hi) ≥ ∆ (24)

73 / 75

We have

The following

L (hj , hk)

L (hK+1, hk)

≥ minhk :hk∈H̃

L (hK+1, hk)

L (hj , hi) ≥ ∆ (24)

73 / 75

We have

The following

L (hj , hk)

L (hK+1, hk)

≥ minhk :hk∈H̃

L (hK+1, hk)

L (hj , hi) ≥ ∆ (24)

73 / 75

We have

The following

L (hj , hk)

L (hK+1, hk)

≥ minhk :hk∈H̃

L (hK+1, hk)

L (hj , hi) ≥ ∆ (24)

73 / 75

Consider the optimal partitionS∗ = {C ∗1 , C ∗2 , ..., C ∗K}, thus at least 2 of the K + 1 elementsh1, h2, ..., hK+1 will be covered by one cluster.

Assume thathi and hj belong to the same cluster in S∗ . Then L (hi , hj) ≤ D∗.

In additionWe have that since L (hi , hj) ≥ ∆ then ∆ ≤ D∗

74 / 75

In addition

Consider elements xm and xn in any cluster represented by hk

L (xm , hk) ≤ ∆ and L (xn , hk) ≤ ∆ (25)

By Triangle Inequality

L (xm , xk) ≤ L (xm , hk) + L (xn , hk) ≤ 2∆ (26)

Finally, because this is true for any pair of xm and xn, andD̃ = L (x ′m,x ′n)

D̃ = maxk

Dk ≤ 2∆ ≤ 2D∗ (27)

75 / 75

In addition

L (xm , hk) ≤ ∆ and L (xn , hk) ≤ ∆ (25)

D̃ = maxk

Dk ≤ 2∆ ≤ 2D∗ (27)

75 / 75

In addition

L (xm , hk) ≤ ∆ and L (xn , hk) ≤ ∆ (25)

D̃ = maxk

Dk ≤ 2∆ ≤ 2D∗ (27)

75 / 75

25 machine learning unsupervised learaning k-means k-centers

Engineering