25 machine learning unsupervised learaning k-means k-centers
TRANSCRIPT
Machine Learning for Data MiningUnsupervised Learning - Clustering
Andres Mendez-Vazquez
July 22, 2015
1 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
2 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
3 / 75
Images/cinvestav-1.jpg
Supervised Learning vs. Unsupervised Learning
Supervised learning:Discover patterns in the data that relate data attributes with a target(class) attribute.
Unsupervised learning:The data have no target attribute.
4 / 75
Images/cinvestav-1.jpg
Supervised Learning vs. Unsupervised Learning
Supervised learning:Discover patterns in the data that relate data attributes with a target(class) attribute.
Unsupervised learning:The data have no target attribute.
4 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
5 / 75
Images/cinvestav-1.jpg
Clustering
ClusteringIt is a technique for finding similarity groups in data, called clusters.
CalledAn unsupervised learning task as no class values denoting an a priorigrouping of the data instances are given, which is the case in supervisedlearning.
Due to historical reasonsClustering is often considered synonymous with unsupervised learning.
In fact, association rule mining is also unsupervised.
6 / 75
Images/cinvestav-1.jpg
Clustering
ClusteringIt is a technique for finding similarity groups in data, called clusters.
CalledAn unsupervised learning task as no class values denoting an a priorigrouping of the data instances are given, which is the case in supervisedlearning.
Due to historical reasonsClustering is often considered synonymous with unsupervised learning.
In fact, association rule mining is also unsupervised.
6 / 75
Images/cinvestav-1.jpg
Clustering
ClusteringIt is a technique for finding similarity groups in data, called clusters.
CalledAn unsupervised learning task as no class values denoting an a priorigrouping of the data instances are given, which is the case in supervisedlearning.
Due to historical reasonsClustering is often considered synonymous with unsupervised learning.
In fact, association rule mining is also unsupervised.
6 / 75
Images/cinvestav-1.jpg
Clustering
ClusteringIt is a technique for finding similarity groups in data, called clusters.
CalledAn unsupervised learning task as no class values denoting an a priorigrouping of the data instances are given, which is the case in supervisedlearning.
Due to historical reasonsClustering is often considered synonymous with unsupervised learning.
In fact, association rule mining is also unsupervised.
6 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
7 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
Pattern Recognition
DefinitionSearch for structure in data
Elements of Numerical Pattern Recognition1 Process Description
I Feature Nomination, Test Data, Design Data2 Feature Analysis
I Preprocessing, Extraction, Selection, . . .3 Cluster Analysis
I Labeling, Validity, . . .4 Classifier Design
I Classification, Estimation, Prediction, Control, . . .
8 / 75
Images/cinvestav-1.jpg
An illustration
The data set has three natural groups of data points, i.e., 3 naturalclusters.
9 / 75
Images/cinvestav-1.jpg
Examples
Example 1Groups people of similar sizes together to make “small”, “medium” and“large” T-Shirts.
Example 2In marketing, segment customers according to their similarities.
10 / 75
Images/cinvestav-1.jpg
Examples
Example 1Groups people of similar sizes together to make “small”, “medium” and“large” T-Shirts.
Example 2In marketing, segment customers according to their similarities.
10 / 75
Images/cinvestav-1.jpg
How we create this classes?
For this, we use the following conceptClustering!!!
BasicallyWe want to “reveal” the organization of patterns into “sensible” clusters(groups).
ActuallyClustering is one of the most primitive mental activities of humans, usedto handle the huge amount of information they receive every day.
11 / 75
Images/cinvestav-1.jpg
How we create this classes?
For this, we use the following conceptClustering!!!
BasicallyWe want to “reveal” the organization of patterns into “sensible” clusters(groups).
ActuallyClustering is one of the most primitive mental activities of humans, usedto handle the huge amount of information they receive every day.
11 / 75
Images/cinvestav-1.jpg
How we create this classes?
For this, we use the following conceptClustering!!!
BasicallyWe want to “reveal” the organization of patterns into “sensible” clusters(groups).
ActuallyClustering is one of the most primitive mental activities of humans, usedto handle the huge amount of information they receive every day.
11 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
12 / 75
Images/cinvestav-1.jpg
What is clustering for?
Data ReductionWe can use clustering to reduce the amount of samples in each group toreduce the processing of information.
Example:I In data transmission, a representative for each cluster is defined.I Then, instead of transmitting the data samples, we transmit a code
number corresponding to the representative.I Thus, data compression is achieved.
Hypothesis GenerationIn this case we apply cluster analysis to a data set in order to infer somehypotheses concerning the nature of the data.
13 / 75
Images/cinvestav-1.jpg
What is clustering for?
Data ReductionWe can use clustering to reduce the amount of samples in each group toreduce the processing of information.
Example:I In data transmission, a representative for each cluster is defined.I Then, instead of transmitting the data samples, we transmit a code
number corresponding to the representative.I Thus, data compression is achieved.
Hypothesis GenerationIn this case we apply cluster analysis to a data set in order to infer somehypotheses concerning the nature of the data.
13 / 75
Images/cinvestav-1.jpg
What is clustering for?
Data ReductionWe can use clustering to reduce the amount of samples in each group toreduce the processing of information.
Example:I In data transmission, a representative for each cluster is defined.I Then, instead of transmitting the data samples, we transmit a code
number corresponding to the representative.I Thus, data compression is achieved.
Hypothesis GenerationIn this case we apply cluster analysis to a data set in order to infer somehypotheses concerning the nature of the data.
13 / 75
Images/cinvestav-1.jpg
What is clustering for?
Data ReductionWe can use clustering to reduce the amount of samples in each group toreduce the processing of information.
Example:I In data transmission, a representative for each cluster is defined.I Then, instead of transmitting the data samples, we transmit a code
number corresponding to the representative.I Thus, data compression is achieved.
Hypothesis GenerationIn this case we apply cluster analysis to a data set in order to infer somehypotheses concerning the nature of the data.
13 / 75
Images/cinvestav-1.jpg
What is clustering for?
Data ReductionWe can use clustering to reduce the amount of samples in each group toreduce the processing of information.
Example:I In data transmission, a representative for each cluster is defined.I Then, instead of transmitting the data samples, we transmit a code
number corresponding to the representative.I Thus, data compression is achieved.
Hypothesis GenerationIn this case we apply cluster analysis to a data set in order to infer somehypotheses concerning the nature of the data.
13 / 75
Images/cinvestav-1.jpg
What is clustering for?
Hypothesis testingIn this context, cluster analysis is used for the verification of the validity ofa specific hypothesis.
Example:I “Big Companies Invest Abroad”
Prediction based on groupsFirst: We apply cluster analysis to the available data set.
Second: The resulting clusters are characterized based on thecharacteristics of the patterns by which they are formed.
Third: If we are given an unknown pattern, we can determine thecluster to which it is more likely to belong.
14 / 75
Images/cinvestav-1.jpg
What is clustering for?
Hypothesis testingIn this context, cluster analysis is used for the verification of the validity ofa specific hypothesis.
Example:I “Big Companies Invest Abroad”
Prediction based on groupsFirst: We apply cluster analysis to the available data set.
Second: The resulting clusters are characterized based on thecharacteristics of the patterns by which they are formed.
Third: If we are given an unknown pattern, we can determine thecluster to which it is more likely to belong.
14 / 75
Images/cinvestav-1.jpg
What is clustering for?
Hypothesis testingIn this context, cluster analysis is used for the verification of the validity ofa specific hypothesis.
Example:I “Big Companies Invest Abroad”
Prediction based on groupsFirst: We apply cluster analysis to the available data set.
Second: The resulting clusters are characterized based on thecharacteristics of the patterns by which they are formed.
Third: If we are given an unknown pattern, we can determine thecluster to which it is more likely to belong.
14 / 75
Images/cinvestav-1.jpg
What is clustering for?
Hypothesis testingIn this context, cluster analysis is used for the verification of the validity ofa specific hypothesis.
Example:I “Big Companies Invest Abroad”
Prediction based on groupsFirst: We apply cluster analysis to the available data set.
Second: The resulting clusters are characterized based on thecharacteristics of the patterns by which they are formed.
Third: If we are given an unknown pattern, we can determine thecluster to which it is more likely to belong.
14 / 75
Images/cinvestav-1.jpg
What is clustering for?
Hypothesis testingIn this context, cluster analysis is used for the verification of the validity ofa specific hypothesis.
Example:I “Big Companies Invest Abroad”
Prediction based on groupsFirst: We apply cluster analysis to the available data set.
Second: The resulting clusters are characterized based on thecharacteristics of the patterns by which they are formed.
Third: If we are given an unknown pattern, we can determine thecluster to which it is more likely to belong.
14 / 75
Images/cinvestav-1.jpg
What is clustering for?
Hypothesis testingIn this context, cluster analysis is used for the verification of the validity ofa specific hypothesis.
Example:I “Big Companies Invest Abroad”
Prediction based on groupsFirst: We apply cluster analysis to the available data set.
Second: The resulting clusters are characterized based on thecharacteristics of the patterns by which they are formed.
Third: If we are given an unknown pattern, we can determine thecluster to which it is more likely to belong.
14 / 75
Images/cinvestav-1.jpg
What is clustering for?
Hypothesis testingIn this context, cluster analysis is used for the verification of the validity ofa specific hypothesis.
Example:I “Big Companies Invest Abroad”
Prediction based on groupsFirst: We apply cluster analysis to the available data set.
Second: The resulting clusters are characterized based on thecharacteristics of the patterns by which they are formed.
Third: If we are given an unknown pattern, we can determine thecluster to which it is more likely to belong.
14 / 75
Images/cinvestav-1.jpg
Aspects of clustering
A clustering algorithm - They are Many!!!Partition clustering.Hierarchical clustering.etc.
Based in a functionA distance (similarity, or dissimilarity) function.
Clustering qualityInter-clusters distance → maximized.Intra-clusters distance → minimized.
15 / 75
Images/cinvestav-1.jpg
Aspects of clustering
A clustering algorithm - They are Many!!!Partition clustering.Hierarchical clustering.etc.
Based in a functionA distance (similarity, or dissimilarity) function.
Clustering qualityInter-clusters distance → maximized.Intra-clusters distance → minimized.
15 / 75
Images/cinvestav-1.jpg
Aspects of clustering
A clustering algorithm - They are Many!!!Partition clustering.Hierarchical clustering.etc.
Based in a functionA distance (similarity, or dissimilarity) function.
Clustering qualityInter-clusters distance → maximized.Intra-clusters distance → minimized.
15 / 75
Images/cinvestav-1.jpg
Aspects of clustering
A clustering algorithm - They are Many!!!Partition clustering.Hierarchical clustering.etc.
Based in a functionA distance (similarity, or dissimilarity) function.
Clustering qualityInter-clusters distance → maximized.Intra-clusters distance → minimized.
15 / 75
Images/cinvestav-1.jpg
Aspects of clustering
A clustering algorithm - They are Many!!!Partition clustering.Hierarchical clustering.etc.
Based in a functionA distance (similarity, or dissimilarity) function.
Clustering qualityInter-clusters distance → maximized.Intra-clusters distance → minimized.
15 / 75
Images/cinvestav-1.jpg
Aspects of clustering
A clustering algorithm - They are Many!!!Partition clustering.Hierarchical clustering.etc.
Based in a functionA distance (similarity, or dissimilarity) function.
Clustering qualityInter-clusters distance → maximized.Intra-clusters distance → minimized.
15 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
16 / 75
Images/cinvestav-1.jpg
K -means clustering
K -meansIs a partitional clustering algorithm.
DefinitionLet the set of data points (or instances) D be {x1, · · · , xn} wherexi = (xi1, · · · , xir)T :
The K -means algorithm partitions the given data into K clusters.Each cluster has a cluster center, called centroid.K is specified by the user.
17 / 75
Images/cinvestav-1.jpg
K -means clustering
K -meansIs a partitional clustering algorithm.
DefinitionLet the set of data points (or instances) D be {x1, · · · , xn} wherexi = (xi1, · · · , xir)T :
The K -means algorithm partitions the given data into K clusters.Each cluster has a cluster center, called centroid.K is specified by the user.
17 / 75
Images/cinvestav-1.jpg
K -means clustering
K -meansIs a partitional clustering algorithm.
DefinitionLet the set of data points (or instances) D be {x1, · · · , xn} wherexi = (xi1, · · · , xir)T :
The K -means algorithm partitions the given data into K clusters.Each cluster has a cluster center, called centroid.K is specified by the user.
17 / 75
Images/cinvestav-1.jpg
K -means clustering
K -meansIs a partitional clustering algorithm.
DefinitionLet the set of data points (or instances) D be {x1, · · · , xn} wherexi = (xi1, · · · , xir)T :
The K -means algorithm partitions the given data into K clusters.Each cluster has a cluster center, called centroid.K is specified by the user.
17 / 75
Images/cinvestav-1.jpg
K -means clustering
K -meansIs a partitional clustering algorithm.
DefinitionLet the set of data points (or instances) D be {x1, · · · , xn} wherexi = (xi1, · · · , xir)T :
The K -means algorithm partitions the given data into K clusters.Each cluster has a cluster center, called centroid.K is specified by the user.
17 / 75
Images/cinvestav-1.jpg
K -means algorithmThe K -means algorithm works as followsGiven k as the possible number of cluster:
1 Randomly choose K data points (seeds) to be the initial centroids,cluster centers,
I {v1, · · · , vk}2 Assign each data point to the closest centroid
I ci = arg minj{dist(xi − vj)}
3 Re-compute the centroids using the current cluster memberships.
I vj =
n∑i=1
I (ci = j)xi
n∑i=1
I (ci = j)
4 If a convergence criterion is not met, go to 2.
18 / 75
Images/cinvestav-1.jpg
K -means algorithmThe K -means algorithm works as followsGiven k as the possible number of cluster:
1 Randomly choose K data points (seeds) to be the initial centroids,cluster centers,
I {v1, · · · , vk}2 Assign each data point to the closest centroid
I ci = arg minj{dist(xi − vj)}
3 Re-compute the centroids using the current cluster memberships.
I vj =
n∑i=1
I (ci = j)xi
n∑i=1
I (ci = j)
4 If a convergence criterion is not met, go to 2.
18 / 75
Images/cinvestav-1.jpg
K -means algorithmThe K -means algorithm works as followsGiven k as the possible number of cluster:
1 Randomly choose K data points (seeds) to be the initial centroids,cluster centers,
I {v1, · · · , vk}2 Assign each data point to the closest centroid
I ci = arg minj{dist(xi − vj)}
3 Re-compute the centroids using the current cluster memberships.
I vj =
n∑i=1
I (ci = j)xi
n∑i=1
I (ci = j)
4 If a convergence criterion is not met, go to 2.
18 / 75
Images/cinvestav-1.jpg
K -means algorithmThe K -means algorithm works as followsGiven k as the possible number of cluster:
1 Randomly choose K data points (seeds) to be the initial centroids,cluster centers,
I {v1, · · · , vk}2 Assign each data point to the closest centroid
I ci = arg minj{dist(xi − vj)}
3 Re-compute the centroids using the current cluster memberships.
I vj =
n∑i=1
I (ci = j)xi
n∑i=1
I (ci = j)
4 If a convergence criterion is not met, go to 2.
18 / 75
Images/cinvestav-1.jpg
K -means algorithmThe K -means algorithm works as followsGiven k as the possible number of cluster:
1 Randomly choose K data points (seeds) to be the initial centroids,cluster centers,
I {v1, · · · , vk}2 Assign each data point to the closest centroid
I ci = arg minj{dist(xi − vj)}
3 Re-compute the centroids using the current cluster memberships.
I vj =
n∑i=1
I (ci = j)xi
n∑i=1
I (ci = j)
4 If a convergence criterion is not met, go to 2.
18 / 75
Images/cinvestav-1.jpg
What is the code trying to do?
It is trying to find a partition SK -means tries to find a partition S that minimizes:
minS
N∑k=1
∑i:xi∈Ck
(xi − µk)T (xi − µk) (1)
Where µk is the centroid for cluster Ck
µk = 1Nk
∑i:xi∈Ck
xi (2)
Where Nk is the number of samples in the cluster Ck .
19 / 75
Images/cinvestav-1.jpg
What is the code trying to do?
It is trying to find a partition SK -means tries to find a partition S that minimizes:
minS
N∑k=1
∑i:xi∈Ck
(xi − µk)T (xi − µk) (1)
Where µk is the centroid for cluster Ck
µk = 1Nk
∑i:xi∈Ck
xi (2)
Where Nk is the number of samples in the cluster Ck .
19 / 75
Images/cinvestav-1.jpg
What is the code trying to do?
It is trying to find a partition SK -means tries to find a partition S that minimizes:
minS
N∑k=1
∑i:xi∈Ck
(xi − µk)T (xi − µk) (1)
Where µk is the centroid for cluster Ck
µk = 1Nk
∑i:xi∈Ck
xi (2)
Where Nk is the number of samples in the cluster Ck .
19 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
20 / 75
Images/cinvestav-1.jpg
What Stopping/convergence criterion should we use?
FirstNo (or minimum) re-assignments of data points to different clusters.
SecondNo (or minimum) change of centroids.
ThirdMinimum decrease in the sum of squared error (SSE),
Ck is cluster k.vk is the centroid of cluster Ck .
SSE =K∑
k=1
∑x∈ck
dist (x, vk)2
21 / 75
Images/cinvestav-1.jpg
What Stopping/convergence criterion should we use?
FirstNo (or minimum) re-assignments of data points to different clusters.
SecondNo (or minimum) change of centroids.
ThirdMinimum decrease in the sum of squared error (SSE),
Ck is cluster k.vk is the centroid of cluster Ck .
SSE =K∑
k=1
∑x∈ck
dist (x, vk)2
21 / 75
Images/cinvestav-1.jpg
What Stopping/convergence criterion should we use?
FirstNo (or minimum) re-assignments of data points to different clusters.
SecondNo (or minimum) change of centroids.
ThirdMinimum decrease in the sum of squared error (SSE),
Ck is cluster k.vk is the centroid of cluster Ck .
SSE =K∑
k=1
∑x∈ck
dist (x, vk)2
21 / 75
Images/cinvestav-1.jpg
What Stopping/convergence criterion should we use?
FirstNo (or minimum) re-assignments of data points to different clusters.
SecondNo (or minimum) change of centroids.
ThirdMinimum decrease in the sum of squared error (SSE),
Ck is cluster k.vk is the centroid of cluster Ck .
SSE =K∑
k=1
∑x∈ck
dist (x, vk)2
21 / 75
Images/cinvestav-1.jpg
What Stopping/convergence criterion should we use?
FirstNo (or minimum) re-assignments of data points to different clusters.
SecondNo (or minimum) change of centroids.
ThirdMinimum decrease in the sum of squared error (SSE),
Ck is cluster k.vk is the centroid of cluster Ck .
SSE =K∑
k=1
∑x∈ck
dist (x, vk)2
21 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
22 / 75
Images/cinvestav-1.jpg
The distance function
Actually, we have the following distance functions:
Euclidean
dixt(x, y) = ||x− y|| =√
(x− y)T (x− y)
Manhattan
dixt(x, y) = ||x− y||1 =n∑
i=1|xi − yi |
Mahalanobis
dixt(x, y) = ||x− y||A =√
(x− y)T A(x− y)
23 / 75
Images/cinvestav-1.jpg
The distance function
Actually, we have the following distance functions:
Euclidean
dixt(x, y) = ||x− y|| =√
(x− y)T (x− y)
Manhattan
dixt(x, y) = ||x− y||1 =n∑
i=1|xi − yi |
Mahalanobis
dixt(x, y) = ||x− y||A =√
(x− y)T A(x− y)
23 / 75
Images/cinvestav-1.jpg
The distance function
Actually, we have the following distance functions:
Euclidean
dixt(x, y) = ||x− y|| =√
(x− y)T (x− y)
Manhattan
dixt(x, y) = ||x− y||1 =n∑
i=1|xi − yi |
Mahalanobis
dixt(x, y) = ||x− y||A =√
(x− y)T A(x− y)
23 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
24 / 75
Images/cinvestav-1.jpg
An example
Dropping two possible centroids
25 / 75
Images/cinvestav-1.jpg
An example
Calculate the memberships
26 / 75
Images/cinvestav-1.jpg
An example
We re-calculate centroids
27 / 75
Images/cinvestav-1.jpg
An example
We re-calculate memberships
28 / 75
Images/cinvestav-1.jpg
An example
We re-calculate centroids and keep going
29 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
30 / 75
Images/cinvestav-1.jpg
Strengths of K -means
StrengthsSimple: easy to understand and to implementEfficient: Time complexity: O(tKN ), where N is the number of datapoints, K is the number of clusters, and t is the number of iterations.Since both K and t are small. K -means is considered a linearalgorithm.
PopularityK -means is the most popular clustering algorithm.
Note thatIt terminates at a local optimum if SSE is used. The global optimum ishard to find due to complexity.
31 / 75
Images/cinvestav-1.jpg
Strengths of K -means
StrengthsSimple: easy to understand and to implementEfficient: Time complexity: O(tKN ), where N is the number of datapoints, K is the number of clusters, and t is the number of iterations.Since both K and t are small. K -means is considered a linearalgorithm.
PopularityK -means is the most popular clustering algorithm.
Note thatIt terminates at a local optimum if SSE is used. The global optimum ishard to find due to complexity.
31 / 75
Images/cinvestav-1.jpg
Strengths of K -means
StrengthsSimple: easy to understand and to implementEfficient: Time complexity: O(tKN ), where N is the number of datapoints, K is the number of clusters, and t is the number of iterations.Since both K and t are small. K -means is considered a linearalgorithm.
PopularityK -means is the most popular clustering algorithm.
Note thatIt terminates at a local optimum if SSE is used. The global optimum ishard to find due to complexity.
31 / 75
Images/cinvestav-1.jpg
Strengths of K -means
StrengthsSimple: easy to understand and to implementEfficient: Time complexity: O(tKN ), where N is the number of datapoints, K is the number of clusters, and t is the number of iterations.Since both K and t are small. K -means is considered a linearalgorithm.
PopularityK -means is the most popular clustering algorithm.
Note thatIt terminates at a local optimum if SSE is used. The global optimum ishard to find due to complexity.
31 / 75
Images/cinvestav-1.jpg
Strengths of K -means
StrengthsSimple: easy to understand and to implementEfficient: Time complexity: O(tKN ), where N is the number of datapoints, K is the number of clusters, and t is the number of iterations.Since both K and t are small. K -means is considered a linearalgorithm.
PopularityK -means is the most popular clustering algorithm.
Note thatIt terminates at a local optimum if SSE is used. The global optimum ishard to find due to complexity.
31 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means
ImportantThe algorithm is only applicable if the mean is defined.
For categorical data, K -mode - the centroid is represented by mostfrequent values.
In additionThe user needs to specify K .
OutliersThe algorithm is sensitive to outliers.
Outliers are data points that are very far away from other data points.Outliers could be errors in the data recording or some special datapoints with very different values.
32 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means
ImportantThe algorithm is only applicable if the mean is defined.
For categorical data, K -mode - the centroid is represented by mostfrequent values.
In additionThe user needs to specify K .
OutliersThe algorithm is sensitive to outliers.
Outliers are data points that are very far away from other data points.Outliers could be errors in the data recording or some special datapoints with very different values.
32 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means
ImportantThe algorithm is only applicable if the mean is defined.
For categorical data, K -mode - the centroid is represented by mostfrequent values.
In additionThe user needs to specify K .
OutliersThe algorithm is sensitive to outliers.
Outliers are data points that are very far away from other data points.Outliers could be errors in the data recording or some special datapoints with very different values.
32 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means
ImportantThe algorithm is only applicable if the mean is defined.
For categorical data, K -mode - the centroid is represented by mostfrequent values.
In additionThe user needs to specify K .
OutliersThe algorithm is sensitive to outliers.
Outliers are data points that are very far away from other data points.Outliers could be errors in the data recording or some special datapoints with very different values.
32 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means
ImportantThe algorithm is only applicable if the mean is defined.
For categorical data, K -mode - the centroid is represented by mostfrequent values.
In additionThe user needs to specify K .
OutliersThe algorithm is sensitive to outliers.
Outliers are data points that are very far away from other data points.Outliers could be errors in the data recording or some special datapoints with very different values.
32 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means
ImportantThe algorithm is only applicable if the mean is defined.
For categorical data, K -mode - the centroid is represented by mostfrequent values.
In additionThe user needs to specify K .
OutliersThe algorithm is sensitive to outliers.
Outliers are data points that are very far away from other data points.Outliers could be errors in the data recording or some special datapoints with very different values.
32 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means: Problems with outliers
A single outlier
Outlier
Outlier
Undesirable Clusters
Ideal Clusters
33 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means: How to deal with outliers
One methodIs to remove some data points in the clustering process that are muchfurther away from the centroids than other data points.
To be safe, we may want to monitor these possible outliers over a fewiterations and then decide to remove them.
Another methodIs to perform random sampling.
Since in sampling we only choose a small subset of the data points,the chance of selecting an outlier is very small.Assign the rest of the data points to the clusters by distance orsimilarity comparison, or classification.
34 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means: How to deal with outliers
One methodIs to remove some data points in the clustering process that are muchfurther away from the centroids than other data points.
To be safe, we may want to monitor these possible outliers over a fewiterations and then decide to remove them.
Another methodIs to perform random sampling.
Since in sampling we only choose a small subset of the data points,the chance of selecting an outlier is very small.Assign the rest of the data points to the clusters by distance orsimilarity comparison, or classification.
34 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means: How to deal with outliers
One methodIs to remove some data points in the clustering process that are muchfurther away from the centroids than other data points.
To be safe, we may want to monitor these possible outliers over a fewiterations and then decide to remove them.
Another methodIs to perform random sampling.
Since in sampling we only choose a small subset of the data points,the chance of selecting an outlier is very small.Assign the rest of the data points to the clusters by distance orsimilarity comparison, or classification.
34 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means: How to deal with outliers
One methodIs to remove some data points in the clustering process that are muchfurther away from the centroids than other data points.
To be safe, we may want to monitor these possible outliers over a fewiterations and then decide to remove them.
Another methodIs to perform random sampling.
Since in sampling we only choose a small subset of the data points,the chance of selecting an outlier is very small.Assign the rest of the data points to the clusters by distance orsimilarity comparison, or classification.
34 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means: How to deal with outliers
One methodIs to remove some data points in the clustering process that are muchfurther away from the centroids than other data points.
To be safe, we may want to monitor these possible outliers over a fewiterations and then decide to remove them.
Another methodIs to perform random sampling.
Since in sampling we only choose a small subset of the data points,the chance of selecting an outlier is very small.Assign the rest of the data points to the clusters by distance orsimilarity comparison, or classification.
34 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means (cont...)
The algorithm is sensitive to initial seeds
Initial Centroids
Final Clusters
35 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means : Different Densities
We have three cluster nevertheless
Original Points K-Means (3 Clusters)
36 / 75
Images/cinvestav-1.jpg
Weaknesses of K -means: Non-globular Shapes
Here, we notice that K -means can only detect globular shapes
37 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
38 / 75
Images/cinvestav-1.jpg
We change the Criterion
Instead of using the K -mean criterionWe use the K -center criterion.
New CriterionThe K -center criterion partitions the points into K clusters so as tominimize the maximum distance of any point to its cluster center.
39 / 75
Images/cinvestav-1.jpg
We change the Criterion
Instead of using the K -mean criterionWe use the K -center criterion.
New CriterionThe K -center criterion partitions the points into K clusters so as tominimize the maximum distance of any point to its cluster center.
39 / 75
Images/cinvestav-1.jpg
Explanation
SetupSuppose we have a data set A that contains N objects.
We want the followingWe want to partition A into K sets labeled C1, C2, ..., CK .
NowDefine a cluster size for any cluster Ck as follows.
The smallest value D for which all points in this cluster Ck are:I Within distance D of each other.I Or within distance D
2 of some point called the cluster center.
40 / 75
Images/cinvestav-1.jpg
Explanation
SetupSuppose we have a data set A that contains N objects.
We want the followingWe want to partition A into K sets labeled C1, C2, ..., CK .
NowDefine a cluster size for any cluster Ck as follows.
The smallest value D for which all points in this cluster Ck are:I Within distance D of each other.I Or within distance D
2 of some point called the cluster center.
40 / 75
Images/cinvestav-1.jpg
Explanation
SetupSuppose we have a data set A that contains N objects.
We want the followingWe want to partition A into K sets labeled C1, C2, ..., CK .
NowDefine a cluster size for any cluster Ck as follows.
The smallest value D for which all points in this cluster Ck are:I Within distance D of each other.I Or within distance D
2 of some point called the cluster center.
40 / 75
Images/cinvestav-1.jpg
Explanation
SetupSuppose we have a data set A that contains N objects.
We want the followingWe want to partition A into K sets labeled C1, C2, ..., CK .
NowDefine a cluster size for any cluster Ck as follows.
The smallest value D for which all points in this cluster Ck are:I Within distance D of each other.I Or within distance D
2 of some point called the cluster center.
40 / 75
Images/cinvestav-1.jpg
Explanation
SetupSuppose we have a data set A that contains N objects.
We want the followingWe want to partition A into K sets labeled C1, C2, ..., CK .
NowDefine a cluster size for any cluster Ck as follows.
The smallest value D for which all points in this cluster Ck are:I Within distance D of each other.I Or within distance D
2 of some point called the cluster center.
40 / 75
Images/cinvestav-1.jpg
Explanation
SetupSuppose we have a data set A that contains N objects.
We want the followingWe want to partition A into K sets labeled C1, C2, ..., CK .
NowDefine a cluster size for any cluster Ck as follows.
The smallest value D for which all points in this cluster Ck are:I Within distance D of each other.I Or within distance D
2 of some point called the cluster center.
40 / 75
Images/cinvestav-1.jpg
Another Way
We haveAnother way to define the cluster size is that D is the maximum pairwisedistance between an arbitrary pair of points in the cluster, or twice themaximum distance between a data point and a chosen centroid.
ThusDenoting the cluster size of Ck by Dk , we have that the cluster size ofpartition (the way the points are grouped) S by:
D = maxk=1,...,K
Dk (3)
In another wordsThe cluster size of a partition composed of multiple clusters is themaximum size of these clusters.
41 / 75
Images/cinvestav-1.jpg
Another Way
We haveAnother way to define the cluster size is that D is the maximum pairwisedistance between an arbitrary pair of points in the cluster, or twice themaximum distance between a data point and a chosen centroid.
ThusDenoting the cluster size of Ck by Dk , we have that the cluster size ofpartition (the way the points are grouped) S by:
D = maxk=1,...,K
Dk (3)
In another wordsThe cluster size of a partition composed of multiple clusters is themaximum size of these clusters.
41 / 75
Images/cinvestav-1.jpg
Another Way
We haveAnother way to define the cluster size is that D is the maximum pairwisedistance between an arbitrary pair of points in the cluster, or twice themaximum distance between a data point and a chosen centroid.
ThusDenoting the cluster size of Ck by Dk , we have that the cluster size ofpartition (the way the points are grouped) S by:
D = maxk=1,...,K
Dk (3)
In another wordsThe cluster size of a partition composed of multiple clusters is themaximum size of these clusters.
41 / 75
Images/cinvestav-1.jpg
Comparison with K-means
We use the following distance for comparisonIn order to compare these methods, we use Euclidean distance.
In K -means we assume the distance between vectors is the squaredEuclidean distanceK -means tries to find a partition S that minimizes:
minS
N∑k=1
∑i:xi∈Ck
(xi − µk)T (xi − µk) (4)
Where µk is the centroid for cluster Ck
µk = 1Nk
∑i:xi∈Ck
xi (5)
42 / 75
Images/cinvestav-1.jpg
Comparison with K-means
We use the following distance for comparisonIn order to compare these methods, we use Euclidean distance.
In K -means we assume the distance between vectors is the squaredEuclidean distanceK -means tries to find a partition S that minimizes:
minS
N∑k=1
∑i:xi∈Ck
(xi − µk)T (xi − µk) (4)
Where µk is the centroid for cluster Ck
µk = 1Nk
∑i:xi∈Ck
xi (5)
42 / 75
Images/cinvestav-1.jpg
Comparison with K-means
We use the following distance for comparisonIn order to compare these methods, we use Euclidean distance.
In K -means we assume the distance between vectors is the squaredEuclidean distanceK -means tries to find a partition S that minimizes:
minS
N∑k=1
∑i:xi∈Ck
(xi − µk)T (xi − µk) (4)
Where µk is the centroid for cluster Ck
µk = 1Nk
∑i:xi∈Ck
xi (5)
42 / 75
Images/cinvestav-1.jpg
Now, what about the K -centersK -center, on the other hand, minimizes the worst case distance tothis centroid
minS
maxk=1,...,K
maxi:xi∈Ck
(xi − µk)T (xi − µk) (6)
Whereµk is called the “centroid”, but may not be the mean vector.
PropertiesThe above objective function shows that for each cluster, only theworst scenario matters, that is, the farthest data point to the centroid.Moreover, among the clusters, only the worst cluster matters, whosefarthest data point yields the maximum distance to the centroidcomparing with the farthest data points of the other clusters.
43 / 75
Images/cinvestav-1.jpg
Now, what about the K -centersK -center, on the other hand, minimizes the worst case distance tothis centroid
minS
maxk=1,...,K
maxi:xi∈Ck
(xi − µk)T (xi − µk) (6)
Whereµk is called the “centroid”, but may not be the mean vector.
PropertiesThe above objective function shows that for each cluster, only theworst scenario matters, that is, the farthest data point to the centroid.Moreover, among the clusters, only the worst cluster matters, whosefarthest data point yields the maximum distance to the centroidcomparing with the farthest data points of the other clusters.
43 / 75
Images/cinvestav-1.jpg
Now, what about the K -centersK -center, on the other hand, minimizes the worst case distance tothis centroid
minS
maxk=1,...,K
maxi:xi∈Ck
(xi − µk)T (xi − µk) (6)
Whereµk is called the “centroid”, but may not be the mean vector.
PropertiesThe above objective function shows that for each cluster, only theworst scenario matters, that is, the farthest data point to the centroid.Moreover, among the clusters, only the worst cluster matters, whosefarthest data point yields the maximum distance to the centroidcomparing with the farthest data points of the other clusters.
43 / 75
Images/cinvestav-1.jpg
Now, what about the K -centersK -center, on the other hand, minimizes the worst case distance tothis centroid
minS
maxk=1,...,K
maxi:xi∈Ck
(xi − µk)T (xi − µk) (6)
Whereµk is called the “centroid”, but may not be the mean vector.
PropertiesThe above objective function shows that for each cluster, only theworst scenario matters, that is, the farthest data point to the centroid.Moreover, among the clusters, only the worst cluster matters, whosefarthest data point yields the maximum distance to the centroidcomparing with the farthest data points of the other clusters.
43 / 75
Images/cinvestav-1.jpg
In other words
FirstThis minimax type of problem is much harder to solve than solving theobjective function of k-means.
Another formulation of K -center minimizes the worst case pairwisedistance instead of using centroids
minS
maxk=1,...,K
maxi,j:xi ,xj∈Ck
L (xi , xj) (7)
WithL (xi , xj) denotes any distance between a pair of objects in the samecluster.
44 / 75
Images/cinvestav-1.jpg
In other words
FirstThis minimax type of problem is much harder to solve than solving theobjective function of k-means.
Another formulation of K -center minimizes the worst case pairwisedistance instead of using centroids
minS
maxk=1,...,K
maxi,j:xi ,xj∈Ck
L (xi , xj) (7)
WithL (xi , xj) denotes any distance between a pair of objects in the samecluster.
44 / 75
Images/cinvestav-1.jpg
In other words
FirstThis minimax type of problem is much harder to solve than solving theobjective function of k-means.
Another formulation of K -center minimizes the worst case pairwisedistance instead of using centroids
minS
maxk=1,...,K
maxi,j:xi ,xj∈Ck
L (xi , xj) (7)
WithL (xi , xj) denotes any distance between a pair of objects in the samecluster.
44 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
45 / 75
Images/cinvestav-1.jpg
Greedy Algorithm for K -Center
Main IdeaThe idea behind the Greedy Algorithm is to choose a subset H from theoriginal dataset S consisting of K points that are farthest apart from eachother.
IntuitionSince the points in set H are far apart then the worst-case scenario hasbeen taken care of and hopefully the cluster size for the partition is small.
ThusEach point hk ∈ H represents one cluster or subset of points Ck .
46 / 75
Images/cinvestav-1.jpg
Greedy Algorithm for K -Center
Main IdeaThe idea behind the Greedy Algorithm is to choose a subset H from theoriginal dataset S consisting of K points that are farthest apart from eachother.
IntuitionSince the points in set H are far apart then the worst-case scenario hasbeen taken care of and hopefully the cluster size for the partition is small.
ThusEach point hk ∈ H represents one cluster or subset of points Ck .
46 / 75
Images/cinvestav-1.jpg
Greedy Algorithm for K -Center
Main IdeaThe idea behind the Greedy Algorithm is to choose a subset H from theoriginal dataset S consisting of K points that are farthest apart from eachother.
IntuitionSince the points in set H are far apart then the worst-case scenario hasbeen taken care of and hopefully the cluster size for the partition is small.
ThusEach point hk ∈ H represents one cluster or subset of points Ck .
46 / 75
Images/cinvestav-1.jpg
Then
Something NotableWe can think of it as a centroid.
HoweverTechnically it is not a centroid because it tends to be at the boundary of acluster, but conceptually we can think of it as a centroid.
ImportantThe way that we partition these points given the centroids is the same asin K -means, that is, the nearest-neighbor rule.
47 / 75
Images/cinvestav-1.jpg
Then
Something NotableWe can think of it as a centroid.
HoweverTechnically it is not a centroid because it tends to be at the boundary of acluster, but conceptually we can think of it as a centroid.
ImportantThe way that we partition these points given the centroids is the same asin K -means, that is, the nearest-neighbor rule.
47 / 75
Images/cinvestav-1.jpg
Then
Something NotableWe can think of it as a centroid.
HoweverTechnically it is not a centroid because it tends to be at the boundary of acluster, but conceptually we can think of it as a centroid.
ImportantThe way that we partition these points given the centroids is the same asin K -means, that is, the nearest-neighbor rule.
47 / 75
Images/cinvestav-1.jpg
Specifically
We do the followingFor every point xi , in order to see which cluster Ck it is partitioned into,we compute its distance to each cluster centroid as follows, and find outwhich centroid is the closest:
L (xi , hk) = mink′=1,...,K
L (xi , hk′) (8)
ThusWhichever centroid with the minimum distance is selected as the clusterfor xi .
48 / 75
Images/cinvestav-1.jpg
Specifically
We do the followingFor every point xi , in order to see which cluster Ck it is partitioned into,we compute its distance to each cluster centroid as follows, and find outwhich centroid is the closest:
L (xi , hk) = mink′=1,...,K
L (xi , hk′) (8)
ThusWhichever centroid with the minimum distance is selected as the clusterfor xi .
48 / 75
Images/cinvestav-1.jpg
Important
For K-center clusteringWe only need pairwise distance L (xi , xj) for any xi , xj ∈ S .
Wherexi can be a non-vector representation of the objects.
As long we can calculateL (xi , xj) which makes the K -center more general than the K -means.
49 / 75
Images/cinvestav-1.jpg
Important
For K-center clusteringWe only need pairwise distance L (xi , xj) for any xi , xj ∈ S .
Wherexi can be a non-vector representation of the objects.
As long we can calculateL (xi , xj) which makes the K -center more general than the K -means.
49 / 75
Images/cinvestav-1.jpg
Important
For K-center clusteringWe only need pairwise distance L (xi , xj) for any xi , xj ∈ S .
Wherexi can be a non-vector representation of the objects.
As long we can calculateL (xi , xj) which makes the K -center more general than the K -means.
49 / 75
Images/cinvestav-1.jpg
Properties
Something NotableThe greedy algorithm achieves an approximation factor of 2 as thedistance measure L satisfies the triangle inequality.
Thus, we have that
D∗ = minS
maxk=1,...,K
maxi,j:xi ,xj∈Ck
L (xi , xj) (9)
Then, we have the following guarantee for the greedy algorithm
D ≤ 2D∗ (10)
50 / 75
Images/cinvestav-1.jpg
Properties
Something NotableThe greedy algorithm achieves an approximation factor of 2 as thedistance measure L satisfies the triangle inequality.
Thus, we have that
D∗ = minS
maxk=1,...,K
maxi,j:xi ,xj∈Ck
L (xi , xj) (9)
Then, we have the following guarantee for the greedy algorithm
D ≤ 2D∗ (10)
50 / 75
Images/cinvestav-1.jpg
Properties
Something NotableThe greedy algorithm achieves an approximation factor of 2 as thedistance measure L satisfies the triangle inequality.
Thus, we have that
D∗ = minS
maxk=1,...,K
maxi,j:xi ,xj∈Ck
L (xi , xj) (9)
Then, we have the following guarantee for the greedy algorithm
D ≤ 2D∗ (10)
50 / 75
Images/cinvestav-1.jpg
Nevertheless
We have thatK -center does not provide a locally optimal solution.
We can get only a solutionGuaranteeing to be within a certain performance range of the theoreticaloptimal solution.
51 / 75
Images/cinvestav-1.jpg
Nevertheless
We have thatK -center does not provide a locally optimal solution.
We can get only a solutionGuaranteeing to be within a certain performance range of the theoreticaloptimal solution.
51 / 75
Images/cinvestav-1.jpg
Setup
FirstSet H denotes the set of cluster centroids or cluster of representativeobjects {h1, ..., hk} ⊂ S .
SecondLet cluster (xi) be the identity of the cluster xi ∈ S belongs to.
ThirdThe distance dist (xi) is the distance between xi and its closest clusterrepresentative object (centroid):
dist (xi) = minhj∈H
L (xi , hj) (11)
52 / 75
Images/cinvestav-1.jpg
Setup
FirstSet H denotes the set of cluster centroids or cluster of representativeobjects {h1, ..., hk} ⊂ S .
SecondLet cluster (xi) be the identity of the cluster xi ∈ S belongs to.
ThirdThe distance dist (xi) is the distance between xi and its closest clusterrepresentative object (centroid):
dist (xi) = minhj∈H
L (xi , hj) (11)
52 / 75
Images/cinvestav-1.jpg
Setup
FirstSet H denotes the set of cluster centroids or cluster of representativeobjects {h1, ..., hk} ⊂ S .
SecondLet cluster (xi) be the identity of the cluster xi ∈ S belongs to.
ThirdThe distance dist (xi) is the distance between xi and its closest clusterrepresentative object (centroid):
dist (xi) = minhj∈H
L (xi , hj) (11)
52 / 75
Images/cinvestav-1.jpg
The Main Idea
Something NotableWe always assign xi to the closest centroid. Therefore dist (xi) is theminimum distance between xi and any centroid.
The Algorithm is IterativeWe generate one cluster centroid first and then add others one by oneuntil we get K clusters.The set of centroids H starts with only a single centroid, h1.
53 / 75
Images/cinvestav-1.jpg
The Main Idea
Something NotableWe always assign xi to the closest centroid. Therefore dist (xi) is theminimum distance between xi and any centroid.
The Algorithm is IterativeWe generate one cluster centroid first and then add others one by oneuntil we get K clusters.The set of centroids H starts with only a single centroid, h1.
53 / 75
Images/cinvestav-1.jpg
The Main Idea
Something NotableWe always assign xi to the closest centroid. Therefore dist (xi) is theminimum distance between xi and any centroid.
The Algorithm is IterativeWe generate one cluster centroid first and then add others one by oneuntil we get K clusters.The set of centroids H starts with only a single centroid, h1.
53 / 75
Images/cinvestav-1.jpg
Algorithm
Step 1Randomly select an object xj from S , let h1 = xj , H = {h1}.It does not matter how xj is selected.
Step 2For j = 1 to n:
1 dist (xi) = L (xi , h1).2 cluster (xi).
In other wordsThe dist (xi) is the computed distance between xj and h1.Because so far we only have one cluster, we will assign a cluster label1 to every xj .
54 / 75
Images/cinvestav-1.jpg
Algorithm
Step 1Randomly select an object xj from S , let h1 = xj , H = {h1}.It does not matter how xj is selected.
Step 2For j = 1 to n:
1 dist (xi) = L (xi , h1).2 cluster (xi).
In other wordsThe dist (xi) is the computed distance between xj and h1.Because so far we only have one cluster, we will assign a cluster label1 to every xj .
54 / 75
Images/cinvestav-1.jpg
Algorithm
Step 1Randomly select an object xj from S , let h1 = xj , H = {h1}.It does not matter how xj is selected.
Step 2For j = 1 to n:
1 dist (xi) = L (xi , h1).2 cluster (xi).
In other wordsThe dist (xi) is the computed distance between xj and h1.Because so far we only have one cluster, we will assign a cluster label1 to every xj .
54 / 75
Images/cinvestav-1.jpg
Algorithm
Step 1Randomly select an object xj from S , let h1 = xj , H = {h1}.It does not matter how xj is selected.
Step 2For j = 1 to n:
1 dist (xi) = L (xi , h1).2 cluster (xi).
In other wordsThe dist (xi) is the computed distance between xj and h1.Because so far we only have one cluster, we will assign a cluster label1 to every xj .
54 / 75
Images/cinvestav-1.jpg
Algorithm
Step 1Randomly select an object xj from S , let h1 = xj , H = {h1}.It does not matter how xj is selected.
Step 2For j = 1 to n:
1 dist (xi) = L (xi , h1).2 cluster (xi).
In other wordsThe dist (xi) is the computed distance between xj and h1.Because so far we only have one cluster, we will assign a cluster label1 to every xj .
54 / 75
Images/cinvestav-1.jpg
Algorithm
Step 1Randomly select an object xj from S , let h1 = xj , H = {h1}.It does not matter how xj is selected.
Step 2For j = 1 to n:
1 dist (xi) = L (xi , h1).2 cluster (xi).
In other wordsThe dist (xi) is the computed distance between xj and h1.Because so far we only have one cluster, we will assign a cluster label1 to every xj .
54 / 75
Images/cinvestav-1.jpg
Algorithm
Step 1Randomly select an object xj from S , let h1 = xj , H = {h1}.It does not matter how xj is selected.
Step 2For j = 1 to n:
1 dist (xi) = L (xi , h1).2 cluster (xi).
In other wordsThe dist (xi) is the computed distance between xj and h1.Because so far we only have one cluster, we will assign a cluster label1 to every xj .
54 / 75
Images/cinvestav-1.jpg
Algorithm
Step 3For i = 2 to K
1 D = maxxj :xj∈S−H
dist (xj)
2 Choose hi ∈ S −H such that dist (hi) == D3 H = H ∪ {hi}4 for j = 1 to N5 if L (xj , hi) ≤ dist (xj)6 dist (xj) = L (xj , hi)7 cluster (xj) = i
55 / 75
Images/cinvestav-1.jpg
Algorithm
Step 3For i = 2 to K
1 D = maxxj :xj∈S−H
dist (xj)
2 Choose hi ∈ S −H such that dist (hi) == D3 H = H ∪ {hi}4 for j = 1 to N5 if L (xj , hi) ≤ dist (xj)6 dist (xj) = L (xj , hi)7 cluster (xj) = i
55 / 75
Images/cinvestav-1.jpg
Algorithm
Step 3For i = 2 to K
1 D = maxxj :xj∈S−H
dist (xj)
2 Choose hi ∈ S −H such that dist (hi) == D3 H = H ∪ {hi}4 for j = 1 to N5 if L (xj , hi) ≤ dist (xj)6 dist (xj) = L (xj , hi)7 cluster (xj) = i
55 / 75
Images/cinvestav-1.jpg
Algorithm
Step 3For i = 2 to K
1 D = maxxj :xj∈S−H
dist (xj)
2 Choose hi ∈ S −H such that dist (hi) == D3 H = H ∪ {hi}4 for j = 1 to N5 if L (xj , hi) ≤ dist (xj)6 dist (xj) = L (xj , hi)7 cluster (xj) = i
55 / 75
Images/cinvestav-1.jpg
Algorithm
Step 3For i = 2 to K
1 D = maxxj :xj∈S−H
dist (xj)
2 Choose hi ∈ S −H such that dist (hi) == D3 H = H ∪ {hi}4 for j = 1 to N5 if L (xj , hi) ≤ dist (xj)6 dist (xj) = L (xj , hi)7 cluster (xj) = i
55 / 75
Images/cinvestav-1.jpg
Algorithm
Step 3For i = 2 to K
1 D = maxxj :xj∈S−H
dist (xj)
2 Choose hi ∈ S −H such that dist (hi) == D3 H = H ∪ {hi}4 for j = 1 to N5 if L (xj , hi) ≤ dist (xj)6 dist (xj) = L (xj , hi)7 cluster (xj) = i
55 / 75
Images/cinvestav-1.jpg
Algorithm
Step 3For i = 2 to K
1 D = maxxj :xj∈S−H
dist (xj)
2 Choose hi ∈ S −H such that dist (hi) == D3 H = H ∪ {hi}4 for j = 1 to N5 if L (xj , hi) ≤ dist (xj)6 dist (xj) = L (xj , hi)7 cluster (xj) = i
55 / 75
Images/cinvestav-1.jpg
Thus
As the algorithm progressesWe gradually add more and more cluster centroids, beginning with 2until we get to K .At each iteration, we find among all of the points which are not yetincluded in the set, a worst point:
I Worst in the sense that this point has maximum distance to itscorresponding centroid.
This worst pointIt is added to the set H .To stress gain, points already included in H are not among theconsideration.
56 / 75
Images/cinvestav-1.jpg
Thus
As the algorithm progressesWe gradually add more and more cluster centroids, beginning with 2until we get to K .At each iteration, we find among all of the points which are not yetincluded in the set, a worst point:
I Worst in the sense that this point has maximum distance to itscorresponding centroid.
This worst pointIt is added to the set H .To stress gain, points already included in H are not among theconsideration.
56 / 75
Images/cinvestav-1.jpg
Thus
As the algorithm progressesWe gradually add more and more cluster centroids, beginning with 2until we get to K .At each iteration, we find among all of the points which are not yetincluded in the set, a worst point:
I Worst in the sense that this point has maximum distance to itscorresponding centroid.
This worst pointIt is added to the set H .To stress gain, points already included in H are not among theconsideration.
56 / 75
Images/cinvestav-1.jpg
Thus
As the algorithm progressesWe gradually add more and more cluster centroids, beginning with 2until we get to K .At each iteration, we find among all of the points which are not yetincluded in the set, a worst point:
I Worst in the sense that this point has maximum distance to itscorresponding centroid.
This worst pointIt is added to the set H .To stress gain, points already included in H are not among theconsideration.
56 / 75
Images/cinvestav-1.jpg
Implementation
We can use the following for implementationThe disjoint-set data structure with the following operations:
I MakeSetI FindI UnionI Remove Iteratively through the disjoint trees.
Of course it is necessary to have extra fields for the necessarydistancesThus, each node-element for the sets must have a field dist (xj) such that
dist (xj) = L (xj , Find (xj)) (12)
57 / 75
Images/cinvestav-1.jpg
Implementation
We can use the following for implementationThe disjoint-set data structure with the following operations:
I MakeSetI FindI UnionI Remove Iteratively through the disjoint trees.
Of course it is necessary to have extra fields for the necessarydistancesThus, each node-element for the sets must have a field dist (xj) such that
dist (xj) = L (xj , Find (xj)) (12)
57 / 75
Images/cinvestav-1.jpg
Implementation
We can use the following for implementationThe disjoint-set data structure with the following operations:
I MakeSetI FindI UnionI Remove Iteratively through the disjoint trees.
Of course it is necessary to have extra fields for the necessarydistancesThus, each node-element for the sets must have a field dist (xj) such that
dist (xj) = L (xj , Find (xj)) (12)
57 / 75
Images/cinvestav-1.jpg
Implementation
We can use the following for implementationThe disjoint-set data structure with the following operations:
I MakeSetI FindI UnionI Remove Iteratively through the disjoint trees.
Of course it is necessary to have extra fields for the necessarydistancesThus, each node-element for the sets must have a field dist (xj) such that
dist (xj) = L (xj , Find (xj)) (12)
57 / 75
Images/cinvestav-1.jpg
Implementation
We can use the following for implementationThe disjoint-set data structure with the following operations:
I MakeSetI FindI UnionI Remove Iteratively through the disjoint trees.
Of course it is necessary to have extra fields for the necessarydistancesThus, each node-element for the sets must have a field dist (xj) such that
dist (xj) = L (xj , Find (xj)) (12)
57 / 75
Images/cinvestav-1.jpg
Implementation
We can use the following for implementationThe disjoint-set data structure with the following operations:
I MakeSetI FindI UnionI Remove Iteratively through the disjoint trees.
Of course it is necessary to have extra fields for the necessarydistancesThus, each node-element for the sets must have a field dist (xj) such that
dist (xj) = L (xj , Find (xj)) (12)
57 / 75
Images/cinvestav-1.jpg
Although
Other things need to be taken in considerationI will allow to you to think about them
58 / 75
Images/cinvestav-1.jpg
Example
Running the following algorithm allows to obtain better results in oneof the problematic sets
Original Points K-Centroid (3 Clusters)
59 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
60 / 75
Images/cinvestav-1.jpg
The Running Time
We have thatThe running time of the algorithm is O (KN ), where K is the number ofclusters generated and N is the size of the data set.
Why?Because K -center only requires pairwise distance between any point andthe centroids.
61 / 75
Images/cinvestav-1.jpg
The Running Time
We have thatThe running time of the algorithm is O (KN ), where K is the number ofclusters generated and N is the size of the data set.
Why?Because K -center only requires pairwise distance between any point andthe centroids.
61 / 75
Images/cinvestav-1.jpg
Main Bound of the K -center algorithm
LemmaGiven the distance measure L satisfying the triangle inequality. If thepartition obtained by the greedy algorithm is S̃ and the optimal partitionbe S∗, such that the cluster size of S̃ be D̃ and the one for S∗ is D∗, then
D̃ ≤ 2D∗ (13)
62 / 75
Images/cinvestav-1.jpg
Outline1 Supervised Learning vs. Unsupervised Learning
Supervised vs UnsupervisedClusteringPattern RecognitionWhy Clustering?
2 K-Means ClusteringK-Means ClusteringConvergence CriterionThe Distance FunctionExampleProperties of K-Means
3 The K-Center Criterion ClusteringIntroduction
Comparison with K-meansThe Greedy K-Center Algorithm
Pseudo-CodeK-Center Algorithm PropertiesK-Center Algorithm proof of correctness
63 / 75
Images/cinvestav-1.jpg
ProofIf we look only at the first j centroidIt generates a partition j with size Dj and also:
The cluster size is the size of the biggest cluster in the currentpartition.The size of every cluster is defined as the maximum distance betweena point and the cluster centroid
Thus
D1 ≥ D2 ≥ D3... (14)
Why?Because D = max
xj :xj∈S−Hdist (xj) and lines 5-7 in the step 3.
64 / 75
Images/cinvestav-1.jpg
ProofIf we look only at the first j centroidIt generates a partition j with size Dj and also:
The cluster size is the size of the biggest cluster in the currentpartition.The size of every cluster is defined as the maximum distance betweena point and the cluster centroid
Thus
D1 ≥ D2 ≥ D3... (14)
Why?Because D = max
xj :xj∈S−Hdist (xj) and lines 5-7 in the step 3.
64 / 75
Images/cinvestav-1.jpg
ProofIf we look only at the first j centroidIt generates a partition j with size Dj and also:
The cluster size is the size of the biggest cluster in the currentpartition.The size of every cluster is defined as the maximum distance betweena point and the cluster centroid
Thus
D1 ≥ D2 ≥ D3... (14)
Why?Because D = max
xj :xj∈S−Hdist (xj) and lines 5-7 in the step 3.
64 / 75
Images/cinvestav-1.jpg
ProofIf we look only at the first j centroidIt generates a partition j with size Dj and also:
The cluster size is the size of the biggest cluster in the currentpartition.The size of every cluster is defined as the maximum distance betweena point and the cluster centroid
Thus
D1 ≥ D2 ≥ D3... (14)
Why?Because D = max
xj :xj∈S−Hdist (xj) and lines 5-7 in the step 3.
64 / 75
Images/cinvestav-1.jpg
ProofIf we look only at the first j centroidIt generates a partition j with size Dj and also:
The cluster size is the size of the biggest cluster in the currentpartition.The size of every cluster is defined as the maximum distance betweena point and the cluster centroid
Thus
D1 ≥ D2 ≥ D3... (14)
Why?Because D = max
xj :xj∈S−Hdist (xj) and lines 5-7 in the step 3.
64 / 75
Images/cinvestav-1.jpg
Now
No, we have that
∀i < j, L (hi , hj) ≥ Dj−1 (15)
65 / 75
Images/cinvestav-1.jpg
Proof of the Previous Statement
First, Did you notice the following?
Dj−1 = L (hj−1, hj) (16)
In addition, it is possible to see that
L (hj−2, hj) ≥ L (hj−1, hj) (17)
How?Assume it is not true. Then, L (hj−2, hj) < L (hj−1, hj)
66 / 75
Images/cinvestav-1.jpg
Proof of the Previous Statement
First, Did you notice the following?
Dj−1 = L (hj−1, hj) (16)
In addition, it is possible to see that
L (hj−2, hj) ≥ L (hj−1, hj) (17)
How?Assume it is not true. Then, L (hj−2, hj) < L (hj−1, hj)
66 / 75
Images/cinvestav-1.jpg
Proof of the Previous Statement
First, Did you notice the following?
Dj−1 = L (hj−1, hj) (16)
In addition, it is possible to see that
L (hj−2, hj) ≥ L (hj−1, hj) (17)
How?Assume it is not true. Then, L (hj−2, hj) < L (hj−1, hj)
66 / 75
Images/cinvestav-1.jpg
Then
We have thathj−2 ∈C̃j−2 , hj−2 ∈C̃j which is a contradiction because hj−2 is generatedby the algorithm such that cannot be in any other cluster!!!
Thus iteratively
L (h1, hj) ≥ L (h2, hj) ≥ L (h3, hj) ≥ ... ≥ L (hj−1, hj) = Dj−1 (18)
67 / 75
Images/cinvestav-1.jpg
Then
We have thathj−2 ∈C̃j−2 , hj−2 ∈C̃j which is a contradiction because hj−2 is generatedby the algorithm such that cannot be in any other cluster!!!
Thus iteratively
L (h1, hj) ≥ L (h2, hj) ≥ L (h3, hj) ≥ ... ≥ L (hj−1, hj) = Dj−1 (18)
67 / 75
Images/cinvestav-1.jpg
Not only that
Not only that
∀j, ∃i < j, L (hi , hj) = Dj−1 (19)
ThereforeTherefore, Dj−1 is not only the lower bound for the distance between hiand hj , it is also the exact boundary.
68 / 75
Images/cinvestav-1.jpg
Not only that
Not only that
∀j, ∃i < j, L (hi , hj) = Dj−1 (19)
ThereforeTherefore, Dj−1 is not only the lower bound for the distance between hiand hj , it is also the exact boundary.
68 / 75
Images/cinvestav-1.jpg
Not only that
Not only that
∀j, ∃i < j, L (hi , hj) = Dj−1 (19)
ThereforeTherefore, Dj−1 is not only the lower bound for the distance between hiand hj , it is also the exact boundary.
68 / 75
Images/cinvestav-1.jpg
ProofNow
Let us consider the optimal partition S∗ with K clusters and theminimum size D∗.
I Suppose the greedy algorithm generates the centroidsH̃ = {h1, h2, ..., hK}.
I For the proof here we are adding one more, hK+1.I This can be supposed without losing generality.
According to the pigeonhole principle, at least two of the centroidsamong{h1, h2, ..., hK , hK+1}will fall into one cluster k of the partition S∗.
ThusLet the two centroids be 1 ≤ i < j ≤ k + 1⇒ Using the triangle inequality:
L (hi , hj) ≤ L (hi , hk) + L (hk , hj) ≤ D∗ + D∗ = 2D∗ (20)69 / 75
Images/cinvestav-1.jpg
ProofNow
Let us consider the optimal partition S∗ with K clusters and theminimum size D∗.
I Suppose the greedy algorithm generates the centroidsH̃ = {h1, h2, ..., hK}.
I For the proof here we are adding one more, hK+1.I This can be supposed without losing generality.
According to the pigeonhole principle, at least two of the centroidsamong{h1, h2, ..., hK , hK+1}will fall into one cluster k of the partition S∗.
ThusLet the two centroids be 1 ≤ i < j ≤ k + 1⇒ Using the triangle inequality:
L (hi , hj) ≤ L (hi , hk) + L (hk , hj) ≤ D∗ + D∗ = 2D∗ (20)69 / 75
Images/cinvestav-1.jpg
ProofNow
Let us consider the optimal partition S∗ with K clusters and theminimum size D∗.
I Suppose the greedy algorithm generates the centroidsH̃ = {h1, h2, ..., hK}.
I For the proof here we are adding one more, hK+1.I This can be supposed without losing generality.
According to the pigeonhole principle, at least two of the centroidsamong{h1, h2, ..., hK , hK+1}will fall into one cluster k of the partition S∗.
ThusLet the two centroids be 1 ≤ i < j ≤ k + 1⇒ Using the triangle inequality:
L (hi , hj) ≤ L (hi , hk) + L (hk , hj) ≤ D∗ + D∗ = 2D∗ (20)69 / 75
Images/cinvestav-1.jpg
ProofNow
Let us consider the optimal partition S∗ with K clusters and theminimum size D∗.
I Suppose the greedy algorithm generates the centroidsH̃ = {h1, h2, ..., hK}.
I For the proof here we are adding one more, hK+1.I This can be supposed without losing generality.
According to the pigeonhole principle, at least two of the centroidsamong{h1, h2, ..., hK , hK+1}will fall into one cluster k of the partition S∗.
ThusLet the two centroids be 1 ≤ i < j ≤ k + 1⇒ Using the triangle inequality:
L (hi , hj) ≤ L (hi , hk) + L (hk , hj) ≤ D∗ + D∗ = 2D∗ (20)69 / 75
Images/cinvestav-1.jpg
ProofNow
Let us consider the optimal partition S∗ with K clusters and theminimum size D∗.
I Suppose the greedy algorithm generates the centroidsH̃ = {h1, h2, ..., hK}.
I For the proof here we are adding one more, hK+1.I This can be supposed without losing generality.
According to the pigeonhole principle, at least two of the centroidsamong{h1, h2, ..., hK , hK+1}will fall into one cluster k of the partition S∗.
ThusLet the two centroids be 1 ≤ i < j ≤ k + 1⇒ Using the triangle inequality:
L (hi , hj) ≤ L (hi , hk) + L (hk , hj) ≤ D∗ + D∗ = 2D∗ (20)69 / 75
Images/cinvestav-1.jpg
Then
In additionAlso L (hi , hj) ≥ Dj−1 ≥ Dk then Dk ≤ 2D∗
Now define with S̃ is the partition generated by the greedy algorithm
∆ = maxxj :xj∈S̃−H̃
minhk :hk∈H̃
L (xj , hk) (21)
BasicallyThe maximum of all points that are not centroids that minimize thedistance to some centroid for the partition generated by the greedyalgorithm.
70 / 75
Images/cinvestav-1.jpg
Then
In additionAlso L (hi , hj) ≥ Dj−1 ≥ Dk then Dk ≤ 2D∗
Now define with S̃ is the partition generated by the greedy algorithm
∆ = maxxj :xj∈S̃−H̃
minhk :hk∈H̃
L (xj , hk) (21)
BasicallyThe maximum of all points that are not centroids that minimize thedistance to some centroid for the partition generated by the greedyalgorithm.
70 / 75
Images/cinvestav-1.jpg
Then
In additionAlso L (hi , hj) ≥ Dj−1 ≥ Dk then Dk ≤ 2D∗
Now define with S̃ is the partition generated by the greedy algorithm
∆ = maxxj :xj∈S̃−H̃
minhk :hk∈H̃
L (xj , hk) (21)
BasicallyThe maximum of all points that are not centroids that minimize thedistance to some centroid for the partition generated by the greedyalgorithm.
70 / 75
Images/cinvestav-1.jpg
Now, we are ready for the proof
Let hK+1 be an element in S̃ − H̃Such that
minhk :hk∈H̃
L (hK+1, hk) = ∆ (22)
By definition
L (hK+1, hk) ≥ ∆, ∀k = 1, ..., K (23)
Thus, we have the following setsLet Hk = {h1, ..., hk} with k = 1, 2, ..., K .
71 / 75
Images/cinvestav-1.jpg
Now, we are ready for the proof
Let hK+1 be an element in S̃ − H̃Such that
minhk :hk∈H̃
L (hK+1, hk) = ∆ (22)
By definition
L (hK+1, hk) ≥ ∆, ∀k = 1, ..., K (23)
Thus, we have the following setsLet Hk = {h1, ..., hk} with k = 1, 2, ..., K .
71 / 75
Images/cinvestav-1.jpg
Now, we are ready for the proof
Let hK+1 be an element in S̃ − H̃Such that
minhk :hk∈H̃
L (hK+1, hk) = ∆ (22)
By definition
L (hK+1, hk) ≥ ∆, ∀k = 1, ..., K (23)
Thus, we have the following setsLet Hk = {h1, ..., hk} with k = 1, 2, ..., K .
71 / 75
Images/cinvestav-1.jpg
Now
Consider the distance between hi and hj for i < j ≤ KAccording the greedy algorithm min
hk :hk∈Hj−1L (hj , hk) ≥ min
hk :hk∈Hj−1L (x l , hk)
for any xl ∈ S̃ −Hj
SincehK+1 ∈ S̃ − H̃ and S̃ − H̃ ⊂ S̃ −Hj
72 / 75
Images/cinvestav-1.jpg
Now
Consider the distance between hi and hj for i < j ≤ KAccording the greedy algorithm min
hk :hk∈Hj−1L (hj , hk) ≥ min
hk :hk∈Hj−1L (x l , hk)
for any xl ∈ S̃ −Hj
SincehK+1 ∈ S̃ − H̃ and S̃ − H̃ ⊂ S̃ −Hj
72 / 75
Images/cinvestav-1.jpg
We have
The following
L (hj , hi) ≥ minhk :hk∈Hj−1
L (hj , hk)
≥ minhk :hk∈Hj−1
L (hK+1, hk)
≥ minhk :hk∈H̃
L (hK+1, hk)
=∆
We have shown that for any for i < j ≤ k + 1
L (hj , hi) ≥ ∆ (24)
73 / 75
Images/cinvestav-1.jpg
We have
The following
L (hj , hi) ≥ minhk :hk∈Hj−1
L (hj , hk)
≥ minhk :hk∈Hj−1
L (hK+1, hk)
≥ minhk :hk∈H̃
L (hK+1, hk)
=∆
We have shown that for any for i < j ≤ k + 1
L (hj , hi) ≥ ∆ (24)
73 / 75
Images/cinvestav-1.jpg
We have
The following
L (hj , hi) ≥ minhk :hk∈Hj−1
L (hj , hk)
≥ minhk :hk∈Hj−1
L (hK+1, hk)
≥ minhk :hk∈H̃
L (hK+1, hk)
=∆
We have shown that for any for i < j ≤ k + 1
L (hj , hi) ≥ ∆ (24)
73 / 75
Images/cinvestav-1.jpg
We have
The following
L (hj , hi) ≥ minhk :hk∈Hj−1
L (hj , hk)
≥ minhk :hk∈Hj−1
L (hK+1, hk)
≥ minhk :hk∈H̃
L (hK+1, hk)
=∆
We have shown that for any for i < j ≤ k + 1
L (hj , hi) ≥ ∆ (24)
73 / 75
Images/cinvestav-1.jpg
We have
The following
L (hj , hi) ≥ minhk :hk∈Hj−1
L (hj , hk)
≥ minhk :hk∈Hj−1
L (hK+1, hk)
≥ minhk :hk∈H̃
L (hK+1, hk)
=∆
We have shown that for any for i < j ≤ k + 1
L (hj , hi) ≥ ∆ (24)
73 / 75
Images/cinvestav-1.jpg
Now
Consider the optimal partitionS∗ = {C ∗1 , C ∗2 , ..., C ∗K}, thus at least 2 of the K + 1 elementsh1, h2, ..., hK+1 will be covered by one cluster.
Assume thathi and hj belong to the same cluster in S∗ . Then L (hi , hj) ≤ D∗.
In additionWe have that since L (hi , hj) ≥ ∆ then ∆ ≤ D∗
74 / 75
Images/cinvestav-1.jpg
Now
Consider the optimal partitionS∗ = {C ∗1 , C ∗2 , ..., C ∗K}, thus at least 2 of the K + 1 elementsh1, h2, ..., hK+1 will be covered by one cluster.
Assume thathi and hj belong to the same cluster in S∗ . Then L (hi , hj) ≤ D∗.
In additionWe have that since L (hi , hj) ≥ ∆ then ∆ ≤ D∗
74 / 75
Images/cinvestav-1.jpg
Now
Consider the optimal partitionS∗ = {C ∗1 , C ∗2 , ..., C ∗K}, thus at least 2 of the K + 1 elementsh1, h2, ..., hK+1 will be covered by one cluster.
Assume thathi and hj belong to the same cluster in S∗ . Then L (hi , hj) ≤ D∗.
In additionWe have that since L (hi , hj) ≥ ∆ then ∆ ≤ D∗
74 / 75
Images/cinvestav-1.jpg
In addition
Consider elements xm and xn in any cluster represented by hk
L (xm , hk) ≤ ∆ and L (xn , hk) ≤ ∆ (25)
By Triangle Inequality
L (xm , xk) ≤ L (xm , hk) + L (xn , hk) ≤ 2∆ (26)
Finally, because this is true for any pair of xm and xn, andD̃ = L (x ′m,x ′n)
D̃ = maxk
Dk ≤ 2∆ ≤ 2D∗ (27)
75 / 75
Images/cinvestav-1.jpg
In addition
Consider elements xm and xn in any cluster represented by hk
L (xm , hk) ≤ ∆ and L (xn , hk) ≤ ∆ (25)
By Triangle Inequality
L (xm , xk) ≤ L (xm , hk) + L (xn , hk) ≤ 2∆ (26)
Finally, because this is true for any pair of xm and xn, andD̃ = L (x ′m,x ′n)
D̃ = maxk
Dk ≤ 2∆ ≤ 2D∗ (27)
75 / 75
Images/cinvestav-1.jpg
In addition
Consider elements xm and xn in any cluster represented by hk
L (xm , hk) ≤ ∆ and L (xn , hk) ≤ ∆ (25)
By Triangle Inequality
L (xm , xk) ≤ L (xm , hk) + L (xn , hk) ≤ 2∆ (26)
Finally, because this is true for any pair of xm and xn, andD̃ = L (x ′m,x ′n)
D̃ = maxk
Dk ≤ 2∆ ≤ 2D∗ (27)
75 / 75