amcs/cs 340: data mining clustering xiangliang zhang king abdullah university of science and...
TRANSCRIPT
AMCS/CS 340: Data Mining
Clustering
Xiangliang Zhang
King Abdullah University of Science and Technology
Grouping fruits
2
Grouping apple with apple, orange with orange and banana with banana
Give pictures to a computer
3Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Change pictures to data
4Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Change pictures to data
5
x1 x2 x3
x4 x5 x6
x7 x8 x9 ……xn
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Use clustering methods
6
x1x2x3x4x5x6x7x8x9...…
Clustering Method
121232313...…
Output: clustering indicator
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Correct?
7
x1x2x3x4x5x6x7x8x9...…
Clustering Method
121232313...…
Output: clustering indicator
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Cluster Analysis
1. What is Cluster Analysis?
2. Partitioning Methods
3. Hierarchical Methods
4. Density-Based Methods
5. Grid-Based Methods
6. Model-Based Methods
7. Clustering High-Dimensional Data
8. How to decide the number of clusters?
8Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster
distances are maximized
Intra-cluster distances are
minimized
Unsupervised learning: no predefined classes
9Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Applications of Cluster Analysis
Understanding As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms E.g., group related documents for browsing, group genes and
proteins that have similar functionality, or group stocks with similar price fluctuations
Summarization Reduce the size of large data sets Image segmentation/compression Preserve Privacy (e.g., in medical data)
10Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Clustering
A clustering is a set of clusters
Notion of a Cluster can be Ambiguous
A set of data points
A clustering with Four Clusters A clustering with Six Clusters
A clustering with Two Clusters
11Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Quality: What is Good Clustering?
• A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
• The quality of a clustering result depends on both the
implementation of a method and the similarity
measure used by this method
• The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns12Xiangliang Zhang, KAUST AMCS/CS 340: Data
Mining
What kinds of similarity measure ?
Quality: What is Good Clustering?
• A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
• The quality of a clustering result depends on both the
implementation of a method and the similarity
measure used by this method
• The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns13
What kinds of method ?
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Similarity measure
The similarity measure depends on the characteristics
of the input data
• Attribute type: binary, categorical, continuous
• Sparseness
• Dimensionality
• Type of proximity
14
Center-based Density-based
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Data Structures
Data matrixn instances, p attributes (features)
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
npx...nfx...n1x
...............ipx...ifx...i1x
...............1px...1fx...11x
15
• Minkowski distance:
If q = 1, d is Manhattan distance
If q = 2, d is Euclidean distance• Cosine measure• Correlation coefficient
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
Distance matrix (dissimilarity matrix)
||...||||),(2211 pp jxixjxixjxixjid
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Types of Clustering methods
Partitioning Clustering A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset Typical methods: k-means, k-medoids, CLARANS
A Partitioning Clustering
16Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Types of Clustering methods
Hierarchical clustering A set of nested clusters organized as a hierarchical
tree
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
p4 p1
p3
p2
p4p1 p2 p3 Hierarchical Clustering Dendrogram
17Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Types of Clustering methods
Density-based Clustering Based on connectivity and density functions A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density. Typical methods: DBSACN, OPTICS, DenClue
18
6 density-based clusters
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Types of Clustering methods
Grid-based Clustering Based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE
19Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Types of Clustering methods
Model-based clustering: A model is hypothesized for each of the clusters and
tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB
20Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Cluster Analysis
1. What is Cluster Analysis?
2. Partitioning Methods
3. Hierarchical Methods
4. Density-Based Methods
5. Grid-Based Methods
6. Model-Based Methods
7. Clustering High-Dimensional Data
8. How to decide the number of clusters?
21Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Partitioning Algorithms: Basic Concept
Partitioning clustering method: Construct a partition of a
dataset D of n objects into a set of k clusters, s.t., min sum of squared
distance
Averaged center of the cluster
NP-hard when k is a part of input (even for 2-dim)* Given a k, finding a partition of k clusters that optimizes SSD takes
Heuristic methods: k-means (also called Lloyd’s method [Llo82])
22
A Partitioning Clustering k=3
21 )( miCx
km x
mi
# )( )(dkOnO
C3
C2C1
xx
x
μi
* Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. (2009). "The Planar k-Means Problem is NP-Hard". Lecture Notes in Computer Science 5431: 274–285.# Inaba; Katoh; Imai (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering". Proceedings of 10th ACM Symposium on Computational Geometry.
iCxm
m xC mi
||
1
Partitioning Algorithms: Basic Concept
Partitioning clustering method: Construct a partition of a
dataset D of n objects into a set of k clusters, s.t., min sum of
squared distance
Actual center of the cluster
Global optimal:
exhaustively enumerate all partitions
Heuristic methods: k-medoids or PAM
(Partition around medoids) (Kaufman & Rousseeuw’87)
23
}|{miimCxx
A Partitioning Clustering k=3
C3
C2C1 O
μi
O
O
21 )( miCx
km x
mi
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Partitioning Algorithms
• k-means Algorithm
Issue of initial centroids , clustering
evaluation
Limitations of k-means
• k-medoids
24Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
k-means clustering
• Number of clusters, K, must be specified
• Each cluster is associated with an averaged point (centroid)
• Each point is assigned to the cluster with the closest centroid
• The basic algorithm is very simple
25Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
k-means clustering
• Example:
26
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassignreassign
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
K-means Clustering – Details
• Initial centroids are often chosen randomly. Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * t * d )n = number of points, K = number of clusters, t = number of iterations, d = number of attributes 27Xiangliang Zhang, KAUST AMCS/CS 340: Data
Mining
Partitioning Algorithms
• k-means Algorithm
Issue of initial centroids , clustering
evaluation
Limitations of k-means
• k-medoids
28Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Importance of Choosing Initial Centroid
Clusters produced vary from one run to another.
29-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Original Points
Run 1 Run 2
Evaluating K-means ClusteringMost common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest centroid(the error of representing each point by its nearest centroid)
Given two clustering results, we can choose the one with smaller error
SSE of optimal clustering result reduces when increasing K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Note:
21 )(),( miCx
km xkCSSE
mi
21 and )2,2()1,1( kkkCSSEkCSSE
C1 is better than C2
30Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Importance of Choosing Initial Centroid
Clusters produced vary from one run to another.
31-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Original Points
Run 1 Run 2
SSE(Run1) < SSR(Run2)
Solutions to Initial Centroids Problem
• Multiple runs select the one with smallest SSE
• Sample and use hierarchical clustering to determine initial centroids
32Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Comments on the K-Means Method
Strength: Relatively efficient: O(nktd), where n is # objects, k is #
clusters, t is # iterations , and d is # dimensions. Normally, k,
t ,d<< n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimum.
Weakness
Applicable only when mean is defined, then what about categorical
data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with differing sizes, differing
density, non-convex shapes33Xiangliang Zhang, KAUST AMCS/CS 340: Data
Mining
Partitioning Algorithms
• k-means Algorithm
Issue of initial centroids , clustering
evaluation
Limitations of k-means
• k-medoids
34Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
35Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
36Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Limitations of K-means: Non-convex Shapes
Original Points K-means (2 Clusters)
37Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters.Find parts of clusters, but need to put together.
38Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Original Points K-means Clusters
Overcoming K-means Limitations
One solution is to use many clusters.Find parts of clusters, but need to put together.
39Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Original Points K-means Clusters
One solution is to use many clusters.Find parts of clusters, but need to put together.
Overcoming K-means Limitations
40Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
K-medoids, instead of K-means
• The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data
• Means are not able to compute in some cases.• Only similarities among objects are available
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
41Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Partitioning Algorithms
• k-means Algorithm
Issue of initial centroids , clustering evaluation
Limitations of k-means
• k-medoids PAM
CLARA
CLARANS42Xiangliang Zhang, KAUST AMCS/CS 340: Data
Mining
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• k-medoids, use the same strategy of k-means
• PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
PAM works effectively for small data sets, but does not scale well
for large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
43Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
A Typical K-Medoids Algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost ({mi}) = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary choose k object as initial medoids
(mi,i=1..k) 0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids
select a nonmedoid object,Oj
Compute total cost of all possible new set of medoids
(Oj,{mi},i≠t)0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 18
Swapping mt and Oj
If cost({Oj,{mi,i≠t})} is the smallest, and cost ({Oj,{mi,i≠t}}) < cost({mi}).
Do loop
Until no change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
44
PAM (Partitioning Around Medoids) (1987)
PAM (Partitioning Around Medoids, Kaufman and Rousseeuw,
1987)
Use real object to represent the cluster
1. Select k representative objects (medoids) arbitrarily
2. For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
TCih= total_cost(replace i by h) - total_cost(no replace)
3. If min(TCih )< 0, i is replaced by h
Then assign each non-selected object to the most similar
representative object
4. repeat steps 2-3 until there is no change
45
O(k(n-k)2 ) for each iteration where n is # of data, k is # of clusters
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
CLARA (Clustering Large Applications) (1990)
CLARA (Clustering LARge Applications, Kaufmann and Rousseeuw)
Sampling based method: It draws multiple samples of the data set,
applies PAM on each sample,
gives the best clustering as the output (minimizing cost/SSE)
Strength: deals with larger data sets than PAM
Weakness: Efficiency depends on the sample size
A good clustering based on samples will not necessarily represent a
good clustering of the whole data set if the sample is biased
46Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on Randomized Search, Ng and Han’94)
The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids
• PAM: checks every neighbor
• CLARA: examines fewer neighbors, searches in subgraphs built from samples
• CLARANS: searches the whole graph but draws sample of neighbors dynamically
47
…….
…….
…….
Each node: k medoids, which correspond to a clustering
!)!(
!#
kkn
nnodes
Two nodes are connectedas neighbors if their sets differ by only one item
each node has k(n-k) neighbors
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
CLARANS (“Randomized” CLARA) (1994)
• CLARANS: searches the whole graph but draws sample of neighbors dynamically
• It is more efficient and scalable than both PAM and CLARA
• Focusing techniques and spatial access structures may further improve its performance (Ester et al.’95)
48
…….
…….
…….
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
What you should know
• What is clustering?
• What is partitioning clustering method?
• How does k-means work?
• The limitation of k-means
• How does k-mediods work?
• How to solve the scalability problem of k-
mediods?
49Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining