clustering
DESCRIPTION
Data Clustering and clustering techniques focus on K-means algorithmsTRANSCRIPT
Clustering, K-means variants clustering techniques andapplications
Jagdeep Matharu
Brock University
March 18th 2013
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 1 / 54
Clustering Algorithms Clustering
Clustering
1 Grouping together data objects that are in some similar wayaccording to some user defined criteria.
2 Cluster : collection of data objects that are similar to each other
3 A form of Unsupervised learning.
4 Data exploration - Looking for new patterns for structures of data.
5 Optimization problem.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 2 / 54
Clustering Algorithms Clustering
Clustering Task
1 Pattern Representation2 Pattern proximity measure Most important
How much (de)similar two objects are.
3 Grouping
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 3 / 54
Clustering Algorithms Clustering Techniques
Clustering Techniques
1 Hierarchical Algorithms: Create Hierarchical decomposition of thedata set.
Agglomerative: Bottom-up approach.Divisive: top-down approach.
2 Partition Algorithms: Create partition and then evaluate by somecriteria
e.g: k-means ,k-medoids
Figure 1 : Examples of segmentation based on colour or intensity.Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 4 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Hierarchical Clustering Algorithms
1 Sequential Clustering Algorithm2 Algorithm:
assign every data point in a separate clusterKeep merging the most similar pairs of data points/clusters until wehave one clusterCompute Distances between and old clusters
3 Use distance matrix as clustering criteria
4 Construct nested partitions layer by layer into tree like structure
5 Resulting cluster can further cut down to get the desired number ofcluster.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 5 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Cont’d
1 Binary Tree or dendrogram.
2 Where Height of the bars shows how close two objects are.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 6 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 7 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 8 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 9 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 10 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 11 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 12 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 13 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 14 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 15 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 16 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 17 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 18 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 19 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 20 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 21 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 22 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 23 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 24 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 25 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 26 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 27 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 28 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 29 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 30 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 31 / 54
Clustering Algorithms Hierarchical Clustering Algorithms
Strengths and Weaknesses
1 Pros:No need to assume number of clusters required.Easy to implement.
2 Cons:
Time and Space complexity O(n2).
computing proximity matrix.
No objective function directly minimized.Merging decisions are final - cannot undone.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 32 / 54
Partition Clustering algorithms
Partition Clustering algorithms
1 Overview:
Construct a partition of a data set D of n objects into a set of kclusters.Value of k is specified by user.
different values of k result in different cluster output.
Find the partition of k clusters that optimize the chosen partitioncriteria/Error Function.
E.g.: Error Sum of Squares(SSE)
2 Combinatorial search can be computationally expensive.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 33 / 54
Partition Clustering algorithms Partition Clustering algorithm
Partition Clustering algorithms
1 k-medoids
Use medoid (data point) to represent the cluster.
2 k-means
Use centriod to represent the cluster.
3 Variations
Bisecting k-meansISODATA
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 34 / 54
Partition Clustering algorithms Partition Clustering algorithms
k-means algorithms
1 Choose k initial centroids (center points).2 Each cluster is associated with a centroid.3 Each data object is assigned to closet centroid.4 The centroid of each cluster is then updated based on the data
objects assignment to the cluster.5 Repeat the assignment and update steps until convergence.
Figure 2 : Algorithm
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 35 / 54
Partition Clustering algorithms Partition Clustering algorithms
K-means Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 36 / 54
Partition Clustering algorithms Partition Clustering algorithms
K-means Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 37 / 54
Partition Clustering algorithms Partition Clustering algorithms
K-means Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 38 / 54
Partition Clustering algorithms Partition Clustering algorithms
K-means Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 39 / 54
Partition Clustering algorithms Partition Clustering algorithms
K-means Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 40 / 54
Partition Clustering algorithms Partition Clustering algorithms
K-means Example
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 41 / 54
Partition Clustering algorithms Partition Clustering algorithms
K-means
1 What is the size of k?
2 How to Choosing initial centroids ?
3 How to assign points to closet centroid ?
4 Cluster evaluation ?
5 Other issues.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 42 / 54
Partition Clustering algorithms Partition Clustering algorithms
Choosing value of k
1 k represent the number of the clusters required in a partition.2 Must specify before hand3 There is no rule of thumb while choosing k - Trail and failure.4 Different sizes may result to different results.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 43 / 54
Partition Clustering algorithms Partition Clustering algorithms
choosing initial centroid.
1 Key step of k-means method.
2 Different initial centroids can produce different results.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 44 / 54
Partition Clustering algorithms Partition Clustering algorithms
Example - Optimal Initial Centroid.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 45 / 54
Partition Clustering algorithms Partition Clustering algorithms
Example - Sub - Optimal Initial Centroid.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 46 / 54
Partition Clustering algorithms Partition Clustering algorithms
Choosing intial centroid.
1 Choose Initial centroid randomly.
Can lead to poor clustering.
2 Choosing centroid by performing multiple runs with randomly choseninitial centroid.
Select the set of clusters with optimal solution.
3 Take a sample of points and cluster them using a hierarchicalclustering technique. k clusters are extracted from hierarchy.Centroids of those clusters are used as initial centroids.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 47 / 54
Partition Clustering algorithms Partition Clustering algorithms
Assigning points to centroid.
1 Goal is to find the closest centroid for each data points.
2 Assign data points to the closest centroid .3 Required proximity measure to calculate distances.
Euclidien distance, Manhattan distance.
4 Point is assigned to the centroid with smallest distance.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 48 / 54
Partition Clustering algorithms Partition Clustering algorithms
Cluster Evaluation
1 Most common measure is the sum of squared errors. (SSE)
2 Goal is to reduce the error.
3 Error represent the distance from data point to nearest cluster.
4 MathematicallyK∑i=1
∑x∈Ci
dist2(mi , x)
5 Where dist is the distence from a data point to cluster, x is a datapoint, Ci and Mi is repersentative points for the cluster Ci
6 Given the two clusters, we choose the one with the smallest error.
7 To reduce SSE increase k.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 49 / 54
Partition Clustering algorithms Partition Clustering algorithms
k-means
1 Pros
Easy to implement.Guarantee to converge.
In few initial iterations.
Linear complexity O(n).
2 Cons
Need to specify k , in advance.Sensitive to outliers.May yield empty clusters.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 50 / 54
Partition Clustering algorithms Partition Clustering algorithms
Bisecting k-means
1 Variation of basic k-means method.
2 Can produce a partitional or hierarchical clustering.
3 To obtain K clusters, split the set of all points into two clusters.4 Choose one of two clusters to split again.
Can choose largest cluster between two.Can choose one with hight SSE .Cab choose based on both.
5 Continue until K clusters have been produced.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 51 / 54
Partition Clustering algorithms Partition Clustering algorithms
ISODATA
1 Iterative Self Organizing Data Analysis Technique A
2 Dont need to know the number of clusters.
3 Cluster centers are randomly placed and points are assigned to closestcentriod.
4 The standard deviation within each cluster, and the distance betweencluster centers is calculated.
Clusters are split if standard deviation is greater than the user-defined.Clusters are merged if the distance between them is less than theuser-defined threshold.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 52 / 54
Partition Clustering algorithms Partition Clustering algorithms
Practical Example of k-means
1 Image segmentation using k-means clustering.
Figure 3 : Examples of segmentation based on colour or intensity.
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 53 / 54
Partition Clustering algorithms Bibliography
Bibliography I
A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: Areview,” 1999.
P. L. Lanzi. (2007) Clustering: Partitioning methods. [Online].Available: http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-06-clustering-partitioning?from=ss embed
Tan. (2005) Introduction to data mining. [Online]. Available:http://www-users.cs.umn.edu/∼kumar/dmbook/dmslides/chap8 basic cluster analysis.pdf
Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 54 / 54