clustering basic concepts and algorithms bamshad mobasher depaul university bamshad mobasher depaul...

Click here to load reader

Download Clustering Basic Concepts and Algorithms Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Post on 14-Dec-2015




0 download

Embed Size (px)


  • Slide 1

Clustering Basic Concepts and Algorithms Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University Slide 2 2 What is Clustering in Data Mining? Cluster: a collection of data objects that are similar to one another and thus can be treated collectively as one group but as a collection, they are sufficiently different from other groups Clustering unsupervised classification no predefined classes Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters Helps users understand the natural grouping or structure in a data set Slide 3 Applications of Cluster Analysis Data reduction Summarization: Preprocessing for regression, PCA, classification, and association analysis Compression: Image processing: vector quantization Hypothesis generation and testing Prediction based on groups Cluster & find characteristics/patterns for each group Finding K-nearest Neighbors Localizing search to one or a small number of clusters Outlier detection: Outliers are often viewed as those far away from any cluster 3 Slide 4 Basic Steps to Develop a Clustering Task Feature selection / Preprocessing Select info concerning the task of interest Minimal information redundancy May need to do normalization/standardization Distance/Similarity measure Similarity of two feature vectors Clustering criterion Expressed via a cost function or some rules Clustering algorithms Choice of algorithms Validation of the results Interpretation of the results with applications 4 Slide 5 5 Distance or Similarity Measures Common Distance Measures: Manhattan distance: Euclidean distance: Cosine similarity: Slide 6 6 More Similarity Measures Simple Matching Cosine Coefficient Dices Coefficient Jaccards Coefficient In vector-space model many similarity measures can be used in clustering Slide 7 Quality: What Is Good Clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used its implementation, and Its ability to discover some or all of the hidden patterns 7 Slide 8 Major Clustering Approaches Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSCAN, OPTICS, DenClue Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB 8 Slide 9 9 Partitioning Approaches The notion of comparing item similarities can be extended to clusters themselves, by focusing on a representative vector for each cluster cluster representatives can be actual items in the cluster or other virtual representatives such as the centroid this methodology reduces the number of similarity computations in clustering clusters are revised successively until a stopping condition is satisfied, or until no more changes to clusters can be made Reallocation-Based Partitioning Methods Start with an initial assignment of items to clusters and then move items from cluster to cluster to obtain an improved partitioning Most common algorithm: k-means Slide 10 The K-Means Clustering Method Given the number of desired clusters k, the k-means algorithm follows four steps: 1.Randomly assign objects to create k nonempty initial partitions (clusters) 2.Compute the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) 3.Assign each object to the cluster with the nearest centroid (reallocation step) 4.Go back to Step 2, stop when the assignment does not change 10 Slide 11 11 K-Means Example: Document Clustering Initial (arbitrary) assignment: C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6} Cluster Centroids Now compute the similarity (or distance) of each item to each cluster, resulting a cluster- document similarity matrix (here we use dot product as the similarity measure). Slide 12 12 Example (Continued) For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment. C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation Slide 13 13 Example (Continued) Now compute new cluster centroids using the original document- term matrix This will lead to a new cluster-doc similarity matrix similar to previous slide. Again, the items are reallocated to clusters with highest similarity. C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7} New assignment Note: This process is now repeated with new clusters. However, the next iteration in this example Will show no change to the clusters, thus terminating the algorithm. Slide 14 14 K-Means Algorithm Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t

View more