clustering basic concepts and algorithms bamshad mobasher depaul university bamshad mobasher depaul...

ClusteringBasic Concepts and Algorithms

Bamshad MobasherDePaul University


2

What is Clustering in Data Mining?

i Cluster:4 a collection of data objects that

are “similar” to one another and thus can be treated collectively as one group

4 but as a collection, they are sufficiently different from other groups

i Clustering4 unsupervised classification4 no predefined classes

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Helps users understand the natural grouping or structure in a data set

Applications of Cluster Analysis

i Data reduction

4 Summarization: Preprocessing for regression, PCA, classification, and association analysis

4 Compression: Image processing: vector quantizationi Hypothesis generation and testingi Prediction based on groups

4 Cluster & find characteristics/patterns for each group

i Finding K-nearest Neighbors

4 Localizing search to one or a small number of clusters

i Outlier detection: Outliers are often viewed as those “far away” from any cluster

3

Basic Steps to Develop a Clustering Task

i Feature selection / Preprocessing4 Select info concerning the task of interest

4 Minimal information redundancy

4 May need to do normalization/standardization

i Distance/Similarity measure4 Similarity of two feature vectors

i Clustering criterion4 Expressed via a cost function or some rules

i Clustering algorithms4 Choice of algorithms

i Validation of the results

i Interpretation of the results with applications

4

5

Distance or Similarity Measures

i Common Distance Measures:

4 Manhattan distance:

4 Euclidean distance:

4 Cosine similarity:

( , ) 1 ( , )dist X Y sim X Y 2 2

( )( , )

i ii

i ii i

x ysim X Y

x y

6

More Similarity Measures

Simple Matching

Cosine Coefficient

Dice’s Coefficient

Jaccard’s Coefficient

i In vector-space model many similarity measures can be used in clustering

2 2

( )( , )

i ii

i ii i

x ysim X Y

x y

sim X Y x yi ii

( , ) sim X Y

x y

x y

i ii

i iii

( , )

2

2 2

sim X Yx y

x y x y

i ii

i i i iiii

( , )

2 2

Quality: What Is Good Clustering?

i A good clustering method will produce high quality clusters

4 high intra-class similarity: cohesive within clusters

4 low inter-class similarity: distinctive between clusters

i The quality of a clustering method depends on

4 the similarity measure used

4 its implementation, and

4 Its ability to discover some or all of the hidden patterns

7

Major Clustering Approaches i Partitioning approach:

4 Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors

4 Typical methods: k-means, k-medoids, CLARANSi Hierarchical approach:

4 Create a hierarchical decomposition of the set of data (or objects) using some criterion

4 Typical methods: Diana, Agnes, BIRCH, CAMELEONi Density-based approach:

4 Based on connectivity and density functions4 Typical methods: DBSCAN, OPTICS, DenClue

i Model-based: 4 A model is hypothesized for each of the clusters and tries to find the best

fit of that model to each other4 Typical methods: EM, SOM, COBWEB

8

9

Partitioning Approachesi The notion of comparing item similarities can be extended to

clusters themselves, by focusing on a representative vector for each cluster4 cluster representatives can be actual items in the cluster or other “virtual”

representatives such as the centroid4 this methodology reduces the number of similarity computations in

clustering4 clusters are revised successively until a stopping condition is satisfied, or

until no more changes to clusters can be made

i Reallocation-Based Partitioning Methods4 Start with an initial assignment of items to clusters and then move items

from cluster to cluster to obtain an improved partitioning4 Most common algorithm: k-means

The K-Means Clustering Method

i Given the number of desired clusters k, the k-means

algorithm follows four steps:

1. Randomly assign objects to create k nonempty initial

partitions (clusters)

2. Compute the centroids of the clusters of the current

partitioning (the centroid is the center, i.e., mean

point, of the cluster)

3. Assign each object to the cluster with the nearest

centroid (reallocation step)

4. Go back to Step 2, stop when the assignment does not

change10

11

K-Means Example: Document Clustering

Initial (arbitrary) assignment:C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6}

T1 T2 T3 T4 T5

D1 0 3 3 0 2D2 4 1 0 1 2D3 0 4 0 0 2D4 0 3 0 3 3D5 0 1 3 0 1D6 2 2 0 0 4D7 1 0 3 2 0D8 3 1 0 0 2C1 4/2 4/2 3/2 1/2 4/2C2 0/2 7/2 0/2 3/2 5/2C3 2/2 3/2 3/2 0/2 5/2

Cluster Centroids

Now compute the similarity (or distance) of each item to each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure).

D1 D2 D3 D4 D5 D6 D7 D8C1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2C2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2C3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2

12

Example (Continued)D1 D2 D3 D4 D5 D6 D7 D8

C1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2C2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2C3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2

For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…

This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…

13

Example (Continued)

T1 T2 T3 T4 T5

D1 0 3 3 0 2D2 4 1 0 1 2D3 0 4 0 0 2D4 0 3 0 3 3D5 0 1 3 0 1D6 2 2 0 0 4D7 1 0 3 2 0D8 3 1 0 0 2

C1 8/3 2/3 3/3 3/3 4/3C2 2/4 12/4 3/4 3/4 11/4C3 0/1 1/1 3/1 0/1 1/1

Now compute new cluster centroids using the original document-term matrix

This will lead to a new cluster-doc similarity matrix similar to previous slide. Again, the items are reallocated to clusters with highest similarity.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

D1 D2 D3 D4 D5 D6 D7 D8C1 7.67 15.01 5.34 9.00 5.00 12.00 7.67 11.34C2 16.75 11.25 17.50 19.50 8.00 6.68 4.25 10.00C3 14.00 3.00 6.00 6.00 11.00 9.34 9.00 3.00

C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}New assignment

Note: This process is now repeated with new clusters. However, the next iteration in this exampleWill show no change to the clusters, thus terminating the algorithm.

14

K-Means Algorithmi Strength of the k-means:

4 Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n

4 Often terminates at a local optimum

i Weakness of the k-means:4 Applicable only when mean is defined; what about

categorical data?4 Need to specify k, the number of clusters, in advance4 Unable to handle noisy data and outliers

i Variations of K-Means usually differ in:4 Selection of the initial k means4 Distance or similarity measures used4 Strategies to calculate cluster means

15

A Disk Version of k-meansi k-means can be implemented with data on disk

4 In each iteration, it scans the database once4 The centroids are computed incrementally

i It can be used to cluster large datasets that do not fit in main memory

i We need to control the number of iterations 4 In practice, a limited is set (< 50)

i There are better algorithms that scale up for large data sets, e.g., BIRCH

BIRCH

i Designed for very large data sets4 Time and memory are limited4 Incremental and dynamic clustering of incoming objects4 Only one scan of data is necessary4 Does not need the whole data set in advance

i Two key phases:4 Scans the database to build an in-memory tree4 Applies clustering algorithm to cluster the leaf nodes

16

Hierarchical Clustering Algorithms• Two main types of hierarchical clustering

– Agglomerative: • Start with the points as individual clusters• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

– Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters)

• Traditional hierarchical algorithms use a similarity or distance matrix

– Merge or split one cluster at a time

18

Hierarchical Clustering Algorithmsi Use dist / sim matrix as clustering criteria

4 does not require the no. of clusters as input, but needs a termination condition

a

b

c

d

e

ab

cd

cde

abcde

Step 0 Step 1 Step 2 Step 3 Step 4

Step 4 Step 3 Step 2 Step 1 Step 0

Agglomerative

Divisive

19

Hierarchical Agglomerative Clustering

i Basic procedure4 1. Place each of N items into a cluster of its own.4 2. Compute all pairwise item-item similarity coefficients

h Total of N(N-1)/2 coefficients

4 3. Form a new cluster by combining the most similar pair of current clusters i and j

h (methods for determining which clusters to merge: single-link, complete link, group average, etc.);

h update similarity matrix by deleting the rows and columns corresponding to i and j;

h calculate the entries in the row corresponding to the new cluster i+j.

4 4. Repeat step 3 if the number of clusters left is great than 1.

Hierarchical Agglomerative Clustering:: Example

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

61

2 5

3

4

Distance Between Two Clusters

i The basic procedure varies based on the method used to determine inter-cluster distances or similarities

i Different methods results in different variants of the algorithm

4 Single link4 Complete link4 Average link4 Ward’s method4 Etc.

21

22

Single Link Method

i The distance between two clusters is the distance between two closest data points in the two clusters, one data point from each cluster

i It can find arbitrarily shaped clusters, but4 It may cause the undesirable

“chain effect” due to noisy points

Two natural clusters are split into two

Distance between two clustersi Single-link distance between clusters Ci and Cj is the minimum

distance between any object in Ci and any object in Cj

4 The distance is defined by the two most similar objects

jiyxjisl CyCxyxdCCD ,),(min, ,

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00

1 2 3 4 5

24

Complete Link Method

i The distance between two clusters is the distance of two furthest data points in the two clusters

i It is sensitive to outliers because they are far away

Distance between two clustersi Complete-link distance between clusters Ci and Cj is the

maximum distance between any object in Ci and any object in Cj

4 The distance is defined by the two least similar objects

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00

jiyxjicl CyCxyxdCCD ,),(max, ,

1 2 3 4 5

26

Average link and centroid methods

i Average link: A compromise between 4 the sensitivity of complete-link clustering to outliers and 4 the tendency of single-link clustering to form long chains that do not

correspond to the intuitive notion of clusters as compact, spherical objects

4 In this method, the distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters.

i Centroid method: In this method, the distance between two clusters is the distance between their centroids

Distance between two clustersi Group average distance between clusters Ci and Cj is the average

distance between objects in Ci and objects in Cj

4 The distance is defined by the average similarities

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00

ji CyCxji

jiavg yxdCC

CCD,

),(1

,

1 2 3 4 5

ClusteringBasic Concepts and Algorithms



clustering basic concepts and algorithms bamshad mobasher depaul university bamshad mobasher depaul...

Documents

clustering slide

data set slide

cosine similarity

good clustering method

set of data

clustering basic concepts

predefined classes clustering

major clustering approaches