clustering part 1

Clustering Part 1Abdul Kawsar Tushar

Nadeem AhmedCSE, UAP

What is Clustering

• visualization of data• hypothesis generation

Overview of Clustering

• Feature Selection• Feature Extraction

• transformations of the input features to produce new salient features.

• Inter-pattern Similarity• Grouping

Formal Definition• Clustering is the classification of objects into different

groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often according to some defined distance measure.

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Partition_of_a_set

http://en.wikipedia.org/wiki/Data_set

http://en.wikipedia.org/wiki/Subset

http://en.wikipedia.org/wiki/Metric_(mathematics)

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

Hierarchical Clustering: Example

Hierarchical Clustering: Example Using Single Linkage

Hierarchical Clustering: Forming Clusters• Forming clusters from dendograms

Hierarchical Clustering• Advantages

• Dendograms are great for visualization• Provides hierarchical relations between clusters• Shown to be able to capture concentric clusters

• Disadvantages• Not easy to define levels for clusters• Experiments showed that other clustering techniques outperform hierarchical

clustering

How to Define Inter-Cluster Similarity

Similarity?

Single Link Complete Link Average Link

How to Define Inter-Cluster Similarity

Single Link Complete Link Average Link

Common Similarity Measures• Distance measure will determine how the similarity of two

elements is calculated and it will influence the shape of the clusters.They include:

1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-norm) is given by:

http://en.wikipedia.org/wiki/Euclidean_distance

http://en.wikipedia.org/wiki/Manhattan_distance

A Simple example showing the implementation of k-means algorithm (using K=2)

Step 1:Initialization: Randomly we choose following two centroids (k=2) for two clusters.In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).

Step 2:• Thus, we obtain two clusters

containing:{1,2,3} and {4,5,6,7}.

• Their new centroids are:

Step 3:• Now using these centroids we

compute the Euclidean distance of each object, as shown in table.

• Therefore, the new clusters are:{1,2} and {3,4,5,6,7}

• Next centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1)

• Step 4 :The clusters obtained are:{1,2} and {3,4,5,6,7}

• Therefore, there is no change in the cluster.

• Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2} and {3,4,5,6,7}.

(with K=3)

Step 1 Step 2

Two different K-means Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Clustering Non-clustered Data

Getting Stuck In A Local Minimum

Can k-means Handle Non-spherical Clusters?

…Maybe not.

Let’s Try Single Linkage Hierarchical Clustering

K-means with Polar Coordinates

clustering part 1

Education