9/03data mining – clustering g dong (wsu) 1 4. clustering methods concepts partitional (k-means,...

31
9/03 Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB) Density-based (DBSCAN, CLIQUE) Large size data (STING, BIRCH, CURE)

Upload: oscar-stone

Post on 30-Dec-2015

234 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

1

4. Clustering Methods

ConceptsPartitional (k-Means, k-Medoids)

Hierarchical (Agglomerative & Divisive, COBWEB)

Density-based (DBSCAN, CLIQUE)

Large size data (STING, BIRCH, CURE)

Page 2: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

2

The Clustering Problem

• The clustering problem is about grouping a set of data tuples into a number of clusters. Data in the same cluster are highly similar to each other and data in different clusters are highly different from each other.

• About clusters– Inter-clusters distance maximization

– Intra-clusters distance minimization

• Clustering vs. classification– Which one is more difficult? Why?

– Various possible ways of clustering, which way is the best?

Page 3: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

3

Different ways of representing clusters

• Division with boundaries

• Venn diagram or spheres

• Probabilistic

• Dendrograms

• Trees

• Rules1 2 3

I1

I2

In

0.5 0.2 0.3

Page 4: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

4

Major Categories of Algorithms

• Partitioning: Divide into k partitions (k fixed); regroup to get better clustering.

• Hierarchical: Divide into different number of partitions in layers - merge (bottom-up) or divide (top-down).

• Density-based: Continue to grow a cluster as long as the density of the cluster exceeds a threshold

• Grid-based: First divide space into grids, then perform clustering on the grids.

Page 5: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

5

• Algorithm1. Given k

2. Randomly pick k instances as the initial centers

3. Assign the rest instances to closest one of k clusters

4. Recalculate the mean of each cluster

5. Repeat 3 & 4 until means don’t change

• How good the clusters are– Initial and final clusters

– Within-cluster variation diff(x,mean)^2

– Why don’t we consider inter-cluster distance?

k-Means

Page 6: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

6

Example

• For simplicity, 1 dimensional objects and k=2.• Objects: 1, 2, 5, 6,7• K-means:

– Randomly select 5 and 6 as initial centroids;

– => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5

– => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6

– => no change.

– Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5

Page 7: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

7

Discussions

• Limitations: – Means cannot be defined for categorical attributes;

– Choice of k;

– Sensitive to outliers;

– Crisp clustering

• Variants of k-means exist:– Using modes to deal with categorical attributes

• How about distance measures

• Is it similar to or different from k-NN?– With and without learning

Page 8: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

8

k-Medoids

• k-Means algorithm is sensitive to outliers– Is this true? How to prove it?

• Medoid – the most centrally located point in a cluster, as a representative point of the cluster.

• In contrast, a centroid is not necessarily in a cluster.

• An example

Initial Medoids

Page 9: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

9

Partition Around Medoids

• PAM:1. Given k2. Randomly pick k instances as initial medoids

3. Assign each instance to the nearest medoid4. Calculate the objective function

• the sum of dissimilarities of all instances to their nearest medoids

5. Randomly select an instance y6. Swap some medoid x by y if the swap reduces

the objective function7. Repeat (3-6) until no change

Page 10: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

10

k-Means and k-Medoids

• The key difference lies in how they update means or medoids

• Both require distance calculation and reassignment of instances

• Time complexity– Which one is more costly?

• Dealing with outliers

Outlier (100 unit away)

Page 11: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

11

EM (Expectation Maximization)

• Moving away from crisp clusters as in k-Means by allowing an instance to belong to several clusters

• Finite mixtures – a statistical clustering model– A mixture is a set of k probability distributions,

representing k clusters– The simplest finite mixture: one feature with a Gaussian– When k=2, we need to estimate 5 parameters: 2 pairs of μ,

2 pairs of σ, and pA, where pB = 1- pA

• EM– Estimate using instances– Maximize the overall likelihood that data came from this

data set

Page 12: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

12

Agglomerative

• Each object is viewed as a cluster (bottom up).• Repeat until the number of clusters is small

enough– Choose a closest pair of clusters– Merge the two into one

• Defining “closest”: Centroid (mean of cluster) distance, (average) sum of pairwise distance, …– Refer to the Evaluation part

• A dendrogram is a tree that shows clustering process.

Page 13: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

13

Dendrogram

• Cluster 1, 2, 4, 5, 6, 7 into two clusters (centriod distance)

1

2

4

5

6

7

Page 14: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

14

An example to show different Links

• Single link– Merge the nearest clusters measured by

the shortest edge between the two

– (((A B) (C D)) E)

• Complete link– Merge the nearest clusters measured by

the longest edge between the two

– (((A B) E) (C D))

• Average link– Merge the nearest clusters measured by

the average edge length between the two

– (((A B) (C D)) E)

A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

A B

C

D

E

Page 15: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

15

Divisive

• All instances belong to one cluster (top-down)• To find an optimal division at each layer (especially

the top one) is computationally prohibitive.• One heuristic method is based on the Minimum

Spanning Tree (MST) algorithm– Connecting all instances with MST (O(N2))

– Repeatedly cut out the longest edges at each iteration until some stopping criterion is met or until one instance remains in each cluster.

Page 16: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

16

COBWEB

• Building a conceptual hierarchy incrementally• Each cluster has a probabilistic description• Category Utility:

kijP(fi=vij)P(fi=vij|ck)P(ck|fi=vij)– All categories ck, all features fi, all feature values vij

• It attempts to maximize both the probability that two objects in the same category have values in common and the probability that objects in different categories will have different property values

Page 17: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

17

A tree of clusters produced by COBWEB:

Page 18: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

18

• Processing one instance at a time by choosing best among– Placing the instance in the best existing category

– Adding a new category containing only the instance

– Merging of two existing categories into a new one and adding the instance to that category

– Splitting of an existing category into two and placing the instance in the best new resulting category

Grandparent Grandparent

Parent

Child 2Child 1 Child 2Child 1Merge

Split

Page 19: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

19

Cobweb Demo

http://kiew.cs.uni-dortmund.de:8001/mlnet/instances/81d91eaae317b2bebb

Page 20: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

20

Density-based

• DBSCAN –Density-Based Clustering of Applications with Noise

• It grows regions with sufficiently high density into clusters and can discover clusters of arbitrary shape in spatial databases with noise.– Many existing clustering algorithms find spherical

shapes of clusters

• DBSCAN defines a cluster as a maximal set of density-connected points.

Page 21: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

21

• Defining density and connection -neighborhood of an object x (core object) (M, P, Q)

– MinPts of objects within -neighborhood (say, 3)

– directly density-reachable (Q from M, M from P)

– density-reachable (Q from P, P not from Q) [asymmetric]– density-connected (O, R, S) [symmetric] <for border points>

• What is the relationship between DR and DC?

QM

PS

R

O

Page 22: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

22

• Clustering with DBSCAN– Search for clusters by checking the -neighborhood of

each instance x– If the -neighborhood of x contains more than MinPts,

create a new cluster with x as a core object– Iteratively collect directly density-reachable objects from

these core object and merge density-reachable clusters– Terminate when no new point can be add to any cluster

• DBSCAN is sensitive to the thresholds of density, but it is many folds faster than CLARANS

• Time complexity O(N log N) if a spatial index is used, O(N2) otherwise

Page 23: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

23

Dealing with Large Data

• Key ideas– Reducing the number of instances to be maintained, and

yet to maintain the distribution

– Identifying relevant subspaces where clusters possibly exist

– Using summarized information to avoid repeated data access

• Sampling– CLARA (Clustering LARge Applications) working on

samples instead of the whole data

– CLARANS (Clustering Large Applications based on RANdomized Search)

Page 24: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

24

• Grid: STING (STatistical INformation Grid)– Statistical parameters of higher-level cells can easily be

computed from those of lower-level cells• Attribute-independent: count

• Attribute-dependent: mean, standard deviation, min, max

• Type of distribution: normal, uniform, exponential, or unknown

– Irrelevant cells can be removed

Page 25: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

25

Representatives

• BIRCH using Clustering Feature (CF) and CF tree– A cluster feature is a triplet about sub-clusters of

instances (N, LS, SS)• N - the number of instances, LS – linear sum, SS – square sum

– Two thresholds: branching factor (the max number of children per non-leaf node) and diameter threshold

– Two phases1. Build an initial in-memory CF tree

2. Apply a clustering algorithm to cluster the leaf nodes in CF tree

• CURE (Clustering Using REpresentitives) is another example

Page 26: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

26

CF Tree

CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node

B: Branching factorL: Threshold: max diameter of subclusters at leaf nodes

Page 27: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

27

• Taking advantage of the property of density– If it’s dense in higher dimensional subspaces, it should

be dense in some lower dimensional subspaces

– CLIQUE (CLustering In QUEst)• With high dimensional data, there are many void subspaces

• Using the property identified, we can start with dense lower dimensional data

• CLIQUE is a density-based method that can automatically find subspaces of the highest dimensionality such that high-density clusters exist in those subspaces

Page 28: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

28

Drawbacks of Distance-Based Method

• Drawbacks of square-error based clustering method – Consider only one point as representative of a cluster

– Good only for convex shaped, similar size and density, and if k can be reasonably estimated

Page 29: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

29

Chameleon

• A hierarchical Clustering Algorithm Using Dynamic Modeling– Observations on the weakness of pure distance based

methods

• Basic steps:– Build K nearest neighbor graph

– Partition the graph

– Merge the “strongly connected partitions,” in terms of strength of connections between partitions

Page 30: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

30

Summary

• There are many clustering algorithms • Good clustering algorithms maximize inter-cluster

dissimilarity and intra-cluster similarity• Without prior knowledge, it is difficult to choose

the best clustering algorithm.• Clustering is an important tool for outlier analysis.

Page 31: 9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

9/03 Data Mining – ClusteringG Dong (WSU)

31

Bibliography

• I.H. Witten and E. Frank. Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. 2000. Morgan Kaufmann.

• M. Kantardzic. Data Mining – Concepts, Models, Methods, and Algorithms. 2003. IEEE.

• J. Han and M. Kamber. Data Mining – Concepts and Techniques. 2001. Morgan Kaufmann.

• M. H. Dunham. Data Mining – Introductory and Advanced Topics.