cs570: introduction to data mining - emory universitycengiz/cs570-data-mining-fa... · 1 cs570:...

1

CS570: Introduction to Data Mining

Scalable Clustering Methods: BIRCH and Others

Reading: Chapter 10.3 Han, Chapter 9.5 Tan

Cengiz Gunay, Ph.D.

Slides courtesy of Li Xiong, Ph.D.,

©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann, and

©2006 Tan, Steinbach & Kumar. Introd. Data Mining., Pearson. Addison Wesley.

October 7, 2013

Previously: Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree

Can be visualized as a dendrogram, a tree like diagram

Clustering obtained by cutting at desired level

Do not have to assume any particular number of clusters

May correspond to meaningful taxonomies

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

October 7, 2013 2

Previously: Major Weaknesses of Hierarchical Clustering

Do not scale well (N: number of points)

Space complexity:

Time complexity:

Cannot undo what was done previously

Quality varies in terms of distance measures

MIN (single link): susceptible to noise/outliers

MAX/GROUP AVERAGE: may not work well with non-globular clusters

How to improve?

O(N2)

O(N3)

O(N2 log(N)) for some cases/approaches

October 7, 2013 Data Mining: Concepts and Techniques 4

Scalable Hierarchical Clustering Methods

Combines hierarchical and partitioning approaches

Recent methods:

BIRCH (1996): uses CF-tree and incrementally adjusts the

quality of sub-clusters

CURE(1998): uses representative points for inter-cluster

distance

ROCK (1999): clustering categorical data by neighbor and

link analysis

CHAMELEON (1999): hierarchical clustering using dynamic

modeling on graphs

BIRCH – A Tree-based Approach

October 8, 2013 Data Mining: Scalable Clustering 5

BIRCH

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD’96)

SIGMOD 10-year “test of time” award Main ideas:

Incremental (does not need the whole dataset) Summarizes using in-memory clustering feature Combines hierarchical clustering for microclustering

and partitioning for macroclustering Features:

Scales linearly: single scan and improves the quality with a few additional scans

Weakness: handles only numeric data, and sensitive to the order of the data records.


October 7, 2013 7

Cluster Statistics

Given a cluster of instances

Centroid:

Radius: average distance from member points to centroid

Diameter: average pair-wise distance within a cluster

October 7, 2013 8

Intra-Cluster Distance

Centroid Euclidean distance: Centroid Manhattan distance: Average distance:

Given two clusters

October 7, 2013 9

Clustering Feature (CF) in BIRCH

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

What is it good for?

October 7, 2013 10

Properties of Clustering Feature

CF entry is more compact

Stores significantly less then all of the data points in the sub-cluster

A CF entry has sufficient information to calculate statistics about the cluster and intra-cluster distances

Additivity theorem allows us to merge sub-clusters incrementally & consistently

Cluster Statistics from CF

CF:

Radius (average distance to centroid)

Diameter (average pairwise distance)


22

2

1

0 /2/ nnLSLSnSSnxxRn

i

i

1/221/ 2

2

1 1

nnLSnSSnnxxDn

i

n

j

ji

SSLSn ,,


Hierarchical CF-Tree

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering

A nonleaf node in a tree has descendants or “children”

The nonleaf nodes store sums of the CFs of their children

A CF tree has two parameters

Branching factor: maximum number of children.

Threshold: max diameter of sub-clusters stored at the leaf nodes

Why a tree?


The CF Tree Structure

CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6 prev next CF1 CF2 CF4

prev next

Root

Non-leaf node

Leaf node Leaf node

October 7, 2013 14

CF-Tree Insertion

Traverse down from root (top-down), find the appropriate leaf

Follow the "closest"-CF path, w.r.t. intra-cluster distance measures

Modify the leaf

If the closest-CF leaf cannot absorb, make a new CF entry.

If there is no room for new leaf, split the parent node

Traverse back & up

Updating CFs on the path or splitting nodes

October 7, 2013 15

BIRCH Overview

October 7, 2013 16

The Algorithm: BIRCH

Phase 1: Scan database to build an initial in-memory CF-tree Subsequent phases become fast, accurate, less order

sensitive

Phase 2: Condense data (optional) Rebuild the CF-tree with a larger T (why?)

Phase 3: Global clustering Use existing clustering algorithm on CF entries Helps fix problem where natural clusters span nodes

Phase 4: Cluster refining (optional) Do additional passes over the dataset & reassign data

points to the closest centroid from phase 3

BIRCH Summary

Main ideas: Incremental (does not need the whole dataset) Use in-memory clustering feature to summarize Use hierarchical clustering for microclustering and

other clustering methods (e.g. partitioning) for macroclustering

Features: Scales linearly: single scan and improves the quality

with a few additional scans Weakness:

handles only numeric data sensitive to the order of the data records

unnatural clusters because of leaf node limit

spherical clusters because of diameter (how to solve?)


CURE

CURE: Clustering Using REpresentatives

CURE: An Efficient Clustering Algorithm for Large Databases (1998) S Guha, R Rastogi, K Shim

Addresses potential problems with min, max, centroid distance based hierarchical clustering

Main ideas:

Use representative points for inter-cluster distance

Random sampling and partitioning

Features:

More robust to outliers

Better for non-spherical shapes and non-uniform sizes

See Tan Ch 9.5.3, pp. 635-639

A number of points (e.g., 10+) represents a cluster

Representative points:

Start with farthest from centroid

Add farthest from all selected, and so on until k points

“Shrink” them by to center of the cluster

Why shrink? Parallels to other methods?

Cluster similarity is the similarity of the closest pair of representative points from different clusters

CURE: Cluster Points

distance

Experimental Results: CURE

Picture from CURE, Guha, Rastogi, Shim.

Experimental Results: CURE

Picture from CURE, Guha, Rastogi, Shim.

(centroid)

(single link)

CURE Cannot Handle Differing Densities

Original Points CURE

So far only numerical data?


Clustering Categorical Data: The ROCK Algorithm

ROCK: RObust Clustering using linKs

S. Guha, R. Rastogi & K. Shim, Int. Conf. Data Eng. (ICDE) ’99

Major ideas

Use links to measure similarity/proximity

Sampling-based clustering

Features:

More meaningful clusters

Emphasizes interconnectivity but ignores proximity

Not in textbook, see paper above.

October 7, 2013 Data Mining: Scalable Clustering 24

Similarity Measure in ROCK

Market basket data clustering

Jaccard coefficient-based similarity function:

Example: Two groups (clusters) of transactions

C1. <a, b, c, d, e> {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c,

e}, {b, d, e}, {c, d, e}

C2. <a, b, f, g> {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

Jaccard coefficient may lead to wrong clustering result:

Sim T TT T

T T( , )

1 2

1 2

1 2

2.05

1

},,,,{

}{),( 21

edcba

cTTSim

5.04

2

},,,{

},{),( 31

fcba

fcTTSim


Link Measure in ROCK

Neighbor:

Links: # of common neighbors

Reminder:

Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

Let C1: <a, b, c, d, e>, C2: <a, b, f, g>

Example:

link(T1, T2) = 4, since they have 4 common neighbors

{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}

link(T1, T3) = 3, since they have 3 common neighbors

{a, b, d}, {a, b, e}, {a, b, g}

),( jiSim

Rock Algorithm

1. Obtain a sample of points from the data set

2. Compute the link value for each set of points (computed by Jaccard coefficient)

3. Perform an agglomerative (bottom-up) hierarchical clustering on the data using the “number of shared neighbors” as similarity measure

4. Assign the remaining points to the clusters that have been found


Cluster Merging: Limitations of Current Schemes

Existing schemes are static in nature:

MIN or CURE

merge two clusters based on their closeness (or minimum distance)

GROUP-AVERAGE or ROCK:

merge two clusters based on their average connectivity

Limitations of Current Merging Schemes

Closeness schemes will

merge (a) and (b)

(a)

(b)

(c)

(d)

Average connectivity schemes will merge (c)

and (d) Solution?

Chameleon: Clustering Using Dynamic Modeling

Adapt to the characteristics of the data set to find the natural clusters

Use a dynamic model to measure the similarity between clusters

Main property is the relative closeness and relative inter-connectivity of the cluster

Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters

The merging scheme preserves self-similarity


CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)

CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99

Basic ideas:

A graph-based clustering approach

A two-phase algorithm:

Partitioning: cluster objects into a large number of small sub-clusters

Agglomerative hierarchical clustering: repeatedly combine sub-clusters

Measures the similarity based on a dynamic model

interconnectivity and closeness (proximity)

Features:

Handles clusters of arbitrary shapes, sizes, and density

Scales well

See Sections Han 10.3.4 and Tan 9.4.4

Graph-Based Clustering

Uses the proximity graph

Start with the proximity matrix

Consider each point as a node in a graph

Each edge between two nodes has a weight which is the proximity between the two points

Fully connected proximity graph

MIN (single-link) and MAX (complete-link)

Sparsification eliminates data for graph algos.

kNN

Threshold based

Clusters are connected components in the graph


Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

Chameleon: Steps

Preprocessing Step: Represent the Data by a Graph

Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors

Concept of neighborhood is captured dynamically (even if region is sparse)

Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices

Each cluster should contain mostly points from one “true” cluster, i.e., is a sub-cluster of a “real” cluster

Chameleon: Steps (cont)

Phase 2: Use Hierarchical Agglomerative (bottom-up) Clustering to merge sub-clusters

Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters

Two key properties used to model cluster similarity:

Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters

Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters


CHAMELEON (Clustering Complex Objects)


Chapter 7. Cluster Analysis

Overview

Partitioning methods

Hierarchical methods

Density-based methods

Other methods

Cluster evaluation

Outlier analysis

Summary

cs570: introduction to data mining - emory universitycengiz/cs570-data-mining-fa... · 1 cs570:...

Documents