cs570: introduction to data mining - emory universitycengiz/cs570-data-mining-fa... · 1 cs570:...
TRANSCRIPT
1
CS570: Introduction to Data Mining
Scalable Clustering Methods: BIRCH and Others
Reading: Chapter 10.3 Han, Chapter 9.5 Tan
Cengiz Gunay, Ph.D.
Slides courtesy of Li Xiong, Ph.D.,
©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann, and
©2006 Tan, Steinbach & Kumar. Introd. Data Mining., Pearson. Addison Wesley.
October 7, 2013
Previously: Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram, a tree like diagram
Clustering obtained by cutting at desired level
Do not have to assume any particular number of clusters
May correspond to meaningful taxonomies
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
October 7, 2013 2
Previously: Major Weaknesses of Hierarchical Clustering
Do not scale well (N: number of points)
Space complexity:
Time complexity:
Cannot undo what was done previously
Quality varies in terms of distance measures
MIN (single link): susceptible to noise/outliers
MAX/GROUP AVERAGE: may not work well with non-globular clusters
How to improve?
O(N2)
O(N3)
O(N2 log(N)) for some cases/approaches
October 7, 2013 Data Mining: Concepts and Techniques 4
Scalable Hierarchical Clustering Methods
Combines hierarchical and partitioning approaches
Recent methods:
BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
CURE(1998): uses representative points for inter-cluster
distance
ROCK (1999): clustering categorical data by neighbor and
link analysis
CHAMELEON (1999): hierarchical clustering using dynamic
modeling on graphs
BIRCH – A Tree-based Approach
October 8, 2013 Data Mining: Scalable Clustering 5
BIRCH
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD’96)
SIGMOD 10-year “test of time” award Main ideas:
Incremental (does not need the whole dataset) Summarizes using in-memory clustering feature Combines hierarchical clustering for microclustering
and partitioning for macroclustering Features:
Scales linearly: single scan and improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the data records.
October 8, 2013 Data Mining: Concepts and Techniques 6
October 7, 2013 7
Cluster Statistics
Given a cluster of instances
Centroid:
Radius: average distance from member points to centroid
Diameter: average pair-wise distance within a cluster
October 7, 2013 8
Intra-Cluster Distance
Centroid Euclidean distance: Centroid Manhattan distance: Average distance:
Given two clusters
October 7, 2013 9
Clustering Feature (CF) in BIRCH
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
What is it good for?
October 7, 2013 10
Properties of Clustering Feature
CF entry is more compact
Stores significantly less then all of the data points in the sub-cluster
A CF entry has sufficient information to calculate statistics about the cluster and intra-cluster distances
Additivity theorem allows us to merge sub-clusters incrementally & consistently
Cluster Statistics from CF
CF:
Radius (average distance to centroid)
Diameter (average pairwise distance)
October 7, 2013 Data Mining: Concepts and Techniques 11
22
2
1
0 /2/ nnLSLSnSSnxxRn
i
i
1/221/ 2
2
1 1
nnLSnSSnnxxDn
i
n
j
ji
SSLSn ,,
October 7, 2013 Data Mining: Concepts and Techniques 12
Hierarchical CF-Tree
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
Branching factor: maximum number of children.
Threshold: max diameter of sub-clusters stored at the leaf nodes
Why a tree?
October 7, 2013 Data Mining: Concepts and Techniques 13
The CF Tree Structure
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6 prev next CF1 CF2 CF4
prev next
Root
Non-leaf node
Leaf node Leaf node
October 7, 2013 14
CF-Tree Insertion
Traverse down from root (top-down), find the appropriate leaf
Follow the "closest"-CF path, w.r.t. intra-cluster distance measures
Modify the leaf
If the closest-CF leaf cannot absorb, make a new CF entry.
If there is no room for new leaf, split the parent node
Traverse back & up
Updating CFs on the path or splitting nodes
October 7, 2013 15
BIRCH Overview
October 7, 2013 16
The Algorithm: BIRCH
Phase 1: Scan database to build an initial in-memory CF-tree Subsequent phases become fast, accurate, less order
sensitive
Phase 2: Condense data (optional) Rebuild the CF-tree with a larger T (why?)
Phase 3: Global clustering Use existing clustering algorithm on CF entries Helps fix problem where natural clusters span nodes
Phase 4: Cluster refining (optional) Do additional passes over the dataset & reassign data
points to the closest centroid from phase 3
BIRCH Summary
Main ideas: Incremental (does not need the whole dataset) Use in-memory clustering feature to summarize Use hierarchical clustering for microclustering and
other clustering methods (e.g. partitioning) for macroclustering
Features: Scales linearly: single scan and improves the quality
with a few additional scans Weakness:
handles only numeric data sensitive to the order of the data records
unnatural clusters because of leaf node limit
spherical clusters because of diameter (how to solve?)
October 7, 2013 Data Mining: Concepts and Techniques 17
CURE
CURE: Clustering Using REpresentatives
CURE: An Efficient Clustering Algorithm for Large Databases (1998) S Guha, R Rastogi, K Shim
Addresses potential problems with min, max, centroid distance based hierarchical clustering
Main ideas:
Use representative points for inter-cluster distance
Random sampling and partitioning
Features:
More robust to outliers
Better for non-spherical shapes and non-uniform sizes
See Tan Ch 9.5.3, pp. 635-639
A number of points (e.g., 10+) represents a cluster
Representative points:
Start with farthest from centroid
Add farthest from all selected, and so on until k points
“Shrink” them by to center of the cluster
Why shrink? Parallels to other methods?
Cluster similarity is the similarity of the closest pair of representative points from different clusters
CURE: Cluster Points
distance
Experimental Results: CURE
Picture from CURE, Guha, Rastogi, Shim.
Experimental Results: CURE
Picture from CURE, Guha, Rastogi, Shim.
(centroid)
(single link)
CURE Cannot Handle Differing Densities
Original Points CURE
So far only numerical data?
October 7, 2013 Data Mining: Concepts and Techniques 23
Clustering Categorical Data: The ROCK Algorithm
ROCK: RObust Clustering using linKs
S. Guha, R. Rastogi & K. Shim, Int. Conf. Data Eng. (ICDE) ’99
Major ideas
Use links to measure similarity/proximity
Sampling-based clustering
Features:
More meaningful clusters
Emphasizes interconnectivity but ignores proximity
Not in textbook, see paper above.
October 7, 2013 Data Mining: Scalable Clustering 24
Similarity Measure in ROCK
Market basket data clustering
Jaccard coefficient-based similarity function:
Example: Two groups (clusters) of transactions
C1. <a, b, c, d, e> {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c,
e}, {b, d, e}, {c, d, e}
C2. <a, b, f, g> {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
Jaccard coefficient may lead to wrong clustering result:
Sim T TT T
T T( , )
1 2
1 2
1 2
2.05
1
},,,,{
}{),( 21
edcba
cTTSim
5.04
2
},,,{
},{),( 31
fcba
fcTTSim
October 7, 2013 Data Mining: Concepts and Techniques 25
Link Measure in ROCK
Neighbor:
Links: # of common neighbors
Reminder:
Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
Let C1: <a, b, c, d, e>, C2: <a, b, f, g>
Example:
link(T1, T2) = 4, since they have 4 common neighbors
{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
link(T1, T3) = 3, since they have 3 common neighbors
{a, b, d}, {a, b, e}, {a, b, g}
),( jiSim
Rock Algorithm
1. Obtain a sample of points from the data set
2. Compute the link value for each set of points (computed by Jaccard coefficient)
3. Perform an agglomerative (bottom-up) hierarchical clustering on the data using the “number of shared neighbors” as similarity measure
4. Assign the remaining points to the clusters that have been found
October 7, 2013 Data Mining: Concepts and Techniques 26
Cluster Merging: Limitations of Current Schemes
Existing schemes are static in nature:
MIN or CURE
merge two clusters based on their closeness (or minimum distance)
GROUP-AVERAGE or ROCK:
merge two clusters based on their average connectivity
Limitations of Current Merging Schemes
Closeness schemes will
merge (a) and (b)
(a)
(b)
(c)
(d)
Average connectivity schemes will merge (c)
and (d) Solution?
Chameleon: Clustering Using Dynamic Modeling
Adapt to the characteristics of the data set to find the natural clusters
Use a dynamic model to measure the similarity between clusters
Main property is the relative closeness and relative inter-connectivity of the cluster
Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters
The merging scheme preserves self-similarity
October 7, 2013 Data Mining: Concepts and Techniques 30
CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)
CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99
Basic ideas:
A graph-based clustering approach
A two-phase algorithm:
Partitioning: cluster objects into a large number of small sub-clusters
Agglomerative hierarchical clustering: repeatedly combine sub-clusters
Measures the similarity based on a dynamic model
interconnectivity and closeness (proximity)
Features:
Handles clusters of arbitrary shapes, sizes, and density
Scales well
See Sections Han 10.3.4 and Tan 9.4.4
Graph-Based Clustering
Uses the proximity graph
Start with the proximity matrix
Consider each point as a node in a graph
Each edge between two nodes has a weight which is the proximity between the two points
Fully connected proximity graph
MIN (single-link) and MAX (complete-link)
Sparsification eliminates data for graph algos.
kNN
Threshold based
Clusters are connected components in the graph
October 7, 2013 Data Mining: Concepts and Techniques 32
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
Chameleon: Steps
Preprocessing Step: Represent the Data by a Graph
Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors
Concept of neighborhood is captured dynamically (even if region is sparse)
Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices
Each cluster should contain mostly points from one “true” cluster, i.e., is a sub-cluster of a “real” cluster
Chameleon: Steps (cont)
Phase 2: Use Hierarchical Agglomerative (bottom-up) Clustering to merge sub-clusters
Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters
Two key properties used to model cluster similarity:
Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters
Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters
October 7, 2013 Data Mining: Concepts and Techniques 35
CHAMELEON (Clustering Complex Objects)
October 7, 2013 Data Mining: Concepts and Techniques 36
Chapter 7. Cluster Analysis
Overview
Partitioning methods
Hierarchical methods
Density-based methods
Other methods
Cluster evaluation
Outlier analysis
Summary