Iterative Optimization and Simplification of Hierarchical Clusterings
Doug FisherDepartment of Computer Science,Vanderbilt University
Journal of Artificial Intelligence Research,4 (1996) 147-179
Presented by: Biyu Liang ('06), Paul Haake ('07)
2
Outline
Introduction Fast but Rough Clustering: Hierarchical
Sorting Iterative Optimization Methods and
Comparison Simplification of Hierarchical Clustering Conclusion
3
IntroductionOverview of method:
Construct an initial clustering inexpensively Iteratively optimize the clustering using some
control strategy Simplify the clustering
➢Goals: Find high quality clusterings without
overfitting Good CPU efficiency
4
Introduction (continued)
Properties of any clustering algorithm:• objective function: evaluates the quality of a
particular clustering on a set of data.• control strategy: specifes how the algorithm
searches the space of all possible clusterings, given some objective function.
In this paper, the authors compare different control strategies using the same objective function.
5
Outline
Introduction Fast but Rough Clustering: Hierarchical
Sorting Iterative Optimization Methods and
Experiments Simplification of Hierarchical Clustering Conclusion
6
Hierarchical Sorting
Greedy algorithm to quickly build an initial rough clustering.
All three control strategies (discussed later) begin with the clustering generated by hierarchical sorting. By shuffling records around, they improve the clustering.
7
Hierarchical Sorting
CU(CK) = P(Ck)ij[P(Ai = Vij |CK)2 -P(Ai = Vij)2] Clusters whose data records have similar attribute
values have a higher CU score. Objective function = the “partition utility” (PU),
the average CU value over all clusters.
8
Hierarchical Sorting
Start with an empty clustering and add each data record one at a time
For each record being added, there are two choices: Place the record in some existing cluster in the
hierarchy Place the record in a new cluster
Select the option that yields the highest quality score (PU)
9
10
11
Outline
Introduction Fast but Rough Clustering: Hierarchical
Sorting Iterative Optimization Methods and
Comparison Simplification of Hierarchical Clustering Conclusion
12
Iterative Optimization Methods
Important note: The primary goal of clustering in this paper is to
obtain a single-level partitioning of optimal quality. Hierarchical clustering is used only as an intermediate means toward that end. To evaluate the quality of a solution, the authors therefore only apply the objective function to the first-level partition.
13
Iterative Optimization Methods Reorder-resort (CLUSTER/2): very similar to
k-means Iterative redistribution of single observation:
reassign each record to a better cluster Iterative hierarchical redistribution: reassign
each record or subtree of records to a better cluster
14
Reorder-resort (k-mean)
k random seeds are selected, and k clusters are growing around these attractors.
The centroids of the clusters are picked as new seeds.
The process iterates until there is no further improvement in the quality of generated clustering.
15
Reorder-resort (k-mean) con’t Ordering data to make consecutive
observations dissimilar leads to good clusterings.
Extract a “dissimilarity” ordering from the hierarchical sorting: consecutive records will tend to be dissimilar.
16
Iterative Redistribution of Single Observations
Repeat until the clustering doesn't change: For every record, remove it from the
clustering and resort it beginning at the root
17
Iterative Hierarchical Redistribution
Problem: The last control strategy resorts only one record at a time.
Solution: Resort entire subtrees of records at a time.
18
Iterative Hierarchical Redistribution Hierarchical-Redistribute-Recurse(SiblingSet)
Repeat until two consecutive clusterings have the same set of siblings: For each sibling in SiblingSet:
Remove the sibling from the hierarchy and resort
SiblingSet ← remaining siblings For each sibling S in SiblingSet
call Hierarchical-Redistribute-Recurse(S.children)
Repeat until clustering converges: Clustering ← Hierarchical-Redistribute-
Recurse(Clustering.root.children)
19
20
Main findings from the experiments Hierarchical redistribution achieves the highest
mean PU scores in most cases Reordering and re-clustering comes closest to
hierarchical redistribution’s performance in all cases Single-observation redistribution modestly improves
an initial sort, and is substantially worse than the other two optimization methods
21
Outline
Introduction Generating Initial Hierarchical Clustering Iterative Optimization Methods and
Comparison Simplification of Hierarchical Clustering Conclusion
22
Simplifying Hierarchical Clustering
Higher levels of the hierarchy are meaningful, but lower levels are subject to overfitting.
Solution: post-process the hierarchy with validation and pruning.
23
Validation
Strategy: Find internal nodes that are most predictive on unseen data (a testing set).
What does “predictive” mean in this case? When a data record is classified into a cluster, we want to know how accurately that cluster, in turn, can predict the data record's attribute values.
In a high-quality clustering, we expect that an unseen data record, classified into some cluster, will have attribute values similar to the attribute values of other data records in the cluster.
24
Validation
For each variable Ai: For each data record:
Classify the data record through the cluster hierarchy, beginning at the root, and ignoring the value of Ai.
At each node, compare the record's Ai value to the node's expected Ai value; keep a counter of correct predictions for each variable at each node.
25
Validation
After processing all variables, for each variable, identify a “frontier” in the hierarchy such that the number of correct predictions of that variable is maximized.
If a node lies below the frontier of every variable, then it is pruned.
26
27
Validation
The authors' experiments show that their validation method substantially reduces clustering size without diminishing predictive accuracy.
28
Concluding Remarks
There are three phases in searching the space of hierarchical clusterings: Inexpensive generation of an initial clustering Iterative optimization for clusterings Post-processing simplification of generated
clusterings Experiments found that the new method,
hierarchical redistribution optimization, beats the other iterative optimization methods in most cases.
29
Final Exam Question #1
The main idea in this paper is to construct clusterings which satisfy two conditions.
Name the conditions: Consistently constructs high-quality clusterings Computationally inexpensive
Name the two steps to satisfy the conditions: Generate a tentative clustering inexpensively, using
hierarchical sorting Iteratively optimize that initial clustering
30
Final Exam Question #2
Describe the three iterative methods for clustering optimization:
Seed Selection, Reordering, and Reclustering (p. 14-15)
Iterative Redistribution of Single Observations (p. 16)
Iterative Hierarchical Redistribution (p. 17-19)
31
Final Exam Question #3
The cluster is better when the relative CU score is a) big, b) small, c) equal to 0
Which sorting method is better? a) random sorting, b) similarity sorting
Thanks! Question?