constraint-driven clustering
DESCRIPTION
Constraint-Driven Clustering. Rong Ge 1 , Martin Ester 1 , Wen Jin 1 , Ian Davidson 2 Presenter: Rong Ge 1 Simon Fraser University 2 University of California - Davis. Introduction. Clustering methods aim at grouping data objects into clusters based on some criteria - PowerPoint PPT PresentationTRANSCRIPT
1
Constraint-Driven Clustering
Rong Ge1, Martin Ester1, Wen Jin1, Ian Davidson2
Presenter: Rong Ge1 Simon Fraser University2 University of California - Davis
2
Introduction
Clustering methods aim at grouping data objects into clusters based on
some criteria can be either data-driven or need-driven
[Banerjee’06] Data-Driven methods
discover the true structure of the underlying data by grouping similar data objects together
Need-Driven methods group data objects based on not only similarity but
also application needs discover more actionable clusters
3
Capturing Application Needs
Two methodologies: Design sophisticated objective functions based on
business needs E.g., in catalog segmentation, clustering results are
evaluated by their utility in decision making [Kleinberg et al.’99]
Capture application needs by constraints E.g., discovering balanced customer groups in market
segmentation [Ghosh et al.’02] Yet, existing models often require users to provide the
number of clusters Often unknown Or not suit for application needs
4
Constraint-Driven Clustering
Constraint-Driven Clustering Utilizes constraints to control cluster formation Discovers an arbitrary number of clusters Goals:
Discover compact clusters Satisfy all constraints
Two constraint types (Cluster-level constraints) Minimum significance constraint
Specifies the minimum number of objects in a cluster Minimum variance constraint
Specifies the minimum variance of a cluster
5
Motivation - Energy Aware Sensor Networks
Constraint-Driven Clustering: Minimum Significance Constraint
Balances the work load of master nodes Minimum Variance Constraint
Allows sensor clusters to be balanced in terms of energy consumption
Goal: minimize energy consumption Solution:
Group sensors into clusters A master node is selected from
sensors in a cluster or deployed Other sensors communicate with
outside through the master nodesCommandNode
Sensor
Master Node
Communication Channel
6
Motivation - Privacy Preservation Goal: publish personal records without a privacy breach Solution:
Group records into clusters Release the summary of each cluster to the public
Constraint-Driven Clustering: Minimum Significance Constraint
Similar to k-Anonymity in preserving individual privacy Minimum Variance Constraint
Variance translates into the width of the confidence interval of the adversary estimate
Prevent similar, even identical, records to be released
7
Related Work Clustering with Cluster-level Constraints
Constrained k-means algorithm [Bradley et al.’00] The existential constraint [Tung et al.’01]
Specifies the minimum # of objects in a subset of the input data Is a general form of minimum significance constraint
Different to our model: K is specified K-Anonymity [Samarati et al.’98][Sweeney et al.’02]
Each record is indistinguishable from k-1 other records On categorical data
PPMicroCluster [Jin et al.’06] Minimum significance and minimum radius constraints Constraint is posed on the radius of a cluster Did not analyze the complexity of the clustering model
8
Constraint-Driven Clustering (CDC) Given a set of points , a set of constraints C
Partition P into disjoint clusters {P1,, Pm} s.t.: Each cluster satisfies all constraints The sum of squared distances of data points to their
corresponding cluster representatives is minimized Constraints
For each cluster Pi, 1 · i · m
Our model searches for clusters which are balanced in terms of cardinality or/and variance
9
Theoretical Results
Note that the CDC problem has feasible solutions as long as the whole data set satisfies given constraints
Sig-CDC -Sig-CDC Var-CDC -Var-CDC
Constraints Sig > 1,Var = 0
Sig > 1,Var = 0
Sig = 1,Var>0
Sig = 1, Var>0
Cluster representative
Medoid Mean vector Medoid Mean vector
Complexity NP-hard (by a reduction from PLANAR X3C)
10
Heuristic Algorithm
Intuition The generated clusters must be balanced Membership assignment of each point depends on its
close neighbors Data structure: CD-Tree
Helps to retrieve close neighbors easily Obtain a solution to the CDC problem by post
processing leaf nodes Two parameters
Significance parameter S (S = Sig) Variance parameter V (V = Var)
11
CD-Tree
Leaf nodes Each entry contains an individual data point Upper-bound capacity and variance
Max capacity: 2S – 1 (In an optimal solution, no cluster consists of > 2S-1 data objects)
Max variance: 2V (To keep leaf nodes compact s.t. the SSE is minimized)
Non-leaf nodes Each entry
contains pointers to child nodes and summaries of points in the child nodes
corresponds to the subtree rooted at the child node Max capacity Z ( a constant, can be set arbitrarily)
12
CD-Tree vs. CF-Tree and R*-Tree
CF-Tree Does not save individual data points No max capacity specified for leaf nodes
R*-Tree No max variance specified for leaf nodes
Both CF-Tree and R*-tree are not designed for generating clusters satisfying constraints
CD-Tree One CD-Tree is built for a set of constraints When constraint value is changed slightly, we can
obtain a solution by post-processing leaf nodes
13
Algorithm
Two steps: Build the CD-Tree (Insertion and Split) Post-process leaf nodes to solve the CDC problem
Root
l1
nlr
S = 5
nlrnll
l2 l3l1
l2 l3
l1 l2
l1 l2
nll
14
Experimental Results
Comparison partner PPMicroCluster algorithm
Similar problem definition Can be adapted to handle the minimum variance
constraint Static algorithm
Data sets Synthetic data set (DS1)
5000 2-d data points to simulate sensors deployed uniformly
Two real UCI data sets (Abalone and Letter)
15
Results on Synthetic data set
Results for the DS1 dataset (Only Significance Constraints are Specified)
16
Results on Letter data set
Results for the Letter dataset (Both Significance and Variance Constraints are Specified)
17
Conclusion & Future work A new Constraint-Driven Clustering (CDC) model
Need-driven Focused on two cluster-level constraints
Proved NP-Hardness of the CDC problem Proposed a new data structure (CD-Tree) Developed a heuristic algorithm based on CD-Tree Future Work
Allow constraints to be ranges instead of exact values Design other types of constraints to capture different
application needs Generalize the heuristic algorithm to handle other
constraints, such as minimum separation constraint [Davidson et al.’05]
18
Reference [Ghosh’02] J. Ghosh and A. Strehl. Clustering and visualization of
retail market baskets. In N. R. Pal and L. Jain, editors, Knowledge Discovery in Advanced Information Systems. Springer, 2002.
[Kleinberg’99] J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. J. Data Mining and Knowledge Discovery, 1999.
[Bradley’00] P. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, MSR-TR-2000-65, Microsoft Research, 2000.
[Wagstaff’00] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In ICML, 2000.
[Davidson’05] I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In SDM, 2005.
[Samarati’98] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In PODS, 1998.
19
Reference [Sweeney’02] L. Sweeney. k-anonymity: A model for protecting
privacy. In IJUFKS, 2002. [Jin’06] W. Jin, R. Ge, and W. Qian. On robust and effective k-
anonymity in large databases. In PAKDD, 2006. [Aggarwal’04] C. C. Aggarwal and P. S. Yu. A condensation
approach to privacy preserving data mining. In EDBT, 2004. [Tung’01] A. K. H. Tung, J. Han, R. T. Ng, and L. V. S. Lakshmanan.
Constraint-based clustering in large databases. In ICDT, 2001. [Banerjee’06] A. Banerjee and J. Ghosh. Scalable clustering
algorithms with balancing constraints. Data Mining Knowledge Discovery, 13(3), 2006.
20
Thanks!
Poster: this evening (Tuesday), board #1
21
Split
Split
Create a new leaf node
Move the furthest point to the mean of the old leaf node to it
Calculate the new objective value
Does the objectivevalue drop?
Yes
NoLink the new node appropriately
22
Runtime
O(n2 + n * Sig2) The runtime of inserting one point is O(n) The height of a CD-Tree can be O(n) Total time for split is O(Sig2) Total time for building a tree is O(n2 + nSig2)
23
Outline
Introduction Two classes of clustering methods Motivation for constraint-driven clustering
Related Work Constraint-Driven Clustering model Theoretical Results Heuristic Algorithm Experimental Results Conclusion Future Work
24
Related Work Actionable Clustering [Kleinberg‘99]
Objective function measures the utility of a clustering in decision making
Cluster-level Constraints Constrianed k-means algorithm [Bradley’00] Different to our model: K is specified
Instance-level Constraints Must-link and cannot-link constraints [Wagstaff’00] Feasibility issue with the instance-level constraints
[Davidson’05] Model a cluster-level constraint with instance-level
constraints Require a large number of instance-level constraints Specifying too many constraints is problematic
25
Related Work (Contd.)
K-Anonymity [Samarati’98][Sweeney’02] Each record is indistinguishable from k-1 other
records On categorical data Condensation approach is an extension of K-
Anonymity on numerical data [Aggarwal’04] PPMicroCluster[Jin’06]
Minimum significance constraint and minimum radius constraint
Different to our model: Minimum variance constraint Not analyze the complexity of the cluster model Propose a static algorithm