constraint-driven clustering

25
1 Constraint-Driven Clustering Rong Ge 1 , Martin Ester 1 , Wen Jin 1 , Ian Davidson 2 Presenter: Rong Ge 1 Simon Fraser University 2 University of California - Davis

Upload: tom

Post on 14-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Constraint-Driven Clustering. Rong Ge 1 , Martin Ester 1 , Wen Jin 1 , Ian Davidson 2 Presenter: Rong Ge 1 Simon Fraser University 2 University of California - Davis. Introduction. Clustering methods aim at grouping data objects into clusters based on some criteria - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Constraint-Driven Clustering

1

Constraint-Driven Clustering

Rong Ge1, Martin Ester1, Wen Jin1, Ian Davidson2

Presenter: Rong Ge1 Simon Fraser University2 University of California - Davis

Page 2: Constraint-Driven Clustering

2

Introduction

Clustering methods aim at grouping data objects into clusters based on

some criteria can be either data-driven or need-driven

[Banerjee’06] Data-Driven methods

discover the true structure of the underlying data by grouping similar data objects together

Need-Driven methods group data objects based on not only similarity but

also application needs discover more actionable clusters

Page 3: Constraint-Driven Clustering

3

Capturing Application Needs

Two methodologies: Design sophisticated objective functions based on

business needs E.g., in catalog segmentation, clustering results are

evaluated by their utility in decision making [Kleinberg et al.’99]

Capture application needs by constraints E.g., discovering balanced customer groups in market

segmentation [Ghosh et al.’02] Yet, existing models often require users to provide the

number of clusters Often unknown Or not suit for application needs

Page 4: Constraint-Driven Clustering

4

Constraint-Driven Clustering

Constraint-Driven Clustering Utilizes constraints to control cluster formation Discovers an arbitrary number of clusters Goals:

Discover compact clusters Satisfy all constraints

Two constraint types (Cluster-level constraints) Minimum significance constraint

Specifies the minimum number of objects in a cluster Minimum variance constraint

Specifies the minimum variance of a cluster

Page 5: Constraint-Driven Clustering

5

Motivation - Energy Aware Sensor Networks

Constraint-Driven Clustering: Minimum Significance Constraint

Balances the work load of master nodes Minimum Variance Constraint

Allows sensor clusters to be balanced in terms of energy consumption

Goal: minimize energy consumption Solution:

Group sensors into clusters A master node is selected from

sensors in a cluster or deployed Other sensors communicate with

outside through the master nodesCommandNode

Sensor

Master Node

Communication Channel

Page 6: Constraint-Driven Clustering

6

Motivation - Privacy Preservation Goal: publish personal records without a privacy breach Solution:

Group records into clusters Release the summary of each cluster to the public

Constraint-Driven Clustering: Minimum Significance Constraint

Similar to k-Anonymity in preserving individual privacy Minimum Variance Constraint

Variance translates into the width of the confidence interval of the adversary estimate

Prevent similar, even identical, records to be released

Page 7: Constraint-Driven Clustering

7

Related Work Clustering with Cluster-level Constraints

Constrained k-means algorithm [Bradley et al.’00] The existential constraint [Tung et al.’01]

Specifies the minimum # of objects in a subset of the input data Is a general form of minimum significance constraint

Different to our model: K is specified K-Anonymity [Samarati et al.’98][Sweeney et al.’02]

Each record is indistinguishable from k-1 other records On categorical data

PPMicroCluster [Jin et al.’06] Minimum significance and minimum radius constraints Constraint is posed on the radius of a cluster Did not analyze the complexity of the clustering model

Page 8: Constraint-Driven Clustering

8

Constraint-Driven Clustering (CDC) Given a set of points , a set of constraints C

Partition P into disjoint clusters {P1,, Pm} s.t.: Each cluster satisfies all constraints The sum of squared distances of data points to their

corresponding cluster representatives is minimized Constraints

For each cluster Pi, 1 · i · m

Our model searches for clusters which are balanced in terms of cardinality or/and variance

Page 9: Constraint-Driven Clustering

9

Theoretical Results

Note that the CDC problem has feasible solutions as long as the whole data set satisfies given constraints

Sig-CDC -Sig-CDC Var-CDC -Var-CDC

Constraints Sig > 1,Var = 0

Sig > 1,Var = 0

Sig = 1,Var>0

Sig = 1, Var>0

Cluster representative

Medoid Mean vector Medoid Mean vector

Complexity NP-hard (by a reduction from PLANAR X3C)

Page 10: Constraint-Driven Clustering

10

Heuristic Algorithm

Intuition The generated clusters must be balanced Membership assignment of each point depends on its

close neighbors Data structure: CD-Tree

Helps to retrieve close neighbors easily Obtain a solution to the CDC problem by post

processing leaf nodes Two parameters

Significance parameter S (S = Sig) Variance parameter V (V = Var)

Page 11: Constraint-Driven Clustering

11

CD-Tree

Leaf nodes Each entry contains an individual data point Upper-bound capacity and variance

Max capacity: 2S – 1 (In an optimal solution, no cluster consists of > 2S-1 data objects)

Max variance: 2V (To keep leaf nodes compact s.t. the SSE is minimized)

Non-leaf nodes Each entry

contains pointers to child nodes and summaries of points in the child nodes

corresponds to the subtree rooted at the child node Max capacity Z ( a constant, can be set arbitrarily)

Page 12: Constraint-Driven Clustering

12

CD-Tree vs. CF-Tree and R*-Tree

CF-Tree Does not save individual data points No max capacity specified for leaf nodes

R*-Tree No max variance specified for leaf nodes

Both CF-Tree and R*-tree are not designed for generating clusters satisfying constraints

CD-Tree One CD-Tree is built for a set of constraints When constraint value is changed slightly, we can

obtain a solution by post-processing leaf nodes

Page 13: Constraint-Driven Clustering

13

Algorithm

Two steps: Build the CD-Tree (Insertion and Split) Post-process leaf nodes to solve the CDC problem

Root

l1

nlr

S = 5

nlrnll

l2 l3l1

l2 l3

l1 l2

l1 l2

nll

Page 14: Constraint-Driven Clustering

14

Experimental Results

Comparison partner PPMicroCluster algorithm

Similar problem definition Can be adapted to handle the minimum variance

constraint Static algorithm

Data sets Synthetic data set (DS1)

5000 2-d data points to simulate sensors deployed uniformly

Two real UCI data sets (Abalone and Letter)

Page 15: Constraint-Driven Clustering

15

Results on Synthetic data set

Results for the DS1 dataset (Only Significance Constraints are Specified)

Page 16: Constraint-Driven Clustering

16

Results on Letter data set

Results for the Letter dataset (Both Significance and Variance Constraints are Specified)

Page 17: Constraint-Driven Clustering

17

Conclusion & Future work A new Constraint-Driven Clustering (CDC) model

Need-driven Focused on two cluster-level constraints

Proved NP-Hardness of the CDC problem Proposed a new data structure (CD-Tree) Developed a heuristic algorithm based on CD-Tree Future Work

Allow constraints to be ranges instead of exact values Design other types of constraints to capture different

application needs Generalize the heuristic algorithm to handle other

constraints, such as minimum separation constraint [Davidson et al.’05]

Page 18: Constraint-Driven Clustering

18

Reference [Ghosh’02] J. Ghosh and A. Strehl. Clustering and visualization of

retail market baskets. In N. R. Pal and L. Jain, editors, Knowledge Discovery in Advanced Information Systems. Springer, 2002.

[Kleinberg’99] J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. J. Data Mining and Knowledge Discovery, 1999.

[Bradley’00] P. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, MSR-TR-2000-65, Microsoft Research, 2000.

[Wagstaff’00] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In ICML, 2000.

[Davidson’05] I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In SDM, 2005.

[Samarati’98] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In PODS, 1998.

Page 19: Constraint-Driven Clustering

19

Reference [Sweeney’02] L. Sweeney. k-anonymity: A model for protecting

privacy. In IJUFKS, 2002. [Jin’06] W. Jin, R. Ge, and W. Qian. On robust and effective k-

anonymity in large databases. In PAKDD, 2006. [Aggarwal’04] C. C. Aggarwal and P. S. Yu. A condensation

approach to privacy preserving data mining. In EDBT, 2004. [Tung’01] A. K. H. Tung, J. Han, R. T. Ng, and L. V. S. Lakshmanan.

Constraint-based clustering in large databases. In ICDT, 2001. [Banerjee’06] A. Banerjee and J. Ghosh. Scalable clustering

algorithms with balancing constraints. Data Mining Knowledge Discovery, 13(3), 2006.

Page 20: Constraint-Driven Clustering

20

Thanks!

Poster: this evening (Tuesday), board #1

Page 21: Constraint-Driven Clustering

21

Split

Split

Create a new leaf node

Move the furthest point to the mean of the old leaf node to it

Calculate the new objective value

Does the objectivevalue drop?

Yes

NoLink the new node appropriately

Page 22: Constraint-Driven Clustering

22

Runtime

O(n2 + n * Sig2) The runtime of inserting one point is O(n) The height of a CD-Tree can be O(n) Total time for split is O(Sig2) Total time for building a tree is O(n2 + nSig2)

Page 23: Constraint-Driven Clustering

23

Outline

Introduction Two classes of clustering methods Motivation for constraint-driven clustering

Related Work Constraint-Driven Clustering model Theoretical Results Heuristic Algorithm Experimental Results Conclusion Future Work

Page 24: Constraint-Driven Clustering

24

Related Work Actionable Clustering [Kleinberg‘99]

Objective function measures the utility of a clustering in decision making

Cluster-level Constraints Constrianed k-means algorithm [Bradley’00] Different to our model: K is specified

Instance-level Constraints Must-link and cannot-link constraints [Wagstaff’00] Feasibility issue with the instance-level constraints

[Davidson’05] Model a cluster-level constraint with instance-level

constraints Require a large number of instance-level constraints Specifying too many constraints is problematic

Page 25: Constraint-Driven Clustering

25

Related Work (Contd.)

K-Anonymity [Samarati’98][Sweeney’02] Each record is indistinguishable from k-1 other

records On categorical data Condensation approach is an extension of K-

Anonymity on numerical data [Aggarwal’04] PPMicroCluster[Jin’06]

Minimum significance constraint and minimum radius constraint

Different to our model: Minimum variance constraint Not analyze the complexity of the cluster model Propose a static algorithm