clustering by pattern similarity in large data sets haixun wang, wei wang, jiong yang, philip s. yu...

33
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond Wu

Upload: noreen-waters

Post on 04-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

Clustering by Pattern Similarity in Large Data Sets

Haixun Wang, Wei Wang, Jiong Yang, Philip S. YuIBM T. J. Watson Research CenterPresented by Edmond Wu

Page 2: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 2

Talk Outline

Introduction

Related Work

pCluster Model

Performance Analysis

Conclusion

Page 3: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 3

Motivation

Why discovery of clusters based on pattern similarity is interesting and important?

DNA micro-array analysis

E-commerce: Recommendation systems & target marketing

Page 4: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 4

Background Knowledge

Clustering: the process of grouping a set of objects into classes of similar objects.

Subspace clustering: discovering clusters embedded in the subspace of a high dimensional datasets.

Pattern similarity: coherent pattern on a subset of dimensions. ( Not require to have close values on at least one attribute)

Page 5: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 5

Example of Similar pattern on a subset of dimensions

Page 6: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 6

Challenges

Identifying subspace clusters in high-dimensional data sets is difficult.

Traditional distance functions can not capture the pattern similarity among the objects.

Page 7: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 7

How to detect shifting pattern?

Given N attributes a1,…,an

Define a derived attribute Aij=ai-aj for every

pair of attributes ai-aj Thus, the problem equals to mine subspace clusters on the objects with the derived set of attributes.

Drawback: The converted dataset will have

N(N-1)/2 dimensions

intractable even for a small N

Page 8: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 8

Related Work

Bicluster Model (Cheng et al):

AIJ: sub Matrix of a DNA array, with the following mean squared residue score H(I,J):

δ- bicluster: AIJ is called a δ- bicluster if H(I,J) ≤δ

Page 9: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 9

Bicluster Model (Example)

(1) Shifting pattern (2) Scaling patternH(I,J)=0 H(I,J)=2/3

(3) Not similar pattern (4) Submatrix of (2)

H(I,J)=8 H(I,J)=2.25>2/3

If we set δ=2, (3),(4) are not δ- bicluster.

a1 a2 a3

O1 1 2 3

O2 5 6 7

a1 a2 a3

O1 1 2 4

O2 2 4 8

a1 a2 a3

O1 2 4 12

O2 4 6 2

a1 a3

O1 1 4

O2 2 8

Page 10: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 10

Drawbacks of Bicluster Model

A submatrix of a δ- bicluster is not necessarily a δ- bicluster.

Not sure to find all qualified clusters (randomly greedy algorithm provides only an approximate answer).

Can not exclude outlier in a bicluster.

Difficulties in designing efficient algorithm.

Page 11: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 11

Bicluster Model (Example)

The bicluster shown in Figure (a) contains an obvious outlier but it still has a fairly small mean squared residue (4.238).

If we get rid of such outliers by reducing the δ threshold, it will exclude many biclusters which do exhibit similar patterns.

Page 12: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 12

The pCluster Model

pScore of a 2× 2 matrix:

O : subset of objects in the database

T : subset of attributes; (O,T): submatrix of dataset

δ: user specified clustering threshold

dxa: value of object X on attribute a

Given x, y O, and ∈ a, b ∈T

)()( ybyaxbxaybya

xbxadddd

dd

ddpScore

Page 13: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 13

The pCluster Model (Cont.)

pScore(X) ≤ δ means that the change of values on the two attributes between the two objects in X is confined byδ, a user-specified threshold.

Pair (O, T ) forms a δ-pCluster if for any 2 × 2 submatrix X in (O, T ), we have pScore(X) ≤ δ for some δ ≥ 0.

Page 14: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 14

The pCluster Model (Example)

In Figure (a): Object 2, 3 and {b, c} form a 2× 2 submatrix X: d2b=12, d2c=15, d3b=40, d3c=43 pScore(X)=|(12-15)-(40-43)|=0

Objects 1,2,3 and {b,c,h,j,e} form a pCluster (δ=0)

Page 15: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 15

The pCluster Model (Cont.)

Compact property of pCluster:

let (O,T) be a δ-pCluster. Any of its submatrix, (O’,T’) is also a δ-pCluster (Based on the definition of pCluster);

The volume of a pCluster: |O|×|T|;

Definition of pCluster is symmetric:

|(dxa- dxb) - (dya- dyb)|

= |(dxa- dya) - (dxb- dyb)|

Page 16: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 16

Problem Statement

Task: To find all pairs (O,T) such that (O,T) is a δ-pCluster according to its definition, and |O|≥ nr, |T|≥ nc.

Parameters: D : dataset δ: a cluster threshold nc : a minimal number of columns nr : a minimal number of rows

Page 17: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 17

The Algorithm

Definition of MDS: Assuming c = (O, T) is a δ-pCluster. Column set T is a Maximum Dimension Set (MDS) of c if

there does not exist T’ T such that (O, T’) is also a δ-pCluster.

Objects can form pClusters on multiple MDSs. The algorithm is depth-first, meaning only generate pClusters that cluster on MDSs.

Page 18: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 18

Pair-wise Clustering

Pairwise Clustering Principle:

Given objects X and Y, and a dimension set T, X and Y form a δ-pCluster on T iff the difference between the largest and smallest value in

S(X, Y, T) is below δ.

In other word, ({X,Y},T) is a pCluster if the following is true:

),(max,

bafTba

)()(),( ybxbyaxa ddddbaf

Page 19: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 19

Pair-wise Clustering (Example)

Sorted sequence of S(X, Y, T) =s1,…,sk ,…,sn

Object x and y forms a δ-pCluster if

Three MDSs were found: {e,g,c}, {a,d,b,h}, {h,f}

1,...,1, niss ik

Page 20: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 20

MDS Pruning

MDS Pruning Principle:

Let Txy be an MDS for objects x, y, and a ∈Txy. For any O and T , a necessary condition of ({x, y} ∪O, {a} ∪ T ) being a δ-pCluster is b ∈ T , Oab {x, y}.

The pruning criterion can be stated as follows:

For any dimension a in a MDS Txy, count the number

of Oab that contain {x, y}. If the number of such Oab is

less than nc-1, remove a from Txy. Furthermore, if the

removal of a makes |Txy| < nc, we remove Txy as well.

Page 21: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 21

MDS Pruning (Example)

Page 22: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 22

The Main Algorithm

First step: Scan the dataset to find column-pair MDSs and object-pair MDSs.Second step: Prune object-pair MDSs and column-pair MDSs by turn until no pruning can be made.Third step: Insert the remaining object-pair MDSs into a prefix tree. (Each node represents a cluster of objects, each edge represents the column selected)

Page 23: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 23

Construct a prefix tree

Sort the order of columns e.g., a,b,c,…Insert 2-object pCluster(O,T) into the prefix tree. Perform a post-order traversal of the prefix tree. Prune nodes that |O|<nr. ( Add the objects in O to nodes whose column set

T’ T and |T’|=|T|-1

Page 24: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 24

Construct a prefix tree (Example)

Page 25: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 25

Algorithm Complexity

Main algorithm for mining pClusters has time complexity :

where M is the # of columns and N is the # of

objects.The worse case:However, the complexity can be greatly reduced because of the MDS pruning process.

)loglog( 22 MMNNNMO

)( 22NkMO

Page 26: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 26

Experiments

DatasetsSynthetic datasets (parameters: different nr, nc, # of embedded perfect pCluster with δ=0)

Gene expression data (yeast microarray)

MovieLens dataset (E-commerce)

Page 27: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 27

Performance Analysis

Response time VS. data size

Page 28: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 28

Performance Analysis (Cont.)

Sensitiveness to mining parameters: δ, nc, and nr

Page 29: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 29

Performance Analysis (Cont.)

Compare the pCluster with an alternative approach based on the subspace clustering algorithm CLIQUE.

Page 30: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 30

Performance Analysis (Cont.)

The pruning process is essential in the pCluster algorithm.

Without pruning, the pCluster Algorithm can not beyond 3,000 objects. As the number of the MDS become too large to put into a Prefix tree.

Page 31: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 31

Conclusion

pCluster Model: capture the closeness of objects and pattern similarity among the objects in subsets of dimensions.Advantages :

-Discover all the qualified pClusters. -The depth-first clustering algorithm avoids generating clusters which are part of other clusters. -More efficient than current algorithm. -Resilient to outliers

Page 32: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

DB-Seminar Slide 32

References

Y. Cheng and G. Church. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000.S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and

G. Church. Yeast micro data set, 2000. In http://arep.med.harvard.edu/biclustering/yeast.matrix,

R. C. Agarwal, C. C. Aggarwal, and V. Parsad. Depth first generation of long patterns. In SIGKDD, 2000.J. Yang, W. Wang, H. Wang, and P. S. Yu. δ-clusters:

Capturing subspace correlation in a large data set. In ICDE, pages 517–528, 2002.

Page 33: Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond

Thanks!!