clustering by pattern similarity in large data sets haixun wang, wei wang, jiong yang, philip s. yu...

Clustering by Pattern Similarity in Large Data Sets

Haixun Wang, Wei Wang, Jiong Yang, Philip S. YuIBM T. J. Watson Research CenterPresented by Edmond Wu

DB-Seminar Slide 2

Talk Outline

Introduction

Related Work

pCluster Model

Performance Analysis

Conclusion

DB-Seminar Slide 3

Motivation

Why discovery of clusters based on pattern similarity is interesting and important?

DNA micro-array analysis

E-commerce: Recommendation systems & target marketing

DB-Seminar Slide 4

Background Knowledge

Clustering: the process of grouping a set of objects into classes of similar objects.

Subspace clustering: discovering clusters embedded in the subspace of a high dimensional datasets.

Pattern similarity: coherent pattern on a subset of dimensions. ( Not require to have close values on at least one attribute)

DB-Seminar Slide 5

Example of Similar pattern on a subset of dimensions

DB-Seminar Slide 6

Challenges

Identifying subspace clusters in high-dimensional data sets is difficult.

Traditional distance functions can not capture the pattern similarity among the objects.

DB-Seminar Slide 7

How to detect shifting pattern?

Given N attributes a1,…,an

Define a derived attribute Aij=ai-aj for every

pair of attributes ai-aj Thus, the problem equals to mine subspace clusters on the objects with the derived set of attributes.

Drawback: The converted dataset will have

N(N-1)/2 dimensions

intractable even for a small N

DB-Seminar Slide 8

Related Work

Bicluster Model (Cheng et al):

AIJ: sub Matrix of a DNA array, with the following mean squared residue score H(I,J):

δ- bicluster: AIJ is called a δ- bicluster if H(I,J) ≤δ

DB-Seminar Slide 9

Bicluster Model (Example)

(1) Shifting pattern (2) Scaling patternH(I,J)=0 H(I,J)=2/3

(3) Not similar pattern (4) Submatrix of (2)

H(I,J)=8 H(I,J)=2.25>2/3

If we set δ=2, (3),(4) are not δ- bicluster.

a1 a2 a3

O1 1 2 3

O2 5 6 7

a1 a2 a3

O1 1 2 4

O2 2 4 8

a1 a2 a3

O1 2 4 12

O2 4 6 2

a1 a3

O1 1 4

O2 2 8

DB-Seminar Slide 10

Drawbacks of Bicluster Model

A submatrix of a δ- bicluster is not necessarily a δ- bicluster.

Not sure to find all qualified clusters (randomly greedy algorithm provides only an approximate answer).

Can not exclude outlier in a bicluster.

Difficulties in designing efficient algorithm.

DB-Seminar Slide 11

Bicluster Model (Example)

The bicluster shown in Figure (a) contains an obvious outlier but it still has a fairly small mean squared residue (4.238).

If we get rid of such outliers by reducing the δ threshold, it will exclude many biclusters which do exhibit similar patterns.

DB-Seminar Slide 12

The pCluster Model

pScore of a 2× 2 matrix:

O : subset of objects in the database

T : subset of attributes; (O,T): submatrix of dataset

δ: user specified clustering threshold

dxa: value of object X on attribute a

Given x, y O, and ∈ a, b ∈T

)()( ybyaxbxaybya

xbxadddd

dd

ddpScore

DB-Seminar Slide 13

The pCluster Model (Cont.)

pScore(X) ≤ δ means that the change of values on the two attributes between the two objects in X is confined byδ, a user-specified threshold.

Pair (O, T ) forms a δ-pCluster if for any 2 × 2 submatrix X in (O, T ), we have pScore(X) ≤ δ for some δ ≥ 0.

DB-Seminar Slide 14

The pCluster Model (Example)

In Figure (a): Object 2, 3 and {b, c} form a 2× 2 submatrix X: d2b=12, d2c=15, d3b=40, d3c=43 pScore(X)=|(12-15)-(40-43)|=0

Objects 1,2,3 and {b,c,h,j,e} form a pCluster (δ=0)

DB-Seminar Slide 15

The pCluster Model (Cont.)

Compact property of pCluster:

let (O,T) be a δ-pCluster. Any of its submatrix, (O’,T’) is also a δ-pCluster (Based on the definition of pCluster);

The volume of a pCluster: |O|×|T|;

Definition of pCluster is symmetric:

|(dxa－ dxb) － (dya－ dyb)|

= |(dxa－ dya) － (dxb－ dyb)|

DB-Seminar Slide 16

Problem Statement

Task: To find all pairs (O,T) such that (O,T) is a δ-pCluster according to its definition, and |O|≥ nr, |T|≥ nc.

Parameters: D : dataset δ: a cluster threshold nc : a minimal number of columns nr : a minimal number of rows

DB-Seminar Slide 17

The Algorithm

Definition of MDS: Assuming c = (O, T) is a δ-pCluster. Column set T is a Maximum Dimension Set (MDS) of c if

there does not exist T’ T such that (O, T’) is also a δ-pCluster.

Objects can form pClusters on multiple MDSs. The algorithm is depth-first, meaning only generate pClusters that cluster on MDSs.

DB-Seminar Slide 18

Pair-wise Clustering

Pairwise Clustering Principle:

Given objects X and Y, and a dimension set T, X and Y form a δ-pCluster on T iff the difference between the largest and smallest value in

S(X, Y, T) is below δ.

In other word, ({X,Y},T) is a pCluster if the following is true:

),(max,

bafTba

)()(),( ybxbyaxa ddddbaf

DB-Seminar Slide 19

Pair-wise Clustering (Example)

Sorted sequence of S(X, Y, T) =s1,…,sk ,…,sn

Object x and y forms a δ-pCluster if

Three MDSs were found: {e,g,c}, {a,d,b,h}, {h,f}

1,...,1, niss ik

DB-Seminar Slide 20

MDS Pruning

MDS Pruning Principle:

Let Txy be an MDS for objects x, y, and a ∈Txy. For any O and T , a necessary condition of ({x, y} ∪O, {a} ∪ T ) being a δ-pCluster is b ∈ T , Oab {x, y}.

The pruning criterion can be stated as follows:

For any dimension a in a MDS Txy, count the number

of Oab that contain {x, y}. If the number of such Oab is

less than nc-1, remove a from Txy. Furthermore, if the

removal of a makes |Txy| < nc, we remove Txy as well.

DB-Seminar Slide 21

MDS Pruning (Example)

DB-Seminar Slide 22

The Main Algorithm

First step: Scan the dataset to find column-pair MDSs and object-pair MDSs.Second step: Prune object-pair MDSs and column-pair MDSs by turn until no pruning can be made.Third step: Insert the remaining object-pair MDSs into a prefix tree. (Each node represents a cluster of objects, each edge represents the column selected)

DB-Seminar Slide 23

Construct a prefix tree

Sort the order of columns e.g., a,b,c,…Insert 2-object pCluster(O,T) into the prefix tree. Perform a post-order traversal of the prefix tree. Prune nodes that |O|<nr. ( Add the objects in O to nodes whose column set

T’ T and |T’|=|T|-1

DB-Seminar Slide 24

Construct a prefix tree (Example)

DB-Seminar Slide 25

Algorithm Complexity

Main algorithm for mining pClusters has time complexity :

where M is the # of columns and N is the # of

objects.The worse case:However, the complexity can be greatly reduced because of the MDS pruning process.

)loglog( 22 MMNNNMO

)( 22NkMO

DB-Seminar Slide 26

Experiments

DatasetsSynthetic datasets (parameters: different nr, nc, # of embedded perfect pCluster with δ=0)

Gene expression data (yeast microarray)

MovieLens dataset (E-commerce)

DB-Seminar Slide 27

Performance Analysis

Response time VS. data size

DB-Seminar Slide 28

Performance Analysis (Cont.)

Sensitiveness to mining parameters: δ, nc, and nr

DB-Seminar Slide 29


Compare the pCluster with an alternative approach based on the subspace clustering algorithm CLIQUE.

DB-Seminar Slide 30


The pruning process is essential in the pCluster algorithm.

Without pruning, the pCluster Algorithm can not beyond 3,000 objects. As the number of the MDS become too large to put into a Prefix tree.

DB-Seminar Slide 31

Conclusion

pCluster Model: capture the closeness of objects and pattern similarity among the objects in subsets of dimensions.Advantages :

-Discover all the qualified pClusters. -The depth-first clustering algorithm avoids generating clusters which are part of other clusters. -More efficient than current algorithm. -Resilient to outliers

DB-Seminar Slide 32

References

Y. Cheng and G. Church. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000.S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and

G. Church. Yeast micro data set, 2000. In http://arep.med.harvard.edu/biclustering/yeast.matrix,

R. C. Agarwal, C. C. Aggarwal, and V. Parsad. Depth first generation of long patterns. In SIGKDD, 2000.J. Yang, W. Wang, H. Wang, and P. S. Yu. δ-clusters:

Capturing subspace correlation in a large data set. In ICDE, pages 517–528, 2002.

Thanks!!

clustering by pattern similarity in large data sets haixun wang, wei wang, jiong yang, philip s. yu...

Documents