clustering by pattern similarity in large data sets haixun wang, wei wang, jiong yang, philip s. yu...
TRANSCRIPT
Clustering by Pattern Similarity in Large Data Sets
Haixun Wang, Wei Wang, Jiong Yang, Philip S. YuIBM T. J. Watson Research CenterPresented by Edmond Wu
DB-Seminar Slide 2
Talk Outline
Introduction
Related Work
pCluster Model
Performance Analysis
Conclusion
DB-Seminar Slide 3
Motivation
Why discovery of clusters based on pattern similarity is interesting and important?
DNA micro-array analysis
E-commerce: Recommendation systems & target marketing
DB-Seminar Slide 4
Background Knowledge
Clustering: the process of grouping a set of objects into classes of similar objects.
Subspace clustering: discovering clusters embedded in the subspace of a high dimensional datasets.
Pattern similarity: coherent pattern on a subset of dimensions. ( Not require to have close values on at least one attribute)
DB-Seminar Slide 5
Example of Similar pattern on a subset of dimensions
DB-Seminar Slide 6
Challenges
Identifying subspace clusters in high-dimensional data sets is difficult.
Traditional distance functions can not capture the pattern similarity among the objects.
DB-Seminar Slide 7
How to detect shifting pattern?
Given N attributes a1,…,an
Define a derived attribute Aij=ai-aj for every
pair of attributes ai-aj Thus, the problem equals to mine subspace clusters on the objects with the derived set of attributes.
Drawback: The converted dataset will have
N(N-1)/2 dimensions
intractable even for a small N
DB-Seminar Slide 8
Related Work
Bicluster Model (Cheng et al):
AIJ: sub Matrix of a DNA array, with the following mean squared residue score H(I,J):
δ- bicluster: AIJ is called a δ- bicluster if H(I,J) ≤δ
DB-Seminar Slide 9
Bicluster Model (Example)
(1) Shifting pattern (2) Scaling patternH(I,J)=0 H(I,J)=2/3
(3) Not similar pattern (4) Submatrix of (2)
H(I,J)=8 H(I,J)=2.25>2/3
If we set δ=2, (3),(4) are not δ- bicluster.
a1 a2 a3
O1 1 2 3
O2 5 6 7
a1 a2 a3
O1 1 2 4
O2 2 4 8
a1 a2 a3
O1 2 4 12
O2 4 6 2
a1 a3
O1 1 4
O2 2 8
DB-Seminar Slide 10
Drawbacks of Bicluster Model
A submatrix of a δ- bicluster is not necessarily a δ- bicluster.
Not sure to find all qualified clusters (randomly greedy algorithm provides only an approximate answer).
Can not exclude outlier in a bicluster.
Difficulties in designing efficient algorithm.
DB-Seminar Slide 11
Bicluster Model (Example)
The bicluster shown in Figure (a) contains an obvious outlier but it still has a fairly small mean squared residue (4.238).
If we get rid of such outliers by reducing the δ threshold, it will exclude many biclusters which do exhibit similar patterns.
DB-Seminar Slide 12
The pCluster Model
pScore of a 2× 2 matrix:
O : subset of objects in the database
T : subset of attributes; (O,T): submatrix of dataset
δ: user specified clustering threshold
dxa: value of object X on attribute a
Given x, y O, and ∈ a, b ∈T
)()( ybyaxbxaybya
xbxadddd
dd
ddpScore
DB-Seminar Slide 13
The pCluster Model (Cont.)
pScore(X) ≤ δ means that the change of values on the two attributes between the two objects in X is confined byδ, a user-specified threshold.
Pair (O, T ) forms a δ-pCluster if for any 2 × 2 submatrix X in (O, T ), we have pScore(X) ≤ δ for some δ ≥ 0.
DB-Seminar Slide 14
The pCluster Model (Example)
In Figure (a): Object 2, 3 and {b, c} form a 2× 2 submatrix X: d2b=12, d2c=15, d3b=40, d3c=43 pScore(X)=|(12-15)-(40-43)|=0
Objects 1,2,3 and {b,c,h,j,e} form a pCluster (δ=0)
DB-Seminar Slide 15
The pCluster Model (Cont.)
Compact property of pCluster:
let (O,T) be a δ-pCluster. Any of its submatrix, (O’,T’) is also a δ-pCluster (Based on the definition of pCluster);
The volume of a pCluster: |O|×|T|;
Definition of pCluster is symmetric:
|(dxa- dxb) - (dya- dyb)|
= |(dxa- dya) - (dxb- dyb)|
DB-Seminar Slide 16
Problem Statement
Task: To find all pairs (O,T) such that (O,T) is a δ-pCluster according to its definition, and |O|≥ nr, |T|≥ nc.
Parameters: D : dataset δ: a cluster threshold nc : a minimal number of columns nr : a minimal number of rows
DB-Seminar Slide 17
The Algorithm
Definition of MDS: Assuming c = (O, T) is a δ-pCluster. Column set T is a Maximum Dimension Set (MDS) of c if
there does not exist T’ T such that (O, T’) is also a δ-pCluster.
Objects can form pClusters on multiple MDSs. The algorithm is depth-first, meaning only generate pClusters that cluster on MDSs.
DB-Seminar Slide 18
Pair-wise Clustering
Pairwise Clustering Principle:
Given objects X and Y, and a dimension set T, X and Y form a δ-pCluster on T iff the difference between the largest and smallest value in
S(X, Y, T) is below δ.
In other word, ({X,Y},T) is a pCluster if the following is true:
),(max,
bafTba
)()(),( ybxbyaxa ddddbaf
DB-Seminar Slide 19
Pair-wise Clustering (Example)
Sorted sequence of S(X, Y, T) =s1,…,sk ,…,sn
Object x and y forms a δ-pCluster if
Three MDSs were found: {e,g,c}, {a,d,b,h}, {h,f}
1,...,1, niss ik
DB-Seminar Slide 20
MDS Pruning
MDS Pruning Principle:
Let Txy be an MDS for objects x, y, and a ∈Txy. For any O and T , a necessary condition of ({x, y} ∪O, {a} ∪ T ) being a δ-pCluster is b ∈ T , Oab {x, y}.
The pruning criterion can be stated as follows:
For any dimension a in a MDS Txy, count the number
of Oab that contain {x, y}. If the number of such Oab is
less than nc-1, remove a from Txy. Furthermore, if the
removal of a makes |Txy| < nc, we remove Txy as well.
DB-Seminar Slide 21
MDS Pruning (Example)
DB-Seminar Slide 22
The Main Algorithm
First step: Scan the dataset to find column-pair MDSs and object-pair MDSs.Second step: Prune object-pair MDSs and column-pair MDSs by turn until no pruning can be made.Third step: Insert the remaining object-pair MDSs into a prefix tree. (Each node represents a cluster of objects, each edge represents the column selected)
DB-Seminar Slide 23
Construct a prefix tree
Sort the order of columns e.g., a,b,c,…Insert 2-object pCluster(O,T) into the prefix tree. Perform a post-order traversal of the prefix tree. Prune nodes that |O|<nr. ( Add the objects in O to nodes whose column set
T’ T and |T’|=|T|-1
DB-Seminar Slide 24
Construct a prefix tree (Example)
DB-Seminar Slide 25
Algorithm Complexity
Main algorithm for mining pClusters has time complexity :
where M is the # of columns and N is the # of
objects.The worse case:However, the complexity can be greatly reduced because of the MDS pruning process.
)loglog( 22 MMNNNMO
)( 22NkMO
DB-Seminar Slide 26
Experiments
DatasetsSynthetic datasets (parameters: different nr, nc, # of embedded perfect pCluster with δ=0)
Gene expression data (yeast microarray)
MovieLens dataset (E-commerce)
DB-Seminar Slide 27
Performance Analysis
Response time VS. data size
DB-Seminar Slide 28
Performance Analysis (Cont.)
Sensitiveness to mining parameters: δ, nc, and nr
DB-Seminar Slide 29
Performance Analysis (Cont.)
Compare the pCluster with an alternative approach based on the subspace clustering algorithm CLIQUE.
DB-Seminar Slide 30
Performance Analysis (Cont.)
The pruning process is essential in the pCluster algorithm.
Without pruning, the pCluster Algorithm can not beyond 3,000 objects. As the number of the MDS become too large to put into a Prefix tree.
DB-Seminar Slide 31
Conclusion
pCluster Model: capture the closeness of objects and pattern similarity among the objects in subsets of dimensions.Advantages :
-Discover all the qualified pClusters. -The depth-first clustering algorithm avoids generating clusters which are part of other clusters. -More efficient than current algorithm. -Resilient to outliers
DB-Seminar Slide 32
References
Y. Cheng and G. Church. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000.S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and
G. Church. Yeast micro data set, 2000. In http://arep.med.harvard.edu/biclustering/yeast.matrix,
R. C. Agarwal, C. C. Aggarwal, and V. Parsad. Depth first generation of long patterns. In SIGKDD, 2000.J. Yang, W. Wang, H. Wang, and P. S. Yu. δ-clusters:
Capturing subspace correlation in a large data set. In ICDE, pages 517–528, 2002.
Thanks!!