eecs 800 research seminar mining biological data
DESCRIPTION
EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Model-Based Clustering. What is model-based clustering? Attempt to optimize the fit between the given data and some mathematical model - PowerPoint PPT PresentationTRANSCRIPT
The UNIVERSITY of Kansas
EECS 800 Research SeminarMining Biological Data
Instructor: Luke Huan
Fall, 2006
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide2
10/04/2006Model-based Clustering
Model-Based ClusteringModel-Based Clustering
What is model-based clustering?Attempt to optimize the fit between the given data and some mathematical model
Based on the assumption: Data are generated by a mixture of underlying probability distribution
Typical methodsStatistical approach
EM (Expectation maximization), AutoClass
Machine learning approach
COBWEB, CLASSIT
Neural network approach
SOM (Self-Organizing Feature Map)
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide3
10/04/2006Model-based Clustering
EM — Expectation MaximizationEM — Expectation Maximization
EM — A popular iterative refinement algorithm
An extension to k-means
Assign each object to a cluster according to a weight (prob. distribution)
New means are computed based on weighted measures
General idea
Starts with an initial estimate of the parameter vector
Iteratively rescores the patterns against the mixture density produced by the parameter vector
The rescored patterns are used to update the parameter updates
Patterns belonging to the same cluster, if they are placed by their scores in a particular component
Algorithm converges fast but may not be in global optima
AutoClass (Cheeseman and Stutz, 1996)
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide4
10/04/2006Model-based Clustering
1D Guassian Mixture Model1D Guassian Mixture Model
Given a set of data distributed in a 1D space, how to perform clustering in the data set?
General idea: factorize the p.d.f. into a mixture of simple models.
Discrete values: Bernoulli distribution
Continues values: Gaussian distribution
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide5
10/04/2006Model-based Clustering
The EM (Expectation Maximization) Algorithm
The EM (Expectation Maximization) Algorithm
Initially, randomly assign k cluster centers
Iteratively refine the clusters based on two steps Expectation step: assign each data point Xi to cluster Ci with the following probability
Maximization step:
Estimation of model parameters
),(/),(* kikiik CxCxx
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide6
10/04/2006Model-based Clustering
Another Way of K-mean?Another Way of K-mean?
Pos:AutoClass can adapt to different (convex) shapes of clusters, k-mean assumes spheres
Solid statistics foundation
Cons:computational expensive
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide7
10/04/2006Model-based Clustering
Model Based Subspace Clustering
Model Based Subspace Clustering
Microarray
Bi-clustering
δ-clustering
p-clustering
OP-clustering
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide8
10/04/2006Model-based Clustering
MicroArray DatasetMicroArray Dataset
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide9
10/04/2006Model-based Clustering
Gene Expression MatrixGene Expression Matrix
nmnjn
imiji
mj
xxx
xxx
xxx
......
...............
......
...............
......
1
1
1111
Ge
ne
s
Conditions
Ge
ne
s
Conditions
Time points
Cancer Tissues
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide10
10/04/2006Model-based Clustering
Data Mining: ClusteringData Mining: Clustering
k
t citi
t
cxdist1
2),(
m
jtjijti cxcxdist
1
2)(),(WhereK-means clustering minimizes
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide11
10/04/2006Model-based Clustering
Clustering by Pattern Similarity (p-Clustering)
Clustering by Pattern Similarity (p-Clustering)
The micro-array “raw” data shows 3 genes and their
values in a multi-dimensional space
Parallel Coordinates Plots
Difficult to find their patterns
“non-traditional” clustering
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide12
10/04/2006Model-based Clustering
Clusters Are Clear After Projection
Clusters Are Clear After Projection
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide14
10/04/2006Model-based Clustering
MotivationMotivation
DNA microarray analysisCH1I CH1B CH1D CH2I CH2B
CTFC3 4392 284 4108 280 228
VPS8 401 281 120 275 298
EFB1 318 280 37 277 215
SSA1 401 292 109 580 238
FUN14 2857 285 2576 271 226
SP07 228 290 48 285 224
MDM10 538 272 266 277 236
CYS3 322 288 41 278 219
DEP1 312 272 40 273 232
NTG1 329 296 33 274 228
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide15
10/04/2006Model-based Clustering
MotivationMotivation
0
50100
150
200250
300
350400
450
CH1I CH1D CH2B
condition
stre
ng
th
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide16
10/04/2006Model-based Clustering
MotivationMotivation
Strong coherence exhibits by the selected objects on the selected attributes.
They are not necessarily close to each other but rather bear a constant shift.
Object/attribute bias
bi-cluster
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide17
10/04/2006Model-based Clustering
ChallengesChallenges
The set of objects and the set of attributes are usually unknown.
Different objects/attributes may possess different biases and such biases
may be local to the set of selected objects/attributes
are usually unknown in advance
May have many unspecified entries
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide18
10/04/2006Model-based Clustering
Previous WorkPrevious Work
Subspace clusteringIdentifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes.
Collaborative filtering: Pearson ROnly considers global offset of each object/attribute.
2
222
11
2211
)()(
))((
oooo
oooo
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide19
10/04/2006Model-based Clustering
bi-cluster Termsbi-cluster Terms
Consists of a (sub)set of objects and a (sub)set of attributes
Corresponds to a submatrixOccupancy threshold
Each object/attribute has to be filled by a certain percentage.
Volume: number of specified entries in the submatrixBase: average value of each object/attribute (in the bi-cluster)Biclustering of Expression Data, Cheng & Church ISMB’00
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide20
10/04/2006Model-based Clustering
bi-clusterbi-cluster
CH1I CH1B CH1D CH2I CH2B Obj base
CTFC3
VPS8 401 120 298 273
EFB1 318 37 215 190
SSA1
FUN14
SP07
MDM10
CYS3 322 41 219 194
DEP1
NTG1
Attr base 347 66 244 219
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide21
10/04/2006Model-based Clustering
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16139 69 0 69 139 139 139 139 69 0 0 69 110 0 69 0 0 YBL069W
0 69 69 69 110 110 110 110 69 0 69 69 110 110 69 0 69 YBL097W139 110 0 69 69 110 139 139 139 0 69 69 139 69 69 0 0 YBR064W139 110 0 69 110 110 110 139 110 0 69 110 139 69 69 0 69 YBR065C208 179 110 69 110 110 110 161 161 0 69 69 110 0 69 0 69 YBR114W
0 0 0 69 69 139 161 179 139 0 69 0 110 0 69 0 69 YCL013W0 0 0 0 110 110 110 69 110 0 0 0 69 0 69 0 69 YDR149C
179 161 69 69 69 110 69 110 110 0 0 69 0 0 69 0 69 YDR461W69 110 69 110 110 161 110 69 139 69 69 110 110 139 110 69 110 YDR526C69 0 69 69 110 139 110 0 0 0 69 69 110 69 69 0 0 YHR061C
139 161 110 110 139 179 139 110 139 69 69 69 110 110 110 69 69 YIL092W179 179 161 139 161 195 161 161 161 110 161 161 139 139 161 110 110 YIR043C179 240 161 195 195 256 220 208 240 139 195 195 195 161 195 161 110 YJL010C161 161 69 110 139 161 139 110 161 69 110 139 69 69 110 69 69 YJL023C208 283 240 248 264 304 283 283 283 195 220 240 240 240 248 195 208 YJL033W161 195 110 139 195 248 179 161 220 110 179 195 161 179 208 110 110 YJL076W139 161 139 161 139 179 161 139 69 69 139 69 69 179 179 110 69 YJR162C304 326 304 322 326 350 340 376 318 248 314 283 314 318 326 264 264 YKL068W69 69 0 69 110 110 69 0 69 0 69 69 139 69 69 0 0 YKL134C
283 208 220 277 289 326 289 289 248 220 271 240 271 294 277 230 208 YLR219W337 383 383 413 414 403 381 393 343 350 369 358 347 358 356 314 289 YLR380W161 161 220 195 161 195 161 110 110 110 195 179 179 69 139 110 110 YLR381W208 195 220 161 139 161 161 110 139 110 195 195 195 69 161 139 139 YLR382C248 230 330 300 277 240 240 179 195 220 277 289 240 240 220 161 161 YLR383W264 300 289 264 277 277 289 277 300 248 283 271 294 256 264 271 283 YLR384C230 240 289 264 240 256 220 208 220 248 271 256 256 240 220 179 208 YLR386W439 442 464 456 451 422 417 403 432 510 438 442 450 462 419 476 476 YLR388W256 230 208 240 230 248 240 283 248 220 230 230 220 240 248 220 240 YLR392C374 322 322 300 330 356 361 333 369 376 369 374 369 343 361 393 399 YLR395C139 195 161 139 161 139 161 139 179 110 110 139 139 139 110 161 161 YLR400W230 277 256 248 264 271 248 240 256 220 230 230 256 208 208 240 230 YLR401C494 470 498 488 477 460 466 484 449 532 485 473 464 487 477 492 484 YLR406C326 248 240 289 300 294 289 264 277 248 283 283 277 283 277 271 283 YLR408C179 139 110 69 69 110 69 69 69 69 69 69 69 69 110 69 69 YLR411W326 411 397 383 371 347 314 277 330 264 289 283 304 264 264 340 343 YLR413W161 220 220 220 208 208 161 161 208 179 195 179 179 161 139 161 139 YLR450W220 271 248 230 240 248 240 179 248 208 208 220 230 220 179 230 230 YLR451W220 271 230 208 161 195 161 161 195 161 208 195 220 161 179 195 220 YLR452C179 195 110 161 139 179 161 179 161 69 110 139 139 139 161 139 161 YLR453C283 318 195 271 195 304 289 283 289 304 330 264 256 271 309 277 256 YLR454W
17 conditions40 genes
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide22
10/04/2006Model-based Clustering
MotivationMotivation
0
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
condition
ex
pre
ssio
n l
eve
l
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide23
10/04/2006Model-based Clustering
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16139 69 0 69 139 139 139 139 69 0 0 69 110 0 69 0 0 YBL069W
0 69 69 69 110 110 110 110 69 0 69 69 110 110 69 0 69 YBL097W139 110 0 69 69 110 139 139 139 0 69 69 139 69 69 0 0 YBR064W139 110 0 69 110 110 110 139 110 0 69 110 139 69 69 0 69 YBR065C208 179 110 69 110 110 110 161 161 0 69 69 110 0 69 0 69 YBR114W
0 0 0 69 69 139 161 179 139 0 69 0 110 0 69 0 69 YCL013W0 0 0 0 110 110 110 69 110 0 0 0 69 0 69 0 69 YDR149C
179 161 69 69 69 110 69 110 110 0 0 69 0 0 69 0 69 YDR461W69 110 69 110 110 161 110 69 139 69 69 110 110 139 110 69 110 YDR526C69 0 69 69 110 139 110 0 0 0 69 69 110 69 69 0 0 YHR061C
139 161 110 110 139 179 139 110 139 69 69 69 110 110 110 69 69 YIL092W179 179 161 139 161 195 161 161 161 110 161 161 139 139 161 110 110 YIR043C179 240 161 195 195 256 220 208 240 139 195 195 195 161 195 161 110 YJL010C161 161 69 110 139 161 139 110 161 69 110 139 69 69 110 69 69 YJL023C208 283 240 248 264 304 283 283 283 195 220 240 240 240 248 195 208 YJL033W161 195 110 139 195 248 179 161 220 110 179 195 161 179 208 110 110 YJL076W139 161 139 161 139 179 161 139 69 69 139 69 69 179 179 110 69 YJR162C304 326 304 322 326 350 340 376 318 248 314 283 314 318 326 264 264 YKL068W69 69 0 69 110 110 69 0 69 0 69 69 139 69 69 0 0 YKL134C
283 208 220 277 289 326 289 289 248 220 271 240 271 294 277 230 208 YLR219W337 383 383 413 414 403 381 393 343 350 369 358 347 358 356 314 289 YLR380W161 161 220 195 161 195 161 110 110 110 195 179 179 69 139 110 110 YLR381W208 195 220 161 139 161 161 110 139 110 195 195 195 69 161 139 139 YLR382C248 230 330 300 277 240 240 179 195 220 277 289 240 240 220 161 161 YLR383W264 300 289 264 277 277 289 277 300 248 283 271 294 256 264 271 283 YLR384C230 240 289 264 240 256 220 208 220 248 271 256 256 240 220 179 208 YLR386W439 442 464 456 451 422 417 403 432 510 438 442 450 462 419 476 476 YLR388W256 230 208 240 230 248 240 283 248 220 230 230 220 240 248 220 240 YLR392C374 322 322 300 330 356 361 333 369 376 369 374 369 343 361 393 399 YLR395C139 195 161 139 161 139 161 139 179 110 110 139 139 139 110 161 161 YLR400W230 277 256 248 264 271 248 240 256 220 230 230 256 208 208 240 230 YLR401C494 470 498 488 477 460 466 484 449 532 485 473 464 487 477 492 484 YLR406C326 248 240 289 300 294 289 264 277 248 283 283 277 283 277 271 283 YLR408C179 139 110 69 69 110 69 69 69 69 69 69 69 69 110 69 69 YLR411W326 411 397 383 371 347 314 277 330 264 289 283 304 264 264 340 343 YLR413W161 220 220 220 208 208 161 161 208 179 195 179 179 161 139 161 139 YLR450W220 271 248 230 240 248 240 179 248 208 208 220 230 220 179 230 230 YLR451W220 271 230 208 161 195 161 161 195 161 208 195 220 161 179 195 220 YLR452C179 195 110 161 139 179 161 179 161 69 110 139 139 139 161 139 161 YLR453C283 318 195 271 195 304 289 283 289 304 330 264 256 271 309 277 256 YLR454W
17 conditions40 genes
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide24
10/04/2006Model-based Clustering
MotivationMotivation
0
100
200
300
400
500
600
3 5 9 14 15
condition
ex
pre
ssio
n l
eve
lYBL069WYBL097WYBR064WYBR065CYBR114WYCL013WYDR149CYDR461WYDR526CYHR061CYIL092WYIR043CYJL010CYJL023CYJL033WYJL076WYJR162CYKL068WYKL134CYLR219W
Co-regulated genes
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide25
10/04/2006Model-based Clustering
bi-clusterbi-cluster
Perfect -cluster
Imperfect -clusterResidue:
IJIjiJij
IJiJIjij
IJIjiJij
dddd
dddd
dddd
îíì
dunspecifie is ,0
specified is ,
ij
ijIJIjiJijij d
dddddr
dIJ
dIj
diJdij
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide26
10/04/2006Model-based Clustering
bi-clusterbi-cluster
The smaller the average residue, the stronger the coherence.
Objective: identify -clusters with residue smaller than a given threshold
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide27
10/04/2006Model-based Clustering
Cheng-Church AlgorithmCheng-Church Algorithm
Find one bi-cluster.
Replace the data in the first bi-cluster with random data
Find the second bi-cluster, and go on.
The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide28
10/04/2006Model-based Clustering
The FLOC algorithmThe FLOC algorithm
Generating initial clusters
Determine the best action for each row and each column
Perform the best action of each row and column sequentially
Improved?Y
N
Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide29
10/04/2006Model-based Clustering
The FLOC algorithmThe FLOC algorithm
Action: the change of membership of a row(or column) with respect to a cluster
3 4
4
1 3
2 2
3
2
2
0 4
column
row 1
3
2
1
2 3 4
M+N actions arePerformed ateach iteration
N=3
M=4
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide30
10/04/2006Model-based Clustering
The FLOC algorithmThe FLOC algorithm
Gain of an action: the residue reduction incurred by performing the action Order of action:
Fixed orderRandom orderWeighted random order
Complexity: O((M+N)MNkp)
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide31
10/04/2006Model-based Clustering
The FLOC algorithmThe FLOC algorithm
Additional featuresMaximum allowed overlap among clusters
Minimum coverage of clusters
Minimum volume of each cluster
Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide32
10/04/2006Model-based Clustering
PerformancePerformance
Microarray data: 2884 genes, 17 conditions100 bi-clusters with smallest residue were returned.Average residue = 10.34
The average residue of clusters found via the state of the art method in computational biology field is 12.54
The average volume is 25% biggerThe response time is an order of magnitude faster
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide33
10/04/2006Model-based Clustering
Conclusion RemarkConclusion Remark
The model of bi-cluster is proposed to capture coherent objects with incomplete data set.
base
residue
Many additional features can be accommodated (nearly for free).
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide34
10/04/2006Model-based Clustering
p-Clustering: Clustering by Pattern
Similarity
p-Clustering: Clustering by Pattern
Similarity
Given object x, y in O and features a, b in T, pCluster is a 2 by 2
matrix
A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),
pScore(X) ≤ δ for some δ > 0
For scaling patterns, one can observe, taking logarithmic on
will lead to the pScore form
H. Wang, et al., Clustering by pattern similarity in large data sets, SIGMOD’02.
|)()(|)( ybyaxbxayb
xb
ya
xadddd
d
d
d
dpScore
ybxb
yaxa
dd
dd
/
/
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide35
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
Want to accommodate noises but not outliers
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide36
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
Coherent clusterSubspace clustering
pair-wise disparityFor a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b}
)()( ybxbyaxa
ybya
xbxa
dddd
dd
ddD
x
y
a b
dxa
dya
dxb
dyb
x
y
a b
z
attributemutual bias
of attribute amutual bias
of attribute b
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide37
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to .An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster.
A -coherent cluster is a maximum -coherent cluster if it is not a submatrix of any other -coherent cluster.
Objective: given a data matrix and a threshold , find all maximum -coherent clusters.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide38
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
Challenges:Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality.
The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix.
The actual values of the objects in a coherent cluster may be far apart from each other.
Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide39
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
Compute the maximum coherent attribute sets for each pair of objects
Construct the lexicographical tree
Post-order traverse the tree to find maximum coherent clusters
Two-way Pruning
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide40
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
Observation: Given a pair of objects {o1, o2} and a (sub)set of attributes {a1, a2, …, ak}, the 2k submatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai) does not differ from each other by more than .
a1 a2 a3 a4 a5
1
3
5
7
3 2 3.5 2 2.5
o1
o2
[2, 3.5]
If = 1.5,then {a1,a2,a3,a4,a5} is acoherent attribute set (CAS)of (o1,o2).
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide41
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
Observation: given a subset of objects {o1, o2, …, ol} and a subset of attributes {a1, a2, …, ak}, the lk submatrix is a -coherent cluster iff {a1, a2, …, ak} is a coherent attribute set for every pair of objects (oi,oj) where 1 i, j l.
a1 a5 a6 a7a2 a3 a4
o1
o3
o4
o5
o6
o2
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide42
10/04/2006Model-based Clustering
a1 a2 a3 a4 a5
1
3
5
7
3 2 3.5 2 2.5
r1
r2
Coherent ClusterCoherent Cluster
Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .
= 1
3
5
7
r1
r2
a2
2
a3
3.5
a4
2
a5
2.5
a1
3
1
The maximum coherent attribute sets define the search space for maximum coherent clusters.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide43
10/04/2006Model-based Clustering
Two Way PruningTwo Way Pruning
a0 a1 a2
o0 1 4 2
o1 2 5 5
o2 3 6 5
o3 4 200 7
o4 300 7 6
(o0,o2) →(a0,a1,a2)(o1,o2) →(a0,a1,a2)
(a0,a1) →(o0,o1,o2)(a0,a2) →(o1,o2,o3)(a1,a2) →(o1,o2,o4)(a1,a2) →(o0,o2,o4)
delta=1 nc =3 nr = 3(a0,a1) →(o0,o1,o2)(a0,a2) →(o1,o2,o3)(a1,a2) →(o1,o2,o4)(a1,a2) →(o0,o2,o4)
(o0,o2) →(a0,a1,a2)(o1,o2) →(a0,a1,a2)
MCAS
MCOS
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide44
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. ob
ject
s
attributes
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide45
10/04/2006Model-based Clustering
(o0,o1) : {a0,a1}, {a2,a3}(o0,o2) : {a0,a1,a2,a3}(o0,o4) : {a1,a2}
(o1,o2) : {a0,a1,a2}, {a2,a3}
(o1,o3) : {a0,a2}
(o1,o4) : {a1,a2}
(o2,o3) : {a0,a2}
(o2,o4) : {a1,a2}
a0 a1 a2 a3
o0 1 4 2 5
o1 2 5 5 8
o2 3 6 5 7
o3 4 20 7 2
o4 30 7 6 6a0
a1 a2
a2
a3
a1a2
a2 a3
(o0,o1)
(o1,o2)
(o0,o2)
(o1,o3)(o2,o3)
(o0,o4)(o1,o4)(o2,o4)
(o0,o1)(o1,o2)
assume = 1
{a0,a1} : (o0,o1){a0,a2} : (o1,o3),(o2,o3){a1,a2} : (o0,o4),(o1,o4),(o2,o4){a2,a3} : (o0,o1),(o1,o2){a0,a1,a2} : (o1,o2){a0,a1,a2,a3} : (o0,o2)
(o1,o2)(o1,o2)
(o1,o2)
(o0,o2)(o0,o2)
(o0,o2)(o0,o2)
(o0,o2)
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide46
10/04/2006Model-based Clustering
Coherent ClusterCoherent Cluster
High expressive powerThe coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods.
Efficient and highly scalable
Wide applicationsGene expression analysis
Collaborative filtering0
2000
4000
6000
8000
10000
12000
10 20 50 100 200 500
number of conditions
aver
age r
espo
nse t
ime (
sec)
subspacecluster
coherent cluster
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide47
10/04/2006Model-based Clustering
RemarkRemark
Comparing to BiclusterCan well separate noises and outliers
No random data insertion and replacement
Produce optimal solution
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide48
10/04/2006Model-based Clustering
Definition of OP-ClusterDefinition of OP-Cluster
Let I be a subset of genes in the database. Let J be a subset of conditions. We say <I, J> forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions.
2121 ,, jjJjj
21,)1( ijij DDIi
A1 A2 A3 A4
Exp
ers
sion
Levels
21,)2( ijij DDIi
21,)3( ijij DDIi
|)||,(|min||max2121
21 ,ijijijij
JjjDDDD
when
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide49
10/04/2006Model-based Clustering
Problem StatementProblem Statement
Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold nc and nr.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide50
10/04/2006Model-based Clustering
Conversion to Sequence Mining Problem
Conversion to Sequence Mining Problem
),()3(
)2(
)1(
21
21
21
21
21
21
jjrderCanonicalODD
jjDD
jjDD
ijij
ijij
ijij
A1 A2 A3 A4
Expers
sion L
evels
2341 AAAA Sequence:
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide51
10/04/2006Model-based Clustering
Ming OP-Clusters: A naïve approachMing OP-Clusters: A naïve approach
A naïve approachEnumerate all possible subsequences in a prefix tree.
For each subsequences, collect all genes that contain the subsequences.
Challenge: The total number of distinct subsequences are
Ni m
ii
1
!
a b c d
b c d
c d b d b c
d c bd bc
a c d
c d a d …
d c ad …
…
A Complete Prefix Tree with 4 items {a,b,c,d}
root
a
b
d
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide52
10/04/2006Model-based Clustering
Mining OP-Clusters: Prefix TreeMining OP-Clusters: Prefix Tree
Goal:
Build a compact prefix tree that includes all sub-sequenes only occurring in the original database.
Strategies:
1. Depth-First Traversal
2. Suffix concatenation: Visit subsequences that only exist in the input sequences.
3. Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences.
g1 adbc
g2 abdc
g3 badc
a:1,2
d:1 b:2
d:2b:1 c:1,3
b:3
Root
c:1 c:2
a:3
d:3
c:3
a:3
d:3
c:3
a:1,2
d:1,3
a:1,2,3
d:1,3d:1,2,3
c:1,2,3 d:2
c:2
a:1,2,3
d:1,2,3
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide53
10/04/2006Model-based Clustering
ReferencesReferences
J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002.
H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002.
Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004.
J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.