eecs 800 research seminar mining biological data

52
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Upload: richard-rivers

Post on 02-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Model-Based Clustering. What is model-based clustering? Attempt to optimize the fit between the given data and some mathematical model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EECS 800 Research Seminar Mining Biological Data

The UNIVERSITY of Kansas

EECS 800 Research SeminarMining Biological Data

Instructor: Luke Huan

Fall, 2006

Page 2: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide2

10/04/2006Model-based Clustering

Model-Based ClusteringModel-Based Clustering

What is model-based clustering?Attempt to optimize the fit between the given data and some mathematical model

Based on the assumption: Data are generated by a mixture of underlying probability distribution

Typical methodsStatistical approach

EM (Expectation maximization), AutoClass

Machine learning approach

COBWEB, CLASSIT

Neural network approach

SOM (Self-Organizing Feature Map)

Page 3: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide3

10/04/2006Model-based Clustering

EM — Expectation MaximizationEM — Expectation Maximization

EM — A popular iterative refinement algorithm

An extension to k-means

Assign each object to a cluster according to a weight (prob. distribution)

New means are computed based on weighted measures

General idea

Starts with an initial estimate of the parameter vector

Iteratively rescores the patterns against the mixture density produced by the parameter vector

The rescored patterns are used to update the parameter updates

Patterns belonging to the same cluster, if they are placed by their scores in a particular component

Algorithm converges fast but may not be in global optima

AutoClass (Cheeseman and Stutz, 1996)

Page 4: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide4

10/04/2006Model-based Clustering

1D Guassian Mixture Model1D Guassian Mixture Model

Given a set of data distributed in a 1D space, how to perform clustering in the data set?

General idea: factorize the p.d.f. into a mixture of simple models.

Discrete values: Bernoulli distribution

Continues values: Gaussian distribution

Page 5: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide5

10/04/2006Model-based Clustering

The EM (Expectation Maximization) Algorithm

The EM (Expectation Maximization) Algorithm

Initially, randomly assign k cluster centers

Iteratively refine the clusters based on two steps Expectation step: assign each data point Xi to cluster Ci with the following probability

Maximization step:

Estimation of model parameters

),(/),(* kikiik CxCxx

Page 6: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide6

10/04/2006Model-based Clustering

Another Way of K-mean?Another Way of K-mean?

Pos:AutoClass can adapt to different (convex) shapes of clusters, k-mean assumes spheres

Solid statistics foundation

Cons:computational expensive

Page 7: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide7

10/04/2006Model-based Clustering

Model Based Subspace Clustering

Model Based Subspace Clustering

Microarray

Bi-clustering

δ-clustering

p-clustering

OP-clustering

Page 8: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide8

10/04/2006Model-based Clustering

MicroArray DatasetMicroArray Dataset

Page 9: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide9

10/04/2006Model-based Clustering

Gene Expression MatrixGene Expression Matrix

nmnjn

imiji

mj

xxx

xxx

xxx

......

...............

......

...............

......

1

1

1111

Ge

ne

s

Conditions

Ge

ne

s

Conditions

Time points

Cancer Tissues

Page 10: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide10

10/04/2006Model-based Clustering

Data Mining: ClusteringData Mining: Clustering

k

t citi

t

cxdist1

2),(

m

jtjijti cxcxdist

1

2)(),(WhereK-means clustering minimizes

Page 11: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide11

10/04/2006Model-based Clustering

Clustering by Pattern Similarity (p-Clustering)

Clustering by Pattern Similarity (p-Clustering)

The micro-array “raw” data shows 3 genes and their

values in a multi-dimensional space

Parallel Coordinates Plots

Difficult to find their patterns

“non-traditional” clustering

Page 12: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide12

10/04/2006Model-based Clustering

Clusters Are Clear After Projection

Clusters Are Clear After Projection

Page 13: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide14

10/04/2006Model-based Clustering

MotivationMotivation

DNA microarray analysisCH1I CH1B CH1D CH2I CH2B

CTFC3 4392 284 4108 280 228

VPS8 401 281 120 275 298

EFB1 318 280 37 277 215

SSA1 401 292 109 580 238

FUN14 2857 285 2576 271 226

SP07 228 290 48 285 224

MDM10 538 272 266 277 236

CYS3 322 288 41 278 219

DEP1 312 272 40 273 232

NTG1 329 296 33 274 228

Page 14: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide15

10/04/2006Model-based Clustering

MotivationMotivation

0

50100

150

200250

300

350400

450

CH1I CH1D CH2B

condition

stre

ng

th

Page 15: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide16

10/04/2006Model-based Clustering

MotivationMotivation

Strong coherence exhibits by the selected objects on the selected attributes.

They are not necessarily close to each other but rather bear a constant shift.

Object/attribute bias

bi-cluster

Page 16: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide17

10/04/2006Model-based Clustering

ChallengesChallenges

The set of objects and the set of attributes are usually unknown.

Different objects/attributes may possess different biases and such biases

may be local to the set of selected objects/attributes

are usually unknown in advance

May have many unspecified entries

Page 17: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide18

10/04/2006Model-based Clustering

Previous WorkPrevious Work

Subspace clusteringIdentifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes.

Collaborative filtering: Pearson ROnly considers global offset of each object/attribute.

2

222

11

2211

)()(

))((

oooo

oooo

Page 18: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide19

10/04/2006Model-based Clustering

bi-cluster Termsbi-cluster Terms

Consists of a (sub)set of objects and a (sub)set of attributes

Corresponds to a submatrixOccupancy threshold

Each object/attribute has to be filled by a certain percentage.

Volume: number of specified entries in the submatrixBase: average value of each object/attribute (in the bi-cluster)Biclustering of Expression Data, Cheng & Church ISMB’00

Page 19: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide20

10/04/2006Model-based Clustering

bi-clusterbi-cluster

CH1I CH1B CH1D CH2I CH2B Obj base

CTFC3

VPS8 401 120 298 273

EFB1 318 37 215 190

SSA1

FUN14

SP07

MDM10

CYS3 322 41 219 194

DEP1

NTG1

Attr base 347 66 244 219

Page 20: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide21

10/04/2006Model-based Clustering

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16139 69 0 69 139 139 139 139 69 0 0 69 110 0 69 0 0 YBL069W

0 69 69 69 110 110 110 110 69 0 69 69 110 110 69 0 69 YBL097W139 110 0 69 69 110 139 139 139 0 69 69 139 69 69 0 0 YBR064W139 110 0 69 110 110 110 139 110 0 69 110 139 69 69 0 69 YBR065C208 179 110 69 110 110 110 161 161 0 69 69 110 0 69 0 69 YBR114W

0 0 0 69 69 139 161 179 139 0 69 0 110 0 69 0 69 YCL013W0 0 0 0 110 110 110 69 110 0 0 0 69 0 69 0 69 YDR149C

179 161 69 69 69 110 69 110 110 0 0 69 0 0 69 0 69 YDR461W69 110 69 110 110 161 110 69 139 69 69 110 110 139 110 69 110 YDR526C69 0 69 69 110 139 110 0 0 0 69 69 110 69 69 0 0 YHR061C

139 161 110 110 139 179 139 110 139 69 69 69 110 110 110 69 69 YIL092W179 179 161 139 161 195 161 161 161 110 161 161 139 139 161 110 110 YIR043C179 240 161 195 195 256 220 208 240 139 195 195 195 161 195 161 110 YJL010C161 161 69 110 139 161 139 110 161 69 110 139 69 69 110 69 69 YJL023C208 283 240 248 264 304 283 283 283 195 220 240 240 240 248 195 208 YJL033W161 195 110 139 195 248 179 161 220 110 179 195 161 179 208 110 110 YJL076W139 161 139 161 139 179 161 139 69 69 139 69 69 179 179 110 69 YJR162C304 326 304 322 326 350 340 376 318 248 314 283 314 318 326 264 264 YKL068W69 69 0 69 110 110 69 0 69 0 69 69 139 69 69 0 0 YKL134C

283 208 220 277 289 326 289 289 248 220 271 240 271 294 277 230 208 YLR219W337 383 383 413 414 403 381 393 343 350 369 358 347 358 356 314 289 YLR380W161 161 220 195 161 195 161 110 110 110 195 179 179 69 139 110 110 YLR381W208 195 220 161 139 161 161 110 139 110 195 195 195 69 161 139 139 YLR382C248 230 330 300 277 240 240 179 195 220 277 289 240 240 220 161 161 YLR383W264 300 289 264 277 277 289 277 300 248 283 271 294 256 264 271 283 YLR384C230 240 289 264 240 256 220 208 220 248 271 256 256 240 220 179 208 YLR386W439 442 464 456 451 422 417 403 432 510 438 442 450 462 419 476 476 YLR388W256 230 208 240 230 248 240 283 248 220 230 230 220 240 248 220 240 YLR392C374 322 322 300 330 356 361 333 369 376 369 374 369 343 361 393 399 YLR395C139 195 161 139 161 139 161 139 179 110 110 139 139 139 110 161 161 YLR400W230 277 256 248 264 271 248 240 256 220 230 230 256 208 208 240 230 YLR401C494 470 498 488 477 460 466 484 449 532 485 473 464 487 477 492 484 YLR406C326 248 240 289 300 294 289 264 277 248 283 283 277 283 277 271 283 YLR408C179 139 110 69 69 110 69 69 69 69 69 69 69 69 110 69 69 YLR411W326 411 397 383 371 347 314 277 330 264 289 283 304 264 264 340 343 YLR413W161 220 220 220 208 208 161 161 208 179 195 179 179 161 139 161 139 YLR450W220 271 248 230 240 248 240 179 248 208 208 220 230 220 179 230 230 YLR451W220 271 230 208 161 195 161 161 195 161 208 195 220 161 179 195 220 YLR452C179 195 110 161 139 179 161 179 161 69 110 139 139 139 161 139 161 YLR453C283 318 195 271 195 304 289 283 289 304 330 264 256 271 309 277 256 YLR454W

17 conditions40 genes

Page 21: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide22

10/04/2006Model-based Clustering

MotivationMotivation

0

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

condition

ex

pre

ssio

n l

eve

l

Page 22: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide23

10/04/2006Model-based Clustering

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16139 69 0 69 139 139 139 139 69 0 0 69 110 0 69 0 0 YBL069W

0 69 69 69 110 110 110 110 69 0 69 69 110 110 69 0 69 YBL097W139 110 0 69 69 110 139 139 139 0 69 69 139 69 69 0 0 YBR064W139 110 0 69 110 110 110 139 110 0 69 110 139 69 69 0 69 YBR065C208 179 110 69 110 110 110 161 161 0 69 69 110 0 69 0 69 YBR114W

0 0 0 69 69 139 161 179 139 0 69 0 110 0 69 0 69 YCL013W0 0 0 0 110 110 110 69 110 0 0 0 69 0 69 0 69 YDR149C

179 161 69 69 69 110 69 110 110 0 0 69 0 0 69 0 69 YDR461W69 110 69 110 110 161 110 69 139 69 69 110 110 139 110 69 110 YDR526C69 0 69 69 110 139 110 0 0 0 69 69 110 69 69 0 0 YHR061C

139 161 110 110 139 179 139 110 139 69 69 69 110 110 110 69 69 YIL092W179 179 161 139 161 195 161 161 161 110 161 161 139 139 161 110 110 YIR043C179 240 161 195 195 256 220 208 240 139 195 195 195 161 195 161 110 YJL010C161 161 69 110 139 161 139 110 161 69 110 139 69 69 110 69 69 YJL023C208 283 240 248 264 304 283 283 283 195 220 240 240 240 248 195 208 YJL033W161 195 110 139 195 248 179 161 220 110 179 195 161 179 208 110 110 YJL076W139 161 139 161 139 179 161 139 69 69 139 69 69 179 179 110 69 YJR162C304 326 304 322 326 350 340 376 318 248 314 283 314 318 326 264 264 YKL068W69 69 0 69 110 110 69 0 69 0 69 69 139 69 69 0 0 YKL134C

283 208 220 277 289 326 289 289 248 220 271 240 271 294 277 230 208 YLR219W337 383 383 413 414 403 381 393 343 350 369 358 347 358 356 314 289 YLR380W161 161 220 195 161 195 161 110 110 110 195 179 179 69 139 110 110 YLR381W208 195 220 161 139 161 161 110 139 110 195 195 195 69 161 139 139 YLR382C248 230 330 300 277 240 240 179 195 220 277 289 240 240 220 161 161 YLR383W264 300 289 264 277 277 289 277 300 248 283 271 294 256 264 271 283 YLR384C230 240 289 264 240 256 220 208 220 248 271 256 256 240 220 179 208 YLR386W439 442 464 456 451 422 417 403 432 510 438 442 450 462 419 476 476 YLR388W256 230 208 240 230 248 240 283 248 220 230 230 220 240 248 220 240 YLR392C374 322 322 300 330 356 361 333 369 376 369 374 369 343 361 393 399 YLR395C139 195 161 139 161 139 161 139 179 110 110 139 139 139 110 161 161 YLR400W230 277 256 248 264 271 248 240 256 220 230 230 256 208 208 240 230 YLR401C494 470 498 488 477 460 466 484 449 532 485 473 464 487 477 492 484 YLR406C326 248 240 289 300 294 289 264 277 248 283 283 277 283 277 271 283 YLR408C179 139 110 69 69 110 69 69 69 69 69 69 69 69 110 69 69 YLR411W326 411 397 383 371 347 314 277 330 264 289 283 304 264 264 340 343 YLR413W161 220 220 220 208 208 161 161 208 179 195 179 179 161 139 161 139 YLR450W220 271 248 230 240 248 240 179 248 208 208 220 230 220 179 230 230 YLR451W220 271 230 208 161 195 161 161 195 161 208 195 220 161 179 195 220 YLR452C179 195 110 161 139 179 161 179 161 69 110 139 139 139 161 139 161 YLR453C283 318 195 271 195 304 289 283 289 304 330 264 256 271 309 277 256 YLR454W

17 conditions40 genes

Page 23: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide24

10/04/2006Model-based Clustering

MotivationMotivation

0

100

200

300

400

500

600

3 5 9 14 15

condition

ex

pre

ssio

n l

eve

lYBL069WYBL097WYBR064WYBR065CYBR114WYCL013WYDR149CYDR461WYDR526CYHR061CYIL092WYIR043CYJL010CYJL023CYJL033WYJL076WYJR162CYKL068WYKL134CYLR219W

Co-regulated genes

Page 24: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide25

10/04/2006Model-based Clustering

bi-clusterbi-cluster

Perfect -cluster

Imperfect -clusterResidue:

IJIjiJij

IJiJIjij

IJIjiJij

dddd

dddd

dddd

îíì

dunspecifie is ,0

specified is ,

ij

ijIJIjiJijij d

dddddr

dIJ

dIj

diJdij

Page 25: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide26

10/04/2006Model-based Clustering

bi-clusterbi-cluster

The smaller the average residue, the stronger the coherence.

Objective: identify -clusters with residue smaller than a given threshold

Page 26: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide27

10/04/2006Model-based Clustering

Cheng-Church AlgorithmCheng-Church Algorithm

Find one bi-cluster.

Replace the data in the first bi-cluster with random data

Find the second bi-cluster, and go on.

The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

Page 27: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide28

10/04/2006Model-based Clustering

The FLOC algorithmThe FLOC algorithm

Generating initial clusters

Determine the best action for each row and each column

Perform the best action of each row and column sequentially

Improved?Y

N

Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02

Page 28: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide29

10/04/2006Model-based Clustering

The FLOC algorithmThe FLOC algorithm

Action: the change of membership of a row(or column) with respect to a cluster

3 4

4

1 3

2 2

3

2

2

0 4

column

row 1

3

2

1

2 3 4

M+N actions arePerformed ateach iteration

N=3

M=4

Page 29: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide30

10/04/2006Model-based Clustering

The FLOC algorithmThe FLOC algorithm

Gain of an action: the residue reduction incurred by performing the action Order of action:

Fixed orderRandom orderWeighted random order

Complexity: O((M+N)MNkp)

Page 30: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide31

10/04/2006Model-based Clustering

The FLOC algorithmThe FLOC algorithm

Additional featuresMaximum allowed overlap among clusters

Minimum coverage of clusters

Minimum volume of each cluster

Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

Page 31: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide32

10/04/2006Model-based Clustering

PerformancePerformance

Microarray data: 2884 genes, 17 conditions100 bi-clusters with smallest residue were returned.Average residue = 10.34

The average residue of clusters found via the state of the art method in computational biology field is 12.54

The average volume is 25% biggerThe response time is an order of magnitude faster

Page 32: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide33

10/04/2006Model-based Clustering

Conclusion RemarkConclusion Remark

The model of bi-cluster is proposed to capture coherent objects with incomplete data set.

base

residue

Many additional features can be accommodated (nearly for free).

Page 33: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide34

10/04/2006Model-based Clustering

p-Clustering: Clustering by Pattern

Similarity

p-Clustering: Clustering by Pattern

Similarity

Given object x, y in O and features a, b in T, pCluster is a 2 by 2

matrix

A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),

pScore(X) ≤ δ for some δ > 0

For scaling patterns, one can observe, taking logarithmic on

will lead to the pScore form

H. Wang, et al., Clustering by pattern similarity in large data sets, SIGMOD’02.

|)()(|)( ybyaxbxayb

xb

ya

xadddd

d

d

d

dpScore

ybxb

yaxa

dd

dd

/

/

Page 34: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide35

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

Want to accommodate noises but not outliers

Page 35: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide36

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

Coherent clusterSubspace clustering

pair-wise disparityFor a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b}

)()( ybxbyaxa

ybya

xbxa

dddd

dd

ddD

x

y

a b

dxa

dya

dxb

dyb

x

y

a b

z

attributemutual bias

of attribute amutual bias

of attribute b

Page 36: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide37

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to .An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster.

A -coherent cluster is a maximum -coherent cluster if it is not a submatrix of any other -coherent cluster.

Objective: given a data matrix and a threshold , find all maximum -coherent clusters.

Page 37: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide38

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

Challenges:Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality.

The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix.

The actual values of the objects in a coherent cluster may be far apart from each other.

Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

Page 38: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide39

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

Compute the maximum coherent attribute sets for each pair of objects

Construct the lexicographical tree

Post-order traverse the tree to find maximum coherent clusters

Two-way Pruning

Page 39: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide40

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

Observation: Given a pair of objects {o1, o2} and a (sub)set of attributes {a1, a2, …, ak}, the 2k submatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai) does not differ from each other by more than .

a1 a2 a3 a4 a5

1

3

5

7

3 2 3.5 2 2.5

o1

o2

[2, 3.5]

If = 1.5,then {a1,a2,a3,a4,a5} is acoherent attribute set (CAS)of (o1,o2).

Page 40: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide41

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

Observation: given a subset of objects {o1, o2, …, ol} and a subset of attributes {a1, a2, …, ak}, the lk submatrix is a -coherent cluster iff {a1, a2, …, ak} is a coherent attribute set for every pair of objects (oi,oj) where 1 i, j l.

a1 a5 a6 a7a2 a3 a4

o1

o3

o4

o5

o6

o2

Page 41: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide42

10/04/2006Model-based Clustering

a1 a2 a3 a4 a5

1

3

5

7

3 2 3.5 2 2.5

r1

r2

Coherent ClusterCoherent Cluster

Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .

= 1

3

5

7

r1

r2

a2

2

a3

3.5

a4

2

a5

2.5

a1

3

1

The maximum coherent attribute sets define the search space for maximum coherent clusters.

Page 42: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide43

10/04/2006Model-based Clustering

Two Way PruningTwo Way Pruning

a0 a1 a2

o0 1 4 2

o1 2 5 5

o2 3 6 5

o3 4 200 7

o4 300 7 6

(o0,o2) →(a0,a1,a2)(o1,o2) →(a0,a1,a2)

(a0,a1) →(o0,o1,o2)(a0,a2) →(o1,o2,o3)(a1,a2) →(o1,o2,o4)(a1,a2) →(o0,o2,o4)

delta=1 nc =3 nr = 3(a0,a1) →(o0,o1,o2)(a0,a2) →(o1,o2,o3)(a1,a2) →(o1,o2,o4)(a1,a2) →(o0,o2,o4)

(o0,o2) →(a0,a1,a2)(o1,o2) →(a0,a1,a2)

MCAS

MCOS

Page 43: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide44

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. ob

ject

s

attributes

Page 44: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide45

10/04/2006Model-based Clustering

(o0,o1) : {a0,a1}, {a2,a3}(o0,o2) : {a0,a1,a2,a3}(o0,o4) : {a1,a2}

(o1,o2) : {a0,a1,a2}, {a2,a3}

(o1,o3) : {a0,a2}

(o1,o4) : {a1,a2}

(o2,o3) : {a0,a2}

(o2,o4) : {a1,a2}

a0 a1 a2 a3

o0 1 4 2 5

o1 2 5 5 8

o2 3 6 5 7

o3 4 20 7 2

o4 30 7 6 6a0

a1 a2

a2

a3

a1a2

a2 a3

(o0,o1)

(o1,o2)

(o0,o2)

(o1,o3)(o2,o3)

(o0,o4)(o1,o4)(o2,o4)

(o0,o1)(o1,o2)

assume = 1

{a0,a1} : (o0,o1){a0,a2} : (o1,o3),(o2,o3){a1,a2} : (o0,o4),(o1,o4),(o2,o4){a2,a3} : (o0,o1),(o1,o2){a0,a1,a2} : (o1,o2){a0,a1,a2,a3} : (o0,o2)

(o1,o2)(o1,o2)

(o1,o2)

(o0,o2)(o0,o2)

(o0,o2)(o0,o2)

(o0,o2)

Page 45: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide46

10/04/2006Model-based Clustering

Coherent ClusterCoherent Cluster

High expressive powerThe coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods.

Efficient and highly scalable

Wide applicationsGene expression analysis

Collaborative filtering0

2000

4000

6000

8000

10000

12000

10 20 50 100 200 500

number of conditions

aver

age r

espo

nse t

ime (

sec)

subspacecluster

coherent cluster

Page 46: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide47

10/04/2006Model-based Clustering

RemarkRemark

Comparing to BiclusterCan well separate noises and outliers

No random data insertion and replacement

Produce optimal solution

Page 47: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide48

10/04/2006Model-based Clustering

Definition of OP-ClusterDefinition of OP-Cluster

Let I be a subset of genes in the database. Let J be a subset of conditions. We say <I, J> forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions.

2121 ,, jjJjj

21,)1( ijij DDIi

A1 A2 A3 A4

Exp

ers

sion

Levels

21,)2( ijij DDIi

21,)3( ijij DDIi

|)||,(|min||max2121

21 ,ijijijij

JjjDDDD

when

Page 48: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide49

10/04/2006Model-based Clustering

Problem StatementProblem Statement

Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold nc and nr.

Page 49: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide50

10/04/2006Model-based Clustering

Conversion to Sequence Mining Problem

Conversion to Sequence Mining Problem

),()3(

)2(

)1(

21

21

21

21

21

21

jjrderCanonicalODD

jjDD

jjDD

ijij

ijij

ijij

A1 A2 A3 A4

Expers

sion L

evels

2341 AAAA Sequence:

Page 50: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide51

10/04/2006Model-based Clustering

Ming OP-Clusters: A naïve approachMing OP-Clusters: A naïve approach

A naïve approachEnumerate all possible subsequences in a prefix tree.

For each subsequences, collect all genes that contain the subsequences.

Challenge: The total number of distinct subsequences are

Ni m

ii

1

!

a b c d

b c d

c d b d b c

d c bd bc

a c d

c d a d …

d c ad …

A Complete Prefix Tree with 4 items {a,b,c,d}

root

a

b

d

Page 51: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide52

10/04/2006Model-based Clustering

Mining OP-Clusters: Prefix TreeMining OP-Clusters: Prefix Tree

Goal:

Build a compact prefix tree that includes all sub-sequenes only occurring in the original database.

Strategies:

1. Depth-First Traversal

2. Suffix concatenation: Visit subsequences that only exist in the input sequences.

3. Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences.

g1 adbc

g2 abdc

g3 badc

a:1,2

d:1 b:2

d:2b:1 c:1,3

b:3

Root

c:1 c:2

a:3

d:3

c:3

a:3

d:3

c:3

a:1,2

d:1,3

a:1,2,3

d:1,3d:1,2,3

c:1,2,3 d:2

c:2

a:1,2,3

d:1,2,3

Page 52: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide53

10/04/2006Model-based Clustering

ReferencesReferences

J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002.

H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002.

Y. Sungroh,  C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004.

J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.