workshop report: biclustering methods for microarray data, hassalt university, belgium

Workshop report: Biclustering Methods for Microarray Data,

Hassalt University, Belgium

Guy Harari

FABIA: factor analysis for bicluster acquisition

Sepp Hochreiter et al,.University of Linz, Austria

FABIA - Motivation

• Plaid models: for bicluster i:• They use least squares fit for model selection • Thus assume Gaussian effects• However, microarray datasets are not

Gaussian (heavy tails)

kij i ki ij

FABIA – model

• Biclusters have multiplicative coherent values

• λ – prototype• z - factors• In the example above:

21040000

31.5

06

4208

Tz

2 1 0 4Tz 1 0 1.5 2 T

FABIA – model

• For p biclusters and additive Gaussian noise:

• The j-th sample (column in X) is:

• where is the j-the column of Z. • Λ and Z are sparse.

1

pT

i ii

X Z

z

1

p

j i ij j j ji

x z z

jz

Generative Model for Factor Analysis

• Data was produced by:– Picking values independently

from some Gaussian hidden factors.

– Linearly combining the factors using a factor loading matrix.

– Add Gaussian noise for each input

ijw

(0,1)if N

jx

if

2( , )j jN

Generative Model for Factor Analysis

• Assume factors and noise areindependent.

• Assume also . • Select #factors by e.g.

Kaiser criterion –• Extract factors using e.g.

maximum likelihood.

ijw

( )Cov F I

( )# 1Cov XEV

(0,1)if N

jx

if

2( , )j jN

FABIA – model

• Fix the value for j.• Factors are the ‘s, .• • Biclusters shouldn’t be correlated.• are the loading matrix’s entries.• is diagonal – independent

Gaussian noise.

ijz 1 i p

ij

(0,1)ijz N

jx

ijz

(0 , )N

jCov z I

ij

Sparseness• We want sparse solutions for and• So use Laplace distribution for :

• For use one of:1. FABIA:

2. FABIAS:

z iz

2| |

1

1( )2

i

p pz

i

p z e

i 2| |

1

12

ki

n n

ik

p e

0

ii

i

c for sp spLp

for sp spL

2

1 1

1

n n

ki kik k

i

nsp

n

parameter

Model Selection• Center the data to zero median.• Normalization – divide values by row’s std.• Use EM where the parameters are and .• Rank biclusters according to mutual

information:

• Determine members of each bicluster using two thresholds for values and .

1

; | \ ; | \l

T Ti i j ij j ij

j

I X z Z z I x z z z

ki ijz

Experiments – Simulated Datasets

• n=1000 genes, l=100 samples • p=10 multiplicative biclusters• Generate :– Choose - the number of genes in bicluster i -

uniformly at random from {10,…,210}.– Choose genes from {1,…,1000}.– Set components not in bicluster i to .– Set components in bicluster i to .

i

iN

iN

i2(0,0.2 )N

i ( 3,1)N


• Generate :– Choose - the number of samples in bicluster i -

uniformly at random from {5,…,25}.– Choose samples from {1,…,100}.– Set components not in bicluster i to .– Set components in bicluster i to .

• Add random noise to all entries according to .

• Compute the dataset with

izziN

ziN

2(0,0.2 )Niziz (2,1)N

2(0,3 )N

1

pT

i ii

X Z

z

Evaluation – consensus score

• For two sets of biclusters:– Compute similarity between each pair of

biclusters, one from each set.– Find maximum assignment using the Munkres

(Hungarian) algorithm.– Penalize different numbers of biclusters - Divide

the sum of similarities of the assigned biclusters by the number of biclusters of the largest set.

• Use Jaccard index for computing similarity.

Simulated Datasets - Results

• Average score and STD for each method:

Simulated Datasets - Results• Avg. and STD of information content and similarity:

Simulated additive datasets

• Generate biclusters in the same way.• Use additive model for each bicluster:

• Choose from and from .

• Choose from one of three models:– Low signal – – Moderate signal – – High signal –

ikj i ik ij ik

2(0.5,0.2 )Nij

2(1,0.5 )Ni

2(0,2 )N2( 2,0.5 )N

2( 4,0.5 )N

Additive Datasets - results

• Low signal:


• Moderate signal:


• High signal:

Gene Expression Datasets

• Breast cancer (Van’t Veer et al., 2002) – 3 classes (clusters) were found in Hoshida et al., 2007.

• Multiple tissue types dataset (Su et al., 2002)• Diffuse large-B-cell lymphoma dataset (DLBCL)

(Rosenwald et al., 2002) – 3 classes (clusters) were found in Hoshida et al. (2007).

Gene Expression Datasets - results

Biological Interpretation• Breast cancer:– Bicluster 1 is related to cell cycle (GO and KEGG,

) and to the proteins CDC2 (division control) and KIF (mitosis).

– Bicluster 2 is related to immune response (GO, ) and cytokine-cytokine receptor interaction (KEGG ), and to cytokine-related proteins as CCR5, CCL4 and CSF2RB.

• Multiple tissue – no biological interpretation.

910p

2610p 1010p

Biological Interpretation

• DLBCL:– Bicluster 1 is related to the ribosome (GO ,

KEGG ) and to B-cell receptor signaling (KEGG ).

– Bicluster 2 is related to the immune system (GO , KEGG ).

610p 810p

910p

610p 910p

Drag Design

• Goal: find compounds with similar effects on gene expression.

• Use Affymetrix GeneChip HT HG-U133+ PM array plates with 12*8 samples per plate.

• Selected compounds are active on a cancer cell line.

• Each compound was testes in a group of three replicates.

Drag Design

• 3 biclusters were found to have 2-5 replicate sets.

• One of them extracted genes related to mitosis (GO ).

• The compounds of this bicluster are now under investigation by Johnson & Johnson Pharmaceutical R&D.

1310p

Biclustering Gene Expression Time Series

Sara C Madeira, Technical University of Lisbon

Introduction

• Input: columns correspond to samples taken in consecutive instants of time.

• Output: biclusters with contiguous columns.• Motivation: biological processes start and end

in a contiguous time leading to increased/decreased activity of some genes.

• Goal: find all maximal contiguous column coherent (CCC) biclusters sorted by a statistical score.

Discretization

• Let be the input expression matrix.• Define

• Standardize A’ to mean=0 and STD=1 by gene.

'n mA

' '( 1) '

'

'' ' '( 1)

' '( 1)

' '( 1)

, 0,

1, 0 0,1, 0 0,0, 0 0.

i j ijij

ij

ij ij i j

ij i j

ij i j

A Aif A

AA if A and A

if A and Aif A and A

Discretization

• Define

• Where D symbolizes Down-regulation, U for Up-regulation and N for No-change.

• And t=1 is the standard deviation of a gene.

''

''

, ,, ,, .

ij

ij ij

D if A tA U f A t

N otherwise

CCC-Bicluster

• Definition: A CCC-Bicluster is a subset of rows

and contiguous subset of

columns such that

for all rows and columns

.

• Note that each CCC-Bicluster defines a string S

which is common to every row in I.

IJA

1, , kI i i

, 1, , 1,J r r s s

ij kjA A ,i k Ij J

Suffix Trees1. Each node, other than the

root, has at least two children.2. Each edges is labeled with

nonempty substring of S (here “BANANA”)

3. No two edges out of a node have edge labels starting with the same symbol.

4. The label from the root to a leaf is a suffix of S.

Example

Internal node = row-maximal, right-maximal CCC-Bicluster

Main Result• Every (inclusion) maximal CCC-Bicluster with

at least two rows corresponds to an internal node in the suffix tree such that:– It does not have incoming suffix links, or,– It has incoming suffix links only from nodes having

less leaves in their subtress.• Each such an internal node defines a maximal

CCC-Bicluster with at least two rows.• This implies an O(nm) time algorithm for

finding all CCC-Biclusters.


• Generate a random 1000 x 50 dataset.• Apply the algorithm on it.• Plant 10 CCC-Biclusters on the same dataset.• Apply again the algorithm on the dataset.• Define a similarity measure to be Jaccard index

(genes and conditions) and a statistical test.• Filter out similar biclusters and those didn’t

pass the statistical test.

The Statistical Test

• Null hypothesis – expression values of a subset of genes evolve independently.

• Expression patterns are modeled by a first-order Markov Chain, e.g. for the pattern :

where 2Pr( ) Pr( 2 3 4) Pr( 2) Pr( 3 | 2) Pr( 4 | 3)Bp U D U U D U U D

2Pr( 2) ,

UU

n

2 3Pr( 2 3)Pr( 3 | 2) ,Pr( 2) 2

U DU DD UU U

3 4Pr( 3 4)Pr( 4 | 3) .Pr( 3) 3

D UD UU DD D

2Bp

The Statistical Test

• n – the number of genes in the dataset.• I – the subset of genes in a CCC-Bicluster.• The significance of a CCC-Bicluster B with an

expression pattern is:

1

1

| | 1

( ) Pr( ) 1 Pr( )n

j n jB B

j I

pval B p p

Bp

Simulated Datasets - results

• 165 CCC-Biclusters passed the test at the 1 percent level, after Bonferroni correction.

Experiments – Real Datasets

• Use yeast heat shock response dataset from Gasch et al.

• 25 CCC-Biclusters were found to be highly significant at the 1% after Bonferroni corr.

• 9 of them removed after similarity check.• Test results for GO enrichment (hypergeo.)

Real Datasets - results

Up-regulated CCC-Biclusters

Down-regulated CCC-Biclusters

Improvements

• Allow errors: replacement of D/U with N and vice versa.

• Discover biclusters with opposite patterns (anti-correlated).

• Allow scaled and time-lagged (shifted) patterns.• TriClustering – genes x time points x exemplars

(different patients/stress conditions).

Other talks

• “biclust” R package – Ludwig Maximilian University of Munich (Inst. of statistics) and Hasselt University.

• ISA and related tools (R packages) – Gabor Csardi, University of Lausanne, Switzerland.

• Clustering of dose-response microarray data – Hasselt University, Johnson & Johnson PR&D.

• Model- and graph-based clustering of genomic data – Freiburg inst. For advanced studies, Ger.

Questions?

workshop report: biclustering methods for microarray data, hassalt university, belgium

Documents

bicluster i

number of biclusters

additive gaussian noise

additive model

sets of biclusters

assigned biclusters

rank biclusters

pair of biclusters