workshop report: biclustering methods for microarray data, hassalt university, belgium
DESCRIPTION
Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium. Guy Harari. FABIA: factor analysis for bicluster acquisition. Sepp Hochreiter et al., University of Linz, Austria. FABIA - Motivation. Plaid models: for bicluster i : - PowerPoint PPT PresentationTRANSCRIPT
Workshop report: Biclustering Methods for Microarray Data,
Hassalt University, Belgium
Guy Harari
FABIA: factor analysis for bicluster acquisition
Sepp Hochreiter et al,.University of Linz, Austria
FABIA - Motivation
• Plaid models: for bicluster i:• They use least squares fit for model selection • Thus assume Gaussian effects• However, microarray datasets are not
Gaussian (heavy tails)
kij i ki ij
FABIA – model
• Biclusters have multiplicative coherent values
• λ – prototype• z - factors• In the example above:
21040000
31.5
06
4208
Tz
2 1 0 4Tz 1 0 1.5 2 T
FABIA – model
• For p biclusters and additive Gaussian noise:
• The j-th sample (column in X) is:
• where is the j-the column of Z. • Λ and Z are sparse.
1
pT
i ii
X Z
z
1
p
j i ij j j ji
x z z
jz
Generative Model for Factor Analysis
• Data was produced by:– Picking values independently
from some Gaussian hidden factors.
– Linearly combining the factors using a factor loading matrix.
– Add Gaussian noise for each input
ijw
(0,1)if N
jx
if
2( , )j jN
Generative Model for Factor Analysis
• Assume factors and noise areindependent.
• Assume also . • Select #factors by e.g.
Kaiser criterion –• Extract factors using e.g.
maximum likelihood.
ijw
( )Cov F I
( )# 1Cov XEV
(0,1)if N
jx
if
2( , )j jN
FABIA – model
• Fix the value for j.• Factors are the ‘s, .• • Biclusters shouldn’t be correlated.• are the loading matrix’s entries.• is diagonal – independent
Gaussian noise.
ijz 1 i p
ij
(0,1)ijz N
jx
ijz
(0 , )N
jCov z I
ij
Sparseness• We want sparse solutions for and• So use Laplace distribution for :
• For use one of:1. FABIA:
2. FABIAS:
z iz
2| |
1
1( )2
i
p pz
i
p z e
i 2| |
1
12
ki
n n
ik
p e
0
ii
i
c for sp spLp
for sp spL
2
1 1
1
n n
ki kik k
i
nsp
n
parameter
Model Selection• Center the data to zero median.• Normalization – divide values by row’s std.• Use EM where the parameters are and .• Rank biclusters according to mutual
information:
• Determine members of each bicluster using two thresholds for values and .
1
; | \ ; | \l
T Ti i j ij j ij
j
I X z Z z I x z z z
ki ijz
Experiments – Simulated Datasets
• n=1000 genes, l=100 samples • p=10 multiplicative biclusters• Generate :– Choose - the number of genes in bicluster i -
uniformly at random from {10,…,210}.– Choose genes from {1,…,1000}.– Set components not in bicluster i to .– Set components in bicluster i to .
i
iN
iN
i2(0,0.2 )N
i ( 3,1)N
Experiments – Simulated Datasets
• Generate :– Choose - the number of samples in bicluster i -
uniformly at random from {5,…,25}.– Choose samples from {1,…,100}.– Set components not in bicluster i to .– Set components in bicluster i to .
• Add random noise to all entries according to .
• Compute the dataset with
izziN
ziN
2(0,0.2 )Niziz (2,1)N
2(0,3 )N
1
pT
i ii
X Z
z
Evaluation – consensus score
• For two sets of biclusters:– Compute similarity between each pair of
biclusters, one from each set.– Find maximum assignment using the Munkres
(Hungarian) algorithm.– Penalize different numbers of biclusters - Divide
the sum of similarities of the assigned biclusters by the number of biclusters of the largest set.
• Use Jaccard index for computing similarity.
Simulated Datasets - Results
• Average score and STD for each method:
Simulated Datasets - Results• Avg. and STD of information content and similarity:
Simulated additive datasets
• Generate biclusters in the same way.• Use additive model for each bicluster:
• Choose from and from .
• Choose from one of three models:– Low signal – – Moderate signal – – High signal –
ikj i ik ij ik
2(0.5,0.2 )Nij
2(1,0.5 )Ni
2(0,2 )N2( 2,0.5 )N
2( 4,0.5 )N
Additive Datasets - results
• Low signal:
Additive Datasets - results
• Moderate signal:
Additive Datasets - results
• High signal:
Gene Expression Datasets
• Breast cancer (Van’t Veer et al., 2002) – 3 classes (clusters) were found in Hoshida et al., 2007.
• Multiple tissue types dataset (Su et al., 2002)• Diffuse large-B-cell lymphoma dataset (DLBCL)
(Rosenwald et al., 2002) – 3 classes (clusters) were found in Hoshida et al. (2007).
Gene Expression Datasets - results
Biological Interpretation• Breast cancer:– Bicluster 1 is related to cell cycle (GO and KEGG,
) and to the proteins CDC2 (division control) and KIF (mitosis).
– Bicluster 2 is related to immune response (GO, ) and cytokine-cytokine receptor interaction (KEGG ), and to cytokine-related proteins as CCR5, CCL4 and CSF2RB.
• Multiple tissue – no biological interpretation.
910p
2610p 1010p
Biological Interpretation
• DLBCL:– Bicluster 1 is related to the ribosome (GO ,
KEGG ) and to B-cell receptor signaling (KEGG ).
– Bicluster 2 is related to the immune system (GO , KEGG ).
610p 810p
910p
610p 910p
Drag Design
• Goal: find compounds with similar effects on gene expression.
• Use Affymetrix GeneChip HT HG-U133+ PM array plates with 12*8 samples per plate.
• Selected compounds are active on a cancer cell line.
• Each compound was testes in a group of three replicates.
Drag Design
• 3 biclusters were found to have 2-5 replicate sets.
• One of them extracted genes related to mitosis (GO ).
• The compounds of this bicluster are now under investigation by Johnson & Johnson Pharmaceutical R&D.
1310p
Biclustering Gene Expression Time Series
Sara C Madeira, Technical University of Lisbon
Introduction
• Input: columns correspond to samples taken in consecutive instants of time.
• Output: biclusters with contiguous columns.• Motivation: biological processes start and end
in a contiguous time leading to increased/decreased activity of some genes.
• Goal: find all maximal contiguous column coherent (CCC) biclusters sorted by a statistical score.
Discretization
• Let be the input expression matrix.• Define
• Standardize A’ to mean=0 and STD=1 by gene.
'n mA
' '( 1) '
'
'' ' '( 1)
' '( 1)
' '( 1)
, 0,
1, 0 0,1, 0 0,0, 0 0.
i j ijij
ij
ij ij i j
ij i j
ij i j
A Aif A
AA if A and A
if A and Aif A and A
Discretization
• Define
• Where D symbolizes Down-regulation, U for Up-regulation and N for No-change.
• And t=1 is the standard deviation of a gene.
''
''
, ,, ,, .
ij
ij ij
D if A tA U f A t
N otherwise
CCC-Bicluster
• Definition: A CCC-Bicluster is a subset of rows
and contiguous subset of
columns such that
for all rows and columns
.
• Note that each CCC-Bicluster defines a string S
which is common to every row in I.
IJA
1, , kI i i
, 1, , 1,J r r s s
ij kjA A ,i k Ij J
Suffix Trees1. Each node, other than the
root, has at least two children.2. Each edges is labeled with
nonempty substring of S (here “BANANA”)
3. No two edges out of a node have edge labels starting with the same symbol.
4. The label from the root to a leaf is a suffix of S.
Example
Internal node = row-maximal, right-maximal CCC-Bicluster
Main Result• Every (inclusion) maximal CCC-Bicluster with
at least two rows corresponds to an internal node in the suffix tree such that:– It does not have incoming suffix links, or,– It has incoming suffix links only from nodes having
less leaves in their subtress.• Each such an internal node defines a maximal
CCC-Bicluster with at least two rows.• This implies an O(nm) time algorithm for
finding all CCC-Biclusters.
Experiments – Simulated Datasets
• Generate a random 1000 x 50 dataset.• Apply the algorithm on it.• Plant 10 CCC-Biclusters on the same dataset.• Apply again the algorithm on the dataset.• Define a similarity measure to be Jaccard index
(genes and conditions) and a statistical test.• Filter out similar biclusters and those didn’t
pass the statistical test.
The Statistical Test
• Null hypothesis – expression values of a subset of genes evolve independently.
• Expression patterns are modeled by a first-order Markov Chain, e.g. for the pattern :
where 2Pr( ) Pr( 2 3 4) Pr( 2) Pr( 3 | 2) Pr( 4 | 3)Bp U D U U D U U D
2Pr( 2) ,
UU
n
2 3Pr( 2 3)Pr( 3 | 2) ,Pr( 2) 2
U DU DD UU U
3 4Pr( 3 4)Pr( 4 | 3) .Pr( 3) 3
D UD UU DD D
2Bp
The Statistical Test
• n – the number of genes in the dataset.• I – the subset of genes in a CCC-Bicluster.• The significance of a CCC-Bicluster B with an
expression pattern is:
1
1
| | 1
( ) Pr( ) 1 Pr( )n
j n jB B
j I
pval B p p
Bp
Simulated Datasets - results
• 165 CCC-Biclusters passed the test at the 1 percent level, after Bonferroni correction.
Experiments – Real Datasets
• Use yeast heat shock response dataset from Gasch et al.
• 25 CCC-Biclusters were found to be highly significant at the 1% after Bonferroni corr.
• 9 of them removed after similarity check.• Test results for GO enrichment (hypergeo.)
Real Datasets - results
Up-regulated CCC-Biclusters
Down-regulated CCC-Biclusters
Improvements
• Allow errors: replacement of D/U with N and vice versa.
• Discover biclusters with opposite patterns (anti-correlated).
• Allow scaled and time-lagged (shifted) patterns.• TriClustering – genes x time points x exemplars
(different patients/stress conditions).
Other talks
• “biclust” R package – Ludwig Maximilian University of Munich (Inst. of statistics) and Hasselt University.
• ISA and related tools (R packages) – Gabor Csardi, University of Lausanne, Switzerland.
• Clustering of dose-response microarray data – Hasselt University, Johnson & Johnson PR&D.
• Model- and graph-based clustering of genomic data – Freiburg inst. For advanced studies, Ger.
Questions?