bayesian multi-topic microarray analysis with hyperparameter reestimation
DESCRIPTION
TRANSCRIPT
Overview
Problem
Latent Process Decomposition (LPD)
Hyperparameter reestimation (MVB+)
Experiment
Results
Conclusions
2
Problem
Explain differences among the cells of
different nature (e.g. cancer/normal cells)
by analyzing differences in gene expression
obtained from DNA microarray experiments.
3
Gene expression
http://bix.ucsd.edu/bioalgorithms/slides.php
DNA microarray experiment
We can find out
which genes are
used (expressed)
by different types
of cells.
5
6
Latent Process Decomposition
8
latent Dirichlet allocation
(LDA)[Blei et al. 01]
latent process decomposition
(LPD)[Rogers et al. 05]
text mining microarray analysis
document sample
word gene
word frequency gene expression level
latent topic latent process
LPD as a multi-topic model
row = gene, column = sample, color = process
9
LPD as a generative model
For each sample d, draw a multinomial θd from
a Dirichlet prior Dir(α)
θd : mixing proportions of processes for sample d
For each gene g in each sample d,
Draw a process k from Mult(θd)
Draw a real number from Gaussian N(μgk, λgk)
10
Inference by VB [Rogers et al. 05]
Variational Bayesian inference
VB is used when EM cannot be used.
Instead of log likelihood,
variational lower bound is maximized.
11
Variational lower bound
12
gd
gzdggzgz
kg
bagk
a
kg
gk
d k
ndkK
dgdgdggk
dk
xe
a
b
K
bap
,
2
,
1
0
0
,
20001
0000
2
)(exp
2)(
2
)(exp
2)(
)(
),,,,,,,,(
00
0
zx
dddq
bapq
bap
z z
zxz
x
),,,(
),,,,,,,,(log),,,(
),,,,(log
0000
0000
Inference by MVB [Ying et al. 08]
Marginalized variational Bayesian inference
Marginalizes multinomial parameters
Achieves less approximation than VB
cf. Collapsed variational Bayesian inference
for LDA [Teh et al. 06]
13
Marginalization in MVB
14
gd
gzdggzgz
kg
bagk
a
kg
gk
d k dk
k dk
K
dgdgdggkx
ea
b
Kn
nK
dbapbap
,
2
,
1
0
0
,
2000
00000000
2
)(exp
2)(
2
)(exp
2)(
)(
)(
)(
),,,,,,,,(),,,,,,,(
00
0
zxzx
ddq
bapq
bap
z z
zxz
x
),,(
),,,,,,,(log),,(
),,,,(log
0000
0000
15
02
0
00
0
2
2
,,
,,
,,
1
2
1
2
1
1
1
2log)(
2
1
)(2
)1()1(log
bmxl
b
aa
b
xa
lm
b
al
mxlb
aba
dgkdg
gkdgkgk
ddgkgk
gk
d dgdgkgk
gkgk
gk
d dgkgkgk
gkdggkgk
gkgkgk
dgkkgd kgd
dgkdgkkgd kgdkgd
dgkkgd
kgddgk
Our proposal: MVB+
MVB with hyperparameter reestimation
Empirical Bayes method
○ Estimate hyperparameters by maximizing
variational lower bound
Hand-tuned hyperparameter values often result
in poor quality of inference.
16
Update formulas in MVB+
17
GK
mkg gk
,0
g k gkgk ba
GKab 0
0
00 log
log)()( b
GK
baa g k gkgk
Inversion of digamma function is required.
Hyperparameter reestimation
An outstanding trend in Bayesian modeling?
[Asuncion et al. UAI’09]
○ Reestimate hyperparameters of LDA
○ Overturn our common sense!
before: “VB < CVB < CGS”
after: “VB = CVB = CGS” (in perplexity)
[Masada et al. CIKM’09 (poster, to appear)]
18
Experiments
Datasets available from Web
LK: Leukemia ( 白血病 , 백혈병 )
○ http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=63
D1: "Five types of breast cancer”
D2: "Three types of bladder cancer”
D3: "Healthy tissues”
○ http://www.ihes.fr/~zinovyev/princmanif2006/
19
Data specifications
20
Dataset name (abbreviation) # of samples # of genes
Leukemia (LK) 72 12582
Five types of breast cancer (D1) 286 17816
Three types of bladder cancer (D2) 40 3036
Healthy tissues (D3) 103 10383
Results
1. Can we achieve inference of better quality?
2. Can we achieve better sample clustering?
3. Are there any qualitative differences
between MVB and MVB+?
21
22
LK
# of iterations
low
er b
ound
23
D1
24
D2
25
D3
26
LK
# of processes
low
er b
ound
(afte
r co
nve
rge
nce
)
27
D1
28
D2
29
D3
Sample clustering evaluation
30
dataset method precision recall F-score
LKMVB+ 0.934+0.007 0.931+0.010 0.932+0.009
MVB 0.930+0.000 0.924+0.000 0.927+0.000
D2MVB+ 0.837+0.038 0.822+0.032 0.829+0.033
MVB 0.779+0.084 0.751+0.069 0.763+0.071
(averaged over 100 trials)
Qualitative difference (LK)
row = gene, column = sample
MVB+ can preserve diversity of genes31
MVB+ MVB
Conclusions
Formulas for hyperparameter reestimation
Improvement in inference quality
Larger variational lower bounds
Better sample clustering
Gene diversity preservation
33
Future work
Use more data to prove efficiency
Devise collapsed Gibbs sampling for LPD
Accelerate computations
OpenMP, Nvidia CUDA
Provide a method for gene clustering
34