bayesian multi-topic microarray analysis with hyperparameter reestimation

Tomonari MASADA (正田备也 )NAGASAKI University (长崎大学 )

[email protected]

1

Overview

Problem

Latent Process Decomposition (LPD)

Hyperparameter reestimation (MVB+)

Experiment

Results

Conclusions

2

Problem

Explain differences among the cells of

different nature (e.g. cancer/normal cells)

by analyzing differences in gene expression

obtained from DNA microarray experiments.

3

Gene expression

http://bix.ucsd.edu/bioalgorithms/slides.php

DNA microarray experiment

We can find out

which genes are

used (expressed)

by different types

of cells.

5

Latent Process Decomposition

8

latent Dirichlet allocation

(LDA)[Blei et al. 01]

latent process decomposition

(LPD)[Rogers et al. 05]

text mining microarray analysis

document sample

word gene

word frequency gene expression level

latent topic latent process

LPD as a multi-topic model

row = gene, column = sample, color = process

9

LPD as a generative model

For each sample d, draw a multinomial θd from

a Dirichlet prior Dir(α)

θd : mixing proportions of processes for sample d

For each gene g in each sample d,

Draw a process k from Mult(θd)

Draw a real number from Gaussian N(μgk, λgk)

10

Inference by VB [Rogers et al. 05]

Variational Bayesian inference

VB is used when EM cannot be used.

Instead of log likelihood,

variational lower bound is maximized.

11

Variational lower bound

12

gd

gzdggzgz

kg

bagk

a

kg

gk

d k

ndkK

dgdgdggk

dk

xe

a

b

K

bap

,

2

,

1

0

0

,

20001

0000

2

)(exp

2)(

2

)(exp

2)(

)(

),,,,,,,,(

00

0

zx

dddq

bapq

bap

z z

zxz

x

),,,(

),,,,,,,,(log),,,(

),,,,(log

0000

0000

Inference by MVB [Ying et al. 08]

Marginalized variational Bayesian inference

Marginalizes multinomial parameters

Achieves less approximation than VB

cf. Collapsed variational Bayesian inference

for LDA [Teh et al. 06]

13

Marginalization in MVB

14

gd

gzdggzgz

kg

bagk

a

kg

gk

d k dk

k dk

K

dgdgdggkx

ea

b

Kn

nK

dbapbap

,

2

,

1

0

0

,

2000

00000000

2

)(exp

2)(

2

)(exp

2)(

)(

)(

)(

),,,,,,,,(),,,,,,,(

00

0

zxzx

ddq

bapq

bap

z z

zxz

x

),,(

),,,,,,,(log),,(

),,,,(log

0000

0000

15

02

0

00

0

2

2

,,

,,

,,

1

2

1

2

1

1

1

2log)(

2

1

)(2

)1()1(log

bmxl

b

aa

b

xa

lm

b

al

mxlb

aba

dgkdg

gkdgkgk

ddgkgk

gk

d dgdgkgk

gkgk

gk

d dgkgkgk

gkdggkgk

gkgkgk

dgkkgd kgd

dgkdgkkgd kgdkgd

dgkkgd

kgddgk

Our proposal: MVB+

MVB with hyperparameter reestimation

Empirical Bayes method

○ Estimate hyperparameters by maximizing

variational lower bound

Hand-tuned hyperparameter values often result

in poor quality of inference.

16

Update formulas in MVB+

17

GK

mkg gk

,0

g k gkgk ba

GKab 0

0

00 log

log)()( b

GK

baa g k gkgk

Inversion of digamma function is required.

Hyperparameter reestimation

An outstanding trend in Bayesian modeling?

[Asuncion et al. UAI’09]

○ Reestimate hyperparameters of LDA

○ Overturn our common sense!

before: “VB < CVB < CGS”

after: “VB = CVB = CGS” (in perplexity)

[Masada et al. CIKM’09 (poster, to appear)]

18

Experiments

Datasets available from Web

LK: Leukemia ( 白血病 , 백혈병 )

○ http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=63

D1: "Five types of breast cancer”

D2: "Three types of bladder cancer”

D3: "Healthy tissues”

○ http://www.ihes.fr/~zinovyev/princmanif2006/

19

Data specifications

20

Dataset name (abbreviation) # of samples # of genes

Leukemia (LK) 72 12582

Five types of breast cancer (D1) 286 17816

Three types of bladder cancer (D2) 40 3036

Healthy tissues (D3) 103 10383

Results

1. Can we achieve inference of better quality?

2. Can we achieve better sample clustering?

3. Are there any qualitative differences

between MVB and MVB+?

21

22

LK

# of iterations

low

er b

ound

23

D1

24

D2

25

D3

26

LK

# of processes

low

er b

ound

(afte

r co

nve

rge

nce

)

27

D1

28

D2

29

D3

Sample clustering evaluation

30

dataset method precision recall F-score

LKMVB+ 0.934+0.007 0.931+0.010 0.932+0.009

MVB 0.930+0.000 0.924+0.000 0.927+0.000

D2MVB+ 0.837+0.038 0.822+0.032 0.829+0.033

MVB 0.779+0.084 0.751+0.069 0.763+0.071

(averaged over 100 trials)

Qualitative difference (LK)

row = gene, column = sample

MVB+ can preserve diversity of genes31

MVB+ MVB

Conclusions

Formulas for hyperparameter reestimation

Improvement in inference quality

Larger variational lower bounds

Better sample clustering

Gene diversity preservation

33

Future work

Use more data to prove efficiency

Devise collapsed Gibbs sampling for LPD

Accelerate computations

OpenMP, Nvidia CUDA

Provide a method for gene clustering

34

bayesian multi-topic microarray analysis with hyperparameter reestimation

Documents