1 alex lewin centre for biostatistics imperial college, london joint work with natalia bochkina,...
TRANSCRIPT
1
Alex LewinCentre for Biostatistics
Imperial College, London
Joint work with Natalia Bochkina, Sylvia Richardson
BBSRC Exploiting Genomics grant
Mixture models for classifying differentially expressed genes
2
Modelling differential expression
• Many different methods/models for differential expression– t-test – t-test with stabilised variances (EB)– Bayesian hierarchical models– mixture models
• Choice whether to model alternative hypothesis or not
• Our model: – Model the alternative hypothesis – Fully Bayesian
3
• Gene means and fold differences: linear model on the log scale
• Gene variances: borrow information across genes by assuming exchangeable variances
• Mixture prior on fold difference parameters
• Point mass prior for ‘null hypothesis’
Mixture model features
4
• 1st level
yg1r | g, dg, g1 N(g – ½ dg , g12),
yg2r | g, dg, g2 N(g + ½ dg , g22),
• 2nd level
gs2 | as, bs
IG (as, bs)
dg ~ 0δ0 + 1G_ (1.5, 1) + 2G+ (1.5, 2)
• 3rd level
Gamma hyper prior for 1 , 2 , as, bs
Dirichlet distribution for (0, 1, 2)
Fully Bayesian mixture model for differential expression
Explicit modellingof the alternative
H0
5
• In full Bayesian framework, introduce latent allocation variable zg = 0,1 for gene g in null, alternative
• For each gene, calculate posterior probability of belonging to unmodified component: pg = Pr( zg = 0 | data )
• Classify using cut-off on pg (Bayes rule corresponds to 0.5)
• For any given pg , can estimate FDR, FNR.
Decision Rules
For gene-list S, est. (FDR | data) = Σg S pg / |S|
6
Simulation Study
Explore Explore performance of fully Bayesian mixture in
different situations:
• Non-standard distribution of DE genes
• Small number of DE genes
• Small number of replicate arrays
• Asymmetric distributions of over- and under-expressed genes
Simulated data, 50 simulated data sets for each of several different set-ups.
7
2500 genes, 8 replicates in each experimental condition
dg ~ 0δ0 + 1 ( Unif() + (1 - ) N() ) + 2 ( Unif() + (1 - ) N() )
gs ~ logNorm(-1.8, 0.5) ( logNorm based on data )
Simulation Study
8Gamma distributions superimposed
Non-standard distributions of DE genes
Av. est. π0 = 0.805 ± 0.010
Av. est. π0 = 0.797 ± 0.010
Av. est. π0 = 0.781 ± 0.010
= 0.3 = 0.5 = 0.8
π0 = 0.8
9
Small number of DE genes / Small number of replicate arrays
True π0 = 0.95
True π0 = 0.99
8 replicates
Av. FDR = 7.0 %Av. FNR = 2.0 %Av. est. π0 = 0.947 ± 0.007
3 replicates
Av. FDR = 17.9 %Av. FNR = 3.6 %Av. est. π0 = 0.956 ± 0.009
8 replicates
Av. FDR = 9.2 %Av. FNR = 0.6 %Av. est. π0 = 0.990 ± 0.003
3 replicates
Av. FDR = 17.6 %Av. FNR = 0.9 %Av. est. π0 = 0.995 ± 0.007
10
Asymmetric distributions of over/under-expressed genes
True π0 = 0.9True π1 = 0.09True π2 = 0.01
Av. est. π0 = 0.897 ± 0.007Av. est. π1 = 0.093 ± 0.003Av. est. π2 = 0.011 ± 0.006
dg ~ 0δ0 + 1 (0.6 Unif( 0.01 , 1.7 ) + 0.4 N(1.7 , 0.8) ) + 2 (0.6 Unif( -0.7 , -0.01 ) + 0.4 N( -0.7 , 0.8) )
11
1) FDR / FNR can be estimated well
Additional Checks
50 simulations of same set-up:Av. est. π0 = 0.999No genes are declared to be DE.
2) Model works when there are no DE genes
True FDREst. FDR
True FNREst. FNR
12
Comparison with conjugate mixture prior
Replacedg ~ 0δ0 + 1G_ (1.5, 1) + 2G+ (1.5, 2)
withdg ~ 0δ0 + 1 N(0, cg
2 )
NB: We estimate both c and 0 in fully Bayesian way.
True 0 Est. 0 with
Gamma prior
Est. 0 with
conjugate prior
0.8 0.781 ± 0.010 0.796 ± 0.010
0.95 0.947 ± 0.007 0.955 ± 0.006
0.99 0.990 ± 0.003 0.991 ± 0.003
1 0.999 ± 0.001 0.999 ± 0.001
13
Application to Mouse data
Mouse wildtype (WT) and knock-out (KO) data (Affymetrix)
~ 22700 genes, 8 replicates in each WT and KO
Gamma prior Est. π0 = 0.996 ± 0.001 Declares 59 genes DE
14
Summary
• Good performance of fully Bayesian mixture model– can estimate proportion of DE genes in variety of situations– accurate estimation of FDR / FNR
• Different mixture priors give similar classification
results
• Gives reasonable results for real data