differential expressions bayesian techniques lecture topic 8

Differential ExpressionsBayesian Techniques

Lecture Topic 8

Why Bayes?

A friend of mine who is Bayesian said the following when asked this question:

• Some problems very hard to solve by classical techniques• e.g. Behrens-Fisher problem• Every new problem requires a new solution• Bayes provides a coherent path

The Frequentist Paradigm

• Probability refers to a limiting relative frequency. Probability are OBJECTIVE properties in the real world.

• Parameters are fixed unknown constants, NO probability statement is possible about a parameter.

• Statistical procedures should be designed to have well-defined LONG-RUN frequency properties. For example a 95% confidence interval should trap the true value of the parameter with a limiting frequency of 95%.

Bayesian Philosophy

• Probability describes a DEGREE OF BELIEF not a relative frequency. As such you can make probability statements about anything, not just data

• We CAN make probability statement about parameters even if they are fixed constants.

• We make inferences about a parameter by producing its probability distributions. Inferences such as point or interval estimation maybe extracted from the probability distribution of the parameter.

The Contrasts

• According to Larry Wasserman: “Bayesian inference is a controversial approach as it embraces a subjective notion of probability”.

• In general Bayesian methods have NO guarantees for long run performance.

Advantages of Bayesian Methods

• Provide ability to formally incorporate prior information• Inference conditional on actual data (not what might have been)• More easily interpretable by non-specialists (e.g. confidence intervals)• All analyses follow directly from posterior distribution• Stopping Rule does not affect Inference• Any question can be directly answered ex. bioequivalence

– H0: θ0 ≠ θ0– H1: θ0 = θ1

• ■ Reverse role of null and alternative• ■ Hard to use traditional testing methods in Bayes easy

Disadvantages

• Initial Bayesians were subjectivist• Results not “objective,” could be manipulated to yield any

desired result• How to set the prior in general?• Computationally difficult• Need to evaluate complex integrals even for simple

problems• Need inexpensive high speed computing

How Bayesian Method Works

• Choose a probability density f(q) – called the PRIOR distribution - that expresses our beliefs about a parameter BEFORE we see any data.

• We choose a statistical model f(x| q) that reflects our beliefs our x given q. Here we write it as f(x | q) NOT f(x;q) in the frequentist world.

• After OBSERVING the data X1, …, Xn, we update our belief in the parameter and calculate the posterior distribution f(q | x).

• It essentially uses the Bayes theorem to calculate the posterior distribution.

Bayes Theorem: Discrete VersionA Simple Probability Result

• Let B1,B2 . . . Bn disjoint sets P(Bk) > 0, all k,

• P(B1U B2 . . . U Bn) = 1 • (Mutually exclusive and exhaustive)

• For any event A• P(Bj|A) = P(Bj)P(A|Bj)/ SP(Bk)P(A|Bk)

EXAMPLE: • Disease incidence in population – P(D)=0.001

• Diagnostic test – false positive rate 0.05 , P(+|not D) = 0.05– false negative rate 0.01, P(-|D) = 0.01

• If Person drawn at random tests +, What is probability he has disease, D?

)|()()|()(

)|()()|( CC DPDPDPDP

DPDPDP

0194.)05)(.999(.)99)(.001(.

)99)(.001(.

Comment

• Hence, probability that you HAVE the disease given that you have TESTED positive is still pretty LOW, even with very small FALSE POSITIVES and FALSE NEGATIVES.

• This rule is very useful in numerous other situations.

Bayes Theorem: The Continuous Version

• Let f( ) q be our prior distribution (density) for our parameter .q

• Suppose we have the data X1, …, Xn, with density f(X1, …, Xn | ) q also written as Ln(X, q)

dfxxf

fxxf

dfxxL

fxxLxxf

n

n

nn

nnn

)()|...(

)()|...(

)()...,(

)()...,()...|(

1

1

1

11

Some Simplifications

• The denominator is sometimes very hard to deal with, since the integration over the parameters is not trivial.

• We call that the normalizing constant. And in most cases don’t explicitly evaluate it. And we use the idea that:

)()|...()...|( 11 fxxLxxf nn

Bayes’ Idea

• Think of a model for data y1, . . . , yn

f(y1, . . . , yn|θ) e.g. Normal, Binomial, etc.

• θ random with prior density g(.)

• Bayes Rule says that:

p(θ| y1, . . . , yn) =

g(θ) f(y1, . . . , yn |θ)• Hence, the posterior is proportional to probability of

prior multiplied by probability of data given the parameter.

Hypothesis Testing: Classical vs. Bayesian

Classical: Set up null, alternative hypotheses, perform a

test, calculate a p-value, reject or fail to reject

the null

Bayesian: Inference based on posterior distribution,

p(θ|y1, . . . , yn)

• Consider evidence in favor of certain parameter values

• Data as well as prior beliefs influence inference

Major Challenge 1: Setting Priors

Approaches• Subjective - based on beliefs of individual, expert, etc.

issues:– how to do in practice?– -people inconsistent– elicitation can help

• Non-informative - based on “prior ignorance” about parameter

• issues:– often hard to define– may lead to improper posteriors– sensitive to parameterization

Setting Priors: Conjugate Priors • Conjugate priors are priors so that combined with the model the

posterior will have a KNOWN distribution.• issues:

– choice of convenience– avoids computational problems– exists only for limited families

• Example:• y ~ Bin(n,θ), θ ~ Beta(α,β) then p(θ|y) Beta(α+y,β+n-y)

• Normal conjugate is Normal for location• Poisson conjugate is Gamma• Inverse Gamma is often used as a prior for Normal s2.• Generally all members of the Exponential Families have conjugate

priors.

Setting Priors: Non-informative

• Assuming we have no REAL information about the parameter, we can model it with a “non-informative” prior.

• For example if qi is discrete we can think of – P(qi) =1/n for i= 1…n

• If we know an interval (a,b) in which q lies, we can define – Prior as P(q) = 1/(b-a) a < q < b.

• We can also define– P(q) = c, c > 0. (improper Prior, since its not a pdf).

Setting Priors: Jeffery’s Prior

• Uniform non-informative priors are criticized since they do not lend themselves to transformation.

• Jeffery’s Prior is often used, that IS invariant under transformation.

• P(q) = [I(q)]1/2 , I: information matrix

)|(log()(2

2

|

XfEI X

Major Challenge II: Computation

• Need to evaluate complicated high dimensional integrals• Lots of technology developed in last 20-25 years

Approaches• Earliest solutions: approximations and numerical integration• Noniterative Monte Carlo: direct sampling, indirect sampling

(importance, rejection)• Markov Chain Monte Carlo (MCMC): Gibbs sampling, Metropolis-

Hastings algorithm, hybrid methods . . .

• MCMC most popular and can be implemented in high dimensional situations.

Simple Example

Simple Example contd…

• Posterior mean is weighted average of prior mean and data mean

■ Sample average is shrunk toward prior mean ■ Weight depends on relative variability of prior and data

• Posterior precision is sum of prior precision and data precision

• Samples from posterior are easy to get given data, σ², μ, τ²

Lessons from Example

General principle: posterior is compromise between prior and data

• μ and τ² not known

■ Empirical Bayes: estimate μ and τ²

■ Hierarchical Bayes: put prior on μ and τ² as well

Bayesian Hypothesis Testing

• The idea is due to Jefferys (1961).

• Idea: Based on the data that each hypothesis is supposed to predict, one applies Bayes’ Theorem and computes the posterior probability that the first hypothesis is correct.

• UNLIKE Classical methods the hypothesis DO NOT have to be nested within each other.

Mechanics of Bayesian Hypothesis Testing

• Lets consider we have two hypothesis H0 and H1 (the Bayesians prefer to use the word “models” as opposed to hypothesis, but we will keep “hypothesis” to be consistent with the classical ideas).

• Let H0 and H1 be two hypotheses concerning the data Y, and let q0 and q1 be the associated parameters.

• We define pi (qi) as the corresponding priors.• Let fi(y | qi) be the corresponding marginal distributions.• We can use Bayes’ Theorem to calculate, P(qi |y) the posteriors. • Bayes’ hypothesis testing consists of finding the following and using

pre-specified cut-offs for decisions:– B=[P(q0|y)/P(q1|y)]/[P(q0)/P(q1)] (Bayes’ Factor)– P( q0 | Y=y), P(q0 | Y>=y) (Bayesian p-values)

Bayesian Hypothesis Tests in Microarrays

• Let

Hg1: gene is differentially expressed

Hg0: gene is not differentially expressed

• Traditional Bayesians would write this as

otherwise 0

expressedally differenti is gene theif 1gv

Method 1

• Differential Expression Score• Use t-statistic or Wilcoxon Rank sum statistic, zg

• Then Calculate P(H0 | zg=z) or P(H0 | zg z) or

• P(vg=0 | zg=z) or P(vg=0 | zg z)

• McClure and Wit (2004) show that the second term is identical to using the FDR method for controlling error.

Fully Bayesian Analysis

• In general we are interested in:• The term given below where p0 is the fraction of inactive

genes in the array, F0 is the distribution under the null hypothesis, v=0, F is the distribution of the test statistic

)(ˆ1

)(1ˆ)|0(ˆ 0

0zF

zFpzzvP gg

Bayesian t test

• The t statistic is given by:

• Assume: zg|{vg=0} ~ N(0,s02)

• zg|{vg=1} ~ N(0,s12)

• Hence, zg ~ (1-p1) N(0,s02)+ p1 N(0,s1

2)

se

xxt gg )( 21

Bayesian t test: Priors

• p1 ~ Uniform(0,1)• vg ~ Bernoulli (p1) 1/s0

2 ~ Gamma(a,b), 1/s12 ~ Gamma(g,d),

b ~ Gamma( 1l , 1t ), d ~ Gamma( 2l , 2t ), q = (v, p1, s0

2,a, b, s12 ,g,d, 1l , 1, 2t l , 2t )

These are all conjugate priors to make the calculations easier.

One uses the Gibbs sampler to simulate from P(q| z) to estimate p1, s0

2 ,s12 to calculate the required probability.

Gibbs Sampler

• It is used to calculate the poster mean.• It does not calculate P(q|y) explicitly. It simulates draws from this

distribution. Using sample summaries we get a good idea of the joint posterior as well as the marginal distribution of interest P(v| y).

• It samples from the distribution of P(qi| q-i,y), until it converges to a stationery distribution. This is called “burn-in”.

• After burn-in each draw of q is a draw from a posterior distribution.• Bayes Theorem states that the conditional distribution of P(qi|q-i,y) is

proportional to the likelihood of the prior, P(y|q)P(q) as a function of qi.

• If the marginal distributions without the specific component is defined (generally using conjugate priors) this procedure can be applied easily.

Empirical Bayes Idea• The prior distributions depend upon unknown parameters which in

turn may need a second or higher stage prior in some hierarchical setting.

• But at some point we HAVE to specify all remaining parameters of the hyper-prior.

• In other words we HAVE to use our knowledge to specify our prior.• The Empirical Bayes method uses sample data to estimate the

parameters for the final stage prior.• The idea is if we are interested in q|y, let q ~ P(h1), h1~P(h2)… • h L-1~P(hL). • In the empirical Bayes idea we use the data to estimate the parameter

hL obtained as the value that maximizes the marginal likelihood P(Y| hL).

• We replace the estimate of hL in the priors, and the posterior distribution is now P(q|y,est- hL).

Empirical Bayes’ Idea in Differential Expression

• Average log fold change.• Problem: non DE genes with large variances have too much chance

of being selected.• t-statistics• Problem: apparently DE genes with very small sample variances are

suspect.• Moderated t-statistics A happy compromise between the two above,

an empirical Bayes estimate, using data to estimate the new se, sg.

Generally

css gg ~

The moderated t statistic

• Smoothed standard deviations: shrink towards

• Eliminates large t-statistics due merely to very small s values,and reduces the impact of very large s values.

,)(

~

0

22

210

g

gggg dd

sdsds

EB Idea

• Posterior odds (for DE)• Posterior probability of differential expression for any

gene is

• A monotonic function of t˜ 2 for constant d.

Estimating hyper-parameters

Closed form estimators with good properties are available:

for s0 and d0 in terms of the first two moments of log s2.

for c0 in terms of quantiles of the | t˜g | .

Nowadays the EB estimate is used most often for differential expressions and the genes are ranked by the EB estimates.

Instead of doing strict Error Control, the top g genes are looked at using EB estimates for ranking purposes. Sometimes | t˜g | >4 is used as an empirical cut-off.

Limma in R uses empirical Bayes estimates for looking at which genes are differentially expressed.

differential expressions bayesian techniques lecture topic 8

Documents

probability statements

probability distributions

probability density

bayesian inference

bayesian philosophyprobability

general bayesian methods

prior distribution

bayesian method workschoose