differential expression detection in microarray...

83
Differential Expression Detection in Microarray Data Jianhua Hu Department of Bioinformatics and Computational Biology 2.11.08

Upload: others

Post on 08-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Differential Expression Detection in Microarray Data

Jianhua

HuDepartment of Bioinformatics and Computational Biology

2.11.08

Page 2: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Outline

Introduction

Penalized t-statistics by Tusher et al. (2001) with implementation in the software ``SAM”.

Moderated t-statistics by Smyth et al. (2004) with implmentation inR package ``limmod”.

LikelihoodLikelihood--based identifying of differentially expressed genes by based identifying of differentially expressed genes by HuHuand Wright (2007).and Wright (2007).

Page 3: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Introduction

DNA microarrays play an important role in many areas of biomedical research.

Two popular types: Spotted cDNA microarrays, multiprobeoligonucleotide arrays (Affymetrix Genechip).

Multiprobe oligonucleotide array: probe redundancy, one “color”.

Page 4: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Introduction

An example Affymetrix

genechip

array:

Probe set: 10-20 probe pairsProbe pair: A (PM, MM) pairPM: 25-mers complementary to region of geneMM: Middle base is different to that of PM

...

Coding portion of gene X polyA

Perfect Match (PM)

Mismatch (MM)

25-mers

a probe set

Page 5: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Current methods

Simple rule without accounting for expression variation:Chen et al. (1997) -

“fold changes”

rule.

Ordinary two-sample t-statistic:Dudoit et al. (2002), Thomas et al. (2001)

Modified two-sample t-statistic:Tusher et al. (2001), Efron et al. (2001), Smyth (2004).

Page 6: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

“Borrowing” information from across the genes:

Uncertainty in the variance is an acute problem when the sample size is small.

Eaves et al. (2002) - weighted average of the sample variance and a local variance estimate for groups of genes.

Bayesian approaches:Newton et al. (2001), Baldi and Long (2001), Ibrahim et al. (2002).

Page 7: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

The median number of arrays was 8 both in 1999 and 2003.

Illustrates importance of dealing with a wide range of sample sizes.

Page 8: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

1

Significance analysis of Microarrays (SAM)

Tusher, Tibshirani, Chu (2001)

Page 9: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

2

The Problem:

Identifying differentially expressed genes •

Determine which changes are significant

Enormous number of genes

Page 10: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

3

Reminder: t-Test

t-Test for a single gene:•

We want to know if the expression level changed from condition A to condition B.

Null assumption: no change•

Sample the expression level of the genes in two conditions, A and B.

Calculate •

H0 : The groups are not different,

BA xx ,0)( =− BA xxE

Page 11: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

4

t-Test Cont’d

Under H0 , and under the assumption that the data is normally distributed,

Use the distribution table to determine the significance of your results.

txx

xx

BA

BA ~)(ˆ0)(

−−−

σt-Statistic

Page 12: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

5

Multiple Hypothesis Testing

Naïve solution: do t-test for each gene.•

Multiplicity Problem: The probability of error increases.

We’ve seen ways to deal with it, that try to control the FWER or the FDR.

Today: SAM (estimates FDR)

Page 13: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

6

SAM-

procedure overviewSample genes

expression

scale

Define and calculate a statistic, d(i)

Generate permutatedsamples

Estimate attributes of d(i)’s

distribution

Identify potentiallySignificant genes

Estimate FDR

ChooseΔ

Page 14: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

7

SAM-

procedure overviewSample genes

expression

scale

Define and calculate a statistic, d(i)

Generate permutatedsamples

Estimate attributes of d(i)’s

distribution

Identify potentiallySignificant genes

Estimate FDR

ChooseΔ

Page 15: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

8

The Experiment

Two human lymphoblastoid

cell lines:

Eight hybridizations were performed.

1 2

I1 I2

U1 U2

I1A I1B I2A I2B

U1A U1B U2A U2B

Page 16: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

9

SAM-

procedure overviewSample genes

expression

scale

Define and calculate a statistic, d(i)

Generate permutatedsamples

Estimate attributes of d(i)’s

distribution

Identify potentiallySignificant genes

Estimate FDR

ChooseΔ

Page 17: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

10

SAM’s

statistic-

Relative Difference

Define a statistic, based on the ratio of change in gene expression to standard deviation in the data for this gene.

0)()()()(

sisixixid UI

+−

= Difference between the means of the two conditions

Estimate of the standard deviation of the numerator

Fudge Factor

⎭⎬⎫

⎩⎨⎧

−+−⎟⎟⎠

⎞⎜⎜⎝

⎛−+

+= ∑∑

mIm

mIm

nn ixixixixnn

is 22

21

11

)]()([)]()([2

)( 21

Page 18: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

11

At low expression levels, variance in d(i) can be high, due to small values of s(i).

To compare d(i) across all genes, the distribution of d(i) should be independent of the level of gene expression and of s(i).

Choose s0

to make the coefficient of variation of d(i) approximately constant as a function of s(i).

Why s0 ?

Page 19: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

12

We gave each gene a score.

At what threshold should we call a gene significant?

How many false positives can we expect?

Now what?

Page 20: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

13

SAM-

procedure overviewSample genes

expression

scale

Define and calculate a statistic, d(i)

Generate permutatedsamples

Estimate attributes of d(i)’s

distribution

Identify potentiallySignificant genes

Estimate FDR

ChooseΔ

Page 21: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

14

More data required

Experiments are expensive.•

Instead, generate permutations of the data (mix the labels)

Can we use all possible permutations?

Page 22: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

15

SAM-

procedure overviewSample genes

expression

scale

Define and calculate a statistic, d(i)

Generate permutatedsamples

Estimate attributes of d(i)’s

distribution

Identify potentiallySignificant genes

Estimate FDR

ChooseΔ

Page 23: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

16

For each permutation p, calculate dp(i).

Rank genes by magnitude:

Define:

Estimating d(i)’s

Order Statistics

...)3()2()1( ≥≥≥ ppp ddd

∑=p

idE

pid 36)()(

0

21

)()()()(

sisixixid GG

p +−

=

Page 24: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

17

Example

Page 25: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

18

SAM-

procedure overviewSample genes

expression

scale

Define and calculate a statistic, d(i)

Generate permutatedsamples

Estimate attributes of d(i)’s

distribution

Identify potentiallySignificant genes

Estimate FDR

ChooseΔ

Page 26: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

19

Plot d(i) vs. dE (i) :

For most of the genes,

)()( idid E≅

Now Rank the original d(i)’s:

...)3()2()1( ≥≥≥ ddd

Identifying Significant Genes

Page 27: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

20

Define a threshold, Δ.•

Find the smallest positive d(i) such that

Δ≥− )()( idid Ecall it t1.

In a similar manner, find the largest negative d(i). Call it t2.

For each gene i, if, call it potentially significant.

21 )()( tidtid ≤∨≥

Identifying Significant Genes

Page 28: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

21

Page 29: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

22

SAM-

procedure overviewSample genes

expression

scale

Define and calculate a statistic, d(i)

Generate permutatedsamples

Estimate attributes of d(i)’s

distribution

Identify potentiallySignificant genes

Estimate FDR

ChooseΔ

Page 30: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

23

Estimate FDR•

t1 and t2 will be used as cutoffs.

Calculate the average number of genes that exceed these values in the permutations.

Estimate the number of falsely significant genes, under H0:

Divide by the number of genes called significant

∑ =≤∨≥

36

1 21361 })()(|{#

p pp tidtidi

Page 31: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

24

FDR cont’d

})()(|{#

})()(|{#

21

36

1 21361

tidtidi

tidtidiFDR p pp

≤∨≥

≤∨≥≈

∑ =

Note: Cutoffs are asymmetric

Page 32: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

25

Example

5833.0347

=≈FDR

Page 33: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

26

How to choose Δ?

Omitting s0 caused higher FDR.

Page 34: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

27

Test SAM’s

validity•

10 out of 34 genes found have been reported in the literature as part of the response to IR

19 appear to be involved in the cell cycle

4 play role in DNA repair•

Perform Northern Blot-

strong

correlation found•

Artificial data sets-

some genes

induced, background noise

Page 35: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

28

Moderated t-statistic

Smyth G. K. (2004)

Page 36: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Linear model estimates

Obtain a linear model for each gene g

E (yg ) = Xβg , var(yg ) = W−1g σ2

g .

Estimate model by robust regression, least squares, or generalized leastsquares to obtaincoefficients, βgj

estimators of σ2g , s2

g

standard errors, se(βgj)2 = cgjs2

g.

() Moderated t February 10, 2008 2 / 10

Page 37: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Parallel inference for genes

1 10,000-40,000 linear models.

2 High dimensionality:Need to adjust for multiple testing, e.g., control family-wise error rate(FWE) or false discover rate (FDR).

3 The key: borrow information across genes.

() Moderated t February 10, 2008 3 / 10

Page 38: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Hierarchical model

βgj ∼ N(βgj , cgjσ2g )

P(βgj 6= 0) = p

βgj | βgj 6= 0 ∼ N(0, c0jσ2g )

s2g ∼

σ2g

dgχ2

dg

σ2g ∼ s2

0 (χ2d0/d0)−1

() Moderated t February 10, 2008 4 / 10

Page 39: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Posterior statistics

1 Posterior variance estimators

s2g = E (σ2

g | s2g ) =

s2gdg + s2

0d0

dg + d0

2 Moderated t-statistic

tgj =βgj

sg√

cgj

3 The goal: eliminates large t-statistics merely from very small s.

() Moderated t February 10, 2008 5 / 10

Page 40: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Marginal distributions

The marginal distributions of the sample variances are moderatedt-statistics are mutually independent

s2g ∼ s2

0Fd ,d0

tg ∼ {td0+d with prob 1− p√

1 + c0/ctd0+d with prob p

Degrees of freedom add!

() Moderated t February 10, 2008 6 / 10

Page 41: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Estimating prior parameters

Marginal moments of log(s2) lead to estimators of s0 and d0:Estimate d0 by solving

ψ′(d0/2) = mean{ns2

e − ψ′(dg/2)}

whereeg = log(s2

g )− ψ(dg/2) + log(dg/2),

ands2e = (eg − e)2/(n − 1).

Finallys20 = exp{e + ψ(d0/2)− log(d0/2)}.

Where ψ() and ψ′() are the digamma and trigamma functions respectively.

() Moderated t February 10, 2008 7 / 10

Page 42: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Estimating prior parameters

s1, s2, · · · , sg → s1, s2, · · · , sg → s0t1, t2, · · · , t − g → t1, t2, · · · , tg → tg ,pooled

The data decides whether tg should be closer to tg ,pooled or to tg .

() Moderated t February 10, 2008 8 / 10

Page 43: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Quantile estimation of c0

let r be rank of | tg | in descending order, and let F () be the distributionfunction of the t-distribution. c0 can be estimated by equating empiricalto theoretical quantiles:

2[pF (−√

cg

cg + c0| tg |; d0 + dg ) + (1− p)F (− | tg |; d0 + dg )] =

r − 0.5

n

Get overall estimator of c0 by averaging the individual estimators from thetop p/2 proportion of the | tg |.

() Moderated t February 10, 2008 9 / 10

Page 44: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Posterior odds

Posterior probability of differential expression for any gene is

p(β 6= 0 | β, s2)

p(β = 0 | β, s2)=

p

1− p

(c

c + c0

)1/2

{ t2 + d + d0

t2 cc+c0

+ d + d0}

1+d+d02

It is a monotonic function of t2 for constant d .

() Moderated t February 10, 2008 10 / 10

Page 45: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Limma

• Limma is an R package to find differentially expressed genes

• it uses linear models– fitted to normalized intensities (one-color)– or log-ratios (two-color)

• assumption: normal distribution• output: p-values (adjusted for multiple

testing)

Page 46: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Documentation

• limma User’s Guide, Gordon Smyth, Natalie Thorne, James Wettenhall

• help documents for each function• Smyth, GK (2004). SAGMB 3 (1) article 3• de Menezes RX, Boer JM, van

Houwelingen JC (2004). Applied Bioinformatics 3: 229-235

• background on linear models: tech note by Renee de Menezes

Page 47: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

limma

• linear models– can be used to compare two or more groups– can be used for multifactorial designs

• e.g. genotype and treatment

• uses empirical Bayes analysis to improve power in small sample sizes– borrowing information across genes

Page 48: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Pre-analysis steps

• read data into limma/affy• basic quality control features• background correction• within-array normalization• between-array normalization• if duplicate spotting: sort data so that

duplicates are together

Page 49: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Linear model

• make design matrix• fit a linear model to estimate all the fold

changes• [make contrasts matrix]• apply Bayesian smoothing to the standard

errors (very important!) • output: moderated t-statistics

Page 50: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Two color - start

• working directory containing– *.gpr files– targets.txt file– *.gal file (optional)

Page 51: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Reading in data

• basically the same as Anja Schiel has shown for Quality Control packages– read in a targets file including

• file names for *.gpr files• cy3 and cy5 samples

– read in *.gpr files using read.maimages()– option to use GenePix flag information– print layout (from *.gpr or *.gal file)– option to define spot types (controls)

Page 52: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Other BioC packages

• Limma package can work with microarray objects derived by these packages:

• marray: marrayRaw and marrayNorm

• affy: single channel (exprSet)

Page 53: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Exploring data

• automate the production of plots for all arrays in an experiment– imageplot3by2

• array image of R, Rb, G, Gb, M (R/G) (un)norm– plotMA3by2

• MA plots before/after normalization– plotDensities

• histogram of all intensities before/after normalization

Page 54: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Background correction

• default = subtract– disadvantage: negative values -> NAs

• “normexp”, offset = 50– adjusts fg to bg to yield strictly positive intensities– use of an offset damps the variation of the log-

ratios for very low intensities towards 0, i.e. stabilizes the variability of the M-values as a function of intensity

– this is important for the empirical Bayes methods

Page 55: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Normalization 1

• normalizeWithinArray– normalizes M-values of each array separately– default = print-tip loess– not appropriate for e.g. Agilent arrays, which do

not have print groups: method = “loess”– assumes bulk of probes not changed– symmetrical change is not required– spot quality weights (in RG) are used by default;

weight = 0 will not influence normalization of other spots, but will be kept and normalized

Page 56: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Normalization 2

• normalizeBetweenArray– intensities of single-channel microarrays– log-ratios of two-color microarrays as a

second step after within array normalization of the M-values

– because: loess normalization doesn’t affect the A-values

– quantile normalization results in equal distributions across channels and arrays

Page 57: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Normalization 3

• normalizeBetweenArrays directly on two- color data– quantile normalization directly to individual red

and green intensities– vsn normalization should always be used

directly on raw intensities• background subtraction is allowed,• but no correction (e.g. normexp) or loess!!!

Page 58: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Linear models

• design matrix– indicates which RNA samples have been

applied to each array– rows: arrays; columns: coefficients

• contrast matrix– specifies which comparisons you would like to

make between the RNA samples– for very simple experiments, you may not

need a contrast matrix

Page 59: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Look at the result

• topTable(fit, adjust=“fdr”)– gives the top10 of differentially expressed

genes (for each contrast)• plotMA(fit)• decideTests

– makes a matrix with 0 (not selected) and -1/1 (selected for a specific p-value)

– visualize by Venn diagram

Page 60: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Limma objects

• RGList (Red-Green, raw data)• generated by read.maimages

• MAList (M- and A-values, normalized data)• generated by MA.RG or normalizeWithinArrays

• MArrayLM (result of fitting linear model)• generated by lmFit

• TestResults (results of testing a set of contrasts equal to 0 for each probe)

• generated by decideTests

Page 61: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Example 1: paired design

• direct two-color design including dye-swap• dataset "arthritis", Maaike van den Hoven• platform: Sigmamouse, 23232 single spots• 12 arrays, 2 groups:

– untreated (6 biological replicates)– arthritis (6 biological replicates)

• question: find differentially expressed genes after induction of arthritis

Page 62: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

targets.txt

Page 63: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

plotDensities(RGb, MA, MA.q, MAq)

Page 64: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

plotMA(RGb, MA, MAq)

Page 65: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

topTable(MA.q, adjust=“fdr”)Block Row Column ID Name M A t P.Value B

1838 4 18 12 NM_026004 NA 1.996 9.31 9.58 0.00577 6.88

9277 20 4 15 NM_018762 NA 0.392 10.88 8.63 0.00577 5.91

5551 12 11 7 NM_017372 NA 1.741 9.37 8.55 0.00577 5.74

14031 29 22 17 AmbionSpike5 NA 1.053 14.24 8.43 0.00577 5.66

18056 38 7 16 NM_020611 NA 0.187 8.21 8.52 0.00577 5.52

15529 33 2 19 U52197 NA 0.340 8.83 8.28 0.00598 5.47

13274 28 10 8 X83919 NA 0.407 8.56 8.11 0.00598 5.18

22079 46 14 13 NM_026542 NA 2.017 9.97 8.08 0.00598 5.18

1155 3 9 11 X14097 NA 0.251 8.07 8.46 0.00577 5.16

13559 29 1 7 AmbionSpike5 NA 1.034 13.93 7.93 0.00598 5.03

AmbionSpike5 was spiked in at 2-fold change arthritis/untreated: log-ratio 1

Page 66: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

The likelihood based approach

Hu and Wright (2007)

Page 67: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Notation A simple family of t-like statistics for gene i:

t0

is the “ordinary”

Welch statistic, f = n1

+ n2

– 2.

, governs the power of the statistics.

t0 can be viewed as an estimate of δ.2

22

1

21

21

nnii

iii

σσμμδ

+

−=

Its performance can be examined by the positive FDR in Storey (2001),

,21

asxxt

i

iiai +

−= with ,// 2

221

21 nsnss iii +=

)|( 0 cTHPrFDR ≥=

Page 68: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

FDR property of t0

δs, independent random variable, with

Theorem 1 limc→∞FDR = π0 / ( π0 + π1Q ), where

Φ, the standard normal CDF; Y and Z, random variables with truncated normal densities,

{=iδ0 w.p. π0δ’

w.p. π1

,)(

)()(2)|Pr()|Pr(

lim'

0

1f

f

c ZEYE

HcTHcTQ δΦ

=≥≥

=∞→

0,2/)2/exp(2)(

0)),(2/()2/)(exp()(2

'2'

≥−=

≥Φ−−=

zzzp

yyyp

π

δπδ

FDR reaches a limit as ,∞→c

Page 69: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Visualizing the limit Q

)|Pr()|Pr(

lim0

1

HcTHcTQ

c ≥≥

=∞→

)()()(2 '

f

f

ZEYEQ δΦ

=

Page 70: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

FDR curves for t0

with π0

= 0.9, δ’ = 2

Page 71: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Vertical movement represents improvement in FDR purely due to improvement in variance estimates

FDR curves for t0 with π0

= 0.9, δ’ = 2

Page 72: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Motivation

The variance estimation plays an important role.

A roughly linear relationship between log(s2) and log( ) is observed.

x

Page 73: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Figure 4: log(s2) vs. log( ) x

Heteroscedasticity

is consistent

with the model

Page 74: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Statistical modelUnder each experimental condition, the expression index estimate xijfrom N(μi

, σi2).

Linear model is proposed,

ηi

, a gene-specific random effect term, from N(0, ξ2). Maximum likelihood estimation:We assume the hierarchical model xij

| μi

, σi2

~

N(μi

, σi2),

log σi2| μi

~ N(α0

+ α1

log(μi

), ξ2),

where σi2 is substituted by exp( ).

iii ημαασ ++= )log()log( 102

∏∏∫= =

−−−=m

i

n

jiiiiij

ii

dxL1 1

22

2

22

2

210 )2/exp(

21)2/)(exp(

21),,,(

η

ηξηπξ

σμπσ

ξααμ

ii ημαα ++ )log(10

Page 75: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Parameter estimation

Table 1: MLEs for the four data sets

The “best predictor” approach (McCulloch and Searle, 2001) is used to predict ηi:

Then σ2 can be predicted based on the fixed parameter estimates and predicted ηi .

Lemon et al. -5.853 1.697 0.049Huang et al. -5.08 2.27 0.80Virtaneva

et al. -2.803 2.028 0.271Bro et al. -0.010 1.518 0.00

0α 2ξ

∫==i

ii iiiiii dxpxEBPη

ηξααμηηηη ),,,;|()|()( 210

Page 76: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Figure 5:The shrinkage effect in estimating μ, σ2

log(

estim

ates

of σ

2 )

log(estimate of μ)

),( 2sx

)ˆ,ˆ( 2σμ

Page 77: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

2

22

1

21

21

ˆˆˆˆˆ

nnii

iii

σσμμδ

+

−=

Eventually, MLE of δi can be obtained,

Page 78: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Small sample efficiency results

The MSE of nearly achieves the Cramér-Rao lower bound

, Fisher information

Define

B

represents a conservative lower bound for the efficiency of

B nearly reaches 1, with no overdispersion (small sample size).

The situation of overdispersion is more complex.

iδˆ

),(/))(1()ˆ( .2'

iii bCRLB διδδ +=

)ˆ()( iii iEb δδδ δ−= ).(. iδι

1)ˆ()ˆ(

)ˆ()(1

. ≤≤=−

i

i

i

i

MSECRLB

MSEB

δδ

δδι

.ˆiδ

Page 79: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Criterion of comparing the statisticsGeneral criterion: how well the statistics preserve the rank order of the δi

.

Continuous δi:

Relationship between FDR and AUC for discrete δ:Theorem 2 AUC = 1/2 (1 –

FDR + true accept/accept)

Receiver-operator characteristic curve (ROC),comparing the conditional distribution of the δs

for T

< c

vs. T c.

The area under the ROC curve (AUC), Pr (δ R > δ A), the most commonly used summary.

Discrete δi: FDR is used, fixing the number of rejected genes.

Page 80: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

Simulation studyCommon set-up:

“realized” FDR (Genovese and Wasserman, 2002) a case of δ’ = 1 and π0 = 0.7

Sampling from a double exponential distribution with location 0 and variance 1/2.

n1 = n2 = 3, m = 10000.In each case, 2000 simulations are performed.Choices of a: -

percentiles 25, 50, 75, 90 and 100 of s, and 2max(s)

-

a

= 0, -

criterion proposed in Tusher et al. (2001).

Discrete δ:

Continuous δ:

Page 81: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

0 50 100 150 200 250 300

0.05

0.10

0.15

0.20

0.25

0.30

number of genes rejected

FD

R

δ

Newtonfold−change

t0

Efron Smyth

Tusher

0 50 100 150 200 250 300

0.94

0.96

0.98

1.00

number of genes rejected

AU

C

δ

Newton

fold−change

t0

Efron

Smyth

Tusher

0.00 0.05 0.10 0.15 0.20

0.04

0.06

0.08

0.10

0.12

proportion of genes rejected

FD

R

EfronSmyth

t0

LemonTusher

Newton

δ (no overdispersion)

δ (overdispersion)

Page 82: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

A case studyReal data: A human fibroblast cell expression data set from Lemon et al. (2002) (6 vs. 6, under stimulated to 50:50 conditions)

Small sample comparisons: n1 = n2 = 2

Permutation procedures to estimate the FDR (Storey and Tibshirani, 2001)

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛26

26

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛210

212

“observed” distribution - comparisons of two conditions.

an empirical “null” distribution - total comparisons of 2 vs. 2 arrays.

Page 83: Differential Expression Detection in Microarray Dataodin.mdacc.tmc.edu/~kim/TeachBioinf/Week5/Lecture5-Feb11...11 • At low expression levels, variance in d(i) can be high, due to

0 50 100 150 200 250 300

0.05

0.10

0.15

0.20

0.25

0.30

number of genes rejected

FD

R

δ

Newtonfold−change

t0

Efron Smyth

Tusher

0 50 100 150 200 250 300

0.94

0.96

0.98

1.00

number of genes rejected

AU

C

δ

Newton

fold−change

t0

Efron

Smyth

Tusher

0.00 0.05 0.10 0.15 0.20

0.04

0.06

0.08

0.10

0.12

proportion of genes rejected

FD

R

EfronSmyth

t0

LemonTusher

Newton

δ (no overdispersion)

δ (overdispersion)