testing high-dimensional count (rna-seq) data for ... · notes 6 testing high-dimensional count...

30
1 1 Utah State University – Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 6 1 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression

Upload: vanthu

Post on 19-May-2018

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

11

Utah State University – Fall 2017

Statistical Bioinformatics (Biomedical Big Data)

Notes 6

1

Testing High-Dimensional Count

(RNA-Seq) Data for Differential Expression

Page 2: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

2

References

Anders & Huber (2010), “Differential Expression Analysis for

Sequence Count Data”, Genome Biology 11:R106

DESeq2 Bioconductor package vignette, obtained in R using

vignette("DESeq2")

Kvam, Liu, and Si (2012), “A comparison of statistical

methods for detecting differentially expressed genes from

RNA-seq data”, Am. J. of Botany 99(2):248-256.

Love, Huber, and Sanders (2014), “Moderated estimation of

fold change and dispersion for RNA-Seq data with DESeq2”,

Genome Biology 15(12):550.

Page 3: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

3

Example – 3 treated vs. 4 untreated;

read counts (RNA-Seq) for 14,470 genes

Published 2010 (Brooks et al., Genome Research)

Drosophila melanogaster

3 samples “treated” by knock-down of “pasilla” gene (thought to be involved in regulation of splicing)

T1 T2 T3 U1 U2 U3 U4

FBgn0000003 0 1 1 0 0 0 0

FBgn0000008 118 139 77 89 142 84 76

FBgn0000014 0 10 0 1 1 0 0

FBgn0000015 0 0 0 0 0 1 2

FBgn0000017 4852 4853 3710 4640 7754 4026 3425

FBgn0000018 572 497 322 552 663 272 321

Page 4: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

4

# load data

library(pasilla); data(pasillaGenes)

library(DESeq)

eset <- counts(pasillaGenes)

colnames(eset) <- c('T1','T2','T3','U1','U2','U3','U4')

head(eset)

Page 5: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

Consider per-gene tests t-test

Nonparametric

Wilcoxon Rank Sum

5

Error in t.test.default(x = c(2L, 2L, 2L, 2L), y = c(1L, 1L, 1L)) :

data are essentially constant

T1 T2 T3 U1 U2 U3 U4

1 1 1 2 2 2 2

Page 6: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

6

# try a per-gene t-test

trt <- c(1,1,1,0,0,0,0)

pvals <- rep(NA,nrow(eset))

for(i in 1:nrow(eset))

{

x <- eset[i,]

a1 <- t.test(x~trt)

pvals[i] <- a1$p.value

}

i # 1687

eset[i,]

#T1 T2 T3 U1 U2 U3 U4

# 1 1 1 2 2 2 2

# try a per-gene Wilcoxon rank sum test (allowing for ties)

library(coin)

pvals <- rep(NA,nrow(eset))

for(i in 1:nrow(eset)) # This takes a few minutes

{

x <- eset[i,]

a1 <- wilcox_test(x~as.factor(trt))

pvals[i] <- pvalue(a1)

}

hist(pvals, main='Pvalues from Wilcoxon Rank Sum Test',

cex.main=2, cex.lab=1.5)

Page 7: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

7

Consider data as counts

(Poisson regression) On a per-gene basis:

Let Ni = # of total fragments counted in sample i

Let pi = P{ fragment matches to gene in sample i }

Observed # of total reads for gene in sample i :

Ri ~ Poisson(Nipi)

E[Ri] = Var[Ri] = Nipi

Let Ti = indicator of trt. status (0/1) for sample i

Assume log(pi) = β0 + β1 Ti

Test for DE using H0: β1 = 0

Page 8: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

8

Poisson Regression

E[Ri] = Nipi = Ni exp(β0 + β1 Ti)

log(E[Ri]) = log Ni + β0 + β1 Ti

Do this for one gene in R (here, gene 2):

estimate β’s using iterative MLE procedure

not interesting, but important

– call this the “offset”;

often considered the “exposure” for sample I

(a quasi-normalization to scale overall genomic material)

trt <- c(1,1,1,0,0,0,0)

R <- eset[2,]

lExposure <- log(colSums(eset))

a1 <- glm(R ~ trt, family=poisson, offset=lExposure)

summary(a1)

Page 9: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

9

Call:

glm(formula = R ~ trt, family = poisson, offset = lExposure)

Deviance Residuals:

T1 T2 T3 U1 U2 U3 U4

0.3690 0.4516 -0.9047 -0.7217 0.5862 2.3048 -2.5286

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -11.85250 0.06804 -174.19 <2e-16 ***

trt 0.05875 0.10304 0.57 0.569

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 14.053 on 6 degrees of freedom

Residual deviance: 13.729 on 5 degrees of freedom

AIC: 58.17

Number of Fisher Scoring iterations: 4

Page 10: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

10

Do this for all genes …

jackpot?

Page 11: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

11

Possible (frequent) problem – overdispersion

Recall [implicit] assumption for Poisson dist’n:

E[Ri] = Var[Ri] = Nipi

It can sometimes happen that Var[Ri] > E[Ri]

common check: add a scale (or dispersion)

parameter σ

Var[Ri] = σ E[Ri]

Estimate σ2 as χ2/df

Deviance χ2 a goodness of fit statistic:

i i

iiD

R

RR

ˆlog22

Page 12: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

12

# Poisson regression for all genes, checking for overdispersion

Poisson.p <- scale <- rep(NA,nrow(eset))

lExposure <- log(colSums(eset))

trt <- c(1,1,1,0,0,0,0)

## this next part takes about 1.5 minutes

print(date()); for(i in 1:nrow(eset))

{ count <- eset[i,]

a1 <- glm(count ~ trt, family=poisson, offset=lExposure)

Poisson.p[i] <- summary(a1)$coeff[2,4]

scale[i] <- sqrt(a1$deviance/a1$df.resid)

}; print(date())

par(mfrow=c(2,2))

hist(Poisson.p, main='Poisson', xlab='raw P-value')

boxplot(scale, main='Poisson', xlab='scale estimate');

abline(h=1,lty=2)

mean(scale > 1)

# 0.640152

Page 13: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

13

Can use alternative distribution:

edgeR package does this:

For each gene: Ri ~ NegativeBinomial

(number of indep. Bernoulli trials to achieve a fixed

number of successes)

Let μi = E[Ri] , and vi = Var[Ri]

But low sample sizes prevent reliable estimation of

μi and vi

Assume vi = μi + α μi2

estimate α by pooling information across genes

then only one parameter must be estimated for each gene

But – DESeq2 package improves on this

Page 14: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

14

Negative Binomial (NB) using DESeq2 …

Define trt. condition of sample i:

Define # of fragment reads in sample i for gene k:

Assumptions in estimating and :

2,~ kikiki NBR

ki 2

ki

iikki sq )(,

)(i

)(,

22

ikikiki vs

)(,)(, ikik qvv

library size, prop. to coverage [exposure] in sample iper-gene abundance, prop. to true conc. of fragments

raw variance (biological variability)“shot noise” – this “dominates” for low-expressed genes

smooth function – pool information across genes to

estimate variance

Page 15: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

15

Estimate parameters (for NB distn.)

denom. is geometric mean across samples

like a pseudo-reference sample

is essentially equivalent to ,

with robustness against very large for some k

mm

j

kjkiki RRmeds

/1

1

ˆ

m = # samples; n = # genes

is k

kiR

kiR

For median

calculation, skip

genes where

geometric mean

(denom) is zero.

Page 16: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

16

= # samples in trt. condition

this is the mean of the standardized counts from the

samples in treatment condition

Estimate parameters (for NB distn.)

)(: ˆ

ii i

kik

s

R

mq

m

Page 17: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

17

Estimate function wρ by plotting vs. , and use

parametric dispersion-mean relation:

( is “asymptotic dispersion”; is “extra Poisson”)

Estimate parameters (for NB distn.)

kk qqw ˆ/ˆ10

(this is the variance

of the standardized

counts from the

samples in trt.

condition ρ)

(an un-biasing constant)

kw kq

kkkk

ii i

k

k

ii

k

i

kik

zqwwqv

sm

qz

qs

R

mw

ˆ,ˆmaxˆˆ

ˆ

ˆˆ1

)(:

2

)(:

0 1

Page 18: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

Estimating Dispersion in DESeq2

1. Estimate dispersion value for each gene

2. Fit for each condition (or pooled conditions

[default]) a curve through estimates (in the

vs. plot)

3. Assign to each gene a dispersion value, using the

maximum of the estimated [empirical] value

or the fitted value -- this conservative approach avoids under-estimating

dispersion (which would increase false positives) 18

kw

kw

kw

kq

kqw ˆ

Page 19: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

Getting started with DESeq2 package

Data in this format (previous slide 3)

Integer counts in matrix form, with columns for samples and

rows for genes

Row names correspond to genes (or genomic regions, at least)

See package vignette for suggestions on how to get to this

format (including from sequence alignments and annotation)

Can use read.csv or read.table functions to read in text files

Each column is a biological rep

If have technical reps, sum them together to get a single column

19

Page 20: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

20

# format data

library(DESeq2)

countsTable <- eset # counts table needs

# gene IDs in row names

rownames(countsTable) <- rownames(eset)

dim(countsTable) # 14470 genes, 7 samples

conds <- c("T","T","T","U","U","U","U")

# 3 treated, 4 untreated; put in data.frame:

cframe <- data.frame(conds)

# Fit DESeq model (after formatting object):

dds <- DESeqDataSetFromMatrix(countsTable, colData=cframe,

design = ~ conds)

ddsCtrst <- DESeq(dds)

# check quality of dispersion estimation

par(mfrow=c(1,1))

plotDispEsts(ddsCtrst, cex.lab=1.5)

Page 21: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

Checking Quality of Dispersion

Estimation

Plot vs.

(both axes

log-scale here)

Add fitted line

for

Check that

fitted line is

roughly

appropriate

general trend21

kw kq

kqw ˆ

Page 22: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

22

Test for DE between conditions

Based on

contrasts

(coming more

formally in

Notes 7,

slides 14-20)

Page 23: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

23

Peak near zero:

DE genes

Peak nearer one:

low-count genes (?)

Default adjustment:

BH FDR (?)

log2 fold change (MLE): conds T vs U

Wald test p-value: conds T vs U

DataFrame with 6 rows and 4 columns

baseMean log2FoldChange pvalue padj

<numeric> <numeric> <numeric> <numeric>

FBgn0000003 0.1594687 0.95577724 0.80202750 NA

FBgn0000008 52.2256776 0.02806414 0.92576489 0.9892560

FBgn0000014 0.3897080 0.74861167 0.81899159 NA

FBgn0000015 0.9053584 -0.81010553 0.67840751 NA

FBgn0000017 2358.2434078 -0.27580756 0.03285053 0.2400995

FBgn0000018 221.2415562 -0.11987673 0.50758039 0.8708435

Page 24: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

24

# test for DE (Wald test, z=est/se{est})

res <- results(ddsCtrst, contrast=c("conds","T","U"))

# see results

# (partial columns here just for convenience)

head(res)[,c(1,2,5,6)]

hist(res$pvalue,xlab='raw P-value', cex.lab=1.5, cex.main=2,

main='DESeq2, Wald test')

# check to explain missing p-values

t <- is.na(res$pvalue)

sum(t) # 2638, or about 18.2% here

boxplot(res$baseMean[t], cex=2, pch=16)

# -- almost always, only happens

# for undetected genes

# define sig DE genes

padj <- p.adjust(res$pvalue, "fdr")

t <- padj < .05 & !is.na(padj)

gn.sig <- rownames(res)[t]

length(gn.sig) # 561

Page 25: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

25

# check p-value peak nearer 1

counts <- rowMeans(eset)

t <- res$pvalue > 0.8 & !is.na(res$pvalue)

par(mfrow=c(2,2))

hist(log(counts[t]), xlab='[logged] mean count',

main='Genes with largest p-values')

hist(log(counts[!t]), xlab='[logged] mean count',

main='Genes with NOT largest p-values')

# -- tends to be genes with smaller overall counts

Page 26: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

26

Same example, but with extra covariate

3 samples “treated” by knock-down of “pasilla” gene, 4 samples “untreated”Of 3 “treated” samples, 1 was “single-read” and 2

were “paired-end” types

Of 4 “untreated” samples, 2 were “single-read” and 2 were “paired-end” types

TS1 TP1 TP2 US1 US2 UP1 UP2

FBgn0000003 0 1 1 0 0 0 0

FBgn0000008 118 139 77 89 142 84 76

FBgn0000014 0 10 0 1 1 0 0

FBgn0000015 0 0 0 0 0 1 2

FBgn0000017 4852 4853 3710 4640 7754 4026 3425

FBgn0000018 572 497 322 552 663 272 321

Page 27: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

27

Page 28: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

28

# load data; recall eset object from previous slides

colnames(eset) <- c('TS1','TP1','TP2','US1','US2','UP1','UP2')

head(eset)

# format data and fit model

countsTable <- eset

rownames(countsTable) <- rownames(eset)

trt <- c("T","T","T","U","U","U","U")

type <- c("S","P","P","S","S","P","P")

cframe <- data.frame(trt, type)

dds <- DESeqDataSetFromMatrix(countsTable, colData=cframe,

design = ~ trt + type)

ddsCtrst <- DESeq(dds)

res <- results(ddsCtrst, contrast=c("trt","T","U"))

pvals <- res$pvalue

# Visualize sig. results

par(mfrow=c(1,1))

hist(pvals, xlab='Raw p-value', cex.lab=1.5, cex.main=2,

main='Test trt effect while accounting for type')

Page 29: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

29

# Visualize sig. results

hist(pvals, xlab='Raw p-value', cex.lab=1.5, cex.main=2,

main='Test trt effect while accounting for type')

# Get sig. genes

adj.pvals <- p.adjust(pvals, "BH")

t <- adj.pvals < .05 & !is.na(adj.pvals)

sum(t) # 708

sig.gn <- rownames(eset)[t]

# Visualize sig. genes

library(RColorBrewer)

small.eset <- eset[t,]

hmcol <- colorRampPalette(brewer.pal(9,"Reds"))(256)

csc <- rep(hmcol[250],ncol(small.eset))

csc[trt=="U"] <- hmcol[10]

heatmap(small.eset,scale="row",col=hmcol,

ColSideColors=csc, cexCol=2.5,

main=paste(sum(t),'Sig. Genes'))

Page 30: Testing High-Dimensional Count (RNA-Seq) Data for ... · Notes 6 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression. 2 ... dds

Summary

Test count (RNA-Seq) data using Negative

Binomial distribution (DESeq2 approach, using

contrasts), pooling information across genes

What next?

Adjust for multiple testing

Filtering (to increase statistical power)

zero-count genes?

Visualization: Heatmaps / clustering / PCA

biplot / others

Characterize significant genes (annotations)30