testing high-dimensional count (rna-seq) data for ... · notes 6 testing high-dimensional count...
TRANSCRIPT
11
Utah State University – Fall 2017
Statistical Bioinformatics (Biomedical Big Data)
Notes 6
1
Testing High-Dimensional Count
(RNA-Seq) Data for Differential Expression
2
References
Anders & Huber (2010), “Differential Expression Analysis for
Sequence Count Data”, Genome Biology 11:R106
DESeq2 Bioconductor package vignette, obtained in R using
vignette("DESeq2")
Kvam, Liu, and Si (2012), “A comparison of statistical
methods for detecting differentially expressed genes from
RNA-seq data”, Am. J. of Botany 99(2):248-256.
Love, Huber, and Sanders (2014), “Moderated estimation of
fold change and dispersion for RNA-Seq data with DESeq2”,
Genome Biology 15(12):550.
3
Example – 3 treated vs. 4 untreated;
read counts (RNA-Seq) for 14,470 genes
Published 2010 (Brooks et al., Genome Research)
Drosophila melanogaster
3 samples “treated” by knock-down of “pasilla” gene (thought to be involved in regulation of splicing)
T1 T2 T3 U1 U2 U3 U4
FBgn0000003 0 1 1 0 0 0 0
FBgn0000008 118 139 77 89 142 84 76
FBgn0000014 0 10 0 1 1 0 0
FBgn0000015 0 0 0 0 0 1 2
FBgn0000017 4852 4853 3710 4640 7754 4026 3425
FBgn0000018 572 497 322 552 663 272 321
4
# load data
library(pasilla); data(pasillaGenes)
library(DESeq)
eset <- counts(pasillaGenes)
colnames(eset) <- c('T1','T2','T3','U1','U2','U3','U4')
head(eset)
Consider per-gene tests t-test
Nonparametric
Wilcoxon Rank Sum
5
Error in t.test.default(x = c(2L, 2L, 2L, 2L), y = c(1L, 1L, 1L)) :
data are essentially constant
T1 T2 T3 U1 U2 U3 U4
1 1 1 2 2 2 2
6
# try a per-gene t-test
trt <- c(1,1,1,0,0,0,0)
pvals <- rep(NA,nrow(eset))
for(i in 1:nrow(eset))
{
x <- eset[i,]
a1 <- t.test(x~trt)
pvals[i] <- a1$p.value
}
i # 1687
eset[i,]
#T1 T2 T3 U1 U2 U3 U4
# 1 1 1 2 2 2 2
# try a per-gene Wilcoxon rank sum test (allowing for ties)
library(coin)
pvals <- rep(NA,nrow(eset))
for(i in 1:nrow(eset)) # This takes a few minutes
{
x <- eset[i,]
a1 <- wilcox_test(x~as.factor(trt))
pvals[i] <- pvalue(a1)
}
hist(pvals, main='Pvalues from Wilcoxon Rank Sum Test',
cex.main=2, cex.lab=1.5)
7
Consider data as counts
(Poisson regression) On a per-gene basis:
Let Ni = # of total fragments counted in sample i
Let pi = P{ fragment matches to gene in sample i }
Observed # of total reads for gene in sample i :
Ri ~ Poisson(Nipi)
E[Ri] = Var[Ri] = Nipi
Let Ti = indicator of trt. status (0/1) for sample i
Assume log(pi) = β0 + β1 Ti
Test for DE using H0: β1 = 0
8
Poisson Regression
E[Ri] = Nipi = Ni exp(β0 + β1 Ti)
log(E[Ri]) = log Ni + β0 + β1 Ti
Do this for one gene in R (here, gene 2):
estimate β’s using iterative MLE procedure
not interesting, but important
– call this the “offset”;
often considered the “exposure” for sample I
(a quasi-normalization to scale overall genomic material)
trt <- c(1,1,1,0,0,0,0)
R <- eset[2,]
lExposure <- log(colSums(eset))
a1 <- glm(R ~ trt, family=poisson, offset=lExposure)
summary(a1)
9
Call:
glm(formula = R ~ trt, family = poisson, offset = lExposure)
Deviance Residuals:
T1 T2 T3 U1 U2 U3 U4
0.3690 0.4516 -0.9047 -0.7217 0.5862 2.3048 -2.5286
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.85250 0.06804 -174.19 <2e-16 ***
trt 0.05875 0.10304 0.57 0.569
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 14.053 on 6 degrees of freedom
Residual deviance: 13.729 on 5 degrees of freedom
AIC: 58.17
Number of Fisher Scoring iterations: 4
10
Do this for all genes …
jackpot?
11
Possible (frequent) problem – overdispersion
Recall [implicit] assumption for Poisson dist’n:
E[Ri] = Var[Ri] = Nipi
It can sometimes happen that Var[Ri] > E[Ri]
common check: add a scale (or dispersion)
parameter σ
Var[Ri] = σ E[Ri]
Estimate σ2 as χ2/df
Deviance χ2 a goodness of fit statistic:
i i
iiD
R
RR
ˆlog22
12
# Poisson regression for all genes, checking for overdispersion
Poisson.p <- scale <- rep(NA,nrow(eset))
lExposure <- log(colSums(eset))
trt <- c(1,1,1,0,0,0,0)
## this next part takes about 1.5 minutes
print(date()); for(i in 1:nrow(eset))
{ count <- eset[i,]
a1 <- glm(count ~ trt, family=poisson, offset=lExposure)
Poisson.p[i] <- summary(a1)$coeff[2,4]
scale[i] <- sqrt(a1$deviance/a1$df.resid)
}; print(date())
par(mfrow=c(2,2))
hist(Poisson.p, main='Poisson', xlab='raw P-value')
boxplot(scale, main='Poisson', xlab='scale estimate');
abline(h=1,lty=2)
mean(scale > 1)
# 0.640152
13
Can use alternative distribution:
edgeR package does this:
For each gene: Ri ~ NegativeBinomial
(number of indep. Bernoulli trials to achieve a fixed
number of successes)
Let μi = E[Ri] , and vi = Var[Ri]
But low sample sizes prevent reliable estimation of
μi and vi
Assume vi = μi + α μi2
estimate α by pooling information across genes
then only one parameter must be estimated for each gene
But – DESeq2 package improves on this
14
Negative Binomial (NB) using DESeq2 …
Define trt. condition of sample i:
Define # of fragment reads in sample i for gene k:
Assumptions in estimating and :
2,~ kikiki NBR
ki 2
ki
iikki sq )(,
)(i
)(,
22
ikikiki vs
)(,)(, ikik qvv
library size, prop. to coverage [exposure] in sample iper-gene abundance, prop. to true conc. of fragments
raw variance (biological variability)“shot noise” – this “dominates” for low-expressed genes
smooth function – pool information across genes to
estimate variance
15
Estimate parameters (for NB distn.)
denom. is geometric mean across samples
like a pseudo-reference sample
is essentially equivalent to ,
with robustness against very large for some k
mm
j
kjkiki RRmeds
/1
1
ˆ
m = # samples; n = # genes
is k
kiR
kiR
For median
calculation, skip
genes where
geometric mean
(denom) is zero.
16
= # samples in trt. condition
this is the mean of the standardized counts from the
samples in treatment condition
Estimate parameters (for NB distn.)
)(: ˆ
1ˆ
ii i
kik
s
R
mq
m
17
Estimate function wρ by plotting vs. , and use
parametric dispersion-mean relation:
( is “asymptotic dispersion”; is “extra Poisson”)
Estimate parameters (for NB distn.)
kk qqw ˆ/ˆ10
(this is the variance
of the standardized
counts from the
samples in trt.
condition ρ)
(an un-biasing constant)
kw kq
kkkk
ii i
k
k
ii
k
i
kik
zqwwqv
sm
qz
qs
R
mw
ˆ,ˆmaxˆˆ
ˆ
1ˆ
ˆˆ1
1ˆ
)(:
2
)(:
0 1
Estimating Dispersion in DESeq2
1. Estimate dispersion value for each gene
2. Fit for each condition (or pooled conditions
[default]) a curve through estimates (in the
vs. plot)
3. Assign to each gene a dispersion value, using the
maximum of the estimated [empirical] value
or the fitted value -- this conservative approach avoids under-estimating
dispersion (which would increase false positives) 18
kw
kw
kw
kq
kqw ˆ
Getting started with DESeq2 package
Data in this format (previous slide 3)
Integer counts in matrix form, with columns for samples and
rows for genes
Row names correspond to genes (or genomic regions, at least)
See package vignette for suggestions on how to get to this
format (including from sequence alignments and annotation)
Can use read.csv or read.table functions to read in text files
Each column is a biological rep
If have technical reps, sum them together to get a single column
19
20
# format data
library(DESeq2)
countsTable <- eset # counts table needs
# gene IDs in row names
rownames(countsTable) <- rownames(eset)
dim(countsTable) # 14470 genes, 7 samples
conds <- c("T","T","T","U","U","U","U")
# 3 treated, 4 untreated; put in data.frame:
cframe <- data.frame(conds)
# Fit DESeq model (after formatting object):
dds <- DESeqDataSetFromMatrix(countsTable, colData=cframe,
design = ~ conds)
ddsCtrst <- DESeq(dds)
# check quality of dispersion estimation
par(mfrow=c(1,1))
plotDispEsts(ddsCtrst, cex.lab=1.5)
Checking Quality of Dispersion
Estimation
Plot vs.
(both axes
log-scale here)
Add fitted line
for
Check that
fitted line is
roughly
appropriate
general trend21
kw kq
kqw ˆ
22
Test for DE between conditions
Based on
contrasts
(coming more
formally in
Notes 7,
slides 14-20)
23
Peak near zero:
DE genes
Peak nearer one:
low-count genes (?)
Default adjustment:
BH FDR (?)
log2 fold change (MLE): conds T vs U
Wald test p-value: conds T vs U
DataFrame with 6 rows and 4 columns
baseMean log2FoldChange pvalue padj
<numeric> <numeric> <numeric> <numeric>
FBgn0000003 0.1594687 0.95577724 0.80202750 NA
FBgn0000008 52.2256776 0.02806414 0.92576489 0.9892560
FBgn0000014 0.3897080 0.74861167 0.81899159 NA
FBgn0000015 0.9053584 -0.81010553 0.67840751 NA
FBgn0000017 2358.2434078 -0.27580756 0.03285053 0.2400995
FBgn0000018 221.2415562 -0.11987673 0.50758039 0.8708435
24
# test for DE (Wald test, z=est/se{est})
res <- results(ddsCtrst, contrast=c("conds","T","U"))
# see results
# (partial columns here just for convenience)
head(res)[,c(1,2,5,6)]
hist(res$pvalue,xlab='raw P-value', cex.lab=1.5, cex.main=2,
main='DESeq2, Wald test')
# check to explain missing p-values
t <- is.na(res$pvalue)
sum(t) # 2638, or about 18.2% here
boxplot(res$baseMean[t], cex=2, pch=16)
# -- almost always, only happens
# for undetected genes
# define sig DE genes
padj <- p.adjust(res$pvalue, "fdr")
t <- padj < .05 & !is.na(padj)
gn.sig <- rownames(res)[t]
length(gn.sig) # 561
25
# check p-value peak nearer 1
counts <- rowMeans(eset)
t <- res$pvalue > 0.8 & !is.na(res$pvalue)
par(mfrow=c(2,2))
hist(log(counts[t]), xlab='[logged] mean count',
main='Genes with largest p-values')
hist(log(counts[!t]), xlab='[logged] mean count',
main='Genes with NOT largest p-values')
# -- tends to be genes with smaller overall counts
26
Same example, but with extra covariate
3 samples “treated” by knock-down of “pasilla” gene, 4 samples “untreated”Of 3 “treated” samples, 1 was “single-read” and 2
were “paired-end” types
Of 4 “untreated” samples, 2 were “single-read” and 2 were “paired-end” types
TS1 TP1 TP2 US1 US2 UP1 UP2
FBgn0000003 0 1 1 0 0 0 0
FBgn0000008 118 139 77 89 142 84 76
FBgn0000014 0 10 0 1 1 0 0
FBgn0000015 0 0 0 0 0 1 2
FBgn0000017 4852 4853 3710 4640 7754 4026 3425
FBgn0000018 572 497 322 552 663 272 321
27
28
# load data; recall eset object from previous slides
colnames(eset) <- c('TS1','TP1','TP2','US1','US2','UP1','UP2')
head(eset)
# format data and fit model
countsTable <- eset
rownames(countsTable) <- rownames(eset)
trt <- c("T","T","T","U","U","U","U")
type <- c("S","P","P","S","S","P","P")
cframe <- data.frame(trt, type)
dds <- DESeqDataSetFromMatrix(countsTable, colData=cframe,
design = ~ trt + type)
ddsCtrst <- DESeq(dds)
res <- results(ddsCtrst, contrast=c("trt","T","U"))
pvals <- res$pvalue
# Visualize sig. results
par(mfrow=c(1,1))
hist(pvals, xlab='Raw p-value', cex.lab=1.5, cex.main=2,
main='Test trt effect while accounting for type')
29
# Visualize sig. results
hist(pvals, xlab='Raw p-value', cex.lab=1.5, cex.main=2,
main='Test trt effect while accounting for type')
# Get sig. genes
adj.pvals <- p.adjust(pvals, "BH")
t <- adj.pvals < .05 & !is.na(adj.pvals)
sum(t) # 708
sig.gn <- rownames(eset)[t]
# Visualize sig. genes
library(RColorBrewer)
small.eset <- eset[t,]
hmcol <- colorRampPalette(brewer.pal(9,"Reds"))(256)
csc <- rep(hmcol[250],ncol(small.eset))
csc[trt=="U"] <- hmcol[10]
heatmap(small.eset,scale="row",col=hmcol,
ColSideColors=csc, cexCol=2.5,
main=paste(sum(t),'Sig. Genes'))
Summary
Test count (RNA-Seq) data using Negative
Binomial distribution (DESeq2 approach, using
contrasts), pooling information across genes
What next?
Adjust for multiple testing
Filtering (to increase statistical power)
zero-count genes?
Visualization: Heatmaps / clustering / PCA
biplot / others
Characterize significant genes (annotations)30