robustness in count-based differential analysis of rna-seq
TRANSCRIPT
Robustness in count-based differential analysis of
RNA-seq data
Mark D. Robinson
Institute of Molecular Life Sciences,
University of Zurich
Pasteur – 26 November 2013
Outline Fundamentals of edgeR: gene-level
analyses
Robustness/Outliers observation weights. CS: “Remove genes with outliers or use non-parametric methods” Respectfully disagree
Simulations
Food for thought: differential splicing
Data analysis pipelines for RNA-seq differential expression
Nature Protocols September 2013 (preprint at http://arxiv.org/pdf/1302.3685v3.pdf)
©20
12 N
atur
e A
mer
ica,
Inc.
All
righ
ts r
eser
ved.
PROTOCOL
NATURE PROTOCOLS | VOL.7 NO.3 | 2012 | 563
TopHat and Cufflinks are both operated through the UNIX shell. No graphical user interface is included. However, there are now commercial products and open-source interfaces to these and other RNA-seq analysis tools. For example, the Galaxy Project18 uses a web interface to cloud computing resources to bring command-line–driven tools such as TopHat and Cufflinks to users without UNIX skills through the web and the computing cloud.
Alternative analysis packagesTopHat and Cufflinks provide a complete RNA-seq workflow, but there are other RNA-seq analysis packages that may be used instead of or in combination with the tools in this protocol. Many alterna-tive read-alignment programs19–21 now exist, and there are several alternative tools for transcriptome reconstruction22,23, quantifica-tion10,24,25 and differential expression26–28 analysis. Because many of these tools operate on similarly formatted data files, they could be used instead of or in addition to the tools used here. For example, with straightforward postprocessing scripts, one could provide GSNAP19 read alignments to Cufflinks, or use a Scripture22 tran-scriptome reconstruction instead of a Cufflinks one before differ-ential expression analysis. However, such customization is beyond the scope of this protocol, and we discourage novice RNA-seq users from making changes to the protocol outlined here.
This protocol is appropriate for RNA-seq experiments on organ-isms with sequenced reference genomes. Users working without a sequenced genome but who are interested in gene discovery should consider performing de novo transcriptome assembly using one of several tools such as Trinity29, Trans-Abyss30 or Oases (http://www.ebi.ac.uk/~zerbino/oases/). Users performing expression ana-lysis with a de novo transcriptome assembly may wish to consider RSEM10 or IsoEM25. For a survey of these tools (including TopHat and Cufflinks) readers may wish to see the study by Garber et al.12, which describes their comparative advantages and disadvantages and the theoretical considerations that inform their design.
Overview of the protocolAlthough RNA-seq experiments can serve many purposes, we describe a workflow that aims to compare the transcriptome pro-files of two or more biological conditions, such as a wild-type versus mutant or control versus knockdown experiments. For simplicity, we assume that the experiment compares only two biological con-ditions, although the software is designed to support many more, including time-course experiments.
This protocol begins with raw RNA-seq reads and concludes with publication-ready visualization of the analysis. Figure 2 highlights the main steps of the protocol. First, reads for each condition are mapped to the reference genome with TopHat. Many RNA-seq users are also interested in gene or splice variant discovery, and the failure to look for new transcripts can bias expression estimates and reduce accuracy8. Thus, we include transcript assembly with
Cufflinks as a step in the workflow (see Box 1 for a workflow that skips gene and transcript discovery). After running TopHat, the resulting alignment files are provided to Cufflinks to generate a transcriptome assembly for each condition. These assemblies are then merged together using the Cuffmerge utility, which is included with the Cufflinks package. This merged assembly provides a uni-form basis for calculating gene and transcript expression in each condition. The reads and the merged assembly are fed to Cuffdiff, which calculates expression levels and tests the statistical signifi-cance of observed changes. Cuffdiff also performs an additional layer of differential analysis. By grouping transcripts into biologi-cally meaningful groups (such as transcripts that share the same transcription start site (TSS)), Cuffdiff identifies genes that are dif-ferentially regulated at the transcriptional or post-transcriptional level. These results are reported as a set of text files and can be displayed in the plotting environment of your choice.
We have recently developed a powerful plotting tool called CummeRbund (http://compbio.mit.edu/cummeRbund/), which provides functions for creating commonly used expression plots such as volcano, scatter and box plots. CummeRbund also han-dles the details of parsing Cufflinks output file formats to con-nect Cufflinks and the R statistical computing environment. CummeRbund transforms Cufflinks output files into R objects suitable for analysis with a wide variety of other packages available within the R environment and can also now be accessed through the Bioconductor website (http://www.bioconductor.org/).
This protocol does not require extensive bioinformatics exper-tise (e.g., the ability to write complex scripts), but it does assume familiarity with the UNIX command-line interface. Users should
Cufflinks package
Cuffcompare Compares transcript assemblies to annotation
Cuffmerge Merges two or more transcript assemblies
Cuffdiff Finds differentially expressed genes and transcripts Detects differential splicing and promoter use
TopHatAligns RNA-Seq reads to the genome using Bowtie
Discovers splice sites
CummeRbundPlots abundance and differential expression results from Cuffdiff
BowtieExtremely fast, general purpose short read aligner
Cufflinks Assembles transcripts
Figure 1 | Software components used in this protocol. Bowtie33 forms the algorithmic core of TopHat, which aligns millions of RNA-seq reads to the genome per CPU hour. TopHat’s read alignments are assembled by Cufflinks and its associated utility program to produce a transcriptome annotation of the genome. Cuffdiff quantifies this transcriptome across multiple conditions using the TopHat read alignments. CummeRbund helps users rapidly explore and visualize the gene expression data produced by Cuffdiff, including differentially expressed genes and transcripts.
NATURE BIOTECHNOLOGY VOLUME 29 NUMBER 7 JULY 2011 645
complexity of overlaps between variants. Finally, Butterfly (Fig. 1c) analyzes the paths taken by reads and read pairings in the context of the corresponding de Bruijn graph and reports all plausible transcript sequences, resolving alternatively spliced isoforms and transcripts derived from paralogous genes. Below, we describe each of Trinity’s modules.
Inchworm assembles contigs greedily and efficientlyInchworm efficiently reconstructs linear transcript contigs in six steps (Fig. 1a). Inchworm (i) constructs a k-mer dictionary from all sequence reads (in practice, k = 25); (ii) removes likely error-containing k-mers from the k-mer dictionary; (iii) selects the most frequent k-mer in the dictionary to seed a contig assembly, excluding both low-complexity
For transcriptome assembly, each path in the graph represents a possible transcript. A scoring scheme applied to the graph structure can rely on the original read sequences and mate-pair information to discard non-sensical solutions (transcripts) and compute all plausible ones.
Applying the scheme of de Bruijn graphs to de novo assembly of RNA-Seq data represents three critical challenges: (i) efficiently con-structing this graph from large amounts (billions of base pairs) of raw data; (ii) defining a suitable scoring and enumeration algorithm to recover all plausible splice forms and paralogous transcripts; and (iii) providing robustness to the noise stemming from sequencing errors and other artifacts in the data. In particular, sequencing errors would introduce a large number of false nodes, resulting in a massive graph with millions of possible (albeit mostly implausible) paths.
Here, we present Trinity, a method for the efficient and robust de novo reconstruction of transcriptomes, consisting of three software modules: Inchworm, Chrysalis and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. We evaluated Trinity on data from two well-annotated species—one microorganism (fission yeast) and one mam-mal (mouse)—as well as an insect (the whitefly Bemisia tabaci), whose genome has not yet been sequenced. In each case, Trinity recovers most of the reference (annotated) expressed tran-scripts as full-length sequences, and resolves alternative isoforms and duplicated genes, per-forming better than other available transcrip-tome de novo assembly tools, and similarly to methods relying on genome alignments.
RESULTSTrinity: a method for de novo transcriptome assemblyIn contrast to de novo assembly of a genome, where few large connected sequence graphs can represent connectivities among reads across entire chromosomes, in assembling transcriptome data we expect to encounter numerous individual disconnected graphs, each representing the transcriptional com-plexity at nonoverlapping loci. Accordingly, Trinity partitions the sequence data into these many individual graphs, and then processes each graph independently to extract full-length isoforms and tease apart transcripts derived from paralogous genes.
In the first step in Trinity, Inchworm assembles reads into the unique sequences of transcripts. Inchworm (Fig. 1a) uses a greedy k-mer–based approach for fast and efficient transcript assembly, recovering only a single (best) representative for a set of alternative variants that share k-mers (owing to alterna-tive splicing, gene duplication or allelic varia-tion). Next, Chrysalis (Fig. 1b) clusters related contigs that correspond to portions of alterna-tively spliced transcripts or otherwise unique portions of paralogous genes. Chrysalis then constructs a de Bruijn graph for each cluster of related contigs, each graph reflecting the
cba
>a121:len = 5,845
>a122:len = 2,560
>a123:len = 4,443
>a124:len = 48
>a126:len = 66
k – 1
Read set
Extend in k-merspace andbreak ties
Linear sequences
...
!
A
A
A A
A
CGT
CTC
G
TCGT
T C
T G
T C
T* C
... ... ......
Overlap linearsequences byoverlaps of k – 1to build graphcomponents
De Bruijngraph (k = 5)
Compactgraph
Compact graphwith reads
Transcripts
Compacting
Finding paths
Extracting sequences
ATTCG CTTCG
TTCGC
TCGCA
CGCAA
GCAAT
CAATG CAATC
AATGA AATCA
ATGAT ATCAT
TGATC TCATC
GATCG CATCG
ATCGG
TCGGA
CGGAT
... ...
A C
TTCGCAA...T
ATCGGAT...
CG
... ...
A C
CG
... ...
...CTTCGCAA...TGATCGGAT...
...ATTCGCAA...TCATCGGAT...
k – 1
k – 1
k – 1
k – 1
TTCGCAA...T
ATCGGAT...
Figure 1 Overview of Trinity. (a) Inchworm assembles the read data set (short black lines, top) by greedily searching for paths in a k-mer graph (middle), resulting in a collection of linear contigs (color lines, bottom), with each k-mer present only once in the contigs. (b) Chrysalis pools contigs (colored lines) if they share at least one k – 1-mer and if reads span the junction between contigs, and then it builds individual de Bruijn graphs from each pool. (c) Butterfly takes each de Bruijn graph from Chrysalis (top), and trims spurious edges and compacts linear paths (middle). It then reconciles the graph with reads (dashed colored arrows, bottom) and pairs (not shown), and outputs one linear sequence for each splice form and/or paralogous transcript represented in the graph (bottom, colored sequences).
ART ICL ES
cufflinks, cuffdiff Trinity edgeR, DESeq
Technical replica(on versus biological replica(on
Sample 1
Sample 2
Independent DNA popula(ons from same experimental condi(on
Mean-Variance relationship in real data
mean=variance (Poisson assumption)
data from Parikh et al. Genome Bio 2010 data from Marioni et al. Gen Res 2008
Technical replicates Biological replicates
V
aria
nce
Mean Mean
Davis McCarthy
Model assumptions
Poisson describes technical variation:
Yij ~ Pois( Mj * λij )
mean(Yij)= variance(Yij) = Mj * λij
Negative binomial models biological variability using the dispersion parameter ϕ:
Yij ~ NB( μij=Mj * λij , ϕi )
Same mean, variance is quadratic in the mean:
variance( Yij ) = μij ( 1 + μij ϕi )
Mj = library size λij = relative abundance of
feature i
Tag ID A1 A2 A3 A4 B1 B2 B3 ENSG00000124208 478 619 628 744 483 716 240
ENSG00000182463 27 20 27 26 48 55 24
ENSG00000125835 132 200 200 228 560 408 103
ENSG00000125834 42 60 72 86 131 99 30
ENSG00000197818 21 29 35 31 52 44 20
ENSG00000125831 0 0 2 0 0 0 0
ENSG00000215443 4 4 4 0 9 7 4
ENSG00000222008 30 23 29 19 0 0 0
ENSG00000101444 46 63 58 71 54 53 17
ENSG00000101333 2256 2793 3456 3362 2702 2976 1320
… … …
Critical parameter to estimate: dispersion
edgeR dispersion estimation: moderate towards trend
Data: Tuch et al., 2008
Mouse hemapoeitic stem cells, (Samir Taoudi)
Mouse lymphomas (Stan Lee)
Advantage: share information, but genes are allowed to have their own variance.
Davis McCarthy
Flexibility for various experimental designs: Generalized linear modeling
Response is negative binomial with dispersion fixed (to make it in the exponential family).
Link function (relate mean of response to linear combination of parameters)
For example:
Applicability to a wide range of designs 8
X – design matrix ln() – link function β – parameters
McCarthy et al. 2012, NAR
Challenge: edgeR can be sensitive to outliers
When the assumed model does not hold, our statistic is able to select significant features much moreefficiently than parametric methods. Also, in contrast to parametric methods, our method gives areliable estimate of the FDR. On several real data sets, our method is able to find features that areexpressed consistently higher in one class, and these are more likely to be biologically meaningful.
Moreover, the use of current parametric methods is limited in the outcome types that they canhandle. Except for PoissonSeq,20 to our knowledge, existing methods can only be used for data withtwo-class outcomes. PoissonSeq can also be used for data with quantitative outcomes and multiple-class outcomes, but not survival outcomes. Because of the complexity of parametric methods, it isoften difficult to extend them to other types of outcomes. In contrast, our nonparametric methodcan be used for all the types of outcomes mentioned above. Further, the resampling strategy that wedeveloped (Section 2.2) eliminates the difference between sequencing depths of experiments, makingit easy to generalize our method to other possible types of outcomes.
The rest of this article is organized as follows. In Section 2, we propose a nonparametric statisticfor data with a two-class outcome and the associated resampling strategy, as well as a permutationplug-in method to estimate the false discovery rate FDR. In Section 3, we study the performance ofour nonparametric method on simulated data sets, and compare it with three available methods,edgeR, PoissonSeq and DESeq. In Section 4, we apply our method as well as edgeR, PoissonSeq andDESeq on three real RNA-Seq data sets, and compare the list of features that are called asdifferentially expressed by different methods. In Section 5, we extend our nonparametric statisticto other types of outcomes, and show their performance on simulated data sets. Section 6 containsthe discussion.
2 A nonparametric method for two-class data
2.1 Wilcoxon statistic
For Feature j, suppose that we have counts N1j, . . . ,Nnj from either Class 1 or Class 2. Suppose Classk contains nk samples, k¼ 1, 2 and n1+ n2¼ n. Let Ck¼ {i : Sample i is from Class k}, k¼ 1, 2. If the
0 10 20 30 40 50 60
020
0060
0010
000
miR−206, No. 7 by edgeR
Sample
Sca
led
coun
ts
0 10 20 30 40 50 60
050
010
0020
00
miR−133b, No. 10 by edgeR
Sample
Sca
led
coun
ts
0 10 20 30 40 50 60
020
0060
0010
000
miR−375, No. 11 by edgeR
Sample
Sca
led
coun
ts
Figure 2. Counts from some miRNAs found to be very significant by edgeR do not seem to follow negative binomialdistributions. Each panel shows the counts from one miRNA in the Witten data.28 These miRNAs are the 7th, 10thand 11th most significant features detected by edgeR. The heights of vertical bars show the scaled counts from thesamples. The first 29 bars, coloured red, are samples from the one class, and the other 29 bars, coloured blue, arefrom the other. The black broken line is also drawn to separate the two classes. In each panel, we see that one counthas much larger values than all the other counts.
4 Statistical Methods in Medical Research 0(0)
level. We put the FDR threshold at 0.05, and calculatedthe true false discovery rate as the fraction of the genescalled significant at this level that were indeed false dis-coveries. Since NOISeq does not return a statistic that isrecommended to use as an adjusted p-value or FDR esti-mate, it was excluded from this evaluation. For baySeq,EBSeq and ShrinkSeq, we imposed the desired thresholdon the Bayesian FDR [28].As above, when only 10% of the genes were DE, the
direction of their regulation had little effect on the falsediscovery rate (simulation studies B1250
0 and B625625 , com-
pare Figures 4A and 4B). The main difference betweenthe two settings was seen for ShrinkSeq, whose FDR
control was worse when all genes were regulated in thesame direction. The high false discovery rate seen forShrinkSeq can possibly be reduced by setting a non-zerovalue for the fold change threshold defining the nullmodel. Also the variability of the baySeq performancewas considerably reduced when there were both up- anddownregulated genes among the DE ones. For the largestsample size (10 samples per group), ShrinkSeq, NBPSeq,EBSeq, edgeR and TSPM often found too many falsepositives. The remaining methods were essentially ableto control the false discovery rate at the desired levelunder these conditions. A possible explanation for thehigh false discovery rates of NBPSeq is that the
DE
Seq
.2ed
geR
.2vs
t.2vo
om.2
TS
PM
.2N
BP
Seq
.2
DE
Seq
.5ed
geR
.5vs
t.5vo
om.5
TS
PM
.5N
BP
Seq
.5
DE
Seq
.10
edge
R.1
0vs
t.10
voom
.10
TS
PM
.10
NB
PS
eq.1
0
0.00
0.05
0.10
0.15
0.20
Type I error rate at p_nom < 0.05, B00
Type
I er
ror
rate
DE
Seq
.2ed
geR
.2vs
t.2v o
om.2
TS
PM
.2N
BP
Seq
.2
DE
Seq
.5ed
geR
.5vs
t.5vo
om.5
TS
PM
.5N
BP
Seq
.5
DE
Seq
.10
edge
R.1
0vs
t.10
voom
.10
TS
PM
.10
NB
PS
eq.1
0
0.00
0.05
0.10
0.15
0.20
Type I error rate at p_nom < 0.05, P00
Type
I er
ror
rate
Type I error rate at p_nom < 0.05, P00
DE
Seq
.2vo
om.2
vst.2
TS
PM
.2ed
geR
.2N
BP
Seq
.2
DE
Seq
.5vo
om.5
vst.5
TS
PM
.5ed
geR
.5N
BP
Seq
.5
DE
Seq
.10
voom
.10
vst.1
0T
SP
M.1
0ed
geR
.10
NB
PS
eq.1
0
0.00
0.05
0.10
0.15
0.20
Type I error rate at p_nom < 0.05, R00
Type
I er
ror
rate
DE
Seq
.2vs
t.2vo
om.2
TS
PM
.2ed
geR
.2N
BP
Seq
.2
DE
Seq
.5vs
t.5vo
om.5
TS
PM
.5ed
geR
.5N
BP
Seq
.5
DE
Seq
.10
vst.1
0vo
om.1
0T
SP
M.1
0ed
geR
.10
NB
PS
eq.1
0
0.00
0.05
0.10
0.15
0.20
Type
I er
ror
rate
Type I error rate at p_nom < 0.05, S0
A B
C D
Figure 3 Type I error rates. Type I error rates, for the six methods providing nominal p-values, in simulation studies B00 (panel A), P00 (panel B),
S00 (panel C) and R00 (panel D). Letting some counts follow a Poisson distribution (panel B) reduced the type I error rates for TSPM slightly but hadoverall a small effect. Including outliers with abnormally high counts (panels C and D) had a detrimental effect on the ability to control the type Ierror for edgeR and NBPSeq, while DESeq became slightly more conservative.
Soneson and Delorenzi BMC Bioinformatics 2013, 14:91 Page 8 of 18http://www.biomedcentral.com/1471-2105/14/91
level. We put the FDR threshold at 0.05, and calculatedthe true false discovery rate as the fraction of the genescalled significant at this level that were indeed false dis-coveries. Since NOISeq does not return a statistic that isrecommended to use as an adjusted p-value or FDR esti-mate, it was excluded from this evaluation. For baySeq,EBSeq and ShrinkSeq, we imposed the desired thresholdon the Bayesian FDR [28].As above, when only 10% of the genes were DE, the
direction of their regulation had little effect on the falsediscovery rate (simulation studies B1250
0 and B625625 , com-
pare Figures 4A and 4B). The main difference betweenthe two settings was seen for ShrinkSeq, whose FDR
control was worse when all genes were regulated in thesame direction. The high false discovery rate seen forShrinkSeq can possibly be reduced by setting a non-zerovalue for the fold change threshold defining the nullmodel. Also the variability of the baySeq performancewas considerably reduced when there were both up- anddownregulated genes among the DE ones. For the largestsample size (10 samples per group), ShrinkSeq, NBPSeq,EBSeq, edgeR and TSPM often found too many falsepositives. The remaining methods were essentially ableto control the false discovery rate at the desired levelunder these conditions. A possible explanation for thehigh false discovery rates of NBPSeq is that the
DE
Seq
.2ed
geR
.2vs
t.2vo
om.2
TS
PM
.2N
BP
Seq
.2
DE
Seq
.5ed
geR
.5vs
t.5vo
om.5
TS
PM
.5N
BP
Seq
.5
DE
Seq
.10
edge
R.1
0vs
t.10
voom
.10
TS
PM
.10
NB
PS
eq.1
0
0.00
0.05
0.10
0.15
0.20
Type I error rate at p_nom < 0.05, B00
Type
I er
ror
rate
DE
Seq
.2ed
geR
.2vs
t.2v o
om.2
TS
PM
.2N
BP
Seq
.2
DE
Seq
.5ed
geR
.5vs
t.5vo
om.5
TS
PM
.5N
BP
Seq
.5
DE
Seq
.10
edge
R.1
0vs
t.10
voom
.10
TS
PM
.10
NB
PS
eq.1
0
0.00
0.05
0.10
0.15
0.20
Type I error rate at p_nom < 0.05, P00
Type
I er
ror
rate
Type I error rate at p_nom < 0.05, P00
DE
Seq
.2vo
om.2
vst.2
TS
PM
.2ed
geR
.2N
BP
Seq
.2
DE
Seq
.5vo
om.5
vst.5
TS
PM
.5ed
geR
.5N
BP
Seq
.5
DE
Seq
.10
voom
.10
vst.1
0T
SP
M.1
0ed
geR
.10
NB
PS
eq.1
0
0.00
0.05
0.10
0.15
0.20
Type I error rate at p_nom < 0.05, R00
Type
I er
ror
rate
DE
Seq
.2vs
t.2vo
om.2
TS
PM
.2ed
geR
.2N
BP
Seq
.2
DE
Seq
.5vs
t.5vo
om.5
TS
PM
.5ed
geR
.5N
BP
Seq
.5
DE
Seq
.10
vst.1
0vo
om.1
0T
SP
M.1
0ed
geR
.10
NB
PS
eq.1
0
0.00
0.05
0.10
0.15
0.20
Type
I er
ror
rate
Type I error rate at p_nom < 0.05, S0
A B
C D
Figure 3 Type I error rates. Type I error rates, for the six methods providing nominal p-values, in simulation studies B00 (panel A), P00 (panel B),
S00 (panel C) and R00 (panel D). Letting some counts follow a Poisson distribution (panel B) reduced the type I error rates for TSPM slightly but hadoverall a small effect. Including outliers with abnormally high counts (panels C and D) had a detrimental effect on the ability to control the type Ierror for edgeR and NBPSeq, while DESeq became slightly more conservative.
Soneson and Delorenzi BMC Bioinformatics 2013, 14:91 Page 8 of 18http://www.biomedcentral.com/1471-2105/14/91
Li and Tibshirani, 2011
no outliers presence of outliers
Soneson and Delorenzi, 2013
Fals
e pos
itiv
e ra
te
Why is robustness needed?
NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA191404004 0.0 1.9 178.1 0.0 0.5 0.0 0.0 0.0 0.0 0.02538 2.0 0.6 235.5 6.8 60.2 1.0 0.0 0.0 2.5 1.34962 3.5 0.6 429.5 1.0 35.9 0.0 0.4 0.0 0.0 4.77921 1.0 5.1 78.9 2.9 0.0 0.0 0.8 0.0 0.0 0.46115 0.0 1.3 0.0 1.9 0.0 0.5 46.1 0.0 100.1 1.35156 13.8 1.3 30.7 0.0 7.1 0.0 0.0 1.0 0.0 1.32527 23.7 111.0 228.8 77.0 129.5 10.0 45.3 27.4 26.3 19.11115 2.0 15.2 1074.8 19.5 13.2 10.0 29.6 0.0 1.3 5.53175 3.0 6.3 181.0 7.8 7.6 0.0 5.5 3.0 3.1 2.57951 1.0 12.1 35.9 0.0 1.0 1.0 0.0 1.0 0.0 0.07631 0.0 1.9 0.4 1.0 0.0 0.5 29.6 0.0 24.4 5.53437 24.6 31.1 167.0 4.9 21.2 4.5 8.3 10.1 8.1 0.4 logFC logCPM LR PValue FDR4004 -10.413038 4.186203 30.07924 4.147469e-08 0.00022395132538 -5.942865 4.963086 29.60406 5.299369e-08 0.00022395134962 -6.387829 5.576979 26.06085 3.308237e-07 0.00093204067921 -5.808379 3.183079 22.51927 2.080466e-06 0.00439602416115 5.746084 3.921353 21.37010 3.786299e-06 0.00640035955156 -4.573655 2.512035 20.13483 7.217026e-06 0.01016638412527 -2.154480 6.128702 18.44343 1.750229e-05 0.02113276281115 -4.575934 6.873996 18.14127 2.051076e-05 0.02116723253175 -3.843458 4.473754 17.71318 2.568407e-05 0.02116723257951 -4.786326 2.416892 17.66324 2.636730e-05 0.02116723257631 4.311717 2.683367 17.57990 2.754846e-05 0.02116723253437 -3.014484 4.821100 17.05690 3.627624e-05 0.0255505626
LETTERS
Transcriptome genetics using second generationsequencing in a Caucasian populationStephen B. Montgomery1,2, Micha Sammeth3, Maria Gutierrez-Arcelus1, Radoslaw P. Lach2, Catherine Ingle2,James Nisbett2, Roderic Guigo3 & Emmanouil T. Dermitzakis1,2
Gene expression is an important phenotype that informs aboutgenetic and environmental effects on cellular state. Many studieshave previously identified genetic variants for gene expression phe-notypes using custom and commercially available microarrays1–5.Second generation sequencing technologies are now providingunprecedented access to the fine structure of the transcriptome6–14.We have sequenced the mRNA fraction of the transcriptome in 60extended HapMap individuals of European descent and have com-bined these data with genetic variants from the HapMap3 project15.We have quantified exon abundance based on read depth and havealso developed methods to quantify whole transcript abundance.We have found that approximately 10 million reads of sequencingcan provide access to the same dynamic range as arrays with betterquantification of alternative and highly abundant transcripts.Correlation with SNPs (small nucleotide polymorphisms) leads toa larger discovery of eQTLs (expression quantitative trait loci) thanwith arrays. We also detect a substantial number of variants thatinfluence the structure of mature transcripts indicating variantsresponsible for alternative splicing. Finally, measures of allele-specific expression allowed the identification of rare eQTLs andallelic differences in transcript structure. This analysis shows thathigh throughput sequencing technologies reveal new properties ofgenetic effects on the transcriptome and allow the exploration ofgenetic effects in cellular processes.
Genetic variation in gene expression is an important determinant ofhuman phenotypic variation; a number of studies have elucidatedgenome-wide patterns of heritability and population differentiationandare beginning tounravel the role of gene expression in the aetiologyof disease1–5. Interrogation of the transcriptome in these studies hasbeen greatly facilitated by the use of microarrays, which quantify tran-script abundance by hybridization. However, microarrays possessseveral limitations and recent advances in transcriptome sequencingin second generation sequencing platforms have now provided single-nucleotide resolution of gene expression providing access to rare tran-scripts, more accurate quantification of abundant transcripts (abovethe signal saturation point of arrays), novel gene structure, alternativesplicing and allele-specific expression6–14. Although RNA-Seq studieshave addressed issues of transcript complexity, they have not yetaddressed how genetic studies can benefit from this increased resolu-tion to reveal novel effects of sequence variants on the transcriptome.
To understand the quantitative differences in gene expressionwithin a human population as determined from second generationsequencing, we sequenced themRNA fraction of the transcriptome oflymphoblastoid cell lines (LCLs) from 60 CEU (HapMap individualsof European descent) individuals (from CEPH—Centre d’Etude duPolymorphisme Humain) using 37-base pairs (bp) paired-endIllumina sequencing. Each individual’s transcriptome was sequenced
in one lane of an Illumina GAII analyzer and yielded 16.96 5.9(mean6 s.d.) million reads that were then mapped to the NCBI36assembly of the human genome (Supplementary Fig. 1) usingMAQ16.We subsequently filtered reads that had lowmapping quality,mappedsex chromosomes or mitochondrial DNA and were not correctlypaired, which yielded 9.46 3.3 million reads. On average, 86% ofthe filtered reads mapped to known exons in Ensembl version54(ref. 17) and 15% of read pairs spanned more than one exon.Evaluation of sequence andmapping qualitymeasures was preformedto ensure that the data quality is acceptable for analysis (Sup-plementary Fig. 2, also see methods).
We quantified reads for known exons, transcripts and whole genes.Read counts for each individual were scaled to a theoretical yield of 10million reads and corrected for peak insert size across correspondinglibraries. Each quantification was filtered to exclude those with miss-ing data for. 10% of the individuals. For exons, this resulted in datafor 90,064 exons for 10,777 genes. Of these, 95% had on averagemorethan 10 reads, 38% more than 50 reads and 20% had a mean quan-tification of$ 100 reads (Supplementary Fig. 3). For transcript quan-tification, new methods needed to be developed to map readsinto specific isoforms18,19. We developed a methodology, called theFluxCapacitor, to quantify abundances of annotated alternativelyspliced transcripts (see Methods). Using this method, we obtain rela-tive quantities for 15,967 transcripts from 11,674 genes. For eachindividual, we compared whole-gene read counts to array intensitiesgenerated with Illumina HG-6 version 2 microarrays. Correlationscoefficients between RNA-Seq and array quantities and amongRNA-Seq samples were high and consistent with previous studies20
(Supplementary Figs 4 and 5). Finally, we explored whether the cor-relation structure of abundance among exons could facilitate thedevelopment of a framework that will allow the imputation of abund-ance values for exons that are not screened, given a set of referenceRNA-Seq samples. This is the same principle as using the correlationstructure (Linkage Disequilibrium) of genetic variants to imputevariants from a reference to any population sample of interest21. Foreachof the10,777 genes,we assessed thepairwise correlationof all exonsand on average, any two pairs of exons within a gene were moderatelycorrelated (mean Pearson’s correlation R25 0.3786 0.261) (Sup-plementary Fig. 6). This correlation increased with increase in totalnumber of reads present in each exon. It isworthnoting that the averagecorrelation coefficient between SNPs within the same recombinationhotspot interval in HapMap3 is R25 0.3266 0.174, indicating that thecorrelation structure within genes is stronger and probablymore acces-sibleby imputationmethodologies thanSNPs; however, this needs tobeassessed in a tissue-specific context.
Association of gene expressionmeasured by RNA-Seq with geneticvariation was evaluated in cis with the use of 1.2 million HapMap3
1Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, 1211 Switzerland. 2Wellcome Trust Sanger Institute, Cambridge CB10 1HH, UK.3Center for Genomic Regulation, University Pompeu Fabra, Barcelona, Catalonia, 08003 Spain.
Vol 464 | 1 April 2010 |doi:10.1038/nature08903
773Macmillan Publishers Limited. All rights reserved©2010
Nature, 2010
Random split of dataset: n1=5; n2=5 Very little true differential expression
Results driven by outliers
CPMs (counts per million)
“Disadvantage” of moderation: outliers
Moderation can do more harm than good for outliers, but is crucial (on the average) in small samples
edgeR’s versus DESeq’s approach to dispersion estimation
Current policies (robustness)
• edgeR – one option: moderate dispersion less towards trend • Allows dispersions to be driven more by the data
• DESeq – take the maximum of the fit (trended) or the feature-specific dispersion • Very robust, but many genes pay a penalty, less powerful.
• DESeq2 – calculate Cook’s distance and filter genes with outliers • Can inadvertently filter interesting genes
• Our goal: Achieve a middle ground between protection against outliers while maintaining high power
A “new” direction: observation weights
Pearson residuals Huber weighting function Likelihood Weighted likelihood
An iterative reweighting strategy NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140 0.0 1.9 178.1 0.0 0.5 0.0 0.0 0.0 0.0 0.0Iteration NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140[1,] -0.849 -0.804 3.327 -0.845 -0.837 0.000 0.000 0.000 0.000 0.000[2,] -0.876 -0.774 8.397 -0.866 -0.849 -0.082 -0.092 -0.058 -0.073 -0.089[3,] -0.949 -0.692 22.642 -0.922 -0.882 -0.086 -0.097 -0.061 -0.077 -0.094[4,] -1.037 -0.454 53.633 -0.967 -0.883 -0.078 -0.087 -0.055 -0.069 -0.084[5,] -1.102 -0.013 104.956 -0.966 -0.810 -0.081 -0.091 -0.057 -0.072 -0.088[6,] -1.102 0.441 154.648 -0.919 -0.680 -0.080 -0.090 -0.056 -0.072 -0.087 NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140[1,] 1 1 0.404 1 1 1 1 1 1 1[2,] 1 1 0.160 1 1 1 1 1 1 1[3,] 1 1 0.059 1 1 1 1 1 1 1[4,] 1 1 0.025 1 1 1 1 1 1 1[5,] 1 1 0.013 1 1 1 1 1 1 1[6,] 1 1 0.009 1 1 1 1 1 1 1
Observation weights
(Pearson) Residuals
Trajectories after applying observation weights
DE in genes with outliers
When the assumed model does not hold, our statistic is able to select significant features much moreefficiently than parametric methods. Also, in contrast to parametric methods, our method gives areliable estimate of the FDR. On several real data sets, our method is able to find features that areexpressed consistently higher in one class, and these are more likely to be biologically meaningful.
Moreover, the use of current parametric methods is limited in the outcome types that they canhandle. Except for PoissonSeq,20 to our knowledge, existing methods can only be used for data withtwo-class outcomes. PoissonSeq can also be used for data with quantitative outcomes and multiple-class outcomes, but not survival outcomes. Because of the complexity of parametric methods, it isoften difficult to extend them to other types of outcomes. In contrast, our nonparametric methodcan be used for all the types of outcomes mentioned above. Further, the resampling strategy that wedeveloped (Section 2.2) eliminates the difference between sequencing depths of experiments, makingit easy to generalize our method to other possible types of outcomes.
The rest of this article is organized as follows. In Section 2, we propose a nonparametric statisticfor data with a two-class outcome and the associated resampling strategy, as well as a permutationplug-in method to estimate the false discovery rate FDR. In Section 3, we study the performance ofour nonparametric method on simulated data sets, and compare it with three available methods,edgeR, PoissonSeq and DESeq. In Section 4, we apply our method as well as edgeR, PoissonSeq andDESeq on three real RNA-Seq data sets, and compare the list of features that are called asdifferentially expressed by different methods. In Section 5, we extend our nonparametric statisticto other types of outcomes, and show their performance on simulated data sets. Section 6 containsthe discussion.
2 A nonparametric method for two-class data
2.1 Wilcoxon statistic
For Feature j, suppose that we have counts N1j, . . . ,Nnj from either Class 1 or Class 2. Suppose Classk contains nk samples, k¼ 1, 2 and n1+ n2¼ n. Let Ck¼ {i : Sample i is from Class k}, k¼ 1, 2. If the
0 10 20 30 40 50 60
020
0060
0010
000
miR−206, No. 7 by edgeR
Sample
Sca
led
coun
ts
0 10 20 30 40 50 60
050
010
0020
00
miR−133b, No. 10 by edgeR
Sample
Sca
led
coun
ts
0 10 20 30 40 50 60
020
0060
0010
000
miR−375, No. 11 by edgeR
Sample
Sca
led
coun
ts
Figure 2. Counts from some miRNAs found to be very significant by edgeR do not seem to follow negative binomialdistributions. Each panel shows the counts from one miRNA in the Witten data.28 These miRNAs are the 7th, 10thand 11th most significant features detected by edgeR. The heights of vertical bars show the scaled counts from thesamples. The first 29 bars, coloured red, are samples from the one class, and the other 29 bars, coloured blue, arefrom the other. The black broken line is also drawn to separate the two classes. In each panel, we see that one counthas much larger values than all the other counts.
4 Statistical Methods in Medical Research 0(0)Li and Tibshirani, 2011
Simulation model
Supplementary Figure 2
3
Sample from joint distribution of dispersion-mean estimates from real data (e.g. Pickrell dataset). For some percentage of features, add a single outlier: multiply count by random factor 2-10. For some percentage of features, add differential expression.
Complicated summary
• False discovery plots
• ROC curves • Power (by mean)
plots • Power (split by
situation by mean) plots
• Power versus achieved FDR
ROC curves
X – marks the (estimated) 5% FDR point.
Power curves
At the method’s 5% FDR. Split into 5 groups based on expression strength.
Outlier in up
Outlier in down
No outliers (but outliers present in dataset)
Another interest: do methods achieve their FDRs?
Another interest: do methods achieve their FDRs?
My take from simulations Robust edgeR suffers a tiny bit in power with no
outliers, but has good capacity to dampen their effect if present
DESeq’s policy on outliers has a global effect, resulting in (sometimes drastic) drop in power
DESeq2 is very powerful in the absence of outliers, but policy to filter outliers results in loss of power
edgeR and edgeR robust are a bit liberal (5% FDR might mean 6% or 7%)
Shiny app / web-accessible script; wrapper function to try new methods (coming soon)
Kegg ribosome pathway
0
2
4
6
ENSMUSG00000090137
ENSMUSG00000022370
ENSMUSG00000039221
ENSMUSG00000049751
ENSMUSG00000032215
ENSMUSG00000036781
ENSMUSG00000044533
ENSMUSG00000079641
ENSMUSG00000039001
ENSMUSG00000079435
ENSMUSG00000057322
ENSMUSG00000046364
ENSMUSG00000038900
ENSMUSG00000028936
ENSMUSG00000052146
ENSMUSG00000061983
ENSMUSG00000060636
ENSMUSG00000067288
ENSMUSG00000058600
ENSMUSG00000037805
ENSMUSG00000028234
ENSMUSG00000073702
ENSMUSG00000062328
ENSMUSG00000048758
ENSMUSG00000037563
ENSMUSG00000063457
ENSMUSG00000061787
ENSMUSG00000030432
ENSMUSG00000090862
ENSMUSG00000008683
ENSMUSG00000047675
ENSMUSG00000071415
ENSMUSG00000034892
ENSMUSG00000063316
ENSMUSG00000059070
ENSMUSG00000025290
ENSMUSG00000025362
ENSMUSG00000020460
ENSMUSG00000057863
ENSMUSG00000007892
ENSMUSG00000022601
ENSMUSG00000025508
ENSMUSG00000062997
ENSMUSG00000058558
ENSMUSG00000059291
ENSMUSG00000009927
ENSMUSG00000090733
ENSMUSG00000038274
ENSMUSG00000060938
ENSMUSG00000028495
ENSMUSG00000058546
ENSMUSG00000041841
ENSMUSG00000029614
ENSMUSG00000041453
ENSMUSG00000008682
ENSMUSG00000046330
ENSMUSG00000012405
ENSMUSG00000047215
ENSMUSG00000031320
ENSMUSG00000032518
ENSMUSG00000061477
ENSMUSG00000049517
ENSMUSG00000074129
ENSMUSG00000062647
ENSMUSG00000003970
ENSMUSG00000017404
ENSMUSG00000008668
ENSMUSG00000025794
ENSMUSG00000000740
ENSMUSG00000003429
ENSMUSG00000093674
ENSMUSG00000040952
ENSMUSG00000028081
ENSMUSG00000030744
ENSMUSG00000006333
ENSMUSG00000057841
ENSMUSG00000012848
ENSMUSG00000060036
ENSMUSG00000032399
ENSMUSG00000043716
ENSMUSG00000045128
ENSMUSG00000067274
Cou
nts
per m
illio
n
Inverse weight1234
SampleControl 1Control 2Control 3
Cerebellar granular neurons (mouse), treated with Brdu (bromodeoxyuridine) == baseline control condition.
Can we learn something from outliers? Look for over-represented functional
categories in downweighted genes
log
Food for thought:
simulating differential
splicing
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
all genes FDR=0.05
1 − spec
sens
DEXSeqedgeRCuffdiffMISO_dexseqMISO_limMISO_lim_weight
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
low expressed FDR=0.05
1−spec
sens
DEXSeqedgeRCuffdiffMISO_dexseqMISO_limMISO_lim_weight
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
mid expressed FDR=0.05
1−spec
sens
DEXSeqedgeRCuffdiffMISO_dexseqMISO_limMISO_lim_weight
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
high expressed FDR=0.05
1−spec
sens
DEXSeqedgeRCuffdiffMISO_dexseqMISO_limMISO_lim_weight
TPR
FPR
edgeR = spliceVariants() (currently, a gene-level test as opposed to DEXSeq’s exon-level test)
X – FDR=5%
Concluding remarks Statistic themes:
Moderating dispersion is helpful but sensitive to outliers
Observation weights dampen the effect of outliers, sometimes we can learn source of them; limited to probably >=3 replicates
Simulations: Benchmarking is hard work. Why not do this
collectively ?
I suggest we: i) collect simulation models and/or reference datasets; ii) make benchmarks that persist in time as new methods come. Other fields do this (e.g. machine learning).
Katarina Matthes
Andrea Komljenovic
Helen Lindsay Xiaobei Zhou
Olga Nikolayeva
Robinson Statistical Bioinformatics Group, UZH
Ian Morilla
Charity Law
Gosia Nowicka
edgeR users Antonio Schmandke (UZH) Andrea Riebler (NTNO)
WEHI edgeR devel: Andy Chen, Aaron Lun, Gordon Smyth, Davis McCarthy (Oxford)