robustness in count-based differential analysis of rna-seq

Robustness in count-based differential analysis of

RNA-seq data

Mark D. Robinson

Institute of Molecular Life Sciences,

University of Zurich

Pasteur – 26 November 2013

Outline  Fundamentals of edgeR: gene-level

analyses

 Robustness/Outliers observation weights. CS: “Remove genes with outliers or use non-parametric methods” Respectfully disagree

 Simulations

 Food for thought: differential splicing

Data analysis pipelines for RNA-seq differential expression

Nature Protocols September 2013 (preprint at http://arxiv.org/pdf/1302.3685v3.pdf)

©20

12 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

PROTOCOL

NATURE PROTOCOLS | VOL.7 NO.3 | 2012 | 563

TopHat and Cufflinks are both operated through the UNIX shell. No graphical user interface is included. However, there are now commercial products and open-source interfaces to these and other RNA-seq analysis tools. For example, the Galaxy Project18 uses a web interface to cloud computing resources to bring command-line–driven tools such as TopHat and Cufflinks to users without UNIX skills through the web and the computing cloud.

Alternative analysis packagesTopHat and Cufflinks provide a complete RNA-seq workflow, but there are other RNA-seq analysis packages that may be used instead of or in combination with the tools in this protocol. Many alterna-tive read-alignment programs19–21 now exist, and there are several alternative tools for transcriptome reconstruction22,23, quantifica-tion10,24,25 and differential expression26–28 analysis. Because many of these tools operate on similarly formatted data files, they could be used instead of or in addition to the tools used here. For example, with straightforward postprocessing scripts, one could provide GSNAP19 read alignments to Cufflinks, or use a Scripture22 tran-scriptome reconstruction instead of a Cufflinks one before differ-ential expression analysis. However, such customization is beyond the scope of this protocol, and we discourage novice RNA-seq users from making changes to the protocol outlined here.

This protocol is appropriate for RNA-seq experiments on organ-isms with sequenced reference genomes. Users working without a sequenced genome but who are interested in gene discovery should consider performing de novo transcriptome assembly using one of several tools such as Trinity29, Trans-Abyss30 or Oases (http://www.ebi.ac.uk/~zerbino/oases/). Users performing expression ana-lysis with a de novo transcriptome assembly may wish to consider RSEM10 or IsoEM25. For a survey of these tools (including TopHat and Cufflinks) readers may wish to see the study by Garber et al.12, which describes their comparative advantages and disadvantages and the theoretical considerations that inform their design.

Overview of the protocolAlthough RNA-seq experiments can serve many purposes, we describe a workflow that aims to compare the transcriptome pro-files of two or more biological conditions, such as a wild-type versus mutant or control versus knockdown experiments. For simplicity, we assume that the experiment compares only two biological con-ditions, although the software is designed to support many more, including time-course experiments.

This protocol begins with raw RNA-seq reads and concludes with publication-ready visualization of the analysis. Figure 2 highlights the main steps of the protocol. First, reads for each condition are mapped to the reference genome with TopHat. Many RNA-seq users are also interested in gene or splice variant discovery, and the failure to look for new transcripts can bias expression estimates and reduce accuracy8. Thus, we include transcript assembly with

Cufflinks as a step in the workflow (see Box 1 for a workflow that skips gene and transcript discovery). After running TopHat, the resulting alignment files are provided to Cufflinks to generate a transcriptome assembly for each condition. These assemblies are then merged together using the Cuffmerge utility, which is included with the Cufflinks package. This merged assembly provides a uni-form basis for calculating gene and transcript expression in each condition. The reads and the merged assembly are fed to Cuffdiff, which calculates expression levels and tests the statistical signifi-cance of observed changes. Cuffdiff also performs an additional layer of differential analysis. By grouping transcripts into biologi-cally meaningful groups (such as transcripts that share the same transcription start site (TSS)), Cuffdiff identifies genes that are dif-ferentially regulated at the transcriptional or post-transcriptional level. These results are reported as a set of text files and can be displayed in the plotting environment of your choice.

We have recently developed a powerful plotting tool called CummeRbund (http://compbio.mit.edu/cummeRbund/), which provides functions for creating commonly used expression plots such as volcano, scatter and box plots. CummeRbund also han-dles the details of parsing Cufflinks output file formats to con-nect Cufflinks and the R statistical computing environment. CummeRbund transforms Cufflinks output files into R objects suitable for analysis with a wide variety of other packages available within the R environment and can also now be accessed through the Bioconductor website (http://www.bioconductor.org/).

This protocol does not require extensive bioinformatics exper-tise (e.g., the ability to write complex scripts), but it does assume familiarity with the UNIX command-line interface. Users should

Cufflinks package

Cuffcompare Compares transcript assemblies to annotation

Cuffmerge Merges two or more transcript assemblies

Cuffdiff Finds differentially expressed genes and transcripts Detects differential splicing and promoter use

TopHatAligns RNA-Seq reads to the genome using Bowtie

Discovers splice sites

CummeRbundPlots abundance and differential expression results from Cuffdiff

BowtieExtremely fast, general purpose short read aligner

Cufflinks Assembles transcripts

Figure 1 | Software components used in this protocol. Bowtie33 forms the algorithmic core of TopHat, which aligns millions of RNA-seq reads to the genome per CPU hour. TopHat’s read alignments are assembled by Cufflinks and its associated utility program to produce a transcriptome annotation of the genome. Cuffdiff quantifies this transcriptome across multiple conditions using the TopHat read alignments. CummeRbund helps users rapidly explore and visualize the gene expression data produced by Cuffdiff, including differentially expressed genes and transcripts.

NATURE BIOTECHNOLOGY VOLUME 29 NUMBER 7 JULY 2011 645

complexity of overlaps between variants. Finally, Butterfly (Fig. 1c) analyzes the paths taken by reads and read pairings in the context of the corresponding de Bruijn graph and reports all plausible transcript sequences, resolving alternatively spliced isoforms and transcripts derived from paralogous genes. Below, we describe each of Trinity’s modules.

Inchworm assembles contigs greedily and efficientlyInchworm efficiently reconstructs linear transcript contigs in six steps (Fig. 1a). Inchworm (i) constructs a k-mer dictionary from all sequence reads (in practice, k = 25); (ii) removes likely error-containing k-mers from the k-mer dictionary; (iii) selects the most frequent k-mer in the dictionary to seed a contig assembly, excluding both low-complexity

For transcriptome assembly, each path in the graph represents a possible transcript. A scoring scheme applied to the graph structure can rely on the original read sequences and mate-pair information to discard non-sensical solutions (transcripts) and compute all plausible ones.

Applying the scheme of de Bruijn graphs to de novo assembly of RNA-Seq data represents three critical challenges: (i) efficiently con-structing this graph from large amounts (billions of base pairs) of raw data; (ii) defining a suitable scoring and enumeration algorithm to recover all plausible splice forms and paralogous transcripts; and (iii) providing robustness to the noise stemming from sequencing errors and other artifacts in the data. In particular, sequencing errors would introduce a large number of false nodes, resulting in a massive graph with millions of possible (albeit mostly implausible) paths.

Here, we present Trinity, a method for the efficient and robust de novo reconstruction of transcriptomes, consisting of three software modules: Inchworm, Chrysalis and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. We evaluated Trinity on data from two well-annotated species—one microorganism (fission yeast) and one mam-mal (mouse)—as well as an insect (the whitefly Bemisia tabaci), whose genome has not yet been sequenced. In each case, Trinity recovers most of the reference (annotated) expressed tran-scripts as full-length sequences, and resolves alternative isoforms and duplicated genes, per-forming better than other available transcrip-tome de novo assembly tools, and similarly to methods relying on genome alignments.

RESULTSTrinity: a method for de novo transcriptome assemblyIn contrast to de novo assembly of a genome, where few large connected sequence graphs can represent connectivities among reads across entire chromosomes, in assembling transcriptome data we expect to encounter numerous individual disconnected graphs, each representing the transcriptional com-plexity at nonoverlapping loci. Accordingly, Trinity partitions the sequence data into these many individual graphs, and then processes each graph independently to extract full-length isoforms and tease apart transcripts derived from paralogous genes.

In the first step in Trinity, Inchworm assembles reads into the unique sequences of transcripts. Inchworm (Fig. 1a) uses a greedy k-mer–based approach for fast and efficient transcript assembly, recovering only a single (best) representative for a set of alternative variants that share k-mers (owing to alterna-tive splicing, gene duplication or allelic varia-tion). Next, Chrysalis (Fig. 1b) clusters related contigs that correspond to portions of alterna-tively spliced transcripts or otherwise unique portions of paralogous genes. Chrysalis then constructs a de Bruijn graph for each cluster of related contigs, each graph reflecting the

cba

>a121:len = 5,845

>a122:len = 2,560

>a123:len = 4,443

>a124:len = 48

>a126:len = 66

k – 1

Read set

Extend in k-merspace andbreak ties

Linear sequences

...

!

A

A

A A

A

CGT

CTC

G

TCGT

T C

T G

T C

T* C

... ... ......

Overlap linearsequences byoverlaps of k – 1to build graphcomponents

De Bruijngraph (k = 5)

Compactgraph

Compact graphwith reads

Transcripts

Compacting

Finding paths

Extracting sequences

ATTCG CTTCG

TTCGC

TCGCA

CGCAA

GCAAT

CAATG CAATC

AATGA AATCA

ATGAT ATCAT

TGATC TCATC

GATCG CATCG

ATCGG

TCGGA

CGGAT

... ...

A C

TTCGCAA...T

ATCGGAT...

CG

... ...

A C

CG

... ...

...CTTCGCAA...TGATCGGAT...

...ATTCGCAA...TCATCGGAT...

k – 1

k – 1

k – 1

k – 1

TTCGCAA...T

ATCGGAT...

Figure 1 Overview of Trinity. (a) Inchworm assembles the read data set (short black lines, top) by greedily searching for paths in a k-mer graph (middle), resulting in a collection of linear contigs (color lines, bottom), with each k-mer present only once in the contigs. (b) Chrysalis pools contigs (colored lines) if they share at least one k – 1-mer and if reads span the junction between contigs, and then it builds individual de Bruijn graphs from each pool. (c) Butterfly takes each de Bruijn graph from Chrysalis (top), and trims spurious edges and compacts linear paths (middle). It then reconciles the graph with reads (dashed colored arrows, bottom) and pairs (not shown), and outputs one linear sequence for each splice form and/or paralogous transcript represented in the graph (bottom, colored sequences).

ART ICL ES

cufflinks, cuffdiff Trinity edgeR, DESeq

Technical replica(on versus biological replica(on

Sample 1

Sample 2

Independent DNA popula(ons from same experimental condi(on

Mean-Variance relationship in real data

mean=variance (Poisson assumption)

data from Parikh et al. Genome Bio 2010 data from Marioni et al. Gen Res 2008

Technical replicates Biological replicates

V

aria

nce

Mean Mean

Davis McCarthy

Model assumptions

  Poisson describes technical variation:

Yij ~ Pois( Mj * λij )

mean(Yij)= variance(Yij) = Mj * λij

  Negative binomial models biological variability using the dispersion parameter ϕ:

Yij ~ NB( μij=Mj * λij , ϕi )

  Same mean, variance is quadratic in the mean:

variance( Yij ) = μij ( 1 + μij ϕi )

Mj = library size λij = relative abundance of

feature i

Tag ID A1 A2 A3 A4 B1 B2 B3 ENSG00000124208 478 619 628 744 483 716 240

ENSG00000182463 27 20 27 26 48 55 24

ENSG00000125835 132 200 200 228 560 408 103

ENSG00000125834 42 60 72 86 131 99 30

ENSG00000197818 21 29 35 31 52 44 20

ENSG00000125831 0 0 2 0 0 0 0

ENSG00000215443 4 4 4 0 9 7 4

ENSG00000222008 30 23 29 19 0 0 0

ENSG00000101444 46 63 58 71 54 53 17

ENSG00000101333 2256 2793 3456 3362 2702 2976 1320

… … …

Critical parameter to estimate: dispersion

edgeR dispersion estimation: moderate towards trend

Data: Tuch et al., 2008

Mouse hemapoeitic stem cells, (Samir Taoudi)

Mouse lymphomas (Stan Lee)

Advantage: share information, but genes are allowed to have their own variance.

Davis McCarthy

Flexibility for various experimental designs: Generalized linear modeling

  Response is negative binomial with dispersion fixed (to make it in the exponential family).

  Link function (relate mean of response to linear combination of parameters)

  For example:

  Applicability to a wide range of designs 8

X – design matrix ln() – link function β – parameters

McCarthy et al. 2012, NAR

Challenge: edgeR can be sensitive to outliers

When the assumed model does not hold, our statistic is able to select significant features much moreefficiently than parametric methods. Also, in contrast to parametric methods, our method gives areliable estimate of the FDR. On several real data sets, our method is able to find features that areexpressed consistently higher in one class, and these are more likely to be biologically meaningful.

Moreover, the use of current parametric methods is limited in the outcome types that they canhandle. Except for PoissonSeq,20 to our knowledge, existing methods can only be used for data withtwo-class outcomes. PoissonSeq can also be used for data with quantitative outcomes and multiple-class outcomes, but not survival outcomes. Because of the complexity of parametric methods, it isoften difficult to extend them to other types of outcomes. In contrast, our nonparametric methodcan be used for all the types of outcomes mentioned above. Further, the resampling strategy that wedeveloped (Section 2.2) eliminates the difference between sequencing depths of experiments, makingit easy to generalize our method to other possible types of outcomes.

The rest of this article is organized as follows. In Section 2, we propose a nonparametric statisticfor data with a two-class outcome and the associated resampling strategy, as well as a permutationplug-in method to estimate the false discovery rate FDR. In Section 3, we study the performance ofour nonparametric method on simulated data sets, and compare it with three available methods,edgeR, PoissonSeq and DESeq. In Section 4, we apply our method as well as edgeR, PoissonSeq andDESeq on three real RNA-Seq data sets, and compare the list of features that are called asdifferentially expressed by different methods. In Section 5, we extend our nonparametric statisticto other types of outcomes, and show their performance on simulated data sets. Section 6 containsthe discussion.

2 A nonparametric method for two-class data

2.1 Wilcoxon statistic

For Feature j, suppose that we have counts N1j, . . . ,Nnj from either Class 1 or Class 2. Suppose Classk contains nk samples, k¼ 1, 2 and n1+ n2¼ n. Let Ck¼ {i : Sample i is from Class k}, k¼ 1, 2. If the

0 10 20 30 40 50 60

020

0060

0010

000

miR−206, No. 7 by edgeR

Sample

Sca

led

coun

ts

0 10 20 30 40 50 60

050

010

0020

00

miR−133b, No. 10 by edgeR

Sample

Sca

led

coun

ts

0 10 20 30 40 50 60

020

0060

0010

000


Sample

Sca

led

coun

ts

Figure 2. Counts from some miRNAs found to be very significant by edgeR do not seem to follow negative binomialdistributions. Each panel shows the counts from one miRNA in the Witten data.28 These miRNAs are the 7th, 10thand 11th most significant features detected by edgeR. The heights of vertical bars show the scaled counts from thesamples. The first 29 bars, coloured red, are samples from the one class, and the other 29 bars, coloured blue, arefrom the other. The black broken line is also drawn to separate the two classes. In each panel, we see that one counthas much larger values than all the other counts.

4 Statistical Methods in Medical Research 0(0)

level. We put the FDR threshold at 0.05, and calculatedthe true false discovery rate as the fraction of the genescalled significant at this level that were indeed false dis-coveries. Since NOISeq does not return a statistic that isrecommended to use as an adjusted p-value or FDR esti-mate, it was excluded from this evaluation. For baySeq,EBSeq and ShrinkSeq, we imposed the desired thresholdon the Bayesian FDR [28].As above, when only 10% of the genes were DE, the

direction of their regulation had little effect on the falsediscovery rate (simulation studies B1250

0 and B625625 , com-

pare Figures 4A and 4B). The main difference betweenthe two settings was seen for ShrinkSeq, whose FDR

control was worse when all genes were regulated in thesame direction. The high false discovery rate seen forShrinkSeq can possibly be reduced by setting a non-zerovalue for the fold change threshold defining the nullmodel. Also the variability of the baySeq performancewas considerably reduced when there were both up- anddownregulated genes among the DE ones. For the largestsample size (10 samples per group), ShrinkSeq, NBPSeq,EBSeq, edgeR and TSPM often found too many falsepositives. The remaining methods were essentially ableto control the false discovery rate at the desired levelunder these conditions. A possible explanation for thehigh false discovery rates of NBPSeq is that the

DE

Seq

.2ed

geR

.2vs

t.2vo

om.2

TS

PM

.2N

BP

Seq

.2

DE

Seq

.5ed

geR

.5vs

t.5vo

om.5

TS

PM

.5N

BP

Seq

.5

DE

Seq

.10

edge

R.1

0vs

t.10

voom

.10

TS

PM

.10

NB

PS

eq.1

0

0.00

0.05

0.10

0.15

0.20

Type I error rate at p_nom < 0.05, B00

Type

I er

ror

rate

DE

Seq

.2ed

geR

.2vs

t.2v o

om.2

TS

PM

.2N

BP

Seq

.2

DE

Seq

.5ed

geR

.5vs

t.5vo

om.5

TS

PM

.5N

BP

Seq

.5

DE

Seq

.10

edge

R.1

0vs

t.10

voom

.10

TS

PM

.10

NB

PS

eq.1

0

0.00

0.05

0.10

0.15

0.20

Type I error rate at p_nom < 0.05, P00

Type

I er

ror

rate


DE

Seq

.2vo

om.2

vst.2

TS

PM

.2ed

geR

.2N

BP

Seq

.2

DE

Seq

.5vo

om.5

vst.5

TS

PM

.5ed

geR

.5N

BP

Seq

.5

DE

Seq

.10

voom

.10

vst.1

0T

SP

M.1

0ed

geR

.10

NB

PS

eq.1

0

0.00

0.05

0.10

0.15

0.20

Type I error rate at p_nom < 0.05, R00

Type

I er

ror

rate

DE

Seq

.2vs

t.2vo

om.2

TS

PM

.2ed

geR

.2N

BP

Seq

.2

DE

Seq

.5vs

t.5vo

om.5

TS

PM

.5ed

geR

.5N

BP

Seq

.5

DE

Seq

.10

vst.1

0vo

om.1

0T

SP

M.1

0ed

geR

.10

NB

PS

eq.1

0

0.00

0.05

0.10

0.15

0.20

Type

I er

ror

rate

Type I error rate at p_nom < 0.05, S0

A B

C D

Figure 3 Type I error rates. Type I error rates, for the six methods providing nominal p-values, in simulation studies B00 (panel A), P00 (panel B),

S00 (panel C) and R00 (panel D). Letting some counts follow a Poisson distribution (panel B) reduced the type I error rates for TSPM slightly but hadoverall a small effect. Including outliers with abnormally high counts (panels C and D) had a detrimental effect on the ability to control the type Ierror for edgeR and NBPSeq, while DESeq became slightly more conservative.

Soneson and Delorenzi BMC Bioinformatics 2013, 14:91 Page 8 of 18http://www.biomedcentral.com/1471-2105/14/91

level. We put the FDR threshold at 0.05, and calculatedthe true false discovery rate as the fraction of the genescalled significant at this level that were indeed false dis-coveries. Since NOISeq does not return a statistic that isrecommended to use as an adjusted p-value or FDR esti-mate, it was excluded from this evaluation. For baySeq,EBSeq and ShrinkSeq, we imposed the desired thresholdon the Bayesian FDR [28].As above, when only 10% of the genes were DE, the

direction of their regulation had little effect on the falsediscovery rate (simulation studies B1250

0 and B625625 , com-

pare Figures 4A and 4B). The main difference betweenthe two settings was seen for ShrinkSeq, whose FDR

control was worse when all genes were regulated in thesame direction. The high false discovery rate seen forShrinkSeq can possibly be reduced by setting a non-zerovalue for the fold change threshold defining the nullmodel. Also the variability of the baySeq performancewas considerably reduced when there were both up- anddownregulated genes among the DE ones. For the largestsample size (10 samples per group), ShrinkSeq, NBPSeq,EBSeq, edgeR and TSPM often found too many falsepositives. The remaining methods were essentially ableto control the false discovery rate at the desired levelunder these conditions. A possible explanation for thehigh false discovery rates of NBPSeq is that the

DE

Seq

.2ed

geR

.2vs

t.2vo

om.2

TS

PM

.2N

BP

Seq

.2

DE

Seq

.5ed

geR

.5vs

t.5vo

om.5

TS

PM

.5N

BP

Seq

.5

DE

Seq

.10

edge

R.1

0vs

t.10

voom

.10

TS

PM

.10

NB

PS

eq.1

0

0.00

0.05

0.10

0.15

0.20

Type I error rate at p_nom < 0.05, B00

Type

I er

ror

rate

DE

Seq

.2ed

geR

.2vs

t.2v o

om.2

TS

PM

.2N

BP

Seq

.2

DE

Seq

.5ed

geR

.5vs

t.5vo

om.5

TS

PM

.5N

BP

Seq

.5

DE

Seq

.10

edge

R.1

0vs

t.10

voom

.10

TS

PM

.10

NB

PS

eq.1

0

0.00

0.05

0.10

0.15

0.20


Type

I er

ror

rate


DE

Seq

.2vo

om.2

vst.2

TS

PM

.2ed

geR

.2N

BP

Seq

.2

DE

Seq

.5vo

om.5

vst.5

TS

PM

.5ed

geR

.5N

BP

Seq

.5

DE

Seq

.10

voom

.10

vst.1

0T

SP

M.1

0ed

geR

.10

NB

PS

eq.1

0

0.00

0.05

0.10

0.15

0.20

Type I error rate at p_nom < 0.05, R00

Type

I er

ror

rate

DE

Seq

.2vs

t.2vo

om.2

TS

PM

.2ed

geR

.2N

BP

Seq

.2

DE

Seq

.5vs

t.5vo

om.5

TS

PM

.5ed

geR

.5N

BP

Seq

.5

DE

Seq

.10

vst.1

0vo

om.1

0T

SP

M.1

0ed

geR

.10

NB

PS

eq.1

0

0.00

0.05

0.10

0.15

0.20

Type

I er

ror

rate

Type I error rate at p_nom < 0.05, S0

A B

C D

Figure 3 Type I error rates. Type I error rates, for the six methods providing nominal p-values, in simulation studies B00 (panel A), P00 (panel B),

S00 (panel C) and R00 (panel D). Letting some counts follow a Poisson distribution (panel B) reduced the type I error rates for TSPM slightly but hadoverall a small effect. Including outliers with abnormally high counts (panels C and D) had a detrimental effect on the ability to control the type Ierror for edgeR and NBPSeq, while DESeq became slightly more conservative.

Soneson and Delorenzi BMC Bioinformatics 2013, 14:91 Page 8 of 18http://www.biomedcentral.com/1471-2105/14/91

Li and Tibshirani, 2011

no outliers presence of outliers

Soneson and Delorenzi, 2013

Fals

e pos

itiv

e ra

te

Why is robustness needed?

NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA191404004 0.0 1.9 178.1 0.0 0.5 0.0 0.0 0.0 0.0 0.02538 2.0 0.6 235.5 6.8 60.2 1.0 0.0 0.0 2.5 1.34962 3.5 0.6 429.5 1.0 35.9 0.0 0.4 0.0 0.0 4.77921 1.0 5.1 78.9 2.9 0.0 0.0 0.8 0.0 0.0 0.46115 0.0 1.3 0.0 1.9 0.0 0.5 46.1 0.0 100.1 1.35156 13.8 1.3 30.7 0.0 7.1 0.0 0.0 1.0 0.0 1.32527 23.7 111.0 228.8 77.0 129.5 10.0 45.3 27.4 26.3 19.11115 2.0 15.2 1074.8 19.5 13.2 10.0 29.6 0.0 1.3 5.53175 3.0 6.3 181.0 7.8 7.6 0.0 5.5 3.0 3.1 2.57951 1.0 12.1 35.9 0.0 1.0 1.0 0.0 1.0 0.0 0.07631 0.0 1.9 0.4 1.0 0.0 0.5 29.6 0.0 24.4 5.53437 24.6 31.1 167.0 4.9 21.2 4.5 8.3 10.1 8.1 0.4 logFC logCPM LR PValue FDR4004 -10.413038 4.186203 30.07924 4.147469e-08 0.00022395132538 -5.942865 4.963086 29.60406 5.299369e-08 0.00022395134962 -6.387829 5.576979 26.06085 3.308237e-07 0.00093204067921 -5.808379 3.183079 22.51927 2.080466e-06 0.00439602416115 5.746084 3.921353 21.37010 3.786299e-06 0.00640035955156 -4.573655 2.512035 20.13483 7.217026e-06 0.01016638412527 -2.154480 6.128702 18.44343 1.750229e-05 0.02113276281115 -4.575934 6.873996 18.14127 2.051076e-05 0.02116723253175 -3.843458 4.473754 17.71318 2.568407e-05 0.02116723257951 -4.786326 2.416892 17.66324 2.636730e-05 0.02116723257631 4.311717 2.683367 17.57990 2.754846e-05 0.02116723253437 -3.014484 4.821100 17.05690 3.627624e-05 0.0255505626

LETTERS

Transcriptome genetics using second generationsequencing in a Caucasian populationStephen B. Montgomery1,2, Micha Sammeth3, Maria Gutierrez-Arcelus1, Radoslaw P. Lach2, Catherine Ingle2,James Nisbett2, Roderic Guigo3 & Emmanouil T. Dermitzakis1,2

Gene expression is an important phenotype that informs aboutgenetic and environmental effects on cellular state. Many studieshave previously identified genetic variants for gene expression phe-notypes using custom and commercially available microarrays1–5.Second generation sequencing technologies are now providingunprecedented access to the fine structure of the transcriptome6–14.We have sequenced the mRNA fraction of the transcriptome in 60extended HapMap individuals of European descent and have com-bined these data with genetic variants from the HapMap3 project15.We have quantified exon abundance based on read depth and havealso developed methods to quantify whole transcript abundance.We have found that approximately 10 million reads of sequencingcan provide access to the same dynamic range as arrays with betterquantification of alternative and highly abundant transcripts.Correlation with SNPs (small nucleotide polymorphisms) leads toa larger discovery of eQTLs (expression quantitative trait loci) thanwith arrays. We also detect a substantial number of variants thatinfluence the structure of mature transcripts indicating variantsresponsible for alternative splicing. Finally, measures of allele-specific expression allowed the identification of rare eQTLs andallelic differences in transcript structure. This analysis shows thathigh throughput sequencing technologies reveal new properties ofgenetic effects on the transcriptome and allow the exploration ofgenetic effects in cellular processes.

Genetic variation in gene expression is an important determinant ofhuman phenotypic variation; a number of studies have elucidatedgenome-wide patterns of heritability and population differentiationandare beginning tounravel the role of gene expression in the aetiologyof disease1–5. Interrogation of the transcriptome in these studies hasbeen greatly facilitated by the use of microarrays, which quantify tran-script abundance by hybridization. However, microarrays possessseveral limitations and recent advances in transcriptome sequencingin second generation sequencing platforms have now provided single-nucleotide resolution of gene expression providing access to rare tran-scripts, more accurate quantification of abundant transcripts (abovethe signal saturation point of arrays), novel gene structure, alternativesplicing and allele-specific expression6–14. Although RNA-Seq studieshave addressed issues of transcript complexity, they have not yetaddressed how genetic studies can benefit from this increased resolu-tion to reveal novel effects of sequence variants on the transcriptome.

To understand the quantitative differences in gene expressionwithin a human population as determined from second generationsequencing, we sequenced themRNA fraction of the transcriptome oflymphoblastoid cell lines (LCLs) from 60 CEU (HapMap individualsof European descent) individuals (from CEPH—Centre d’Etude duPolymorphisme Humain) using 37-base pairs (bp) paired-endIllumina sequencing. Each individual’s transcriptome was sequenced

in one lane of an Illumina GAII analyzer and yielded 16.96 5.9(mean6 s.d.) million reads that were then mapped to the NCBI36assembly of the human genome (Supplementary Fig. 1) usingMAQ16.We subsequently filtered reads that had lowmapping quality,mappedsex chromosomes or mitochondrial DNA and were not correctlypaired, which yielded 9.46 3.3 million reads. On average, 86% ofthe filtered reads mapped to known exons in Ensembl version54(ref. 17) and 15% of read pairs spanned more than one exon.Evaluation of sequence andmapping qualitymeasures was preformedto ensure that the data quality is acceptable for analysis (Sup-plementary Fig. 2, also see methods).

We quantified reads for known exons, transcripts and whole genes.Read counts for each individual were scaled to a theoretical yield of 10million reads and corrected for peak insert size across correspondinglibraries. Each quantification was filtered to exclude those with miss-ing data for. 10% of the individuals. For exons, this resulted in datafor 90,064 exons for 10,777 genes. Of these, 95% had on averagemorethan 10 reads, 38% more than 50 reads and 20% had a mean quan-tification of$ 100 reads (Supplementary Fig. 3). For transcript quan-tification, new methods needed to be developed to map readsinto specific isoforms18,19. We developed a methodology, called theFluxCapacitor, to quantify abundances of annotated alternativelyspliced transcripts (see Methods). Using this method, we obtain rela-tive quantities for 15,967 transcripts from 11,674 genes. For eachindividual, we compared whole-gene read counts to array intensitiesgenerated with Illumina HG-6 version 2 microarrays. Correlationscoefficients between RNA-Seq and array quantities and amongRNA-Seq samples were high and consistent with previous studies20

(Supplementary Figs 4 and 5). Finally, we explored whether the cor-relation structure of abundance among exons could facilitate thedevelopment of a framework that will allow the imputation of abund-ance values for exons that are not screened, given a set of referenceRNA-Seq samples. This is the same principle as using the correlationstructure (Linkage Disequilibrium) of genetic variants to imputevariants from a reference to any population sample of interest21. Foreachof the10,777 genes,we assessed thepairwise correlationof all exonsand on average, any two pairs of exons within a gene were moderatelycorrelated (mean Pearson’s correlation R25 0.3786 0.261) (Sup-plementary Fig. 6). This correlation increased with increase in totalnumber of reads present in each exon. It isworthnoting that the averagecorrelation coefficient between SNPs within the same recombinationhotspot interval in HapMap3 is R25 0.3266 0.174, indicating that thecorrelation structure within genes is stronger and probablymore acces-sibleby imputationmethodologies thanSNPs; however, this needs tobeassessed in a tissue-specific context.

Association of gene expressionmeasured by RNA-Seq with geneticvariation was evaluated in cis with the use of 1.2 million HapMap3

1Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, 1211 Switzerland. 2Wellcome Trust Sanger Institute, Cambridge CB10 1HH, UK.3Center for Genomic Regulation, University Pompeu Fabra, Barcelona, Catalonia, 08003 Spain.

Vol 464 | 1 April 2010 |doi:10.1038/nature08903

773Macmillan Publishers Limited. All rights reserved©2010

Nature, 2010

Random split of dataset: n1=5; n2=5 Very little true differential expression

Results driven by outliers

CPMs (counts per million)

“Disadvantage” of moderation: outliers

Moderation can do more harm than good for outliers, but is crucial (on the average) in small samples

edgeR’s versus DESeq’s approach to dispersion estimation

Current policies (robustness)

•  edgeR – one option: moderate dispersion less towards trend •  Allows dispersions to be driven more by the data

•  DESeq – take the maximum of the fit (trended) or the feature-specific dispersion •  Very robust, but many genes pay a penalty, less powerful.

•  DESeq2 – calculate Cook’s distance and filter genes with outliers •  Can inadvertently filter interesting genes

•  Our goal: Achieve a middle ground between protection against outliers while maintaining high power

A “new” direction: observation weights

Pearson residuals Huber weighting function Likelihood Weighted likelihood

An iterative reweighting strategy NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140 0.0 1.9 178.1 0.0 0.5 0.0 0.0 0.0 0.0 0.0Iteration NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140[1,] -0.849 -0.804 3.327 -0.845 -0.837 0.000 0.000 0.000 0.000 0.000[2,] -0.876 -0.774 8.397 -0.866 -0.849 -0.082 -0.092 -0.058 -0.073 -0.089[3,] -0.949 -0.692 22.642 -0.922 -0.882 -0.086 -0.097 -0.061 -0.077 -0.094[4,] -1.037 -0.454 53.633 -0.967 -0.883 -0.078 -0.087 -0.055 -0.069 -0.084[5,] -1.102 -0.013 104.956 -0.966 -0.810 -0.081 -0.091 -0.057 -0.072 -0.088[6,] -1.102 0.441 154.648 -0.919 -0.680 -0.080 -0.090 -0.056 -0.072 -0.087 NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140[1,] 1 1 0.404 1 1 1 1 1 1 1[2,] 1 1 0.160 1 1 1 1 1 1 1[3,] 1 1 0.059 1 1 1 1 1 1 1[4,] 1 1 0.025 1 1 1 1 1 1 1[5,] 1 1 0.013 1 1 1 1 1 1 1[6,] 1 1 0.009 1 1 1 1 1 1 1

Observation weights

(Pearson) Residuals

Trajectories after applying observation weights

DE in genes with outliers

When the assumed model does not hold, our statistic is able to select significant features much moreefficiently than parametric methods. Also, in contrast to parametric methods, our method gives areliable estimate of the FDR. On several real data sets, our method is able to find features that areexpressed consistently higher in one class, and these are more likely to be biologically meaningful.

Moreover, the use of current parametric methods is limited in the outcome types that they canhandle. Except for PoissonSeq,20 to our knowledge, existing methods can only be used for data withtwo-class outcomes. PoissonSeq can also be used for data with quantitative outcomes and multiple-class outcomes, but not survival outcomes. Because of the complexity of parametric methods, it isoften difficult to extend them to other types of outcomes. In contrast, our nonparametric methodcan be used for all the types of outcomes mentioned above. Further, the resampling strategy that wedeveloped (Section 2.2) eliminates the difference between sequencing depths of experiments, makingit easy to generalize our method to other possible types of outcomes.

The rest of this article is organized as follows. In Section 2, we propose a nonparametric statisticfor data with a two-class outcome and the associated resampling strategy, as well as a permutationplug-in method to estimate the false discovery rate FDR. In Section 3, we study the performance ofour nonparametric method on simulated data sets, and compare it with three available methods,edgeR, PoissonSeq and DESeq. In Section 4, we apply our method as well as edgeR, PoissonSeq andDESeq on three real RNA-Seq data sets, and compare the list of features that are called asdifferentially expressed by different methods. In Section 5, we extend our nonparametric statisticto other types of outcomes, and show their performance on simulated data sets. Section 6 containsthe discussion.

2 A nonparametric method for two-class data

2.1 Wilcoxon statistic

For Feature j, suppose that we have counts N1j, . . . ,Nnj from either Class 1 or Class 2. Suppose Classk contains nk samples, k¼ 1, 2 and n1+ n2¼ n. Let Ck¼ {i : Sample i is from Class k}, k¼ 1, 2. If the

0 10 20 30 40 50 60

020

0060

0010

000


Sample

Sca

led

coun

ts

0 10 20 30 40 50 60

050

010

0020

00

miR−133b, No. 10 by edgeR

Sample

Sca

led

coun

ts

0 10 20 30 40 50 60

020

0060

0010

000


Sample

Sca

led

coun

ts

Figure 2. Counts from some miRNAs found to be very significant by edgeR do not seem to follow negative binomialdistributions. Each panel shows the counts from one miRNA in the Witten data.28 These miRNAs are the 7th, 10thand 11th most significant features detected by edgeR. The heights of vertical bars show the scaled counts from thesamples. The first 29 bars, coloured red, are samples from the one class, and the other 29 bars, coloured blue, arefrom the other. The black broken line is also drawn to separate the two classes. In each panel, we see that one counthas much larger values than all the other counts.

4 Statistical Methods in Medical Research 0(0)Li and Tibshirani, 2011

Simulation model

Supplementary Figure 2

3

Sample from joint distribution of dispersion-mean estimates from real data (e.g. Pickrell dataset). For some percentage of features, add a single outlier: multiply count by random factor 2-10. For some percentage of features, add differential expression.

Complicated summary

•  False discovery plots

•  ROC curves •  Power (by mean)

plots •  Power (split by

situation by mean) plots

•  Power versus achieved FDR

ROC curves

X – marks the (estimated) 5% FDR point.

Power curves

At the method’s 5% FDR. Split into 5 groups based on expression strength.

Outlier in up

Outlier in down

No outliers (but outliers present in dataset)

Another interest: do methods achieve their FDRs?

My take from simulations   Robust edgeR suffers a tiny bit in power with no

outliers, but has good capacity to dampen their effect if present

  DESeq’s policy on outliers has a global effect, resulting in (sometimes drastic) drop in power

  DESeq2 is very powerful in the absence of outliers, but policy to filter outliers results in loss of power

  edgeR and edgeR robust are a bit liberal (5% FDR might mean 6% or 7%)

Shiny app / web-accessible script; wrapper function to try new methods (coming soon)

Kegg ribosome pathway

0

2

4

6

ENSMUSG00000090137

ENSMUSG00000022370

ENSMUSG00000039221

ENSMUSG00000049751

ENSMUSG00000032215

ENSMUSG00000036781

ENSMUSG00000044533

ENSMUSG00000079641

ENSMUSG00000039001

ENSMUSG00000079435

ENSMUSG00000057322

ENSMUSG00000046364

ENSMUSG00000038900

ENSMUSG00000028936

ENSMUSG00000052146

ENSMUSG00000061983

ENSMUSG00000060636

ENSMUSG00000067288

ENSMUSG00000058600

ENSMUSG00000037805

ENSMUSG00000028234

ENSMUSG00000073702

ENSMUSG00000062328

ENSMUSG00000048758

ENSMUSG00000037563

ENSMUSG00000063457

ENSMUSG00000061787

ENSMUSG00000030432

ENSMUSG00000090862

ENSMUSG00000008683

ENSMUSG00000047675

ENSMUSG00000071415

ENSMUSG00000034892

ENSMUSG00000063316

ENSMUSG00000059070

ENSMUSG00000025290

ENSMUSG00000025362

ENSMUSG00000020460

ENSMUSG00000057863

ENSMUSG00000007892

ENSMUSG00000022601

ENSMUSG00000025508

ENSMUSG00000062997

ENSMUSG00000058558

ENSMUSG00000059291

ENSMUSG00000009927

ENSMUSG00000090733

ENSMUSG00000038274

ENSMUSG00000060938

ENSMUSG00000028495

ENSMUSG00000058546

ENSMUSG00000041841

ENSMUSG00000029614

ENSMUSG00000041453

ENSMUSG00000008682

ENSMUSG00000046330

ENSMUSG00000012405

ENSMUSG00000047215

ENSMUSG00000031320

ENSMUSG00000032518

ENSMUSG00000061477

ENSMUSG00000049517

ENSMUSG00000074129

ENSMUSG00000062647

ENSMUSG00000003970

ENSMUSG00000017404

ENSMUSG00000008668

ENSMUSG00000025794

ENSMUSG00000000740

ENSMUSG00000003429

ENSMUSG00000093674

ENSMUSG00000040952

ENSMUSG00000028081

ENSMUSG00000030744

ENSMUSG00000006333

ENSMUSG00000057841

ENSMUSG00000012848

ENSMUSG00000060036

ENSMUSG00000032399

ENSMUSG00000043716

ENSMUSG00000045128

ENSMUSG00000067274

Cou

nts

per m

illio

n

Inverse weight1234

SampleControl 1Control 2Control 3

Cerebellar granular neurons (mouse), treated with Brdu (bromodeoxyuridine) == baseline control condition.

Can we learn something from outliers? Look for over-represented functional

categories in downweighted genes

log

Food for thought:

simulating differential

splicing

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

all genes FDR=0.05

1 − spec

sens

DEXSeqedgeRCuffdiffMISO_dexseqMISO_limMISO_lim_weight

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

low expressed FDR=0.05

1−spec

sens


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

mid expressed FDR=0.05

1−spec

sens


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

high expressed FDR=0.05

1−spec

sens


TPR

FPR

edgeR = spliceVariants() (currently, a gene-level test as opposed to DEXSeq’s exon-level test)

X – FDR=5%

Concluding remarks  Statistic themes:

  Moderating dispersion is helpful but sensitive to outliers

  Observation weights dampen the effect of outliers, sometimes we can learn source of them; limited to probably >=3 replicates

 Simulations:   Benchmarking is hard work. Why not do this

collectively ?

  I suggest we: i) collect simulation models and/or reference datasets; ii) make benchmarks that persist in time as new methods come. Other fields do this (e.g. machine learning).

Katarina Matthes

Andrea Komljenovic

Helen Lindsay Xiaobei Zhou

Olga Nikolayeva

Robinson Statistical Bioinformatics Group, UZH

Ian Morilla

Charity Law

Gosia Nowicka

edgeR users Antonio Schmandke (UZH) Andrea Riebler (NTNO)

WEHI edgeR devel: Andy Chen, Aaron Lun, Gordon Smyth, Davis McCarthy (Oxford)

robustness in count-based differential analysis of rna-seq

Documents