transcriptomics: workflow, tools, experimental design · parameters for experimental design...

15
INCOB 2016 Genomics Workshop Transcriptomics: Workflow, tools, experimental design Jonathan Göke Senior Research Scientist Genome Institute of Singapore 21 September 2016 3 [email protected]

Upload: others

Post on 15-Oct-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

INCOB 2016Genomics Workshop

Transcriptomics:Workflow, tools, experimental design

Jonathan Göke

Senior Research Scientist

Genome Institute of Singapore

21 September 2016

3

[email protected]

Page 2: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

RNA-Seq Workflow

4

Raw read data (fastq)

Quality control: fastqc

Page 3: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

RNA-Seq Workflow

5

Raw read data (fastq)

TopHat2/STAR: aligned read data (bam)

Page 4: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

RNA-Seq Workflow

5

TopHat2/STAR: aligned read data (bam)

genome.ucsc.edu

Page 5: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

RNA-Seq Workflow

• Gene-level read counts

– HTSeq, R+Bioconductor, Cufflinks

– Normalisation required! (e.g. regularized log transformation)

• Isoform quantification

– Cufflinks, Kallisto, Salmon

6

Page 6: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

RNA-Seq Workflow

7

• Gene read count based:

– DESeq2, EdgeR, …

– Input: read count (not normalized!)

• Isoform-based:

– Cuffdiff2, sleuth (Kallisto)

Page 7: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

RNA-Seq Workflow

8

• Alternative splicing:

– DEXSeq (alternative exon usage)

– rMATS (alternative splicing events)

– SplAdder (alternative splicing evens)

• Differential isoform expression

– Cuffdiff, sleuth

• Always look at the data!

Page 8: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

Experimental Design

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Ronald Fisher (1890 - 1962)

11

Page 9: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

Parameters for experimental design

Differential gene

expression

Alternative

splicing

Single cell RNA-

Seq

Read length normal long Normal

Paired end Single or paired paired Single or paired

Sequencing depth Normal (>=12

samples per lane,

re-sequence if

necessary)

High (6-12 samples

per lane, re-

sequence if

necessary)

Low (up to 96

samples per lane,

or much higher)

Replicate number Minimum 3 Minimum 3 multiple

independent

samples

recommended

13

Page 10: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

Experimental design and batch effects

14

Condition 1 Condition 2

Rep 1 Rep 1

Rep 2 Rep 2

Rep 3 Rep 3

Rep 4 Rep 4

Rep 5 Rep 5

Rep 6 Rep 6

Batch 1 Batch 2

Rep 1 Rep 1

Rep 2 Rep 2

Rep 3 Rep 3

Rep 4 Rep 4

Rep 5 Rep 5

Rep 6 Rep 6Batch 1

Batch 2

Condition 1

Condition 2

Many differentially expressed genes between condition 1 and 2 Many differentially expressed genes between batch 1 and 2

Page 11: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

Experimental design and batch effects

15

Condition 1 Condition 2

Rep 1 Rep 1

Rep 2 Rep 2

Rep 3 Rep 3

Rep 4 Rep 4

Rep 5 Rep 5

Rep 6 Rep 6

Batch 1 Batch 2

Rep 1 Rep 4

Rep 2 Rep 5

Rep 3 Rep 6

Rep 1 Rep 4

Rep 2 Rep 5

Rep 3 Rep 6Batch 1

Batch 2

Condition 1

Condition 2

Very few differentially expressed genes between condition 1 and 2 Many differentially expressed genes between batch 1 and 2 Control for batch effects! (e.g. Combat)

Page 12: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

Experimental design and batch effects

15

Condition 1 Condition 2

Rep 1 Rep 1

Rep 2 Rep 2

Rep 3 Rep 3

Rep 4 Rep 4

Rep 5 Rep 5

Rep 6 Rep 6

Batch 1 Batch 2

Rep 1 Rep 4

Rep 2 Rep 5

Rep 3 Rep 6

Rep 1 Rep 4

Rep 2 Rep 5

Rep 3 Rep 6

Batch 1

Batch 2

Condition 1

Condition 2

Many differentially expressed genes between condition 1 and 2 Very few differentially expressed genes between batch 1 and 2

Remove batch effects

Page 13: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

Experimental design and batch effects

• Batch effects:

– Technology (read length, single/paired end, sequencing machine, lanes,…)

– Experiment (day, person, sequencing center, …)

• If impossible to avoid, add controls for normalization/ batch effect correction

16

Page 14: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

Summary• RNA-Seq workflow:

– Read alignment

– Expression quantification (with or without alignment)

– Differential expression

– Alternative splicing

• Experimental design and batch effects

– Replicates are indispensable

– Avoid confounding

– Control batch effects

20

Page 15: Transcriptomics: Workflow, tools, experimental design · Parameters for experimental design Differential gene expression Alternative splicing Single cell RNA-Seq Read length normal

References and further readingQuality Control:

• Fastqc: S. Andrews. "FastQC: A quality control tool for high throughput sequence data.“ http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Read alignment:

• Tophat2 :Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36, doi:10.1186/gb-

2013-14-4-r36 (2013).

• STAR: Dobin et al. "STAR: ultrafast universal RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21.

• Visualilsation: WJ Kent et al. The human genome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.

Read counting/isoform quantification:

• Bioconductor (GenomicAlignments): M. Lawrence, et al. "Software for computing and annotating genomic ranges." PLoS Comput Biol 9.8 (2013): e1003118.

• HTSeq: S. Anders et al. (2014). HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics, btu638.

• Cufflinks/Cuffdiff: C. Trapnell et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols,

7(3), 562-578.

• Kallisto: N.L. Bray et al. "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34.5 (2016): 525-527.

• Salmon: R. Patro et al. (2016) “Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference”

http://biorxiv.org/content/early/2016/08/30/021592

Differential expression

• DESeq2: M. Love et al. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology, 15(12), 1.

• EdgeR: M.D. Robinson et al. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

• Sleuth: H.J. Pimentel et al. (2016). Differential analysis of RNA-Seq incorporating quantification uncertainty. bioRxiv, 058164.

Alternative splicing:

• DEXSeq: S. Anders at al. (2012). Detecting differential usage of exons from RNA-seq data. Genome research, 22(10), 2008-2017.

• rMATS: S. Shen et al. (2014). rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proceedings of the National Academy

of Sciences, 111(51), E5593-E5601.

Batch effects:

• http://simplystatistics.org/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know/

• J.T. Leek et al. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10), 733-739.

• Gilad, Y., & Mizrahi-Man, O. (2015). A reanalysis of mouse ENCODE comparative gene expression data. F1000Research, 4.

Further reading and online training material:

• A. Conesa, A et al (2016). A survey of best practices for RNA-seq data analysis. Genome biology, 17(1), 1.

• M. Teng et al. (2016). A benchmark for RNA-seq quantification pipelines. Genome biology, 17(1), 1.

• Lawrence, M., & Morgan, M. (2014). Scalable genomics with R and Bioconductor. Statistical Science, 29(2), 214-226.

• W. Huber et al. (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nature methods, 12(2), 115-121.

• Online lectures:

• https://www.edx.org/xseries/data-analysis-life-sciences

• https://www.edx.org/xseries/genomics-data-analysis

22