rna sequencing – a basic introduction

34
NA sequencing – a basic introducti ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014 Maja Molin, PhD Dept. of Medical Biochemistry and Microbiology, Uppsala University

Upload: redford

Post on 25-Feb-2016

118 views

Category:

Documents


5 download

DESCRIPTION

RNA sequencing – a basic introduction. ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014. Maja Molin , PhD Dept. of Medical Biochemistry and Microbiology, Uppsala University. Overview. Lecture Historical perspective – “past” and present techniques - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: RNA sequencing – a basic introduction

RNA sequencing – a basic introduction

ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014

Maja Molin, PhDDept. of Medical Biochemistry and Microbiology, Uppsala University

Page 2: RNA sequencing – a basic introduction

Overview

Lecture• Historical perspective – “past” and present techniques• An RNAseq experiment consist of many steps• Design experiment• Purify RNA• Prepare libraries• Sequence • Analysis

ExerciseRNA seq analysis using the de novo assembler Trinity

Page 3: RNA sequencing – a basic introduction

“Past”• Sequencing -> Sanger sequencing of cDNA libraries

• Limitations in the number of sequences• Redundancy due to highly expressed genes• Read length about 800bp -> poor in full-length• Prone to indel errors

• Global quantifications -> Expression microarrays• Sequences have to be known• Incomplete annotations• No discovery of novel transcripts• Hybridization-based method, problems with SNPs, Indels• Noise• Signal intensity is used to calculate the expression level of the gene

Historical perspective – “past” and present techniques

Page 4: RNA sequencing – a basic introduction

Present• Sequencing -> Next-Gen Sequencing technologies

• Several different platforms, Illumina, SOLiD, Ion Torrent, 454, PacBio• Short reads • Full-length transcripts• High dynamic range• Strand-specific sequencing• Sequencing errors are mostly substitutions

• Applications• Global differential expression analysis• Characterization of alternative splicing, polyadenylation, transcription• Discovery of novel transcripts• SNP finding• RNA editing• Allelic gene expression

Historical perspective – “past” and present techniques

Page 5: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

1. Design experiment• Is the primary aim qualitative or quantitative?

Sequence reads must cover the transcripts evenly, including both ends. Coverage depends on library prep and seq. depth

Qualitative/Annotation: identify expressed transcripts, exon/intron boundaries, TSS, poly-A sites.

Quantitative/DGE:meassure differences in gene

expression, alternative splicing, TSS and poly-A sites between ≥2 groups

Must accurately measure the counts of transcripts and the variances assoc. with the counts. Replicates are essential!

http://rnaseq.uoregon.edu/

• Other objectives? SNP finding, allelic gene expression, RNA editing?• Which sequencing technology, Illumina, SOLiD, Ion Torrent, 454, PacBio?

Page 6: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

2. Purify RNA• A cell contains many types of RNA, e.g

• rRNA (>80%)• tRNA• mRNA (1-5% of totalRNA)• miRNA• ncRNA• snoRNA

• Always use high quality and high purity RNA for sequencing• OD 260/280 ratio > 1.8, 260/230 ratio close to 2.0 • RIN > 8.0• Measure concentration using Qubit• If RNA extraction is based on phenol (e.g. TRIzol) or

organic methods -> RNA clean-up is recommended using e.g. columns to remove traces of phenol

• DNaseI treatment of RNA is recommended

Page 7: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

3. Prepare libraries• Library preparation by the platform or by you?• Library prep. needs to match the sequencing technology.• PolyA selection or rRNA depletion for mRNA sequencing?

• PolyA selection isolates mRNA very efficiently but cannot be used for non-poly RNA.

• rRNA depletion preserves non-polyA RNAs, but less effective of removing all rRNA.

• Single-end or paired end (PE allows more accurate mapping and is useful for isoform detection)

• Strand-specific library (or non-stranded?)• Barcoding and Pooling

Page 8: RNA sequencing – a basic introduction

Strand-specific (or non-stranded) library

LevinJZ, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods.Nat Methods. 2010 Sep;7(9):709-15.

Non-stranded library• Does not contain any information about which strand was originally transcribed

Strand-specific library• Preserve the information about which strand was transcribed• Anti-sense transcripts can be identified• Identify the exact boundaries of adjacent genes transcribed from opposite strands• Correct expression pattern of coding or non-coding overlapping transcripts• Often the default method today

Page 9: RNA sequencing – a basic introduction

Strand-specific (or non-stranded) library

LevinJZ, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods.Nat Methods. 2010 Sep;7(9):709-15.

Page 10: RNA sequencing – a basic introduction

Barcoding and pooling

cDNA insertAdapter Adapter

mRNA

Total RNA

Fragmented mRNA/cDNA

Finished library

IndexBarcoding and pooling:• Short 6-8 nt´s (index) are introduced as part of the adapters• Index provide unique identifier for each sample• The index allows pooling of samples to avoid lane effects and to use the sequencing capacity more efficiently

Page 11: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

4. Sequence• Pooling strategy• Sequence depth

1 2 3

1 2 3

Control: 3 biological replicates

Treated: 3 biological replicates

Pool and sequence in one lane on Illumina HiseqPool and sequence in one lane on Illumina Hiseq

Pool and sequence in one lane on Illumina Hiseq

Pool and sequence in one lane on Illumina Hiseq

Page 12: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

4. Sequence• Pooling strategy• Sequence depth

• 30M reads is sufficient to detect nearly all annotated chicken genes (15742).

• 30M reads generate representative assemblies, good balance between coverage and noise.

• >60M reads sequencing errors accumulate in highly expressed genes and few new genes are discovered

• Increasing replicates is more important than increasing sequencing depth for DE analysis. Wang et al. BMC Bioinformatics 2011, 12(Suppl10):S5

Francis et al. BMC Genomics 2013, 14:167Rapaport et al. Genome Biology 2013, 14:R95

Page 13: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing reads

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

Page 14: RNA sequencing – a basic introduction

Quality check of sequence reads• Illumina sequencing runs stores data in large text files called FASTQ (extension .fq or .fastq).

• FastQ files contain both the sequence and the quality of each base call for every read in the run.

• Information about each read is listed on four consecutive lines

1. Sequence ID beginning with @

2. Base calls (sequence)

3. A plus sign

4. Sequence quality codes

@61G9EAAXX100520:5:100:10000:12335/1CGGGTTAGAATCAACAAGTGTAGGAGGAACTTGGTAACGATGATTTAAATTATCTGCACTACGGTCGT+GGGFEGGGGFGGGGGGGGEGDGGEFGGEEFGGFFCFCGGEFFDEEEEAEGDEEBDEDCDEAEBCACED

1.2.3.4.

Page 15: RNA sequencing – a basic introduction

@61G9EAAXX100520:5:100:10000:12335/1CGGGTTAGAATCAACAAGTGTAGGAGGAACTTGGTAACGATGATTTAAATTATCTGCACTACGGTCGT+GGGFEGGGGFGGGGGGGGEGDGGEFGGEEFGGFFCFCGGEFFDEEEEAEGDEEBDEDCDEAEBCACED@61G9EAAXX100520:5:100:10000:14468/1ACGAGTAATCTTGGTGGGGATACCAAGAGCTTGGAAGAAAGAGGTCTTACCGGGTTCCATACCAGTGT+GGGGGGGGGDGGGGBGGGGGGGGFDFGGGGGGGFEFGEFFGDEFDDEGGEEEEECDDFDEDDACDCDE

@61G9EAAXX100520:5:100:10000:12335/2GGATCTTTCACATTTGAAATGTCTCTTCCTCACCGTAATCCCTCATTGTCTTCCCTTCCAACTACTGG+GGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGFGGGGGGFFFGEFFGGGGGGGGDEEGEFGFG@61G9EAAXX100520:5:100:10000:14468/2GTCTTCACCAACGCTGATTTGAAGGAAGTCCGTGAGACCATTATTGCTAATGTTATTGCTGCTCCTGC+GGFGGGGGDGGGGGGGGGGGFEGGFGGGEGGGFGGGGGFGGGGGGGGGGGGGDGBGFFFFFEEFEFFB

Quality check of sequence reads

Paired-end Sequences cDNA insert

One FastQ file with all the left (/1) reads

One FastQ file with all the right (/2) reads

Page 16: RNA sequencing – a basic introduction

Quality check of sequence reads using FastQC tool(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Quality score across all reads in a file summarized by position. A good run will have quality score >28. If lower at some point, consider trimming.

Page 17: RNA sequencing – a basic introduction

Shows if a subset of your sequences have overall low quality scores. If the most frequently observed mean quality is <27, a warning is raised. You can consider filtering your reads by average quality to keep only the best reads.

Quality check of sequence reads using FastQC tool(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Page 18: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing read

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

Page 19: RNA sequencing – a basic introduction

Preprocessing of sequencing read using Trimmomatic(http://www.usadellab.org/cms/index.php?page=trimmomatic)

Consider running FastQC again to check your trimming

Page 20: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing read

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

Page 21: RNA sequencing – a basic introduction

• Novel organism – little or no previous sequencing?

• Non-model organism some sequences available (ESTs, Unigene set)

• Genome-Sequenced organism– draft genome with maybe tens of chromosomes, some annotations etc.

• Model organism – genome fully sequenced and annotated with multiple genomes available, well-annotated transcriptomes, genetic maps, available mutants etc.

De novo transcriptome assembly or aligning RNA-seq reads to a reference?

Page 22: RNA sequencing – a basic introduction

De novo transcriptome assembly or aligning RNA-seq reads to a reference?

Haas BJ and Zody MC. Nat Biotechnol. 2010 May;28(5):421-3.

TopHat

Cufflinks

Page 23: RNA sequencing – a basic introduction

De novo transcriptome assembly or aligning RNA-seq reads to a reference?

TrinityTrans-ABySSVelvet-OasesSOAPdenovo-trans

Page 24: RNA sequencing – a basic introduction

De novo assembly using Trinity

Trinity combines three independent software modules:• Inchworm• Chrysalis• Butterfly

Inchworm

• kmer =short oligonucleotide of length k

• All sequence reads are cut into overlapping kmers (25-mers). Each kmer overlap with its neighbor in all but one base.

Martin and Wang, Nat. Rev. Genet. Oct 2011, vol 12:671-682

Page 25: RNA sequencing – a basic introduction

1. Identifies seed kmer as most abundant kmer.

2. Extend kmer at 3´end and at 5´end based on coverage

3. For each extension, 4 possible kmers exists, each ending with one of the four nt´s. The most abundant cumulative ending wins!

4. The assembled contig is reported and the assembled kmers are removed from the catalog and the whole process starts again.

De novo assembly using Trinity

Inchworm algorithm

Page 26: RNA sequencing – a basic introduction

GATTACA9

G

A

TC

4

1

0

4

GATTACA9

G

A

TC

4

1

0

4

G AT

C

GA

TC

0 5

10

11

11

A

GATTACA9

G4

5

Inchworm algorithm

Page 27: RNA sequencing – a basic introduction

A

GATTACA9

G4

5

0 0 0

0

C

T

AG

0

0

6

1

A

GATTACA9

G4

5

0 0 0

0

6A

0

00

0

• Report the contig …….AGATTACAGA…...

• Remove assembeld kmers from the catalog of all kmers and then repeat this step

• Trinity default is set at a minimum kmer of 1 (all kmers are used) but with large datasets this parameter can be changed to min. kmer of 2

Inchworm algorithm

Page 28: RNA sequencing – a basic introduction

De novo assembly using Trinity

Trinity combines three independent software modules:

• Inchworm – linear contigs

• Chrysalis – recluster/re-groups related contigs from Inchworm

• Butterfly – reconstructs transcripts and alternatively spliced isoforms

Trinity output – a fasta file with all the transcripts

c2 is read cluster from Inchwormg0 is “gene”i1 is the isoform

gene identifier

Page 29: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing read

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

Page 30: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

1. Design experiment• Is the primary aim qualitative/annotation or

quantitative/Differential gene expression?

• Qualitative/annotation

Page 31: RNA sequencing – a basic introduction

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

1. Design experiment• Is the primary aim qualitative/annotation or quantitative/Differential

gene expression?

• Quantitative/differential gene expression• The level of gene expression corresponds to read counts• Align reads to transcriptome assembly or reference genome• Calculate expression values/abundance estimation based on

the mapped reads• Output is normalized expression values• Normalization based on both length of the transcript and total

depth of the sequencing.• RPKM (Reads Per Kilobase per Million reads Mapped)• FPKM (Fragments Per Kilobase per Million reads mapped)

Page 32: RNA sequencing – a basic introduction

Normalized read count/expression values

1. Low expression 2. High expression

Read count

Expression value (RPKM or FPKM)

1 2 1 2

3. Short transcript 4. Long transcript

3 4 3 4

Page 33: RNA sequencing – a basic introduction

Summary

• An RNAseq experiment consist of many steps• Design experiment• Purify RNA• Prepare libraries• Sequence • Analysis

• Several different options to choose between at every step• De novo assembler Trinity

Page 34: RNA sequencing – a basic introduction

ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014

Maja Molin, PhDDept. of Medical Biochemistry and Microbiology, Uppsala University

Thank you!

Questions?