rna-seq: a high-resolution view of the transcriptome

Post on 15-Feb-2017

241 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sean Davis, M.D., Ph.D.Genetics Branch, Center for Cancer Research

National Cancer InstituteNational Institutes of Health

RNA-seq: A high-resolutionView of the Transcriptome

Normal Karyotype

Tumor Karyotype

The Central Dogma

phenotype

Gene Copy Number

Sequence Variation

Chromatin Structure and

Function

Gene Expression

Transcriptional Regulation

DNA Methylation

Patient and Population

Characteristics

+

=

Your Nature Paper

High Throughput SequencingAKA, NGS

DNA(0.1-1.0 ug)

Single molecule arraySample preparation

Cluster growth5’

5’3’

G

T

C

A

G

T

C

A

G

T

C

A

C

A

G

TC

A

T

C

A

C

C

TAG

CG

TA

GT

1 2 3 7 8 94 5 6

Image acquisition Base calling

T G C T A C G A T …

Sequencing

Illumina SBS TechnologyReversible Terminator Chemistry Foundation

© Illumina, Inc.http://www.illumina.com/technology/sequencing_technology.ilmnhttp://seqanswers.com/forums/showthread.php?t=21

Single end vs paired end sequencing

Illumina Paired-end sequencingPaired-end: useful for RRBS, essential for RNA-seq, not useful for ChIP-

seq

What comes out of the machine: short reads in fastq format

@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1[^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1_[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1\^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1GTGGCCGATTCCTGAGCTGTGTTTGAGGAGAGGGCGGAGTGCCATCTGGGTAGC+D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1aa_eeeeegggggihhiiifgeghfeghbgcghifiidg^dbgggeeeee`dcd@D3B4KKQ1_0166:8:1101:2358:2174#CGATGT/1CTGACCTGGGTCCTGTGGTGCTCAGCCTTTTGAAGATGCCAGAAAAATACGTCG+D3B4KKQ1_0166:8:1101:2358:2174#CGATGT/1\^_cccccg^Y`ega`fg`ebegfhd^egghhghfffhghdhbfffhhhfgfcf

QS to int In R:as.integer(charToRaw(‘e'))-33

Pair end sequencings_8_1_sequence.txt.gz s_8_2_sequence.txt.gz

@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1CTCCTGGAAAACGCTTTGGTAGATTTGGCCAGGAGCTTTCTTTTATGTAAATTG+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/1[^^cedeefee`cghhhfcRX`_gfghf^bZbecg^eeb[caef`ef^a_`eXa@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1TCCANCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTGTC+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/1ab_eBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1TACAAGTGCAGCATCAAGGAGCGAATGCTCTACTCCAGCTGCAAGAGCCGCCTC+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/1_[_ceeec[^eeghdffffhh^efh_egfhfgeec_fbafhhhhd`caegfheh@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1GAAGGAGAGAAGGGGAGGAGGGCGGGGGGCACCTACTACATCGCCCTCCACATC+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/1\^_accceg`gga`f[fgcb`Ucgfaa_LVV^[bbbbbRWW`W^Y[_[^bbbbb@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1GTGGCCGATTCCTGAGCTGTGTTTGAGGAGAGGGCGGAGTGCCATCTGGGTAGC+D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/1aa_eeeeegggggihhiiifgeghfeghbgcghifiidg^dbgggeeeee`dcd

@D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2GGCATATTTAACAGCATTGAACAGAATTCTGTGTCCTGTAAAAAAATTAGCTTA+D3B4KKQ1_0166:8:1101:1960:2190#CGATGT/2a__aaa`ce`cgcffdf_acda^ea]befffbeged`g[a`e_caaac]cb`gb@D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2TTGAGGCTGTTGTCATACTTCTCATGGTTCACACCCATGACGAACATGGGGGCG+D3B4KKQ1_0166:8:1101:2154:2137#CGATGT/2a__eeeeeggegefhhhiiihhhhhiieghhhghhiiffhiififhhiihegic@D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2CGGGGTGCACCTCGTCGTAGAGGAACTCTGCCGTCAGCTCTGCCCCATCGCCAA+D3B4KKQ1_0166:8:1101:2249:2171#CGATGT/2^__ee__cge`cghghhfgddgfgi]ehhfffff^ec[beegidffhhfhadba@D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2CTTAGTCTCAGTTTTCCTCCAGCAGCCTGAGGAAACTCAAAGGCACAGTTCCCA+D3B4KKQ1_0166:8:1101:2043:2187#CGATGT/2_abeaaacg^g^eghhhhgafghhdfghfedeghfiiicfbgdHYagfeecggf@D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2TAGGCTCAAAGTCTAACGCCAATCCCGAACCTGGGCATCTGTACACACACACAC+D3B4KKQ1_0166:8:1101:2188:2232#CGATGT/2abbeceeegggcghiihiihhhhiifhiiiiihiiiiiiihegh`eggfebfhg

… …

RNA-seq protocol schematic

Our First Experiment

Overview of BAC in the Genome

Sequencing a BAC

Sequence Coverage

Repeats

Repeats

Repeats are not created equal

Approaches to RNA-seq

Nature Biotech (2010) 28, 421-423

Alignment

RNA-seq Alignment

Run Time

Alignment Yield

Splice Read Placement Accuracy

Impact on Transcript Assembly

Transcript Quantification

Models for RNA-seq

• Count-based models• Multi-reads (isoform resolution)• Paired-end reads (include length resolution

step)• Positional bias along transcript length• Sequence bias

Read Counting

Mortazavi, 2008, NMeth

L. Pachter (2011) arXiv:1104.3889v

Sequence Bias--priming

Hansen (2010), NAR

Sample-specific Sequence Bias

Models for RNA-seq

Result of Quantification

Clustering and Visualization

Hierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical ClusteringGene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Distance Metrics

Euclidean distance

Manhattan distance

Minkowski distance (generalized distance)

Distance Metrics• Correlation

– maximum value of 1 if X and Y are perfectly correlated– minimum value of -1 if X and Y are exactly opposite– d(X,Y) = 1 – rxy

• Many, many others• Choice of distance metric can be driven by

underlying data (eg., binary data, categorical data, outliers, etc.)

Example of Distance Metric Choice

Example• dat = matrix(rnorm(10000),ncol=20)• dat[1:100,1:10] = dat[1:100,1:10]+1• hclust• dist• as.dist(1-cor)

Differential Expression

MA Plot

DE False Positive Rates

DE Evaluation

DE Software Runtime

RNA-seq workflow as proposed by Anders et al. in Nature Protocols

MA Plot

Fusion Gene Detection

Fusion gene schematic

Fusion Detection

False Positive Fusion Detection

Experimental Design

• What are my goals?– Differential expression?– Transcriptome assembly?– Identify rare, novel trancripts?

• System characteristics?– Large, expanded genome?– Intron/exon structures complex?– No reference genome or transcriptome

Experimental Design

• Technical replicates– Probably not needed due to low technical variation

• Biological replicates– Not explicitly needed for transcript assembly– Essential for differential expression analysis– Number of replicates often driven by sample

availability for human studies– More is almost always better

Links of Interest

• http://bioconductor.org• http://biostars.org• http://www.rna-seqblog.com/• https://genome.ucsc.edu/ENCODE/• http://www.ncbi.nlm.nih.gov/gds/

Visualizing Splicing

top related