biomedical informatics shared resource workshop...biomedical informatics shared resource workshop...

42
Biomedical Informatics Shared Resource Workshop RNA-seq analysis 2015 03 12 Paolo Guarnieri, M.D.

Upload: others

Post on 29-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Biomedical Informatics Shared

Resource Workshop

RNA-seq analysis

2015 03 12

Paolo Guarnieri, M.D.

Page 2: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Topics

• Experimental design and library selection

• Sequence handling and processing

• Quality assessment

– Library level

– Read sequence level

– Sample level

• Identification of genes differentially expressed

• Additional analyses and tools

Page 3: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Microarray

Experimental Design

• Platform

• Chips

• Samples

RNA preparation

• Library preparation for hybridization

Data analysis

• Image analysis

• Probe intensities

• Normalization

RNA-seq

Experimental Design

• Lanes

• Reads

• Samples

RNA preparation

• Library prep for sequencing

Data analysis

• Alignment

• Read counts

• Normalization

Comparing paradigms

Page 4: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Experimental design and library

selection

• Sample preparation:

Choose the proper kit for your experiment

Poly-A mRNA isolation Vs. rRNA depletion (Ribo-Ziro)

• Library preparation:

Single end vs. paired end

Please refer to manufacturer for details

Page 5: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Paired end vs. single end

adapted from: Zhernakova et al., PLoS Genet. 2013 Jun; 9(6)

Library preparation steps:

- adapters ligated

- amplified

- size selected

Page 6: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

THE SEQUENCE

Page 7: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Sequencing process

Images: Werner Van Belle

Flow cell: composed of

multiple lanes

Lanes: contain multiple

imaging position

Page 8: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Sequencing by synthesis process

Images: Werner Van Belle

Positions are imaged 4 times

Each imaging position contains

multiple sequence cluster

Incorporation of

new nucleotide

Repeat n times

n = read length

Page 9: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Base calling

Sequencing complete Per cycle base calling

Sequence file:

<.fastq>

Images: Werner Van Belle

Page 10: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

The sequence

The raw data is the <.fastq> file

• A collection of multiple reads

• Each read in the file has these main features:

– 4 lines

– starts with @

Often <.fastq> are compressed into <.fastq.qz>

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC

+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

<- Read ID + desc

<- Sequence

<- As 1 but opt

<- Quality score

Page 11: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Sequence handling

when dealing with sequencing files be aware of:

• Data transfer:

– samples start from around 1.5 Gb per sample (> if PE)

– compress as .gz to reduce transfer time

• Group files:

– representing reads (R1, R2) to samples

• Storage:

– files are required for publication. Must be kept safe.

Page 12: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

PRE-PROCESSING

Page 13: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Pre-processing

• Remove adaptors

– fastx_clipper

• De-mulitplex (if required)

fastx_splitter, fastq-multx (handles both mates)

These steps usually performed by sequencing facility

Page 14: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Multiplexing removes variability– Lane specific: multiplex in different lanes or

flow cells (barcoding)

good bad

Page 15: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

De-multiplexing

• Grouped barcodes file (.fil) looks like this:

<id1> <sequence1> <group1>

<id1> <sequence1> <group1>

<id2> <sequence2> <group2>...

https://code.google.com/p/ea-utils/wiki/FastqMultx

>fastq-multx -B barcode-definition.fil \

PE_read1.fq -o r1.%.fq \

PE_read2.fq -o r2.%.fq

>_

Page 16: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

QUALITY ASSESSMENT

Page 17: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Library level QC

Before sequencing: RNA integrity measure with

the Bioanalyzer (usually performed by facility)

Page 18: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Sequence level QC

• FastQC: quality score for each nt position

• Additional quality based trimming

Very good Normal Bad

Page 19: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Sample/Dataset level

• Evaluate the distribution of the reads as

function of the expression

Page 20: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Coverage

After sequencing: evaluate your actual library

size as compared to the expected

Total reads

Mapped reads

Sum of counts

per sample

Page 21: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Alignment/Mapping – what

Reference genome / transcriptome

...GTGGGCCGGCAATTCGATATCGCGCATATATTTCGGCGCATGCTTAGC...

Reads to be mapped

TGGGCCGGCA

CCGGCAATTC

ATTCGATATC

GATATCGCGC

GCATATATTT

CATGCTTAGC

ATATTTCGGC

GCATATATTT

TCGCGCATAT

Page 22: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Alignment/Mapping – what

Reference genome / transcriptome

...GTGGGCCGGCAATTCGATATCGCGCATATATTTCGGCGCATGCTTAGC...

TGGGCCGGCA GCATATATTT CATGCTTAGC

CCGGCAATTC ATATTTCGGC

ATTCGATATC GCATATATTT

TCGCGCATAT

GATATCGCGC

Reads mapped

Page 23: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Alignment/Mapping – Coverage

Page 24: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Alignment/Mapping – how

• Alignment programs• Bowtie

• BWA

• GSNAP

• OLEGO

• STAR

• TopHat

• […]

• Parameters are important:

– Number of mismatches

– Unique alignments

Page 25: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

STAR aligner

>STAR --genomeDir /star_indices/mm10 \

--runThreadN 8 \

--outFilterMultimapNmax 1

--outSAMtype BAM SortedByCoordinate \

--sjdbGTFfile mm10.igenome_ucsc.gtf \

--readFilesCommand zcat

--outSAMstrandField intronMotif

--readFilesIn S1_R1.fq S1_R2.fq \

--outFileNamePrefix somePrefix.

>_

Page 26: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Aligned sequences

Aligned sequences are stored in:

– SAM file: Sequence Alignment/Map file

– BAM file: BGZF compressed version of the SAM

– BAI file: indexed version of the BAM

Main features:

– store all the alignment information

– be compact

– can be processed line by line

– can be indexed for fast position access

Page 27: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

SAM file

TAB-delimited text format consisting of

SAM/BAM and related specifications:

http://samtools.github.io/hts-specs/

@HD VN:1.5 SO:coordinate

@SQ SN:ref LN:45

r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *

r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *

r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA *

r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *

r003 2064 ref 29 17 6H5M * 0 0 TAGGC *

r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT *

Q

N

A

M

E

F

L

A

G

R

N

A

M

E

P

O

S

M

A

P

Q

C

I

G

A

R

R

N

E

X

T

P

N

E

X

T

T

L

E

N

S

E

Q

Q

U

A

L

Header section

Page 28: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

SAM/BAM handling

• Several tools to fiddle with alignment files:

samtools, picard, sambamba etc.

• Functions:

– view

– merge

– sort (some apps require aligned BAMs)

– subset (eg. Filter chr 5)

– pipe (i.e. stream line by line) into other programs

– etc.

Page 29: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Library complexity

After alignment we can use PICARD function

EstimateLibraryComplexity

Adapted from Levin et al. 2010 Nature

Page 30: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Assigning aligned reads to genes

Gene name Count of reads

0610005C13Rik 0

0610007N19Rik 28

0610007P14Rik 1157

0610008F07Rik 0

0610009B14Rik 4

0610009B22Rik 544

0610009D07Rik 708

0610009L18Rik 4

0610009O20Rik 169

0610010B08Rik 0

0610010F05Rik 418

0610010K14Rik 248

0610011F06Rik 147

Page 31: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

htseq-count

>samtools view S1.sorted.bam | \

htseq-count --stranded=no \

mm10.igenome_ucsc.gtf \

S1.sorted.bam.counts

>_

Page 32: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Normalization

• RPKM: Reads per kilobase of exon per million reads mapped

(RPKM) (Mortazavi et. al. 2008)

• FPKM: Fragments per kilobase of exon per million reads

mapped (Trapnell et al. 2010)

SE: FPKM = RPKM PE: FPKM ≠ RPKM

• TPM: proportion of transcripts in your pool of RNA (Bo Li et

al. 2009)

• TMM: trimmed mean of M-values (Robinson et al. 2010)

Page 33: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Cufflinks

>cuffnorm –o yourOuputDir mm10.gtf \

Sample1.bam \

Sample2.bam \

Sample3.bam

>_

Page 34: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Identification of genes differentially

expressed

Specific nature of counts data requires adequate

statistical test methods

• Underlying counts data is not normally

distributed (over-dispersion): no t-test

• Negative binomial based methods:

– edgeR (R/Bioconductor)

– DESeq2 (R/Bioconductor)

– cuffdiff (stand alone)

Page 35: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

DESeq2

library("DESeq2")

dds <- DESeqDataSetFromMatrix(countData = countData,

colData = colData,

design = ~ condition)

dds <- DESeq(dds)

res <- results(dds)

## log2 fold change (MAP): condition treated vs untreated

## Wald test p-value: condition treated vs untreated

## DataFrame with 6 rows and 6 columns

## baseMean log2FoldChange lfcSE stat pvalue padj

## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>

## FBgn0039155 453 -3.71 0.160 -23.2 4.01e-119 3.09e-115

## FBgn0029167 2165 -2.08 0.104 -20.1 6.68e-90 2.57e-86

## FBgn0035085 367 -2.23 0.137 -16.3 1.89e-59 4.85e-56

Page 36: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

ADDITIONAL ANALYSIS

Page 37: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Splicing analysis

Image: wikipedia

Page 38: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Fusion proteins

Image: wikipedia

Page 39: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Splicing analysis

• Splicing usually requires specific alignments and is divided in:

– annotation dependent

– annotation free (discovery)

– Examples:

• Olego/Quantas (CUMC: Zhang lab)

• MISO

• Scripture

• DEXSeq

• SGSeq

Page 40: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

GUI

• Galaxy: https://usegalaxy.org/

• SAMate: http://sammate.sourceforge.net/

• OneChannelGUI (R/Bioconductor)

• Others…

Page 41: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

EOF

Page 42: Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop RNA-seqanalysis 2015 03 12 Paolo Guarnieri, M.D. Topics • Experimental design and

Sequence handling

Images from Werner Van Belle

Data refers to 36bp read length