gene expression analyses - welcome to sandberg...

Post on 20-Mar-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Rickard Sandberg

Gene Expression Analyses

Assistant Professor Ludwig Institute for Cancer Research Department of Cell and Molecular Biology Karolinska Institutet

Outline

- microarrays

- RNA-Seq

- Common gene expression analyses steps

- clustering of samples

- differential expression tests

- enrichment tests

Transcriptome analyses

- rRNAs (dominating, ~95%)

- mRNAs (~5%)

- long non-coding RNAs (e.g. lincRNAs) (~0.05%)

- snoRNAs, snRNAs

- microRNAs, piRNAs

Different protocols identify different parts of the transcriptome

PolyA selection

- rRNAs (dominating, ~95%)

- mRNAs (~5%)

- long non-coding RNAs (e.g. lincRNAs) (~0.05%)

- snoRNAs, snRNAs

- microRNAs, piRNAs

Different protocols identify different parts of the transcriptome

Ribominus (removal of

ribosomal RNAs)

not so random hexamers or DSN

- rRNAs (dominating, ~95%)

- mRNAs (~5%)

- long non-coding RNAs (e.g. lincRNAs) (~0.05%)

- snoRNAs, snRNAs

- microRNAs, piRNAs

Different protocols identify different parts of the transcriptome

small RNA protocol

- rRNAs (dominating, ~95%)

- mRNAs (~5%)

- long non-coding RNAs (e.g. lincRNAs) (~0.05%)

- snoRNAs, snRNAs

- microRNAs, piRNAs

DNA microarrays

!oligonucleotide arrays (affymetrix, agilent, illumina etc) cDNA microarrays (competitive hybridization)

Important Considerations

§ Microarrays where designed based on EST-clusters § Probes mapping at multiple locations § Multiple probe sets mapping to the same gene !

§ Many projects curated microarray probes to only allow for uniquely mapping ones, e.g. customCDF

http://brainarray.mbni.med.umich.edu/Brainarray/Database/ CustomCDF/genomic_curated_CDF.asp

Basis of Microarrays

Steps in microarray analyses

§ Start with RAW data (for affy arrays = CEL files) § Normalize

àremove systematic strength biases àoften quantile normalization

§ Background adjust/transform àTries to estimate signal from background àlog2 transform (ratios problem, stabilize variance)

§ Gene (or probeset summarization) àmedian polish (fancy average of probes targeting

the same gene/transcript/probe set)

Gene Expression - Microarray data

§ Repositories of raw and processed data: àGene Expression Omnibus (GEO)

http://www.ncbi.nlm.nih.gov/geo/ àArrayExpress

http://www.ebi.ac.uk/microarray-as/ae/

§ Databases with Gene Expression Atlases àHuman, Mouse and Rat Tissue Atlas

Symatlas / BioGPShttp://biogps.gnf.org/

àCancer Gene expression atlas: oncominewww.oncomine.org

!In what tissues are my gene expressed? using BioGPS (former symatlas)

http://biogps.gnf.org/

Finding experiments where my gene is differentially expressed

ArrayExpress GEO

§ Do not use updated CDFs (probe to transcript mappings) § Constantly evolving (hard to reproduce years later) § Offer no quality control § Limited capabilities for more comprehensive analyses

What are the methods measuring?

• Expressed Sequence Tags• Traditional 3’UTR focused microarrays

• Exon and Tiling Arrays• Deep Sequencing using Illumina/Solexa, SOLiD, (454)

Isolate polyA+ RNA

mRNA-seq protocol

Wang et al. 2009 Nat Rev Gen

§ polyA+ RNAs § rRNA- RNAs § short RNAs (e.g. miRNAs) § Ribosome footprint

sequencing § GRO-Seq (Global Run On

sequencing) § CLIP-Seq (RNA-protein

interactions) !

§ non-RNA applications:ChIP-Seq, DNAse hypersensitive sites,...

Strand-specific RNA-Seq protocols

Genome Chromosome Fasta Files

+

Known and putative splice junctions Fasta File

2. map reads towards genome + junction compilation

GTAAGT-----------AG Exon n+1

1. compile sets of junctions

Exon n

Mapping of splice junctions

Tophat first MethodIdentifying the transcriptome

A B C identify candidate exons

via genomic mapping

A B C A B C Generate possible

pairings of exons

Align “unmappable”

reads to possible junctions

A B C A B C

Longer readsLonger reads

GATGTTCTCAGTGTCC GATGTAATCAGTGTCC AACCCTCTCAGTGTCC

>HWI-EAS229_75_30DY0AAXX:7:1:0:949

Very long (100Kb+) intron

By segmenting the long reads, and mapping the segments independently, we can

look harder for junctions we might have missed with shorter reads

Running time

independent of

intron size

Mapping to transcriptomeExons 5’UTR 3’UTRIntronsGene:

DNA (genome)W

C

pre-mRNA

Transcription

AAAAA

RNA processing (splicing, polyadenylation)

mRNA AAAAA

Exons 5’UTR 3’UTRIntronsGene:

DNA (genome)W

C

Microexons and junction coverage

Exons 5’UTR 3’UTRIntronsGene:

DNA (genome)W

C

2 or more splice junctions within the same read

in-house mapping tophat mapping

Different read length will have different problems!

Finding novel non-annotated genes or transcript variants

Mapping'speed 308'M'reads'/'hour%'uniquely'mapping 60%'multimapping 25%'unmapped 15

Example of STAR aligned single-cell RNA-Seq data

281 719 splice junctions 279 356 with GT/AG 2 123 with GC/AG 215 with AT/AC

TestesLiverSkeletal MuscleHeartAK074759BC011574AK092689

log 1

0(read

s) 02

02

02

02

3B

3A

3B

RNA-Seq generate quantitative expression estimates

<10M reads

Brain expression / UHR expression (Taqman)

Bra

in R

eads / U

HR

Reads (R

NA

-SE

Q)

104

R = 0.953

slope = .933103

102

101

100

10-1

10-2

10-3

10-4

104 103 102 101 100 10-1 10-2 10-3 10-4

Mortazavi et al. Nat Methods 2008 Ramskold et al. PLoS Comp Biol 2009

03691215 12.3

0.13 0.10Exon Intron Intergenic

MKPR

Wang*, Sandberg* et al. Nature 2008

150x

How gene expression levels are estimated

gene A (2 kb transcript) gene B (600 bp transcript)

ACGCG... TCGAG... AGGTA... CCGTG... CTGCG...

Sequencing

FragmentationThe number of fragments are proportional to the abundance and length of the transcript.

Normalize for different transcripts lengths and different sequence depths in different samples.

RPKM (Reads per kilobase and million mappable reads): Given 10 million mappable reads:

RPKM, Gene A: 500 reads x 1000/2000 x 106/107

500 / (2 x 10) = 25 RPKM

RPKM roughly corresponds to transcripts per cell (Mortazavi et al. 2008) (assuming a standard cell with ~ 300.000 transcripts)

Fragments PKM (FPKM)

Gene quantification and mRNA copy numbers in cells

CN

X LT

=

X =109R T

C, number of reads mapping to transcript N, total number of sequenced reads !X, copies per cell of transcript T, total length of transcriptome L, transcript length !R, RPKM (reads per kilobase and million

mappable reads)

T, can be estimated from !1. starting amount of mRNA 2. spiked in controls 3. estimate transcriptome length - if 300.000 transcript of around 1500 nt each -> 4.5 *108

- 1 RPKM ~ 0.5 transcripts per cell

XN LC T= = 106

R T103

Depth needed for accurate expression level estimation

Perc

enta

ge o

f gen

es w

ithin

±20

% o

f fin

al e

xpre

ssio

n

100

80

60

40

20

01 5 10 15 20 25 30 35 40 45

1-9 RPKM (n=4338)10-29 RPKM (n=3048)30-99 RPKM (n=2817)100-999 RPKM (n=1469)1000-6705 RPKM (n=56)

Million mapped reads

B

A

01 5 10 15 20 25 30 35 40 45

Million mapped reads

Perc

enta

ge o

f gen

es w

ithin

fold

-cha

nge

of fi

nal e

xpre

ssio

n

100

80

60

40

20

2-fold1.5-fold1.2-fold1.1-fold1.05-fold

Mortazavi et al. 2008 Ramskold/Kavak et al. 2011 (bookchapter)

RNA sequencing of blastocyst-derived cell lines

Read counts for selected genes

ES TS XEN EpiSCNanog 6525 20 1 263

Cdx2 124 6256 1 1

Sox17 11 5 9814 99

Sox3 151 1234 6 796

Shh 0 0 0 1

Ihh 4 12 107 17

Dhh 10 212 575 80

Significance of expression level

background RPKM ~ 0.05 RPKM detection level of 0.3 RPKM an average 1 500 nt transcript 20 M uniquely mapping reads !background model: 0.05 x 1.5 x 20 = 1.5 reads !expressed at 0.3 RPKM: 0.3 x 1.5 x 20 = 9 reads binomial test for 9 reads out of 20 M mapping to transcript given a background probability of 1.5 / 20x109 gives a p-value of 2.8e-5 !!expressed at 1 RPKM: 1 x 1.5 x 20 = 30 reads

0.05 RPKM 1 RPKM

Mixed species/strains experiments

§ Mixed species experiments allows mapping of host and pathogen interactions

§ Parasite-host interactions

§ Tumor-stroma interactions

Allele-sensitive RNA-seq using mouse crosses

Fusion events, e.g. translocations in cancer

Oszolak and Milos, Nature Rev Genet 2011

Outline

- microarrays

- RNA-Seq

- Common gene expression analyses steps

- clustering of samples

- differential expression tests

- enrichment tests

Early Quality Control

0.0

0.2

0.4

0.6

0.8

1.0

20% at 3'Middle20% at 5'

SMARTer

Varian

t #2

varia

nt #3

Optimize

d

varia

nt #1

varia

nt #4

Supplementary Figure 6. Read coverage across genes in single-cell RNA-Seq data.Fraction of reads mapping to the 20% 5’ most, the 20% 3' most, and the 60% in the middle region for all individual single-cell transcriptome data from HEK293T cells. Variant protocols are as the optimized except for differences in volume of TSO used (variant #1 use 2 ul instead of 1ul), template switching oligo (variant #2 uses rGrG+N, variant #4 uses rGrGrG) or preamplification enzyme (variant #3 uses Advantage 2).

fraction o

f m

apped r

eads

0.00

0.02

0.04

0.06

0.08

0.10

0.12

123

456

789

Read mapping (STAR to hg19)

Reads (

%)

0

20

40

60

80

100

No matchMultimappingUniquely mapping

fraction o

f m

apped r

eads

0.0

0.2

0.4

0.6

0.8

1.0

IntergenicIntronic Exonic

Number of mismatches:

Genomic regions

Variant #2

Variant #3

Optim

ized

variant #1

SM

ARTe

r

variant #4

Supplementary Figure 2. Mapping statistics for single-cell libraries generated using SMARTer, optimized Smart-Seq and variants of the optimized protocol.(A) The fraction of uniquely aligned reads with 1 to 9 mismatches for each single-cell RNA-

Seq library. (B) Percentage of reads that could be aligned uniquely, aligned to multiple

genomic coordinates (multimapping) or did not align for all single-cell RNA-Seq libraries. (C)

The fraction of uniquely aligned reads that mapped to exonic, intronic or intergenic regions

(annotations based on RefSeq gene models). Variant protocols are as the optimized except

for differences in volume of TSO used (variant #1 use 2 ul instead of 1ul), template switch-

ing oligo (variant #2 uses rGrG+N, variant #4 uses rGrGrG) or preamplification enzyme

(variant #3 uses Advantage 2).

A B

C

Variant #2

Variant #3

Optim

ized

variant #1

SM

ARTe

r

variant #4

Variant #2

Variant #3

Optim

ized

variant #1

SM

ARTe

r

variant #4

Biological QC Look at replicates and that samples group by

origin/type

Hierarchical clustering

−100

−50

0

50

100

150

í100 −50 0 50 100 150

PC3 (n=4)

T24(n=4)

Lncap (n=4)

SVD component 1

SVD

com

pone

nt 2

PCA / SVD

U251

SNB-19

SF-295

SNB-75

HS-578T

SF-539

SF-268

BT-549

HOP-62

NCI-H226

A498

RXF-393

786-0

CAKI-1

UO-31

ACHN

TK-10

MDA-MB-231

HOP-92

SN12C

ADR-RES

OVCAR-8

LOXIMVI

PC-3

OVCAR-3

OVCAR-4

IGROV1

SK-OV-3

OVCAR-5

DU-145

EKVX

A549

NCI-H460

RPMI-8226

K562

K562

K-562

HL-60

MOLT-4

CCRF-CEMSR

HCT-116

SW-620

HCT-15

KM12

HCC-2998

COLO205

HT-29

MCF7

MCF7

MCF7

T-47D

NCI-H322

NCI-H23

NCI-H522

SK-MEL-5

MDA-MB435

MDA-N

M-14

SK-MEL-28

UACC-257

MALME-3M

UACC-62

SK-MEL-2A

1.00

-1.00

0.60

0.20

-0.20

-0.60

leukaemia colon melanomaCNS renal ovarian

breastprostatenon-small-lung

NCI60 cell line expression clustering

ordering pretty arbitrary

Careful about high order clustering

Singular Value Decompostion (SVD)Genes

e_0m

e_30m

e_60m

e_90m

e_120m

e_150m

e_180m

e_210m

e_240m

e_270m

e_300m

e_330m

e_360m

e_390m

Arrays

Genes

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Eigenarrays

1413121110987654321

Eigenarrays

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Eigengenes

1413121110987654321

Eigengenes

e_0m

e_30m

e_60m

e_90m

e_120m

e_150m

e_180m

e_210m

e_240m

e_270m

e_300m

e_330m

e_360m

e_390m

Arrays

QC: Similarities between replicates

0 hr

6 hr

48 hrSa

mpl

e Pr

ojec

tion

(eig

enge

ne 2

, 31%

)

Sample Projection (eigengene 1, 52%)

Eigengenes 0 hr 6 hr 48 hr 0 hr 6 hr 48 hr

SVD Analysis of Mouse T-cell Stimulation

Captures 83% of variation

QC: Outliers

Embryoid bodiesSonic Hedgehog induced

?

Differential Expression

Either based on reads or RPKM values

Most tools developed for microarrays are based on probe set expression values, whereas RNA-Seq tools aim to use read counts !Reads • have more statistical power • have unresolved biases • need fewer replicates? !

Expression levels, RPKMs • better understood statistics, but has less power

Statistical models of differential expression

Statistical models of differential expression

Transcript length effects in differential expression tests

Oshlack and Wakefield Biology Direct 2009

p-values should not be the basis for sorting

non-coding RNAs in prostate cancer: Expression and differential expression

Enrichment analyses

Goals of enrichment analyses

Factors to consider

Gene Sets, e.g. pathways and gene ontology

§ Gene Ontology § KEGG § BioCarta § PANTHER !

§ Chromosomal location

§ Genes found differentially expressed in another experiment

Two strategies

List-based enrichment analyses

Gene In List Gene NOT In List

In Category a bNOT In Category c d

all genes

in category

gene set

in category

Assessing significance

DAVID

Query many types of gene sets in one go

Current Background: HOMO SAPIENS Check Defaults ! • Main Accessions (0 selected) • Other Accessions (0 selected) • Gene Ontology (3 selected) • Protein Domains (3 selected) • Pathways (3 selected) • General Annotations (0 selected) • Functional Categories (3 selected) • Protein Interactions (0 selected) • Literature (0 selected) • Disease (1 selected) • Tissue Expression

Gene set enrichment analyses (GSEA)

Molecular Signature db

Gene Ontology analyses

§ Note: Background matterschoosing the wrong background set of genes may affect/confound your results

§ Depends upon preselected categories !

§ List-dependente.g. DAVID, http://david.abcc.ncifcrf.gov/ !

§ List-independent methodse.g. GSEA, http://www.broad.mit.edu/gsea/

Questions?

top related