base-resolution rna-seq - jeff leek

@simplystats

base-resolution rna-seq

Jeff Leek Johns Hopkins Bloomberg School of Public Health

@simplystats

normally You are free to: Provided: Provided you attribute this work to its author and respect the rights and licenses associated with its components

Copy, share adapt and remix Photograph, film and broadcast Live tweet, blog, post video of

Adapted from:

@simplystats

today

1.  types of statistical methods

2.  derfinder

3.  Unexpected expression

@simplystats

data generation

Genome

@simplystats

data generation

Genome

Transcripts

@simplystats

data generation

Genome

Transcripts

Reads

@simplystats

“simplest” thing – annotate-identify

Genome

@simplystats

exon model

Genome

Count by Exon

Bullard et al. BMC Bioinformatics 2010

@simplystats

union model

Genome

Union of all exons


@simplystats

union-intersection model

Genome

Union/Intersection


@simplystats

sources of variation in annotate-identify

1.  annotation 2.  gene models 3.  fragment-level biases 4.  technical variation 5.  biological variability

@simplystats

annotation variation

Frazee et al. Biostatistics under review

@simplystats

gc-variation

Hansen et al. 2011 Biostatistics

@simplystats

biological variation

Choy et al. (2008) vs.

Pickrell et al. (2010)

Stranger et al. (2007) vs.

Montgomery et al. (2010)

Hansen et al. 2010 Nat. Biotech

@simplystats

some data

hCp://bowGe-‐bio.sourceforge.net/recount/

@simplystats

assemble-identify

Genome

Reads

@simplystats

assemble-identify (align)

Genome

@simplystats

assemble-identify (assemble)

Genome

Fragments

Transcripts

Trapnell et al. 2010 Nat. Biotech

@simplystats

assemble-identify (abundance)

Genome

Transcripts

Trapnell et al. 2010 Nat. Biotech

@simplystats

inherent ambiguity (boundaries)

Genome

Fragments

Transcripts

@simplystats

inherent ambiguity (assembly)

Genome

Alternative Assemblies

@simplystats

assembly variation

Frazee et al. in prep

@simplystats

result of assembly variation


@simplystats

result of assembly variation (bio reps)


@simplystats



@simplystats


Cufflinks p values

p values

Density

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

Cufflinks p-values

p-value

density

0.0 0.2 0.4 0.6 0.8 1.0

010

2030

4050

Cufflinks v1 Cufflinks v2

Frazee et al. 2012 in prep Frazee et al. in prep

@simplystats

methods

annotate-identify 1.  align 2.  gene-model 3.  abundances 4.  analyze

Pros: •  analogous to microarray, •  processed data easy to handle Cons: •  incorrect/variable annotation •  gene model choices have a big

impact

assemble-identify 1.  align 2.  assemble 3.  abundances 4.  analyze

Pros: •  alternative transcription •  (potentially) less annotation

dependent Cons: •  ambiguity/variation in

assembly

@simplystats

differentially expressed region finder

1.  Calculate base pair-resolution coverage 2.  Perform test at each base 3.  Identify regions of differential expression (segment) 4.  Annotate regions (optional) Pros: •  processed data still easier to handle •  less dependent on annotation •  no assembly variability Cons: •  still no transcript-level abundances (but…)

Frazee et al. 2012b in prep

@simplystats

derfinder notes

•  Ignores annotation •  Coverage data at base resolution, designed

for “differential” expression analysis –  Lose paired end information –  Lose junction information –  Lose potential mapping quality information – …

•  Annotate the resulting differentially expressed regions (DERs)

@simplystats

Solution ir

5 10 15 202 2 3 6 11 12 14 15 15 16 15 17 16 14 9 8 6 5 5 4 3 1 1

@simplystats

result n samples à

3 billion nt

Frazee et al. Biostatistics in review

@simplystats

base-pair model (case/control)

g() = Transform (Box-Cox, log(+32) etc.) Yi,j = coverage on sample i at base j lj = genomic location j α() = baseline coverage β() = change in coverage between groups γk() = adjustment’s for confounders Wik = value of kth confounder on ith sample


@simplystats

batch-variation

Blue: 3 sds below the mean Orange: 3 sds above the mean

Human chromosome 16

Horizontal lines delimit process dates

Leek et al. 2010 Nat. Rev. Genet.

@simplystats

finding the statistics for d.e. bases

t ~ π0f0 + π1f1 + π2f2 + π3f3

@simplystats

empirical bayes

@simplystats

estimating parameters

t ~ π0f0 + π1f1 + π2f2 + π3f3

Assumed known – the distribution of zeros Alternatively – Gottardo and Raftery 2008 JCGS

@simplystats


t ~ π0f0 + π1f1 + π2f2 + π3f3

Estimated null distribution from e.g. Efron 2002

@simplystats


Estimated from 2-groups model, assumed symmetric

t ~ π0f0 + π1f1 + π2f2 + π3f3

@simplystats

hmm

DE DE DE not DE not DE

t1 t2 t3 t4 t5

hidden states

emissions are statistics


@simplystats

statistic

Observed


@simplystats

monte-carlo p-value

Observed Null

Frazee et al. Biostatistics in review Lagnmead et al. in prep

Jaffe et al. Biostatistics 2011

@simplystats

ma-plots


@simplystats

statistical significance

p values

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0200

400

600

800 DER Finder - sex

p values

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0200

400

600

800

1000

DER Finder - males

p values

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

050

100

200

300

Cufflinks - sex

p values

Frequency

0.2 0.4 0.6 0.8 1.0

050

100150200250300 Cufflinks - males

p value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150 EdgeR - sex

p value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

2025 EdgeR - males

p value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80100

140

DESeq - sex

p value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

010

2030

40 DESeq - malesFrazee et al. Biostatistics in review

@simplystats

percent “correct hits” by ranking

@simplystats

caveat

Genome


@simplystats

annotation incorrect

4.5

5.0

5.5

6.0

6.5

7.0

log2(count+32)

chrY: 15016699 - 15017219

femalemale

12

34

56

7t s

tatis

tic

xaxinds

exons

states

15016742 15016842 15016942 15017119 15017219genomic position


@simplystats

annotation missing

4.5

5.0

5.5

6.0

6.5

7.0

log2(count+32)

chrY: 2715932-2716691

femalemale

2.0

2.5

3.0

3.5

4.0

4.5

t sta

tistic

xaxinds

exons

states

2715882 2716082 2716282 2716482 2716682genomic position


@simplystats

missed by cufflinks


@simplystats

computational goals

•  Aligned reads (say from TopHat) to DERs in < 24 hours, all within R statistical software – Table of DERs and matrix of mean coverage per

sample per region for post-hoc analysis – Annotated using data from UCSC and Ensembl:

counts of features and annotation lists – Visualized DERs, including annotation to identify

novel transcriptional activity –  Easy methods for counting exons from coverage

objects (~2-4 hours from aligned reads for all samples)

@simplystats

derfinder - fast

1.  Test for differential expression at each base, record statistic (linear modeling)

2.  Identify contiguous/adjacent bases that are differentially expressed above some cutoff (thresholding/ “bumphunter”)

3.  Summarize each DER (area) 4.  Perform significance testing on region-

level (permutations, empirical p-values)

@simplystats

time and memory needed: derSnyder

•  Load & filter data: 10 cores with mclapply 1hr 15min, 177 GB

•  Make models: 20 min, 52 GB •  Analysis: 10 permutations, 4 cores each chr,

total 59 mins –  chr1 41 min, 46 GB

•  Merging: 30 min, 22 GB •  Report: 27 min, 17 GB •  Total wallclock time: 3 hr 46 min

20 samples

@simplystats

Counts: derSnyder

•  Load & filter data: 10 cores with mclapply 1hr 15min, 177 GB

•  Create count table: 26 min, 24 GB •  Total wallclock time: 1 hr 41 min

20 samples

@simplystats

lieber brain samples

•  DLPFC Paired-end RNAseq Data •  36 samples across 6 age ranges, n=6/

group: Fetal (age < 0) ; Infant (0 -1) ; Child (1 - 10) ; Teen (10 - 20) ; Adult (20 -50) ; 50+

•  4 M and 2 F per group; mostly AA, but some Caucasians

•  RINs are evenly distributed across age

@simplystats

lieber brain samples

@simplystats

test for base-level de

@simplystats

thresholding on statistic

F-‐staGsGc corresponding to p-‐value < 10-‐8 (F5,30)

@simplystats

derfinder results

•  alt model: age group + median coverage •  null model: median coverage •  threshold: p-value < 1e-8 •  5,565 DERs with FWER ~ 0 (conservative) – Median length: 148bp [IQR: 112-235]

@simplystats

@simplystats

annotating

•  Devised “light-weight” R annotation files for UCSC hg19 knownGene and Ensembl GRCh37.p11

•  “Genomic State” objects: each base pair in the genome gets assigned to exactly one “state”, annotations merged across overlapping features

•  Two different configurations: –  “Full” (introns, exons, un-annotated/intragenic) –  “Coding” (introns, coding exons, UTRs, promoters,

un-annotated/intragenic) •  Very fast, 1000s of regions in seconds

@simplystats

derfinder results

•  2,655 regions (47.7%) show expression of 1+ annotated intron (UCSC: 2,505; 45%)

•  577 regions (10.4%) show expression of an “intragenic” region (UCSC: 800, 14%)

Ensembl UCSC

@simplystats

derfinder results

•  261 regions (4.7%) crossed a known lincRNA – 51 overlapping 535 “intragenic” regions

(9.6%; e.g. no exons)

•  Only one region crossed known miRNA, but same region had annotated exon on other strand

@simplystats

derfinder results

•  Verifying the 5,565 DERs: – 95% of regions had mappability of 100bp

reads greater than 99% – Only 16 regions were in tracks excluded by

Duke site of Encode (all “BSR/Beta” for satellite repeats) and 0 by Data Analysis Center of Encode

– Only 90 regions (1.5%) mapped to known pseudogenes

@simplystats

derfinder results

•  Fetal samples had the highest expression in the majority of the regions (84%; 18 [1.7-Inf] fold increase); second highest was 50+ group (7%; 1.4 [1-4.3] fold increase)

@simplystats

derfinder results

@simplystats

derfinder subgroup

•  Identified DERs within each 6-sample age group based on mean expression – Represents set of expressed sequences for

each group at a given coverage threshold – Varied mean coverage cutoff

@simplystats

% of genome expressed

Percen

t of G

enom

e Expressed

@simplystats

scaled % of genome expressed

Fetal is highest at EVERY cutoff

Teen is lowest thru 114 reads

Infant is lowest a<er 114 reads

@simplystats

higher cutoffs create longer DERs

@simplystats

% of genome expressed (L ≥ 12)

Percen

t of G

enom

e Expressed

@simplystats

Scaled % of genome expressed (L ≥ 12)

Fetal is s=ll highest at EVERY cutoff

@simplystats

Higher cutoffs still create longer DERs

@simplystats

try that stuff, yo!

https://github.com/lcolladotor/derfinder https://github.com/lcolladotor/derfinderReport https://github.com/lcolladotor/derfinderExample

@simplystats

acknowledgements Leek Group Alyssa Frazee Prasad Patil Leo Collado Torres Abhi Nellore University of Maryland Héctor Corrada Bravo Harvard Rafael Irizarry Lieber Institute Andrew Jaffe Danny Weinberger Thomas Hyde

Hopkins Kasper Hansen Roger Peng Ben Langmead Sarven Sabunicyan Luigi Marchionni Donald Geman Funding Amazon Web Services Digital Science NIH CCNE Hopkins inHealth

base-resolution rna-seq - jeff leek

Technology

simplystats frazee

simplystats biotech

annotation coverage

assembly variationfrazee

annotation variationfrazee

base resolution

groups model

union model union