base-resolution rna-seq - jeff leek
TRANSCRIPT
@simplystats
base-resolution rna-seq
Jeff Leek Johns Hopkins Bloomberg School of Public Health
@simplystats
normally You are free to: Provided: Provided you attribute this work to its author and respect the rights and licenses associated with its components
Copy, share adapt and remix Photograph, film and broadcast Live tweet, blog, post video of
Adapted from:
@simplystats
today
1. types of statistical methods
2. derfinder
3. Unexpected expression
@simplystats
data generation
Genome
@simplystats
data generation
Genome
Transcripts
@simplystats
data generation
Genome
Transcripts
Reads
@simplystats
“simplest” thing – annotate-identify
Genome
@simplystats
exon model
Genome
Count by Exon
Bullard et al. BMC Bioinformatics 2010
@simplystats
union model
Genome
Union of all exons
Bullard et al. BMC Bioinformatics 2010
@simplystats
union-intersection model
Genome
Union/Intersection
Bullard et al. BMC Bioinformatics 2010
@simplystats
sources of variation in annotate-identify
1. annotation 2. gene models 3. fragment-level biases 4. technical variation 5. biological variability
@simplystats
annotation variation
Frazee et al. Biostatistics under review
@simplystats
gc-variation
Hansen et al. 2011 Biostatistics
@simplystats
biological variation
Choy et al. (2008) vs.
Pickrell et al. (2010)
Stranger et al. (2007) vs.
Montgomery et al. (2010)
Hansen et al. 2010 Nat. Biotech
@simplystats
some data
hCp://bowGe-‐bio.sourceforge.net/recount/
@simplystats
assemble-identify
Genome
Reads
@simplystats
assemble-identify (align)
Genome
@simplystats
assemble-identify (assemble)
Genome
Fragments
Transcripts
Trapnell et al. 2010 Nat. Biotech
@simplystats
assemble-identify (abundance)
Genome
Transcripts
Trapnell et al. 2010 Nat. Biotech
@simplystats
inherent ambiguity (boundaries)
Genome
Fragments
Transcripts
@simplystats
inherent ambiguity (assembly)
Genome
Alternative Assemblies
@simplystats
assembly variation
Frazee et al. in prep
@simplystats
result of assembly variation
Frazee et al. in prep
@simplystats
result of assembly variation (bio reps)
Frazee et al. in prep
@simplystats
result of assembly variation
Frazee et al. in prep
@simplystats
result of assembly variation
Cufflinks p values
p values
Density
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
6
Cufflinks p-values
p-value
density
0.0 0.2 0.4 0.6 0.8 1.0
010
2030
4050
Cufflinks v1 Cufflinks v2
Frazee et al. 2012 in prep Frazee et al. in prep
@simplystats
methods
annotate-identify 1. align 2. gene-model 3. abundances 4. analyze
Pros: • analogous to microarray, • processed data easy to handle Cons: • incorrect/variable annotation • gene model choices have a big
impact
assemble-identify 1. align 2. assemble 3. abundances 4. analyze
Pros: • alternative transcription • (potentially) less annotation
dependent Cons: • ambiguity/variation in
assembly
@simplystats
differentially expressed region finder
1. Calculate base pair-resolution coverage 2. Perform test at each base 3. Identify regions of differential expression (segment) 4. Annotate regions (optional) Pros: • processed data still easier to handle • less dependent on annotation • no assembly variability Cons: • still no transcript-level abundances (but…)
Frazee et al. 2012b in prep
@simplystats
derfinder notes
• Ignores annotation • Coverage data at base resolution, designed
for “differential” expression analysis – Lose paired end information – Lose junction information – Lose potential mapping quality information – …
• Annotate the resulting differentially expressed regions (DERs)
@simplystats
Solution ir
5 10 15 202 2 3 6 11 12 14 15 15 16 15 17 16 14 9 8 6 5 5 4 3 1 1
@simplystats
result n samples à
3 billion nt
Frazee et al. Biostatistics in review
@simplystats
base-pair model (case/control)
g() = Transform (Box-Cox, log(+32) etc.) Yi,j = coverage on sample i at base j lj = genomic location j α() = baseline coverage β() = change in coverage between groups γk() = adjustment’s for confounders Wik = value of kth confounder on ith sample
Frazee et al. Biostatistics in review
@simplystats
batch-variation
Blue: 3 sds below the mean Orange: 3 sds above the mean
Human chromosome 16
Horizontal lines delimit process dates
Leek et al. 2010 Nat. Rev. Genet.
@simplystats
finding the statistics for d.e. bases
t ~ π0f0 + π1f1 + π2f2 + π3f3
@simplystats
empirical bayes
@simplystats
estimating parameters
t ~ π0f0 + π1f1 + π2f2 + π3f3
Assumed known – the distribution of zeros Alternatively – Gottardo and Raftery 2008 JCGS
@simplystats
estimating parameters
t ~ π0f0 + π1f1 + π2f2 + π3f3
Estimated null distribution from e.g. Efron 2002
@simplystats
estimating parameters
Estimated from 2-groups model, assumed symmetric
t ~ π0f0 + π1f1 + π2f2 + π3f3
@simplystats
hmm
DE DE DE not DE not DE
t1 t2 t3 t4 t5
hidden states
emissions are statistics
Frazee et al. Biostatistics in review
@simplystats
statistic
Observed
Frazee et al. Biostatistics in review
@simplystats
monte-carlo p-value
Observed Null
Frazee et al. Biostatistics in review Lagnmead et al. in prep
Jaffe et al. Biostatistics 2011
@simplystats
ma-plots
Frazee et al. Biostatistics in review
@simplystats
statistical significance
p values
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0200
400
600
800 DER Finder - sex
p values
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0200
400
600
800
1000
DER Finder - males
p values
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
050
100
200
300
Cufflinks - sex
p values
Frequency
0.2 0.4 0.6 0.8 1.0
050
100150200250300 Cufflinks - males
p value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
050
100
150 EdgeR - sex
p value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
2025 EdgeR - males
p value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
80100
140
DESeq - sex
p value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
010
2030
40 DESeq - malesFrazee et al. Biostatistics in review
@simplystats
percent “correct hits” by ranking
@simplystats
caveat
Genome
Bullard et al. BMC Bioinformatics 2010
@simplystats
caveat
Genome
Bullard et al. BMC Bioinformatics 2010
@simplystats
annotation incorrect
4.5
5.0
5.5
6.0
6.5
7.0
log2(count+32)
chrY: 15016699 - 15017219
femalemale
12
34
56
7t s
tatis
tic
xaxinds
exons
states
15016742 15016842 15016942 15017119 15017219genomic position
Frazee et al. Biostatistics in review
@simplystats
annotation missing
4.5
5.0
5.5
6.0
6.5
7.0
log2(count+32)
chrY: 2715932-2716691
femalemale
2.0
2.5
3.0
3.5
4.0
4.5
t sta
tistic
xaxinds
exons
states
2715882 2716082 2716282 2716482 2716682genomic position
Frazee et al. Biostatistics in review
@simplystats
missed by cufflinks
Frazee et al. Biostatistics in review
@simplystats
computational goals
• Aligned reads (say from TopHat) to DERs in < 24 hours, all within R statistical software – Table of DERs and matrix of mean coverage per
sample per region for post-hoc analysis – Annotated using data from UCSC and Ensembl:
counts of features and annotation lists – Visualized DERs, including annotation to identify
novel transcriptional activity – Easy methods for counting exons from coverage
objects (~2-4 hours from aligned reads for all samples)
@simplystats
derfinder - fast
1. Test for differential expression at each base, record statistic (linear modeling)
2. Identify contiguous/adjacent bases that are differentially expressed above some cutoff (thresholding/ “bumphunter”)
3. Summarize each DER (area) 4. Perform significance testing on region-
level (permutations, empirical p-values)
@simplystats
time and memory needed: derSnyder
• Load & filter data: 10 cores with mclapply 1hr 15min, 177 GB
• Make models: 20 min, 52 GB • Analysis: 10 permutations, 4 cores each chr,
total 59 mins – chr1 41 min, 46 GB
• Merging: 30 min, 22 GB • Report: 27 min, 17 GB • Total wallclock time: 3 hr 46 min
20 samples
@simplystats
Counts: derSnyder
• Load & filter data: 10 cores with mclapply 1hr 15min, 177 GB
• Create count table: 26 min, 24 GB • Total wallclock time: 1 hr 41 min
20 samples
@simplystats
lieber brain samples
• DLPFC Paired-end RNAseq Data • 36 samples across 6 age ranges, n=6/
group: Fetal (age < 0) ; Infant (0 -1) ; Child (1 - 10) ; Teen (10 - 20) ; Adult (20 -50) ; 50+
• 4 M and 2 F per group; mostly AA, but some Caucasians
• RINs are evenly distributed across age
@simplystats
lieber brain samples
@simplystats
test for base-level de
@simplystats
thresholding on statistic
F-‐staGsGc corresponding to p-‐value < 10-‐8 (F5,30)
@simplystats
derfinder results
• alt model: age group + median coverage • null model: median coverage • threshold: p-value < 1e-8 • 5,565 DERs with FWER ~ 0 (conservative) – Median length: 148bp [IQR: 112-235]
@simplystats
@simplystats
@simplystats
@simplystats
@simplystats
@simplystats
@simplystats
@simplystats
annotating
• Devised “light-weight” R annotation files for UCSC hg19 knownGene and Ensembl GRCh37.p11
• “Genomic State” objects: each base pair in the genome gets assigned to exactly one “state”, annotations merged across overlapping features
• Two different configurations: – “Full” (introns, exons, un-annotated/intragenic) – “Coding” (introns, coding exons, UTRs, promoters,
un-annotated/intragenic) • Very fast, 1000s of regions in seconds
@simplystats
derfinder results
• 2,655 regions (47.7%) show expression of 1+ annotated intron (UCSC: 2,505; 45%)
• 577 regions (10.4%) show expression of an “intragenic” region (UCSC: 800, 14%)
Ensembl UCSC
@simplystats
derfinder results
• 261 regions (4.7%) crossed a known lincRNA – 51 overlapping 535 “intragenic” regions
(9.6%; e.g. no exons)
• Only one region crossed known miRNA, but same region had annotated exon on other strand
@simplystats
derfinder results
• Verifying the 5,565 DERs: – 95% of regions had mappability of 100bp
reads greater than 99% – Only 16 regions were in tracks excluded by
Duke site of Encode (all “BSR/Beta” for satellite repeats) and 0 by Data Analysis Center of Encode
– Only 90 regions (1.5%) mapped to known pseudogenes
@simplystats
derfinder results
• Fetal samples had the highest expression in the majority of the regions (84%; 18 [1.7-Inf] fold increase); second highest was 50+ group (7%; 1.4 [1-4.3] fold increase)
@simplystats
derfinder results
@simplystats
derfinder subgroup
• Identified DERs within each 6-sample age group based on mean expression – Represents set of expressed sequences for
each group at a given coverage threshold – Varied mean coverage cutoff
@simplystats
% of genome expressed
Percen
t of G
enom
e Expressed
@simplystats
scaled % of genome expressed
Fetal is highest at EVERY cutoff
Teen is lowest thru 114 reads
Infant is lowest a<er 114 reads
@simplystats
higher cutoffs create longer DERs
@simplystats
% of genome expressed (L ≥ 12)
Percen
t of G
enom
e Expressed
@simplystats
Scaled % of genome expressed (L ≥ 12)
Fetal is s=ll highest at EVERY cutoff
@simplystats
Higher cutoffs still create longer DERs
@simplystats
try that stuff, yo!
https://github.com/lcolladotor/derfinder https://github.com/lcolladotor/derfinderReport https://github.com/lcolladotor/derfinderExample
@simplystats
acknowledgements Leek Group Alyssa Frazee Prasad Patil Leo Collado Torres Abhi Nellore University of Maryland Héctor Corrada Bravo Harvard Rafael Irizarry Lieber Institute Andrew Jaffe Danny Weinberger Thomas Hyde
Hopkins Kasper Hansen Roger Peng Ben Langmead Sarven Sabunicyan Luigi Marchionni Donald Geman Funding Amazon Web Services Digital Science NIH CCNE Hopkins inHealth