algorithms to action: r / bioconductor for analysis and comprehension of high-throughput sequence...
DESCRIPTION
Algorithms form a bridge between mathematical expression and computational evaluation. Statistical algorithms and insights are particularly relevant in high-throughput analysis: the large volume of data masks small sample size; technical artifacts abound; and nuanced questions require innovative rather than cookie-cutter approaches. The R / Bioconductor project represents an effective way to bring useful computation on high-throughput data to a broad scientific research community. These points are explored through Bioconductor packages and work flows, and by articulation of Bioconductor principles associated with reproducible research conducted in an open environment.TRANSCRIPT
Algorithms to Action: R / Bioconductor forAnalysis and Comprehension of High-Throughput
Sequence Data
Martin T. Morgan [email protected]
Fred Hutchinson Cancer Research CenterSeattle, WA, USA
6 December 2013
Transforming Algorithms to Action
Algorithms
I Intellectual engagement during creative exploration ofchallenging puzzles at the frontiers of knowledge
Action
I Satisfaction from practical application and use by others
Overview
1. Introduction: R and Bioconductor
2. What makes for successful computational biology software?
3. Exemplars: algorithms into actions
4. Challenges & opportunities
Introduction: R
I http://r-project.org
I Open-source, statisticalprogramming language;widely used in academia,finance, pharma, . . .
I Core language and basepackages
I Interactive sessions,scripts
I > 5000 contributedpackages
## Two 'vectors'
x <- rnorm(1000)
y <- x + rnorm(1000, sd=.5)
## Integrated container
df <- data.frame(X=x, Y=y)
## Visualize
plot(Y ~ X, df)
## Regression; 'object'
fit <- lm(Y ~ X, df)
## Methods on the object
abline(fit) # regression line
anova(fit) # ANOVA table
Introduction: Bioconductor
Analysis and comprehension of high-throughput genomic data
I http://bioconductor.org
I > 11 years old, 749 packages
Themes
I Rigorous statistics
I Reproducible work flows
I Integrative analysis
Introduction: Bioconductor
I 1341 PubMed full-textcitations in trailing 12months
I 28,000 web visits / month;75,000 unique IP downloads/ year
I Annual conferences; courses;active mailing list; . . .
European Bioconductor Conference, December 9-10,Cambridge, UK
Introduction: What is Bioconductor good for?
I Microarrays: expression, copy number, SNPs, methylation, . . .I Sequencing: RNA-seq, ChIP-seq, called variants, . . .
I Especially after assembly / alignment
I Annotation: genes, pathways, gene models (exons, transcripts,etc.), . . .
I Flow cytometry, proteomics, image analysis, high-throughputscreens, . . .
Introduction: The ShortRead package
## Use the 'ShortRead' package
library(ShortRead)
## Create an object to represent a sample from a file
sampler <- FastqSampler("ERR127302_1.fastq.gz")
## Apply a method to yield a random sample
fq <- yield(sampler)
## Access sequences of sampled reads using `sread()`
## Summarize nucleotide use by cycle
## 'abc' is a nucleotide x cycle matrix of counts
abc <- alphabetByCycle(sread(fq))
## Subset of interesting nucleotides
abc <- abc[c("A", "C", "G", "T", "N"),]
Introduction: The ShortRead package
## Create a plot from a
## matrix
matplot(t(abc), type="l",
lty=1, lwd=3,
xlab="Cycle",
ylab="Count",
cex.lab=2)
## Add a legend
legend("topright",
legend=rownames(abc),
lty=1, lwd=3, col=1:5,
cex=1.8)
0 10 20 30 40 50 60 70
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
Cycle
Cou
nt
ACGTN
Introduction: Some key points
I R is a high-level programming language, so lots can beaccomplished with just a little code
I Objects such as a data.frame help to manage complicateddata
I Reducing possibility for clerical and other mistakesI Facilitating inter-operability between different parts of an
analysis
I Packages provide a great way to benefit from the expertise ofothers (and to contribute your own expertise back to thecommunity!)
I Scripts make work flows reproducible
I Visualizing data is an important part of exploratory analysis
Successful computational biology software
1. Extensive: software, annotation, integration
2. Statistical: volume, technology, experimental design
3. Reproducible: long-term, multi-participant science
4. Leading edge: novel, technology-driven
5. Accessible: affordable, transparent, usable
Successful computational biology software
1. Extensive:I Software packages for diverse analysis tasksI Annotation (gene and genome) resources to place statistical
analysis in biological contextI Integration across work flows, statistical analysis tasks, . . .
2. Statistical: volume, technology, experimental design
3. Reproducible: long-term, multi-participant science
4. Leading edge: novel, technology-driven
5. Accessible: affordable, transparent, usable
Successful computational biology software
1. Extensive: software, annotation, integration
2. Statistical:I Volume of data requires statistical treatmentI Technology introduces statistical artifacts that need to be
accommodatedI Designed experiments need statistical treatment, often involve
small sample sizes (even though the data is big!), and requireappropriate analysis
3. Reproducible: long-term, multi-participant science
4. Leading edge: novel, technology-driven
5. Accessible: affordable, transparent, usable
Successful computational biology software
1. Extensive: software, annotation, integration
2. Statistical: volume, technology, experimental design
3. Reproducible:I Long-term analysis projects (days, months, years) need to be
performed in a way that is reproducibleI Multi-participant science requires facilities that bridge, e.g.,
statistical analysis and communication with biologistcollaborators.
4. Leading edge: novel, technology-driven
5. Accessible: affordable, transparent, usable
Successful computational biology software
1. Extensive: software, annotation, integration
2. Statistical: volume, technology, experimental design
3. Reproducible: long-term, multi-participant science
4. Leading edge:I Science research questions changes rapidly, so usable software
needs to keep pace without necessarily being finishedI Software needs to keep pace with rapidly changing technology
5. Accessible: affordable, transparent, usable
Successful computational biology software
1. Extensive: software, annotation, integration
2. Statistical: volume, technology, experimental design
3. Reproducible: long-term, multi-participant science
4. Leading edge: novel, technology-driven
5. Accessible:I Research software needs to be affordable, both in terms of
money and timeI Science requires transparent algorithms, so that merits can be
evaluated, problems identified and correctedI Software has to be usable in the sense that it works, and is
documented
Successful computational biology software
1. Extensive: software, annotation, integrationI 750 inter-operable Bioconductor packages
2. Statistical: volume, technology, experimental designI R a ‘natural’ for statistical analysis
3. Reproducible: long-term, multi-participant scienceI Objects, scripts, vignettes, packages, . . . encourage
reproducible research
4. Leading edge: novel, technology-drivenI Packages and user community closely track leading edge
science
5. Accessible: affordable, transparent, usableI Bioconductor is free and open, with extensive documentation
and an active and supportive user community
Exemplars: Algorithms to action
1. Batch effects
2. Methylation
3. RNA-seq Differential Representation
4. Visualization
Exemplar: Batch Effects
Leek et al., 2010, Nature Reviews Genetics 11, 733-739, Leek &Story PLoS Genet 3(9): e161
I Scientific finding: pervasivebatch effects
I Statistical insights:surrogate variable analysis:identify and build surrogatevariables; remove knownbatch effects
I Benefits: reducedependence, stabilize errorrate estimates, and improvereproducibility
Bioconductor support: sva
HapMap samples from one facility,
ordered by date of processing. From
Exemplar: Batch Effects
Leek et al., 2010, Nature Reviews Genetics 11, 733-739, Leek &Story PLoS Genet 3(9): e161
I Scientific finding: pervasivebatch effects
I Statistical insights:surrogate variable analysis:identify and build surrogatevariables; remove knownbatch effects
I Benefits: reducedependence, stabilize errorrate estimates, and improvereproducibility
Bioconductor support: sva
1. Remove signal due tovariable(s) of interest
2. Identify subset of genesdriving orthogonal signaturesof EH
3. Build a surrogate variablebased on full EH signatureof that subset
4. Include significant surrogatevariables as covariates
EH: expression heterogeneity
Exemplar: Methylation
Hansen et al., 2011, Nature Genetics 43, 768-775
I Scientific finding: stochastic methylation variation ofcancer-specific de-methylated regions (DMR), distinguishingcancer from normal tissue, in several cancers.
I Statistical challenges: smoothing, non-specific filtering, tstatistics, find DMRs
Bioconductor support: whole-genome (bsseq) or reducedrepresentation (MethylSeekR) bisulfite sequencing; Illumina 450karrays (minfi)
Exemplar: Differential Representation
Haglund et al., 2012 J Clin Endocrin Metab
I Scientific finding: identifygenes whose expression isregulated by estrogenreceptors in parathyroidadenoma cells
I Statistical challenges:between-samplenormalization; appropriatestatistical model; efficientestimation; . . .
Bioconductor support: DESeq2 , edgeR, many statistical ‘lessonslearned’ from microarrays; extensive integration with down-streamtools
Exemplar: Visualization
Gviz
I Track-like visualizations
I Data panels
I Fully integrated withBioconductor sequencerepresentations
ggbioepivizr
Exemplar: Visualization
Gviz
I Track-like visualizations
I Data panels
I Fully integrated withBioconductor sequencerepresentations
ggbioepivizr
Exemplar: Visualization
Gvizggbio
I Comprehensive visualizations
I autoplot file and data types
I Fully integrated withBioconductor sequencerepresentations
epivizr
Exemplar: Visualization
Gvizggbioepivizr
I Genome browser with socketcommunication to R
I Fully integrated withBioconductor sequencerepresentations
Challenges & Opportunities
I Big data – transparent management within R, facile use ofestablished resources
I Developer and user training
Resources
I http://r-project.org, An Introduction to R manual;Dalgaard, Introductory Statistics with R; R for Dummies
I http://bioconductor.org/
I StackOverflow, Bioconductor mailing list
Acknowledgements
I Bioconductor team: Marc Carlson, Valerie Obenchain, HervePages, Paul Shannon, Dan Tenenbaum
I Technical advisory council: Vincent Carey, Wolfgang Huber,Robert Gentleman, Rafael Irizzary, Sean Davis, Kasper Hansen
I Scientific advisory board: Simon Tavare, Vivian Bonazzi,Vincent Carey, Wolfgang Huber, Robert Gentleman, RafaelIrizzary, Paul Flicek, Simon Urbanek.
I NIH / NHGRI U41HG0004059
I The Bioconductor community!
I . . . and the organizers of BioInfoSummer 2013!