algorithms to action: r / bioconductor for analysis and comprehension of high-throughput sequence...

Algorithms to Action: R / Bioconductor forAnalysis and Comprehension of High-Throughput

Sequence Data

Martin T. Morgan [email protected]

Fred Hutchinson Cancer Research CenterSeattle, WA, USA

6 December 2013

[email protected]

Transforming Algorithms to Action

Algorithms

I Intellectual engagement during creative exploration ofchallenging puzzles at the frontiers of knowledge

Action

I Satisfaction from practical application and use by others

Overview

1. Introduction: R and Bioconductor

2. What makes for successful computational biology software?

3. Exemplars: algorithms into actions

4. Challenges & opportunities

Introduction: R

I http://r-project.org

I Open-source, statisticalprogramming language;widely used in academia,finance, pharma, . . .

I Core language and basepackages

I Interactive sessions,scripts

I > 5000 contributedpackages

## Two 'vectors'

x <- rnorm(1000)

y <- x + rnorm(1000, sd=.5)

## Integrated container

df <- data.frame(X=x, Y=y)

## Visualize

plot(Y ~ X, df)

## Regression; 'object'

fit <- lm(Y ~ X, df)

## Methods on the object

abline(fit) # regression line

anova(fit) # ANOVA table

http://r-project.org

Introduction: Bioconductor

Analysis and comprehension of high-throughput genomic data

I http://bioconductor.org

I > 11 years old, 749 packages

Themes

I Rigorous statistics

I Reproducible work flows

I Integrative analysis

http://bioconductor.org

Introduction: Bioconductor

I 1341 PubMed full-textcitations in trailing 12months

I 28,000 web visits / month;75,000 unique IP downloads/ year

I Annual conferences; courses;active mailing list; . . .

European Bioconductor Conference, December 9-10,Cambridge, UK

Introduction: What is Bioconductor good for?

I Microarrays: expression, copy number, SNPs, methylation, . . .I Sequencing: RNA-seq, ChIP-seq, called variants, . . .

I Especially after assembly / alignment

I Annotation: genes, pathways, gene models (exons, transcripts,etc.), . . .

I Flow cytometry, proteomics, image analysis, high-throughputscreens, . . .

Introduction: The ShortRead package

## Use the 'ShortRead' package

library(ShortRead)

## Create an object to represent a sample from a file

sampler <- FastqSampler("ERR127302_1.fastq.gz")

## Apply a method to yield a random sample

fq <- yield(sampler)

## Access sequences of sampled reads using `sread()`

## Summarize nucleotide use by cycle

## 'abc' is a nucleotide x cycle matrix of counts

abc <- alphabetByCycle(sread(fq))

## Subset of interesting nucleotides

abc <- abc[c("A", "C", "G", "T", "N"),]

http://bioconductor.org/packages/release/bioc/html/ShortRead.html

Introduction: The ShortRead package

## Create a plot from a

## matrix

matplot(t(abc), type="l",

lty=1, lwd=3,

xlab="Cycle",

ylab="Count",

cex.lab=2)

## Add a legend

legend("topright",

legend=rownames(abc),

lty=1, lwd=3, col=1:5,

cex=1.8)

0 10 20 30 40 50 60 70

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

Cycle

Cou

nt

ACGTN

http://bioconductor.org/packages/release/bioc/html/ShortRead.html

Introduction: Some key points

I R is a high-level programming language, so lots can beaccomplished with just a little code

I Objects such as a data.frame help to manage complicateddata

I Reducing possibility for clerical and other mistakesI Facilitating inter-operability between different parts of an

analysis

I Packages provide a great way to benefit from the expertise ofothers (and to contribute your own expertise back to thecommunity!)

I Scripts make work flows reproducible

I Visualizing data is an important part of exploratory analysis

Successful computational biology software

1. Extensive: software, annotation, integration

2. Statistical: volume, technology, experimental design

3. Reproducible: long-term, multi-participant science

4. Leading edge: novel, technology-driven

5. Accessible: affordable, transparent, usable


1. Extensive:I Software packages for diverse analysis tasksI Annotation (gene and genome) resources to place statistical

analysis in biological contextI Integration across work flows, statistical analysis tasks, . . .







2. Statistical:I Volume of data requires statistical treatmentI Technology introduces statistical artifacts that need to be

accommodatedI Designed experiments need statistical treatment, often involve

small sample sizes (even though the data is big!), and requireappropriate analysis







3. Reproducible:I Long-term analysis projects (days, months, years) need to be

performed in a way that is reproducibleI Multi-participant science requires facilities that bridge, e.g.,

statistical analysis and communication with biologistcollaborators.







4. Leading edge:I Science research questions changes rapidly, so usable software

needs to keep pace without necessarily being finishedI Software needs to keep pace with rapidly changing technology







5. Accessible:I Research software needs to be affordable, both in terms of

money and timeI Science requires transparent algorithms, so that merits can be

evaluated, problems identified and correctedI Software has to be usable in the sense that it works, and is

documented


1. Extensive: software, annotation, integrationI 750 inter-operable Bioconductor packages

2. Statistical: volume, technology, experimental designI R a ‘natural’ for statistical analysis

3. Reproducible: long-term, multi-participant scienceI Objects, scripts, vignettes, packages, . . . encourage

reproducible research

4. Leading edge: novel, technology-drivenI Packages and user community closely track leading edge

science

5. Accessible: affordable, transparent, usableI Bioconductor is free and open, with extensive documentation

and an active and supportive user community

Exemplars: Algorithms to action

1. Batch effects

2. Methylation

3. RNA-seq Differential Representation

4. Visualization

Exemplar: Batch Effects

Leek et al., 2010, Nature Reviews Genetics 11, 733-739, Leek &Story PLoS Genet 3(9): e161

I Scientific finding: pervasivebatch effects

I Statistical insights:surrogate variable analysis:identify and build surrogatevariables; remove knownbatch effects

I Benefits: reducedependence, stabilize errorrate estimates, and improvereproducibility

Bioconductor support: sva

HapMap samples from one facility,

ordered by date of processing. From

http://www.nature.com/nrg/journal/v11/n10/abs/nrg2825.html

http://dx.doi.org/10.1371/journal.pgen.0030161

http://bioconductor.org/packages/release/bioc/html/sva.html

Exemplar: Batch Effects

Leek et al., 2010, Nature Reviews Genetics 11, 733-739, Leek &Story PLoS Genet 3(9): e161

I Scientific finding: pervasivebatch effects

I Statistical insights:surrogate variable analysis:identify and build surrogatevariables; remove knownbatch effects

I Benefits: reducedependence, stabilize errorrate estimates, and improvereproducibility

Bioconductor support: sva

1. Remove signal due tovariable(s) of interest

2. Identify subset of genesdriving orthogonal signaturesof EH

3. Build a surrogate variablebased on full EH signatureof that subset

4. Include significant surrogatevariables as covariates

EH: expression heterogeneity

http://www.nature.com/nrg/journal/v11/n10/abs/nrg2825.html

http://dx.doi.org/10.1371/journal.pgen.0030161

http://bioconductor.org/packages/release/bioc/html/sva.html

Exemplar: Methylation

Hansen et al., 2011, Nature Genetics 43, 768-775

I Scientific finding: stochastic methylation variation ofcancer-specific de-methylated regions (DMR), distinguishingcancer from normal tissue, in several cancers.

I Statistical challenges: smoothing, non-specific filtering, tstatistics, find DMRs

Bioconductor support: whole-genome (bsseq) or reducedrepresentation (MethylSeekR) bisulfite sequencing; Illumina 450karrays (minfi)

http://www.nature.com/ng/journal/v43/n8/full/ng.865.html

http://bioconductor.org/packages/release/bioc/html/bsseq.html

http://bioconductor.org/packages/release/bioc/html/MethylSeekR.html

http://bioconductor.org/packages/release/bioc/html/minfi.html

Exemplar: Differential Representation

Haglund et al., 2012 J Clin Endocrin Metab

I Scientific finding: identifygenes whose expression isregulated by estrogenreceptors in parathyroidadenoma cells

I Statistical challenges:between-samplenormalization; appropriatestatistical model; efficientestimation; . . .

Bioconductor support: DESeq2 , edgeR, many statistical ‘lessonslearned’ from microarrays; extensive integration with down-streamtools

http://www.ncbi.nlm.nih.gov/pubmed/23024189

http://bioconductor.org/packages/release/bioc/html/DESeq2.html

http://bioconductor.org/packages/release/bioc/html/edgeR.html

Exemplar: Visualization

Gviz

I Track-like visualizations

I Data panels

I Fully integrated withBioconductor sequencerepresentations

ggbioepivizr

http://bioconductor.org/packages/release/bioc/html/Gviz.html

http://bioconductor.org/packages/release/bioc/html/ggbio.html

http://bioconductor.org/packages/release/bioc/html/epivizr.html


Gvizggbio

I Comprehensive visualizations

I autoplot file and data types


epivizr





Gvizggbioepivizr

I Genome browser with socketcommunication to R





Challenges & Opportunities

I Big data – transparent management within R, facile use ofestablished resources

I Developer and user training

Resources

I http://r-project.org, An Introduction to R manual;Dalgaard, Introductory Statistics with R; R for Dummies

I http://bioconductor.org/

I StackOverflow, Bioconductor mailing list

http://r-project.org

http://rfordummies.com/

http://bioconductor.org/

http://stackoverflow.com/questions/tagged/r

http://bioconductor.org/help/mailing-list/mailform/

Acknowledgements

I Bioconductor team: Marc Carlson, Valerie Obenchain, HervePages, Paul Shannon, Dan Tenenbaum

I Technical advisory council: Vincent Carey, Wolfgang Huber,Robert Gentleman, Rafael Irizzary, Sean Davis, Kasper Hansen

I Scientific advisory board: Simon Tavare, Vivian Bonazzi,Vincent Carey, Wolfgang Huber, Robert Gentleman, RafaelIrizzary, Paul Flicek, Simon Urbanek.

I NIH / NHGRI U41HG0004059

I The Bioconductor community!

I . . . and the organizers of BioInfoSummer 2013!

algorithms to action: r / bioconductor for analysis and comprehension of high-throughput sequence...

Health & Medicine

r bioconductor

contributed packages

core language

comprehension of high

practical application

vectors x

challenges opportunities