algorithms to action: r / bioconductor for analysis and comprehension of high-throughput sequence...

28
Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data Martin T. Morgan [email protected] Fred Hutchinson Cancer Research Center Seattle, WA, USA 6 December 2013

Upload: australian-bioinformatics-network

Post on 27-Jan-2015

105 views

Category:

Health & Medicine


0 download

DESCRIPTION

Algorithms form a bridge between mathematical expression and computational evaluation. Statistical algorithms and insights are particularly relevant in high-throughput analysis: the large volume of data masks small sample size; technical artifacts abound; and nuanced questions require innovative rather than cookie-cutter approaches. The R / Bioconductor project represents an effective way to bring useful computation on high-throughput data to a broad scientific research community. These points are explored through Bioconductor packages and work flows, and by articulation of Bioconductor principles associated with reproducible research conducted in an open environment.

TRANSCRIPT

Page 1: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Algorithms to Action: R / Bioconductor forAnalysis and Comprehension of High-Throughput

Sequence Data

Martin T. Morgan [email protected]

Fred Hutchinson Cancer Research CenterSeattle, WA, USA

6 December 2013

Page 2: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Transforming Algorithms to Action

Algorithms

I Intellectual engagement during creative exploration ofchallenging puzzles at the frontiers of knowledge

Action

I Satisfaction from practical application and use by others

Page 3: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Overview

1. Introduction: R and Bioconductor

2. What makes for successful computational biology software?

3. Exemplars: algorithms into actions

4. Challenges & opportunities

Page 4: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Introduction: R

I http://r-project.org

I Open-source, statisticalprogramming language;widely used in academia,finance, pharma, . . .

I Core language and basepackages

I Interactive sessions,scripts

I > 5000 contributedpackages

## Two 'vectors'

x <- rnorm(1000)

y <- x + rnorm(1000, sd=.5)

## Integrated container

df <- data.frame(X=x, Y=y)

## Visualize

plot(Y ~ X, df)

## Regression; 'object'

fit <- lm(Y ~ X, df)

## Methods on the object

abline(fit) # regression line

anova(fit) # ANOVA table

Page 5: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Introduction: Bioconductor

Analysis and comprehension of high-throughput genomic data

I http://bioconductor.org

I > 11 years old, 749 packages

Themes

I Rigorous statistics

I Reproducible work flows

I Integrative analysis

Page 6: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Introduction: Bioconductor

I 1341 PubMed full-textcitations in trailing 12months

I 28,000 web visits / month;75,000 unique IP downloads/ year

I Annual conferences; courses;active mailing list; . . .

European Bioconductor Conference, December 9-10,Cambridge, UK

Page 7: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Introduction: What is Bioconductor good for?

I Microarrays: expression, copy number, SNPs, methylation, . . .I Sequencing: RNA-seq, ChIP-seq, called variants, . . .

I Especially after assembly / alignment

I Annotation: genes, pathways, gene models (exons, transcripts,etc.), . . .

I Flow cytometry, proteomics, image analysis, high-throughputscreens, . . .

Page 8: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Introduction: The ShortRead package

## Use the 'ShortRead' package

library(ShortRead)

## Create an object to represent a sample from a file

sampler <- FastqSampler("ERR127302_1.fastq.gz")

## Apply a method to yield a random sample

fq <- yield(sampler)

## Access sequences of sampled reads using `sread()`

## Summarize nucleotide use by cycle

## 'abc' is a nucleotide x cycle matrix of counts

abc <- alphabetByCycle(sread(fq))

## Subset of interesting nucleotides

abc <- abc[c("A", "C", "G", "T", "N"),]

Page 9: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Introduction: The ShortRead package

## Create a plot from a

## matrix

matplot(t(abc), type="l",

lty=1, lwd=3,

xlab="Cycle",

ylab="Count",

cex.lab=2)

## Add a legend

legend("topright",

legend=rownames(abc),

lty=1, lwd=3, col=1:5,

cex=1.8)

0 10 20 30 40 50 60 70

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

Cycle

Cou

nt

ACGTN

Page 10: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Introduction: Some key points

I R is a high-level programming language, so lots can beaccomplished with just a little code

I Objects such as a data.frame help to manage complicateddata

I Reducing possibility for clerical and other mistakesI Facilitating inter-operability between different parts of an

analysis

I Packages provide a great way to benefit from the expertise ofothers (and to contribute your own expertise back to thecommunity!)

I Scripts make work flows reproducible

I Visualizing data is an important part of exploratory analysis

Page 11: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Successful computational biology software

1. Extensive: software, annotation, integration

2. Statistical: volume, technology, experimental design

3. Reproducible: long-term, multi-participant science

4. Leading edge: novel, technology-driven

5. Accessible: affordable, transparent, usable

Page 12: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Successful computational biology software

1. Extensive:I Software packages for diverse analysis tasksI Annotation (gene and genome) resources to place statistical

analysis in biological contextI Integration across work flows, statistical analysis tasks, . . .

2. Statistical: volume, technology, experimental design

3. Reproducible: long-term, multi-participant science

4. Leading edge: novel, technology-driven

5. Accessible: affordable, transparent, usable

Page 13: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Successful computational biology software

1. Extensive: software, annotation, integration

2. Statistical:I Volume of data requires statistical treatmentI Technology introduces statistical artifacts that need to be

accommodatedI Designed experiments need statistical treatment, often involve

small sample sizes (even though the data is big!), and requireappropriate analysis

3. Reproducible: long-term, multi-participant science

4. Leading edge: novel, technology-driven

5. Accessible: affordable, transparent, usable

Page 14: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Successful computational biology software

1. Extensive: software, annotation, integration

2. Statistical: volume, technology, experimental design

3. Reproducible:I Long-term analysis projects (days, months, years) need to be

performed in a way that is reproducibleI Multi-participant science requires facilities that bridge, e.g.,

statistical analysis and communication with biologistcollaborators.

4. Leading edge: novel, technology-driven

5. Accessible: affordable, transparent, usable

Page 15: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Successful computational biology software

1. Extensive: software, annotation, integration

2. Statistical: volume, technology, experimental design

3. Reproducible: long-term, multi-participant science

4. Leading edge:I Science research questions changes rapidly, so usable software

needs to keep pace without necessarily being finishedI Software needs to keep pace with rapidly changing technology

5. Accessible: affordable, transparent, usable

Page 16: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Successful computational biology software

1. Extensive: software, annotation, integration

2. Statistical: volume, technology, experimental design

3. Reproducible: long-term, multi-participant science

4. Leading edge: novel, technology-driven

5. Accessible:I Research software needs to be affordable, both in terms of

money and timeI Science requires transparent algorithms, so that merits can be

evaluated, problems identified and correctedI Software has to be usable in the sense that it works, and is

documented

Page 17: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Successful computational biology software

1. Extensive: software, annotation, integrationI 750 inter-operable Bioconductor packages

2. Statistical: volume, technology, experimental designI R a ‘natural’ for statistical analysis

3. Reproducible: long-term, multi-participant scienceI Objects, scripts, vignettes, packages, . . . encourage

reproducible research

4. Leading edge: novel, technology-drivenI Packages and user community closely track leading edge

science

5. Accessible: affordable, transparent, usableI Bioconductor is free and open, with extensive documentation

and an active and supportive user community

Page 18: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplars: Algorithms to action

1. Batch effects

2. Methylation

3. RNA-seq Differential Representation

4. Visualization

Page 19: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplar: Batch Effects

Leek et al., 2010, Nature Reviews Genetics 11, 733-739, Leek &Story PLoS Genet 3(9): e161

I Scientific finding: pervasivebatch effects

I Statistical insights:surrogate variable analysis:identify and build surrogatevariables; remove knownbatch effects

I Benefits: reducedependence, stabilize errorrate estimates, and improvereproducibility

Bioconductor support: sva

HapMap samples from one facility,

ordered by date of processing. From

Page 20: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplar: Batch Effects

Leek et al., 2010, Nature Reviews Genetics 11, 733-739, Leek &Story PLoS Genet 3(9): e161

I Scientific finding: pervasivebatch effects

I Statistical insights:surrogate variable analysis:identify and build surrogatevariables; remove knownbatch effects

I Benefits: reducedependence, stabilize errorrate estimates, and improvereproducibility

Bioconductor support: sva

1. Remove signal due tovariable(s) of interest

2. Identify subset of genesdriving orthogonal signaturesof EH

3. Build a surrogate variablebased on full EH signatureof that subset

4. Include significant surrogatevariables as covariates

EH: expression heterogeneity

Page 21: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplar: Methylation

Hansen et al., 2011, Nature Genetics 43, 768-775

I Scientific finding: stochastic methylation variation ofcancer-specific de-methylated regions (DMR), distinguishingcancer from normal tissue, in several cancers.

I Statistical challenges: smoothing, non-specific filtering, tstatistics, find DMRs

Bioconductor support: whole-genome (bsseq) or reducedrepresentation (MethylSeekR) bisulfite sequencing; Illumina 450karrays (minfi)

Page 22: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplar: Differential Representation

Haglund et al., 2012 J Clin Endocrin Metab

I Scientific finding: identifygenes whose expression isregulated by estrogenreceptors in parathyroidadenoma cells

I Statistical challenges:between-samplenormalization; appropriatestatistical model; efficientestimation; . . .

Bioconductor support: DESeq2 , edgeR, many statistical ‘lessonslearned’ from microarrays; extensive integration with down-streamtools

Page 23: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplar: Visualization

Gviz

I Track-like visualizations

I Data panels

I Fully integrated withBioconductor sequencerepresentations

ggbioepivizr

Page 24: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplar: Visualization

Gviz

I Track-like visualizations

I Data panels

I Fully integrated withBioconductor sequencerepresentations

ggbioepivizr

Page 25: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplar: Visualization

Gvizggbio

I Comprehensive visualizations

I autoplot file and data types

I Fully integrated withBioconductor sequencerepresentations

epivizr

Page 26: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Exemplar: Visualization

Gvizggbioepivizr

I Genome browser with socketcommunication to R

I Fully integrated withBioconductor sequencerepresentations

Page 27: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Challenges & Opportunities

I Big data – transparent management within R, facile use ofestablished resources

I Developer and user training

Resources

I http://r-project.org, An Introduction to R manual;Dalgaard, Introductory Statistics with R; R for Dummies

I http://bioconductor.org/

I StackOverflow, Bioconductor mailing list

Page 28: Algorithms to Action: R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data  - Martin T. Morgan

Acknowledgements

I Bioconductor team: Marc Carlson, Valerie Obenchain, HervePages, Paul Shannon, Dan Tenenbaum

I Technical advisory council: Vincent Carey, Wolfgang Huber,Robert Gentleman, Rafael Irizzary, Sean Davis, Kasper Hansen

I Scientific advisory board: Simon Tavare, Vivian Bonazzi,Vincent Carey, Wolfgang Huber, Robert Gentleman, RafaelIrizzary, Paul Flicek, Simon Urbanek.

I NIH / NHGRI U41HG0004059

I The Bioconductor community!

I . . . and the organizers of BioInfoSummer 2013!