developing an open source community for cloud bioinformatics

Developing an open sourcecommunity for cloud

bioinformatics

Brad Chapmanhttp://bcbio.wordpress.com/

8 June 2010

http://bcbio.wordpress.com/

Overview

1 Building open source bioinformatics

communities is hard.

2 Developer resources are a productive

target.

3 Framework: collaborative software

images and data snapshots.

Motivation

Open sourceOpenBio, BiopythonGraduate school – developed distributedalgorithm. Never reused.

WorkStartup: Automated biological pipelines.Research hospital: Democratization ofanalysis.

Filters in biological computing

Working in same biological area

Interest in developing open source code

Technical abilities

Your software is good enough

Successful bioinformatics

Sean Eddy, HMMER

...the best software in the field is often an

unplanned labor of love from a single

investigator.

http://selab.janelia.org/people/eddys/blog/?p=313

http://selab.janelia.org/people/eddys/blog/?p=313

Recognizing contributions

Successful community projects

OpenBio: BioPerl, Biopython, BioJava

Bioconductor

Common themeAimed at developers.

Biologists benefit indirectly.

Lowering activation energy

Establishing common platform

=The solutionto all ourproblems

Remove install and distribution barriers

Building block for scaling

Existing cloud bioinformatics work

JCVI Cloud BioLinux

bioperl-max

MachetEC2

Debian Med

Overlapping set of useful functionality.

http://www.jcvi.org/cms/research/projects/jcvi-cloud-biolinux/overview/

http://fortinbras.us/bioperl-max/

http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/

http://wiki.debian.org/Cloud

Integrated community solution

Inclusive but configurable

Easy to contribute

Automated

Bootstrap bare machine to fully ready

distributed AMI.

http://github.com/chapmanb/bcbb/tree/master/ec2/

biolinux/

http://github.com/chapmanb/bcbb/tree/master/ec2/biolinux/

http://github.com/chapmanb/bcbb/tree/master/ec2/biolinux/

Inclusive but configurable

# Top level YAML configuration file specifying# groups of programs to be installed.packages:- python- r- erlang- databases- viz- bio_search- bio_alignment- bio_nextgen- bio_sequencing- bio_visualization- phylogeny

libraries:- r-libs- python-libs

Easy to contribute

# Configuration file defining R specific libraries that# are installed via CRAN and Bioconductor.cranrepo: http://software.rc.fas.harvard.edu/mirrors/R/cran:- ggplot2- rjson- sqldf- NMF- ape

biocrepo: http://bioconductor.org/biocLite.Rbioc:- ShortRead- BSgenome- edgeR- GOstats- biomaRt- Rsamtools

Automated

def install_biolinux():

ec2_ubuntu_environment()

pkg_install, lib_install = _read_main_config()

_apt_packages(pkg_install)

_do_library_installs(lib_install)

def _ruby_library_installer(config):

for gem in config[’gems’]:

sudo("gem install %s" % gem)

Fabric: http://docs.fabfile.org/

http://docs.fabfile.org/

Ready to use biological data

% ls /referenceGenomes/AthalianaCelegansDmelanogasterEcoliHsapiensMmusculusMsmegmatisMtuberculosis_H37RvPaeruginosa_UCBPP-PA14phiX174RnorvegicusScerevisiaeXtropicalis

% ls Hsapiens/hg18arachnebowtiebwaelandmaqseqsnpsucsc

http://github.com/chapmanb/bcbb/blob/master/galaxy/galaxy_fabfile.py

http://github.com/chapmanb/bcbb/blob/master/galaxy/galaxy_fabfile.py

Organization: Codefest 2010

www.open-bio.org/wiki/Codefest_2010

www.open-bio.org/wiki/Codefest_2010

developing an open source community for cloud bioinformatics

Technology

software

http

python

installed

pkginstall

github

config