australian bioinformatics conference (abic) 2014 talk - doing bioinformatics better by mitchell jon...

31
Doing bioinformatics better Mitchell Stanton-Cook Beatson Microbial Genomics Group @mscook #ABiC14

Upload: the-university-of-queensland

Post on 14-Jun-2015

574 views

Category:

Science


3 download

DESCRIPTION

How to write better bioinformatics software. Discussion on: 1) versioning software 2) pinning specific versions of required software 3) working in a fixed environment 4) using a revision control system 5) way to ship large pipelines with lots of dependencies

TRANSCRIPT

Page 1: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Doing bioinformatics betterMitchell Stanton-Cook

Beatson Microbial Genomics Group

@mscook #ABiC14

Page 2: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

About Me

• HR = Systems Administrator/Software Engineer

+15 +10

2003-2006, 2011-

Bio|Dev|Op

Page 3: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

The Beatson Group • Microbial Genomics – no wet lab!

• We analyse 10-100-1000’s of isolate genomes• Bacterial evolution• Bacterial pathogenesis• Genomic epidemiology • Software development for Next-Gen Sequencing data

Mitchell Sullivan

Nabil AlikhanMarisa Emerson

Page 4: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Term DevOps first appeared about 5 years ago

Were bioinformaticians early DevOps?

Dev+Ops

Dev = Builds stuff

Ops = Gets &

keeps stuff running

Page 5: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Will focus on 5 ‘ings –– Versioning,– Pinning,– Fixing,– Revisioning &– Virtualising

• Encourage & Empower

• Assuming:– Majority here are/have written their own

software/algorithms/analysis scripts/pipelines

• Nothing about making data reusable/reproducible

Outline

Page 6: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Version 0.99 is when you write the paper?• Version 0.99 is in case it does not work as expected?

0.99

The observed version distribution in bioinformatics software

Page 7: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Use semantic versioning (http://semver.org)

– start at 0.1.0• +1 MAJOR = incompatible changes,• +1 MINOR = new functionality in a backwards-compatible manner,• +1 PATCH = backwards-compatible bug fixes

• Tools– https://github.com/peritus/bumpversion

Follow a formal versioning convention

X Y Z. .{ { {major minor patch

Page 8: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Version 0.99 is when you write the paper?• Version 0.99 is in case it does not work as expected?

0.99

The observed version distribution in bioinformatics software

Page 9: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Semantic versioning allows others to make quick and informed decisions

Implications of change are clearly identified

Page 10: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Want the end user running the same versions of 3rd party software/libraries that you built with!

Pinning of dependencies provides predictability

$ cat Readme.txt<SNIP>

Installation------------

My awesome bioinformatics package requires that this software is installed: * numpy * scipy * matplotlib * biopython * ghalton...

Page 11: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Examples of python requirements.txt file:

$ cat requirements.txtnumpy==1.8.1scipy==0.14.0matplotlib==0.99.1biopython==1.64ghalton==0.6

$ cat requirements.txtnumpyscipymatplotlibbiopythonghalton

Pinning of dependencies provides predictability

Page 12: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Examples of python requirements.txt file:

pip install -r requirements.txt

$ cat requirements.txtnumpy==1.8.1scipy==0.14.0matplotlib==0.99.1biopython==1.64ghalton==0.6

$ cat requirements.txtnumpyscipymatplotlibbiopythonghalton

Pinning of dependencies provides predictability

Page 13: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

$ cat requirements.txtnumpy==1.8.1scipy==0.14.0matplotlib==0.99.1biopython==1.64ghalton==0.6

$ cat requirements.txtnumpy==1.9.0scipy==0.14.0matplotlib==1.4.0Biopython==1.64Ghalton==0.6

Pinning of dependencies provides predictability

Page 14: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

$ cat requirements.txtnumpy==1.8.1scipy==0.14.0matplotlib==0.99.1biopython==1.64ghalton==0.6

$ cat requirements.txtnumpy==1.9.0scipy==0.14.0matplotlib==1.4.0Biopython==1.64Ghalton==0.6

Pinning of dependencies provides predictability

Page 15: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• You want yours and others software to be predictable and deterministic

Pinning of dependencies provides predictability

Page 16: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Death (slow or sudden) by:

“Fixing your environment”

$ sudo pip install mypackage

Page 17: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• System wide python install in: /usr/bin/python

• Also python 2.x and 3.x

“Fixing your environment” to manage 3rd party libraries

>virtualenv

Project1 Project2 Project3 ProjectN~/.venvs/Project1/bin/python ~/.venvs/Project2/bin/python ~/.venvs/Project3/bin/python ~/.venvs/ProjectN/bin/python

numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6

biopython==1.54 khmer=1.0biopython==1.64

Page 18: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Use a revisioning system

• Revision control* is a system that records changes to a file or set of files over time so that you can recall specific versions later.

*Revision control is also known as version control. I’ll stick with revision control to avoid confusion with semantic versioning of your software/libraries/analysis scripts/pipelines

Page 19: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Why use revision control

Have you ever…

Nope Nope Nope

Page 20: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Initially:– git init, git add, git commit, git push, git pull, git tag

Choose a revision control tool and learn it (and the tools that enhance it)

Workingdirectory Staging area Repository

git add

git commit

Page 21: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Initially:– git init, git add, git commit, git push, git pull, git tag

Choose a revision control tool and learn it (and the tools that enhance it)

Workingdirectory Staging area Repository

git add

git commit

git push

git pull

git tag v0.3.5

Page 22: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Use GitHub (or BitBucket)https://education.github.com

https://education.github.com/pack

Page 23: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Research. Shared.

https://guides.github.com/activities/citable-code/

http://www.zenodo.org

Page 24: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6

~/.venvs/Project2/bin/python

Project2

v0.3.9

Operating system

Software orLibrary or

Analysis ScriptPipeline

numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6

~/.venvs/Project2/bin/python

Project2

v0.3.8

Type and version of OS?Version of gcc?Version of libpng?Version of Python/Perl/Ruby

Page 25: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6

~/.venvs/Project2/bin/python

Project2

v0.3.9

Operating system

Software orLibrary or

Analysis ScriptPipeline

numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6

~/.venvs/Project2/bin/python

Project2

v0.3.8

Type and version of OS?Version of gcc?Version of libpng?Version of Python/Perl/Ruby

Virtualising

Page 26: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Virtualisation technologies

git servermonitoring servermanaging server

• Use virtual machines (VMs) to pin specific OS/software versions

• Distribute the VMs

Traditional Virtualised

Page 27: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Vagrant (http://www.vagrantup.com)– Easy to configure, reproducible and portable work

environments

– Benefits:• Vagrant (+Ansible) will automatically set everything up

required for your software/libraries/analysis/pipelines

Vagrant is useful for making environments

+ Vagrant file + virtualisation software+ base image (+ Ansible)

Page 28: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

• Containers– Lightweight• Share resources

– Versionable/diffable– Easily distributable

Virtualisation – forget about VM’s and move to containers?

Page 29: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Conclusions

Nope Nope Nope

• A practicing bioinformatician has roles not too dissimilar to that of a DevOp– Multi-disciplinary

• Often are the ones whom implement deploy and maintain software/libraries/analysis scripts/pipelines.

– Need to understand and use tools from the Dev community• SemVer, dependency pinning, “fixed environment”

development, git, GitHub

– Need to understand and use tools from the Ops community as good ways to distribute tools/pipelines in a controlled manner• Virtualisation (vagrant and docker)• IT automation/orchestration (Ansible)

Page 30: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland

Acknowledgements

Dr Nouri Ben Zakour

Dr Scott Beatson

http://beatsonlab.com

Page 31: Australian Bioinformatics Conference (ABiC) 2014 Talk - Doing bioinformatics better by Mitchell Jon Stanton-Cook of The University of Queensland