australian bioinformatics conference (abic) 2014 talk - doing bioinformatics better by mitchell jon...
DESCRIPTION
How to write better bioinformatics software. Discussion on: 1) versioning software 2) pinning specific versions of required software 3) working in a fixed environment 4) using a revision control system 5) way to ship large pipelines with lots of dependenciesTRANSCRIPT
Doing bioinformatics betterMitchell Stanton-Cook
Beatson Microbial Genomics Group
@mscook #ABiC14
About Me
• HR = Systems Administrator/Software Engineer
+15 +10
2003-2006, 2011-
Bio|Dev|Op
The Beatson Group • Microbial Genomics – no wet lab!
• We analyse 10-100-1000’s of isolate genomes• Bacterial evolution• Bacterial pathogenesis• Genomic epidemiology • Software development for Next-Gen Sequencing data
Mitchell Sullivan
Nabil AlikhanMarisa Emerson
• Term DevOps first appeared about 5 years ago
Were bioinformaticians early DevOps?
Dev+Ops
Dev = Builds stuff
Ops = Gets &
keeps stuff running
• Will focus on 5 ‘ings –– Versioning,– Pinning,– Fixing,– Revisioning &– Virtualising
• Encourage & Empower
• Assuming:– Majority here are/have written their own
software/algorithms/analysis scripts/pipelines
•
• Nothing about making data reusable/reproducible
Outline
• Version 0.99 is when you write the paper?• Version 0.99 is in case it does not work as expected?
0.99
The observed version distribution in bioinformatics software
• Use semantic versioning (http://semver.org)
– start at 0.1.0• +1 MAJOR = incompatible changes,• +1 MINOR = new functionality in a backwards-compatible manner,• +1 PATCH = backwards-compatible bug fixes
• Tools– https://github.com/peritus/bumpversion
Follow a formal versioning convention
X Y Z. .{ { {major minor patch
• Version 0.99 is when you write the paper?• Version 0.99 is in case it does not work as expected?
0.99
The observed version distribution in bioinformatics software
Semantic versioning allows others to make quick and informed decisions
Implications of change are clearly identified
• Want the end user running the same versions of 3rd party software/libraries that you built with!
Pinning of dependencies provides predictability
$ cat Readme.txt<SNIP>
Installation------------
My awesome bioinformatics package requires that this software is installed: * numpy * scipy * matplotlib * biopython * ghalton...
• Examples of python requirements.txt file:
$ cat requirements.txtnumpy==1.8.1scipy==0.14.0matplotlib==0.99.1biopython==1.64ghalton==0.6
$ cat requirements.txtnumpyscipymatplotlibbiopythonghalton
Pinning of dependencies provides predictability
• Examples of python requirements.txt file:
pip install -r requirements.txt
$ cat requirements.txtnumpy==1.8.1scipy==0.14.0matplotlib==0.99.1biopython==1.64ghalton==0.6
$ cat requirements.txtnumpyscipymatplotlibbiopythonghalton
Pinning of dependencies provides predictability
$ cat requirements.txtnumpy==1.8.1scipy==0.14.0matplotlib==0.99.1biopython==1.64ghalton==0.6
$ cat requirements.txtnumpy==1.9.0scipy==0.14.0matplotlib==1.4.0Biopython==1.64Ghalton==0.6
Pinning of dependencies provides predictability
$ cat requirements.txtnumpy==1.8.1scipy==0.14.0matplotlib==0.99.1biopython==1.64ghalton==0.6
$ cat requirements.txtnumpy==1.9.0scipy==0.14.0matplotlib==1.4.0Biopython==1.64Ghalton==0.6
Pinning of dependencies provides predictability
• You want yours and others software to be predictable and deterministic
Pinning of dependencies provides predictability
Death (slow or sudden) by:
“Fixing your environment”
$ sudo pip install mypackage
• System wide python install in: /usr/bin/python
• Also python 2.x and 3.x
“Fixing your environment” to manage 3rd party libraries
>virtualenv
Project1 Project2 Project3 ProjectN~/.venvs/Project1/bin/python ~/.venvs/Project2/bin/python ~/.venvs/Project3/bin/python ~/.venvs/ProjectN/bin/python
numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6
biopython==1.54 khmer=1.0biopython==1.64
Use a revisioning system
• Revision control* is a system that records changes to a file or set of files over time so that you can recall specific versions later.
*Revision control is also known as version control. I’ll stick with revision control to avoid confusion with semantic versioning of your software/libraries/analysis scripts/pipelines
Why use revision control
Have you ever…
Nope Nope Nope
• Initially:– git init, git add, git commit, git push, git pull, git tag
Choose a revision control tool and learn it (and the tools that enhance it)
Workingdirectory Staging area Repository
git add
git commit
• Initially:– git init, git add, git commit, git push, git pull, git tag
Choose a revision control tool and learn it (and the tools that enhance it)
Workingdirectory Staging area Repository
git add
git commit
git push
git pull
git tag v0.3.5
Use GitHub (or BitBucket)https://education.github.com
https://education.github.com/pack
Research. Shared.
https://guides.github.com/activities/citable-code/
http://www.zenodo.org
numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6
~/.venvs/Project2/bin/python
Project2
v0.3.9
Operating system
Software orLibrary or
Analysis ScriptPipeline
numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6
~/.venvs/Project2/bin/python
Project2
v0.3.8
Type and version of OS?Version of gcc?Version of libpng?Version of Python/Perl/Ruby
numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6
~/.venvs/Project2/bin/python
Project2
v0.3.9
Operating system
Software orLibrary or
Analysis ScriptPipeline
numpy==1.8.1scipy==0.14.0matplotlib==1.3.1biopython==1.64ghalton==0.6
~/.venvs/Project2/bin/python
Project2
v0.3.8
Type and version of OS?Version of gcc?Version of libpng?Version of Python/Perl/Ruby
Virtualising
Virtualisation technologies
git servermonitoring servermanaging server
• Use virtual machines (VMs) to pin specific OS/software versions
• Distribute the VMs
Traditional Virtualised
• Vagrant (http://www.vagrantup.com)– Easy to configure, reproducible and portable work
environments
– Benefits:• Vagrant (+Ansible) will automatically set everything up
required for your software/libraries/analysis/pipelines
Vagrant is useful for making environments
+ Vagrant file + virtualisation software+ base image (+ Ansible)
• Containers– Lightweight• Share resources
– Versionable/diffable– Easily distributable
Virtualisation – forget about VM’s and move to containers?
Conclusions
Nope Nope Nope
• A practicing bioinformatician has roles not too dissimilar to that of a DevOp– Multi-disciplinary
• Often are the ones whom implement deploy and maintain software/libraries/analysis scripts/pipelines.
– Need to understand and use tools from the Dev community• SemVer, dependency pinning, “fixed environment”
development, git, GitHub
– Need to understand and use tools from the Ops community as good ways to distribute tools/pipelines in a controlled manner• Virtualisation (vagrant and docker)• IT automation/orchestration (Ansible)
Acknowledgements
Dr Nouri Ben Zakour
Dr Scott Beatson
http://beatsonlab.com