keeping big projects under control - biohpc portal home · keeping big projects under control 1...
TRANSCRIPT
Keeping Big Projects Under Control
1 Updated for 2017-02-15
[web] portal.biohpc.swmed.edu
[email] [email protected]
Overview
2
When working with large volumes of data, software, code and coordinating efforts in a team, how can we:
- Keeping Data Under ControlOrganizing files and analyses, relevant to anyone using BioHPC for analysis
work.
- Keeping Software Under Controlworking with modules and software environments for analysis use a
consistent set of software across your project and team.
- Keeping Code Under Controlan overview of some techniques we use that help us keep our large projects,
such as the BioHPC portal under control.
*more advanced training sessions that will be held later in 2017 addressing these areas in more depth
1 – Keeping Data Under Control
3
Arrange data carefully to
Avoid Losing dataProtect sensitive dataMinimize duplicated dataMaking research reproducible…
How
Plan aheadKeep good structureProper permissionsBackup data…
1 – Keeping Data Under Control : Arrange Data Carefully
4
Folder and File Structure
Benefits:- Easy to find files- Greatly facilitates sharing a project with others- Prevent contamination of raw data (proper permissions)
- Define it in advance- Limit the files you need to keep to only those that are strictly necessary- Maintaining a logical folder structure. Keep groups of files (raw data, final results) in separate but clearly labeled folders.
2nd: by process/step
1st: by project
2nd: by data type
3rd: by user
3rd: by job/date
*tips: use symbolic links smartly to minimize data duplication
1 – Keeping Data Under Control : Data Integrity
5
accuracy & consistency
Storage
Database
User Web site
Database Design: Jun. 21st
- User has read access at storage for copy/download data
- Create/Delete/Edit only allowed from web
- Web server will update both DB & Storage at the same time to guarantee the data integrity.
1 – Keeping Data Under Control : Data Security
6
Data SecuritySecurely remove personal identity information and other restricted data as early as possiblePrinciple of least privilege
setfaclgetfacl
Fine-grained permissions with
1 – Keeping Data Under Control : Sharing
7
Lab/Dept level directories on BioHPC
Intra-department level directories
shared
The group ownership can be inherited by new files and folders created inside the intra-department folder
Tips: if you mv data into the folder, you need apply chgrp command to correct the group ownership
1 – Keeping Data Under Control : Backup
8
http://www.backup4all.com/kb/incremental-backup-118.html
Mirror/Full backup is the starting point for all other backups and contains all the data in the folders
and files that are selected to be backed up.
/home2 (Mondays & Wednesdays)
/work (Fridays)
Incremental backup provides a faster method of backing up data than repeatedly running full/mirror
backups
/project (upon request)
What data should be backed up ?How often ?
2 – Keeping Software Package Under Control
9
Difficulties
Versions
Dependencies
How to reproduce a research project?
Install everything from scratch? (Difficult and time consuming)
Solutions
Environment modules (partially)
Software Environments with Conda
2 – Keeping Software Package Under Control : Environment Modules
10
provides for the dynamic modification of a user's environment via modulefilesModules can be loaded and unloaded dynamically and atomically, in an clean fashionKeep different versions
Environment Modules
set up a private module folder under user's home directory
module load use.ownmodule avail
module file defined in ~/privatemodules/<software module>/
2 – Keeping Software Package Under Control : Environment Modules
11
Issue: incomparable between software/modules
V1.7 V1.8
Still need to install everything from scratchStill need manually solve the dependency issues
2 – Keeping Software Package Under Control : Software Environments with Conda
12
Package managerCross-platformOpen source, BSD licenseCreated for python programs but can package and manage any softwareDose not require administrator privileges
Traditional package managers: apt-get, yum, homebrew, pip and etc.
2 – Keeping Software Package Under Control : Software Environments with Conda
13
conda install: Install a packageconda remove: Remove a packageconda update: Update a packageconda list: list packages installedconda create: Create a new conda environmentconda search:source activate: Activate a conda environmentsource deactivate: Deactivate the current conda environment
* packages are hard-linked into the environment to save disk space
2 – Keeping Software Package Under Control : Software Environments with Conda
14
Try: Update matplotlib to newest version (1.5.0->2.0.0)
Problem
Solution
Create isolated conda environment to have your own set of installed and managed packages
2 – Keeping Software Package Under Control : Software Environments with Conda
15
Conda environment is installed in your home directory
Create a conda environment named test1 with latest anaconda package
2 – Keeping Software Package Under Control : Software Environments with Conda
16
Use the environment you just created
Install packages in the conda environment
2 – Keeping Software Package Under Control : Software Environments with Conda
17
User defined packages
anaconda search <software name>list all user defined packages
anaconda show <user defined package name>detailed info of a certain package
2 – Keeping Software Package Under Control : Software Environments with Conda
18
A GitHub community-led conda channel
https://conda-forge.github.io/feedstocks
a channel for the conda package
manager specializing in
bioinformatics software
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda
Channel orderIf you add multiple channels in one environment, the latest or most recent added one have the highest priority. Python: Nov. 15th
2 – Keeping Software Package Under Control : Software Environments with Conda
19
Using R with conda
Install all of the most popular packages with all of their dependenciesconda install –c r r-essentials
Update all of the packages and their dependencies with one commandconda update –c r-essentials
Update a single package in R-Essentialsconda update r-<package name>
Parallel R: Oct. 18th
Keeping large code projects under control requires consistency and modularity
Common development environment that’s easy to setup for new developers
Version control strategy that can track and integrate everyone’s changes
Modules of code that can be extended/bug-fixed without affecting other areas
Lots of tools available to help! Some we use:
3 – Keeping Code Under Control
20
A Great Reference
21
https://software-carpentry.org/lessons/
General lessons on Linux, Git,
Principles of good coding in multiple languagesPython, R, MATLAB
Focus on reproducible and open science
Case Study – BioHPC Portal
22
A large Python/Django project
1,237 code/config/doc files
500 commits
Complex interactions:SLURM SchedulerUsage AccountingQuotasUser accountsInteractive Sessions
But – structured so new team members have a portal task as their first coding job
Modularity – a complex project, of many small applications
23
Zinnia(news)
Django-CMS(Content)
accounts modules
otrs
sbatch
terminal
utils
quota plugin
slurmplugin
usage plugin
Celery scheduler Templates
Code Re-use
24
sbatch terminal
Provides Web Job Submission
Routines to submit jobs to cluster
Provides webGUI, Desktop etc.
Can re-use cluster functions
Keeping Track of Changes
25
Use a structured Git workflow – keep the stable version safe, allow easy merging
A stable branch with live versionFeature branches for each person adding a featureA develop branch to merge and test features
Git basics: April 17th
Advanced: July 19th
Complexity – External Interactions
26
Zinnia(news)
Django-CMS(Content)
accounts modules
otrs
sbatch
terminal
utils
quota plugin
slurmplugin
usage plugin
Celery scheduler Templates
Nucleus Cluster
Accounting System
User Directory
Storage
Modules
Tickets
Web VNC
Consistency and Ease of Entry
27
BioHPC team can start developing with just:
git clone [email protected]:biohpc/biohpc_portal.gitcd biohpc_portalvagrant up
Creates a development virtual machine with an emulated cluster, user directory etc.
Same everywhere – BioHPC workstation, laptop, at home….
+
Demo 1 - Vagrant
Vagrantfile and Provisioning Scripts
28
Defensive Coding – Assume Nothing! (or as little as possible)
29
Don’t trust: Inputs to be valid Check type, size, range etc.
Your code can be more reliable if you don’t trust the outside world.
A Norwegian woman mistyped her account number on an internet banking system. Instead of typing her 11-digit account number, she accidentally typed an extra digit, for a total of 12 numbers. The system discarded the extra digit, and transferred $100,000 to the (incorrect) account. A simple dialog box informing her that she had typed too many digits would have helped avoid this expensive error.
Olsen, Kai. “The $100,000 Keying error” IEEE Computer, August 2008
Assertions – a quick way to capture invalid input
30
You don’t always need to code complex checks and user feedback.
If it’s just important that you catch invalid data use assertions:
Python
Matlab
Examples from Software Carpentry lessons
Defensive Coding – Assume Nothing! (or as little as possible)
31
Don’t trust: Files to exist/not-exist Always write sanity checks!
Check early, check often!
Note: Function to construct paths (safe across different Linux, Mac, Windows)Check path is valid when we construct it, don’t defer until useUse the language’s own exceptions, and catch them elsewhereDebug level logging – speculative, but turns out to be useful
Defensive Coding – Assume Nothing! (or as little as possible)
32
Don’t trust: External calls to work try and catch
Calling external programs is a classic point of failure in Bioinformatics code
Note: Using a call that collects output from the commandCatching errors from the called command and more general OSErrorRaising the error to a higher-level try-catch with a message that
makes sense to the end user.
Testing & CI
33
Automated testing means you can check if changes break things easily
We provide GitLab CI so you can do this with your code
Case Study – BioHPC param_runner tool
Demo 2 – Overview of git.biohpc.swmed.edu CI
GitLab CI: July 19th
Test Code & Test Runner
34
In this case we are using pytest to write and tests
Demo 3 – pytest run
2017 Training Program – Coding & Managing Projects
35
3/15/17 Monitoring and troubleshooting on BioHPC
4/12/17 Git on BioHPC
5/17/17 Parallel Programming in Matlab on BioHPC (MDCS and parallel tool box)
6/21/17 Database System Design
7/19/17 Managing Software Projects in a Team
9/20/17 Data Security and Management
10/18/17 Parallel R/R on BioHPC
11/15/17 Python on BioHPC