building cloud-enabled genomics workflows with luigi and docker
TRANSCRIPT
![Page 1: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/1.jpg)
Building cloud-enabled cancer genomics workflows with Luigi and DockerJake Feala, PhDPrincipal Scientist, Bioinformatics @ CapernaFounder @ Outlier Bio
![Page 2: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/2.jpg)
Building cloud-enabled cancer genomics workflows with Luigi and DockerJake Feala, PhDPrincipal Scientist, Bioinformatics @ CapernaFounder @ Outlier Bio
![Page 3: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/3.jpg)
Outline
Pipelines
Requirements for the ideal workflow
platform
ChallengesPractical, free alternatives
Convergence of cloud bioinformatics
platforms
Luigi
Docker
S3*Application:
Confirming cancer mutations in RNA-seq
*Not technically free but nearly**FOSS = Free and Open Source Software
Some enabling infrastructure
Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS** tools
Copyright © 2016 Outlier Bio
![Page 4: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/4.jpg)
data >> algorithms
Copyright © 2016 Outlier Bio
![Page 5: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/5.jpg)
When in doubt, just sequence it!
Plummeting sequencing costs
+
Powerful insights
= Mounds of data
Copyright © 2016 Outlier Bio
![Page 6: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/6.jpg)
managing big genomics data is hard
Copyright © 2016 Outlier Bio
![Page 7: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/7.jpg)
Pipelines
What you show your boss
AlignFastQ BAM
What you actually do
Align each lane
Merge SAM
Build index
Update IGV registry
Index BAM
SAM BAM
Sort BAM
Copyright © 2016 Outlier Bio
![Page 8: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/8.jpg)
Pipelines
What you show your boss
Align
And sometimes….
FastQ BAM
What you actually do
Align each lane
Merge SAM
Build index
Update IGV registry
Index BAM
SAM BAM
Sort BAM
Copyright © 2016 Outlier Bio
![Page 9: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/9.jpg)
Pipelines
What you show your boss
Align
And sometimes….
FastQ BAM
What you actually do
Align each lane
Merge SAM
Build index
Update IGV registry
Index BAM
SAM BAM
Sort BAM
Copyright © 2016 Outlier Bio
And BTW these files are massive
![Page 10: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/10.jpg)
Capability = automated pipeline*
• Pipelines are at the core of a bioinformatics group
• Best way to show a capability is a complete pipeline– Any suitable input quality-assured output– Sure, you may be able to do a certain analysis,
but a working, tested pipeline proves it
• Lots of tools are available, but putting them together correctly into a pipeline is hard
*”workflow” is synonymous with “pipeline” for the purposes of this talk
Lots of possible input sources
and types
Lots of possible downstream
analyses
Core capability
Copyright © 2016 Outlier Bio
![Page 11: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/11.jpg)
good data science requires good data engineering
Copyright © 2016 Outlier Bio
![Page 12: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/12.jpg)
Linux commands and custom file formats are the “narrow waist” of bioinformatics
Copyright © 2016 Outlier Bio
This constraint underlies the design of bioinformatics pipelines
![Page 13: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/13.jpg)
This works great for tool developers…
Copyright © 2016 Outlier Bio
![Page 14: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/14.jpg)
…but quickly gets complicated for users
Common homegrown solution:- Bash scripts- VMs (AMIs)- StarCluster or similar- SunGrid Engine or similar- Parallelize by sample
Copyright © 2016 Outlier Bio
B
![Page 15: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/15.jpg)
Outline
Pipelines
Requirements for the ideal workflow
platform
ChallengesPractical, free alternatives
Convergence of cloud bioinformatics
platforms
Luigi
Docker
S3*Application:
Confirming cancer mutations in RNA-seq
*Not technically free but nearly**FOSS = Free and Open Source Software
Some enabling infrastructure
Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
![Page 16: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/16.jpg)
Limitations with commonly used technology
• Compute– SGE is flat, just a job scheduler, bash only, no dependencies
• Environment– AMIs and VMs are slow and heavyweight, no code, not good for
ongoing dev• Workflow
– Makefiles have ugly syntax, static, inflexible– GUIs and domain-specific systems like Galaxy/Taverna are not
easily programmable or general-purpose
Copyright © 2016 Outlier Bio
![Page 17: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/17.jpg)
Properties of the ideal pipeline system
• General purpose: familiar language, can apply to any task
• Modular: any language, well-tested components with tight APIs
• Scalable: parallelize for free, independent of components
• Integrated: LIMS, metadata, viz, versioning, reporting
• Versioned: reproduce from snapshots in time
• Idempotent: resume from failure, guarantee outputs
Copyright © 2016 Outlier Bio
![Page 18: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/18.jpg)
Outline
Pipelines
Requirements for the ideal workflow
platform
ChallengesPractical, free alternatives
Convergence of cloud bioinformatics
platforms
Luigi
Docker
S3*Application:
Confirming cancer mutations in RNA-seq
*Not technically free but nearly**FOSS = Free and Open Source Software
Some enabling infrastructure
Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
![Page 19: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/19.jpg)
Emerging best practice is to containerize and decompose into atomic steps
Copyright © 2016 Outlier Bio
![Page 20: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/20.jpg)
But atomicity and scaling bring added complexity
Copyright © 2016 Outlier Bio
![Page 21: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/21.jpg)
Everyone is doing this same basic architecture
Copyright © 2016 Outlier Bio
![Page 22: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/22.jpg)
That is, everyone except for ADAM/Spark. They are shifting the paradigm
• Pipeline = sequence of transformations on a data model, not a string of shell commands
• Separate method from implementation• Distribute records (i.e. alignments) for horizontal scaling• Needs more community tools before adoption
Fig 1, Nothaft et al. (2015) Rethinking Data-Intensive Science Using Scalable Analytics Systems
Copyright © 2016 Outlier Bio
https://bdgenomics.org
![Page 23: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/23.jpg)
Outline
Pipelines
Requirements for the ideal workflow
platform
ChallengesPractical, free alternatives
Convergence of cloud bioinformatics
platforms
Luigi
Docker
S3*Application:
Confirming cancer mutations in RNA-seq
*Not technically free but nearly**FOSS = Free and Open Source Software
Some enabling infrastructure
Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
![Page 24: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/24.jpg)
A free alternative architecture
Components• Storage: S3• Environment: Docker• Workflows: Luigi• Compute: EC2 auto-scaling groups• Infrastructure: to put the pieces together
Benefits• Free (almost)• Huge communities, battle tested• Separate concerns, use best tool for each job
Copyright © 2016 Outlier Bio
![Page 25: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/25.jpg)
S3 and object storage
Steps:1. Learn how object store is different from a filesystem2. Write a bit of extra code to manage transfers to/from S33. Never think about long-term storage again*
*within reason, if IT director is in the room.Copyright © 2016 Outlier Bio
![Page 26: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/26.jpg)
Docker
Copyright © 2016 Outlier BioFrom https://www.docker.com/what-docker
• Lightweight environment management• Specify environment as code (Dockerfile)• Portable (laptop cluster with same behavior)• Wide adoption in tech• Gaining ground in bioinformatics
![Page 27: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/27.jpg)
Luigi
• Like a Makefile but in pure Python*
• Executes locally => easy to debug
• OOP declarative framework with Tasks and Targets– Tasks are Python objects, they can do anything– Targets are Python objects, they can be anything
• Idempotent (resume where you left off) and atomic (no half-finished tasks)
• Batteries included (graph viz, CLI integration, S3, MySQL, Hadoop, Spark, Redshift, SGE, …
*i.e., not another DSL
What’s a Makefile?• Build DAG of tasks• Each task specifies dependencies and output• Start with what you want to build, work up
Copyright © 2016 Outlier Bio
![Page 28: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/28.jpg)
Luigi is flexible enough to do pretty much anything within a workflow
S3
EC2/Docker
Python code
S3
Database
Hadoop/Spark
Local client
Local file
storage
compute
data flows
Copyright © 2016 Outlier Bio
Local file
Manual operation!
![Page 29: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/29.jpg)
Luigi is flexible enough to do pretty much anything within a workflow
S3
EC2/Docker
Python code
S3
Database
Hadoop/Spark
Local client
Local file
storage
compute
data flows
Copyright © 2016 Outlier Bio
Local file
Manual operation!
luigi.Target
luigi.Task
luigi.Task.requires
API
![Page 30: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/30.jpg)
Anatomy of a Luigi Task
Copyright © 2016 Outlier Bio
![Page 31: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/31.jpg)
Anatomy of a Luigi Task
Copyright © 2016 Outlier Bio
Build your filepaths in Task.output
Declare your parameters (can be passed from CLI)
Do the thingRead the input
Write the output
![Page 32: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/32.jpg)
Anatomy of a Luigi Task
Copyright © 2016 Outlier Bio
Now output to S3 instead
![Page 33: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/33.jpg)
Anatomy of a Luigi Task
Copyright © 2016 Outlier Bio
Run a Docker container instead
![Page 34: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/34.jpg)
Outline
Pipelines
Requirements for the ideal workflow
platform
ChallengesPractical, free alternatives
Convergence of cloud bioinformatics
platforms
Luigi
Docker
S3*Application:
Confirming cancer mutations in RNA-seq
*Not technically free but nearly**FOSS = Free and Open Source Software
Some enabling infrastructure
Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
![Page 35: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/35.jpg)
Some assembly required
Copyright © 2016 Outlier Bio
ECS
EC2 auto-scaling group
Local client
Container registry
EC2
DockerECS agent
CloudWatch
Luigi
EC2
DockerECS agent
SQS
SGE
But lots of ways to do it!
![Page 36: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/36.jpg)
Outline
Application: Confirming cancer
mutations in RNA-seq
*Not technically free but nearly**FOSS = Free and Open Source Software
Some enabling infrastructure
Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
Pipelines
Requirements for the ideal workflow
platform
ChallengesPractical, free alternatives
Convergence of cloud bioinformatics
platforms
Luigi
Docker
S3*
![Page 37: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/37.jpg)
• Problem:– Calling somatic cancer mutations is difficult
• Low allele frequencies• Complex variants• Purity, ploidy issues
– RNA-seq can add confidence, but• TCGA does not provide RNA-seq variants• Manual review is slow• Automating and scaling requires a complex, custom workflow
An interesting* cancer genomics application
Copyright © 2016 Outlier Bio*to me
![Page 38: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/38.jpg)
Solution: RNA-seq mutation validation pipeline
Copyright © 2016 Outlier Bio
CGHub
GDAC portal
Note: this workflow should soon be feasible with the NCI cloud pilot platforms
Download RNA-seq
Download somatic
mutations
RNA-seq(BAM)
DNA variants
(VCF)
Extract RNA-seq regions
Local region(BAM) MuTect
RNA variants
(VCF)
Merge RNA/DNA variants
for each variant
for each patient
Merge patient variantsMerge cohort
variants
![Page 39: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/39.jpg)
Dockerizing the apps
Copyright © 2016 Outlier Bio
GTDownload ./bioit/apps/gtdownload/Dockerfile
Samtools ./bioit/apps/samtools/Dockerfile
See https://github.com/outlierbio/bio-it for full project
$ docker build –t outlierbio/bioit .
Base image ./Dockerfile
$ cd bioit/apps/gtdownload$ docker build –t outlierbio/gtdownload .
$ cd ../samtools$ docker build –t outlierbio/samtools .
![Page 40: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/40.jpg)
Luigi-izing the workflow
Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
![Page 41: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/41.jpg)
Luigi-izing the workflow
Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
![Page 42: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/42.jpg)
Luigi-izing the workflow
Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
![Page 43: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/43.jpg)
Run it!
Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
$ luigi --module bioit.validate_rnaseq RunSample --sample-id=TCGA-AB-2929
![Page 44: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/44.jpg)
Thanks! Questions?
Pipelines
Requirements for the ideal workflow
platform
ChallengesPractical, free alternatives
Convergence of cloud bioinformatics
platforms
Luigi
Docker
S3*Application:
Confirming cancer mutations in RNA-seq
*Not technically free but nearly**FOSS = Free and Open Source Software
Some enabling infrastructure
Takeaways: 1. Engineer your workflows2. Everyone is converging on architecture3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
Further reading• My blog: https://medium.com/outlier-bio-blog• GitHub repo for this talk:
https://github.com/outlierbio/bioit • Luigi: https://github.com/spotify/luigi• Docker: https://docker.com• Amazon AWS: http://lmgtfy.com/?q=aws
![Page 45: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/45.jpg)
Extra slides
Copyright © 2016 Outlier Bio
![Page 46: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/46.jpg)
More on reproducibility: 3 major components and some possible solutions
• Code– VCS– Notebooks
• Data– Metadata store– S3– Synapse– Versioned data APIs
• Environment– Package managers (pip –r)– VM– Docker
Copyright © 2016 Outlier Bio
![Page 47: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/47.jpg)
The slide about how this is kind of like functional programming
function Immutable output
• No side effects (pure)• Stateless• Same output for every input• Compose complex, scalable functionality from small components
Immutable input
containerized data
transformation
Immutable S3 object
Immutable S3 object
Copyright © 2016 Outlier BioSee http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html for more on this idea
![Page 48: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/48.jpg)
• Specify environment as code (Dockerfile)
• A little like git– commits (each with uuid hash)– pull– push
• Isolated like a VM
• Run like an app– Takes arguments– CLI allows interactive use– Hop inside the container to debug
Docker (user perspective)
Copyright © 2016 Outlier Bio
![Page 49: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/49.jpg)
EC2 auto-scaling groups
Copyright © 2016 Outlier BioFrom http://docs.aws.amazon.com/AutoScaling/
![Page 50: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/50.jpg)
Dispatching pattern allows fan-in from data sources, fan-out to pipelines
Load sample BAM
CGHub download
S3 download
Local fileNCBI SRA download
CGHub extract FQSRA extract FQ
HTTP/FTP download
exomeRNA-seq
xenograftWGS
shRNA
Copyright © 2016 Outlier Bio
![Page 51: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/51.jpg)
Dispatching pattern allows fan-in from data sources, fan-out to pipelines
Load sample
CGHub download
S3 download
Local fileNCBI SRA download
CGHub extract FQSRA extract FQ
HTTP/FTP download
exomeRNA-seq
xenograftWGS
shRNA
Copyright © 2016 Outlier Bio
Luigi makes this easy
![Page 52: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/52.jpg)
• Base Dockerfile with AWS credentials and helpers
• Simple metadata DB to map sample IDs raw data locations– All other filepaths managed by pipeline code– Level-up: store pipeline versions, parameters, batches
• Small test dataset for constant integration tests (use Jenkins)
Practical implementation notes
Copyright © 2016 Outlier Bio
![Page 53: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/53.jpg)
Challenges
• Filesystem: Still need a BIG, FAST local filesystem for NGS analysis
• Luigi has warts– Parameter passing– Delete files to re-run– Lots of code edits to rewire the DAG
Copyright © 2016 Outlier Bio
![Page 54: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/54.jpg)
5-year timeline (wishlist?)
• More distributed algorithms with ADAM/Spark
• Integration with Jupyter notebooks
• Open source template for spinning up a secure cloud infrastructure
Copyright © 2016 Outlier Bio
![Page 55: Building cloud-enabled genomics workflows with Luigi and Docker](https://reader035.vdocuments.mx/reader035/viewer/2022062904/587332521a28ab596c8b6dcb/html5/thumbnails/55.jpg)
Live demo???
Copyright © 2016 Outlier Bio