aws re:invent 2016: large-scale, cloud-based analysis of cancer genomes: lessons learned from the...
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Brian O’Connor
Technical Director - Analysis Core
UCSC Genomics Institute
Nov 28th, 2016
Large-scale, Cloud-based Analysis of
Cancer GenomesLessons Learned from the PCAWG Project
Overview
Past Present Future
PCAWG: A Cloud-Based, Distributed Collaboration
● International Cancer Genome
Consortium (ICGC)
● ~5,800 Whole Genomes
–~2,800 Cancer Donors
–~1,300 with RNASeq data
–Goal is to consistently analyze data
● 8 sites storing and sharing data via GNOS
– 300TB -> 900TB
● 14 Cloud (and HPC) environments
–3 Commercial, 7 OpenStack, 4 HPC
–~630 VMs, ~15K cores, ~60TB of
RAM
PCAWG Cloud Analysis “Core” Workflows
PCAWG Lessons Learned
1. Commercial cloud policies
2. Portable tools
3. Failure-tolerant, distributed execution infrastructure
4. Commercial cloud costs
Lesson 1: Commercial Cloud Policies
• PCAWG analysis showed the power of clouds
• Key policy changes enabled commercial cloud usage
• NIH updated dbGaP cloud policy - March 2015
• ICGC DACO updated ICGC cloud policy - May 2015
• Partnerships with commercial cloud entities
• Amazon Public Datasets Program
• Seven Bridges
• DNAnexus
PCAWG Cloud Analysis Architecture
GNOS
Academic
Compute
Centers
Cloud
Orchestrator
Compute AWS Cloud
Cloud
Orchestrator
Metadata
Index
Sequencing
Projects
Spot
Instances
Work Orders
PCAWG Analysis Architecture & AWS
GNOS
Academic
Compute
Centers
Cloud
Orchestrator
ComputeAWS
Cloud
Cloud
Orchestrator
Metadata
Index
DNAnexus
Seven
Bridges
Sequencing
Projects
Represents a major shift, ICGC data now redistributed within Amazon’s Cloud
Spot
Instances
Work Orders
Amazon S3
Lesson 2: Portable Tools
Containerized workflows for portability between sites
Core Workflows
Alignment: BWA-Mem
Variant Calling: Broad, DKFZ/EMBL, and Sanger
https://github.com/ICGC-TCGA-PanCancer
Lesson 3: Fault-Tolerant Cloud Execution
Architecture 1.0
Architecture 2.0
Architecture 3.0● cloud-based
clusters
● gluster distributed
filesystem
● scheduling per
cloud● single-node
workers
● no distributed
filesystem
● ansible for setup ● a complete rethink
Lesson 4: Cloud Costs
Workflow Hardware (cores /
machine)
Runtimes Cost on AWS
BWA 8 cores (16 GB RAM) 5 days (± 5) per
specimen
$11.16
Sanger 8 cores (32 GB RAM) 4 days (± 3) per
donor
$17.22
DKFZ /
EMBL
16 cores (64 GB RAM) 2 days (± 0.6)
per donor
$12.80
Broad 32 cores (256 GB RAM) 2.6 days per
donor
$20.48
workflow storage required per donor
BWA 240 GB
Sanger 4 GB
DKFZ / EMBL 5 GB
Total 249 GBData analysis: Create a cloud
commons, Nature 2015
$62/donor
ICGC PCAWG Legacy
Publications soon
AWS Public Datasets
Program
~1,400 PCAWG Donors
- BAM (~70% of ICGC
donors)
- VCF from all three
pipelines
- more ICGC data uploaded
regularly
https://dcc.icgc.org/icgc-in-the-cloud
The Present
Goal: to formalize lessons from PCAWG into reusable tools
Dockstore
Tool/Workflow
Sharing
Toil
Workflow
Execution
Redwood
File
Storage
Ecosystem of Tools
Redwood - Scalable Storage
Authentication
& Storage
Services
Key Features: based on ICGC Storage Service, supports FUSE, BAM
Slicing, and Highly Parallel access, typically WORM usage pattern
client
Amazon S3Amazon EC2
instanceAWS cloud
Redwood - Storage System Performance
The Redwood Storage System (and underlying S3) provided a stable
and secure mechanism to store and use genomic data
Example run of ~100 simultaneously downloads saw ~45-100MB/s
Dockstore.org - Sharing Tools & Workflows
Dockstore:
● Share tools and
workflows
● Package tools with
Docker, Describe
with CWL/WDL
● PCAWG goal,
provide our tools
via Dockstore
http://dockstore.org and https://github.com/ga4gh
Dockstore Architecture
Built on
DockerHub/Quay.io
and
GitHub/BitBucket
Adds metadata to
address
shortcomings for
bioinformatics
workflows
CWL/WDL is the
natural choice for
Descriptor
Dockstore 1.0 Release
Highlighted New Features
Support for 1.0.0 GA4GH Tool Registry API
Support for displaying, sharing, and natively launching CWL 1.0 &
WDL tools
Preliminary support for CWL/WDL workflows
Full list of updates since 0.4-beta.4 in
https://github.com/ga4gh/dockstore/releases
New Content
ICGC PanCancer Analysis of Whole Genomes (PCAWG) tools• BWA-mem, Sanger, Delly, DKFZ
Dockstore Tour
Search
Main PageTool Management
Running Dockstore Tools
Execution with the Dockstore Command Line Interface (CLI)
Goal was something simple but want the same process
accessible via other execution systems!
provision
input files
pull
Docker
images
execute
tool with
inputs
using CWL
provision
output files
somewhere
Seven Bridges, Curoverse, Galaxy, Consonance, etc
Simple Dockstore Command Line
Coming Soon to Dockstore
Workflow DAG view
Testing PCAWG
Test Data
“Launch With…”• Consonance
• Commercial partner(s)
Signed Dockers
Cross site indexing
See Roadmap:
https://goo.gl/4D9a8F
Toil - Efficient Compute on AWS
● A system for large-scale, efficient work on AWS
● Toil recently completed a 30K core, 20K sample re-
compute
● Per job granularity allows for better efficiency and
robustness
The job graph in
Toil can be either
statically or
dynamically
declared.
Toil - Dynamic DAGs
Toil - Spark & ADAM Integration
Amazon EC2 Instances
master
slave
slave slave
slave
User scripts are written in pure Python
from toil.job import Job
def helloWorld(message, memory="2G", cores=2, disk="3G"):
return "Hello, world!, here's a message: %s" % message
j = Job.wrapFn(helloWorld, "You did it!")
if __name__=="__main__":
parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()
print Job.Runner.startToil(j, options) #Prints Hello, world!, ...
Toil - Accessible to New Developers
● Toil can be installed on any system with
Python 2.7
● Built-in support for various batch systems - a few
in part to open-source community support!
○ Mesos
○ SGE (GridEngine)
○ UCSC’s Parasol
○ Single Machine Mode
○ LSF
○ SLURM
● All batch systems can be interchangeably used
with any of the job stores
Toil - Portable
● Cloud-based job stores are designed to handle
many concurrent workers
● Mesos has been shown to scale to 50k simulated
nodes in Amazon Elastic Compute Cloud (EC2)
● Workers try to reduce interactions with the master
by scheduling jobs locally
Toil - Scalable
● Jobs are checkpointed upon completion, allowing
for resumability after job failure
● Toil’s jobstore can resume from any combination of
leader/worker failure
● Toil currently supports job stores for:
○ Shared file systems
○ AWS (Amazon S3 + Amazon SimpleDB)
○ Experimental support for Azure / Google Cloud
Toil - Robust to Failures
Toil in Action 20,000 RNA-seq Sample Recompute
Scalable and robust to failure
Toil RNA-seq Recompute
The Future
PCAWG showed the power of cloud for large scientific
analysis
Current work with Redwood, Dockstore, and Toil
formalized lessons learned and methodologies
Our future work focuses on establishing standards
from our previous work and applying these to future
larger-scale efforts
Tool Registry API
● Formalizing the standard with the GA4GH through the Containers and
Workflows Task Team, implemented in Dockstore
● Basic read API with extended support for write and search
Tool(s)
de
scrip
tor
Docker GET list
GET search
POST register
CWL/WDL Conventions API Standard to Share
Emerging GA4GH API Standards
Emerging GA4GH API Standards
Further work of the Containers and Workflows Task Team
Workflow/Task Execution APIs
POST new task
GET task status
GET task
stderr/stdout
API Standard to Execute
Tools
DockerJSON
stderr stdout file(s)
status
+
Cloud-specific
Implementation
WDL/CWL
Workflowor
GA4GH Containers & Workflow Vision
Toil
Dockstore.org
Redwood
- GA4GH Containers & Workflows Task Team
- Broad Institute
- Cincinnati Children’s Hospital
- Curoverse
- European Bioinformatics Institute
- Intel
- Institute for Systems Biology
- Google, Microsoft, and Amazon
- Ontario Institute for Cancer Research
- Oregon Health and Science University
- Seven Bridges Genomics
- University of California Santa Cruz
● Lincoln Stein, Josh Stuart,
Gad Getz, Peter Campbell,
Jan Korbel - PCAWG
● Vincent Ferretti - Storage
● Denis Yuen - Dockstore
● Kyle Ellrott - Task API
● Peter Amstutz - Workflow API
and Co-leader
● Jeff Gentry - Co-leader
● Hannes Schmidt, Frank
Nothaft & the Toil Team
Acknowledgements
Software Availability
Dockstore
Tool/Workflow
Sharing
Toil
Workflow
Execution
Redwood
File
Storage
https://github.com/icgc-dcc/dcc-storage https://dockstore.org/ https://toil.readthedocs.io
All three projects are open source and welcome your contributions
The AWS Perspective
Enabling science
Scalable compute resource only when needed
Time to result was greatly reduced
Cost of analysis was greatly reduced
Data is able to be securely shared in place
Global community access
Open data as a platform
Data Creation Data Enrichment
Se
nse
makin
g
Data at Rest(Object storage)
Basic APIs
Complex APIs
Consumerapplications
Algorithmicpolicy
Data-drivenjournalism
Data Catalogs
Focused datadashboards
Predictivemodeling
Visualizations
Lower cost of knowledge(Efficiency)
45
Open data as a platform
Data Creation Data Enrichment
Se
nse
makin
g
Data at Rest(Object storage)
Basic APIs
Complex APIs
Consumerapplications
Algorithmicpolicy
Data-drivenjournalism
Data Catalogs
Focused datadashboards
Predictivemodeling
Visualizations
Lower cost of knowledge(Efficiency)
46
BAM gVCF
Wig, GFF
? ?
?
??
Amazon S3 for science
Amazon S3
Data Lake
Data Science Sandbox
Visualization /
Reporting
Public datasets on AWS
To enable more innovation, AWS hosts a selection of datasets that anyone
can access for free. Data in our public datasets is available for rapid
access to our flexible and low-cost computing resources.
Earth Science
• Landsat
• NEXRAD
• NASA NEX
Life Science
• TCGA & ICGC
• 1000 Genomes
• Genome in a Bottle
• Human Microbiome Project
• 3000 Rice Genome Internet Science
• Common Crawl Corpus
• Google Books Ngrams
• Multimedia Commons
https://aws.amazon.com/public-datasets/
Serverless Science with AWS
Lambda
AWS Lambda
Continuous Scaling No Servers to
Manage
AWS Lambda automatically
scales your application by running
code in response to each trigger.
Your code runs in parallel and
processes each trigger
individually, scaling precisely with
the size of the workload.
Subsecond
Metering
With AWS Lambda, you are
charged for every 100ms your code
executes and the number of times
your code is triggered. You don't
pay anything when your code isn't
running.
AWS Lambda automatically runs
your code without requiring you to
provision or manage servers. Just
write the code and upload it to
Lambda.
Serverless, event-driven compute service
Key Scenarios
Stateless processing of discrete or
streaming updates to your data-store or
message bus
Customize responses and response
workflows to state and data changes
within AWS
Execute server side backend logic in a
cross platform fashion
Data processing App backend development Control systems
Evented genome sequence processing
Nanocall*
* Matei David (Jared T. Simpson lab)doi:10.1093/bioinformatics/btw569
The use of API gateway to execute Lambda
functions that bundle a statistical program
function in R for calculating the significance of
an association of a gene’s expression level
with patient survival for every gene in the
genome (~20K)
Utilization of this Serverless architecture
enabled them to scale dynamically without
paying for idle compute and leveraging robust
error handling capabilities
Exemplifies how researchers can leverage
PHI data de-identification to use more
resources on the AWS platform
Data analysis using R, API Gateway, and Lambda
Station X’s GenePool platform enables real-time biomarker analysis and management of
clinical genomic data at scale.
The patient data has been de-
identified…API Gateway and
Lambda only receive the event,
time-to-event, and expression
values [which] ensures that we are
able to use Lambda and API
Gateway...while still complying with
the AWS BAA and HIPAA.
“
”
GT-Scan2 – Scaling CRISPR-Cas9 searches
Thank you!
Remember to complete
your evaluations!