aws re:invent 2016: large-scale, cloud-based analysis of cancer genomes: lessons learned from the...

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Brian O’Connor

Technical Director - Analysis Core

UCSC Genomics Institute

Nov 28th, 2016

Large-scale, Cloud-based Analysis of

Cancer GenomesLessons Learned from the PCAWG Project

Overview

Past Present Future

PCAWG: A Cloud-Based, Distributed Collaboration

● International Cancer Genome

Consortium (ICGC)

● ~5,800 Whole Genomes

–~2,800 Cancer Donors

–~1,300 with RNASeq data

–Goal is to consistently analyze data

● 8 sites storing and sharing data via GNOS

– 300TB -> 900TB

● 14 Cloud (and HPC) environments

–3 Commercial, 7 OpenStack, 4 HPC

–~630 VMs, ~15K cores, ~60TB of

RAM

PCAWG Cloud Analysis “Core” Workflows

PCAWG Lessons Learned

1. Commercial cloud policies

2. Portable tools

3. Failure-tolerant, distributed execution infrastructure

4. Commercial cloud costs

Lesson 1: Commercial Cloud Policies

• PCAWG analysis showed the power of clouds

• Key policy changes enabled commercial cloud usage

• NIH updated dbGaP cloud policy - March 2015

• ICGC DACO updated ICGC cloud policy - May 2015

• Partnerships with commercial cloud entities

• Amazon Public Datasets Program

• Seven Bridges

• DNAnexus

PCAWG Cloud Analysis Architecture

GNOS

Academic

Compute

Centers

Cloud

Orchestrator

Compute AWS Cloud

Cloud

Orchestrator

Metadata

Index

Sequencing

Projects

Spot

Instances

Work Orders

PCAWG Analysis Architecture & AWS

GNOS

Academic

Compute

Centers

Cloud

Orchestrator

ComputeAWS

Cloud

Cloud

Orchestrator

Metadata

Index

DNAnexus

Seven

Bridges

Sequencing

Projects

Represents a major shift, ICGC data now redistributed within Amazon’s Cloud

Spot

Instances

Work Orders

Amazon S3

Lesson 2: Portable Tools

Containerized workflows for portability between sites

Core Workflows

Alignment: BWA-Mem

Variant Calling: Broad, DKFZ/EMBL, and Sanger

https://github.com/ICGC-TCGA-PanCancer

https://github.com/ICGC-TCGA-PanCancer

Lesson 3: Fault-Tolerant Cloud Execution

Architecture 1.0

Architecture 2.0

Architecture 3.0● cloud-based

clusters

● gluster distributed

filesystem

● scheduling per

cloud● single-node

workers

● no distributed

filesystem

● ansible for setup ● a complete rethink

Lesson 4: Cloud Costs

Workflow Hardware (cores /

machine)

Runtimes Cost on AWS

BWA 8 cores (16 GB RAM) 5 days (± 5) per

specimen

$11.16

Sanger 8 cores (32 GB RAM) 4 days (± 3) per

donor

$17.22

DKFZ /

EMBL

16 cores (64 GB RAM) 2 days (± 0.6)

per donor

$12.80

Broad 32 cores (256 GB RAM) 2.6 days per

donor

$20.48

workflow storage required per donor

BWA 240 GB

Sanger 4 GB

DKFZ / EMBL 5 GB

Total 249 GBData analysis: Create a cloud

commons, Nature 2015

$62/donor

ICGC PCAWG Legacy

Publications soon

AWS Public Datasets

Program

~1,400 PCAWG Donors

- BAM (~70% of ICGC

donors)

- VCF from all three

pipelines

- more ICGC data uploaded

regularly

https://dcc.icgc.org/icgc-in-the-cloud

https://dcc.icgc.org/icgc-in-the-cloud

The Present

Goal: to formalize lessons from PCAWG into reusable tools

Dockstore

Tool/Workflow

Sharing

Toil

Workflow

Execution

Redwood

File

Storage

Ecosystem of Tools

Redwood - Scalable Storage

Authentication

& Storage

Services

Key Features: based on ICGC Storage Service, supports FUSE, BAM

Slicing, and Highly Parallel access, typically WORM usage pattern

client

Amazon S3Amazon EC2

instanceAWS cloud

Redwood - Storage System Performance

The Redwood Storage System (and underlying S3) provided a stable

and secure mechanism to store and use genomic data

Example run of ~100 simultaneously downloads saw ~45-100MB/s

Dockstore.org - Sharing Tools & Workflows

Dockstore:

● Share tools and

workflows

● Package tools with

Docker, Describe

with CWL/WDL

● PCAWG goal,

provide our tools

via Dockstore

http://dockstore.org and https://github.com/ga4gh

http://dockstore.org

https://github.com/ga4gh

Dockstore Architecture

Built on

DockerHub/Quay.io

and

GitHub/BitBucket

Adds metadata to

address

shortcomings for

bioinformatics

workflows

CWL/WDL is the

natural choice for

Descriptor

Dockstore 1.0 Release

Highlighted New Features

Support for 1.0.0 GA4GH Tool Registry API

Support for displaying, sharing, and natively launching CWL 1.0 &

WDL tools

Preliminary support for CWL/WDL workflows

Full list of updates since 0.4-beta.4 in

https://github.com/ga4gh/dockstore/releases

New Content

ICGC PanCancer Analysis of Whole Genomes (PCAWG) tools• BWA-mem, Sanger, Delly, DKFZ

https://github.com/ga4gh/dockstore/releases

https://dcc.icgc.org/icgc-in-the-cloud/aws

Dockstore Tour

Search

Main PageTool Management

Running Dockstore Tools

Execution with the Dockstore Command Line Interface (CLI)

Goal was something simple but want the same process

accessible via other execution systems!

provision

input files

pull

Docker

images

execute

tool with

inputs

using CWL

provision

output files

somewhere

Seven Bridges, Curoverse, Galaxy, Consonance, etc

Simple Dockstore Command Line

Coming Soon to Dockstore

Workflow DAG view

Testing PCAWG

Test Data

“Launch With…”• Consonance

• Commercial partner(s)

Signed Dockers

Cross site indexing

See Roadmap:

https://goo.gl/4D9a8F

https://goo.gl/4D9a8F

Toil - Efficient Compute on AWS

● A system for large-scale, efficient work on AWS

● Toil recently completed a 30K core, 20K sample re-

compute

● Per job granularity allows for better efficiency and

robustness

The job graph in

Toil can be either

statically or

dynamically

declared.

Toil - Dynamic DAGs

Toil - Spark & ADAM Integration

Amazon EC2 Instances

master

slave

slave slave

slave

User scripts are written in pure Python

from toil.job import Job

def helloWorld(message, memory="2G", cores=2, disk="3G"):

return "Hello, world!, here's a message: %s" % message

j = Job.wrapFn(helloWorld, "You did it!")

if __name__=="__main__":

parser = Job.Runner.getDefaultArgumentParser()

options = parser.parse_args()

print Job.Runner.startToil(j, options) #Prints Hello, world!, ...

Toil - Accessible to New Developers

● Toil can be installed on any system with

Python 2.7

● Built-in support for various batch systems - a few

in part to open-source community support!

○ Mesos

○ SGE (GridEngine)

○ UCSC’s Parasol

○ Single Machine Mode

○ LSF

○ SLURM

● All batch systems can be interchangeably used

with any of the job stores

Toil - Portable

● Cloud-based job stores are designed to handle

many concurrent workers

● Mesos has been shown to scale to 50k simulated

nodes in Amazon Elastic Compute Cloud (EC2)

● Workers try to reduce interactions with the master

by scheduling jobs locally

Toil - Scalable

● Jobs are checkpointed upon completion, allowing

for resumability after job failure

● Toil’s jobstore can resume from any combination of

leader/worker failure

● Toil currently supports job stores for:

○ Shared file systems

○ AWS (Amazon S3 + Amazon SimpleDB)

○ Experimental support for Azure / Google Cloud

Toil - Robust to Failures

Toil in Action 20,000 RNA-seq Sample Recompute

Scalable and robust to failure

Toil RNA-seq Recompute

The Future

PCAWG showed the power of cloud for large scientific

analysis

Current work with Redwood, Dockstore, and Toil

formalized lessons learned and methodologies

Our future work focuses on establishing standards

from our previous work and applying these to future

larger-scale efforts

Tool Registry API

● Formalizing the standard with the GA4GH through the Containers and

Workflows Task Team, implemented in Dockstore

● Basic read API with extended support for write and search

Tool(s)

de

scrip

tor

Docker GET list

GET search

POST register

CWL/WDL Conventions API Standard to Share

Emerging GA4GH API Standards

Emerging GA4GH API Standards

Further work of the Containers and Workflows Task Team

Workflow/Task Execution APIs

POST new task

GET task status

GET task

stderr/stdout

API Standard to Execute

Tools

DockerJSON

stderr stdout file(s)

status

+

Cloud-specific

Implementation

WDL/CWL

Workflowor

GA4GH Containers & Workflow Vision

Toil

Dockstore.org

Redwood

- GA4GH Containers & Workflows Task Team

- Broad Institute

- Cincinnati Children’s Hospital

- Curoverse

- European Bioinformatics Institute

- Intel

- Institute for Systems Biology

- Google, Microsoft, and Amazon

- Ontario Institute for Cancer Research

- Oregon Health and Science University

- Seven Bridges Genomics

- University of California Santa Cruz

● Lincoln Stein, Josh Stuart,

Gad Getz, Peter Campbell,

Jan Korbel - PCAWG

● Vincent Ferretti - Storage

● Denis Yuen - Dockstore

● Kyle Ellrott - Task API

● Peter Amstutz - Workflow API

and Co-leader

● Jeff Gentry - Co-leader

● Hannes Schmidt, Frank

Nothaft & the Toil Team

Acknowledgements

Software Availability

Dockstore

Tool/Workflow

Sharing

Toil

Workflow

Execution

Redwood

File

Storage

https://github.com/icgc-dcc/dcc-storage https://dockstore.org/ https://toil.readthedocs.io

All three projects are open source and welcome your contributions

https://github.com/icgc-dcc/dcc-storage

https://dockstore.org/

https://toil.readthedocs.io

The AWS Perspective

Enabling science

Scalable compute resource only when needed

Time to result was greatly reduced

Cost of analysis was greatly reduced

Data is able to be securely shared in place

Global community access

Open data as a platform

Data Creation Data Enrichment

Se

nse

makin

g

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Lower cost of knowledge(Efficiency)

45

Open data as a platform

Data Creation Data Enrichment

Se

nse

makin

g

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Lower cost of knowledge(Efficiency)

46

BAM gVCF

Wig, GFF

? ?

?

??

Amazon S3 for science

Amazon S3

Data Lake

Data Science Sandbox

Visualization /

Reporting

Public datasets on AWS

To enable more innovation, AWS hosts a selection of datasets that anyone

can access for free. Data in our public datasets is available for rapid

access to our flexible and low-cost computing resources.

Earth Science

• Landsat

• NEXRAD

• NASA NEX

Life Science

• TCGA & ICGC

• 1000 Genomes

• Genome in a Bottle

• Human Microbiome Project

• 3000 Rice Genome Internet Science

• Common Crawl Corpus

• Google Books Ngrams

• Multimedia Commons

https://aws.amazon.com/public-datasets/

Serverless Science with AWS

Lambda

AWS Lambda

Continuous Scaling No Servers to

Manage

AWS Lambda automatically

scales your application by running

code in response to each trigger.

Your code runs in parallel and

processes each trigger

individually, scaling precisely with

the size of the workload.

Subsecond

Metering

With AWS Lambda, you are

charged for every 100ms your code

executes and the number of times

your code is triggered. You don't

pay anything when your code isn't

running.

AWS Lambda automatically runs

your code without requiring you to

provision or manage servers. Just

write the code and upload it to

Lambda.

Serverless, event-driven compute service

Key Scenarios

Stateless processing of discrete or

streaming updates to your data-store or

message bus

Customize responses and response

workflows to state and data changes

within AWS

Execute server side backend logic in a

cross platform fashion

Data processing App backend development Control systems

Evented genome sequence processing

Nanocall*

* Matei David (Jared T. Simpson lab)doi:10.1093/bioinformatics/btw569

The use of API gateway to execute Lambda

functions that bundle a statistical program

function in R for calculating the significance of

an association of a gene’s expression level

with patient survival for every gene in the

genome (~20K)

Utilization of this Serverless architecture

enabled them to scale dynamically without

paying for idle compute and leveraging robust

error handling capabilities

Exemplifies how researchers can leverage

PHI data de-identification to use more

resources on the AWS platform

Data analysis using R, API Gateway, and Lambda

Station X’s GenePool platform enables real-time biomarker analysis and management of

clinical genomic data at scale.

The patient data has been de-

identified…API Gateway and

Lambda only receive the event,

time-to-event, and expression

values [which] ensures that we are

able to use Lambda and API

Gateway...while still complying with

the AWS BAA and HIPAA.

“

”

GT-Scan2 – Scaling CRISPR-Cas9 searches

Thank you!

Remember to complete

your evaluations!

aws re:invent 2016: large-scale, cloud-based analysis of cancer genomes: lessons learned from the...

Technology