aws re:invent 2016: large-scale, cloud-based analysis of cancer genomes: lessons learned from the...

56
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Brian O’Connor Technical Director - Analysis Core UCSC Genomics Institute Nov 28th, 2016 Large-scale, Cloud-based Analysis of Cancer Genomes Lessons Learned from the PCAWG Project

Upload: amazon-web-services

Post on 06-Jan-2017

415 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Brian O’Connor

Technical Director - Analysis Core

UCSC Genomics Institute

Nov 28th, 2016

Large-scale, Cloud-based Analysis of

Cancer GenomesLessons Learned from the PCAWG Project

Page 2: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Overview

Past Present Future

Page 3: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

PCAWG: A Cloud-Based, Distributed Collaboration

● International Cancer Genome

Consortium (ICGC)

● ~5,800 Whole Genomes

–~2,800 Cancer Donors

–~1,300 with RNASeq data

–Goal is to consistently analyze data

● 8 sites storing and sharing data via GNOS

– 300TB -> 900TB

● 14 Cloud (and HPC) environments

–3 Commercial, 7 OpenStack, 4 HPC

–~630 VMs, ~15K cores, ~60TB of

RAM

Page 4: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

PCAWG Cloud Analysis “Core” Workflows

Page 5: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

PCAWG Lessons Learned

1. Commercial cloud policies

2. Portable tools

3. Failure-tolerant, distributed execution infrastructure

4. Commercial cloud costs

Page 6: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Lesson 1: Commercial Cloud Policies

• PCAWG analysis showed the power of clouds

• Key policy changes enabled commercial cloud usage

• NIH updated dbGaP cloud policy - March 2015

• ICGC DACO updated ICGC cloud policy - May 2015

• Partnerships with commercial cloud entities

• Amazon Public Datasets Program

• Seven Bridges

• DNAnexus

Page 7: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

PCAWG Cloud Analysis Architecture

GNOS

Academic

Compute

Centers

Cloud

Orchestrator

Compute AWS Cloud

Cloud

Orchestrator

Metadata

Index

Sequencing

Projects

Spot

Instances

Work Orders

Page 8: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

PCAWG Analysis Architecture & AWS

GNOS

Academic

Compute

Centers

Cloud

Orchestrator

ComputeAWS

Cloud

Cloud

Orchestrator

Metadata

Index

DNAnexus

Seven

Bridges

Sequencing

Projects

Represents a major shift, ICGC data now redistributed within Amazon’s Cloud

Spot

Instances

Work Orders

Amazon S3

Page 9: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Lesson 2: Portable Tools

Containerized workflows for portability between sites

Core Workflows

Alignment: BWA-Mem

Variant Calling: Broad, DKFZ/EMBL, and Sanger

https://github.com/ICGC-TCGA-PanCancer

Page 10: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Lesson 3: Fault-Tolerant Cloud Execution

Architecture 1.0

Architecture 2.0

Architecture 3.0● cloud-based

clusters

● gluster distributed

filesystem

● scheduling per

cloud● single-node

workers

● no distributed

filesystem

● ansible for setup ● a complete rethink

Page 11: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Lesson 4: Cloud Costs

Workflow Hardware (cores /

machine)

Runtimes Cost on AWS

BWA 8 cores (16 GB RAM) 5 days (± 5) per

specimen

$11.16

Sanger 8 cores (32 GB RAM) 4 days (± 3) per

donor

$17.22

DKFZ /

EMBL

16 cores (64 GB RAM) 2 days (± 0.6)

per donor

$12.80

Broad 32 cores (256 GB RAM) 2.6 days per

donor

$20.48

workflow storage required per donor

BWA 240 GB

Sanger 4 GB

DKFZ / EMBL 5 GB

Total 249 GBData analysis: Create a cloud

commons, Nature 2015

$62/donor

Page 12: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

ICGC PCAWG Legacy

Publications soon

AWS Public Datasets

Program

~1,400 PCAWG Donors

- BAM (~70% of ICGC

donors)

- VCF from all three

pipelines

- more ICGC data uploaded

regularly

https://dcc.icgc.org/icgc-in-the-cloud

Page 13: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

The Present

Goal: to formalize lessons from PCAWG into reusable tools

Dockstore

Tool/Workflow

Sharing

Toil

Workflow

Execution

Redwood

File

Storage

Page 14: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Ecosystem of Tools

Page 15: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Redwood - Scalable Storage

Authentication

& Storage

Services

Key Features: based on ICGC Storage Service, supports FUSE, BAM

Slicing, and Highly Parallel access, typically WORM usage pattern

client

Amazon S3Amazon EC2

instanceAWS cloud

Page 16: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Redwood - Storage System Performance

The Redwood Storage System (and underlying S3) provided a stable

and secure mechanism to store and use genomic data

Example run of ~100 simultaneously downloads saw ~45-100MB/s

Page 17: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Dockstore.org - Sharing Tools & Workflows

Dockstore:

● Share tools and

workflows

● Package tools with

Docker, Describe

with CWL/WDL

● PCAWG goal,

provide our tools

via Dockstore

http://dockstore.org and https://github.com/ga4gh

Page 18: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Dockstore Architecture

Built on

DockerHub/Quay.io

and

GitHub/BitBucket

Adds metadata to

address

shortcomings for

bioinformatics

workflows

CWL/WDL is the

natural choice for

Descriptor

Page 19: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Dockstore 1.0 Release

Highlighted New Features

Support for 1.0.0 GA4GH Tool Registry API

Support for displaying, sharing, and natively launching CWL 1.0 &

WDL tools

Preliminary support for CWL/WDL workflows

Full list of updates since 0.4-beta.4 in

https://github.com/ga4gh/dockstore/releases

New Content

ICGC PanCancer Analysis of Whole Genomes (PCAWG) tools• BWA-mem, Sanger, Delly, DKFZ

Page 20: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Dockstore Tour

Search

Main PageTool Management

Page 21: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)
Page 22: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Running Dockstore Tools

Execution with the Dockstore Command Line Interface (CLI)

Goal was something simple but want the same process

accessible via other execution systems!

provision

input files

pull

Docker

images

execute

tool with

inputs

using CWL

provision

output files

somewhere

Seven Bridges, Curoverse, Galaxy, Consonance, etc

Simple Dockstore Command Line

Page 23: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)
Page 24: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)
Page 25: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)
Page 26: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)
Page 27: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Coming Soon to Dockstore

Workflow DAG view

Testing PCAWG

Test Data

“Launch With…”• Consonance

• Commercial partner(s)

Signed Dockers

Cross site indexing

See Roadmap:

https://goo.gl/4D9a8F

Page 28: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Toil - Efficient Compute on AWS

● A system for large-scale, efficient work on AWS

● Toil recently completed a 30K core, 20K sample re-

compute

● Per job granularity allows for better efficiency and

robustness

Page 29: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

The job graph in

Toil can be either

statically or

dynamically

declared.

Toil - Dynamic DAGs

Page 30: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Toil - Spark & ADAM Integration

Amazon EC2 Instances

master

slave

slave slave

slave

Page 31: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

User scripts are written in pure Python

from toil.job import Job

def helloWorld(message, memory="2G", cores=2, disk="3G"):

return "Hello, world!, here's a message: %s" % message

j = Job.wrapFn(helloWorld, "You did it!")

if __name__=="__main__":

parser = Job.Runner.getDefaultArgumentParser()

options = parser.parse_args()

print Job.Runner.startToil(j, options) #Prints Hello, world!, ...

Toil - Accessible to New Developers

Page 32: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

● Toil can be installed on any system with

Python 2.7

● Built-in support for various batch systems - a few

in part to open-source community support!

○ Mesos

○ SGE (GridEngine)

○ UCSC’s Parasol

○ Single Machine Mode

○ LSF

○ SLURM

● All batch systems can be interchangeably used

with any of the job stores

Toil - Portable

Page 33: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

● Cloud-based job stores are designed to handle

many concurrent workers

● Mesos has been shown to scale to 50k simulated

nodes in Amazon Elastic Compute Cloud (EC2)

● Workers try to reduce interactions with the master

by scheduling jobs locally

Toil - Scalable

Page 34: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

● Jobs are checkpointed upon completion, allowing

for resumability after job failure

● Toil’s jobstore can resume from any combination of

leader/worker failure

● Toil currently supports job stores for:

○ Shared file systems

○ AWS (Amazon S3 + Amazon SimpleDB)

○ Experimental support for Azure / Google Cloud

Toil - Robust to Failures

Page 35: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Toil in Action 20,000 RNA-seq Sample Recompute

Page 36: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Scalable and robust to failure

Toil RNA-seq Recompute

Page 37: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

The Future

PCAWG showed the power of cloud for large scientific

analysis

Current work with Redwood, Dockstore, and Toil

formalized lessons learned and methodologies

Our future work focuses on establishing standards

from our previous work and applying these to future

larger-scale efforts

Page 38: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Tool Registry API

● Formalizing the standard with the GA4GH through the Containers and

Workflows Task Team, implemented in Dockstore

● Basic read API with extended support for write and search

Tool(s)

de

scrip

tor

Docker GET list

GET search

POST register

CWL/WDL Conventions API Standard to Share

Emerging GA4GH API Standards

Page 39: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Emerging GA4GH API Standards

Further work of the Containers and Workflows Task Team

Workflow/Task Execution APIs

POST new task

GET task status

GET task

stderr/stdout

API Standard to Execute

Tools

DockerJSON

stderr stdout file(s)

status

+

Cloud-specific

Implementation

WDL/CWL

Workflowor

Page 40: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

GA4GH Containers & Workflow Vision

Toil

Dockstore.org

Redwood

Page 41: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

- GA4GH Containers & Workflows Task Team

- Broad Institute

- Cincinnati Children’s Hospital

- Curoverse

- European Bioinformatics Institute

- Intel

- Institute for Systems Biology

- Google, Microsoft, and Amazon

- Ontario Institute for Cancer Research

- Oregon Health and Science University

- Seven Bridges Genomics

- University of California Santa Cruz

● Lincoln Stein, Josh Stuart,

Gad Getz, Peter Campbell,

Jan Korbel - PCAWG

● Vincent Ferretti - Storage

● Denis Yuen - Dockstore

● Kyle Ellrott - Task API

● Peter Amstutz - Workflow API

and Co-leader

● Jeff Gentry - Co-leader

● Hannes Schmidt, Frank

Nothaft & the Toil Team

Acknowledgements

Page 42: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Software Availability

Dockstore

Tool/Workflow

Sharing

Toil

Workflow

Execution

Redwood

File

Storage

https://github.com/icgc-dcc/dcc-storage https://dockstore.org/ https://toil.readthedocs.io

All three projects are open source and welcome your contributions

Page 43: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

The AWS Perspective

Page 44: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Enabling science

Scalable compute resource only when needed

Time to result was greatly reduced

Cost of analysis was greatly reduced

Data is able to be securely shared in place

Global community access

Page 45: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Open data as a platform

Data Creation Data Enrichment

Se

nse

makin

g

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Lower cost of knowledge(Efficiency)

45

Page 46: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Open data as a platform

Data Creation Data Enrichment

Se

nse

makin

g

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Lower cost of knowledge(Efficiency)

46

BAM gVCF

Wig, GFF

? ?

?

??

Page 47: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Amazon S3 for science

Amazon S3

Data Lake

Data Science Sandbox

Visualization /

Reporting

Page 48: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Public datasets on AWS

To enable more innovation, AWS hosts a selection of datasets that anyone

can access for free. Data in our public datasets is available for rapid

access to our flexible and low-cost computing resources.

Earth Science

• Landsat

• NEXRAD

• NASA NEX

Life Science

• TCGA & ICGC

• 1000 Genomes

• Genome in a Bottle

• Human Microbiome Project

• 3000 Rice Genome Internet Science

• Common Crawl Corpus

• Google Books Ngrams

• Multimedia Commons

https://aws.amazon.com/public-datasets/

Page 49: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Serverless Science with AWS

Lambda

Page 50: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

AWS Lambda

Continuous Scaling No Servers to

Manage

AWS Lambda automatically

scales your application by running

code in response to each trigger.

Your code runs in parallel and

processes each trigger

individually, scaling precisely with

the size of the workload.

Subsecond

Metering

With AWS Lambda, you are

charged for every 100ms your code

executes and the number of times

your code is triggered. You don't

pay anything when your code isn't

running.

AWS Lambda automatically runs

your code without requiring you to

provision or manage servers. Just

write the code and upload it to

Lambda.

Serverless, event-driven compute service

Page 51: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Key Scenarios

Stateless processing of discrete or

streaming updates to your data-store or

message bus

Customize responses and response

workflows to state and data changes

within AWS

Execute server side backend logic in a

cross platform fashion

Data processing App backend development Control systems

Page 52: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Evented genome sequence processing

Nanocall*

* Matei David (Jared T. Simpson lab)doi:10.1093/bioinformatics/btw569

Page 53: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

The use of API gateway to execute Lambda

functions that bundle a statistical program

function in R for calculating the significance of

an association of a gene’s expression level

with patient survival for every gene in the

genome (~20K)

Utilization of this Serverless architecture

enabled them to scale dynamically without

paying for idle compute and leveraging robust

error handling capabilities

Exemplifies how researchers can leverage

PHI data de-identification to use more

resources on the AWS platform

Data analysis using R, API Gateway, and Lambda

Station X’s GenePool platform enables real-time biomarker analysis and management of

clinical genomic data at scale.

The patient data has been de-

identified…API Gateway and

Lambda only receive the event,

time-to-event, and expression

values [which] ensures that we are

able to use Lambda and API

Gateway...while still complying with

the AWS BAA and HIPAA.

Page 54: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

GT-Scan2 – Scaling CRISPR-Cas9 searches

Page 55: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Thank you!

Page 56: AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Lessons Learned from the PCAWG Project (LFS304)

Remember to complete

your evaluations!