2016 aws life sciences days | boston, ma – may 17, 2016

Post on 18-Jan-2017

767 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

AWS Life Sciences DaysBoston, MA

Mark Johnston, Director of Global Business Development,

Healthcare and Life Sciences

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE

PARTNERS HEALTHCAREMAY 17, 2016

Trung DoExecutive Director, Business Development – InnovationPartners HealthCare

Operating Revenue $9.0 Billion

Research Revenue $1.5+ Billion

Inpatient Discharges 166,700

Licensed Beds 4,000

Lives UnderManagement1 750,000

Physicians 6,500

Employees (FTEs) 68,000

Clinical Trials 1,200

Clinical & ResearchFellows and Residents 4,300

Faculty Appointed at Harvard Medical School

Brigham and Women’s Hospital Founded 1832 ; Ranked #6 in US

Massachusetts General HospitalFounded 1811; Ranked Number #1 in US

P A R T N E R S H E A L T H C A R E

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE

PARTNERS HEALTHCAREMAY 17, 2016

Trung DoExecutive Director, Business Development – InnovationPartners HealthCare

Operating Revenue $9.0 Billion

Research Revenue $1.5+ Billion

Inpatient Discharges 166,700

Licensed Beds 4,000

Lives Under

Management1 750,000

Physicians 6,500

Employees (FTEs) 68,000

Clinical Trials 1,200

Clinical & Research

Fellows and Residents 4,300

Faculty Appointed at Harvard Medical School

Brigham and Women’s Hospital Founded 1832 ; Ranked #6 in US

Massachusetts General HospitalFounded 1811; Ranked Number #1 in US

P A R T N E R S H E A L T H C A R E

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

THE INNOVATION CHALLENGE

INNOVATING AT SCALE

INNOVATION AT PARTNERS HEALTHCARE

• Innovation is a core Partners activity that supports and sustains our ability to constantly improve the care we deliver to our patients to enrich their lives and well-being

• The value of an integrated health system, like Partners HealthCare, is its ability to innovate at scale

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

CHALLENGES TO INNOVATING AT SCALE

• Many small advances occur continuously throughout leading academic medical centers

• Difficult to ensure these innovations are broadly adopted by clinicians and accessible to all patients

• There are many impediments• Time

• Organizational structures

• Physical distance Communication is challenging

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

ERA OF DATA SCIENCE

Standard Reports “What happened?”

Ad hoc reports “How many, how often,

where?”

Query/drill down “What exactly is the problem?”

Alerts “What actions are needed?”

Statistical Analysis “Why is this happening?”

Forecasting /

extrapolation “What if these trends continue?”

Predictive Modeling “What will happen next?”

Optimization“What is the best that can

happen?”

Predictive

Analytics

Descriptive

Analytics

Com

pe

titive

Ad

va

nta

ge

Sophistication of Intelligence

Source: Competing on Analytics: The New Science of Winning (Davenport /

Harris)

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION

BUSINESS

Standard Reports EHR ecosystem

Ad hoc reports Quality Data Warehouse

Query/drill down Quality Data Warehouse

Alerts CDS built on EHR

Statistical AnalysisObservational Studies

Clinical TrialsResearch

Clinical

Com

pe

titive

Ad

va

nta

ge

Sophistication of Intelligence

Massive bifurcation

causes significant

inefficiencies

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION

BUSINESS

Standard Reports EHR ecosystem

Ad hoc reports Quality Data Warehouse

Query/drill down Quality Data Warehouse

Alerts CDS built on EHR

Statistical AnalysisObservational Studies

Clinical Trials Research

Clinical

Com

pe

titive

Ad

va

nta

ge

Sophistication of Intelligence

Establish continuous learning environment

based on predictive and population based

analytics

Opportunity!

Massive bifurcation

causes significant

inefficiencies

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION

BUSINESS

Standard Reports EHR ecosystem

Ad hoc reports Quality Data Warehouse

Query/drill down Quality Data Warehouse

Alerts CDS built on EHR

Statistical AnalysisObservational Studies

Clinical Trials

Forecasting /

extrapolation

Trends –

pharmacovigilance surveillance

Continuous

Learning

Healthcare

System

Com

pe

titive

Ad

va

nta

ge

Sophistication of Intelligence

Predictive Modeling Machine Learning - CDS

Optimization Patient Specific Predictive CDS

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Public Data

Sets

Database

Schemas

ML

Cluster

Analysis

Image

Data

To Help Predict Outcomes and Support Medical Decisions

To Learn About what Differences in People are Important for Predicting

Disease

To Understand the Disease Traits Caused by a Gene Variant

To Help Interpret Features in Medical Images and Tissues

USING BIG DATA TO IMPROVE HEALTHCARE

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

BIG DATA SOURCES…

• EHR / Hospital Data

– Research Patient Data Registry (RPDR)

• 7 million patients

• 2 billion diagnoses, medications, laboratories and clinical findings

• 13 billion images

• Partners Biobank: 40,000 consented patient samples linked to EMR (target 100,000 by 2018)

Department of

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

BIG DATA COMMONS:Creating an Enterprise Wide Query And Analysis System For Big Data

Big Data CommonsIntegrates disparate islands

of patient data (clinical and

research data) onto a common

Big Data Platform

Enhanced Query Tool

Partners Biobank Portal –

Genomics Data, Samples

Public Health Data

Imaging Data

Notes / Text

Repository (e.g. physician

notes)

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

UNPREDICTABLE QUALITY USING RAW ICD9/10 CODES

Phenotype Count with ICD-

9/ICD-10 Code

Count (90% positive predictive value)

Count with

Genotype Data

Asthma 7618 3322 805

Bipolar Disorder 1754 219 84

Breast Cancer 2101 1711 378

Congestive Heart Failure 10160 4597 1859

Coronary Artery Disease 1435 803 236

Crohn’s Disease 5177 700 350

Depression 11154 4273 1074

Epilepsy 2351 1211 381

Gout 2464 1828 566

Hypertension 20788 16995 4553

Multiple Sclerosis 602 320 58

Obesity 10245 12179 3191

Rheumatoid Arthritis 3475 878 261

Schizophrenia 509 83 14

Type 1 Diabetes 2196 232 61

Type 2 Diabetes 7123 4385 1268

Ulcerative Colitis 1359 624 157

May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

95%

Specificity

1. Create a gold standard training

set.

2. Create a comprehensive list of

features (concepts/variables) that

describe the phenotype of interest

3. Develop the classification algorithm. Using the data

analysis file and the training set from step 1, assess the

frequency of each variable. Remove variables with low

prevalence. Apply adaptive LASSO penalized logistic regression to identify highly predictive variables for the algorithm

4. Apply the algorithm to all subjects in the superset and

assign each subject a probability of having the phenotype

CREATING QUALITY DATA WITH SUPERVISED MACHINE LEARNING

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

PREDICTABLY QUALITY - COMPUTED PHENOTYPES

Phenotype Count with ICD-

9/ICD-10 Code

Count (90% positive predictivevalue)

Count with

Genotype Data

Asthma 7618 3322 805

Bipolar Disorder 1754 219 84

Breast Cancer 2101 1711 378

Congestive Heart Failure 10160 4597 1859

Coronary Artery Disease 1435 803 236

Crohn’s Disease 5177 700 350

Depression 11154 4273 1074

Epilepsy 2351 1211 381

Gout 2464 1828 566

Hypertension 20788 16995 4553

Multiple Sclerosis 602 320 58

Obesity 10245 12179 3191

Rheumatoid Arthritis 3475 878 261

Schizophrenia 509 83 14

Type 1 Diabetes 2196 232 61

Type 2 Diabetes 7123 4385 1268

Ulcerative Colitis 1359 624 157

May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

DEFINITIVE DISEASE STATES ASSOCIATED WITH GENOMIC DATA

Phenotype Count with ICD-

9/ICD-10 Code

Count (90% positive predictivevalue)

Count with

Genotype Data

Asthma 7618 3322 805

Bipolar Disorder 1754 219 84

Breast Cancer 2101 1711 378

Congestive Heart Failure 10160 4597 1859

Coronary Artery Disease 1435 803 236

Crohn’s Disease 5177 700 350

Depression 11154 4273 1074

Epilepsy 2351 1211 381

Gout 2464 1828 566

Hypertension 20788 16995 4553

Multiple Sclerosis 602 320 58

Obesity 10245 12179 3191

Rheumatoid Arthritis 3475 878 261

Schizophrenia 509 83 14

Type 1 Diabetes 2196 232 61

Type 2 Diabetes 7123 4385 1268

Ulcerative Colitis 1359 624 157

May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

CLINICAL APPLICATIONSLEVERAGING MACHINE LEARNING IN HEALTHCARE

CLARIFAI.COM

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

CLARIFAI.COM

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

SymptomaticAsymptomatic Data Symptomatic

CLINICAL DATA SCIENCESYMPTOMATIC DATA DISCOVERY

ASYMPTOMATIC

SCREENING

- BREAST CANCER

- COLON CANCER

- LUNG CANCER

SYMPTOMATIC

DATA SYMPTOMATIC

- CLINICAL DATA

- GENETIC DATA

- CONSUMER

DATA

Clinically

C L I N I C A L D I A G N O S T I C S E R V I C E S

DIAGNOSTIC CLINICAL DECISION SUPPORT

SymptomaticData Discovery

Machine Learning

Can we find disease before

symptoms appear clinically?

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

MGH CLINICAL DATA SCIENCE CENTER

ONCOLOGY

SPECIALTY

NEUROSCIENCE

CARDIAC

BIOCHEMISTRY

LABORATORY

MICROBIOLOGY

IMMUNOLOGY

HEMATOLOGY

RADIOLOGY

DIAGNOSTICS

PATHOLOGY

CARDOLOGY

PHARMACOLOGIC

THERAPUTICS

PROCEDURAL

SURGERY

DISEASE MANAGEMENT

POPULATION HEALTH

WELLNESS PLANNING

PREVENTION

MRI

IMAGING

CTPET

USXRAY

NUC

GENOME

GENETICS

ARRAYS

PROBES

CLINICAL DATA

CLINICAL APPLICATIONS

HOME MONITOR

PERSONAL

WEARABLES

DIRECT TOCONSUMER

DATA

EHR

CLINCAL

OTHER

ALL

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

MGH CLINICAL DATA SCIENCEINITIAL FOCUS: DIAGNOSTIC RADIOLOGY APPLICATIONS

RADIOLOGY

MRI GENOME HOME MONITORBIOCHEMISTRY

MGHCADS

DATA

DIAGNOSTICS

IMAGING LABORATORY GENETICS EHR PERSONAL

CTPET

USXRAY

MICROBIOLOGY

IMMUNOLOGY

HEMATOLOGY

WEARABLESCLINCAL

OTHERNUC

DIRECT TOCONSUMER

ARRAYS

PROBES

CLINICAL DATA

CLINICAL APPLICATIONS

ALL

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

EHR

InterpretationDecision Support

Systems

MGH CLINICAL DATA SCIENCE CENTERAPPLYING NEURAL NETWORKS

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

MGH CLINICAL DATA SCIENCE CENTERMACHINE LEARNING RESOURCES

InterpretationDecision Support

Systems

EHR• Formulate the correct clinical questions• Acquire (retrieve) the necessary volume of clinical data• Accurately label the clinical data such that it answers the clinical questions• Apply the appropriate Data Science techniques (training and validation)• Iterate until the desired (or best) accuracy results• Clinically validate the resulting ‘machine learning appliance’• Implement into clinical practice• License/Commercialize

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE

PARTNERS HEALTHCAREMAY 17, 2016

Trung DoExecutive Director, Business Development – InnovationPartners HealthCare

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

Best practices when building a validated system

on AWS for the Life Sciences

Chris McCurdy

Healthcare and Life Sciences Specialist AWS

Matt Szenher

Principle Architect at Medidata

Agenda

• DevOps to DevSecOps Primer

• Observed industry cloud techniques with AWS• Tools, processes and frameworks to assist

• Medidata Audit Ingestion

DevOps Level Set

Development

Quality Assurance

Operations

DevOps

DevOps Toolchain

Plan

Configure

Verify

Preprod

Monitor

Create

Release

Define and plan; business value, application requirements and metrics

Building, coding and configuration

Ensuring quality; acceptance, regression testing

Infrastructure and application

Approval/certification, triggered releases, release staging and holding

Process, application and infrastructure

Release coordination, promotion, scheduling, rollback and recovery

DevOps Principles

• Collaborate with all stakeholders

• Codify everything

• Test everything

• Automate everything

• Measure and monitor everything

• Deliver business value with continual feedback

Manual Hacking

Drivers for DevSecOps

Embedding Security into DevOps was not successful because…

• Compliance checklists didn’t take us far before we stopped scaling…

• We couldn’t keep up with deployments without automation…

• Standard Security Operations did not work…

• And we needed far more data than we expected to help the business make decisions…

DevSecOps: Security as Code

Establishing these principles…

• Customer focused mindset

• Scale, scale, scale

• Objective criteria

• Proactive hunting

• Continuous detection and response

DevOps Toolchain

Plan

Configure

Verify

Preprod

Monitor

Create

Release

Define and plan; business value, application requirements, security, compliance

and metrics

Build, code and configuration

Ensuring quality; acceptance, regression, security and compliance testing

Infrastructure and application

Approval/certification, triggered releases, release staging and holding

Process, application, infrastructure, security and compliance

Release coordination, promotion, scheduling, rollback and recovery

Here’s some infrastructure as Code"myVPC": {

"Type": "AWS::EC2::VPC",

"Properties": {

"CidrBlock": {"Ref": "myVPCCIDRRange"},

"EnableDnsSupport": false,

"EnableDnsHostnames": false,

"InstanceTenancy": "default"

}

},

"myInstance" : {

"Type" : "AWS::EC2::Instance",

"Properties" : {

"ImageId": {

"Fn::FindInMap": ["AWSRegionToAMI",{"Ref": "AWS::Region"},"64"]

},

"SecurityGroupIds" : [{"Fn::GetAtt": ["myVPC", "DefaultSecurityGroup"]}],

"SubnetId" : {"Ref" : "mySubnet"}

}

}

AWS

CloudFormation

template

Here’s some security as Code{

"Statement": [

{

"Sid": "DenyIncorrectEncryptionHeader",

"Effect": "Deny",

"Principal": "*",

"Action": "s3:PutObject",

"Resource": "arn:aws:s3:::YourBucket/*",

"Condition": {

"StringNotEquals": {

"s3:x-amz-server-side-encryption": "AES256"

}

}

},

{

"Sid": "DenyUnEncryptedObjectUploads",

"Effect": "Deny",

"Principal": "*",

"Action": "s3:PutObject",

"Resource": "arn:aws:s3:::YourBucket/*",

"Condition": {

"Null": {

"s3:x-amz-server-side-encryption": "true"

}

}

}

]

}

AWS IAM

Cloud Era

Observed industry cloud techniques with AWS

AWS as components

AWS makes commercial cloud infrastructure software

products and office productivity applications that are

user-configurable, general purpose in nature, and

delivered to commercial IT standards like ISO, NIST,

SOC and others. This is similar to other general purpose

IT products and services such as database engines,

operating systems, programming languages, internet

service providers, etc. Many organizations categorize

AWS products as commercial-off-the-shelf (COTS)

infrastructure software products, which is consistent

with the US federal government’s use of AWS

Products as a COTS item through a federal

procurement program called FedRAMP.

”Using AWS in GxP Systems” AWS Whitepaper

http://icon-park.com/icon/light-orange-lego-brick-vector-data-for-free/

AWS Foundation Services

Compute Storage Database Networking

AWS Global Infrastructure Regions

Availability Zones

Edge Locations

Cu

sto

mer

sPlatform, Applications, Identity & Access Management

Operating System, Network & Firewall

Customer content

Client-side encryption implementation, Server-side encryption, Network Traffic Protection

A Word on Security

Security

in the

cloud

Security

of the

cloud

Consult internally before implementing

The following slides are practices we

have seen used in industry. As security

and industry compliance is determined

by the customer before implementing

please:

• Consult with your internal best

practices

• Consult with with your Cloud Center of

Excellence

• Consult with your Information Security

group

• Consult with your Compliance

organization

• Do your due diligence

General Strategies

AWS

CodeCommit

AWS

CodeDeploy

AWS

CodePipeline

Consult with compliance and security organizations before implementing

• Decouple protected/sensitive data from

the processing or orchestration

• Track where your protected/sensitive

data flows

• Do not check the protected data into

your source or artifact repository!

• Use indirection when orchestrating your

protected/sensitive data flow

• Separate protected/sensitive and general

workflow logical boundaries

Separate Virtual Private Cloud (VPC) Strategy

Amazon

EC2Amazon

EMRAmazon

S3

Protected/Sensitive Data VPC

Amazon

EC2

General VPC

AWS Directory

Service

AWS

Device Farm

P/S

Consult with compliance and security organizations before implementing

Indirection Strategy

Data Processing

SystemInbound

Data Store

(S3)

HTTPS

Send

SQS

SNS

Claims

P/S Data

Consult with compliance and security organizations before implementing

Example: Analytics Workflow

Insight

System

(EMR)

Inbound

Archive

(Glacier)

Inbound

Data Store

(S3)

Columnar Query

Store

(Redshift)

1

Medical Data

Data Lake

(S3)

6

P/S

Insights

4

Consult with compliance and security organizations before implementing

SQS

AWS

LambdaAmazon

SES

6

General

Insights

9

General

Insights

2 3

5

7

New Object

Message8

9

Compliance Example Workflow (using DevSecOps)

CloudFormation

templateSecurity /

Compliance Admin

1

Define

AWS Service Catalog

2

Publish

CloudFormation

stack

Developers

4

Browse and Launch

AWS CloudTrail Amazon S3

11

Monitors

Logs all API calls

AWS CloudWatchalarm

8

Monitors

10

Initiates

12

Notifies

AWS Config

Track changes

3

Git push

6

AWS CodeCommit

5

Provisions

9

7

Consult with compliance and security organizations before implementing

Audit Ingestion

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential © 2016 Medidata Solutions, Inc. – Proprietary and Confidential

MAudit Audit Ingestion and DevSecOpsMatthew Szenher | mszenher@mdsol.com

Principal Architect - Medidata Core Web Services

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

About Medidata

SaaS Platform for clinical development, analytics and benchmarking in life sciences

Started in 1999

Over 9,000 trials in more than 130 countries

Serve CROs and contracting partners

We’re hiring: https://www.mdsol.com/en/careers

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

What’s are Audits?

A record of actions that create, modify or delete clinically relevant data.

Crucial for asserting confidentiality, integrity and authenticity of this data.

I’ll talk about how auditing is difficult, and how AWS makes DevSecOps for auditing solutions a lot

easier...

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

Audits are a MUST

MUST be captured transactionally with patient data points (as well as other clinically relevant data)

MUST be persisted

MUST be immutable

MUST be consistent

MUST be secure

SHOULD be cheap

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

Audits are VOLUMINOUS

Medidata persists eight billion clinical records from more than two million patients across more than

9,000 studies

More than one half million patient data points are added daily

Regulatorily required to capture audits transactionally with these data points (as well as other clinically

relevant data)

… ~600 audits per second

And growing!

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

And Growing

GADGET trial with GlaxoSmithKline

Patients wore Vital Connect Health Patch (http://www.vitalconnect.com/)

ECG, skin temp., etc.

1 week

~350 GB of audit data

~300 million data points (and their audits)

More data than many years-long trials collect over their lifetimes

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

Solution: MAudit

Scalable

Centralized

Durable

Highly Available

Secure

Audit ingestion and validation service….

Built on AWS infrastructure

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

Audit Producers MAudit Servers

(EC2)

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

MAudit and DevSecOps

S3: Programmatically defined persistence, with security and infinite scaling

Autoscaling Groups: Codified app server scaling

Kinesis: Codified, scalable streaming of data

IAM: Programmatically defined access controls

CloudFormation: Specifying all of the above in code

Thank You

http://icon-park.com/icon/light-orange-lego-brick-vector-

data-for-free/

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Angel Pizarro, AWS Scientific ComputingBrad Chapman, Harvard Medical School

May 17, 2016

Containers for Science!?!Reproducible science at scale on AWS using

Common Workflow Language, Docker, and Amazon

Elastic Container Service

Agenda

Review Common Workflow Language and bcbio

Intro to Docker and Amazon Elastic Container Service

CWL+Docker workflow on top of ECS

Common Workflow Language

We need faster, better science

https://twitter.com/KMS_Meltzy/status/6612060

70308794368

Large Scale Infrastructure Development

Shared problems: Academia, Industry, Startups

● Workflow implementations

● Validation

● Scaling

● Support

Blue Collar Bioinformatics (bcbio)

https://github.com/chapmanb/bcbio-nextgen

Uses

Aligners: bwa, novoalign, bowtie2, HiSat2

Variation: FreeBayes, GATK, VarDict, MuTecT, Scalpel,

SnpEff, VEP, GEMINI, Lumpy, Manta, CNVkit, WHAM

RNA-seq: Tophat, STAR, Sailfish, Kallisto

Quality control: MultiQC, fastqc, Qualimap

Manipulation: bedtools, bcftools, biobambam, sambamba,

samblaster, samtools, vcflib, vt

Provides

http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html

Validation: low frequency cancer variants

http://bcb.io/2016/04/04/vardict-filtering/

Why community developed workflows?

Problem: complexity of parallelization reduces tool reuse

● Ties biology to job running framework

● Results in reimplemented pipelines

● Each requires AWS integration and scaling

Why community developed workflows?

Solution: Common framework for describing tools and

workflows

● Shared language for common concepts

● Increased re-use of tool definitions

● Mix and match workflow components

http://www.common

wl.org/

Common Workflow Language implementations

https://arvados.org/

Toil from BD2K Center for

Translational Genomics

https://github.com/galaxyproject/planemo

https://sbgenomics.comhttp://toil.readthedocs.io

bcbio + Docker

http://bcbio-nextgen.readthedocs.io/en/latest/contents/cloud.html

● Single container with many biological tools

● https://hub.docker.com/r/bcbio/bcbio/

● Same workflow management as non-Docker

● Run with any platform supporting CWL

bcbio + Common Workflow Language

http://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html

● Generates CWL from sample description

● Data management integrates with platform

○ Arvados: Keep containers

○ Toil: AWS buckets

arvados_testcwl-workflow/├── main-arvados_testcwl.cwl├── main-arvados_testcwl-samples.json├── steps│ ├── batch_for_variantcall.cwl│ ├── combine_sample_regions.cwl│ ├── compare_to_rm.cwl│ ├── concat_batch_variantcalls.cwl│ ├── coverage_report.cwl│ ├── get_parallel_regions.cwl│ ├── merge_split_alignments.cwl│ ├── multiqc_summary.cwl│ ├── pipeline_summary.cwl│ ├── postprocess_alignment.cwl│ ├── postprocess_variants.cwl│ ├── prep_align_inputs.cwl│ ├── prep_samples.cwl│ ├── process_alignment.cwl│ └── variantcall_batch_region.cwl├── wf-alignment.cwl└── wf-variantcall.cwl

"files": [{"class": "File","path": "keep:a1d976bc7bcba2b523713fa67695d715+464/7_100326_FC6107FAAXX.bam","secondaryFiles": [{

"class": "File","Path":

"keep:a1d976bc7bcba2b523713fa67695d715+464/7_100326_FC6107FAAXX.bam.bai"}]}]

"reference__fasta__base": [{"class": "File","path": "keep:a84e575534ef1aa756edf1bfb4cad8ae+1927/hg19/seq/hg19.fa","secondaryFiles": [{

"class": "File","path":

"keep:a84e575534ef1aa756edf1bfb4cad8ae+1927/hg19/seq/hg19.fa.fai"}]}]

Manage provenance for all workflow inputs

class: CommandLineToolcwlVersion: cwl:draft-3baseCommand: [bcbio_nextgen.py, runfn, process_alignment, cwl]hints:- class: ResourceRequirement

coresMin: 16ramMin: 65536tmpdirMin: 100000

inputs:- id: '#files'

type:items: Filetype: array

- id: '#reference__fasta__base'type: File

outputs:- id: '#align_bam'

type: File

Required job infrastructure.

Runner handles resource

allocation.

Explicitly define inputs/outputs.

Runner handles file

management.

Community infrastructure development

Decouple biology and infrastructure

Enables interoperability

Improves integration with AWS

Better infrastructure: provenance and reproducibility

Validation: better, faster science

Amazon Elastic Container

Service (ECS)

AMIs VS. Containers

App A App B App C

Bins/Libs Bins/Libs Bins/Libs

Guest OS Guest OS Guest OS

Hypervisor

Server (Host OS)

App A App B App B App C

Bins/Libs Bins/Libs

Guest OS + Container Manager

Hypervisor

Server (Host OS)

AMIs Containers

Containers are natural for discrete applications

Docker provides tools to manage and deploy your applications

Lightweight container virtualization platform

Simple to model

Any app, any language

Container image can be tied to app version

Test & deploy same artifact

Stateless servers decrease change risk

Applications evolve from monolithic stacks ...

* Image courtesy of The Broad Institute - https://www.broadinstitute.org/gatk/img/BP_workflow.png

... to discrete applications (microservices)

* Image courtesy of The Broad Institute - https://www.broadinstitute.org/gatk/img/BP_workflow.png

Server

Guest OS

Bins/Libs Bins/Libs

App2App1

Scheduling one resource is straightforward

Scheduling a cluster is hard

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Amazon Elastic Container Service (ECS)

Easily manage clusters for any scale

No services for you to run

Complete state

Control and monitoring

Cluster Management

Amazon ECS

Docker

Container Instance

ECS Agent

Docker

Container Instance

ECS Agent

Docker

Container Instance

ECS Agent

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Amazon ECS

Docker

Container Instance

ECS Agent

Docker

Container Instance

ECS Agent

Docker

Container Instance

ECS Agent

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Agent Communication Service

Amazon ECS

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

Docker

Task

Container Instance

Container

ECS Agent

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Agent Communication Service

Unit of work

Grouping of related Containers

Run on Container Instances

ECS Tasks

Tasks are defined via Task Definitionsclass: CommandLineToolcwlVersion: cwl:draft-3baseCommand: [samtools, index]hints:- class: ResourceRequirement

coresMin: 2ramMin: 1024tmpdirMin: 100000

inputs:- id: '#files'

type:items: Filetype: array

- id: '#reference__fasta__base'type: File

outputs:- id: '#align_bam'

type: File

{"family": "samtoolsIndexPath1","containerDefinitions": [{

"name": "samtools-index","image": "delagoya/samtools-index","cpu": 2,"memory": 1024,"essential": true,"entryPoint": ["sh", "-c"],"command": ["samtools", "index"],"workingDirectory": "/mnt/scratch","mountPoints": [{"containerPath": "/mnt/scratch","sourceVolume": "scratch","readOnly": null

}],

}],"volumes": [{

"host": {"sourcePath": "/mnt/scratch”

},"name": "scratch”

}]} * Not a complete example* Not a complete example

Amazon ECS

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

Docker

Task

Container Instance

Container

ECS Agent

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Agent Communication Service API

User / Scheduler

Amazon ECS

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

Docker

Task

Container Instance

Container

ECS Agent

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Agent Communication Service API

User / SchedulerToil

Architecture

ECS Scheduler Driver

EC2 Autoscale Groupcwltoil

ECS Task Definitions

https://github.com/awslabs/ecs-mesos-scheduler-driver

Setup ECS Cluster with AutoScaling

Create LaunchConfiguration

Pick instance type depending on resource requirements, e.g. memory or CPU

Use latest Amazon Linux ECS-optimized AMI, other distros available

Create AutoScaling group and set to cluster initial size

AutoScaling your Amazon ECS Cluster

Create CloudWatch

alarm on a metric, e.g.

MemoryReservation

Configure scaling policies

to increase and decrease

the size of your cluster

Demo

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

© 2016 COGNIZANT10

6

Agile Cloud native Development with AWS

11th April 2016

Maran Marudhamuthu

Chief Architect – Cloud Services

© 2016 COGNIZANT10

7

CHANGE

© 2016 COGNIZANT10

8

WHAT CHANGE

‣ Secular change in architecture of distributed system is happening.

‣ How distributed systems are built is changing.

‣ Evolution towards cloud native architecture.

‣ Thick Line between Infra and App is getting thinner –

‣ Serverless

‣ Forget Infra …build applications - IaaS and PaaS gaining maturity

© 2016 COGNIZANT10

9

DEVOPS‣ CULTURE: No more development and operations team

‣ PROCESS: Continuous integration and Delivery

‣ AUTOMATE: Infrastructure as Code, Automate from developer

desktop to Production

‣ TOOLS: New tools to automate, Chef/Puppet, AWS Cloudformation,

Jenkins

CHANGE EXPLAINED

© 2016 COGNIZANT11

0

MICROSERVICES AND API‣ CLOUD NATIVE: 12 factor

For example,

‣ isolate failure

‣ facilitate blue-green deployment

‣ Independently scale

‣ CI /CD: Continuous integration and Delivery easier

‣ STANDARDS : Defined Protocol for Service definition and consumption

‣ FRAMEWORK & TOOLS: Chef/Puppet, AWS Cloudformation, Jenkins, Spring Boot, NetFlix OSS

CHANGE EXPLAINED

© 2016 COGNIZANT11

1

CONTAINERSBUILD SHIP RUN ANY APP ANYWHERE

▸ BUILD : Package your app in a container (image)

▸ SHIP : Move that container from a machine to another

▸ RUN : Execute that container (i.e. app)

▸ ANY APP: Anything that runs on linux and Windows

▸ ANY WHERE: Bare metal, VM, cloud instance

CHANGE EXPLAINED

© 2016 COGNIZANT11

2

DEVOPS

AWS Deployment and

Management Tools

• AWS Codedeploy

• AWS Codepipeline

• AWS CodeCommit

HOW AWS SERVICES HELP ?

Automation

• AWS OpsWorks

• AWS Cloudformation

• AWS SDK,API and CLI

© 2016 COGNIZANT11

3

HOW AWS SERVICES HELP ? - DEVOPS

© 2016 COGNIZANT11

4

MICROSERVICES & API – CLOUD NATIVE

AWS AppDev PaaS Hosting

• AWS Beanstalk• AWS CodeDeploy• AWS CodeCommit• AWS EMR• AWS Datapipeline

HOW AWS SERVICES HELP?

Build Fast – Forget Infra -

Serverless

• AWS Lambda

• AWS Kinesis

• AWS Data pipeline

• AWS API,SDK,CLI

• AWS IOT Framework

© 2016 COGNIZANT11

5

HOW AWS SERVICES HELP ? – MICROSERVICES – CLOUD NATIVE

© 2016 COGNIZANT11

6

HOW AWS SERVICES HELP?

‣ AWS EC2, AMI

‣ Docker run

‣ AWS Beanstalk ‣ Deploy and scale Docker application

‣ AWS EC2 Container Service - ECS‣ Launch and manage Docker container

‣ AWS EC2 Container Registry Service - ECR‣ a fully-managed Docker container registry

‣ Integrated with Amazon EC2 Container Service (ECS)

‣ Simplifies your development to production workflow

‣ DevOps

CONTAINERS

© 2016 COGNIZANT11

7

CONCLUSION

‣ Think beyond Infra/Forget Infra

‣ Use rather than build

‣ IaaS and PaaS/Application Services

© 2016 COGNIZANT11

8

?

© 2016 COGNIZANT11

9

Thank You

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

Scalable Genomic Analysis on AWS with ADAM

Ujjwal RatanHealthcare and Life Sciences

Solutions Architect

Amazon Web Services

Timothy DanfordField Engineer, Tamr, Inc. & Software

Engineer, University of California

Berkeley

This Talk Will Cover

An Overview of Amazon Elastic Map Reduce (EMR) and Spark

Genomics and ADAM deep dive

Video Demonstration

Q&A

Amazon EMR Overview

Hadoop 1.x & 2.x / HDFS clusters

Easy to use; fully managed

Support for EC2 Spot Instances

S3, DynamoDB, Redshift

& Kinesis Integration

Amazon

Elastic

MapReduce

(EMR)

Many storage layers to choose from

Amazon DynamoDB

EMR-DynamoDB

connector

Amazon RDS

Amazon

Kinesis

Streaming data

connectorsJDBC Data Source

w/ Spark SQL

Elasticsearch

connector

Amazon Redshift

Amazon Redshift Copy

From HDFS

EMR File System

(EMRFS)

Amazon S3

Amazon EMR

Create a fully configured cluster in minutes

AWS Management

Console

AWS Command Line

Interface (CLI)

Or use an AWS SDK directly with the Amazon EMR API

Amazon EMR – Managed Spark

Easy to install and configure Spark

Secured

Spark submit or useZeppelin UI

Quickly addand remove capacity

Hourly, reserved, or EC2 Spot pricing

Use S3 to decouplecompute and storage

Core/Task

Single Spark Cluster on Amazon EMRFS

Amazon S3

AWS EMR

Cluster

Core/Task

Core/Task

Master

node

Core/Task

Spark

master

YARN

Spark

worker

Spark

worker

Spark

worker

Spark

worker

S3 Standard is designed for 11

9’s of durability and is designed

for your data sources

S3 reduced redundancy is

designed for 2 9’s of

durability and can be

used to reduce costs on

reproducible datasets

Next-Generation

Genomics

Using Spark and ADAM

Timothy Danford

Tamr Inc.

AMPLab

What’s In

The Box?

A Variant-Calling Pipeline

Stages are written separately

Hand-off between steps is through files

Everyone has their own “flavor” of pipeline

Parallelization in the Cloud

Lingua Franca: File Formats

.bam files define a custom .bai index format

User-defined attributes

Typically in coordinate-sorted order

Where is ”The Platform?”

(This is taken from the Picard library.)

Why are we managing file handles and spilling reads

to disk inside our bioinformatics methods?

Things Fall Apart When Our Computation Changes

Bioinformaticians❤️

Probabilistic Models

Many Bioinformatics Methods Are Just Large Sums

The Challenge: Existing Code!

Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)

Can You Spot the File Format Assumption?

A single piece of a

filtering stage for

the Mutect somatic

variant caller.

Can you spot the

place, where they

assume they are

working with BAM

files?

Spark + Genomics = ADAM• Hosted at Berkeley and

the AMPLab

• Apache 2 License

• Contributors from both

research and

commercial

organizations

• Core spatial primitives,

variant calling

• Avro and Parquet for

data models and file

formats

”The Platform” DefinesCore Genomics Primitives

Stop Defining File Formats By Hand

• Instead of defining

custom file formats for

each data type and access

pattern…

• Parquet creates a

compressed format for

each Avro-defined data

model.

• Improvement over existing

formats1

• 20-22% for

BAM

• ~95% for

VCF 1compression % quoted from 1K Genomes samples

Demo

Installation of ADAM on EMR

Analyze Parquet Files using Spark

Use SCALA to query the genome data in the adam parquet

files

……

val gnomeDF = sqlContext.read.parquet("/user/hadoop/adamfiles/part-r-00000.gz.parquet")

gnomeDF.printSchema()

gnomeDF.registerTempTable("gnome")

val gnome_data = sqlContext.sql("select count(*) from gnome")

gnome_data.show()

…….

For more details, see our blog post

Title: Will Spark Power

the Data behind

Precision Medicine

Authors: Christopher

Crosbie, Ujjwal Ratan

URL: http://blogs.aws.amazon.com/

bigdata/post/Tx1GE3J0NATV

J39/Will-Spark-Power-the-

Data-behind-Precision-

Medicine

Thank YouAny Questions?

top related