globus genomics: how science-as-a-service is accelerating discovery (bdt310) | aws re:invent 2013

52
Science as a Service Ian Foster, The University of Chicago and Argonne National Laboratory November 14, 2013

Upload: amazon-web-services

Post on 10-May-2015

479 views

Category:

Technology


2 download

DESCRIPTION

"In this talk, hear about two high-performant research services developed and operated by the Computation Institute at the University of Chicago running on AWS. Globus.org, a high-performance, reliable, robust file transfer service, has over 10,000 registered users who have moved over 25 petabytes of data using the service. The Globus service is operated entirely on AWS, leveraging Amazon EC2, Amazon EBS, Amazon S3, Amazon SES, Amazon SNS, etc. Globus Genomics is an end-to-end next-gen sequencing analysis service with state-of-art research data management capabilities. Globus Genomics uses Amazon EC2 for scaling out analysis, Amazon EBS for persistent storage, and Amazon S3 for archival storage. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS. "

TRANSCRIPT

Page 1: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Science as a Service

Ian Foster, The University of Chicago and Argonne National Laboratory

November 14, 2013

Page 2: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A time of disruptive change

Page 3: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A time of disruptive change

Page 4: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Most labs have limited resources Heidorn: NSF grants in 2007

< $350,000 80% of awards 50% of grant $$

$1,000,000

$100,000

$10,000

$1,000

2000 4000 6000 8000

Page 5: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Automation is required to apply more sophisticated methods to far more data

Page 6: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Automation is required to apply more sophisticated methods to far more data

Outsourcing is needed to achieve economies of scale in the use of automated methods

Page 7: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Building a discovery cloud • Identify time-consuming activities amenable to

automation and outsourcing • Implement as high-quality, low-touch SaaS • Leverage IaaS for reliability,

economies of scale • Extract common elements as

research automation platform Bonus question: Sustainability

Software as a service

Platform as a service

Infrastructure as a service

Page 8: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

We aspire (initially) to create a great user experience for

research data management

What would a “dropbox for science” look like?

Page 9: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

Page 10: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Registry Staging Store

Ingest Store

Analysis Store

Community Store

Archive Mirror

Ingest Store

Analysis Store

Community Store

Archive Mirror

Registry

Quota exceeded

!

Expired credentials

!

Network failed. Retry.

!

Permission denied

!

It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA … but in reality it’s often very challenging

Page 11: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

Page 12: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

• Move • Sync • Share Capabilities delivered using

Software-as-Service (SaaS) model

Page 13: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Data Source

Data Destination

User initiates transfer request

1

Globus Online moves/syncs files

2

Globus Online notifies user

3

Page 14: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Data Source

User A selects file(s) to share; selects user/group, sets share permissions

1

Globus Online tracks shared files; no need to move files to cloud storage!

2

User B logs in to Globus Online and accesses

shared file

3

Page 15: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Extreme ease of use • InCommon, Oauth, OpenID, X.509, … • Credential management • Group definition and management • Transfer management and optimization • Reliability via transfer retries • Web interface, REST API, command line • One-click “Globus Connect” install • 5-minute Globus Connect Multi User install

Page 16: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Early adoption is encouraging

Page 17: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Early adoption is encouraging

>12,000 registered users; >150 daily >27 PB moved; >1B files

10x (or better) performance vs. scp 99.9% availability

Entirely hosted on Amazon

Page 18: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Amazon web services used • Amazon EC2 for hosting Globus services • Elastic Load Balancing to use multiple

Availability Zones for reliability and uptime • Amazon S3 to store historical state • Amazon RDS PostgreSQL for active state

Page 19: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s

Page 20: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC

Page 21: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience

Page 22: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

2

Credit: Kerstin Kleese-van Dam

Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL

Page 23: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

• Move • Sync • Share Capabilities delivered using

Software-as-Service (SaaS) model

Page 24: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

• Collect • Move • Sync • Share • Analyze

• Annotate • Publish • Search • Backup • Archive

BIG DATA

Page 25: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Online already does a lot

Globus Toolkit

Sharing Service Transfer Service

Globus Nexus (Identity, Group, Profile)

Glo

bus

Onl

ine

API

s

Glo

bus

Con

nect

Page 26: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

The identity challenge in science • Research communities often need to

– Assign identities to their users – Manage user profiles – Organize users into groups for authorization

• Obstacles to high-quality implementations – Complexity of associated security protocols – Creation of identity silos – Multiple credentials for users – Reliability, availability, scalability, security

Page 27: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Nexus provides four key capabilities • Identity provisioning

– Create, manage Globus identities

• Identity hub – Link with other identities; use

to authenticate to services

• Group hub – User-managed groups; groups can

be used for authorization

• Profile management – User-managed attributes;

can use in group admission

I

I I I

I

I a b

I U

V G

Key points: 1) Outsource

identity, group, profile management

2) REST API for flexible integration

3) Intuitive, customizable Web interfaces

Page 28: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Branded sites

Open Science Grid University of Chicago XSEDE

DOE kBase Indiana University University of Exeter

Globus Online NERSC NIH BIRN

Page 29: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A platform for integration

Page 30: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A platform for integration

Page 31: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

A platform for integration

Page 32: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Data management SaaS (Globus) + Next-gen sequence analysis pipelines (Galaxy) +

Cloud IaaS (Amazon) = Flexible, scalable, easy-to-use genomics analysis for

all biologists

globus genomics

Page 33: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Toolkit

Sharing Service Transfer Service

Globus Nexus (Identity, Group, Profile)

Glo

bus

Onl

ine

API

s

Glo

bus

Con

nect

We are adding capabilities

Page 34: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Toolkit

Sharing Service Transfer Service

Dataset Services

Globus Nexus (Identity, Group, Profile)

Glo

bus

Onl

ine

API

s

Glo

bus

Con

nect

We are adding capabilities

Page 35: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

We are adding capabilities • Ingest and publication

– Imagine a DropBox that not only replicates, but also extracts metadata, catalogs, converts

• Cataloging – Virtual views of data based on user-defined and/or automatically

extracted metadata

• Computation – Associate computational procedures, orchestrate application,

catalog results, record provenance

Page 36: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Next Gen Sequencing Analysis for Everyone – No IT Required

Ravi K Madduri, The University of Chicago and Argonne National Laboratory

November 14, 2013

Page 37: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

One slide to get your attention

Page 38: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Outline • Globus Vision • Challenges in Sequencing Analysis

– Big Data Management – Analysis at Scale – Reproducibility

• Proposed Approach Using Globus Genomics • Example Collaborations • Q&A

Page 39: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Vision Goal: Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service to:

– provide millions of researchers with unprecedented access to powerful tools for managing Big Data

– reduce research IT costs dramatically via economies of scale

“Civilization advances by extending the number of important operations which we can perform without thinking of them” —Alfred North Whitehead , 1911

Page 40: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Challenges in Sequencing Analysis

Sequencing Centers

Sequencing Centers

Data Movement and Access Challenges

Manual Data Analysis

Public Data

Storage

Local Cluster/ Cloud Seq

Center

Research Lab

How do we analyze this Sequence Data

Picard

GATK

Fastq Ref Genome

Alignment

Variant Calling

• Manually move the data to the Compute node

(Re)Run Script

Install

Modify

• Install all the tools required for the Analysis • BWA, Picard, GATK, Filtering Scripts, etc. • Shell scripts to sequentially execute the tools

• Manually modify the scripts for any change • Error Prone, difficult to keep track, messy.. • Difficult to maintain and transfer the knowledge

• Data is distributed in different locations • Research labs need access to the data for analysis • Be able to share data with other researchers/collaborators

• Inefficient ways of data movement • Data needs to be available on the local and distributed compute

Resources • Local clusters, cloud, grid and transfer the knowledge

Page 41: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Genomics

Sequencing Centers Sequencing Centers

Public Data

Storage

Local Cluster/ Cloud Seq

Center

Research Lab

Globus Provides a • High-performance • Fault-tolerant • Secure file transfer Service between all data-endpoints

Data Management Data Analysis

Galaxy Data Libraries

• Globus Integrated within Galaxy

• Web-based UI • Drag-Drop workflow

creations • Easily modify

Workflows with new tools

Globus Genomics on Amazon EC2

• Analytical tools are automatically run on the scalable compute resources when possible

Galaxy Based Workflow Management System

Globus Genomics

Page 42: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Genomics Architecture

Figure 2: Globus Genomics Architecture

Page 43: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Genomics Usage

Page 44: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013
Page 45: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Globus Genomics • Computational profiles for

various analysis tools • Resources can be provisioned

on-demand with Amazon Web Services cloud based infrastructure

• Glusterfs as a shared file system between head nodes and compute nodes

• Provisioned I/O on Amazon EBS

Page 46: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Coming soon! • Integration with Globus Catalog

– Better data discovery and metadata management

• Integration with Globus Sharing – Easy and secure method to share large datasets with collaborators

• Integration with Amazon Glacier for data archiving • Support for high throughput computational

modalities through Apache Mesos – MapReduce and MPI clusters

• Dynamic Storage Strategies using Amazon S3 or LVM-based shared file system

Page 47: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013
Page 48: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Provide more capability for more people at lower cost by building a “Discovery Cloud”

Delivering “Science as a service”

Our vision for a 21st century discovery infrastructure

Page 49: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Thank you to our sponsors

Page 50: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

For more information • More information on Globus Genomics and to

sign up: www.globus.org/genomics • More information on Globus:

www.globusonline.org • Follow us on Twitter: @ianfoster, @madduri,

@globusgenomics, @globusonline

Page 51: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Thank you!

Page 52: Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT 310