cns_poster12

1
Optimizing Large Genome Assembly in the Cloud Apurva Kumar, Soumyarupa De and Kenneth Yocum University of California, San Diego Genomic Analytics Data High Throughput Sequencing for human (African male NA18507) produce : 3.5 billions 36 base pairs (bp) reads = 756 MB. (Cost : 1000 USD) Current Problem : To assemble genome in lesser cost and minimal time . Why this problem ? Only 60 complete human genomes publicly exist(including Steve Jobs who got it sequenced for 100K USD) Commercialization is still in early stage and growing rapidly Question How do we scale this 3.5 B bp input data to compute the final genome sequence ? Optimizing using Stateful Bulk Processing Newt Lineage Current Assembly Techniques Make use of Stateful Continuous Bulk Processing (aka CBP) : Efficient stateful graph processing model (supports even Google’s “Pregel”) and run it on top of Azure. Output data Contrail implemented as CBP Stages Input flows Output flows F in edges F out edges T( , ΔF state , ΔF 1) F out state F in state key Δin Δout Translate function for a stage with state as loopback flow Our Approach Aim : To assemble a human genome (3.5 B 36 bp) Genomic Analysis Graph Processing CBP on Azure (+ Newt (Debug)) Genome assembly using graphs Bulk Procesing with State Errors in Contrail Pipeline Debugging via Newt Contrail with Newt Multi-staged pipeline with several MapReduce stages Types of errors: De Bruijn graph for a sample short input reads Eulerian walk across the graph gives the genomic sequence Fail-stop Corrupted Outputs Suspicious Actors Newt “fail” API triggered on crash - Reports crash culprits to Newt Find inputs that lead to corruption or fail-stops - Prune selected inputs and replay entire pipeline Crash avoidance: remove crash culprits immediately and continue - No replay overhead, transparent fault handling - How to handle removed inputs used in other dataflow paths? Online identification of suspicious actor behavior based on actor’s history - Processing rate (too slow, too fast), selectivity (n-to-1 instead of 1-to-1) - How much history? Backward Tracing Build Graph Short read files Graph refinement Genome sequence Porting completed Porting pending Assembler Technology CPU/RAM Velvet Serial 2 TB RAM ABySS MPI 168 cores x 96 hours SOAP denovo Pthreads 40 cores x 40 hours, > 140 GB RAM (total) Input data Output A Extract links Count in-links: Site/URL frequency Merge w/seen Score and threshold D σ(D) state ΔD Output ΔA Extract links Count in-links: Site/URL frequency Merge w/seen Score and threshold state state 1.) Process Dataflow σ with input D 2.) Create state 3.) Process changes, ΔD, and prior state Updates, does not reprocess. Saves CPU, disk, network and energy! Newt: Provenance manager Capture fine-grained provenance in MapReduce jobs Trace data provenance through multi-staged pipeline Replay actors with selected inputs Graph refinement phase Build graph stage is stateful (saves src,dst information). 10 MR stages (Contrail) mapped. Graph refinement stage is stateful and iterative (refines the graph). 30 MR stages (Contrail) to be mapped. Contrail (Schatz et al. 2009) uses Hadoop builds a big graph(>3B nodes and >10B edges), iterates, scales but inefficient, slow (1.5 MB input takes 2 hours to assemble) and complex (40+ MR stages )!!!

Upload: apurva-kumar

Post on 22-Jan-2018

84 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CNS_poster12

Optimizing Large Genome Assembly in the Cloud Apurva Kumar, Soumyarupa De and Kenneth Yocum

University of California, San Diego

Genomic Analytics Data • High Throughput Sequencing for human (African male

NA18507) produce : 3.5 billions 36 base pairs (bp) reads = 756 MB. (Cost : 1000 USD)

Current Problem : To assemble genome in lesser cost and minimal time .

Why this problem ? • Only 60 complete human genomes publicly exist(including

Steve Jobs who got it sequenced for 100K USD)

• Commercialization is still in early stage and growing rapidly

Question How do we scale this 3.5 B bp input data to compute the final genome sequence ?

Optimizing using Stateful Bulk Processing

Newt Lineage

Current Assembly Techniques

Make use of Stateful Continuous Bulk Processing (aka CBP) : Efficient stateful graph processing model (supports even Google’s “Pregel”) and run it on top of Azure.

Output data

Contrail implemented as CBP Stages Input flows

Output flows

Finedges

Foutedges

T( , ΔFstate, ΔF1)

Fout state

Fin state

key

Δin

Δout

Translate function for a stage with state as loopback flow

Our Approach

Aim : To assemble a human genome (3.5 B 36 bp)

Genomic Analysis

Graph Processing

CBP on Azure (+ Newt (Debug))

Genome assembly using graphs Bulk Procesing with State

Errors in Contrail Pipeline Debugging via Newt Contrail with Newt

• Multi-staged pipeline with several MapReduce stages

• Types of errors:

• De Bruijn graph for a sample short input reads

• Eulerian walk across the graph gives the genomic sequence

Fail-stop Corrupted Outputs

Suspicious Actors

• Newt “fail” API triggered on crash - Reports crash culprits to Newt

• Find inputs that lead to corruption or fail-stops

- Prune selected inputs and replay entire pipeline

• Crash avoidance: remove crash culprits immediately and continue

- No replay overhead, transparent fault handling - How to handle removed inputs used in other dataflow paths?

• Online identification of suspicious actor behavior based on actor’s history

- Processing rate (too slow, too fast), selectivity (n-to-1 instead of 1-to-1) - How much history?

Backward Tracing

Build Graph Short read

files

Graph refinement

Genome sequence

Porting completed

Porting pending

Assembler Technology CPU/RAM

Velvet Serial 2 TB RAM

ABySS MPI 168 cores x 96 hours

SOAP denovo Pthreads 40 cores x 40 hours, > 140 GB RAM (total)

Output data

Input data

Output A Extract links

Count in-links:

Site/URL frequency

Merge w/seen Score and

threshold

D

σ(D)

state

ΔD Output ΔA Extract

links Count in-links:

Site/URL frequency

Merge w/seen Score and threshold

state

state

1.) Process Dataflow σ with input D

2.) Create state

3.) Process changes, ΔD, and prior state Updates, does not reprocess.

Saves CPU, disk, network and energy!

• Newt: Provenance manager

• Capture fine-grained provenance in MapReduce jobs

• Trace data provenance through multi-staged pipeline

• Replay actors with selected inputs

• Graph refinement phase • Build graph stage is stateful

(saves src,dst information).

• 10 MR stages (Contrail) mapped.

• Graph refinement stage is stateful and iterative (refines the graph).

• 30 MR stages (Contrail) to be mapped.

Contrail (Schatz et al. 2009) uses Hadoop – builds a big graph(>3B nodes and >10B edges), iterates, scales but inefficient, slow (1.5 MB input takes 2 hours to assemble) and complex (40+ MR stages )!!!