cns_poster12

Optimizing Large Genome Assembly in the Cloud Apurva Kumar, Soumyarupa De and Kenneth Yocum

University of California, San Diego

Genomic Analytics Data • High Throughput Sequencing for human (African male

NA18507) produce : 3.5 billions 36 base pairs (bp) reads = 756 MB. (Cost : 1000 USD)

Current Problem : To assemble genome in lesser cost and minimal time .

Why this problem ? • Only 60 complete human genomes publicly exist(including

Steve Jobs who got it sequenced for 100K USD)

• Commercialization is still in early stage and growing rapidly

Question How do we scale this 3.5 B bp input data to compute the final genome sequence ?

Optimizing using Stateful Bulk Processing

Newt Lineage

Current Assembly Techniques

Make use of Stateful Continuous Bulk Processing (aka CBP) : Efficient stateful graph processing model (supports even Google’s “Pregel”) and run it on top of Azure.

Output data

Contrail implemented as CBP Stages Input flows

Output flows

Finedges

Foutedges

T( , ΔFstate, ΔF1)

Fout state

Fin state

key

Δin

Δout

Translate function for a stage with state as loopback flow

Our Approach

Aim : To assemble a human genome (3.5 B 36 bp)

Genomic Analysis

Graph Processing

CBP on Azure (+ Newt (Debug))

Genome assembly using graphs Bulk Procesing with State

Errors in Contrail Pipeline Debugging via Newt Contrail with Newt

• Multi-staged pipeline with several MapReduce stages

• Types of errors:

• De Bruijn graph for a sample short input reads

• Eulerian walk across the graph gives the genomic sequence

Fail-stop Corrupted Outputs

Suspicious Actors

• Newt “fail” API triggered on crash - Reports crash culprits to Newt

• Find inputs that lead to corruption or fail-stops

- Prune selected inputs and replay entire pipeline

• Crash avoidance: remove crash culprits immediately and continue

- No replay overhead, transparent fault handling - How to handle removed inputs used in other dataflow paths?

• Online identification of suspicious actor behavior based on actor’s history

- Processing rate (too slow, too fast), selectivity (n-to-1 instead of 1-to-1) - How much history?

Backward Tracing

Build Graph Short read

files

Graph refinement

Genome sequence

Porting completed

Porting pending

Assembler Technology CPU/RAM

Velvet Serial 2 TB RAM

ABySS MPI 168 cores x 96 hours

SOAP denovo Pthreads 40 cores x 40 hours, > 140 GB RAM (total)

Output data

Input data

Output A Extract links

Count in-links:

Site/URL frequency

Merge w/seen Score and

threshold

D

σ(D)

state

ΔD Output ΔA Extract

links Count in-links:

Site/URL frequency

Merge w/seen Score and threshold

state

state

1.) Process Dataflow σ with input D

2.) Create state

3.) Process changes, ΔD, and prior state Updates, does not reprocess.

Saves CPU, disk, network and energy!

• Newt: Provenance manager

• Capture fine-grained provenance in MapReduce jobs

• Trace data provenance through multi-staged pipeline

• Replay actors with selected inputs

• Graph refinement phase • Build graph stage is stateful

(saves src,dst information).

• 10 MR stages (Contrail) mapped.

• Graph refinement stage is stateful and iterative (refines the graph).

• 30 MR stages (Contrail) to be mapped.

Contrail (Schatz et al. 2009) uses Hadoop – builds a big graph(>3B nodes and >10B edges), iterates, scales but inefficient, slow (1.5 MB input takes 2 hours to assemble) and complex (40+ MR stages )!!!

cns_poster12

Documents