cns_poster12
TRANSCRIPT
![Page 1: CNS_poster12](https://reader036.vdocuments.mx/reader036/viewer/2022062504/58879e251a28aba2088b4997/html5/thumbnails/1.jpg)
Optimizing Large Genome Assembly in the Cloud Apurva Kumar, Soumyarupa De and Kenneth Yocum
University of California, San Diego
Genomic Analytics Data • High Throughput Sequencing for human (African male
NA18507) produce : 3.5 billions 36 base pairs (bp) reads = 756 MB. (Cost : 1000 USD)
Current Problem : To assemble genome in lesser cost and minimal time .
Why this problem ? • Only 60 complete human genomes publicly exist(including
Steve Jobs who got it sequenced for 100K USD)
• Commercialization is still in early stage and growing rapidly
Question How do we scale this 3.5 B bp input data to compute the final genome sequence ?
Optimizing using Stateful Bulk Processing
Newt Lineage
Current Assembly Techniques
Make use of Stateful Continuous Bulk Processing (aka CBP) : Efficient stateful graph processing model (supports even Google’s “Pregel”) and run it on top of Azure.
Output data
Contrail implemented as CBP Stages Input flows
Output flows
Finedges
Foutedges
T( , ΔFstate, ΔF1)
Fout state
Fin state
key
Δin
Δout
Translate function for a stage with state as loopback flow
Our Approach
Aim : To assemble a human genome (3.5 B 36 bp)
Genomic Analysis
Graph Processing
CBP on Azure (+ Newt (Debug))
Genome assembly using graphs Bulk Procesing with State
Errors in Contrail Pipeline Debugging via Newt Contrail with Newt
• Multi-staged pipeline with several MapReduce stages
• Types of errors:
• De Bruijn graph for a sample short input reads
• Eulerian walk across the graph gives the genomic sequence
Fail-stop Corrupted Outputs
Suspicious Actors
• Newt “fail” API triggered on crash - Reports crash culprits to Newt
• Find inputs that lead to corruption or fail-stops
- Prune selected inputs and replay entire pipeline
• Crash avoidance: remove crash culprits immediately and continue
- No replay overhead, transparent fault handling - How to handle removed inputs used in other dataflow paths?
• Online identification of suspicious actor behavior based on actor’s history
- Processing rate (too slow, too fast), selectivity (n-to-1 instead of 1-to-1) - How much history?
Backward Tracing
Build Graph Short read
files
Graph refinement
Genome sequence
Porting completed
Porting pending
Assembler Technology CPU/RAM
Velvet Serial 2 TB RAM
ABySS MPI 168 cores x 96 hours
SOAP denovo Pthreads 40 cores x 40 hours, > 140 GB RAM (total)
Output data
Input data
Output A Extract links
Count in-links:
Site/URL frequency
Merge w/seen Score and
threshold
D
σ(D)
state
ΔD Output ΔA Extract
links Count in-links:
Site/URL frequency
Merge w/seen Score and threshold
state
state
1.) Process Dataflow σ with input D
2.) Create state
3.) Process changes, ΔD, and prior state Updates, does not reprocess.
Saves CPU, disk, network and energy!
• Newt: Provenance manager
• Capture fine-grained provenance in MapReduce jobs
• Trace data provenance through multi-staged pipeline
• Replay actors with selected inputs
• Graph refinement phase • Build graph stage is stateful
(saves src,dst information).
• 10 MR stages (Contrail) mapped.
• Graph refinement stage is stateful and iterative (refines the graph).
• 30 MR stages (Contrail) to be mapped.
Contrail (Schatz et al. 2009) uses Hadoop – builds a big graph(>3B nodes and >10B edges), iterates, scales but inefficient, slow (1.5 MB input takes 2 hours to assemble) and complex (40+ MR stages )!!!