re-page: domain-specific replication and parallel processing of genomic applications 1 mucahid kutlu...
TRANSCRIPT
![Page 1: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/1.jpg)
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic
Applications
1
Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and Engineering
The Ohio State University
Cluster 2015, Chicago, Illinois
Cluster 2015
![Page 2: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/2.jpg)
MotivationThe sequencing costs are decreasing Available data is increasing!
Cluster 2015 2
*Adapted from www.genome.gov/sequencingcosts *Adapted from www.nlm.nih.gov/about/2015CJ.html
Parallel processing is inevitable!
![Page 3: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/3.jpg)
Cluster 2015 3
Typical Analysis on Genomic Data
• Single Nucleotide Polymorphism (SNP) callingSequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C
Alig
nmen
t File
-1
Reference A G C G T A C C
Sequences 1 2 3 4 5 6 7 8
Read-1 A G A G
Read-2 A G A G T
Read-3 G A G T
Read-4 G T T C CAlig
nmen
t File
-2
*Adapted from Wikipedia
A single SNP may cause Mendelian disease!
✖ ✓✖
![Page 4: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/4.jpg)
IPDPS'14 4
Existing Solutions for Implementation
• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling
• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis
• Middleware Systems– Hadoop
• Not designed for specific needs of genetic data• Limited programmability
– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools
![Page 5: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/5.jpg)
IPDPS'14 5
Our Goal
• We want to develop a middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic
algorithms– Be able to work with different popular genetic
data formats – Allows use of existing programs
![Page 6: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/6.jpg)
IPDPS'14
Challenges• Load Imbalance due
to nature of genomic data– It is not just an array
of A, G, C and T characters
• High overhead of tasks
• I/O contention6
1 3 4
Coverage Variance
![Page 7: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/7.jpg)
IPDPS'14 7
Background: PAGE (ipdps 14)
• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
![Page 8: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/8.jpg)
Parallel Genomic Applications
• RE-PAGE: A Map-Reduce-like middleware for easy parallelization of data-intensive genomic applications (like PAGE)
• Main goals (unlike PAGE)– Decrease I/O contention by employing a
distributed file system– Workload balance in data intensive tasks– Avoid data transfers
Cluster 2015 8
![Page 9: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/9.jpg)
Execution Model
Cluster 2015 9
![Page 10: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/10.jpg)
RE-PAGE
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
• Applicability– The algorithm should be safe to be parallelized by
processing different regions of the genome independently
– SNP calling, statistical tools and others
Cluster 2015 10
![Page 11: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/11.jpg)
IPDPS'14 11
RE-PAGE Parallelization• PAGE can parallelize all applications that have
the following property• M - Map task• R, R1 and R2 are three regions such that
R = concatenation of R1 and R2
• M (R) = M(R1) M(R⊕ 2) where is the ⊕reduction function
R1 R2
R
![Page 12: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/12.jpg)
Domain-Specific Data Chunks
• Heuristic: The data in the same genomic location/region can be related and most likely will be processed together for many types of genomic data analysis
• Construct data chunks according to genomic region
Cluster 2015 12
![Page 13: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/13.jpg)
Proposed Replication Method
• Needed to increase data locality• Replicating all chunks into all nodes is not feasible.• Depending on the analysis we want to perform, some
genomic regions can be more important than others for the target analysis.
• General Idea: Replicate important regions more than others.
Cluster 2015 13
![Page 14: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/14.jpg)
Proposed Scheduling Schemes• Problem definition
– Each chunk can be of varying sizes and can have varying number of replicas– Tasks are data intensive. Data transfer costs out-weigh data processing costs
• General approach: – Avoid remote processing – Take advantage of variety in replication factors and data sizes
• Master & worker approach• We propose 3 scheduling schemes
– Largest Chunk First (LCF)– Help the busiest node (HBN)– Effective memory management (EMM)
Cluster 2015 14
LCF HBNEMM
![Page 15: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/15.jpg)
Experiments (1)
Cluster 2015 15
Computation power: 32 Nodes (256 cores) Average Data Chunk Size: 32MBReplication Factor: 3Number of Chunks: 2000
Varying STD of Data Blocks Varying Computation Speed
Average size of chunks in real genomic data: 68MBSTD of chunks sizes in real genomic data: 63MBProcessing Speed: 1MB/sec
STD of chunk sizes : 24MB
![Page 16: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/16.jpg)
Experiments (2)
Cluster 2015 16
Comparison with a Centralized Approach
Computation power: 32 Nodes (256 cores) Replication Factor: 3Application: Coverage Analyzer
![Page 17: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/17.jpg)
Experiments (3)
Cluster 2015 17
Parallel Scalability
Application: Coverage AnalyzerData Size: 15 SAM files (47 GB)Replication factor: 3
Application: Unified GenotyperData Size: 40 BAM files (51 GB)Replication factor: 3 (only RE-PAGE)
4.2x2.2x
7.1x
9.9x
![Page 18: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/18.jpg)
Summary• RE-PAGE for developing parallel data-intensive genomic applications
– Programming• Employs executables of genomic applications• Can parallelize wide range of applications
– Performance• Keeps data in distributed file system• Minimizes data transfer• Employs intelligent replication method
• RE-PAGE outperforms Hadoop and GATK and has good parallel scalability results
• Observation – Prohibiting remote tasks increases performance if chunks have varying sizes and tasks are data intensive.
Cluster 2015 18
![Page 19: RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062805/5697bfef1a28abf838cba2c5/html5/thumbnails/19.jpg)
Thank you!
Cluster 2015 19