exploringparallelisminshortsequence* mapping*using*burrows...
TRANSCRIPT
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc
Exploring Parallelism in Short Sequence Mapping Using Burrows-‐Wheeler Transform
Doruk Bozdag1, Ayat Hatem1,2, Umit V. Catalyurek1,2 Department of Biomedical Informa@cs
Department of Electrical and Computer Engineering The Ohio State University
IPDPS -‐ HiCOMB 2010 Ninth IEEE InternaGonal Workshop on
High Performance ComputaGonal Biology
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 2
Outline
• Mo@va@on
• Burrows-‐Wheeler Transform
• Paralleliza@on strategies
• Experimental results
• Conclusion and future work
HiCOMB, Atlanta, GA, Apr 19th, 2010 2 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 3
Motivation
MAPPING • Map reads to a reference
genome efficiently - Human genome: 3Gb
• Sequen@al mapping takes about a day - Need fast, parallel
algorithms that can handle mismatches
ANALYSIS
SEQUENCING • High throughput sequencing
instruments (SOLiD, Solexa, 454) can sequence more than 1 billion bases a day
• Hundreds of millions of 35-‐50 base reads
HiCOMB, Atlanta, GA, Apr 19th, 2010 3 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 4
Our UlGmate Goal
• Develop generic paralleliza@on framework • Iden@fy limita@ons due to the applica@on scenarios and tools
• Find the “right” tool for the given problem • Find the best way to parallelize for a given tool and scenario • Quality vs Run@me tradeoffs
• This work is a second step towards that goal [Bozdag IPDPS’09]
HiCOMB, Atlanta, GA, Apr 19th, 2010 4 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 5
Short Sequence Mapping Tools
• Many tools have been developed: • MapReads, MAQ, RMAP, SHRiMP, ZOOM, mrFAST, SOCS, PASS
• State of the art tools: • BWA, Bow@e, and SOAPv2
• All of them are based on the Burrows-‐Wheeler Transform
• Two step mapping approach:
• Build the index for the reference genome
• Map reads to the index
HiCOMB, Atlanta, GA, Apr 19th, 2010 5 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 6
Burrows-‐Wheeler Transform
• The Burrows-‐Wheeler transform (BWT) of a text T is a reversible permuta@on of the characters in that text
• Designed originally for data compression
• Used by data indexing techniques due to its efficiency
• BWT-‐based index can be searched in a small memory footprint
• Exact string matching algorithm has been developed by [FOCS 2000] to search through BWT-‐based index
HiCOMB, Atlanta, GA, Apr 19th, 2010 6 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 7
Inexact Matching
• BWA, SOAP, and Bow@e use an exact matching algorithm based on the BWT-‐index
• Each one provides a different method to handle inexact matches
HiCOMB, Atlanta, GA, Apr 19th, 2010 7 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
AC
GT T A C
AG
SOAP: Split read into 3 fragments for hits that allow two mismatches
GACGTTACA
GACGTTACA GACGTTACC GACGTTACG GACGTTACT
GACGTTAAA GACGTTACA GACGTTAGA GACGTTATA
AC
GT T
AC
AG
AC
G AT
AC
AG
BWA: Enumerate all possible strings
Bowtie: Match not found at T, backtrack, change, find exact match
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 8
Short Sequence Mapping
• Quality of mapping depends on different factors
• Improved quality Increased computa@onal cost
• Tools provide different op@ons to compromise quality to limit the computa@onal cost
• Solu@on: parallel processing strategies
HiCOMB, Atlanta, GA, Apr 19th, 2010 8 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 9
ParallelizaGon Strategies
• Par@@on reads into NR parts • Mapping very large number of reads
to a small genome
Reads
Genome
• Par@@on genome into NG parts
• Mapping a small number of reads to a large genome
HiCOMB, Atlanta, GA, Apr 19th, 2010 9 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 10
ParallelizaGon Strategies
• Par@@on reads and genome
• Deciding number of read parts (NR) and genome parts (NG) depends on the number of reads and size of the genome
• Two main applica@on scenarios • Index the genome each Gme for
matching:
• ParGGon reads and genome • Index the genome once:
• Par@@on reads only
Genome
Reads
HiCOMB, Atlanta, GA, Apr 19th, 2010 10 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 11
Experimental Setup
• Compared Bow@e v0.10.1, BWA v0.5.0, SOAP v2.20
• Experiments on 32-‐node dual 2.4 GHz Opteron cluster with 8GB of memory per node
• Used three reference genomes: human (3.1 Gbp), zebrafish (1.5 Gbp) and lancelet (0.9 Gbp)
• Reads: • Real data from a single run of SOLiD system of length 50bp
• Synthe@c data generated by wgsim of length 70bp • Wgsim tool is a part of SAMtools package
HiCOMB, Atlanta, GA, Apr 19th, 2010 11 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 12
Experiments on Real Data
• Number of nodes: 32
• NR x NG: 1x32, 2x16, 4x8, 8x4, 16x2, 32x1 • G: Human. R: 130M
• Bow@e best configura@on: 4x8. BWA and SOAP: 8x4
Bowtie BWA SOAP
HiCOMB, Atlanta, GA, Apr 19th, 2010 12 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 13
Experiments on SyntheGc Data
• Number of nodes: 16
• NR x NG: 1x16, 2x8, 4x4, 8x2, 16x1 • G: Zebrafish, R: 4M, 16M, 64M
• Matching @me >> indexing @me, larger NR is bemer
Bowtie SOAP
HiCOMB, Atlanta, GA, Apr 19th, 2010 13 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
BWA
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 14
Experiments on SyntheGc Data
• Number of nodes: 16
• NR x NG: 1x16, 2x8, 4x4, 8x2, 16x1 • G: Lancelet, Zebrafish, Human. R: 16M
• Larger NG speeds up the indexing phase
Bowtie SOAP
HiCOMB, Atlanta, GA, Apr 19th, 2010 14 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
BWA
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 15
Experiments on SyntheGc Data
• Best configura@on: provided 2.2 to 10.7 speed up on 16 nodes
Bowtie sequential and parallel running time
HiCOMB, Atlanta, GA, Apr 19th, 2010 15 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 16
Experiments on SyntheGc Data
Bowtie sequential and parallel running time
HiCOMB, Atlanta, GA, Apr 19th, 2010 16 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
• Some unexpected results
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 17
Experiments on SyntheGc Data
R1 R2
HiCOMB, Atlanta, GA, Apr 19th, 2010 17 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
2 genome portions
16 genome portions
16M Reads
80% mapped
40% mapped 40% mapped
5% 5% 5% 5%
Whole genome
……………..
……
.
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 18
Experiments on SyntheGc Data
R1 R2
Mapped reads Bowtie Matching Time
R1 90.04% 121 R2 4.2% 482
HiCOMB, Atlanta, GA, Apr 19th, 2010 18 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
2 genome portions
16 genome portions
16M Reads
80% mapped
40% mapped 40% mapped
5% 5% 5% 5%
Whole genome
……………..
……
.
Inexact matching phase leads to increasing the running time
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc
Inexact vs Exact Matching in BowGe
• R: 16M • G: Human
• NG: 1, 2, 4, 8, 16, 32
• Inexact matching overhead is offset by decreased index size as NG increases but not enough!
HiCOMB, Atlanta, GA, Apr 19th, 2010 19 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 20
Varying Number of Mismatches
• G: Human, R: 21M
• Number of mismatches: 0, 1, 2 • SOAP is not designed for finding “at most” 1 mismatch: requires 2
execu@ons
HiCOMB, Atlanta, GA, Apr 19th, 2010 20 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 21
Conclusions
• Experimented different data distribu@on strategies
• Tested on the state of the art mapping tools: Bow@e, BWA, and SOAP • For Index+Matching Scenario observed 2.2 to 10.7 speedup on 16 nodes
• Best data distribu@on strategy depends on: • Input scenario: genome size, number of reads, and number of nodes
• Rela@ve efficiency of the indexing and matching steps
• algorithm used for inexact matching
• In case of building the index once, it is bemer to use read par@@oning only • Although our inten@on was not to compare the tools, yet
• Bow@e indexing is rela@vely slow
• On human genome, Bow@e is faster for exact matching but when increasing the number of mismatches SOAP becomes the fastest
HiCOMB, Atlanta, GA, Apr 19th, 2010 21 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 22
Future Work: Our Goal
• Develop generic paralleliza@on framework • Iden@fy limita@ons due to the applica@on scenarios and tools
• Find the “right” tool for the given problem • further analysis of tools and methods are needed
• Quality vs Run@me tradeoffs
HiCOMB, Atlanta, GA, Apr 19th, 2010 22 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 23
Thanks
• For more informa@on • [email protected]
• hmp://bmi.osu.edu/~umit or hmp://bmi.osu.edu/hpc
• Research at the HPC Lab is funded by
HiCOMB, Atlanta, GA, Apr 19th, 2010 23 U. Catalyurek, "Parallel Short Seq. Mapping w/ BWT"