exploringparallelisminshortsequence* mapping*using*burrows...

23
Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc Exploring Parallelism in Short Sequence Mapping Using BurrowsWheeler Transform Doruk Bozdag 1 , Ayat Hatem 1,2 , Umit V. Catalyurek 1,2 Department of Biomedical Informa@cs Department of Electrical and Computer Engineering The Ohio State University IPDPS HiCOMB 2010 Ninth IEEE InternaGonal Workshop on High Performance ComputaGonal Biology

Upload: others

Post on 25-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc

Exploring  Parallelism  in  Short  Sequence  Mapping  Using  Burrows-­‐Wheeler  Transform  

Doruk  Bozdag1,  Ayat  Hatem1,2,  Umit  V.  Catalyurek1,2  Department  of  Biomedical  Informa@cs  

Department  of  Electrical  and  Computer  Engineering  The  Ohio  State  University  

IPDPS  -­‐  HiCOMB  2010  Ninth  IEEE  InternaGonal  Workshop  on    

High  Performance  ComputaGonal  Biology  

Page 2: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 2

Outline  

•  Mo@va@on  

•  Burrows-­‐Wheeler  Transform  

•  Paralleliza@on  strategies  

•  Experimental  results  

•  Conclusion  and  future  work  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     2  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 3: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 3

Motivation

MAPPING  •  Map  reads  to  a  reference  

genome  efficiently  -  Human  genome:  3Gb  

•  Sequen@al  mapping  takes  about  a  day  -  Need  fast,  parallel  

algorithms  that  can  handle  mismatches  

ANALYSIS  

SEQUENCING  •  High  throughput  sequencing  

instruments  (SOLiD,  Solexa,  454)  can  sequence  more  than  1  billion  bases  a  day  

•  Hundreds  of  millions  of  35-­‐50  base  reads    

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     3  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 4: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 4

Our  UlGmate  Goal  

•  Develop  generic  paralleliza@on  framework  •  Iden@fy  limita@ons  due  to  the  applica@on  scenarios  and  tools    

•  Find  the  “right”  tool  for  the  given  problem  •  Find  the  best  way  to  parallelize  for  a  given  tool  and  scenario  •  Quality  vs  Run@me  tradeoffs  

•  This  work  is  a  second  step  towards  that  goal  [Bozdag  IPDPS’09]  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     4  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 5: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 5

Short    Sequence  Mapping  Tools  

•  Many  tools  have  been  developed:  •   MapReads,  MAQ,  RMAP,  SHRiMP,  ZOOM,  mrFAST,  SOCS,  PASS  

•  State  of  the  art  tools:  •  BWA,  Bow@e,  and  SOAPv2  

•  All  of  them  are  based  on  the  Burrows-­‐Wheeler  Transform  

•  Two  step  mapping  approach:  

•  Build  the  index  for  the  reference  genome  

• Map  reads  to  the  index  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     5  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 6: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 6

Burrows-­‐Wheeler  Transform  

•  The  Burrows-­‐Wheeler  transform  (BWT)  of  a  text  T  is  a  reversible  permuta@on  of  the  characters  in  that  text  

•  Designed  originally  for  data  compression  

•  Used  by  data  indexing  techniques  due  to  its  efficiency  

•  BWT-­‐based  index  can  be  searched  in  a  small  memory  footprint  

•  Exact  string  matching  algorithm  has  been  developed  by  [FOCS  2000]  to  search  through  BWT-­‐based  index  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     6  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 7: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 7

Inexact  Matching  

•  BWA,  SOAP,  and  Bow@e  use  an  exact  matching  algorithm  based  on  the  BWT-­‐index  

•  Each  one  provides  a  different  method  to  handle  inexact  matches  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     7  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

AC  

GT   T   A   C  

AG

SOAP: Split read into 3 fragments for hits that allow two mismatches

GACGTTACA  

GACGTTACA  GACGTTACC  GACGTTACG  GACGTTACT  

GACGTTAAA  GACGTTACA  GACGTTAGA  GACGTTATA  

AC  

GT   T  

AC  

AG

AC  

G AT  

AC  

AG

BWA: Enumerate all possible strings

Bowtie: Match not found at T, backtrack, change, find exact match

Page 8: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 8

Short  Sequence  Mapping  

•  Quality  of  mapping  depends  on  different  factors  

•  Improved  quality                                  Increased  computa@onal  cost  

•  Tools  provide  different  op@ons  to  compromise  quality  to  limit  the  computa@onal  cost    

•  Solu@on:  parallel  processing  strategies  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     8  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 9: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 9

ParallelizaGon  Strategies  

•  Par@@on  reads  into  NR  parts  •  Mapping  very  large  number  of  reads  

to  a  small  genome    

Reads

Genome

•  Par@@on  genome  into  NG  parts  

•  Mapping  a  small  number  of  reads  to  a  large  genome    

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     9  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 10: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 10

ParallelizaGon  Strategies  

•  Par@@on  reads  and  genome  

•  Deciding  number  of  read  parts  (NR)  and  genome  parts  (NG)  depends  on  the  number  of  reads  and  size  of  the  genome  

•  Two  main  applica@on  scenarios  •  Index  the  genome  each  Gme  for  

matching:    

•  ParGGon  reads  and  genome    •  Index  the  genome  once:  

•  Par@@on  reads  only  

Genome

Reads

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     10  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 11: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 11

Experimental  Setup  

•  Compared  Bow@e  v0.10.1,  BWA  v0.5.0,  SOAP  v2.20  

•  Experiments  on  32-­‐node  dual  2.4  GHz  Opteron  cluster  with  8GB  of  memory  per  node  

•  Used  three  reference  genomes:  human  (3.1  Gbp),  zebrafish  (1.5  Gbp)  and  lancelet  (0.9  Gbp)  

•  Reads:  •  Real  data  from  a  single  run  of  SOLiD  system  of  length  50bp  

•  Synthe@c  data  generated  by  wgsim  of  length  70bp  •  Wgsim  tool  is  a  part  of  SAMtools  package  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     11  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 12: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 12

Experiments  on  Real  Data  

•  Number  of  nodes:  32  

•  NR  x  NG:  1x32,  2x16,  4x8,  8x4,  16x2,  32x1  •  G:  Human.  R:  130M  

•  Bow@e  best  configura@on:  4x8.  BWA  and  SOAP:  8x4  

Bowtie BWA SOAP

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     12  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 13: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 13

Experiments  on  SyntheGc  Data  

•  Number  of  nodes:  16  

•  NR  x  NG:  1x16,  2x8,  4x4,  8x2,  16x1  •  G:  Zebrafish,  R:  4M,  16M,  64M  

•  Matching  @me  >>  indexing  @me,  larger  NR  is  bemer    

Bowtie SOAP

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     13  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

BWA

Page 14: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 14

Experiments  on  SyntheGc  Data  

•  Number  of  nodes:  16  

•  NR  x  NG:  1x16,  2x8,  4x4,  8x2,  16x1  •  G:  Lancelet,  Zebrafish,  Human.  R:  16M  

•  Larger  NG  speeds  up  the  indexing  phase    

Bowtie SOAP

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     14  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

BWA

Page 15: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 15

Experiments  on  SyntheGc  Data  

•  Best  configura@on:  provided  2.2  to  10.7  speed  up  on  16  nodes  

Bowtie sequential and parallel running time

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     15  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 16: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 16

Experiments  on  SyntheGc  Data  

Bowtie sequential and parallel running time

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     16  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

•  Some  unexpected  results  

Page 17: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 17

Experiments  on  SyntheGc  Data  

R1 R2

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     17  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

2 genome portions

16 genome portions

16M  Reads  

80%  mapped  

40%  mapped   40%  mapped  

5%   5%   5%   5%  

Whole genome

……………..

……

.

Page 18: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 18

Experiments  on  SyntheGc  Data  

R1 R2

Mapped reads Bowtie Matching Time

R1 90.04% 121 R2 4.2% 482

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     18  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

2 genome portions

16 genome portions

16M  Reads  

80%  mapped  

40%  mapped   40%  mapped  

5%   5%   5%   5%  

Whole genome

……………..

……

.

Inexact matching phase leads to increasing the running time

Page 19: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc

Inexact  vs  Exact  Matching  in  BowGe  

•  R:  16M    •  G:  Human    

•  NG:  1,  2,  4,  8,  16,  32  

•  Inexact  matching  overhead  is  offset  by  decreased  index  size  as  NG  increases  but  not  enough!  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     19  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 20: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 20

Varying  Number  of  Mismatches  

•  G:  Human,  R:  21M  

•  Number  of  mismatches:  0,  1,  2  •  SOAP  is  not  designed  for  finding  “at  most”  1  mismatch:  requires  2  

execu@ons  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     20  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 21: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 21

 Conclusions  

•  Experimented  different  data  distribu@on  strategies  

•  Tested  on  the  state  of  the  art  mapping  tools:  Bow@e,  BWA,  and  SOAP  •  For  Index+Matching  Scenario    observed  2.2  to  10.7  speedup  on  16  nodes  

•  Best  data  distribu@on  strategy  depends  on:  •  Input  scenario:  genome  size,  number  of  reads,  and  number  of  nodes  

•  Rela@ve  efficiency  of  the  indexing  and  matching  steps  

•  algorithm  used  for  inexact  matching  

•  In  case  of  building  the  index  once,  it  is  bemer  to  use  read  par@@oning  only  •  Although  our  inten@on  was  not  to  compare  the  tools,  yet    

•  Bow@e  indexing  is  rela@vely  slow  

•  On  human  genome,  Bow@e  is  faster  for  exact  matching  but  when  increasing  the  number  of  mismatches  SOAP  becomes  the  fastest  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     21  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 22: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 22

Future  Work:  Our  Goal  

•  Develop  generic  paralleliza@on  framework  •  Iden@fy  limita@ons  due  to  the  applica@on  scenarios  and  tools    

•  Find  the  “right”  tool  for  the  given  problem  •  further  analysis  of  tools  and  methods  are  needed  

•  Quality  vs  Run@me  tradeoffs  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     22  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"  

Page 23: ExploringParallelisminShortSequence* Mapping*Using*Burrows ...tda.gatech.edu/public/slides/Bozdag10-HiCOMB.pdf · U.&Catalyurek,&"Parallel&ShortSeq.&Mapping&w/&BWT"& HiCOMB,&Atlanta,&GA,&Apr&19th,&2010&&

Dep. of Biomedical Informatics HPC Lab bmi.osu.edu/hpc 23

Thanks    

•  For  more  informa@on  •  [email protected]  

•  hmp://bmi.osu.edu/~umit  or  hmp://bmi.osu.edu/hpc  

•  Research  at  the  HPC  Lab  is  funded  by  

HiCOMB,  Atlanta,  GA,  Apr  19th,  2010     23  U.  Catalyurek,  "Parallel  Short  Seq.  Mapping  w/  BWT"