Effectively mapping Effectively mapping deep sequencing reads deep sequencing reads
by BOAT (by BOAT (BBasic asic OOligonucleotide ligonucleotide AAlignment lignment TToolool))Gao, Ge
Center for BioinformaticsPeking University
Next-generation deep Next-generation deep sequencing platforms produce sequencing platforms produce
millions of short reads in one runmillions of short reads in one run
454 Genome Sequencer FLX
Illumina/SolexaGenome Analyzer
SOLiDTM 3Analyzer
Amplification emPCR BridgePCR emPCR
Read length 400bp 36bp-50bp 50-60bp
Read number >1M 30M 400M
Time 10h 2-3day 3.5day
Bases 400-600M 1.3G 20G
Sample 16 8 16
• Comparative genomics, Genotyping
• Profiling: RNA-Seq, ChIP-Seq, Methy-Seq
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
GCGCCCTAGCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGTTTGCGGTA
GCGGTATA
GTATAC…
TCGGAAATTCGGAAATTT
CGGTATAC
TAGGCTATA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGT
TCGGAAATTCGGAAATTTCGGAAATTT
AGGCTATATAGGCTATATAGGCTATAT
GGCTATATGCTATATGCG
…CC…CC…CCA…CCA…CCAT
ATAC…C…C…
…CCAT…CCATAG TATGCGCCC
GGTATAC…CGGTATAC
GGAAATTTG
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
ATAC……CC
GAAATTTGC
Goal: identify variations
Goal: measure significant peaks
Millions of Sequence reads
And those reads need to be mapped back to And those reads need to be mapped back to reference genome reference genome effectively effectively for further for further analysisanalysis
(http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html)
So why we need yet So why we need yet another mapping tool?another mapping tool?
Effectively handle Effectively handle (large) sequence variants during mapping(large) sequence variants during mapping
Genome
Seeding by hybrid indexing schema
Inputted reads
Extension(based on prefix tree)
Generate alignment & Calculate E - value
Hits List
Initialization(based on hash table & bitmap index)
Genome
Seeding
Extension(based on prefix tree)
Refining alignment
Hits List
Initialization(based on hash table & bitmap index)
Basic idea: hybrid index by integrating hash and Basic idea: hybrid index by integrating hash and treetree
TTTTTTTTTT AAAAAAAAAAQue ryAseed2s e e d1
prefix treeAAAAAAAAAA
AAAAAAAAAC
AAAAAAAAAG
AAAAAAAAAT
......
TTTTTTTTTG
TTTTTTTTTT
AAAAAAAAAA
AAAAAAAAAC
AAAAAAAAAG
AAAAAAAAAT
......
TTTTTTTTTG
TTTTTTTTTT
Prefix tree enables effectively detection of Prefix tree enables effectively detection of longest common substring with mismatcheslongest common substring with mismatches
(http://en.wikipedia.org/wiki/Trie)
ATGC
CAGTA
CGC
AGA
GGA
CTGCATGCCTGCATGC
AGA
GGA
CTGCATGCCTGCATGC
ATGCCA
ATGCCA
ATGCCA
ATGGCA
ATGCCA
ATGCCA
GTAAGA
GTACCA
ATGCCA
ATGCCA
CTGC
CTGC
ATGC
CAGTA
CGC
AGA
GGA
CTGCATGCCTGCATGC
AGA
GGA
CTGCATGCCTGCATGC
ATGC
CAGTACAGTA
CGCCGC
AGA
GGA
CTGCATGCCTGCATGC
AGA
GGA
CTGCATGCCTGCATGC
AGA
GGA
CTGCATGCCTGCATGC
AGA
GGA
CTGCATGCCTGCATGC
ATGCCA
ATGCCA
ATGCCA
ATGGCA
ATGCCA
ATGCCA
ATGCCA
ATGCCA
ATGCCA
ATGGCA
ATGCCA
ATGCCA
GTAAGA
GTACCA
ATGCCA
ATGCCA
ATGCCA
ATGCCA
GTAAGA
GTACCA
ATGCCA
ATGCCA
CTGC
CTGC
ATGCCA
ATGCCA
ATGCCA
ATGCCA
CTGC
CTGC
Trigger a new alignment: “double-window hit”Trigger a new alignment: “double-window hit”
Either of the two indexed seeds could initialize a new alignment
TTTTTTTTTTT ACGTA AAAAAAAAAA ACGAT
Seed1 Seed2
Extension of alignment by depth-first traversing Extension of alignment by depth-first traversing the index treethe index tree
Que ryA
AAAAAAAAAA
AAAAAAAAAC
AAAAAAAAAG
AAAAAAAAAT
......
TTTTTTTTTG
TTTTTTTTTT
prefix tree root
bitmap array
prefix tree root
bitmap array
prefix tree root
bitmap array
prefix tree root
bitmap array
prefix tree root
bitmap array
prefix tree root
bitmap array
0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0
1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1
0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0
1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0
0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0
1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0
prefix tree
TTTTTTTTTT ACGTA AAAAAAAAAA ACGATQue ryA
AAAAAAAAAA
AAAAAAAAAC
AAAAAAAAAG
AAAAAAAAAT
......
TTTTTTTTTG
TTTTTTTTTT
prefix tree root
bitmap array
prefix tree root
bitmap array
prefix tree root
bitmap array
prefix tree root
bitmap array
prefix tree root
bitmap array
prefix tree root
bitmap array
0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0
1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1
0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0
1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0
0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0
1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0
bitmap bit for ACGTAseed1 seed2
ACGTACAGTAAACATACGAT
|||||||||||| |||||||
ACGTACAGTAAAGATACGAT
ACGTAC
AGTA
CGTAC
CACAT
ACG
AAGAT
TCG
TCGAT
GCGAA
ACGAT
ACGAT
GAGAAG
CGATAC
ACGATA
GACTAG
Refining alignment by bounded dynamic Refining alignment by bounded dynamic programmingprogramming
djiF
djiF
yxsjiF
jiF
F
ji
1,
,1
,1,1
max,
00,0
For each cell between (i, i-k) and (i, i+k)
BOAT showed significant better recall rate in BOAT showed significant better recall rate in evaluationevaluation
5,000,000 simulated reads were mapped to an original two-million-bp mouse chrX region on a local Linux box with two Intel quad-core (E7310 @ 1.6G Hz) CPUs and 64G RAM. All programs were tuned to maximize their capability for tolerating no more than five mismatches
Effectively handling multiple mismatches Effectively handling multiple mismatches contributes significantly to the improved recall contributes significantly to the improved recall rate, especially with large sequence variancerate, especially with large sequence variance
*
0
1, 000, 000
2, 000, 000
3, 000, 000
4, 000, 000
5, 000, 000
BOAT RMAP SOAP MAQ SeqMap
Numb
er o
f ma
pped
rea
ds
5 mi smatches4 mi smatches3 mi smatches2 mi smatches1 mi smatchperf ect match
And the performance of SNP calling is also improvedAnd the performance of SNP calling is also improved
23 4
56
78
910
1112
13141516
17181920
20. 00%
40. 00%
60. 00%
80. 00%
100. 00%
70. 00% 75. 00% 80. 00% 85. 00% 90. 00% 95. 00% 100. 00%
Reca
ll
Speci fi ci tyBOAT MAQ
BOAT also provides several BOAT also provides several flexible and friendly flexible and friendly featuresfeatures
Max allowed mismatches
Gappedalignment
Local alignment
BLAST-style E-value
Pair-endreads
MultipleThreads
SNP Calling
BOATNo
hardcoded limitation
YES YES YES YES YES YES
RMAPNo
hardcoded limitation
NO NO NO NO NO NO
MAQ 3 NO NO NO YES NO YES
SOAP 5 NO YES* NO YES NO NO
SeqMap 5 YES NO NO NO NO NO
BOAT is available as an Open Source SoftwareBOAT is available as an Open Source Software
(http://boat.cbi.pku.edu.cn)
AcknowledgementZhao, Shu-QiWang, JunZhang, LiLi, Jiong-TangGu, Xiao-ChengWei, Li-Ping