effectively mapping deep sequencing reads by boat ( b asic o ligonucleotide a lignment t ool ) gao,...

18
Effectively mapping Effectively mapping deep sequencing reads deep sequencing reads by BOAT ( by BOAT ( B B asic asic O O ligonucleotide ligonucleotide A A lignment lignment T T ool ool ) ) Gao, Ge Center for Bioinformatics Peking University

Upload: gertrude-small

Post on 13-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Effectively mapping Effectively mapping deep sequencing reads deep sequencing reads

by BOAT (by BOAT (BBasic asic OOligonucleotide ligonucleotide AAlignment lignment TToolool))Gao, Ge

Center for BioinformaticsPeking University

Page 2: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Next-generation deep Next-generation deep sequencing platforms produce sequencing platforms produce

millions of short reads in one runmillions of short reads in one run

454 Genome Sequencer FLX

Illumina/SolexaGenome Analyzer

SOLiDTM 3Analyzer

Amplification emPCR BridgePCR emPCR

Read length 400bp 36bp-50bp 50-60bp

Read number >1M 30M 400M

Time 10h 2-3day 3.5day

Bases 400-600M 1.3G 20G

Sample 16 8 16

Page 3: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

• Comparative genomics, Genotyping

• Profiling: RNA-Seq, ChIP-Seq, Methy-Seq

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…

GCGCCCTAGCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATTCGGAAATTT

CGGTATAC

TAGGCTATA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGT

TCGGAAATTCGGAAATTTCGGAAATTT

AGGCTATATAGGCTATATAGGCTATAT

GGCTATATGCTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT…CCATAG TATGCGCCC

GGTATAC…CGGTATAC

GGAAATTTG

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…

ATAC……CC

GAAATTTGC

Goal: identify variations

Goal: measure significant peaks

Page 4: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Millions of Sequence reads

And those reads need to be mapped back to And those reads need to be mapped back to reference genome reference genome effectively effectively for further for further analysisanalysis

Page 5: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

(http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html)

So why we need yet So why we need yet another mapping tool?another mapping tool?

Page 6: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Effectively handle Effectively handle (large) sequence variants during mapping(large) sequence variants during mapping

Page 7: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Genome

Seeding by hybrid indexing schema

Inputted reads

Extension(based on prefix tree)

Generate alignment & Calculate E - value

Hits List

Initialization(based on hash table & bitmap index)

Genome

Seeding

Extension(based on prefix tree)

Refining alignment

Hits List

Initialization(based on hash table & bitmap index)

Page 8: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Basic idea: hybrid index by integrating hash and Basic idea: hybrid index by integrating hash and treetree

TTTTTTTTTT AAAAAAAAAAQue ryAseed2s e e d1

prefix treeAAAAAAAAAA

AAAAAAAAAC

AAAAAAAAAG

AAAAAAAAAT

......

TTTTTTTTTG

TTTTTTTTTT

AAAAAAAAAA

AAAAAAAAAC

AAAAAAAAAG

AAAAAAAAAT

......

TTTTTTTTTG

TTTTTTTTTT

Page 9: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Prefix tree enables effectively detection of Prefix tree enables effectively detection of longest common substring with mismatcheslongest common substring with mismatches

(http://en.wikipedia.org/wiki/Trie)

ATGC

CAGTA

CGC

AGA

GGA

CTGCATGCCTGCATGC

AGA

GGA

CTGCATGCCTGCATGC

ATGCCA

ATGCCA

ATGCCA

ATGGCA

ATGCCA

ATGCCA

GTAAGA

GTACCA

ATGCCA

ATGCCA

CTGC

CTGC

ATGC

CAGTA

CGC

AGA

GGA

CTGCATGCCTGCATGC

AGA

GGA

CTGCATGCCTGCATGC

ATGC

CAGTACAGTA

CGCCGC

AGA

GGA

CTGCATGCCTGCATGC

AGA

GGA

CTGCATGCCTGCATGC

AGA

GGA

CTGCATGCCTGCATGC

AGA

GGA

CTGCATGCCTGCATGC

ATGCCA

ATGCCA

ATGCCA

ATGGCA

ATGCCA

ATGCCA

ATGCCA

ATGCCA

ATGCCA

ATGGCA

ATGCCA

ATGCCA

GTAAGA

GTACCA

ATGCCA

ATGCCA

ATGCCA

ATGCCA

GTAAGA

GTACCA

ATGCCA

ATGCCA

CTGC

CTGC

ATGCCA

ATGCCA

ATGCCA

ATGCCA

CTGC

CTGC

Page 10: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Trigger a new alignment: “double-window hit”Trigger a new alignment: “double-window hit”

Either of the two indexed seeds could initialize a new alignment

TTTTTTTTTTT ACGTA AAAAAAAAAA ACGAT

Seed1 Seed2

Page 11: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Extension of alignment by depth-first traversing Extension of alignment by depth-first traversing the index treethe index tree

Que ryA

AAAAAAAAAA

AAAAAAAAAC

AAAAAAAAAG

AAAAAAAAAT

......

TTTTTTTTTG

TTTTTTTTTT

prefix tree root

bitmap array

prefix tree root

bitmap array

prefix tree root

bitmap array

prefix tree root

bitmap array

prefix tree root

bitmap array

prefix tree root

bitmap array

0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0

1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1

0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0

1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0

0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0

1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0

prefix tree

TTTTTTTTTT ACGTA AAAAAAAAAA ACGATQue ryA

AAAAAAAAAA

AAAAAAAAAC

AAAAAAAAAG

AAAAAAAAAT

......

TTTTTTTTTG

TTTTTTTTTT

prefix tree root

bitmap array

prefix tree root

bitmap array

prefix tree root

bitmap array

prefix tree root

bitmap array

prefix tree root

bitmap array

prefix tree root

bitmap array

0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0

1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1

0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0

1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0

0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0

1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0

bitmap bit for ACGTAseed1 seed2

ACGTACAGTAAACATACGAT

|||||||||||| |||||||

ACGTACAGTAAAGATACGAT

ACGTAC

AGTA

CGTAC

CACAT

ACG

AAGAT

TCG

TCGAT

GCGAA

ACGAT

ACGAT

GAGAAG

CGATAC

ACGATA

GACTAG

Page 12: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Refining alignment by bounded dynamic Refining alignment by bounded dynamic programmingprogramming

djiF

djiF

yxsjiF

jiF

F

ji

1,

,1

,1,1

max,

00,0

For each cell between (i, i-k) and (i, i+k)

Page 13: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

BOAT showed significant better recall rate in BOAT showed significant better recall rate in evaluationevaluation

5,000,000 simulated reads were mapped to an original two-million-bp mouse chrX region on a local Linux box with two Intel quad-core (E7310 @ 1.6G Hz) CPUs and 64G RAM. All programs were tuned to maximize their capability for tolerating no more than five mismatches

Page 14: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

Effectively handling multiple mismatches Effectively handling multiple mismatches contributes significantly to the improved recall contributes significantly to the improved recall rate, especially with large sequence variancerate, especially with large sequence variance

*

0

1, 000, 000

2, 000, 000

3, 000, 000

4, 000, 000

5, 000, 000

BOAT RMAP SOAP MAQ SeqMap

Numb

er o

f ma

pped

rea

ds

5 mi smatches4 mi smatches3 mi smatches2 mi smatches1 mi smatchperf ect match

Page 15: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

And the performance of SNP calling is also improvedAnd the performance of SNP calling is also improved

23 4

56

78

910

1112

13141516

17181920

20. 00%

40. 00%

60. 00%

80. 00%

100. 00%

70. 00% 75. 00% 80. 00% 85. 00% 90. 00% 95. 00% 100. 00%

Reca

ll

Speci fi ci tyBOAT MAQ

Page 16: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

BOAT also provides several BOAT also provides several flexible and friendly flexible and friendly featuresfeatures

Max allowed mismatches

Gappedalignment

Local alignment

BLAST-style E-value

Pair-endreads

MultipleThreads

SNP Calling

BOATNo

hardcoded limitation

YES YES YES YES YES YES

RMAPNo

hardcoded limitation

NO NO NO NO NO NO

MAQ 3 NO NO NO YES NO YES

SOAP 5 NO YES* NO YES NO NO

SeqMap 5 YES NO NO NO NO NO

Page 17: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

BOAT is available as an Open Source SoftwareBOAT is available as an Open Source Software

(http://boat.cbi.pku.edu.cn)

Page 18: Effectively mapping deep sequencing reads by BOAT ( B asic O ligonucleotide A lignment T ool ) Gao, Ge Center for Bioinformatics Peking University

AcknowledgementZhao, Shu-QiWang, JunZhang, LiLi, Jiong-TangGu, Xiao-ChengWei, Li-Ping

[email protected]