blat – the b last- l ike a lignment t ool kent, w.j. genome res. 2002 12: 656-664 presenter:...
TRANSCRIPT
BLAT – The BLAST-Like Alignment Tool
Kent, W.J.
Genome Res. 2002 12: 656-664
Presenter: 巨彥霖 田知本
BLAT overview
• Use an index to find regions in genome
homologous to query.
• Do a detailed alignment between query
and homologous regions.
• Use dynamic programming to stitch
together detailed alignments regions
into detailed alignment of whole.
Index
• Database : non-overlapping
• Query : overlapping
K-merK-mer
…K-mer
…K-merK-mer
Example
• Database: cacaattatcacgaccgc
3-mers: cac aat tat cac gac cgc
Index: aat 3 gac 12
cac 0,9 tat 6
cgc 15
• Query: aattctcac
3-mers: aat att ttc tct ctc tca cac
0 1 2 3 4 5 6
Search Criteria
• Single Perfect Matches
• Single Near Perfect Matches
• Multiple Perfect Matches
Notation
• K : K-mer size
• M : The match ratio between homologous
area
• H : Homologous region size
• G : Query sequence size
• A : The alphabet size
Single Perfect Matches (1)
K-mer
Perfect Match
kMp 1
Homologous
region
Single Perfect Matches (2)
KHkMP /)1(1
Homologous
region
The prob of at least one k-mer perfect match :
H
K K K K K K K
(Sensitivity)
Single Perfect Matches (3)
• The number of k-mer in the database = G / K• The number of k-mer in the query = Q – K + 1
The number of k-mer that are expected to
matched by chance : KAKGKQF )/1()/()1( (Specificity)
Single Perfect Nucleotide K-mer Matches as Search Criterion
Case (perfect match)
• Comparing mouse and human coding sequences at the nucleotide level :
H = 100
M = 86%
Sensitivity = 0.99
max K = 7
chance matches = 13078962
(query = 500 , database = 3 billion)
Single Near Perfect Matches (1)
K-mer
Near Perfect Match
)1(11 MMKMp Kk
Homologous
region
Almost Perfect : One letter may mismatch
Single Near Perfect Matches (2)
• Sensitivity
• Specificity
KHkpP /1 )1(1
))/1())/1(1()/1(()/()1( 1 KK AAAKKGKQF
Case (near perfect match)
• Comparing mouse and human coding sequences at the nucleotide level :
H = 100
M = 86%
Sensitivity = 0.99
max K = 12
chance matches = 275671
(query = 500 , database = 3 billion)
Single Near Perfect Nucleotide K-mer Matches as Search Criterion
Multiple Perfect Matches
• Hit is triggered :– there must be N perfect matches– each no further than W letters from each other
in the database coordinate– have the same diagonal coordinate
Example
W
a
b
c
d
The hits a, b, c, and d are all k letters long. Hits b and d have the same diagonal coordinate within W letters of each other. Therefore, they would match the 2 perfect K-mer search criteria.
Target Coordinate
Query C
oordinate
Multiple Perfect Nucleotide K-mer Matches as Search Criterion
Default
• Nucleotide– two perfect 11-mer
• Protein– single perfect 5-mer for standalone version– three perfect 4-mer for client/server version
BLAST
1) Build the hash table for Sequence A.
2) Scan Sequence B for hits.
3) Extend hits.
BLASTStep 1: Build the hash table for Sequence A. (3-tuple example)
For DNA sequences:
Seq. A = AGATCGAT 12345678AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..
TTT
For protein sequences:
Seq. A = ELVIS
Add xyz to the hash table if Score(xyz, ELV) T;≧Add xyz to the hash table if Score(xyz, LVI) T;≧Add xyz to the hash table if Score(xyz, VIS) T;≧
BLASTStep2: Scan sequence B for hits.
BLASTStep2: Scan sequence B for hits.
Step 3: Extend hits.
hit
Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)
BLAST 2.0 saves the time spent in extension, and
considers gapped alignments.
Algorithm
1. Search Stage– Use an index to find regions in genome
homologous to query
2. Alignment Stage– Do a detailed alignment between query and
homologous regions
3. Stitching and Filling In– Use dynamic programming to stitch together
detailed alignments regions into detailed alignment of whole
Search Stage
• Build an index which contains positions of each K-mer in database.
• Step through each overlapping K-mer in query and look it up in index
• Get list of ‘hits’ - positions in query and in database that match for K bases
• Cluster hits to find homologous regions
Search Stage
• Clump hits
• Clump ‘clumps’
• Eliminate small clumps
homologous region
Search Stage
Alignment Stage (nucleotide)
• Start from scratch with regions defined with K-mers
• Index on smaller K-mers, but extend each K-mer until it becomes specific
• Extend in both direction without mismatches or gaps and merge overlapping or continues alignments
• Recurse on gaps with smaller K until gap or hits are eliminated
Alignment Stage (nucleotide)
recursive
Alignment Stage (protein)
• Extend hits into maximal scoring ungapped alignment (HSPs) with +2/-1 scoring scheme
• Create a graph of all possible HSP merges
• Use dynamic programming to traverse the graph
Alignment Stage (protein)
Alignment Stage (protein)
query
homologous region
HSP
Stitching and Filling In
• The alignment of gene is often scattered across multiple homologous regions found in the search stage
query
database
Stitching and Filling In
query
database
homologous region
Evaluation
• Comparison with Other Tools:– mRNA/Genome Alignments– Remapped 713 mRNAs corresponding to annotated
chromosome 22– BLAT took 26 sec while Sim4 took 17,468 sec
(almost 5h)
Est_genome Sim4 BLAT
Relative speed 1 333 223,000
Base accuracy N/A 99.66% 99.99%
Gene accuracy 77.7% 93.4% 99.5%
Evaluation• Comparison with Other Tools:
– Translated Mouse/Human Alignments– 13 million mouse genomic reads vs. human
chromosome 22
WU-TBLASTX BLAT
Relative Speed 1x 73x
% RefSeq Covered 84.5% 86.7%
% Genome Covered 2.67% 2.89%
BLAT vs. BLAST
• Index– Query vs. Database
• Hits– Perfect vs. Near Perfect
• Alignment– Separate vs. Together
Magic Time !
Magic
4
4
3
3
2
1
.5
Prediction !No
mind !Great !
Reference
• http://amber.cs.umd.edu/class/838-s04/nada.ppt
• http://bioportal.weizmann.ac.il/course/ATIB/ATIB03_lecture3.print.pdf