patr an huntre and blast

8/17/2019 Patr an Huntre and Blast

1/102

www.bioalgorithms.info An Introduction to Bioinformatics Algorithms

CombinatorialPattern Matching


2/102

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Outline

• Hash Tables• Repeat Finding• Exact Pattern Matching• Keyword Trees

• Suffix Trees• Heuristic Similarity Search Algorithms• Approximate String Matching• Filtration•

Comparing a Sequence Against a Database• Algorithm behind BLAST• Statistics behind BLAST• PatternHunter and BLAT


3/102


Genomic Repeats

• Example of repeats:

• ATGGTCTAGGTCCTAGTGGTC

• Motivation to find them:

• Genomic rearrangements are oftenassociated with repeats

• Trace evolutionary secrets

• Many tumors are characterized by anexplosion of repeats


4/102


Genomic Repeats

• The problem is often more difficult:

• ATGGTCTAGGACCTAGTGTTC

• Motivation to find them:

• Genomic rearrangements are oftenassociated with repeats

• Trace evolutionary secrets

• Many tumors are characterized by anexplosion of repeats


5/102


l -mer Repeats

• Long repeats are difficult to find• Short repeats are easy to find (e.g., hashing)

• Simple approach to finding long repeats:

• Find exact repeats of short l -mers (l is usually10 to 13)

• Use l -mer repeats to potentially extend intolonger, maximal repeats


6/102


l -mer Repeats (cont!d)

• There are typically many locations where anl -mer is repeated:

GCTTACAGATTCAGTCTTACAGATGGT

• The 4-mer TTAC starts at locations 3 and 17


7/102


Extending l -mer Repeats


• Extend these 4-mer matches:


• Maximal repeat: TTACAGAT


8/102


Maximal Repeats

• To find maximal repeats in this way, we need ALL start locations of all l -mers in the

genome

• Hashing lets us find repeats quickly in thismanner


9/102


10/102


Hashing: Definitions

• Hash table: array used in hashing

• Records: data stored in a hash table

• Keys: identifies sets of records

• Hash function: uses a key to generate anindex to insert at in hash table

• Collision: when more than one record ismapped to the same index in the hash table


11/102


Hashing: Example

• Where do theanimals eat?

• Records: eachanimal

• Keys: whereeach animaleats


12/102


Hashing DNA sequences

• Each l -mer can be translated into a binary

string (A, T, C, G can be represented as

00, 01, 10, 11)

• After assigning a unique integer per l -mer it

is easy to get all start locations of each l -

mer in a genome


13/102

A I t d ti t Bi i f ti Al ith bi l ith i f


14/102


Hashing: Collisions

• Dealing withcollisions:

• “Chain” all startlocations of l -mers

(linked list)



15/102


Hashing: Summary

• When finding genomic repeats from l -mers:

• Generate a hash table index for each l -mer

sequence

• In each index, store all genome startlocations of the l -mer which generated that

index

• Extend l -mer repeats to maximal repeats



16/102


Pattern Matching

• What if, instead of finding repeats in agenome, we want to find all sequences in adatabase that contain a given pattern?

• This leads us to a different problem, thePattern Matching Problem



17/102


Pattern Matching Problem

• Goal: Find all occurrences of a pattern in a text

• Input: Pattern p = p1… pn and text t = t 1…t m

• Output: All positions 1< i < (m – n + 1) such thatthe n-letter substring of t starting at i matches p

• Motivation: Searching database for a knownpattern

An Introduction to Bioinformatics Algorithms www bioalgorithms info


18/102


Exact Pattern Matching: A Brute-Force

Algorithm

PatternMatching(p,t)1

n ß length of pattern p2 m ß length of text t3 for i ß 1 to (m – n + 1)4

if t i …t i+n-1 = p5 output i



19/102


Exact Pattern Matching: An Example

• PatternMatching algorithm for:

• Pattern GCAT

• Text CGCATC

GCATCGCATC

GCATCGCATC

CGCATCGCAT

CGCATC

CGCATCGCAT

GCAT



20/102


Exact Pattern Matching: Running Time

• PatternMatching runtime: O(nm)

• Probability-wise, it’s more like O(m)

• Rarely will there be close to n comparisonsin line 4

• Better solution: suffix trees

• Can solve problem in O(m) time

• Conceptually related to keyword trees



21/102


Keyword Trees: Example

• Keyword tree:

• Apple



22/102


Keyword Trees: Example (cont!d)

• Keyword tree:

• Apple

• Apropos



23/102



• Keyword tree:

• Apple

• Apropos

• Banana



24/102



• Keyword tree:

• Apple

• Apropos

• Banana

• Bandana



25/102



• Keyword tree:

• Apple

• Apropos

• Banana

• Bandana

• Orange



26/102


Keyword Trees: Properties

• Stores a set of keywordsin a rooted labeled tree

• Each edge labeled with a

letter from an alphabet• Any two edges coming

out of the same vertexhave distinct labels

•

Every keyword storedcan be spelled on a pathfrom root to some leaf



27/102


Keyword Trees: Threading (cont!d)

• Thread “appeal”

• appeal



28/102




• appeal



29/102




• appeal



30/102




• appeal



31/102



• Thread “apple”

• apple



32/102

g g



• apple


33/102



34/102

g g



• apple



35/102

g g



• apple



36/102

g g

Multiple Pattern Matching Problem

• Goal: Given a set of patterns and a text, find alloccurrences of any of patterns in text

•

Input: k patterns p1,…,p

k , and text t = t 1…t m

• Output: Positions 1 < i < m where substring of t

starting at i matches p j for 1


37/102

g g

Multiple Pattern Matching: StraightforwardApproach

• Can solve as k “Pattern Matching Problems”

• Runtime:

O(kmn)

using the PatternMatching algorithm k times

• m - length of the text

• n - average length of the pattern



38/102

Multiple Pattern Matching: Keyword TreeApproach

• Or, we could use keyword trees:

• Build keyword tree in O(N ) time; N is totallength of all patterns

• With naive threading: O(N + nm)

• Aho-Corasick algorithm: O(N + m)



39/102

Keyword Trees: Threading

• To match patternsin a text using akeyword tree:

• Build keywordtree of patterns

• “Thread” the

text through thekeyword tree



40/102


• Threading is“complete” when wereach a leaf in the

keyword tree

• When threading is

“complete,” we’vefound a pattern inthe text



41/102

Suffix Trees=Collapsed Keyword Trees

• Similar to keyword trees,except edges that formpaths are collapsed

• Each edge is labeledwith a substring of atext

• All internal edges haveat least two outgoingedges

• Leaves labeled by theindex of the pattern.



42/102

Suffix Tree of a Text

• Suffix trees of a text is constructed for all its suffixes

ATCATG TCATG CATG ATG

TG G

Keywor d Tree

Suffix Tree



43/102




TG G

Keywor d Tree

Suffix Tree

How much time does it take?



44/102




TG G

quadratic Keywor d Tree

Suffix Tree

Time is linear in the total size of all suffixes,i.e., it is quadratic in the length of the text



45/102

Suffix Trees: Advantages

• Suffix trees of a text is constructed for all its suffixes• Suffix trees build faster than keyword trees


TG G

quadratic Keywor d Tree

Suffix Tree

linear (Weiner suffix tree algorithm)



46/102

Use of Suffix Trees

• Suffix trees hold all suffixes of a text

• i.e., ATCGC: ATCGC, TCGC, CGC, GC, C• Builds in O(m) time for text of length m

• To find any pattern of length n in a text:• Build suffix tree for text• Thread the pattern through the suffix tree

• Can find pattern in text in O(n) time!• O(n + m) time for “Pattern Matching Problem”

• Build suffix tree and lookup pattern


47/102



48/102

Suffix Trees: Example



49/102

Multiple Pattern Matching: Summary

• Keyword and suffix trees are used to findpatterns in a text

• Keyword trees:

• Build keyword tree of patterns, and thread text through it

• Suffix trees:

• Build suffix tree of text, and thread patterns through it


50/102



51/102

Heuristic Similarity Searches

• Genomes are huge: Smith-Watermanquadratic alignment algorithms are too slow

• Alignment of two sequences usually has short

identical or highly similar fragments

• Many heuristic methods (i.e., FASTA) are

based on the same idea of filtration

• Find short exact matches, and use them as

seeds for potential match extension• “Filter” out positions with no extendable

matches



52/102

Dot Matrices

• Dot matrices showsimilarities betweentwo sequences

• FASTA makes animplicit dot matrix fromshort exact matches,and tries to find longdiagonals (allowing forsome mismatches)



53/102

Dot Matrices (cont!d)

• Identify diagonalsabove a thresholdlength

• Diagonals in the dotmatrix indicate exact

substring matching



54/102

Diagonals in Dot Matrices

• Extend diagonalsand try to link themtogether, allowing

for minimalmismatches/indels

• Linking diagonalsreveals approximate

matches over longersubstrings



55/102

Approximate Pattern Matching Problem

• Goal: Find all approximate occurrences of a pattern in a text

• Input: A pattern p = p1… pn, text t = t 1…t m,

and k , the maximum number of mismatches

• Output: All positions 1 < i < (m – n + 1) such

that t i …t i +n-1 and p1… pn have at most k mismatches (i.e., Hamming distance betweent i …t i +n-1 and p < k )



56/102

Approximate Pattern Matching: A Brute-Force Algorithm

ApproximatePatternMatching(p, t, k )2 n ß length of pattern p3 m ß length of text t4 for i ß 1 to m – n + 15 dist ß 06 for j ß 1 to n 7

if t i+j-1 != p j 8 dist ß dist + 19 if dist < k 10 output i



57/102

Approximate Pattern Matching: Running Time

• That algorithm runs in O(nm).• Landau-Vishkin algorithm: O(kn)• We can generalize the “Approximate Pattern

Matching Problem” into a “Query MatchingProblem”:• We want to match substrings in a query to

substrings in a text with at most k mismatches• Motivation: we want to see similarities to

some gene, but we may not know which partsof the gene to look for



58/102

Query Matching Problem

• Goal: Find all substrings of the query thatapproximately match the text

• Input: Query q = q1…qw ,

text t = t 1…t m,

n (length of matching substrings), k (maximum number of mismatches)• Output: All pairs of positions (i , j ) such that the

n-letter substring of q starting at i

approximately matches then-letter substring of t starting at j ,

with at most k mismatches



59/102

Approximate Pattern Matching vs Query Matching



60/102

Query Matching: Main Idea

• Approximately matching strings share someperfectly matching substrings.

• Instead of searching for approximately

matching strings (difficult) search for perfectlymatching substrings (easy).



61/102

Filtration in Query Matching

• We want all n-matches between a query anda text with up to k mismatches

• “Filter” out positions we know do not match

between text and query• Potential match detection: find all matches

of l -tuples in query and text for some small l

• Potential match verification: Verify eachpotential match by extending it to the left andright, until (k + 1) mismatches are found



62/102

Filtration: Match Detection

• If x 1… x n and y 1…y n match with at most k

mismatches, they must share an l -tuple that

is perfectly matched, with l = ën/(k + 1)û

• Break string of length n into k +1 parts, eacheach of length ën/(k + 1)û

• k mismatches can affect at most k of these

k +1 parts• At least one of these k +1 parts is perfectly

matched



63/102

Filtration: Match Detection (cont!d)

• Suppose k = 3. We would then have l=n/(k+1)=n/4:

•

There are at most k mismatches in n, so at the veryleast there must be one out of the k +1 l –tuples

without a mismatch

1…l l +1…2l 2l +1…3l 3l +1…n

1 2 k k + 1



64/102

Filtration: Match Verification

• For each l -match we find, try to extend thematch further to see if it is substantial

query

Extend perfect

match oflength l

until we find anapproximate

match oflength n with k mismatchestext



65/102

Filtration: Example

k = 0 k = 1 k = 2 k = 3 k = 4 k = 5

l -tuple

length

n n/2 n/3 n/4 n/5 n/6

Shorter perfect matches required

Performance decreases



66/102

Local alignment is to slow…

• Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)



67/102




68/102



69/102



• Guaranteed to find the optimallocal alignment

• Sets the standard for sensitivity



70/102



• Basic Local Alignment Search Tool• Altschul, S., Gish, W., Miller, W.,

Myers, E. & Lipman, D.J.

Journal of Mol. Biol., 1990

• Search sequence databases forlocal alignments to a query



71/102

BLAST

• Great improvement in speed, with a modestdecrease in sensitivity

• Minimizes search space instead of exploring entiresearch space between two sequences

• Finds short exact matches (“seeds”), only exploreslocally around these “hits”



72/102

What Similarity Reveals

• BLASTing a new gene

• Evolutionary relationship

• Similarity between protein function

• BLASTing a genome

• Potential genes



73/102

BLAST algorithm

• Keyword search of all words of length w fromthe query of length n in database of length mwith score above threshold•

w = 11 for DNA queries, w =3 for proteins• Local alignment extension for each foundkeyword• Extend result until longest match above

threshold is achieved• Running time O(nm)



74/102

BLAST algorithm (cont!d)

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

keyword

GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11

neighborhoodscore threshold

(T = 13)

Neighborhoodwords

High-scoring Pair (HSP)

extension



75/102

Original BLAST

• Dictionary• All words of length w

• Alignment

• Ungapped extensions until score fallsbelow some statistical threshold

• Output

• All local alignments with score > threshold


76/102



77/102

Gapped BLAST : Example

• Original BLASTexact keywordsearch, THEN:

• Extend with gaps

around ends ofexact match untilscore < threshold

• Output result

GTAAGGTCCAGT

GTTAGGTC-AGT

A C G A A G T A A G G T C C A G T

C

T

G

A

T

C

C

T

G

G

A

T

T

G

C

G

A

From lectures by Serafim Batzoglou(Stanford)



78/102

Incarnations of BLAST

• blastn: Nucleotide-nucleotide

• blastp: Protein-protein

• blastx: Translated query vs. protein database

• tblastn: Protein query vs. translated database

• tblastx: Translated query vs. translated

database (6 frames each)



79/102

Incarnations of BLAST (cont!d)

• PSI-BLAST• Find members of a protein family or build a

custom position-specific score matrix

• Megablast:• Search longer sequences with fewer

differences

• WU-BLAST: (Wash U BLAST)

• Optimized, added features



80/102

Assessing sequence similarity

• Need to know how strong an alignment can beexpected from chance alone

• “Chance” relates to comparison of sequences that aregenerated randomly based upon a certain sequence

model• Sequence models may take into account:

• G+C content• Poly-A tails•

“Junk” DNA• Codon bias• Etc.


BLAST S S


81/102

BLAST: Segment Score

• BLAST uses scoring matrices (d) to improveon efficiency of match detection

• Some proteins may have very different

amino acid sequences, but are still similar • For any two l -mers x 1… x l and y 1…y l :

• Segment pair: pair of l -mers, one from each

sequence

• Segment score: Sl i=1 d( x i , y i )


BLAST L ll M i l S P i


82/102

BLAST: Locally Maximal Segment Pairs

• A segment pair is maximal if it has the bestscore over all segment pairs

• A segment pair is locally maximal if its score

can’t be improved by extending or shortening• Statistically significant locally maximal segment pairs are of biological interest

• BLAST finds all locally maximal segment

pairs with scores above some threshold• A significantly high threshold will filter out

some statistically insignificant matches


BLAST S i i


83/102

BLAST: Statistics

• Threshold: Altschul-Dembo-Karlin statistics• Identifies smallest segment score that is

unlikely to happen by chance

• # matches above q has mean E(q) = Kmne-lq; K is a constant, m and n are the lengths ofthe two compared sequences•

Parameter l is positive root of:S x ,y in A( p x py e

d(x,y)) = 1, where p x

and py are frequenceies of amino

acids x and y , and A is the twenty


84/102


S l BLAST t t


85/102

Sample BLAST output Score E

Sequences producing significant alignments: (bits) Value

gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44

gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44

gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44

gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43

ALIGNMENTS

>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]Length = 148

Score = 171 bits (434), Expect = 3e-44

Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)

Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60

MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPKSbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60

Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120

V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG

Sbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120

Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147

+ F VQ A+QK +A V +AL +YHSb ct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148

• Blast of human beta globin protein against zebra fish


S l BLAST t t


86/102

Sample BLAST output (cont!d) Score E

Sequences producing significant alignments: (bits) Value

gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75

gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75

gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72

gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66

gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34

gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33

ALIGNMENTS

>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11

Length = 81706

Score = 149 bits (75), Expect = 3e-33

Identities = 183/219 (83%)

Strand = Plus / Plus

Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326

|| ||| | || | || | |||||| ||||| ||||||||||| ||||||||

Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468

Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365

||||||||| |||||||||| ||||| ||||||||||||

Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

• Blast of human beta globin DNA against human DNA


Ti li


87/102

Timeline

• 1970: Needleman-Wunsch global alignmentalgorithm

• 1981: Smith-Waterman local alignment algorithm• 1985: FASTA• 1990: BLAST (basic local alignment search tool)• 2000s: BLAST has become too slow in “genome vs.

genome” comparisons - new faster algorithmsevolve!

• Pattern Hunter • BLAT


PatternHunter: faster and even


88/102

PatternHunter: faster and even

more sensitive• BLAST: matches short

consecutive sequences

(consecutive seed)

• Length = k

•

Example (k = 11):

11111111111

Each 1 represents a “match”

• PatternHunter: matchesshort non-consecutivesequences (spaced seed)

• Increases sensitivity bylocating homologies thatwould otherwise be missed

• Example (a spaced seed oflength 18 w/ 11 “matches”):

111010010100110111

Each 0 represents a “don’tcare”, so there can be amatch or a mismatch


S d d


89/102

Spaced seeds

Example of a hit using a spaced seed:

How does this result in better sensitivity?


Wh i PH b tt ?


90/102

Why is PH better?

• BLAST: redundanthits

! PatternHunter

This results in > 1 hit andcreates clusters ofredundant hits

This results in very fewredundant hits


91/102


Ad t of G d S d


92/102

Advantage of Gapped Seeds

11 positions

11 positions

10 positions


93/102


Use of Multiple Seeds


94/102

Use of Multiple Seeds

Basic Searching Algorithm2. Select a group of spaced seed models3. For each hit of each model, conduct extension to

find a homology.


Another method: BLAT


95/102

Another method: BLAT

• BLAT (BLAST-Like Alignment Tool)• Same idea as BLAST - locate short

sequence hits and extend


BLAT vs BLAST: Differences


96/102

BLAT vs. BLAST: Differences

• BLAT builds an index of the database andscans linearly through the query sequence,whereas BLAST builds an index of the querysequence and then scans linearly through thedatabase

• Index is stored in RAM which is memoryintensive, but results in faster searches


BLAT: Fast cDNA Alignments


97/102

BLAT: Fast cDNA Alignments

Steps:1. Break cDNA into 500 base chunks.

2. Use an index to find regions in genome similar toeach chunk of cDNA.

3. Do a detailed alignment between genomic regionsand cDNA chunk.

4. Use dynamic programming to stitch togetherdetailed alignments of chunks into detailed

alignment of whole.


BLAT: Indexing


98/102

BLAT: Indexing

• An index is built that contains the positions ofeach k -mer in the genome

• Each k -mer in the query sequence iscompared to each k -mer in the index

• A list of ‘hits’ is generated - positions in cDNAand in genome that match for k bases


Indexing: An Example


99/102

Indexing: An Example

Here is an example with k = 3:

Genome: cacaattatcacgaccgc3-mers (non-overlapping): cac aat tat cac gac cgcIndex: aat 3 gac 12 cac 0,9 tat 6 cgc 15

cDNA (query sequence): aattctcac3-mers (overlapping): aat att ttc tct ctc tca cac

0 1 2 3 4 5 6

Hits: aat 0,3cac 6,0cac 6,9

clump: cac AATtatCACgaccgc

Multiple instances map to

single index

Position of 3-mer in query, genome


However


100/102

However…

• BLAT was designed to find sequences of95% and greater similarity of length >40; maymiss more divergent or shorter sequencealignments


PatternHunter and BLAT vs BLAST


101/102

PatternHunter and BLAT vs. BLAST

• PatternHunter is 5-100 times faster thanBlastn, depending on data size, at the samesensitivity

• BLAT is several times faster than BLAST, butbest results are limited to closely relatedsequences


Resources


102/102

Resources

• tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt• http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf

• http://www.genomeblat.com/genomeblat/blatRapShow.pps

patr an huntre and blast

Documents