patr an huntre and blast

Upload: eddy-valdeiglesias-quispe

Post on 06-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 Patr an Huntre and Blast

    1/102

     

    www.bioalgorithms.info An Introduction to Bioinformatics Algorithms

    CombinatorialPattern Matching

  • 8/17/2019 Patr an Huntre and Blast

    2/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Outline

    • Hash Tables• Repeat Finding• Exact Pattern Matching• Keyword Trees

    • Suffix Trees• Heuristic Similarity Search Algorithms•  Approximate String Matching• Filtration•

    Comparing a Sequence Against a Database•  Algorithm behind BLAST• Statistics behind BLAST• PatternHunter and BLAT

  • 8/17/2019 Patr an Huntre and Blast

    3/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Genomic Repeats

    • Example of repeats:

    • ATGGTCTAGGTCCTAGTGGTC

    • Motivation to find them:

    • Genomic rearrangements are oftenassociated with repeats

    • Trace evolutionary secrets

    • Many tumors are characterized by anexplosion of repeats

  • 8/17/2019 Patr an Huntre and Blast

    4/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Genomic Repeats

    • The problem is often more difficult:

    • ATGGTCTAGGACCTAGTGTTC

    • Motivation to find them:

    • Genomic rearrangements are oftenassociated with repeats

    • Trace evolutionary secrets

    • Many tumors are characterized by anexplosion of repeats

  • 8/17/2019 Patr an Huntre and Blast

    5/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    l  -mer Repeats

    • Long repeats are difficult to find• Short repeats are easy to find (e.g., hashing)

    • Simple approach to finding long repeats:

    • Find exact repeats of short l  -mers (l    is usually10 to 13)

    • Use l  -mer repeats to potentially extend intolonger, maximal  repeats

  • 8/17/2019 Patr an Huntre and Blast

    6/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    l  -mer Repeats (cont!d)

    • There are typically many locations where anl  -mer is repeated:

    GCTTACAGATTCAGTCTTACAGATGGT

    • The 4-mer TTAC starts at locations 3 and 17

  • 8/17/2019 Patr an Huntre and Blast

    7/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Extending l  -mer Repeats

    GCTTACAGATTCAGTCTTACAGATGGT

    • Extend these 4-mer matches:

    GCTTACAGATTCAGTCTTACAGATGGT

    • Maximal repeat: TTACAGAT

  • 8/17/2019 Patr an Huntre and Blast

    8/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Maximal Repeats

    • To find maximal repeats in this way, we need ALL start locations of all l  -mers in the

    genome

    • Hashing lets us find repeats quickly in thismanner 

  • 8/17/2019 Patr an Huntre and Blast

    9/102

  • 8/17/2019 Patr an Huntre and Blast

    10/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Hashing: Definitions

    • Hash table: array used in hashing

    • Records: data stored in a hash table

    • Keys: identifies sets of records

    • Hash function: uses a key  to generate anindex to insert at in hash table

    • Collision: when more than one record ismapped to the same index in the hash table

  • 8/17/2019 Patr an Huntre and Blast

    11/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Hashing: Example

    • Where do theanimals eat?

    • Records: eachanimal

    • Keys: whereeach animaleats

  • 8/17/2019 Patr an Huntre and Blast

    12/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Hashing DNA sequences

    • Each l  -mer can be translated into a binary

    string (A, T, C, G can be represented as

    00, 01, 10, 11)

    •  After assigning a unique integer per l  -mer it

    is easy to get all start locations of each l  -

    mer in a genome

  • 8/17/2019 Patr an Huntre and Blast

    13/102

    A I t d ti t Bi i f ti Al ith bi l ith i f

  • 8/17/2019 Patr an Huntre and Blast

    14/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Hashing: Collisions

    • Dealing withcollisions:

    • “Chain” all startlocations of l  -mers

    (linked list)

    A I t d ti t Bi i f ti Al ith bi l ith i f

  • 8/17/2019 Patr an Huntre and Blast

    15/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Hashing: Summary

    • When finding genomic repeats from l  -mers:

    • Generate a hash table index for each l  -mer

    sequence

    • In each index, store all genome startlocations of the l  -mer which generated that

    index

    • Extend l  -mer repeats to maximal repeats

    A I t d ti t Bi i f ti Al ith bi l ith i f

  • 8/17/2019 Patr an Huntre and Blast

    16/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Pattern Matching

    • What if, instead of finding repeats in agenome, we want to find all sequences in adatabase that contain a given pattern?

    • This leads us to a different problem, thePattern Matching  Problem

    A I t d ti t Bi i f ti Al ith bi l ith i f

  • 8/17/2019 Patr an Huntre and Blast

    17/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Pattern Matching Problem

    • Goal: Find all occurrences of a pattern in a text 

    • Input: Pattern p = p1… pn and text t  = t 1…t m

    • Output: All positions 1< i  < (m – n + 1) such thatthe n-letter substring of t  starting at i  matches p

    • Motivation: Searching database for a knownpattern

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    18/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Exact Pattern Matching: A Brute-Force

    Algorithm

    PatternMatching(p,t)1

    n  ß length of pattern p2  m  ß length of text t3 for i ß 1 to (m  – n  + 1)4

      if t i …t i+n-1 = p5   output i 

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    19/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Exact Pattern Matching: An Example

    • PatternMatching  algorithm for:

    • Pattern GCAT

    • Text CGCATC

    GCATCGCATC

    GCATCGCATC

    CGCATCGCAT

    CGCATC

    CGCATCGCAT

    GCAT

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    20/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Exact Pattern Matching: Running Time 

    • PatternMatching  runtime: O(nm)

    • Probability-wise, it’s more like O(m)

    • Rarely will there be close to n comparisonsin line 4

    • Better solution: suffix trees

    • Can solve problem in O(m) time

    • Conceptually related to keyword trees

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    21/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Example

    •  Keyword tree:

    •  Apple

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    22/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Example (cont!d)

    •  Keyword tree:

    •  Apple

    •  Apropos

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    23/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Example (cont!d)

    •  Keyword tree:

    •  Apple

    •  Apropos

    • Banana

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    24/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Example (cont!d)

    •  Keyword tree:

    •  Apple

    •  Apropos

    • Banana

    • Bandana

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    25/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Example (cont!d)

    •  Keyword tree:

    •  Apple

    •  Apropos

    • Banana

    • Bandana

    • Orange

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    26/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Properties

    • Stores a set of keywordsin a rooted labeled tree

    • Each edge labeled with a

    letter from an alphabet•  Any two edges coming

    out of the same vertexhave distinct labels

    Every keyword storedcan be spelled on a pathfrom root to some leaf 

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    27/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Threading (cont!d)

    • Thread “appeal”

    • appeal

    An Introduction to Bioinformatics Algorithms www bioalgorithms info

  • 8/17/2019 Patr an Huntre and Blast

    28/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Threading (cont!d)

    • Thread “appeal”

    • appeal

    An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    29/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Threading (cont!d)

    • Thread “appeal”

    • appeal

    An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    30/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Threading (cont!d)

    • Thread “appeal”

    • appeal

    An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    31/102

     

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Keyword Trees: Threading (cont!d)

    • Thread “apple”

    • apple

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    32/102

     

    g g

    Keyword Trees: Threading (cont!d)

    • Thread “apple”

    • apple

  • 8/17/2019 Patr an Huntre and Blast

    33/102

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    34/102

     

    g g

    Keyword Trees: Threading (cont!d)

    • Thread “apple”

    • apple

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    35/102

     

    g g

    Keyword Trees: Threading (cont!d)

    • Thread “apple”

    • apple

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    36/102

     

    g g

    Multiple Pattern Matching Problem

    • Goal: Given a set of patterns and a text, find alloccurrences of any of patterns in text 

    Input: k  patterns p1,…,p

    k , and text t = t 1…t m

    • Output: Positions 1 < i < m   where substring of t 

    starting at i matches p j  for 1

  • 8/17/2019 Patr an Huntre and Blast

    37/102

     

    g g

    Multiple Pattern Matching: StraightforwardApproach

    • Can solve as k  “Pattern Matching Problems”

    • Runtime:

    O(kmn) 

    using the PatternMatching  algorithm k  times

    • m - length of the text

    • n  - average length of the pattern

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    38/102

     

    Multiple Pattern Matching: Keyword TreeApproach

    • Or, we could use keyword trees:

    • Build keyword tree in O(N ) time; N  is totallength of all patterns

    • With naive threading: O(N + nm)

    •  Aho-Corasick algorithm: O(N + m)

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    39/102

     

    Keyword Trees: Threading

    • To match patternsin a text using akeyword tree:

    • Build keywordtree of patterns

    • “Thread” the

    text through thekeyword tree

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    40/102

     

    Keyword Trees: Threading (cont!d)

    • Threading is“complete” when wereach a leaf in the

    keyword tree

    • When threading is

    “complete,” we’vefound a pattern inthe text

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    41/102

     

    Suffix Trees=Collapsed Keyword Trees

    • Similar to keyword trees,except edges that formpaths are collapsed

    • Each edge is labeledwith a substring  of atext

    •  All internal edges haveat least two outgoingedges

    • Leaves labeled by theindex of the pattern.

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    42/102

     

    Suffix Tree of a Text

    • Suffix trees of a text is constructed for all its suffixes

    ATCATG  TCATG  CATG  ATG

      TG  G

    Keywor d  Tree

    Suffix Tree

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    43/102

     

    Suffix Tree of a Text

    • Suffix trees of a text is constructed for all its suffixes

    ATCATG  TCATG  CATG  ATG

      TG  G

    Keywor d  Tree

    Suffix Tree

    How much time does it take? 

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    44/102

     

    Suffix Tree of a Text

    • Suffix trees of a text is constructed for all its suffixes

    ATCATG  TCATG  CATG  ATG

      TG  G

    quadratic Keywor d  Tree

    Suffix Tree

    Time is linear in the total size of all suffixes,i.e., it is quadratic in the length of the text

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    45/102

     

    Suffix Trees: Advantages

    • Suffix trees of a text is constructed for all its suffixes• Suffix trees build faster than keyword trees

      ATCATG  TCATG  CATG  ATG

      TG  G

    quadratic Keywor d  Tree

    Suffix Tree

    linear  (Weiner suffix tree algorithm)

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    46/102

     

    Use of Suffix Trees

    • Suffix trees hold all suffixes of a text

    • i.e., ATCGC: ATCGC, TCGC, CGC, GC, C• Builds in O(m) time for text of length m

    • To find any pattern of length n in a text:• Build suffix tree for text• Thread the pattern through the suffix tree

    • Can find pattern in text in O(n) time!• O(n + m) time for “Pattern Matching Problem”

    • Build suffix tree and lookup pattern

  • 8/17/2019 Patr an Huntre and Blast

    47/102

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    48/102

     

    Suffix Trees: Example

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    49/102

     

    Multiple Pattern Matching: Summary

    • Keyword and suffix trees are used to findpatterns in a text

    •  Keyword trees:

    • Build keyword tree of patterns, and thread text  through it

    •  Suffix trees:

    • Build suffix tree of text, and thread patterns through it

  • 8/17/2019 Patr an Huntre and Blast

    50/102

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    51/102

     

    Heuristic Similarity Searches

    • Genomes are huge: Smith-Watermanquadratic alignment algorithms are too slow

    •  Alignment of two sequences usually has short

    identical or highly similar fragments

    • Many heuristic methods (i.e., FASTA) are

    based on the same idea of filtration

    • Find short exact matches, and use them as

    seeds for potential match extension• “Filter” out positions with no extendable

    matches

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    52/102

     

    Dot Matrices

    • Dot matrices showsimilarities betweentwo sequences

    • FASTA makes animplicit dot matrix fromshort exact matches,and tries to find longdiagonals (allowing forsome mismatches)

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    53/102

     

    Dot Matrices (cont!d)

    • Identify diagonalsabove a thresholdlength

    • Diagonals in the dotmatrix indicate exact

    substring matching

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    54/102

     

    Diagonals in Dot Matrices

    • Extend diagonalsand try to link themtogether, allowing

    for minimalmismatches/indels

    • Linking diagonalsreveals approximate

    matches over longersubstrings

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    55/102

     

    Approximate Pattern Matching Problem

    • Goal: Find all approximate occurrences of a pattern in a text 

    • Input: A pattern p = p1… pn, text t = t 1…t m,

    and k , the maximum number of mismatches

    • Output: All positions 1 < i  < (m – n + 1) such

    that t i …t i +n-1 and p1… pn have at most k  mismatches (i.e., Hamming distance betweent i …t i +n-1 and p < k )

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    56/102

     

    Approximate Pattern Matching: A Brute-Force Algorithm

    ApproximatePatternMatching(p, t, k )2  n  ß length of pattern p3  m  ß length of text t4 for i  ß 1 to m  – n  + 15   dist  ß 06   for j  ß 1 to n 7

      if t i+j-1 != p  j 8   dist  ß dist  + 19   if dist  < k 10   output i 

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    57/102

     

    Approximate Pattern Matching: Running Time

    • That algorithm runs in O(nm).• Landau-Vishkin algorithm: O(kn)• We can generalize the “Approximate Pattern

    Matching Problem” into a “Query MatchingProblem”:• We want to match substrings in a query to

    substrings in a text with at most k  mismatches• Motivation: we want to see similarities to

    some gene, but we may not know which partsof the gene to look for 

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    58/102

     

    Query Matching Problem

    • Goal: Find all substrings of the query thatapproximately match the text 

    • Input: Query q = q1…qw ,

    text t = t 1…t m,

    n (length of matching substrings),  k (maximum number of mismatches)• Output: All pairs of positions (i , j ) such that the

    n-letter substring of q starting at i  

    approximately matches then-letter substring of t starting at j ,

      with at most k  mismatches

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    59/102

     

    Approximate Pattern Matching vs Query Matching

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    60/102

     

    Query Matching: Main Idea

    •  Approximately matching strings share someperfectly matching substrings.

    • Instead of searching for approximately

    matching strings (difficult) search for perfectlymatching substrings (easy).

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    61/102

     

    Filtration in Query Matching

    • We want all n-matches between a query anda text with up to k  mismatches

    • “Filter” out positions we know do not match

    between text and query• Potential match detection: find all matches

    of l  -tuples in query and text for some small l   

    • Potential match verification: Verify eachpotential match by extending it to the left andright, until (k  + 1) mismatches are found

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    62/102

     

    Filtration: Match Detection

    • If x 1… x n and y 1…y n match with at most k  

    mismatches, they must share an l  -tuple that

    is perfectly matched, with l   = ën/(k  + 1)û

    • Break string of length n into k +1 parts, eacheach of length ën/(k  + 1)û

    • k  mismatches can affect at most k  of these

    k +1 parts•  At least one of these k +1 parts is perfectly

    matched

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    63/102

     

    Filtration: Match Detection (cont!d)

    • Suppose k  = 3. We would then have l=n/(k+1)=n/4:

    There are at most k  mismatches in n, so at the veryleast there must be one out of the k +1 l    –tuples

    without a mismatch

    1…l l  +1…2l    2l  +1…3l 3l  +1…n

    1 2 k k  + 1

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    64/102

     

    Filtration: Match Verification

    • For each l   -match we find, try to extend thematch further to see if it is substantial

    query

    Extend perfect

    match oflength l  

    until we find anapproximate

    match oflength n with k  mismatchestext

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    65/102

     

    Filtration: Example

      k  = 0 k  = 1 k  = 2 k  = 3 k  = 4 k  = 5

    l   -tuple

    length

    n n/2 n/3 n/4 n/5 n/6

    Shorter perfect matches required

    Performance decreases

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    66/102

     

    Local alignment is to slow…

    • Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    67/102

     

    Local alignment is to slow…

    • Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)

  • 8/17/2019 Patr an Huntre and Blast

    68/102

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    69/102

     

    Local alignment is to slow…

    • Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)

    • Guaranteed to find the optimallocal alignment

    • Sets the standard for sensitivity

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    70/102

     

    Local alignment is to slow…

    • Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)

    • Basic Local Alignment Search Tool•  Altschul, S., Gish, W., Miller, W.,

    Myers, E. & Lipman, D.J.

    Journal of Mol. Biol., 1990

    • Search sequence databases forlocal alignments to a query

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    71/102

     

    BLAST

    • Great improvement in speed, with a modestdecrease in sensitivity

    • Minimizes search space instead of exploring entiresearch space between two sequences

    • Finds short exact matches (“seeds”), only exploreslocally around these “hits”

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    72/102

     

    What Similarity Reveals

    • BLASTing a new gene

    • Evolutionary relationship

    • Similarity between protein function

    • BLASTing a genome

    • Potential genes

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    73/102

     

    BLAST algorithm

    • Keyword search of all words of length w  fromthe query of length n in database of length mwith score above threshold•

    w  = 11 for DNA queries, w  =3 for proteins• Local alignment extension for each foundkeyword• Extend result until longest match above

    threshold is achieved• Running time O(nm)

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    74/102

     

    BLAST algorithm (cont!d)

    Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60  +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

    Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

    keyword

    GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11

    neighborhoodscore threshold

    (T = 13)

    Neighborhoodwords

    High-scoring Pair (HSP)

    extension

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    75/102

     

    Original BLAST

    • Dictionary•  All words of length w 

    • Alignment

    • Ungapped  extensions until score fallsbelow some statistical threshold

    • Output

    •  All local alignments with score > threshold

  • 8/17/2019 Patr an Huntre and Blast

    76/102

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    77/102

     

    Gapped BLAST : Example

     • Original BLASTexact keywordsearch, THEN:

    • Extend with gaps

    around ends ofexact match untilscore < threshold  

    • Output result

    GTAAGGTCCAGT

    GTTAGGTC-AGT

    A C G A A G T A A G G T C C A G T

       C 

       T

       G 

       A

       T

       C 

       C 

       T

       G 

       G 

       A

       T

       T

       G 

       C 

       G 

       A

    From lectures by Serafim Batzoglou(Stanford)

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    78/102

     

    Incarnations of BLAST

    • blastn: Nucleotide-nucleotide

    • blastp: Protein-protein

    • blastx: Translated query vs. protein database

    • tblastn: Protein query vs. translated database

    • tblastx: Translated query vs. translated

    database (6 frames each)

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    79/102

     

    Incarnations of BLAST (cont!d)

    • PSI-BLAST• Find members of a protein family or build a

    custom position-specific score matrix

    • Megablast:• Search longer sequences with fewer

    differences

    • WU-BLAST: (Wash U BLAST)

    • Optimized, added features

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

  • 8/17/2019 Patr an Huntre and Blast

    80/102

     

    Assessing sequence similarity

    •  Need to know how strong an alignment can beexpected from chance alone

    • “Chance” relates to comparison of sequences that aregenerated randomly based upon a certain sequence

    model• Sequence models may take into account:

    • G+C content• Poly-A tails•

    “Junk” DNA• Codon bias• Etc.

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    BLAST S S

  • 8/17/2019 Patr an Huntre and Blast

    81/102

     

    BLAST: Segment Score

    • BLAST uses scoring matrices (d) to improveon efficiency of match detection

    • Some proteins may have very different

    amino acid sequences, but are still similar • For any two l  -mers x 1… x l  and y 1…y l  :

    • Segment pair: pair of l  -mers, one from each

    sequence

    • Segment score: Sl  i=1 d( x i , y i )

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    BLAST L ll M i l S P i

  • 8/17/2019 Patr an Huntre and Blast

    82/102

     

    BLAST: Locally Maximal Segment Pairs

    •  A segment pair is maximal if it has the bestscore over all segment pairs

    •  A segment pair is locally maximal if its score

    can’t be improved by extending or shortening• Statistically significant locally maximal  segment pairs are of biological interest

    • BLAST finds all locally maximal segment

    pairs with scores above some threshold•  A significantly high threshold will filter out

    some statistically insignificant matches

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    BLAST S i i

  • 8/17/2019 Patr an Huntre and Blast

    83/102

     

    BLAST: Statistics

    • Threshold: Altschul-Dembo-Karlin statistics• Identifies smallest segment score that is

    unlikely to happen by chance

    • # matches above q has mean E(q) = Kmne-lq; K  is a constant, m and n are the lengths ofthe two compared sequences•

    Parameter l is positive root of:S  x ,y in A( p x  py e

    d(x,y)) = 1, where p x  

    and py  are frequenceies of amino

    acids x  and y , and A is the twenty

  • 8/17/2019 Patr an Huntre and Blast

    84/102

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    S l BLAST t t

  • 8/17/2019 Patr an Huntre and Blast

    85/102

     

    Sample BLAST output Score E

    Sequences producing significant alignments: (bits) Value

    gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44

    gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44

    gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44

    gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43

    ALIGNMENTS

    >gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]Length = 148

     Score = 171 bits (434), Expect = 3e-44

     Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)

    Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60

      MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPKSbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60

    Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120

      V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG

    Sbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120

    Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147

      + F VQ A+QK +A V +AL +YHSb ct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148

    • Blast of human beta globin protein against zebra fish

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    S l BLAST t t

  • 8/17/2019 Patr an Huntre and Blast

    86/102

     

    Sample BLAST output (cont!d) Score E

    Sequences producing significant alignments: (bits) Value

    gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75

    gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75

    gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72

    gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66

    gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34

    gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33

    ALIGNMENTS

    >gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11

      Length = 81706

     Score = 149 bits (75), Expect = 3e-33

     Identities = 183/219 (83%)

     Strand = Plus / Plus

     

    Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326

      || ||| | || | || | |||||| ||||| ||||||||||| ||||||||

    Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468

     

    Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365

      ||||||||| |||||||||| ||||| ||||||||||||

    Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

    • Blast of human beta globin DNA against human DNA

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Ti li

  • 8/17/2019 Patr an Huntre and Blast

    87/102

     

    Timeline

    • 1970: Needleman-Wunsch global alignmentalgorithm

    • 1981: Smith-Waterman local alignment algorithm• 1985: FASTA• 1990: BLAST (basic local alignment search tool)• 2000s: BLAST has become too slow in “genome vs.

    genome” comparisons - new faster algorithmsevolve!

    • Pattern Hunter • BLAT

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    PatternHunter: faster and even

  • 8/17/2019 Patr an Huntre and Blast

    88/102

     

    PatternHunter: faster and even

    more sensitive• BLAST: matches short

    consecutive sequences

    (consecutive seed)

    • Length = k 

    Example (k  = 11):

    11111111111

    Each 1 represents a “match”

    • PatternHunter: matchesshort non-consecutivesequences (spaced seed)

    • Increases sensitivity bylocating homologies thatwould otherwise be missed

    • Example (a spaced seed oflength 18 w/ 11 “matches”):

    111010010100110111

    Each 0 represents a “don’tcare”, so there can be amatch or a mismatch

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    S d d

  • 8/17/2019 Patr an Huntre and Blast

    89/102

     

    Spaced seeds

    Example of a hit using a spaced seed:

    How does this result in better sensitivity?

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Wh i PH b tt ?

  • 8/17/2019 Patr an Huntre and Blast

    90/102

     

    Why is PH better?

    • BLAST: redundanthits

    !  PatternHunter

    This results in > 1 hit andcreates clusters ofredundant hits

    This results in very fewredundant hits

  • 8/17/2019 Patr an Huntre and Blast

    91/102

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Ad t of G d S d

  • 8/17/2019 Patr an Huntre and Blast

    92/102

     

    Advantage of Gapped Seeds

    11 positions

    11 positions

    10 positions

  • 8/17/2019 Patr an Huntre and Blast

    93/102

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Use of Multiple Seeds

  • 8/17/2019 Patr an Huntre and Blast

    94/102

     

    Use of Multiple Seeds

    Basic Searching Algorithm2. Select a group of spaced seed models3. For each hit of each model, conduct extension to

    find a homology.

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Another method: BLAT

  • 8/17/2019 Patr an Huntre and Blast

    95/102

     

    Another method: BLAT

    • BLAT (BLAST-Like Alignment Tool)• Same idea as BLAST - locate short

    sequence hits and extend

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    BLAT vs BLAST: Differences

  • 8/17/2019 Patr an Huntre and Blast

    96/102

     

    BLAT vs. BLAST: Differences

    • BLAT builds an index of the database andscans linearly through the query sequence,whereas BLAST builds an index of the querysequence and then scans linearly through thedatabase

    • Index is stored in RAM which is memoryintensive, but results in faster searches

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    BLAT: Fast cDNA Alignments

  • 8/17/2019 Patr an Huntre and Blast

    97/102

     

    BLAT: Fast cDNA Alignments

    Steps:1. Break cDNA into 500 base chunks.

    2. Use an index to find regions in genome similar toeach chunk of cDNA.

    3. Do a detailed alignment between genomic regionsand cDNA chunk.

    4. Use dynamic programming to stitch togetherdetailed alignments of chunks into detailed

    alignment of whole.

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    BLAT: Indexing

  • 8/17/2019 Patr an Huntre and Blast

    98/102

     

    BLAT: Indexing

    •  An index is built that contains the positions ofeach k -mer in the genome

    • Each k -mer in the query sequence iscompared to each k -mer in the index

    •  A list of ‘hits’ is generated - positions in cDNAand in genome that match for k bases

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Indexing: An Example

  • 8/17/2019 Patr an Huntre and Blast

    99/102

     

    Indexing: An Example

    Here is an example with k  = 3:

    Genome: cacaattatcacgaccgc3-mers (non-overlapping): cac aat tat cac gac cgcIndex:  aat 3 gac 12  cac 0,9 tat 6  cgc 15

    cDNA (query sequence): aattctcac3-mers (overlapping): aat att ttc tct ctc tca cac

    0 1 2 3 4 5 6

    Hits: aat 0,3cac 6,0cac 6,9

    clump: cac AATtatCACgaccgc

    Multiple instances map to

    single index

      Position of 3-mer in query, genome

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    However

  • 8/17/2019 Patr an Huntre and Blast

    100/102

     

    However…

    • BLAT was designed to find sequences of95% and greater similarity of length >40; maymiss more divergent or shorter sequencealignments

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    PatternHunter and BLAT vs BLAST

  • 8/17/2019 Patr an Huntre and Blast

    101/102

     

    PatternHunter and BLAT vs. BLAST

    • PatternHunter is 5-100 times faster thanBlastn, depending on data size, at the samesensitivity

    • BLAT is several times faster than BLAST, butbest results are limited to closely relatedsequences

     An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

    Resources

  • 8/17/2019 Patr an Huntre and Blast

    102/102

    Resources

    • tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt• http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf 

    • http://www.genomeblat.com/genomeblat/blatRapShow.pps