blast and searching sequence databases dr alexei drummond department of computer science...

BLAST and searching sequence databases

Dr Alexei Drummond

Department of Computer Science

[email protected]

BIOSCI 359, Semester 2, 2006

2

Seq

uen

ce d

atab

ase

sear

chin

g

Sequence Homology

• Homologous protein or DNA sequences share common ancestry– A statement of homology is

therefore an evolutionary hypothesis

• Homology need not imply similar function

• Homology is a binary property, a pair of sequences are either homologous or not homologous.– No such thing as degree of

homology

• Homology is often inferred by sequence similarity

a b

x

a b

x y

a, b not homologous

a, b homologous

t

3

Seq

uen

ce d

atab

ase

sear

chin

g

Orthology and paralogy

"Where the homology is a result of gene duplication so that both copies have descended side by side during the history of an organism, (for example, alpha and beta hemoglobin) the genes should be called paralogous (para=in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho=exact). " Fitch WM. Distinguishing homologous from analogous proteins. Systematic Zoology 1970 Jun;19(2):99-113.

4

Seq

uen

ce d

atab

ase

sear

chin

g

Orthology and paralogy

5

Seq

uen

ce d

atab

ase

sear

chin

g

Orthology, paralogy and multigene families

Reproduced from NCBI education website

6

Seq

uen

ce d

atab

ase

sear

chin

g

What are good scores for searching databases?

• We want these scores to distinguish the related sequences from the unrelated sequences

• So we select alignment parameters for database searching that give us the best distinguishing scores

• These may not be the parameters that will give us the most accurate alignment

7

Seq

uen

ce d

atab

ase

sear

chin

g

Database searching is an experiment

• “Database searching is the application of knowledge gained from previous experiments to the problem of discovering the biochemistry and physiology of a newly discovered gene or its protein.”

• It demands the same careful thought and execution as your bench or laboratory investigations!

• Garbage in, garbage out

8

Seq

uen

ce d

atab

ase

sear

chin

g

Sources of previous knowledge

• Similarity scores - PAM, Blosum similarity matrices– Rescue us from having to assume that all amino acid

changes are equally likely and equally harmful– Different similarity matrices are appropriate for

different degrees of evolutionary divergence– More later

9

Seq

uen

ce d

atab

ase

sear

chin

g

Second source of previous knowledge

• Computer Algorithm – Dynamic programming is most sensitive and the

least selective• Local, global, repeats, overlap

– BLAST and FASTA are much faster and more selective which can be an advantage

– No program is always best at finding distantly related sequences for all gene or protein families, but dynamic programming is guaranteed to give optimal alignment for given scores

10

Seq

uen

ce d

atab

ase

sear

chin

g

Third source of previous knowledge

• Database itself– Large store of previously acquired knowledge– Making the best use of this knowledge can save

you many months of expensive laboratory experimentation

– The size of this potential gain is the determining factor in deciding how much effort to devote to any particular database search

11

Seq

uen

ce d

atab

ase

sear

chin

g

Database search assumptions

• The sequences sought have an evolutionary ancestral sequence in common with the “query sequence”

• All substitutions are not equally likely and should be weighted to account for this

• Insertions and deletions are less likely than substitutions and should be weighted to account for this.

12

Seq

uen

ce d

atab

ase

sear

chin

g

FASTA

• 2 step algorithm– (1) word search, using a specific word size, finds

regions with a high number of identical word matches

– (2) Smith-Waterman alignment centered on these regions and bounded by a window size which limits the number of insertions or deletions one sequence can accumulate with respect to the other sequence

FA

ST

A

13

Seq

uen

ce d

atab

ase

sear

chin

g

FASTA Heuristics

• Heuristic approximation to Smith-Waterman– Runs faster– Loses some sensitivity

• Restrictions on the model of sequence evolution– First Heuristic - Word size parameter -usually 2 for

proteins and 6 for nucleic acids - FASTA constrains the evolution between a pair of sequences to preserve a number of unchanged dipeptides or hexanucleotides

FA

ST

A

14

Seq

uen

ce d

atab

ase

sear

chin

g

FASTA Algorithm

• divide query sequence into its constituent overlapping words of length two for proteins or six for nucleic acids

• each sequence in the database is also broken up in the same way

• Two word lists are compared to find all identical words in both sequences

• An initial score is computed based on how many identities are based within a small region of the dot plot.

FA

ST

A

15

Seq

uen

ce d

atab

ase

sear

chin

g

FASTA Algorithm cont.

• If the initial score is high enough, a second score is computed by evaluating which initial identities can be joined into a consistent alignment using only gaps of less then the window size.

• If the secondary score is high enough, then a Smith-Waterman alignment is performed within the same region of the dot plot defined by the concentrated identities and using the same window-size.

• This third score is reported as the optimal score.

FA

ST

A

16

Seq

uen

ce d

atab

ase

sear

chin

g

Creating a Word List

g c t g g a a g g c a tg c t g g a

c t g g a at g g a a g

g g a a g gg a a g g c

a a g g c aa g g c a t

FA

ST

A

17

Seq

uen

ce d

atab

ase

sear

chin

g

First Pathological Example

• Two proteins that share 50% identity - but the proper alignment consists of alternating match and mismatches.

• With a word size of two, there would be no matches along the main diagonal of the dot plot and the proper alignment would not be found.

FA

ST

A

18

Seq

uen

ce d

atab

ase

sear

chin

g

Second Pathological Example

• Two proteins that are almost identical, except the second protein has a 20 residue insertion into the middle of the sequence.

• If the window size is 15, then the Smith-Waterman alignment phase of FASTA will align the protein to either the sequence prior to or following the insertion, thus missing the fact that the proteins were basically identical (with only one long insertion).

FA

ST

A

19

Seq

uen

ce d

atab

ase

sear

chin

g

Second Heuristic

• Window size is the second heuristic used by FASTA.• Its effect is more variable then word size.• If the best alignment, as defined by a full Smith-

Waterman analysis, goes outside the window then a lower scoring alignment will be found by FASTA. This will lead users to conclude the sequences are not homologous when in fact they are and the homology could have been inferred from a full Smith-Waterman alignment.

• In practice these pathological cases are very unlikely. However similar cases do occur and loss of sensitivity caused by the use of these heuristics will be seen.

FA

ST

A

20

Seq

uen

ce d

atab

ase

sear

chin

g

BLAST

• Approximates a simplification of Smith-Waterman known as the maximal segment pairs (MSP) algorithm.

• MSP alignments do not allow gaps and are specified by an equation that includes only the first and fourth terms of the Smith-Waterman equation.

• MSP alignment’s statistics are well understood and so we can compute a significance probability.

BL

AS

T

21

Seq

uen

ce d

atab

ase

sear

chin

g

Significance Probabilities

• Thus the evolutionary model requires that there be a fairly long stretch of sequence that has evolved without insertions or deletions, or at least with a complimentary pattern of insertions and deletions that do not significantly disrupt the alignment

• Recent advances in MSP statistics allow the use of several independent segment alignments to be used in evaluating significance probability.

BL

AS

T

22

Seq

uen

ce d

atab

ase

sear

chin

g

Brief Comparison of BLAST and FASTA

• BLAST is less sensitive than Smith-Waterman but therefore more selective.

• For proteins BLAST is more sensitive than FASTA even though BLAST uses a word size of 3 for proteins while FASTA uses a word size of 2.

• BLAST uses a word size of 11 for nucleic acids. The recent modifications which make it more sensitive for proteins do not seem to work for nucleic acids.

• So therefore FASTA should be used instead of BLAST when searching for nucleic acids.

BL

AS

T

23

Seq

uen

ce d

atab

ase

sear

chin

g

BLAST Algorithm

• It creates a word list (same as FASTA).• Then it expands this list in order to recover sensitivity

lost by only using exact matches.• Any word that scores at least a minimum threshold

(T) when aligned with any of the initial list of words is added to the list.

• BLAST than examines the database for words that exactly match any word in the expanded word list.

• Equivalent to looking for gapless alignments of score at least T in the database.

BL

AS

T

24

Seq

uen

ce d

atab

ase

sear

chin

g

BLAST example

• The example shows an expanded list of 47 words from the original 7. The expanded list contains any word that scores at least eight when aligned with the initial word and scored with the PAM 120 similarity table.

q l n f s a g wq l

l nn f

f ss a

a gg w

InitialWord

Expanded List

ql ql,qm,hl,zlln ln,lbnf nf,af,ny,df,qf,ef,gf,hf,kf,sf,tf,bf,zffs fs,fa,fn,fd,fg,fp,ft,fb,yssa no words score more than 8 including saag aggw gw,aw,rw,nw,dw,qw,ew,hw,iw,kw,mw,

pw,sw,tw,vw,bw,zw,xw

BL

AS

T

25

Seq

uen

ce d

atab

ase

sear

chin

g

BLAST Paradox

• Notice that there is no word that scores 8 or more when aligned with the initial word “sa”, even the word “sa” itself.

• This situation does occur in actual BLAST searches.

• The user has the option to force the initial word into the final list.

BL

AS

T

26

Seq

uen

ce d

atab

ase

sear

chin

g

BLAST default

• The default is to not include such low scoring words because they contain so little information that they are unlikely to be critical in finding a maximal segment pair alignment.

• BLAST has a word length of 3 for protein searches with a threshold score of T=13 using the Blosum62 similarity scoring matrix.

BL

AS

T

27

Seq

uen

ce d

atab

ase

sear

chin

g

Blast final step

• The occurrence of a word hit is followed by an attempt to find a locally optimal ungapped alignment.

• This is accomplished by accumulating the score as the alignment is extended in both directions.

• When a run of mostly negative scores is encountered, the cumulative score will drop substantially. When this happens it is unlikely that the score will rebound.

• This observation provides the basis for an additional heuristic whereby the extension of a hit is terminated when the reduction in score exceeds a dropoff threshold.

• The local alignment with the highest score is returned.

BL

AS

T

28

Seq

uen

ce d

atab

ase

sear

chin

g

Improvements to BLAST

• The growth of the sequence database has raised the minimum score and hence the length of alignment that must be found by BLAST for a match to be significant.

• Speed and sensitivity can be improved by requiring the algorithm find two matches above some (lower) threshold rather that one match. Both matches must be on the same diagonal.

BL

AS

T

29

Seq

uen

ce d

atab

ase

sear

chin

g

New BLAST settings

• The increase in speed results from fewer sequences which are completely evaluated.

• BLAST now looks for 2 words of length 3, that each score at least 11 using Blosum62. The matches must be within 40 amino acids on the same diagonal.

• As the database grows new techniques will need to be constantly devised.

BL

AS

T

30

Seq

uen

ce d

atab

ase

sear

chin

gB

LA

ST

Gapped BLAST

• Builds the alignment out from a central high scoring pair of aligned amino acids analogous to the way BLAST extends the initial maximal segment pair alignment.

• The initial pair of amino acids is chosen as the middle pair of the highest scoring window of 11 amino acids.

• Smith-Waterman alignment is extended in all directions in the path graph until it falls below a fixed percentage of the highest score yet computed in the Smith-Waterman phase.

31

Seq

uen

ce d

atab

ase

sear

chin

g

Guarantees on Gapped Blast

• Will find the best scoring Smith-Waterman alignment if:– The calculation is extended until a score of 0 is

reached. Stopping earlier accepts a small risk of not finding the complete alignment in return for a very large savings in computer resources.

– The initial pair of amino acids selected as the midpoint must actually be part of the alignment the would be reported from a full Smith-Waterman alignment.

BL

AS

T

32

Seq

uen

ce d

atab

ase

sear

chin

g

BLAST Warning

• Before publishing an alignment: prudent to do a complete Smith-Waterman analysis.

• Further it is suggested to make use of the Waterman-Eggert extensions to Smith-Waterman (MaxSegs algorithm) in order to look at the best several independent local alignments and to examine each sequence for repeated motifs.

BL

AS

T

33

Seq

uen

ce d

atab

ase

sear

chin

g

Finding Distant Homologies

• Many functionally and evolutionary important protein similarities are recognizable only through comparison of three-dimensional structures.

• When not available, patterns of conservation identified from the alignment of related sequences can aid the recognition of distant similarities

• These patterns are called motifs, profiles, position-specific score matrices, and Hidden Markov Models.

BL

AS

T

34

Seq

uen

ce d

atab

ase

sear

chin

g

PSI-BLAST

• Position-Specific Iterative BLAST• Designed to detect weak relationships by using a profile

that is constructed automatically from the multiple alignment of the highest scoring hits in the initial BLAST search.

• The profile is created by calculating position-specific scores for ever amino acid at every position in the alignment.

BL

AS

T

35

Seq

uen

ce d

atab

ase

sear

chin

g

How it works

• If a residue is highly conserved at a particular position, it will receive a high score, and others will be assigned high negative scores.

• At weakly conserved positions all residues receive scores near 0.

• Position specific scores can also be assigned to potential insertions and deletions

BL

AS

T

36

Seq

uen

ce d

atab

ase

sear

chin

g

Iteration

• The power of profile methods can be further enhanced through iteration of the search procedure.

• After a profile is run against a database, new similar sequences can be detected. In each iteration:– A new multiple alignment, which includes these new

sequences can be constructed. – A new profile abstracted. – A new database search performed.

• The procedure can be iterated as often as desired or until convergence (when no new statistically significant sequences are detected).

BL

AS

T

37

Seq

uen

ce d

atab

ase

sear

chin

g

Design Goals of PSI-BLAST

• Speed, simplicity, automatic operation• Unlike most profile-based search methods, PSI-

BLAST runs one program starting with a single protein sequence as input, and the intermediate steps of multiple alignment and profile construction are invisible to the user.

BL

AS

T

38

Seq

uen

ce d

atab

ase

sear

chin

g

PSI-BLAST Details

• It uses the gapped BLAST program for the database searches. A PSI-BLAST query is identical to a Gapped BLAST query with the addition of an expectation value cut-off for inclusion of a match in an iteration.

• The E-value cut-off can be over-ridden by the user on a case-by-case basis if a sequence hit of interest is worse then the threshold. (default is 0.001)

• The multiple alignment and profile will have lengths identical to that of the query

BL

AS

T

39

Seq

uen

ce d

atab

ase

sear

chin

g

Notes on using PSI-BLAST

• The WWW version requires the user to decide after each iteration whether to continue. It has the advantage that the user can hand-pick sequences used for each profile construction, regardless of E-value, by checking boxes next to the sequences descriptions.

• A stand-alone version of PSI-BLAST, obtainable from NCBI, allows the user to run the program for a chosen number of iterations or until convergence.

• This version also allows the user to save the profile produced and use it subsequently to search another database.

BL

AS

T

40

Seq

uen

ce d

atab

ase

sear

chin

g

Warnings on using PSI-BLAST

• PSI-BLAST is a powerful tool and it requires caution.

• The sources of error are the same as for standard BLAST, but are easily amplified by iteration!B

LA

ST

41

Seq

uen

ce d

atab

ase

sear

chin

g

Sources of Errors

• The major source of deceptive alignments is the presence within proteins of regions with highly biased amino acid composition - low complexity.

• If such a region is included during production of a profile, otherwise unrelated sequences containing similarly biased regions will creep into subsequent iterations, rendering the search nearly worthless.

BL

AS

T

42

Seq

uen

ce d

atab

ase

sear

chin

g

How to stop bias

• PSI-BLAST filters out biased regions of query sequences by default, using the SEG program.

• SEG parameters are set to avoid masking potentially important regions, so some bias may still persist. So PSI-BLAST can still generate compositionally rooted artifacts.

• These cases can usually be identified by inspection - especially when sequences that have a known bias, such as myosins or collagens, are retrieved.

• SEG can also be set to eliminate nearly all biased regions, or filtering procedures, such as COILS, can be used before submitting the appropriately masked sequence to PSI-BLAST.

BL

AS

T

43

Seq

uen

ce d

atab

ase

sear

chin

g

PHI-BLAST

• Pattern Hit Initiated BLAST searches for particular patterns in protein queries

• It takes a protein query and a pattern contained in that sequence as input.

• It searches the database for protein sequences that– contain the input pattern and also – have significant similarity to the query sequence in the region

of the pattern occurrences.

BL

AS

T

44

Seq

uen

ce d

atab

ase

sear

chin

g

PHI-BLAST and PSI-BLAST

• The statistical significance of PHI-BLAST is reported using E-values like other forms of BLAST, but the method for computing the E-values is different.

• PHI-BLAST is integrated with PSI-BLAST so the results of a PHI-BLAST can be used to initiate one or more iterations of PSI-BLAST searching.

• PHI-BLAST is under development and may change substantially over time.

BL

AS

T

45

Seq

uen

ce d

atab

ase

sear

chin

g

Which Similarity Matrix to Use?

• Database searches or sequence alignments perform much better if the similarity matrix is based on replacement patterns that correspond to the degree of divergence of the sequences being aligned or discovered.

• In database searching, a PAM or Blosum matrix corresponding to an inappropriate degree of divergence can cause you to fail to discover homologous sequences that are present in the database.

• Therefore a thorough database search will involve using at least 2 and most likely 3 different matrices.

• Using different matrices usually has a higher payoff than using different programs and search algorithms.

Sim

ilari

ty M

atri

ces

46

Seq

uen

ce d

atab

ase

sear

chin

g

Comparable Blosum and PAM Matrices

BlosumMatrix

Entropy PAMMatrix

Entropy Sequence IdentityPAM Matrix

Blosum 90 1.18 PAM 100 1.18 43Blosum 80 0.99 PAM 120 0.98 38Blosum 60 0.66 PAM 160 0.70 30Blosum 52 0.52 PAM 200 0.51 25Blosum 45 0.38 PAM 250 0.36 20

Sim

ilari

ty M

atri

ces

47

Seq

uen

ce d

atab

ase

sear

chin

g

What the Comparability Table Means

• The comparability is based on matrix entropy. Entropy is defined by information theory as the average amount of information per position in a sequence alignment that is available to determine whether or not a sequence is homologous.

• This amount of information is available only if the matrix used in the database search is matched for the appropriate degree of sequence divergence.

• As will be shown later this can be used to get a rough indication of whether or not a specific database search result is significant.

Sim

ilari

ty M

atri

ces

48

Seq

uen

ce d

atab

ase

sear

chin

g

Scores for nucleic acids

• If possible use an amino acid sequence for a database search because:– There is redundancy in the genetic code with up to 6 codons

translated as the same amino acid so there is more information in amino acid sequences once the sequences have diverged beyond about 50 PAMs (~60% identical)

– Compositional bias found in many organisms and organelles– Some nucleic acid sequences are derived from messengers

while others are genomic DNA with exons and the introns may be too short to give a significant alignment with a messenger derived sequence.

Sim

ilari

ty M

atri

ces

49

Seq

uen

ce d

atab

ase

sear

chin

g

Search with Nucleic Acids

• There are circumstances when there is no choice but to search with nucleic acid sequences. (for 98.9% of the human genome!)

• BLAST uses a very long word size, 11, for nucleic acids and the modifications to the heuristic to improve sensitivity for protein sequences do not work as well for nucleic acids because they have only a four letter alphabet and the similarity scores are usually calculated with equal rates of replacement for all of the nucleotides.

• Thus FASTA is more sensitive than BLAST for nucleic acid sequences and should be used instead of BLAST if you want to use one of the faster searching programs.

Sim

ilari

ty M

atri

ces

50

Seq

uen

ce d

atab

ase

sear

chin

g

Nucleic Acid Matrices

• There are matrix of replacements for nucleic acids just as has been recommended for proteins.

• Commonly used are PAM 47 scores assuming equal rates of transitions and transversions. This assumption leaves us with only two scores, 5 for identities or matches and -4 for nonidentities or mismatches.

• It is possible to create nucleic acid scores that do not assume equal rates of transitions and transversions.– For example assumes a three to one transition to transversion

ratio might be more appropriate than the defaults

Sim

ilari

ty M

atri

ces

51

Seq

uen

ce d

atab

ase

sear

chin

g

Scoring Insertions and Deletions

• In most alignment and search programs, the gap penalty consists of two terms, the cost to open the gap and the cost to extend the gap.

• This form of gap penalty has been shown to give better results in searching and alignments then a form of penalty that includes only one of the terms.

• The selection of appropriate scores for insertions and deletions, the gap penalties, is as important as selecting the similarity scores for the success of a database search.

Inse

rtio

ns

and

del

etio

ns

52

Seq

uen

ce d

atab

ase

sear

chin

g

Purpose of Gap Penalties

• The intuitive choice of gap penalties is to use one designed to create the most accurate possible alignment between your query sequence and any homologues in the database.

• So we may need to accommodate several small gaps in highly variable regions of the sequence which implies small penalties for opening a gap.

• But more knowledge, experience, and thought about the experiment being performed in a database search shows that the most accurate alignment is not the correct goal.

• A little thought shows that the appropriate goal is to distinguish any homologous sequences from the vast majority of nonhomolgous sequences in the database.

Inse

rtio

ns

and

del

etio

ns

53

Seq

uen

ce d

atab

ase

sear

chin

g

Expected Maximum Scores

• The score for homologous sequences will be dominated by relatively long sections of the alignment uninterrupted by gaps. Thus changing the gap penalty will only have a minor effect on the score for an alignment.

• Gap penalties, however, are critical in determining the expected maximum score for random sequences being compared with your query sequence.

Inse

rtio

ns

and

del

etio

ns

54

Seq

uen

ce d

atab

ase

sear

chin

g

High Non-homologous Scores

• The number of nonhomologous sequences that achieve a particular score decreases exponentially as the score increases as long as the gap penalties are sufficiently large relative to the similarity scores used to compare the sequences.

• If the gap penalties are too small relative to the similarity scores the number of sequences achieving a particular score decrease as the square of the number of sequences.

• This leads to an increase in the number of relatively high scoring non-homologous sequences.

Inse

rtio

ns

and

del

etio

ns

55

Seq

uen

ce d

atab

ase

sear

chin

g

Decay: Exponential versus Square

0

100

200

300

400

500

600

700

10 30 50 70 90

ExponentialSquare

Inse

rtio

ns

and

del

etio

ns

56

Seq

uen

ce d

atab

ase

sear

chin

g

Obscure Homologous Sequences

• Unfortunately there is no analogous effect on the scores of the homologous sequences.

• Thus choosing gap penalties that are too small leads to matches with homologous sequences being obscured by spurious matches with nonhomologous sequences.

Inse

rtio

ns

and

del

etio

ns

57

Seq

uen

ce d

atab

ase

sear

chin

g

Length of Gaps

• While we want to force alignments to have relatively few gaps we need to allow them to be fairly long.

• This allows a few gaps to accommodate a larger number of insertion and deletion events in the evolutionary history of the sequences.

• So we should match a large penalty for opening a new gap with a small penalty for lengthening or extending a gap.

• In this context “large” will typically be two to three times the largest negative value in the similarity matrix and at least as large as the largest positive value.

• For an integer similarity matrix the gap extension penalty will typically be 1-4 and should not be 0.

Inse

rtio

ns

and

del

etio

ns

58

Seq

uen

ce d

atab

ase

sear

chin

g

Statistical Significance

• A Significance probability is an answer to the question:– How often will an event at least as extreme as this one just

observed happen if these events are the result of a well defined, specific, random process?

• For database searches the random process is evolution from independent, unrelated ancestral sequences rather than a common ancestral sequence.

• With respect to any newly determined query sequence most of the sequences in the database are such randomly evolved sequences.

Inse

rtio

ns

and

del

etio

ns

59

Seq

uen

ce d

atab

ase

sear

chin

g

Random Sequences

• We expect the scores for random sequences aligned with our query sequence to decay exponentially.

• For a group of sequences that achieve at least some minimum score we expect only some constant fraction of them to achieve a given higher score.

• If the gap penalty is high enough this exponential decay is observed in practice.

Inse

rtio

ns

and

del

etio

ns

60

Seq

uen

ce d

atab

ase

sear

chin

g

Predicting Good Scores

• This exponential decay process gives a straight line when we plot the log of the number of sequences that achieve a particular score against the score. Extrapolating this line to its intercept with the score axis gives us a score higher than we might reasonably expect to see from a group of random sequences the size of the database.

• Thus any appreciably higher score is a statistically rare event deserving careful evaluation.

Inse

rtio

ns

and

del

etio

ns

61

Seq

uen

ce d

atab

ase

sear

chin

g

Problems if Gap to low

• Empirically this depends on the gap penalty being high enough.

• If the gap penalty is too low, the scores will decay as the square of the number of sequences achieving at least a given score.

• This slower decay obscures most homologous matches (as discussed before) and also invalidates the fitting of the scores to a logarithmic function and yields a much too high extrapolated score.

Inse

rtio

ns

and

del

etio

ns

62

Seq

uen

ce d

atab

ase

sear

chin

g

Difference between Logs

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60 70 80 90 100

Log(Exponential)

Log(Square)

Inse

rtio

ns

and

del

etio

ns

63

Seq

uen

ce d

atab

ase

sear

chin

g

Information Theory

• Every alignment has a certain amount of information per position about whether or not the aligned sequences are homologous. This is called entropy.

• Every similarity matrix has an associated amount of entropy.

• This amount of entropy is a maximum and is achieved only in alignments between sequences that have diverged to the degree appropriate to the matrix.

Sta

tist

ical

tes

tin

g

64

Seq

uen

ce d

atab

ase

sear

chin

g

Entropy

• It is currently impossible to assign an entropy value to alignment gaps corresponding to insertions or deletions.

• Thus entropy can be computed only for regions of the alignment uninterrupted by gaps.

• The amount of information (or entropy) associated with an uninterrupted region of the alignment is given by the length of the alignment in sequence residues multiplied by the entropy value for the similarity matrix.

• It is measured in units called bits, which is the amount of information need to answer a yes no question.

Sta

tist

ical

tes

tin

g

65

Seq

uen

ce d

atab

ase

sear

chin

g

Entropy of Random Sequences

• In order to make use of the information of an alignment we need to know how much information we expect to find in alignments between random sequences.

• The answer obviously depends on:– Length of the query sequence– Size of the database

• Without going into the details of the derivation, this is the amount of information required to repeatedly divide the database into the remaining half containing the best scoring segment.

Sta

tist

ical

tes

tin

g

66

Seq

uen

ce d

atab

ase

sear

chin

g

Test for Statistical Significance

• S ≈ log2(mn)– The minimum amount of information (in bits) for a significant

match is given by the logarithm (base 2) of the product of the length of the query sequence and the number of residues in the database.

• If there is more than this amount of information in the alignment it is statistically significant and should be investigated further.

• This is the basis of the statistical significance tests in BLAST - now used by the other algorithms as well.

• E = mn 2-S – is the E-value formula where m is the length of the alignment,

n is the size of the data base and S is the bit score of entropy.

Sta

tist

ical

tes

tin

g

67

Seq

uen

ce d

atab

ase

sear

chin

g

E Value

• The E-value is the chance number of alignments with a certain score, s. – E=Kmn e-s

– where s is the similarity score, m and n are as above, and K and can be thought of as natural scales for the search space size and scoring system respectively.

• This makes intuitive sense since doubling either the size of the database or the size of the alignment should double the number of random strings attaining that score. Also for us to attain a score twice as good, 2x, it must attain the score x twice in a row, so one expects E to decrease exponentially with score.

Sta

tist

ical

tes

tin

g

68

Seq

uen

ce d

atab

ase

sear

chin

g

Converting raw scores to bit scores

• S is bit score, s is alignment score , K (scale and translation parameters)

– Can be estimated empirically for a given database and score matrix (even including gaps) or calculated exactly assuming a random ungapped model and background amino acid frequencies

€

S =λs− lnK

ln2

Sta

tist

ical

tes

tin

g

69

Seq

uen

ce d

atab

ase

sear

chin

g

Usefulness of Statistical Measures

• The usefulness of statistical measures is dependent on:– Whether the assumptions of the underlying statistical model

are correct– The kinds of errors that one is willing to accept when using the

measure to draw a conclusion

• The threshold for statistical significance will vary depending on which we are more concerned with:– Labeling a sequence as related when it is not - misidentifying a

non-homologue (False Positive or Type I error)– Labeling a sequence as non-homologous when a high scoring

homologue has been found - missing a likely homologue (False Negative or Type II error)

Sta

tist

ical

tes

tin

g

70

Seq

uen

ce d

atab

ase

sear

chin

g

Problems with Statistical Models

• However, with biological sequences (as opposed to “fair” coins), the assumptions underlying the statistical model may not be met.

• In general inaccurate statistical estimates are cause either by :– Incorrect gap penalties– Low complexity regions - runs of simple amino

acid composition

Sta

tist

ical

tes

tin

g

71

Seq

uen

ce d

atab

ase

sear

chin

g

Additional Statistical Checks

• Prss3 in the FASTA collection evaluates the significance of a sequence similarity score by comparing two sequences and calculating optimal scores and then repeatedly shuffling the second sequence and getting optimal similarity scores using the Smith-Waterman algorithm. You must use 500-2000 shuffles and remember to normalize the statistical significance to the size of the database originally searched. - permutation test, randomization testing

• Randseq also in FASTA produces a random sequence with the same length and amino acid composition as a query sequence. In general in a database search, the highest scoring match to a random query sequence should have an E value of 1.

Sta

tist

ical

tes

tin

g

72

Seq

uen

ce d

atab

ase

sear

chin

g

References

• http://www.psc.edu/biomed/TUTORIALS/SEQUENCE/DBSEARCH/tutorial.html

• http://www.ncbi.nlm.nih.gov/BLAST/• Flexible Sequence Similarity Searching with the

FASTA3 program package, William Pearson• Bioinformatics A Practical Guide the Analysis of Genes

and Proteins Baxevanis & Ouellette

blast and searching sequence databases dr alexei drummond department of computer science...

Documents