spring 2007 bioinformatiatics ch. 2 - sequence alignment
Post on 21-Dec-2015
222 views
TRANSCRIPT
Spring 2007 BioinformatiaticsBioinformatiatics
Ch. 2 - Sequence Alignment
C 12 G -3 5 P -3 -1 6 S 0 1 1 1 A -2 1 1 1 2 T -2 0 0 1 1 3 D -5 1 -1 0 0 0 4 E -5 0 -1 0 0 0 3 4 N -4 0 -1 1 0 0 2 1 2 Q -5 -1 0 -1 0 -1 2 2 1 4 H -3 -2 0 -1 -1 -1 1 1 2 3 6 K -5 -2 -1 0 -1 0 0 0 1 1 0 5 R -4 -3 0 0 -2 -1 -1 -1 0 1 2 3 6 V -2 -1 -1 -1 0 0 -2 -2 -2 -2 -2 -2 -2 4 M -5 -3 -2 -2 -1 -1 -3 -2 0 -1 -2 0 0 2 6 I -2 -3 -2 -1 -1 0 -2 -2 -2 -2 -2 -2 -2 4 2 5 L -6 -4 -3 -3 -2 -2 -4 -3 -3 -2 -2 -3 -3 2 4 2 6 F -4 -5 -5 -3 -4 -3 -6 -5 -4 -5 -2 -5 -4 -1 0 1 2 9 Y 0 -5 -5 -3 -3 -3 -4 -4 -2 -4 0 -4 -5 -2 -2 -1 -1 7 10 W -8 -7 -6 -2 -6 -5 -7 -7 -4 -5 -3 -3 2 -6 -4 -5 -2 0 0 17 C G P S A T D E N Q H K R V M I L F Y W 1
PAM 250 Matrix
BLOSUM Matrix 62
SSU Secondary StructureSSU Secondary Structure
66
Cytochrome C
~82,000,000 DNA sequences as of April 2008
When is a database hit significant?
• Problem:
– Even unrelated sequences can be aligned (yielding a low score)
– How do we know if a database hit is meaningful?
– When is an alignment score sufficiently high?
• Solution:
– Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences).
– Compare actual scores to the distribution of random scores.
– Is the real score much higher than you’d expect by chance?
Random alignment scores follow extreme value distributions
The exact shape and location of the distribution depends on the exact nature of the database and the query sequence
Searching a database of unrelated sequences result in scores following an extreme value distribution
No.
of
Sequen
ces
Alignment Score
Significance of a hit: one possible solution(1) Align query sequence to all sequences in database, note scores
(2) Fit actual scores to a mixture of two sub-distributions: (a) an extreme value distribution and (b) a normal distribution
(3) Use fitted extreme-value distribution to predict how many random hits to expect for any given score (the “E-value”)
No.
of
Sequen
ces
Alignment Score
Significance of a hit: exampleSearch against a database of 10,000 sequences.
An extreme-value distribution (blue) is fitted to the distribution of all scores.
It is found that 99.9% of the blue distribution has a score below 112.
This means that when searching a database of 10,000 sequences you’d expect to get 0.1% * 10,000 = 10 hits with a score of 112 or better for random reasons
10 is the E-value of a hit with score 112. You want E-values well below 1!
No.
of
Sequen
ces
Alignment Score
Example of Blast E-values
agtgaagtacgtgcgttaatgcgatgagtacggtaaaaagaccggcgtgctttatgggtgagcgggagtttgtgccagcgaagcgtccttggacttagagagtgtcgggttcgggacgtccggctacagaatagtaaa
•Semi-random sequence
Blast these sequences
agcggaccggtacttaagcgcggaccggcgtgtccttggacttagagagtggggacgtccggcttcggagcgggagtgttcgttgtgccagcgactaaaaagagaattaaatatgggtga
•Non-random sequence