spring 2007 bioinformatiatics ch. 2 - sequence alignment

14
Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

Post on 21-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

Spring 2007 BioinformatiaticsBioinformatiatics

Ch. 2 - Sequence Alignment

Page 2: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

C 12 G -3 5 P -3 -1 6 S 0 1 1 1 A -2 1 1 1 2 T -2 0 0 1 1 3 D -5 1 -1 0 0 0 4 E -5 0 -1 0 0 0 3 4 N -4 0 -1 1 0 0 2 1 2 Q -5 -1 0 -1 0 -1 2 2 1 4 H -3 -2 0 -1 -1 -1 1 1 2 3 6 K -5 -2 -1 0 -1 0 0 0 1 1 0 5 R -4 -3 0 0 -2 -1 -1 -1 0 1 2 3 6 V -2 -1 -1 -1 0 0 -2 -2 -2 -2 -2 -2 -2 4 M -5 -3 -2 -2 -1 -1 -3 -2 0 -1 -2 0 0 2 6 I -2 -3 -2 -1 -1 0 -2 -2 -2 -2 -2 -2 -2 4 2 5 L -6 -4 -3 -3 -2 -2 -4 -3 -3 -2 -2 -3 -3 2 4 2 6 F -4 -5 -5 -3 -4 -3 -6 -5 -4 -5 -2 -5 -4 -1 0 1 2 9 Y 0 -5 -5 -3 -3 -3 -4 -4 -2 -4 0 -4 -5 -2 -2 -1 -1 7 10 W -8 -7 -6 -2 -6 -5 -7 -7 -4 -5 -3 -3 2 -6 -4 -5 -2 0 0 17 C G P S A T D E N Q H K R V M I L F Y W 1

PAM 250 Matrix

Page 3: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

BLOSUM Matrix 62

Page 4: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment
Page 5: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

SSU Secondary StructureSSU Secondary Structure

Page 6: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

66

Page 7: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

Cytochrome C

Page 8: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

~82,000,000 DNA sequences as of April 2008

Page 9: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

When is a database hit significant?

• Problem:

– Even unrelated sequences can be aligned (yielding a low score)

– How do we know if a database hit is meaningful?

– When is an alignment score sufficiently high?

• Solution:

– Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences).

– Compare actual scores to the distribution of random scores.

– Is the real score much higher than you’d expect by chance?

Page 10: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

Random alignment scores follow extreme value distributions

The exact shape and location of the distribution depends on the exact nature of the database and the query sequence

Searching a database of unrelated sequences result in scores following an extreme value distribution

No.

of

Sequen

ces

Alignment Score

Page 11: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

Significance of a hit: one possible solution(1) Align query sequence to all sequences in database, note scores

(2) Fit actual scores to a mixture of two sub-distributions: (a) an extreme value distribution and (b) a normal distribution

(3) Use fitted extreme-value distribution to predict how many random hits to expect for any given score (the “E-value”)

No.

of

Sequen

ces

Alignment Score

Page 12: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

Significance of a hit: exampleSearch against a database of 10,000 sequences.

An extreme-value distribution (blue) is fitted to the distribution of all scores.

It is found that 99.9% of the blue distribution has a score below 112.

This means that when searching a database of 10,000 sequences you’d expect to get 0.1% * 10,000 = 10 hits with a score of 112 or better for random reasons

10 is the E-value of a hit with score 112. You want E-values well below 1!

No.

of

Sequen

ces

Alignment Score

Page 13: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment

Example of Blast E-values

agtgaagtacgtgcgttaatgcgatgagtacggtaaaaagaccggcgtgctttatgggtgagcgggagtttgtgccagcgaagcgtccttggacttagagagtgtcgggttcgggacgtccggctacagaatagtaaa

•Semi-random sequence

Blast these sequences

agcggaccggtacttaagcgcggaccggcgtgtccttggacttagagagtggggacgtccggcttcggagcgggagtgttcgttgtgccagcgactaaaaagagaattaaatatgggtga

•Non-random sequence

Page 14: Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment