sequence alignment i dot plots dynamic programming

111
Sequence Alignment I Dot plots Dynamic Programming

Post on 19-Dec-2015

264 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Sequence Alignment I Dot plots Dynamic Programming

Sequence Alignment I

Dot plots

Dynamic Programming

Page 2: Sequence Alignment I Dot plots Dynamic Programming

Why align sequences?

conserved sequencesconserved function Assess ancestry among homologs

(sequences with common ancestory) to help in gene finding and annotation

Find consensus motifs among related sequences (e.g., regulatory and structural regions)

Estimate the rate of evolution

Page 3: Sequence Alignment I Dot plots Dynamic Programming

Definitions

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 4: Sequence Alignment I Dot plots Dynamic Programming

Definitions

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 5: Sequence Alignment I Dot plots Dynamic Programming

Problem

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 6: Sequence Alignment I Dot plots Dynamic Programming

Are the sequences related?

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 7: Sequence Alignment I Dot plots Dynamic Programming

Types of Alignment

Local Global Gap-free Gapped

Page 8: Sequence Alignment I Dot plots Dynamic Programming

Methods of Alignment

Dot matrix Dynamic programming K-tuple

Page 9: Sequence Alignment I Dot plots Dynamic Programming

Dot plots

To evaluate/visualize similarity between two sequences

Create a matrix when sequence 1 is a row vector and sequence 2 is a column vector.

Page 10: Sequence Alignment I Dot plots Dynamic Programming

Dot plot (identity matrix)

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 11: Sequence Alignment I Dot plots Dynamic Programming

Self-alignment

Page 12: Sequence Alignment I Dot plots Dynamic Programming

Self-alignment with sliding window

Page 13: Sequence Alignment I Dot plots Dynamic Programming

Self-alignment with sliding window

Page 14: Sequence Alignment I Dot plots Dynamic Programming

Dotmatrix

A T C A A

A 1 0 0 1 1

T 0 1 0 0 0

C 0 0 1 0 0

G 0 0 0 0 0

A 1 0 0 1 1

Seq1 = ‘ATCAA’Seq2 = ‘ATCGA’

Page 15: Sequence Alignment I Dot plots Dynamic Programming

Function dotplot1.m

function [dotmatrix]=dotplot1(seq1,seq2) OUTPUT: [dotmatrix] is an n x m matrix with

1s and 0s, where 0 means a mismatch and 1 means a match between two nucleotides

INPUT: seq1 and seq2 are strings made of a,t,c, or g. (seq1 is a row vector 1 x m; and seq2 is a column vector (n x1).

Page 16: Sequence Alignment I Dot plots Dynamic Programming

Run dotplot1.m

Place dotplot1.m into your directory. Open a Matlab session. Open the file named dotplot1.m under the file

option. Define seq1 and seq2 as the following; and run the

program:>> seq1 = ‘attataagg’>> seq2 = ‘attataggg’>> [dotmatrix] = dotplot1(seq1,seq2)

Next, copy and paste each line of the dotplot1.m on the command window without using ; to track what the program is doing.

Page 17: Sequence Alignment I Dot plots Dynamic Programming

Dot plot

Seq1='attataagg‘Seq2='attataggg'

A match is red (=1); a mismatch is blue (=0)

Page 18: Sequence Alignment I Dot plots Dynamic Programming

Dot plot with sliding windows

Sliding windows consider more than one position at a time.

Similarity cutoffs (threshold) also allow to get rid of positions with low similarity across the window.

Sliding window can reduce the noise in the dot plot.

Page 19: Sequence Alignment I Dot plots Dynamic Programming

Dot plot with sliding windows

A T C A A

A 1 0 0 1 1

T 0 1 0 0 0

C 0 0 1 0 0

G 0 0 0 0 0

A 1 0 0 1 1

A T C A A

A 3 0 0 0 0

T 0 3 0 0 0

C 0 0 3 0 0

G 0 0 0 0 0

A 0 0 0 0 0

Use a sliding window of size w, for example w =3, such that only positions with w number of consecutive matches along the diagonal count.

Page 20: Sequence Alignment I Dot plots Dynamic Programming

Dot plot with sliding windows

3 3 0

Page 21: Sequence Alignment I Dot plots Dynamic Programming

Dot plot with sliding windows

20

3 0 0 3 0 0 3 0 0

0

Page 22: Sequence Alignment I Dot plots Dynamic Programming

Dot plot with sliding windows

A T C A A

A 3 0 0 0 0

T 0 2 0 0 0

C 0 0 2 0 0

G 0 0 0 0 0

A 0 0 0 0 0

Page 23: Sequence Alignment I Dot plots Dynamic Programming

Function dotplot2.m

function [dotmatrix,dot]=dotplot1(seq1,seq2,w,t) Introduced two variables:

W = size of the sliding window T = threshold, number of matches along the

diagonal to assume a match.

Page 24: Sequence Alignment I Dot plots Dynamic Programming

Function dotplot2.mWorksheet (due end of lecture)

Examine the code in dotplot2.m to see what commands have been used to slide the window and count the number of consecutive matches. Briefly write down your assessment.

Type in different sequences for seq1 and seq2 (no longer than 30) and try two different w and t values to run dotplot2.m

Page 25: Sequence Alignment I Dot plots Dynamic Programming

Dot plot with sliding windowsSeq1='attataagg‘Seq2='attataggg'

A match is red; a mismatch is blue

Sliding window size = 3Threshold is 3

Sliding window size = 3Threshold is 2

Sliding window size = 1Threshold is 1

Seq1='attataagg’

Seq2='attataggg'

Seq1='attataagg’

Seq2='attataggg'

Page 26: Sequence Alignment I Dot plots Dynamic Programming

Alignment

Is a pairwise match between the characters of each sequence.

A true alignment reflects evolutionarily common ancestry (homology).

Page 27: Sequence Alignment I Dot plots Dynamic Programming

Changes that occur in sequences

A mutation that replaces one character with another is a substitution.

An insertion that adds one or more positions and a deletion that deletes one or more positions are known as indels (gaps)

Page 28: Sequence Alignment I Dot plots Dynamic Programming

Alignment example:

AATCTATA AAGATA

AATCTATA AAGATA

AATCTATA AAGATA

Page 29: Sequence Alignment I Dot plots Dynamic Programming

Gap-free alignment: match and mismatch

An alignment receives for each aligned pair of identical residues (the match score) and the penalty for aligned pair of nonidentical residues (mismatch score).

match score; if seq1=seq2mismatch score; if seq1seq2

n

where n is the length of the longer sequence.

Page 30: Sequence Alignment I Dot plots Dynamic Programming

Gap-free alignment example:

AATCTATA AAGATA

AATCTATA AAGATA

AATCTATA AAGATA

Alignment scores would be 4, 1, 3, respectively; if the match score is 1 and mismatch score is 0.

Page 31: Sequence Alignment I Dot plots Dynamic Programming

Gaps

Indels complicate alignments by increasing the number of possible alignments between two or more sequences.

Page 32: Sequence Alignment I Dot plots Dynamic Programming

Alignment: match, mismatch, and gap penalty

An alignment receives a score for each aligned pair of identical residues (the match score) and the penalty for aligned pair of nonidentical residues (mismatch score), and a penalty for insertion of gaps.

match score; if no gaps and seq1=seq2mismatch score; if no gaps and seq1seq2

n

where n is the length of the longer sequence.

gap penalty; if seq1 = ‘-’ or seq2 = ‘-’

Page 33: Sequence Alignment I Dot plots Dynamic Programming

Gap-free alignment example:

AATCTATA AAG-AT-A

AATCTATA AA-G-ATA

AATCTATA AA--GATA

Alignment scores would be 1, 3, 3, respectively; if the match score is 1; mismatch score is 0; and gap penalty is -1.

Page 34: Sequence Alignment I Dot plots Dynamic Programming

Origination and Length Penalties

Simple gap penalties lead to many optimal alignments (those having the same score).

Mutations are rare, invoking fewest number of unlikely events is evolutionarily sound. 3-nt indel would be more common than

multiple single indels.

Page 35: Sequence Alignment I Dot plots Dynamic Programming

Origination and Length Penalties

Origination penalty: starting a new series of gaps in one of the sequences being aligned.

Length penalty: number of sequential missing characters.

Page 36: Sequence Alignment I Dot plots Dynamic Programming

Gap-free alignment example: AATCTATA AAG-AT-A

AATCTATA AA-G-ATA

AATCTATA AA--GATA

Alignment scores would be -3, -1, +1, respectively; if the match score is 1; mismatch score is 0; and origination penalty is -2 and length penalty is -1.

Page 37: Sequence Alignment I Dot plots Dynamic Programming

Scoring matrices

Mismatch penalty can be used to provide further discrimination: Two protein sequences, one of which has an

alanine in a given position: A substitution to valine (another small hydrophobic aa) would have less impact than a change to lysine (a large, charged residue).

One can weigh these substitutions differently based on the likelihood of occurrence over time or based on other characteristics.

Page 38: Sequence Alignment I Dot plots Dynamic Programming

Scoring matrices

A scoring matrix is used to score each nongap position in the alignment. Nucleotide sequences Amino acid sequences

Page 39: Sequence Alignment I Dot plots Dynamic Programming

Identity matrix

Page 40: Sequence Alignment I Dot plots Dynamic Programming

Scoring matrices for DNA sequences

A T C G

A 5 -4 -4 -4

T -4 5 -4 -4

C -4 -4 5 -4

G -4 5 -4 5

A T C G

A 1 -5 -5 -1

T -5 1 -1 -5

C -5 -1 1 -5

G -1 -5 -5 1

BLAST matrix Transition/Transversion matrix

Page 41: Sequence Alignment I Dot plots Dynamic Programming

Scoring Matrices

Need to know the frequency of one amino acid substituting for another versus that event happening by chance alone based on frequency of occurrence of each amino acid:

odds ratio = P(ab)/q(a)*q(b)

Page 42: Sequence Alignment I Dot plots Dynamic Programming

Scoring matrices for amino acids

Blosum PAM (point accepted mutation)

Page 43: Sequence Alignment I Dot plots Dynamic Programming

Alignment and score for aa

BLOSUM 62 Scoring Matrix

Page 44: Sequence Alignment I Dot plots Dynamic Programming

BLOSUM

Ungapped alignments of related proteins are grouped using clustering techniques, substitution rates between clusters are calculated.

A BLOSUM-62 matrix is appropriate for comparing sequences of approximately 62% sequence similarity.

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 45: Sequence Alignment I Dot plots Dynamic Programming

PAM matrices

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 46: Sequence Alignment I Dot plots Dynamic Programming

PAM matrices

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 47: Sequence Alignment I Dot plots Dynamic Programming

PAM-1

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 48: Sequence Alignment I Dot plots Dynamic Programming

PAM1(multiplied by 10000)Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

A R N D C Q E G H I L K M F P S T W Y VAla A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

Page 49: Sequence Alignment I Dot plots Dynamic Programming

PAM1*PAM1=PAM2

9736 4 18 20 6 16 34 42 4 12 8 4 12 4 44 69 63 0 4 362 9827 2 0 2 20 0 0 20 6 2 38 8 2 8 12 2 16 0 28 2 9647 71 0 8 12 12 41 6 2 26 0 2 4 39 18 2 8 2

12 0 83 9720 0 12 105 12 8 2 0 6 0 0 2 10 6 0 0 22 2 0 0 9946 0 0 0 2 2 0 0 0 0 2 10 2 0 6 46 18 8 10 0 9754 53 2 46 2 6 12 8 0 12 4 4 0 0 2

20 0 14 111 0 69 9732 8 4 6 2 8 2 0 6 8 4 0 2 442 2 24 22 2 6 14 9871 2 0 2 4 2 2 6 42 6 0 0 102 16 36 6 2 40 2 0 9825 0 2 2 0 4 6 2 2 2 8 24 4 6 2 4 2 4 0 0 9746 18 4 24 14 0 2 14 0 2 656 2 6 0 0 12 2 2 8 44 9894 4 89 26 6 2 6 8 4 304 73 49 12 0 24 14 4 4 8 2 9853 40 0 6 16 22 0 2 22 2 0 0 0 4 0 0 0 10 16 8 9750 2 0 2 4 0 0 82 2 2 0 0 0 0 2 4 16 12 0 8 9892 0 4 2 6 56 0

26 10 4 2 2 16 6 4 10 2 4 4 2 2 9853 24 8 0 0 455 22 67 14 22 8 12 32 4 4 2 14 8 6 34 9683 75 10 4 444 4 26 8 2 6 4 4 2 22 4 16 12 2 10 63 9744 0 4 180 4 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 9952 2 02 0 6 0 6 0 2 0 8 2 2 0 0 42 0 2 2 4 9890 2

26 4 2 2 6 4 4 6 6 113 22 2 34 2 6 4 20 0 4 9803

Page 50: Sequence Alignment I Dot plots Dynamic Programming

PAM250 = PAM1250

1332 609 911 935 541 797 942 1179 633 794 590 680 673 417 1162 1131 1163 230 378 921278 1649 399 299 177 542 316 205 581 274 219 902 384 175 388 370 334 660 144 233431 403 648 656 179 483 567 452 573 273 204 520 269 199 382 485 458 163 251 276520 344 760 1142 149 682 1033 554 541 282 195 479 254 141 411 518 469 102 170 299198 159 140 98 5179 99 96 119 154 183 83 100 100 120 202 315 216 56 336 229341 499 453 552 109 963 671 281 741 242 261 459 299 141 413 336 319 127 143 242549 384 693 1090 150 879 1204 528 570 325 251 494 304 156 471 509 469 102 179 341

1199 500 966 1020 413 689 930 2670 545 513 379 601 459 293 819 1151 907 179 251 670241 471 480 386 152 644 377 190 1490 176 182 336 189 239 316 272 249 187 317 178316 229 243 210 232 228 222 199 202 1040 627 252 599 454 219 262 376 108 272 861538 421 451 341 214 569 388 349 534 1507 3366 490 2005 1271 479 442 594 554 650 1315624 1753 1015 832 236 954 808 541 785 539 410 2396 898 255 648 787 820 357 273 472112 133 102 79 46 120 86 72 91 250 353 178 647 144 89 104 138 55 76 229200 165 206 126 160 148 130 167 281 510 596 133 426 3244 129 213 222 438 2006 298679 492 469 433 272 560 474 474 497 341 315 419 333 218 1988 642 564 142 168 388906 655 822 754 682 628 707 895 568 513 368 690 483 342 892 1009 962 387 339 561785 493 655 575 356 495 544 598 426 604 403 604 513 298 659 810 1074 180 289 630

27 162 31 19 20 28 19 21 33 24 20 43 24 135 27 66 32 5518 139 17149 103 196 120 337 123 127 101 304 243 250 91 174 1511 97 162 163 307 3075 175654 370 427 381 410 413 406 450 379 1502 972 394 967 498 479 506 687 154 352 1729

Page 51: Sequence Alignment I Dot plots Dynamic Programming

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix

Observed %Difference

Evolutionary DistanceIn PAMs

1510204050607080

1511235680112159246

Page 52: Sequence Alignment I Dot plots Dynamic Programming

Final Scoring Matrix is the Log-Odds Scoring Matrix

S (a,b) = 10 log10(Mab/Pb)

Original amino acid

Replacement amino acid

Mutational probability matrix number

Frequency of amino acid b

Page 53: Sequence Alignment I Dot plots Dynamic Programming

PAM unit

PAM-1 is 1 substitution per 100 residues.

Page 54: Sequence Alignment I Dot plots Dynamic Programming

Point accepted mutation (PAM) matrix

Calculated by observing the substitutions that occur in alignments between similar sequences with very high (>85%) identity.

Page 55: Sequence Alignment I Dot plots Dynamic Programming

Construct a multiple sequence alignment

ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

Page 56: Sequence Alignment I Dot plots Dynamic Programming

Construct a multiple sequence alignment

AG IL

AG AL CS GA

A phylogenetic tree is created indicating the order in which various substitutions taken place.

Page 57: Sequence Alignment I Dot plots Dynamic Programming

Generating PAM matrix

For each amino acid, the frequency with which it is substituted by every other amino acid is calculated. Substitutions are considered symmetric: AG counts as GA

For example for FGA count all AG and GA substitutions.

Page 58: Sequence Alignment I Dot plots Dynamic Programming

Construct a multiple sequence alignment

AG IL

AG AL CS GA

FGA = AG AG GA = 3

Page 59: Sequence Alignment I Dot plots Dynamic Programming

Relative mutability, mi

Number of times the amino acid is substituted by any other amino acid in the phylogenetic tree.

This number is then divided by the total number of mutations that could have affected the residue.

This denominator is the total number of subs across entire tree times two, multiplied by freq. of the amino acid, times a scaling factor

Page 60: Sequence Alignment I Dot plots Dynamic Programming

Construct a multiple sequence alignment

AG IL

AG AL CS GA

FGA = AG AG GA = 3

Page 61: Sequence Alignment I Dot plots Dynamic Programming

Relative mutability of A: mA

Total of 4 mutations involving A. Total number mutations in the entire tree is

6 should be multiplied by two = 6 x 2 = 12 Relative frequency of A residues (10 As out

of 7x9=63 residues) is 10/63 = 0.159. Thus mA = 4/12 x 0.159 x 100 = 0.0209

Page 62: Sequence Alignment I Dot plots Dynamic Programming

Mutation probability of A to G: MGA

mA = 4/12 x 0.159 x 100 = 0.0209

MGA = mA multiplied by #(AG and GA subs) and then this is divided by #(all subs involving A) MGA = (0.0209 x 3)/4 = 0.0156

Page 63: Sequence Alignment I Dot plots Dynamic Programming

Scoring matrix: RGA

log(Mij/fi), where Mij mutation probability of ij and fi equals to relative frequency of j type residue. For example f(G) = 10/63 = 0.1587

RGA = log(MGA/f(G)) = log(0.0156/0.1587)

= -1.01

Page 64: Sequence Alignment I Dot plots Dynamic Programming

PAM matrix calculator http://www.bioinformatics.nl/tools/pam.

html

Page 65: Sequence Alignment I Dot plots Dynamic Programming

Choice of matrix

PAM vs. BLOSUM: polar residues are less variable in BLOSUM Since PAM is based on pairs of entire sequences with less than 15%

divergence, most of the variations counted are in surface loop regions, regions under low evolutionary constraints. Asn is one of the most mutable residues.

BLOSUM matrices are based on conserved regions of more distantly related proteins and thus ignore residues in surface loop regions. The mutability of Asn is close to the average mutability.

Surface loop regions preferentially contain polar residues

Thus, BLOSUM matrices strongly overestimate the conservation of polar residues relative to PAM matrices. Matrix Conservation polar/apolar

PAM 0.49 BLOSUM 0.93

Page 66: Sequence Alignment I Dot plots Dynamic Programming

Choice of matrix

General: the choice of matrix should be governed by the use to which the matrix is put. Searches for distantly similar sequences, in

which only the BLOCKS are likely to be conserved should rely on BLOSUM matrices.

Complete alignments of globular protein sequences should use PAM matrices.

Proteins with extensive transmembrane helices should be aligned using the transmembrane matrix.

Page 67: Sequence Alignment I Dot plots Dynamic Programming

Choice of matrix

http://mcb.berkeley.edu/labs/king/blast/docs/matrix_info.html

Page 68: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

Exhaustive search: search all possible alignments: NOT FEASIBLE

Dynamic programming: a method of breaking a problem apart into reasonably sized subproblems and using these partial results to compute the final answer. Needleman and Wunsch

Page 69: Sequence Alignment I Dot plots Dynamic Programming

How to start the alignment?

How to break down the problem: Seq1 = ‘CACGA’ Seq2 = ‘CGA’

Three possible ways to start the problem: C of seq1 and C of seq2 match

C ACGAC GA

A gap is inserted into 1st position of seq1- CACGAC GA

A gap is inserted into 1st position of seq2C ACGA- CGA

Page 70: Sequence Alignment I Dot plots Dynamic Programming

How to end the alignment?

If we knew the score for the best alignment, we can track back the steps to reach to the best score. Use a table to store each step of the

alignment to refer back later on. Depends on storing partial sequence

alignments so you don’t do it again and again.

Page 71: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

Page 72: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1

C -2

A -3

G -4

T -5

A -6

G -7

Initialize a matrix where seq1 is on the rows and seq2 is on the colums

Fill in the first row and first column with multiples of gap penalty, in this case it equals to -1.

Page 73: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1

C -2

A -3

G -4

T -5

A -6

G -7

Start with the first residue of each sequence:

Should A match A?The alignment score between

A of seq1 and A of seq2 can come from:

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

SELECT THE MAXIMUM AMONG THREE

Page 74: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0

C -2

A -3

G -4

T -5

A -6

G -7

Continue towards right

Does A match C

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 75: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1

C -2

A -3

G -4

T -5

A -6

G -7

Continue towards right

Does A match T

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 76: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2

C -2

A -3

G -4

T -5

A -6

G -7

Continue towards right

Does A match C

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 77: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2

A -3

G -4

T -5

A -6

G -7

Continue towards right

Does A match G

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 78: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0

A -3

G -4

T -5

A -6

G -7

Go to second row and continue towards right

Does C match A

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 79: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2

A -3

G -4

T -5

A -6

G -7

Continue towards right

Does C match C

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 80: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1

A -3

G -4

T -5

A -6

G -7

Continue towards right

Does C match T

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 81: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0

A -3

G -4

T -5

A -6

G -7

Continue towards right

Does C match C

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 82: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3

G -4

T -5

A -6

G -7

Continue towards right

Does C match G

a) diagonal: match (+1) or mismatch (0)

b) From top means a gap (-1) inserted in the first sequence (from left to right)

c) From left means that a gap (-1) is inserted in the second sequence (from top to bottom)

Page 83: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 0 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2

Fill in all cells in the partial scores table

While you fill in keep tract of from which direction you have carried away the scores from.

Page 84: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 0 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2

Back trace your steps from the optimal alignment score.

Sometimes more than one optimal alignment is possible.

Page 85: Sequence Alignment I Dot plots Dynamic Programming

Partial scores table

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 0 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2

From the end point:GG

TCGTAG

--TCGAGTAG

AC—-TCGACAGTAG

Page 86: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 87: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 88: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 89: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 90: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

Mismatch between D and V is -3; gap is -8

Page 91: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 92: Sequence Alignment I Dot plots Dynamic Programming

Dynamic Programming

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 93: Sequence Alignment I Dot plots Dynamic Programming

Trace-back

Page 94: Sequence Alignment I Dot plots Dynamic Programming

Global alignments

Needleman and Wunch algorithm is global: compares two sequences in their entirety a gap penalty is assessed regardless of

whether gaps are located internally within a sequence, or at the end of one of both sequences.

Page 95: Sequence Alignment I Dot plots Dynamic Programming

Semi-global alignment

Terminal gaps are undermined because they are generally due to incomplete data and have no biological significance.

How is it different than the global algorithm?

Page 96: Sequence Alignment I Dot plots Dynamic Programming

Semi-global vs. global

In global a vertical move equals to a gap in the horizontal axis while a move to left equals to a gap in the vertical axis; they are both penalized.

If one allows initial gaps without penalty in the sequences, should set the first row and first column of the partial scores table to 0.

Page 97: Sequence Alignment I Dot plots Dynamic Programming

Semi-global alignment

A C A C T G A T C G

0 0 0 0 0 0 0 0 0 0 0 0

A 0 1 0 1 0 0 0 1 0 0 0

C 0 0 2 1 2 1 0 0 1 1 0

A 0 1 1 3 2 2 1 1 0 1 1

C 0 0 2 2 4 3 2 1 1 1 1

T 0 0 1 2 3 5 4 3 2 1 1

G 0 0 0 1 2 3 6 6 6 6 6Horizontal moves on the bottom row and vertical moves on the right most column are penalty free.

Page 98: Sequence Alignment I Dot plots Dynamic Programming

Semi-global alignment

ACACTGATCGACACTG----

Page 99: Sequence Alignment I Dot plots Dynamic Programming

Local alignment

Smith-Waterman Algorithm If you have a long sequence, and want

to find any subsequences that are similar to any part of yeast genome.

Local alignment finds the best matching subsequences within two search sequences.

Page 100: Sequence Alignment I Dot plots Dynamic Programming

Modify global alignment algorithm

Smith-Waterman Algorithm Place a zero in any position in the table if

all of the other methods result in scores lower than zero.

Page 101: Sequence Alignment I Dot plots Dynamic Programming

Local Alignment

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 102: Sequence Alignment I Dot plots Dynamic Programming

local alignmentA A C C T A T A G C T

0 0 0 0 0 0 0 0 0 0 0 0 0

G 0 0 0 0 0 0 0 1 0 1 0 0

C 0 0 0 1 1 0 0 0 0 0 2 1

G 0 0 0 0 0 0 0 0 0 1 0 1

A 0 1 1 0 0 0 1 0 1 0 0 0

T 0 0 0 0 0 1 0 2 1 0 0 1

A 0 1 1 0 0 0 2 0 3 2 1 0

T 0 0 0 0 0 1 1 3 2 2 1 2

A 0 1 1 0 0 0 2 2 4 3 2 1

Page 103: Sequence Alignment I Dot plots Dynamic Programming

Multiple Sequence Alignment

Page 104: Sequence Alignment I Dot plots Dynamic Programming

GAPS

http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/

Page 105: Sequence Alignment I Dot plots Dynamic Programming

Use align1 to align: enter x and y in the command line (>>)

>> x='ggagaggat'

x =

ggagaggat

>> y='ccagacct'

y =

ccagacct

>> align1

Page 106: Sequence Alignment I Dot plots Dynamic Programming

Use align1 to align: enter x and y in the command line (>>)

>>align1

xalign =

ggagaggat

yalign =

ccagacc-t

ggagaggatccagacc-t

Page 107: Sequence Alignment I Dot plots Dynamic Programming

Use alignloc1 to align: type in alignloc1 at the >>

>>alignloc1

xalign =

aga

yalign =

aga

agaaga

Page 108: Sequence Alignment I Dot plots Dynamic Programming

What is different?No gap penalty at the initiation

%Initial Conditions for matrices F and I:for i = 2:M+1 F(i,1) = (i-1)*0 I(i,1) = 1 %I:verticalendfor j = 2:N+1 F(1,j) = (j-1)*0 I(1,j) = 3 %I:horizontalend

Page 109: Sequence Alignment I Dot plots Dynamic Programming

What is different?Maximization

[F(i,j) I(i,j)] = max([F(i-1,j)+g F(i-1,j-1)+w F(i,j-1)+g 0])

Page 110: Sequence Alignment I Dot plots Dynamic Programming

What is different?Find the max value of the F matrix

[Yj,fj]=max(max(F)) % find column id

[Yi,fi]=max(max(F')) % find row id

i=fi(1) % set current i

j=fj(1) % set current j

Page 111: Sequence Alignment I Dot plots Dynamic Programming

What is different?When maximum value is 0 then stop

if I(i,j) == 1 i = i-1 xrev(k) = x(i) yrev(k) = '-' elseif I(i,j) == 2

i = i-1; j = j-1 xrev(k) = x(i) yrev(k) = y(j) elseif I(i,j) == 3 j = j-1; xrev(k) = '-' yrev(k) = y(j) else break end