1 pairwise sequence alignment algorithms elya flax & inbar matarasso seminar in structural...

63
1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Post on 21-Dec-2015

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

1

Pairwise sequence alignment algorithms

Elya Flax

&

Inbar Matarasso

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Page 2: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

2

Outline

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

The importance of (sub)sequence comparison in molecular biology

The edit distance between two strings Dynamic Programming String similarity Computing alignments in linear space Local alignment gaps

Page 3: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

3

Motivation

The area of approximate matching and sequence comparison is central in computational molecular biology both because of active mutational processes that (sub)sequence comparison methods seek to model and reveal.

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Much of computational biology concerns sequence alignments

Page 4: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

4

The importance of (Sub)sequence comparison in Molecular Biology

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

The first fact of biology sequence analysis In biomulecular sequences (DNA,RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity.

“Redundancy”, and “similarity” are central phenomena in biology. But similarity has its limits – humans differ in some respects. These differences make conserved similarity even more significant, which in turn makes comparison and analogy very powerful tools in biology.

Page 5: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

5

The importance of (Sub)sequence comparison in Molecular Biology

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

“... Similar sequences yield similar structures, but quite distinct sequences can produce

remarkably similar structures”.

F. E. Choen. Folding the sheets: using computational methods to predict structures of proteins. In E. Lander and M.S. Waterman, editors, Calculating the Secrets of Life, pages 236-71. National Academy Press,

1995.

Page 6: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

6

Terminology

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Approximate – some errors, of various types detailed later, are acceptable in valid matches.

Alignment – lining up characters of strings, allowing mismatches as well as matches and allowing characters of one string to be placed opposite spaces made in opposing strings.

qac_dbd

qawx_b_

Page 7: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

7

Terminology

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Subsequence versus Substring : A subsequence differs from a substring in that the characters in a substring must be contiguous, whereas the characters in a subsequence embedded in a string need not be.

For example, the string xyz is a subsequence, but not a substring, in axayaz.

Page 8: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

8

Dynamic Programming

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Dynamic programming is typically applied to optimization problems. The development of a dynamic programming algorithm can be broken into a sequence of four steps:

i. Characterize the structure of an optimal solution.

ii. Recursively define the value of an optimal solution.

iii. Compute the value of an optimal solution in a bottom-up fashion.

iv. Construct an optimal solution from computed information.

Page 9: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

9

Edit Distance

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Instance: 2 sequences x[1..m] and y[1..n], and set of operation costs.

Problem: To find what is the cost of the least expensive transformation sequence that converts x to y.

Page 10: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

10

The edit distance between two strings

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

The permitted edit operations are: Insertion, Deletion, Replacement.

Definition: A string over the alphabet I,D,R,M that describes a transformation of one string to another is called edit transcript, or transcript for short, of the two strings.

RIMDMDMMI

vintner

writers

Match

Page 11: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

11

The edit distance between two strings

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: The edit distance between two strings is defined as the minimum number of edit operations – insertion, deletion, and substitutions – needed to transform the first string into the second.

For emphasis, note that matches are not counted.

Page 12: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

12

String alignment

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: A (global) alignment of two strings S1 and S2, is obtained by first inserting chosen spaces (or dashes), either into or at the ends of S1 and S2, and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string.

Page 13: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

13

String alignment

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Example - the alignment of the string qacdbd and qawxb:

qac_dbd

qawx_b_

Page 14: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

14

Alignment Versus edit transcript

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

From the mathematical standpoint – equivalent ways to describe a relationship between two strings.

From a modeling standpoint – an edit transcript emphasize the putative mutational events that transform one string to another, whereas an alignment only displays a relationship between the two strings

Page 15: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

15

Dynamic programming calculation of edit distance

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: For two strings S1 and S2, D(i,j) is defined to be the edit distance of S1 [1..i] and S2 [1..j].

D(i,j) denotes the minimum number of edit operations needed to transform the first i characters of S1 into the first j characters of S2.

D(n,m) – the edit distance of S1 and S2

Page 16: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

16

The recurrence relation

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

The base conditions are:

D(i,0) = i; D(0,j) = j

The recurrence relation for D(i,j) when both i and j are strictly positive is:

D(i,j)=min[D(i-1,j)+1, D(i,j-1)+1,D(i-1,j-1)+t(i,j)]

where t(i,j) is defined to have value 1 if S1(i)≠S2(j), and 0

otherwise.

Page 17: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

17

Correctness of the general recurrence

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Lemma 1: The value of D(i,j) must be D(i-1,j)+1, D(i,j-1)+1, or D(i-1,j-1)+t(i,j). There are no other possibilities.

Lemma 2: D(i,j)≤min[D(i-1,j)+1, D(i,j-1)+1,D(i-1,j-1)+t(i,j)]

Theorem: When both i and j are strictly positive, D(i,j)= min[D(i-1,j)+1, D(i,j-1)+1,D(i-1,j-1)+t(i,j)].

Page 18: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

18

Tabular computation of edit distance

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Top-down computation. efficiently compute the value D(n,m). (n+1) × (m+1) combinations of i and j. Redundant recursive. Bottom-up computation. Time analysis: O(nm) cells in the table.

Page 19: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

19

Tabular computation of edit distance

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Page 20: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

20

Tabular computation of edit distance

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Theorem: The dynamic programming table for computing the edit distance between a string of length n and a string of length m can be filled in with O(nm) work. Hence, using dynamic programming, the edit distance D(n,m) can be computed in O(nm) time.

Page 21: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

21

The traceback

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

When the value of cell (i,j) is computed set a pointer according the following rules:

If D(i,j)=D(i,j-1)+1 (i,j)(i,j-1) If D(i,j)=D(i-1,j)+1 (i,j)(i-1,j) If D(i,j)=D(i-1,j-1)+t(i,j) (i,j)(i-1,j-1)

For optimal edit transcript, follow any path of pointers from cell (n,m) to cell (0,0).

Page 22: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

22

The traceback

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Page 23: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

23

The traceback

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Horizontal edge for insertion. Vertical edge for deletion. Diagonal edge for substitution if S1(i)≠S2(j),

and match otherwise.

Page 24: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

24

The traceback

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Theorem: Once the dynamic programming table with pointers has been computed, an optimal edit transcript can be found in O(n+m) time.

Page 25: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

25

The traceback

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Theorem: Any path from (n,m) to (0,0) following pointers established during the computation of D(i,j) specifies an edit transcript with the minimum number of edit operations, any optimal edit transcript is specified by such a path. Moreover, since a path describes only one transcript, the correspondence between paths and optimal transcripts is one-to-one.

Page 26: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

26

Edit graphs

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: Given two strings S1 and S2 of length n and m, respectively, a weighted edit graph has (n+1)×(m+1) nodes, each labeled with distinct pair (i,j) (0≤i≤n, 0≤j≤m). The specific edges and their edge weights depend on the specific string problem.

For the edit distance problem:

• The weight on the edges (i,j)(i,j+1) and (i,j)(i+1,j) is one

• The weight on the edges (i,j)(i+1,j+1) is t(i+1,j+1).

A N N0 1 2 3

0

C 1

A 2

N 3

0

0 0

Page 27: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

27

Weighted edit distance

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: With arbitrary operation weights, the operation-weight edit distance problem is to find an edit transcript that transform string S1 into S2 with

the minimum total operation weight.

For example: if each mismatch has a weight of 2, each space has a weight of 4, and each match a weight of 1, then the following alignment has a total weight of 17 and is an optimal alignment.

writ_ers

vintner_

Page 28: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

28

Alphabet-weight edit distance

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

The weight of a substitution depends on exactly which character in the alphabet is being removed and which is being added.

Page 29: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

29

String similarity

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

A way of formalizing the relatedness of two strings by measuring their similarity rather than their distance

Definition: let Σ be the alphabet used for strings S1 and S2 , and let Σ’ be Σ with the added character “_”. Then, for any two characters x, y in Σ’ , s( x, y) denotes the value (or score) obtained by aligning x against character y.

Page 30: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

30

String similarity

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: for a given alignment A of S1 and S2, let S1’ and S2’ denote the strings after the chosen insertion of spaces, and let l denote the (equal) length of the two strings in A. the value of alignment A is defined as

Σs(S1’(i), S2’(i)).i=1

l

Page 31: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

31

String similarity

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

For example, let Σ={a, b, c, d}

and let the pairwise scores be

defined in the following matrix:

Then the alignment

c a c _ d b d

c a b b d b _ Has a total value of 0 + 1 – 2 + 3 + 3 – 1 = 4

sabcd_

a1-1-20-1

b3-2-10

c0-4-2

d3-1

_0

Page 32: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

32

String similarity

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: Given a pairwise scoring matrix over the alphabet Σ’, the similarity of two strings S1 and S2 is defined as the value of the alignment A of S1 and S2 that maximizes total alignment value.

Page 33: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

33

Computing similarity

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: V(i,j) is defined as the value of the optimal alignment of prefixes S1[1..i] and S2[1..j]

The base conditions are

V(0,j)= Σ s ( _ , S2(k))

V(i,0)= Σ s (S1(k), _ )

1 ≤k ≤j

1 ≤k ≤i

Page 34: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

34

Computing similarity

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

For i and j both strictly positive, the general recurrence is

V( i , j ) = max[ V(i-1,j-1) + s (S1(i), S2(j)),

V(i-1,j) + s (S1(i), _ ),

V(i,j-1) + s ( _ , S2(j)) ]

If S1 and S2 are of length n and m , then the value of their optimal alignment (V( n, m)) can be found (using dynamic programming table) in O (nm) time.

Page 35: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

35

Alignment graphs for similarity

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

As was the case for edit distance, the computation of similarity can be viewed as a path problem on a directed acyclic graph called an alignment graph.

The longest start to destination paths in the alignment graph are in one-to-one correspondence with the optimal (maximum value) alignments.

Page 36: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

36

End-space free variant

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Spaces at the end or the beginning of the alignment contribute a weight of zero.

Example: shotgun sequence assembly problem. Implementation: using the recurrence for global

alignment , but change the base conditions to V(i,0)=V(0,j)=0

Page 37: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

37

Approximate occurrences of P in T

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: Given a parameter δ, a substring T’ of T is said to be an approximate occurrence of P if and only if the optimal alignment of P to T’ has value at least δ.

Theorem: There is an approximate occurrence of P in T ending at position j of T if and only if V(n,j) ≤ δ. Moreover, T [k .. j] is an approximate occurrence of P in T if and only if V(n,j) ≤ δ and there is a path of backpointers from cell (n,j) to cell(0,k).

Page 38: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

38

How to find the optimal alignment in linear space?

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: For any string α, let αr denote the reverse of string α.

Definition: Given strings S1 and S2, define V r(i,j) as the similarity of the string consisting of the first i characters of S1

r, and the string consisting of the first

j characters of S2r. Equivalently, Vr (i,j) is the

similarity of the last i characters of S1 and the last j

characters of S2.

Page 39: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

39

How to find the optimal alignment in linear space?

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Lemma 1: V(n,m)=max0≤k≤m[V(n/2,k)+Vr (n/2,m-k)].

Definition: Let k* be a position k that maximizes [V(n/2,k)+Vr (n/2,m-k)].

Definition: Let Ln/2 be the subpath of L that starts with the last node of L in row n/2-1 and ends with the first node of L in row n/2+1.

Page 40: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

40

How to find the optimal alignment in linear space?

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Lemma 2: A position k* in row n/2 can be found in O(nm) time and O(m) space. Moreover, a subpath Ln/2 can be found and stored in those time and space bounds.

Page 41: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

41

How to find the optimal alignment in linear space?

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

A

B

k1 k* m

k2

n/2-1n/2

n

n/2+1

Page 42: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

42

How to find the optimal alignment in linear space?

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Execute dynamic programming to compute the optimal alignment of S1 and S2 ,stop after interior n/2.

When filling in row n/2, save the normal traceback pointers for the cells in that row. O(m) space

Do the same first steps for S1r and S2

r .

Using the first set of saved pointers, follow any traceback path from cell (n/2,k*) to a cell k1 in row n/2-1. (Do the same

for k2 and row n/2+1).

O(nm) time and O(m) space is used to find k*, k1, k2, and Ln/2.

Page 43: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

43

Local alignment

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Local alignment problem: given two strings S1 and S2, find substrings α and β of S1 and S2, respectively, whose similarity (optimal global alignment value) is maximum over all pairs of substrings from S1 and S2.

Page 44: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

44

Local alignment

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

S1=pqraxabcstvq S2=xyaxbacsll match = +2 mismatch = - 2 space= -1

optimal local alignment

a x a b _ c s

a x _ b a c s The optimal local alignment of S1 and S2 has value

8 and is defined by substrings axabcs and axbacs

Page 45: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

45

Why local alignment?

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Global alignment of protein sequences is often meaningful when the two strings are members of the same protein family.

Local alignment is critical when comparing long stretches of anonymous DNA or proteins from very different families.

Page 46: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

46

Computing local alignment

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: given a pair of indices i ≤ n and j ≤ m, the local suffix alignment problem is to find a (possibly empty) suffix α of S1[1..i] and a (possibly empty) suffix β of S2[1..j] such that V(α, β) is the maximum over all pairs of suffixes of S1[1..i] and S2[1..j].

Page 47: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

47

Computing local alignment

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Theorem: let V(i,j) be the value of the optimal local suffix alignment for the given index pair I, j and v* be the value of the optimal local alignment for two strings of length n and m so v*=max [V(i,j): i ≤ n,j ≤ m]

Page 48: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

48

Computing local alignment

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Theorem: if i’, j’ is an index pair maximizing V(i,j) over all i, j pairs, then a pair of substrings solving the local suffix alignment for i’, j’ also solves the local alignment problem.

Page 49: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

49

How to solve the local suffix alignment problem

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

First, V(i,0)=V(0,j)=0 for all i, j, since we can always choose an empty suffix.

Theorem: For i > 0 and j > 0, the proper recurrence for V(i,j) is

V( i , j ) = max[ 0,V(i-1,j-1) + s (S1(i), S2(j)),

V(i-1,j) + s (S1(i), _ ),

V(i,j-1) + s ( _ , S2(j)) ]

Page 50: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

50

Time analysis

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Theorem: For two strings s1 and s2 of lengths n and m, the local alignment can be solved in O(nm) time, the same time as for global alignment.

Theorem: All optimal local alignments of two strings are represented in the dynamic programming table for V(i,j) and can be found by tracing any pointers back from any cell with value V*.

Page 51: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

51

Gaps

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: A gap is any maximal, consecutive run of spaces in a single string of a given alignment.

An alignment with seven spaces distributed into four gaps

c t t t a a c _ _ a _ a c

c _ _ _ c a c c c a t _ c

Page 52: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

52

Why gaps?

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

A gap in string S1 opposite substring α in string S2 corresponds to either a deletion of α from S1 or to an insertion of α into S2. the concept of a gap in an alignment is therefore important in many biological applications because the insertion or deletion on an entire substring (particularly in DNA) often occurs as single mutational event.

Page 53: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

53

Choices for gap weights

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

We will examine in detail four general types of gap weights: constant, affine, convex, and arbitrary.

The objective in the constant gap weight model is find an alignment A to maximize

Σs(S1’(i),S2’(i)) - Wg(# gaps)

The objective in the affine gap weight model is find an alignment A to maximize

Σs(S1’(i),S2’(i)) -Wg(# gaps) -Ws(# spaces)Ws – the weight given to spaces

i=1

l

l

i=1

Page 54: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

54

Choices for gap weights

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Each additional space in a gap contributes less to the gap weight than the preceding space, a gap weight that is a convex, function of its length. Example: Wg +logeq, where q is the length of the gap.

The arbitrary gap weight, where the weight of the gap is an arbitrary function w(q) of its length q. the constant, affine, and convex weight models are of course subcases of the arbitrary weight model.

Page 55: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

55

Time bounds for gap choices

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Solving the above problems using Dynamic programming

Arbitrary gap O(nm²+n²m) Convex gap O(nmlogm) Affine gap O(nm) Constant gap O(nm)

Page 56: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

56

Arbitrary gap weights

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

S1

S2

S1

S2

S1

S2

E

F

G

1

2

3

i

j

i

i

j

j

Page 57: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

57

Arbitrary gap weights

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Definition: define E(i,j) as the maximum value of any alignment of type 1; define F(i,j) as the maximum of any alignment of type 2; define G(i,j) as the maximum value of any alignment of type 3; and finally define V (i,j) as the maximum value of the three terms E(i,j), F(i,j), G(i,j).

Page 58: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

58

Arbitrary gap weights

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Recurrences for the case of arbitrary gap weights:

V ( i , j ) = max [ E ( i , j ) , F ( i , j ) , G ( i , j ) ]

G ( i , j ) = V ( i – 1, j – 1 ) + s ( S1(i) , S2(j) )

E ( i , j ) = max [ V ( i , k ) – w( j – k ) ]

F ( i , j ) = max [ V ( l , j ) – w( i - l ) ]

0 ≤k ≤j-1

0 ≤l ≤i-1

Page 59: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

59

Arbitrary gap weights

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Base case if all spaces are included in the objective function:

V (i,0) = - w(i) V (0,j) = - w(j)

E (i,0) = - w(i) F (0,j) = - w(j)

G (0,0) = 0 Base case if end space, and hence end gaps are free:

V (i,0) = 0 V (0,j) = 0

Page 60: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

60

Time analysis

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Theorem: assuming that |S1| = n and |S2| = m, the recurrences can be evaluated in O( nm² + n²m ) time.

Before gaps were included in the model, V(i,j) depended on the three cells adjacent to (i,j) and now we need to look j cells to the left and i cells above to determine V(i,j).

Page 61: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

61

Summary

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

The first fact of biological sequence analysis Dynamic Programming:

edit distance the recurrence relation tabular computation

Optimal alignment in linear space Global alignment Vs. local alignment Gaps

Page 62: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

62

Food for thought…

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Repeated substrings: find inexact repeats in a single string.

If we do local alignment of a string against itself, the best substring will be the entire string.

Even using all the values in the table, the best path may be strongly influenced by the main diagonal.

Page 63: 1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

63

Bibliography

Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms

Algorithms on strings, trees, and sequences : computer science and computational biology;  Gusfield Dan; Cambridge : Cambridge University Press, 1997

Introduction to algorithms; by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest; 2nd edition; Cambridge, MA : MIT Press, 2001; The MIT electrical engineering and computer science series