1 seminar in structural bioinformatics - multiple sequence alignment algorithms. elya flax &...

74
1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

Post on 22-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

1Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Elya Flax

&

Inbar Matarasso

Multiple sequence alignment algorithms

Page 2: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

2

Outline

The importance of multiple string alignments in molecular biology.

CLUSTAL W. Family representation. How to score multiple alignments. The center star method for SP alignment. consensus strings. Approximating the optimal consensus multiple alignment. Iterative pairwise alignment. Progressive alignment and contemporary improvements. Repeated-motif methods

Page 3: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

3

Motivation

Why multiple string comparison?

Because many important commonalties are faint or widely dispersed, they might not be apparent when comparing two strings alone but may become clear, or even obvious, when comparing a set of related strings.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 4: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

4

Defenition

Definition: A global multiple alignment of k>2 strings S={S1,S2,…,Sk} is a natural generalization of alignment for two strings. Chosen spaces are inserted into each of the k strings so that the resulting strings have the same length, defined to be l. Then the strings are arrayed in k rows of l columns each, so that each character and space of each string is in a unique column.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 5: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

5

Biological basis for multiple string comparison

The second fact of biological sequence comparison Evolutionarily and functionally related molecular strings can differ significantly throughout much of the string and yet preserve the same three-dimensional structure(s), or the same tow-dimensional substructure(s) (motifs, domains), or the same active sites, or the same or related dispersed residues (DNA or amino acid).

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 6: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

6

Three “big-picture” biological uses for multiple string comparison

The representation of protein families and superfamilies.

The identification and representation of conserved sequence features of DNA or protein that correlate with structure and/or function.

The deduction of evolutionary history from DNA or protein sequences.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 7: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

7

CLUSTAL W

Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

http://www.ebi.ac.uk/clustalw/

Sequences

results

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 8: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

8

Family and superfamily representation

Often a set of strings (a family) is defined by biological similarity, and one wants to find subsequence commonalities that characterize or represent the family.

There are three common kinds of family representations that come from multiple string comparison:

I. Profile representationII. Consensus sequence representationIII. Signature representation

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 9: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

9

Family representation and alignment with profiles

Definition: Given a multiple alignment of a set of strings, a profile for that multiple alignment specifies for each column the frequency that each character appears in the column. A profile is sometimes also called a weight matrix in the biological literature.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 10: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

10

Family representation and alignment with profiles

a b c _ a

a b a b a

a c c b _

c b _ b c

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

C1 C2 C3 C4 C5

a .75 .25 .50

b .75 .75

c .25 .25 .50 .25

_ .25 .25 .25 Often the values in the profile are converted to log-

odds ratio – If p(y,j) is the frequency that character y appears in column j, and p(y) is the frequency that character y appears anywhere in the multiply aligned sequences, then log( p(y,j)/p(y) ) is commonly used as the y,j profile entry.

Page 11: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

11

Aligning a string to a profile

Given a profile P and a new string S, we want to answer the question: “How well S, or substring of S, fit the profile P” .

Since space is a legal character of a profile, a fit of S to P should also allow the insertion of spaces into S, and hence the question is naturally formalized as an easy generalization of pure string alignment.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

a a b b c

1 2 3 4 5An alignment of string aabbc to the column positions of the previous alignment.

Page 12: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

12

How to optimally align a string to a profile

Recall that for two characters x and y, s(x,y) denotes the alphabet-weight value assigned to aligning x with y in the pure string alignment problem.

Definition: For character y and column j, let p(y,j) be the frequency that character y appears in column j of the profile, and let S(x,j) denote y[s(x,y) × p(y,j)], the score for aligning x with column j.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 13: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

13

How to optimally align a string to a profile

Definition: Let V(i,j) denote the value of the optimal alignment of substring S[1..i] with the first j columns of C.

The recurrence: V(i,0)=s(S1(k),_) V(0,j)=S(_,k)For I and j both strictly positive, the general recurrence is:

V(i,j) = max [V(i-1,j-1) + S(S1(i),j),V(i-1,j) + s(S1(i),__),V(i,j-1) + S(_,j) ].

Time analysis: O(nm), where n is the length of S and is the size of the alphabet.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

k≤i k≤j

Page 14: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

14

Profile to profile alignment

Another way that profiles are used is to compare one protein set to another. In that case, the profile for one set is compared to the profile of the other.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 15: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

15

Introduction to computing multiple string alignments

Definition: Given a set of k > 2 strings S={S1, S2, ...,Sk}, a local multiple alignment of S is obtained by selecting one substring Si’ from each string Si S and then globally aligning those substrings.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 16: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

16

How to score multiple alignments

To date, there is no objective function that has been as well accepted for multiple alignment as edit distance or similarity has been for two-string alignment.

We will discuss three types of objective functions:

I. sum-of-pairs functionsII. consensus functionsIII. tree functions

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 17: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

17

Definition: Given a multiple alignment M, the induced pairwise alignment of two strings Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. That is, the induced alignment is multiple alignment M restrict to Si and Sj. Any two opposing spaces in that induced alignment can be removed if desired.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

How to score multiple alignments

Page 18: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

18

How to score multiple alignments

Definition: The score of an induced pairwise alignment is determined using any chosen scoring scheme for tow-string alignment in the standard manner.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

A A G A A _ A

A T _ A A T G

C T G _ G _ G

A T G A A _ G

45 5

SP score 14

Page 19: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

19

Multiple alignment with the sum-of-pairs (SP) objective function

Definition: The sum of pairs (SP) score of multiple alignment M is the sum of the scores of pairwise global alignments induced by M.

The SP alignment problem Compute a global multiple alignment M with minimum sum-of-pairs score.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 20: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

20

An exact solution to the SP alignment problem

Via dynamic programming – for k strings of length n, it takes (nk) time.

We will develop the dynamic programming recurrence only for the case of three strings.

We will develop an accelerant to the basic dynamic programming solution that somewhat increases the number of strings that can be optimally aligned.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 21: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

21

An exact solution to the SP alignment problem

Definition: Let S1, S2 and S3 denote three strings of length n1,n2 and n3, respectively, and let D(i,j,k) be the optimal SP score for aligning S1[1..i], s2[1..j] and s3[1..k]. The score for a match, mismatch, or space is specified by the variables smatch, smis and sspace respectively.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 22: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

22

Recurrences for a nonboundary cell(i,j)

For i:=1 to n1 do

for j:=1 to n2 do

for k:=1 to n3 dobegin

if (S1(i)=S2(j)then sij:=smatch

else cij:=smis;

if (S1(i)=S3(k)then cik:=smatch

else cik:=smis;

if (S2(j)=S3(k)then cjk:=smatch

else cjk:=smis;

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

d1:=D(i-1,j-1,k-1)+cij+cik+cjk;d2:=D(i-1,j-1,k)+cij+2*sspace;d3:=D(i-1,j,k-1)+cik+2*sspace;d4:=D(i,j-1,k-1)+cjk+2*sspace;d5:=D(i-1,j,k)+2*sspace;d6:=D(i,j-1,k)+2*sspace;d7:=D(i,j,k-1)+2*sspace;

D(i,j,k):=min[d1,d2,d3,d4,d5,d6,d7];end;

Page 23: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

23

D values for boundary cells

Let D1,2(i,j) denote the familiar pairwise distance between substrings S1[1..i] and S2[1..j], and let D1,3(i,k) and D2,3(j,k) denote the analogous pairwise distance. Then,

I. D(i,j,0)=D1,2(i,j)+(i+j)*sspace

II. D(i,0,k)=D1,3(i,k)+(i+k)*sspace

III. D(i,j,0)=D2,3(j,k)+(J+k)*sspace

IV. D(0,0,0)=0

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 24: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

24

A speed up for the exact solution

The program for multiple alignment that was shown uses recurrences in backward direction.

In forward dynamic programming when D(i,j,k) is set, D(i,j,k) is sent forward the seven cells that can be influenced by it.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 25: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

25

A speed up for the exact solution

Definition: Let d1,2(i,j) be the edit distance between suffixes S1[i..n] and S2[j..n] of string S1 and S2. Define d1,3(i,k) and d2,3(j,k) analogously.

All these d values can be computed in O(n2) time by reversing the strings and computing three pairwise distances.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 26: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

26

A speed up for the exact solution

Suppose that some multiple alignment of S1, S2, and S3 is known and that the alignment has SP score z.

Key idea of the heuristic speed up Recall that D(i,j) is the optimal SP score for aligning S1[1..i], S2[1..j], and S3[1..k]. If D(i,j,k)+d1,2(i,j)+d1,3(i,k)+d2,3(j,k) is greater than z, then node (i,j,k) cannot be on any optimal path and so D(i,j,k) need not be sent forward to any cell.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 27: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

27

A bounded-error approximation method for SP alignment

The method is provably fast (runs in polynomial worst-case time) and yet produced alignments whose SP score is guaranteed to be less than twice the score of optimal SP alignment.

Recall that for two strings, D(Si,Sj) is the (optimal) weighted edit distance between Si and Sj.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 28: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

28

An initial key idea: alignments consistent with a tree

Definition: Let S be a set of strings, and let T be a tree where each node is labeled with a distinct string from S. Then, a multiple alignment M of S is called consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si,Sj) that label adjacent nodes in T.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 29: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

29

A bounded-error approximation method for SP alignment

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

3

1

4

5

2

AXZ AXZ

AXXZ

AYZ

AYXYZ

3 A X X _ Z

1 A X _ _ Z

2 A _ X _ Z

4 A Y _ _ Z

5 A Y X Y Z

a) b)

Page 30: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

30

An initial key idea: alignments consistent with a tree

Theorem: For any set of strings S and for any tree T whose nodes are labeled by distinct strings of S, we can efficiently find a multiple alignment M(T) of S that is consistent with T

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 31: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

31

The center star method for SP alignment

We will describe the method in terms of an alphabet-weighted scoring scheme for two-string alignment, and let s(x,y) be the score contributed when a character x is aligned opposite a character y.

Definition: A scoring scheme satisfies the triangle inequality if for any three characters x,y and z, s(x,z)≤ s(x,y) + s(y,z).

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 32: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

32

The center star method for SP alignment

Definition: Given a set of k strings S, define a center string Sc S as a string in S that minimizes SjSD(Sc,Sj), and let M denote the minimum sum. Define the center star to be a star tree of k nodes, with the center node labeled Sc and with each of the k-1 remaining nodes labeled by a distinct string in S-Sc.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 33: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

33

The center star method for SP alignment

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

S3

S4

S2

S1

S6

S3

A generic center star for six strings, where the center string Sc is S3

Page 34: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

34

The center star method for SP alignment

Definition: Define the multiple alignment Mc of the set of strings S to be the multiple alignment consistent with the center star.

Definition: Define d(Si,Sj) as the score of the pairwise alignment of strings Si and Sj induced by Mc. Denote the score of an alignment M as d(M).

d(Si,Sj)≥D(Si,Sj), d(Mc)=i<jd(Si,Sj), d(Si,Sc)=D(Si,Sc)

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 35: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

35

The center star method for SP alignment

Lemma: Assume that the two-string scoring scheme satisfies the triangle inequality. Then for any strings Si and Sj in S, d(Si,Sj) ≤ d(Si,Sc) + d(Sc + Sj) = D(Si,Sc) + D(Sc + Sj)

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 36: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

36

The center star method for SP alignment

Definition: Let M* be the optimal multiple alignment of the k strings of S. Let d*(Si,Sj) be the score of the pairwise alignment of strings Si and Sj induced by M*. Then d(M*)=i<jd*(Si,Sj).

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 37: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

37

The center star method for SP alignment

Theorem: d(Mc)/d(M*) ≤ 2(k-1)/k <2.

Corollary:

kM≤i<jD(Si,Sj)≤d(M*)≤d(Mc)≤[2(k-1)/ki<jD(Si,Sj).

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 38: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

38

Steiner consensus strings

Definition: Given a set of strings S, and given another string S’, the consensus error of a string S’ relative to S is

E(S’)= Si S D (S’, Si). Note that S’ need not be from S.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 39: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

39

Steiner consensus strings

Definition: Given a set of strings S, an optimal Steiner string S* for S is a string that minimizes the consensus error E(S*) over all possible strings.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 40: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

40

Steiner consensus strings

Lemma: Let S have k strings, and assume that the two-string scoring scheme satisfies the triangle inequality. Then there exists a string S S such that

E(S) / E(S*) ≤ 2 – 2/k < 2

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

__

Page 41: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

41

Steiner consensus strings

Recall that Sc is a string that minimizes

Si S D (Sc, Si) over all strings in S.

Theorem: Assuming that the scoring scheme satisfies the triangle inequality,

E(Sc) / E(S*) ≤ 2 – 2/k < 2

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 42: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

42

Consensus strings from multiple alignment

Definition: Given a multiple alignment M of a set of strings S, the consensus character of column I of M is the character that minimizes the summed distance to it from all the characters in column i. let d(i) denote the minimum sum in column i.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 43: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

43

Consensus strings from multiple alignment

Definition: The consensus string SM derived from alignment M is the concatenation of the consensus characters for each column of M.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 44: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

44

Consensus strings from multiple alignment

Definition: Let M be a multiple alignment of a set of strings S, and let SM be its consensus string containing q characters. Then the alignment error of SM equals

d(i), and the alignment error of M is defined as the alignment error of SM.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

i=1i=q

Page 45: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

45

Consensus strings from multiple alignment

Definition: The optimal consensus multiple alignment is a multiple alignment M for input set S whose consensus string SM has smallest alignment error over all possible multiple alignments of S

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 46: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

46

Consensus strings from multiple alignment

Definition: Given set S of k strings, let T be the star tree with Steiner string S* at the root and each of the k strings at distinct leaves of T. Then the multiple alignment of SUS* consistent with T is said to be consistent with S*.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 47: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

47

Consensus strings from multiple alignment

Theorem: Let S’ denote the consensus string of the optimal consensus multiple alignment. Then, removal of the spaces from S’ creates the optimal Steiner string S*. Conversely’ removal of the row for S* from the multiple alignment consistent with S* creates the optimal consensus multiple alignment of S.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 48: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

48

Approximating the optimal consensus multiple alignment

Theorem: Assuming the triangle inequality, the multiple alignment Mc created by the center star method has an SP score that is never more than 2 – 2/k times the SP score of the optimal SP alignment, and it has a (consensus) alignment error that is never more than 2 – 2/k times the alignment error of the optimal consensus multiple alignment.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 49: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

49

Multiple alignment to a (phylogenetic) tree

Definition: Given an input tree T with a distinct string (from a set of strings S) written at each leaf, a phylogenetic alignment for T is an assignment of one string to each internal node of T. Note that the strings assigned to internal nodes need not be distinct and need not be from the input strings S.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 50: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

50

Multiple alignment to a (phylogenetic) tree

Definition: If strings S and S’ are assigned to the endpoints of an edge (i,j), then (i,j) had edge distance D(S,S’). The distance along a path is the sum of the distances on the edges in the path. The distance of a phylogenetic alignment is the total of all the edge distances in the tree.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 51: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

51

Multiple alignment to a (phylogenetic) tree

The phylogenetic alignment problem for T find an assignment of strings to internal nodes of T (one string to each node) that minimizes the distance of the alignment.

The consensus alignment problem is a special case of the phylogenetic alignment problem (i.e., when tree T is a star).

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 52: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

52

A heuristic for phylogenetic alignment

Definition: A phylogenetic alignment is called a lifted alignment if for every internal node V, the string assigned to V is also assigned to one of V’s children.

We will show that the best lifted alignment in T has a total distance less than twice that of the optimal phylogenetic alignment.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 53: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

53

A heuristic for phylogenetic alignment

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

S6

S5S6

S6

S6

S7 S8

S5

S1 S2

S2

S4 S5S3

Page 54: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

54

The transformation creating T

We will construct the lifted alignment T out of T* which is the optimal phylogenetic alignment.

Definition: we say a node has been lifted after it has been labeled by a string in the leaf set S.

Let Sv* be the string labeling internal node V in T*. S1, S2 ,…., Sk – v’s children. We lift Sj if D(Sv*,Sj)≤ D(Sv*,Si) for any i from 1 to k.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

L

L

Page 55: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

55

The lifting operation at node V. The numbers on the edges are the distances from Sv* to the lifted strings labeling its children. Note that after the lift, one edge will have zero distance.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

The transformation creating T L

Sv* S3

S3S4 S1 S2

S3S4S1 S2

VV

57 3

06

Page 56: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

56

The error analysis

Theorem: The lifted alignment T has total distance less or equal to twice that of the optimal phylogenetic T* of T.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

L

Page 57: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

57

Computing the minimum distance lifted alignment

The best lifted alignment is computed by dynamic programming.

Definition: Let Tv be the subtree of T rooted at node V. Let d(V,S) denote the distance of the best lifted alignment of Tv under the requirement that string S is assigned to node V (assuming of course that S is a string at a leaf of Tv.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 58: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

58

Computing the minimum distance lifted alignment

We start with the assumption that all the leaves have already been processed.

S’- a string written at a leaf; V’-child of V.

If V is a node all of whose children are leaves

d(V,S)= S’ D (S, S’).

For a general internal node V, the dynamic programming recurrence is

d(V,S)= min [ D (S, S’) + d(V’,S’) ]

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

V’ S’

Page 59: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

59

Computing the minimum distance lifted alignment

Theorem: The optimal lifted alignment can be computed in polynomial time as a function of size of the tree and the lengths of the input strings.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 60: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

60

Iterative pairwise alignment

The target is to iteratively merge two multiple alignments of two subsets of strings into a single multiple alignment of the union of those subsets.

As an example we will explain the average linkage method, and is also known as UPGMA, for “Unweighted Pair-Group Method using arithmetic Averages”. At each merge step, the new multiple alignment could be created by aligning some representation of the two smaller alignments (for example, by aligning profiles or consensus sequences).

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 61: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

61

Iterative pairwise alignment

multiple alignments serve the purpose of characterizing protein families and for identifying important molecular structures, but….

Doolittle: “ ….what we’re really interested in is a historical alignment. The historical alignment ought to reflect, as accurately as possible, the series of divergences that led to the contemporary sequences…..”

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 62: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

62

Iterative pairwise alignment

Iterative alignment methods determine a sequence of merges of disjoint subsets of strings. Hence the history of those merges can be described by a binary tree T. Each leaf of T represents a single string from the input set, and each node of T specifies a merge of the strings found at the leaves of its subtree. Each node also represents a multiple alignment created by the merge at that node.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 63: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

63

Progressive alignment

A pair of strings with minimum edit distance (or greatest similarity) is likely obtained from the pair of taxa that has most recently diverged.

Any spaces (gaps) that appear in the optimal pairwise alignment of those two strings in preserved throughout the entire sequence of successive merges.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 64: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

64

Progressive alignment

The progressive alignment method is explicitly aimed at building an evolutionary tree from molecular data while simultaneously constructing an evolutionarily informative multiple alignment.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 65: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

65

Improvements to progressive alignment

Sequence weighting – the weights are normalized such that the biggest one is set to 1. closely related sequences receive lowered weights. Highly divergent sequences receive high weights.

Initial gap penalties – a gap opening penalty (GOP) is given for every gap, and gap extension penalty (GEP) gives the cost of every space in the gap.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 66: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

66

Improvements to progressive alignment

Weight matrices – Two main series of weight matrices are offered to the user: Dayhoff PAM, BLOSUM.

Divergent sequences – The most divergent sequences are usually the most difficult to align correctly. It is sometimes better to delay the incorporation of these sequences until all of the more easily aligned sequences are merged first.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 67: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

67

Progressive alignment

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Hbb_Human 1 -

Hbb_Horse 2 .17 -

Hba_Human 3 .59 .60 -

Hba_Horse 4 .59 .59 .13 -

Myg_Phyca 5 .77 .77 .75 .75 -

Glb5_Petma 6 .81 .82 .73 .74 .80 -

Lgb2_Luplu 7 .87 .86 .86 .88 .93 .90

1 2 3 4 5 6

Pairwise alignment: calculate distance matrix

Page 68: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

68

Progressive alignment

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Unrooted Neighbor-joining tree

Hbb_Human

Hbb_Horse

Hba_Human

Hba_Horse

Myg_Phyca

Glb5_Petma

Lgb2_Luplu

Page 69: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

69

Progressive alignment

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Hbb_Human

Hbb_Horse

Hba_Human

Hba_Horse

Myg_Phyca

Glb5_Petma

Lgb2_Luplu

Rooted NJ tree (guide tree) and sequence weights

Progressive alignment: Align following the guide tree

Page 70: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

70

Repeated-motif methods

The second major approach used in multiple alignment methods.

Definition: a motif is a substring or a small subsequence that is common to many of the strings in the set.

“width” refers to the length of the motif, and “multiplicity” refers to the number of strings that it appears in.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 71: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

71

Repeated-motif methods

Repeated-motif method general algorithm:1. Find a “good” motif (wide and with high multiplicity)

2. The strings containing it are shifted so that the occurrences of the motif are aligned with each other.

3.The problems divides into two sub problems, one for substrings on each side of the motif.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Page 72: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

72

Repeated-motif methods

4. Continue this recursion until no sufficiently wide or high motif is found.

5. The remaining sub problems can be solved by iterative alignment methods.

6. Strings that did not contain the first good motif are aligned separately.

7. Finally, the two alignments are merged.

Page 73: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

73

Summary

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

The importance of multiple string alignments in molecular biology.

CLUSTAL W. Family representation. How to score multiple alignments. The center star method for SP alignment. consensus strings. Approximating the optimal consensus multiple alignment. Iterative pairwise alignment. Progressive alignment and contemporary improvements. Repeated-motif methods

Page 74: 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

74

Bibliography

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Algorithms on strings, trees, and sequences : computer science and computational biology;  Gusfield Dan; Cambridge : Cambridge University Press, 1997

Nucleic Acids Research, 1994, Vol. 22, No. 22, Oxford University Press.