1 seminar in structural bioinformatics - multiple sequence alignment algorithms. elya flax &...

1Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

Elya Flax

&

Inbar Matarasso

Multiple sequence alignment algorithms

2

Outline

The importance of multiple string alignments in molecular biology.

CLUSTAL W. Family representation. How to score multiple alignments. The center star method for SP alignment. consensus strings. Approximating the optimal consensus multiple alignment. Iterative pairwise alignment. Progressive alignment and contemporary improvements. Repeated-motif methods

3

Motivation

Why multiple string comparison?

Because many important commonalties are faint or widely dispersed, they might not be apparent when comparing two strings alone but may become clear, or even obvious, when comparing a set of related strings.

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

4

Defenition

Definition: A global multiple alignment of k>2 strings S={S1,S2,…,Sk} is a natural generalization of alignment for two strings. Chosen spaces are inserted into each of the k strings so that the resulting strings have the same length, defined to be l. Then the strings are arrayed in k rows of l columns each, so that each character and space of each string is in a unique column.


5

Biological basis for multiple string comparison

The second fact of biological sequence comparison Evolutionarily and functionally related molecular strings can differ significantly throughout much of the string and yet preserve the same three-dimensional structure(s), or the same tow-dimensional substructure(s) (motifs, domains), or the same active sites, or the same or related dispersed residues (DNA or amino acid).


6

Three “big-picture” biological uses for multiple string comparison

The representation of protein families and superfamilies.

The identification and representation of conserved sequence features of DNA or protein that correlate with structure and/or function.

The deduction of evolutionary history from DNA or protein sequences.


7

CLUSTAL W

Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

http://www.ebi.ac.uk/clustalw/

Sequences

results


8

Family and superfamily representation

Often a set of strings (a family) is defined by biological similarity, and one wants to find subsequence commonalities that characterize or represent the family.

There are three common kinds of family representations that come from multiple string comparison:

I. Profile representationII. Consensus sequence representationIII. Signature representation


9

Family representation and alignment with profiles

Definition: Given a multiple alignment of a set of strings, a profile for that multiple alignment specifies for each column the frequency that each character appears in the column. A profile is sometimes also called a weight matrix in the biological literature.


10

Family representation and alignment with profiles

a b c _ a

a b a b a

a c c b _

c b _ b c


C1 C2 C3 C4 C5

a .75 .25 .50

b .75 .75

c .25 .25 .50 .25

_ .25 .25 .25 Often the values in the profile are converted to log-

odds ratio – If p(y,j) is the frequency that character y appears in column j, and p(y) is the frequency that character y appears anywhere in the multiply aligned sequences, then log( p(y,j)/p(y) ) is commonly used as the y,j profile entry.

11

Aligning a string to a profile

Given a profile P and a new string S, we want to answer the question: “How well S, or substring of S, fit the profile P” .

Since space is a legal character of a profile, a fit of S to P should also allow the insertion of spaces into S, and hence the question is naturally formalized as an easy generalization of pure string alignment.


a a b b c

1 2 3 4 5An alignment of string aabbc to the column positions of the previous alignment.

12

How to optimally align a string to a profile

Recall that for two characters x and y, s(x,y) denotes the alphabet-weight value assigned to aligning x with y in the pure string alignment problem.

Definition: For character y and column j, let p(y,j) be the frequency that character y appears in column j of the profile, and let S(x,j) denote y[s(x,y) × p(y,j)], the score for aligning x with column j.


13

How to optimally align a string to a profile

Definition: Let V(i,j) denote the value of the optimal alignment of substring S[1..i] with the first j columns of C.

The recurrence: V(i,0)=s(S1(k),_) V(0,j)=S(_,k)For I and j both strictly positive, the general recurrence is:

V(i,j) = max [V(i-1,j-1) + S(S1(i),j),V(i-1,j) + s(S1(i),__),V(i,j-1) + S(_,j) ].

Time analysis: O(nm), where n is the length of S and is the size of the alphabet.


k≤i k≤j

14

Profile to profile alignment

Another way that profiles are used is to compare one protein set to another. In that case, the profile for one set is compared to the profile of the other.


15

Introduction to computing multiple string alignments

Definition: Given a set of k > 2 strings S={S1, S2, ...,Sk}, a local multiple alignment of S is obtained by selecting one substring Si’ from each string Si S and then globally aligning those substrings.


16

How to score multiple alignments

To date, there is no objective function that has been as well accepted for multiple alignment as edit distance or similarity has been for two-string alignment.

We will discuss three types of objective functions:

I. sum-of-pairs functionsII. consensus functionsIII. tree functions


17

Definition: Given a multiple alignment M, the induced pairwise alignment of two strings Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. That is, the induced alignment is multiple alignment M restrict to Si and Sj. Any two opposing spaces in that induced alignment can be removed if desired.



18


Definition: The score of an induced pairwise alignment is determined using any chosen scoring scheme for tow-string alignment in the standard manner.


A A G A A _ A

A T _ A A T G

C T G _ G _ G

A T G A A _ G

45 5

SP score 14

19

Multiple alignment with the sum-of-pairs (SP) objective function

Definition: The sum of pairs (SP) score of multiple alignment M is the sum of the scores of pairwise global alignments induced by M.

The SP alignment problem Compute a global multiple alignment M with minimum sum-of-pairs score.


20

An exact solution to the SP alignment problem

Via dynamic programming – for k strings of length n, it takes (nk) time.

We will develop the dynamic programming recurrence only for the case of three strings.

We will develop an accelerant to the basic dynamic programming solution that somewhat increases the number of strings that can be optimally aligned.


21

An exact solution to the SP alignment problem

Definition: Let S1, S2 and S3 denote three strings of length n1,n2 and n3, respectively, and let D(i,j,k) be the optimal SP score for aligning S1[1..i], s2[1..j] and s3[1..k]. The score for a match, mismatch, or space is specified by the variables smatch, smis and sspace respectively.


22

Recurrences for a nonboundary cell(i,j)

For i:=1 to n1 do

for j:=1 to n2 do

for k:=1 to n3 dobegin

if (S1(i)=S2(j)then sij:=smatch

else cij:=smis;

if (S1(i)=S3(k)then cik:=smatch

else cik:=smis;

if (S2(j)=S3(k)then cjk:=smatch

else cjk:=smis;


d1:=D(i-1,j-1,k-1)+cij+cik+cjk;d2:=D(i-1,j-1,k)+cij+2*sspace;d3:=D(i-1,j,k-1)+cik+2*sspace;d4:=D(i,j-1,k-1)+cjk+2*sspace;d5:=D(i-1,j,k)+2*sspace;d6:=D(i,j-1,k)+2*sspace;d7:=D(i,j,k-1)+2*sspace;

D(i,j,k):=min[d1,d2,d3,d4,d5,d6,d7];end;

23

D values for boundary cells

Let D1,2(i,j) denote the familiar pairwise distance between substrings S1[1..i] and S2[1..j], and let D1,3(i,k) and D2,3(j,k) denote the analogous pairwise distance. Then,

I. D(i,j,0)=D1,2(i,j)+(i+j)*sspace

II. D(i,0,k)=D1,3(i,k)+(i+k)*sspace

III. D(i,j,0)=D2,3(j,k)+(J+k)*sspace

IV. D(0,0,0)=0


24

A speed up for the exact solution

The program for multiple alignment that was shown uses recurrences in backward direction.

In forward dynamic programming when D(i,j,k) is set, D(i,j,k) is sent forward the seven cells that can be influenced by it.


25


Definition: Let d1,2(i,j) be the edit distance between suffixes S1[i..n] and S2[j..n] of string S1 and S2. Define d1,3(i,k) and d2,3(j,k) analogously.

All these d values can be computed in O(n2) time by reversing the strings and computing three pairwise distances.


26


Suppose that some multiple alignment of S1, S2, and S3 is known and that the alignment has SP score z.

Key idea of the heuristic speed up Recall that D(i,j) is the optimal SP score for aligning S1[1..i], S2[1..j], and S3[1..k]. If D(i,j,k)+d1,2(i,j)+d1,3(i,k)+d2,3(j,k) is greater than z, then node (i,j,k) cannot be on any optimal path and so D(i,j,k) need not be sent forward to any cell.


27

A bounded-error approximation method for SP alignment

The method is provably fast (runs in polynomial worst-case time) and yet produced alignments whose SP score is guaranteed to be less than twice the score of optimal SP alignment.

Recall that for two strings, D(Si,Sj) is the (optimal) weighted edit distance between Si and Sj.


28

An initial key idea: alignments consistent with a tree

Definition: Let S be a set of strings, and let T be a tree where each node is labeled with a distinct string from S. Then, a multiple alignment M of S is called consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si,Sj) that label adjacent nodes in T.


29

A bounded-error approximation method for SP alignment


3

1

4

5

2

AXZ AXZ

AXXZ

AYZ

AYXYZ

3 A X X _ Z

1 A X _ _ Z

2 A _ X _ Z

4 A Y _ _ Z

5 A Y X Y Z

a) b)

30

An initial key idea: alignments consistent with a tree

Theorem: For any set of strings S and for any tree T whose nodes are labeled by distinct strings of S, we can efficiently find a multiple alignment M(T) of S that is consistent with T


31

The center star method for SP alignment

We will describe the method in terms of an alphabet-weighted scoring scheme for two-string alignment, and let s(x,y) be the score contributed when a character x is aligned opposite a character y.

Definition: A scoring scheme satisfies the triangle inequality if for any three characters x,y and z, s(x,z)≤ s(x,y) + s(y,z).


32


Definition: Given a set of k strings S, define a center string Sc S as a string in S that minimizes SjSD(Sc,Sj), and let M denote the minimum sum. Define the center star to be a star tree of k nodes, with the center node labeled Sc and with each of the k-1 remaining nodes labeled by a distinct string in S-Sc.


33



S3

S4

S2

S1

S6

S3

A generic center star for six strings, where the center string Sc is S3

34


Definition: Define the multiple alignment Mc of the set of strings S to be the multiple alignment consistent with the center star.

Definition: Define d(Si,Sj) as the score of the pairwise alignment of strings Si and Sj induced by Mc. Denote the score of an alignment M as d(M).

d(Si,Sj)≥D(Si,Sj), d(Mc)=i<jd(Si,Sj), d(Si,Sc)=D(Si,Sc)


35


Lemma: Assume that the two-string scoring scheme satisfies the triangle inequality. Then for any strings Si and Sj in S, d(Si,Sj) ≤ d(Si,Sc) + d(Sc + Sj) = D(Si,Sc) + D(Sc + Sj)


36


Definition: Let M* be the optimal multiple alignment of the k strings of S. Let d*(Si,Sj) be the score of the pairwise alignment of strings Si and Sj induced by M*. Then d(M*)=i<jd*(Si,Sj).


37


Theorem: d(Mc)/d(M*) ≤ 2(k-1)/k <2.

Corollary:

kM≤i<jD(Si,Sj)≤d(M*)≤d(Mc)≤[2(k-1)/ki<jD(Si,Sj).


38

Steiner consensus strings

Definition: Given a set of strings S, and given another string S’, the consensus error of a string S’ relative to S is

E(S’)= Si S D (S’, Si). Note that S’ need not be from S.


39


Definition: Given a set of strings S, an optimal Steiner string S* for S is a string that minimizes the consensus error E(S*) over all possible strings.


40


Lemma: Let S have k strings, and assume that the two-string scoring scheme satisfies the triangle inequality. Then there exists a string S S such that

E(S) / E(S*) ≤ 2 – 2/k < 2


__

41


Recall that Sc is a string that minimizes

Si S D (Sc, Si) over all strings in S.

Theorem: Assuming that the scoring scheme satisfies the triangle inequality,

E(Sc) / E(S*) ≤ 2 – 2/k < 2


42

Consensus strings from multiple alignment

Definition: Given a multiple alignment M of a set of strings S, the consensus character of column I of M is the character that minimizes the summed distance to it from all the characters in column i. let d(i) denote the minimum sum in column i.


43


Definition: The consensus string SM derived from alignment M is the concatenation of the consensus characters for each column of M.


44


Definition: Let M be a multiple alignment of a set of strings S, and let SM be its consensus string containing q characters. Then the alignment error of SM equals

d(i), and the alignment error of M is defined as the alignment error of SM.


i=1i=q

45


Definition: The optimal consensus multiple alignment is a multiple alignment M for input set S whose consensus string SM has smallest alignment error over all possible multiple alignments of S


46


Definition: Given set S of k strings, let T be the star tree with Steiner string S* at the root and each of the k strings at distinct leaves of T. Then the multiple alignment of SUS* consistent with T is said to be consistent with S*.


47


Theorem: Let S’ denote the consensus string of the optimal consensus multiple alignment. Then, removal of the spaces from S’ creates the optimal Steiner string S*. Conversely’ removal of the row for S* from the multiple alignment consistent with S* creates the optimal consensus multiple alignment of S.


48

Approximating the optimal consensus multiple alignment

Theorem: Assuming the triangle inequality, the multiple alignment Mc created by the center star method has an SP score that is never more than 2 – 2/k times the SP score of the optimal SP alignment, and it has a (consensus) alignment error that is never more than 2 – 2/k times the alignment error of the optimal consensus multiple alignment.


49

Multiple alignment to a (phylogenetic) tree

Definition: Given an input tree T with a distinct string (from a set of strings S) written at each leaf, a phylogenetic alignment for T is an assignment of one string to each internal node of T. Note that the strings assigned to internal nodes need not be distinct and need not be from the input strings S.


50


Definition: If strings S and S’ are assigned to the endpoints of an edge (i,j), then (i,j) had edge distance D(S,S’). The distance along a path is the sum of the distances on the edges in the path. The distance of a phylogenetic alignment is the total of all the edge distances in the tree.


51


The phylogenetic alignment problem for T find an assignment of strings to internal nodes of T (one string to each node) that minimizes the distance of the alignment.

The consensus alignment problem is a special case of the phylogenetic alignment problem (i.e., when tree T is a star).


52

A heuristic for phylogenetic alignment

Definition: A phylogenetic alignment is called a lifted alignment if for every internal node V, the string assigned to V is also assigned to one of V’s children.

We will show that the best lifted alignment in T has a total distance less than twice that of the optimal phylogenetic alignment.


53

A heuristic for phylogenetic alignment


S6

S5S6

S6

S6

S7 S8

S5

S1 S2

S2

S4 S5S3

54

The transformation creating T

We will construct the lifted alignment T out of T* which is the optimal phylogenetic alignment.

Definition: we say a node has been lifted after it has been labeled by a string in the leaf set S.

Let Sv* be the string labeling internal node V in T*. S1, S2 ,…., Sk – v’s children. We lift Sj if D(Sv*,Sj)≤ D(Sv*,Si) for any i from 1 to k.


L

L

55

The lifting operation at node V. The numbers on the edges are the distances from Sv* to the lifted strings labeling its children. Note that after the lift, one edge will have zero distance.


The transformation creating T L

Sv* S3

S3S4 S1 S2

S3S4S1 S2

VV

57 3

06

56

The error analysis

Theorem: The lifted alignment T has total distance less or equal to twice that of the optimal phylogenetic T* of T.


L

57

Computing the minimum distance lifted alignment

The best lifted alignment is computed by dynamic programming.

Definition: Let Tv be the subtree of T rooted at node V. Let d(V,S) denote the distance of the best lifted alignment of Tv under the requirement that string S is assigned to node V (assuming of course that S is a string at a leaf of Tv.


58


We start with the assumption that all the leaves have already been processed.

S’- a string written at a leaf; V’-child of V.

If V is a node all of whose children are leaves

d(V,S)= S’ D (S, S’).

For a general internal node V, the dynamic programming recurrence is

d(V,S)= min [ D (S, S’) + d(V’,S’) ]


V’ S’

59


Theorem: The optimal lifted alignment can be computed in polynomial time as a function of size of the tree and the lengths of the input strings.


60

Iterative pairwise alignment

The target is to iteratively merge two multiple alignments of two subsets of strings into a single multiple alignment of the union of those subsets.

As an example we will explain the average linkage method, and is also known as UPGMA, for “Unweighted Pair-Group Method using arithmetic Averages”. At each merge step, the new multiple alignment could be created by aligning some representation of the two smaller alignments (for example, by aligning profiles or consensus sequences).


61


multiple alignments serve the purpose of characterizing protein families and for identifying important molecular structures, but….

Doolittle: “ ….what we’re really interested in is a historical alignment. The historical alignment ought to reflect, as accurately as possible, the series of divergences that led to the contemporary sequences…..”


62


Iterative alignment methods determine a sequence of merges of disjoint subsets of strings. Hence the history of those merges can be described by a binary tree T. Each leaf of T represents a single string from the input set, and each node of T specifies a merge of the strings found at the leaves of its subtree. Each node also represents a multiple alignment created by the merge at that node.


63

Progressive alignment

A pair of strings with minimum edit distance (or greatest similarity) is likely obtained from the pair of taxa that has most recently diverged.

Any spaces (gaps) that appear in the optimal pairwise alignment of those two strings in preserved throughout the entire sequence of successive merges.


64


The progressive alignment method is explicitly aimed at building an evolutionary tree from molecular data while simultaneously constructing an evolutionarily informative multiple alignment.


65

Improvements to progressive alignment

Sequence weighting – the weights are normalized such that the biggest one is set to 1. closely related sequences receive lowered weights. Highly divergent sequences receive high weights.

Initial gap penalties – a gap opening penalty (GOP) is given for every gap, and gap extension penalty (GEP) gives the cost of every space in the gap.


66

Improvements to progressive alignment

Weight matrices – Two main series of weight matrices are offered to the user: Dayhoff PAM, BLOSUM.

Divergent sequences – The most divergent sequences are usually the most difficult to align correctly. It is sometimes better to delay the incorporation of these sequences until all of the more easily aligned sequences are merged first.


67



Hbb_Human 1 -

Hbb_Horse 2 .17 -

Hba_Human 3 .59 .60 -

Hba_Horse 4 .59 .59 .13 -

Myg_Phyca 5 .77 .77 .75 .75 -

Glb5_Petma 6 .81 .82 .73 .74 .80 -

Lgb2_Luplu 7 .87 .86 .86 .88 .93 .90

1 2 3 4 5 6

Pairwise alignment: calculate distance matrix

68



Unrooted Neighbor-joining tree

Hbb_Human

Hbb_Horse

Hba_Human

Hba_Horse

Myg_Phyca

Glb5_Petma

Lgb2_Luplu

69



Hbb_Human

Hbb_Horse

Hba_Human

Hba_Horse

Myg_Phyca

Glb5_Petma

Lgb2_Luplu

Rooted NJ tree (guide tree) and sequence weights

Progressive alignment: Align following the guide tree

70

Repeated-motif methods

The second major approach used in multiple alignment methods.

Definition: a motif is a substring or a small subsequence that is common to many of the strings in the set.

“width” refers to the length of the motif, and “multiplicity” refers to the number of strings that it appears in.


71


Repeated-motif method general algorithm:1. Find a “good” motif (wide and with high multiplicity)

2. The strings containing it are shifted so that the occurrences of the motif are aligned with each other.

3.The problems divides into two sub problems, one for substrings on each side of the motif.


72


4. Continue this recursion until no sufficiently wide or high motif is found.

5. The remaining sub problems can be solved by iterative alignment methods.

6. Strings that did not contain the first good motif are aligned separately.

7. Finally, the two alignments are merged.

73

Summary


The importance of multiple string alignments in molecular biology.

CLUSTAL W. Family representation. How to score multiple alignments. The center star method for SP alignment. consensus strings. Approximating the optimal consensus multiple alignment. Iterative pairwise alignment. Progressive alignment and contemporary improvements. Repeated-motif methods

74

Bibliography


Algorithms on strings, trees, and sequences : computer science and computational biology; Gusfield Dan; Cambridge : Cambridge University Press, 1997

Nucleic Acids Research, 1994, Vol. 22, No. 22, Oxford University Press.

1 seminar in structural bioinformatics - multiple sequence alignment algorithms. elya flax &...

Documents

multiple alignments

multiple string comparison

progressive alignment

sp alignment

sequence weighting

set of strings

consensus strings

resulting strings