on near-optimal alignments of biological sequences

18
On Near-Optimal Alignments of Biological Sequences DALIT NAOR2 and DOUGLAS L. BRUTLAG1 ABSTRACT A near-optimal alignment between a pair of sequences is an alignment whose score lies within the neighborhood of the optimal score. We present an efficient method for representing all alignments whose score is within any given delta from the optimal score. The representation is a compact graph that makes it easy to impose additional biological constraints and select one desirable alignment from the large set of alignments. We study the combinatorial nature of near-optimal alignments, and define a set of \p=``\canonical\p=''\ near-optimal alignments. We then show how to enumerate near-optimal alignments efficiently in order of their score, and count their number. When applied to comparisons of two distantly related proteins, near-optimal alignments reveal that the most conserved regions among the near-optimal alignments are the highly structured regions in the proteins. We also show that by counting the number of near optimal alignments as a function of the distance from the optimal score, we can select a good set of parameters that best constraints the biologically relevant alignments. Key words: pairwise sequence alignment; suboptimal alignments; near-optimal alignments; dy- namic programming; edit distance INTRODUCTION IT IS WIDELY ACCEPTED that the optimal alignment between a pair of proteins or nucleic acid sequences that minimizes the edit distance may not necessarily reflect the correct biological (evolutionary or structural) alignment. Alignments of proteins based on their structures or of DNA sequences based on evolutionary changes are often different from alignments that minimize edit distance. However, in many cases (e.g., when the sequences are very similar), the edit distance alignment is a good approximation to the biological one. Since, for most sequences, the true alignment is unknown, a method that either assesses the significance of the optimal alignment, or that provides few "close" alternatives to the optimal one, is of great importance. A near-optimal alignment is an alignment whose score lies within the neighborhood of the optimal score. Enumeration of near-optimal alignments (Waterman, 1983; Waterman and Byers, 1985) is not very practical because there are many such alignments. Other approaches (Vingron and Argos, 1989; Vingron and Argos, 1990; Zuker, 1991) that use only partial information about near-optimal alignments are more successful in practice. 'Department of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305-5307. 2Present address: Department of Computer Science, School of Mathematics, Tel-Aviv University, Tel Aviv, 69978 Israel.

Upload: douglas-l

Post on 05-Apr-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Near-Optimal Alignments of Biological Sequences

On Near-Optimal Alignments of Biological Sequences

DALIT NAOR2 and DOUGLAS L. BRUTLAG1

ABSTRACT

A near-optimal alignment between a pair of sequences is an alignment whose score lies withinthe neighborhood of the optimal score. We present an efficient method for representing allalignments whose score is within any given delta from the optimal score. The representationis a compact graph that makes it easy to impose additional biological constraints and selectone desirable alignment from the large set of alignments. We study the combinatorial natureof near-optimal alignments, and define a set of \p=``\canonical\p=''\near-optimal alignments. We thenshow how to enumerate near-optimal alignments efficiently in order of their score, and counttheir number. When applied to comparisons of two distantly related proteins, near-optimalalignments reveal that the most conserved regions among the near-optimal alignments are thehighly structured regions in the proteins. We also show that by counting the number of near

optimal alignments as a function of the distance from the optimal score, we can select a goodset of parameters that best constraints the biologically relevant alignments.

Key words: pairwise sequence alignment; suboptimal alignments; near-optimal alignments; dy-namic programming; edit distance

INTRODUCTION

IT IS WIDELY ACCEPTED that the optimal alignment between a pair of proteins or nucleic acid sequences thatminimizes the edit distance may not necessarily reflect the correct biological (evolutionary or structural)

alignment. Alignments of proteins based on their structures or of DNA sequences based on evolutionarychanges are often different from alignments that minimize edit distance. However, in many cases (e.g., whenthe sequences are very similar), the edit distance alignment is a good approximation to the biological one.

Since, for most sequences, the true alignment is unknown, a method that either assesses the significance ofthe optimal alignment, or that provides few "close" alternatives to the optimal one, is of great importance.

A near-optimal alignment is an alignment whose score lies within the neighborhood of the optimal score.

Enumeration of near-optimal alignments (Waterman, 1983; Waterman and Byers, 1985) is not very practicalbecause there are many such alignments. Other approaches (Vingron and Argos, 1989; Vingron and Argos,1990; Zuker, 1991) that use only partial information about near-optimal alignments are more successful in

practice.

'Department of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305-5307. 2Present address:Department of Computer Science, School of Mathematics, Tel-Aviv University, Tel Aviv, 69978 Israel.

Page 2: On Near-Optimal Alignments of Biological Sequences

We present a method for representing all alignments whose score is within any given delta from the optimalscore. It represents a large number of alignments by a compact graph, which makes it easy to impose additionalbiological constraints and select one desirable alignment from this large set. We study the combinatorial natureof near-optimal alignments, and define a set of "canonical" near-optimal alignments. We then show how toenumerate near-optimal alignments efficiently in order of their score, and count their numbers.

These graphic representations can show, in a dramatic way, all the possible alignments without the needto enumerate them. When applied to comparisons of two distantly related proteins, they reveal that the most

highly conserved regions among all the near-optimal alignments are the highly structured regions (a-helicesand ^-strands) and the loops are the least highly conserved. This phenomena is best exemplified by the graphicrepresentation of near-optimal alignments between the heavy and the light chain of immunoglobulins. Wehave demonstrated that the regions of maximum ambiguity among the near-optimal alignments between thesetwo sequences are their hypervariable regions, or loops.

Although the number of near-optimal alignments grows exponentially, even when closely related sequencesare compared, for each pair of sequences, a specific log-odds amino acid replacement matrix (PAM matrix)shows the fewest near-optimal alignments. This specific PAM matrix correlates with the matrix that gives theoptimal score in the sequence alignment between two sequences. This observation may provide a correlationbetween parametric approaches to sequence alignment that optimize a score, and our graphic method, whichminimizes the number of near-optimal alignments.

Since alignments are essentially paths in a directed acyclic graph with (possibly negative) weights on itsedges, our solution gives an extremely simple method to enumerate all K shortest (or longest) paths from s

to t in such graphs in increasing order, as well as all (s, t) paths that are within S of the optimum, for any <5.Our solution is as good, in terms of running time, as the previously known solutions to the K-best shortestpaths in such graphs, but improves the space requirement by a factor of m when K > n. It also improves thetime and space needed to count the number of near-optimal paths.

MOTIVATION

Protein and nucleic acid sequences are regularly compared against one another, because it is assumed thatsequences with similar biological function have similar physical characteristics, and therefore are similar,or homologous, at the amino or nucleic acid sequence level (Doolittle, 1986, 1990). Since a good similaritymeasure between sequences that predicts the structural or evolutionary similarity is not yet known, biologistsresort to the classical string comparisons methods (Sellers, 1974; Sankoff and Kurskal, 1983). The most

commonly used measure of similarity between strings is the well-known edit distance measure, which isthe minimum-weight set of edit operations (substitutions or insertions/deletions of gaps) that are needed totransform one sequence to another. The minimum edit distance alignment can also be expressed as the bestscored alignment; throughout the paper we use the maximization terminology. Biological constraints are

imposed via the weights that are assigned to the different edit operations (i.e., the PAM matrix for proteins).It is still controversial how to correctly derive these weights see (Gönnet et al, 1992; Henikoff and Henikoff,1992; Jones et al., 1992). With any set of weights, the edit distance measure optimizes a single simpleobjective function, and there is no evidence to believe that this function is being optimized in nature.

Despite these limitations, this technique has become the method of choice in molecular biology for aligningsets of sequences. The main reason is that structural or evolutionary alignments are very hard to find (Leskand Chothia, 1980; Chothia and Lesk, 1986, 1987); they involve the determination of the three-dimensionalstructures of molecules [via crystallography, nuclear magnetic resonance (NMR) techniques or homologymodeling], which are extremely labor-intensive tasks. String comparison techniques, on the other hand,are much faster and in many cases have proven to be good predictors of the correct alignment. However,lacking structural data, it is difficult to assess the significance of an alignment that was obtained by a purelycomputational method.

In summary, the edit distance measure is a well-defined, rigorous combinatorial measure that can beefficiently optimized, and they can be justified evolutionarily (by a set of parameters, i.e., PAM matrix, whosevalues are determined empirically). However, the edit distance does not always yield the alignment that reflectsthe evolutionary steps when optimized. Therefore, it is useful to explore a larger set of solutions intelligently.

Page 3: On Near-Optimal Alignments of Biological Sequences

One approach is the parametric approach, which looks for different optimal solutions obtained for differentsets of parameters (Fitch and Smith, 1983; Gusfield et al, 1992; Waterman et al, 1992). The second is toconsider a larger, but still a manageable, set of alignments that are in the vicinity of the optimum. Theseare "Near-Optimal (or suboptimal) Alignments" (Waterman, 1983; Waterman and Byers, 1985; Vingron andArgos, 1990; Zuker, 1991).

SUMMARY OF RESULTS

In this paper we study the combinatorial nature of near-optimal alignments under a simple scoring function.For every A, we define the minimum set of edges EA in the edit distance graph that includes every near-

optimal alignment in the A-neighborhood of the optimal score, and then show that this set of edges is exactlythe union of certain types of alignments that we call canonical" alignments. We argue that these alignmentscan be viewed as a canonical set for all near-optimal alignments, since any near-optimal alignment is a

"combination" of few canonical ones. We denote by A0 < Aj < A2 < ... the maximal sequence of valuessuch that

and show that the information revealed by this set of graphs is much richer than merely a single optimalalignment, or a list of few near-optimal ones. It allows a graphical view of:

• near-optimal alignments with almost optimal scores that are substantially different than the optimalalignment.

• the regions that are common to many near-optimal alignments, which may indicate that these regionsare significant (Vingron and Argos, 1990; Vingron, 1991; Zuker, 1991.

• the best alignment which satisfies additional biological constraints.

We developed a program, called K-BEST (Naor and Brutlag, 1994), that computes and displays thesegraphs, and counts the number of ¿-near-optimal alignments for each S.

We also define a transformation on the weights of the edit distance graph which allows us to efficiently countand also output (ordered or unordered) near-optimal alignments. Let n and m be the lengths of the sequences,m >n. Given the transformed weights, the next best alignment can be output in 0(m) time. If alignmentsare to be enumerated in increasing order, then the space requirement is 0(K + nm)) (the best previouslyknown method required 0(Km + m) space; Myers, 1987); otherwise, 0(mn) space suffices (Chao, 1994)showed that in fact a linear amount of space suffices). We also show how to count efficiently the number ofnear-optimal alignments rather than enumerate them. This method is extremely simple. It requires only one

additional edit distance computation between the reversed strings. Computing the edit distance between thesequences, and between the reversed sequences, was used in the algorithms of Katoh et al (1982), Waterman(1983), Waterman and Byers (1985), Vingron and Argos (1990), Vingron (1991), and Zuker (1991). Thisinformation is sufficient to represent canonical as well as noncanonical near-optimal alignments and can beobtained in 0(nm) time and space.

EXAMPLES

To motivate this problem, we show three examples. The first demonstrates that the biologically correct

alignment is not necessarily the optimal one. The second example shows how a compact representation of a

set of alignments between a single pair of sequences may reveal a wealth of information that is traditionallyobtained from a multiple alignment of a large set of sequences. In the last example, we show that the numberof near optimal alignments as a function of S can serve as a good discriminator between random and biologicalsequences, and can also reveal which set ofparameters (e.g., PAM values) best selects the biologically relevantalignments. In these examples an alignment is represented by a path in a grid graph that starts at the upper-left

Page 4: On Near-Optimal Alignments of Biological Sequences

corner and ends at the bottom-right corner. In this path, a diagonal edge corresponds to a pair of alignedcharacters, and a horizontal or a vertical line corresponds to a gap in one of the two sequences.

To demonstrate how misleading the optimality criteria can be, consider the example in Fig. 1. Here, two

25-amino-acid-long sequences are compared against each other. The two sequences are substrings of longerleucine zippers, in which the amino acid leucine (L) appears periodically every seven positions. In thisexample, periodic leucines appear at the 4th, 11th, 18th, and 25th positions of the two sequences, so we

expect a good alignment to pick up this pattern. Hence, we define a biologically "correct" alignment as one

that coincides with the periodic leucines at the 4th, 11th, 18th, and 25th positions.When the two sequences were compared (with a PAM 80, gap score —1) the best score was 43. All

alignments with this optimal score are represented as paths in the top part of Fig. 1. Note that no optimalalignment aligns the periodic leucines, but instead aligns other leucines in the sequence. The bottom part ofFig. 1 shows that when the score drops by 2, another large set of paths that are closer to the diagonal are

introduced, among them several alignments align the periodic leucines.

MYRLRDRLRLLPVEVRRLDIFNLI L

K

MYRLRDRLRLLPVEVRRLDIFNLI L

FIG. 1. Aligning two leucine zippers. The top figure depicts the set of optimal alignments. All optimal alignments are

paths that proceed monotonically from the upper left corner to the lower right in £Q. The bottom figure depicts the setof near-optimal alignments within A = 2 of the optimum. Note that the "correct" alignments, the ones that align theperiodic leucines, are shown in bold in £,.

Page 5: On Near-Optimal Alignments of Biological Sequences

In the second example, the variable regions of the heavy and the light chain of a human immunoglobulinFab (the first 110 amino acids of PDB sequence 2FB4JH (accession #P01772 in Swiss-Prot), and 2FB4.L)have been aligned (Figs. 2 and 3). These two sequences contain only about 23% sequence identity, yet theyboth fold into a similar structure, the variable structural domain, which is composed of nine beta-strandsthat form two sheets. The graph that represents optimal alignments reveals that all optimal alignmentsagree on few well-aligned regions (diagonals). This phenomena is reinforced by the consideration of near-

optimal alignments (for 8—0, 12), as they all share the same well-alignd regions, and introduce morealternatives at the remaining parts of the alignment. The fact that these two sequences are conserved in threeregions and are variable between them is the well-known fact of hypervariable regions in immunoglobulins(Wu and Kabat, 1970). Hence, the hypervariable regions which were discovered in the 1970s by aligningmultiple immunoglobulin sequences can be obtained from this single pair of sequences and their near-optimalalignments. Since the structure of the Human Fab has been determined, we located the turns that correspondto the hypervariable regions (CDR1, between strands 2 and 3, CDR2 between strands 3b and 3c, and CDR3between strands 6 and 7), and correlated them with those observed in the near-optimal graphs. In the heavychain, the turns occur at positions 23-33, 51-58, and 100-104D, and the corresponding regions revealed bythe near-optimal graphs are 24-29,47-60,77-79, and 97-106. In the light chain, the turns occur at positions25-32, 52-60, and 91-95, and the corresponding regions revealed by the near-optimal graphs are 24-37A,44-53, and 87-105.

In the third example, we aligned the human a-hemoglobin chain ( ccession #P01922 in Swiss-Prot) to

0 10 20 30 40 50_60 70 80 90 100 110 120

10 \

20 \^30 ^v

40 \

60 \^70 >v^80 N.

90 \

100 \.110 \

120J_FIG. 2. Aligning the variable region of a heavy and a light chain of human immunoglobulins. The graph EQ depictsthe set of optimal alignments between the variable regions of a heavy and a light chain of a human immunoglobulin. Anoptimal alignments is represented by a path in EQ.

Page 6: On Near-Optimal Alignments of Biological Sequences

O 10 20 30 40 50 60 70 80 90 100 110 120

FIG. 3. Aligning the variable region of a heavy and a light chain of human immunoglobulins. The graph E2 depicts theset of near-optimal alignments within A = 2 of the optimum between the variable regions of a heavy and a light chain ofa human immunoglobulin. It shows in a dramatic way the hypervariable regions.

the human ß-hemoglobin chain (accession #P02023 in Swiss Prot) and counted the number of near-optimalalignments that are within 0, 1, 2, ... of the optimal score. We aligned the sequences using different PAMvalues (PAM 40, 50, 100, 150, 200, and 250) and compared the results with those obtained by aligning thetwo randomized chains with each other. The results (on a log scale) are shown in Fig. 4. The number ofnear-optimal alignments grows exponentially in all cases. However, there is a clear distinction between thebehavior of the random sequences and the biological sequences, both in the starting points of the curves (i.e.,number of optimal alignments), and in their rate of growth. The PAM 100 matrix permits the fewest near-

optimal alignments, suggesting that it imposes the maximum biological constraint on this pair of sequences.Therefore, the PAM 100 matrix best discriminates the biologically relevant alignments from the rest.

PROBLEM DEFINITION

Let A and B be two sequences (over a finite alphabet) of lengths m and n respectively, m >n. Let A;(ß;)denote the first ¿' characters in A(B). An alignment between A and B introduces spaces into the sequencessuch that the lengths of the two resulting sequences are identical, and places these spaced sequences one uponthe other so that no space is ever above another space. The length of an alignment is between m and n + m.

Page 7: On Near-Optimal Alignments of Biological Sequences

PAM 250 RANDPAM 200 RAND

PAM 150 RANDPAM 100 RAND

PAM 50 RANDPAM 40 RAND

PAM 250PAM 200PAM 150

PAM 100

PAM 50PAM 40

4 5 6 7 8 9 10 11Distance from Optimal Score

FIG. 4. Near-optimal alignments of human a -hemoglobin with human /3-hemoglobin as a function of the PAM matrix.The five lower curves show the number of near-optimal alignments of the a- and ß-hemoglobin chains as a function of thePAM matrix. The upper five curves show the same numbers for randomized hemoglobin sequences. PAM matrices werecalculated using the NCBI PAM program (available by anonymous ftp from ncbi.nlm.nih.gov). The optimal scores foreach alignment of the normal sequences are: PAM40 = 291. PAM 50 = 305. PAM100= 324, PAM150 = 289, PAM200= 412, PAM250 = 349, and for the randomized sequences the optimized scores are: PAM40 = -65, PAM50 = -55,PAM 100 = 3, PAM 150 = 8, PAM200 = 81, PAM250 = 72. The gap penalty in each case was 4.

A column that contains two identical characters is called a match and is assigned a positive weight w . Acolumn that contains two different characters a and b is called a mismatch, and is assigned a weight w,. Acolumn that contains a space is called a gap, and is assigned a small, generally negative, gap weight (gap).We define the score of an alignment as the sum over all weights assigned to its columns. The edit distanceproblem is to find the alignment that maximizes its score.

If D(i, j) is the score of the best alignment between A. and Bj, and if the ith and jth characters of A andB are a and b, respectively, then the following dynamic programming formulation will correctly computeD(i, j) (Sankoff and Kruskal, 1983).

D(i, j) = max{D(i-

1, j) + gap; D(i - 1, ;-

1) + wab; D(i, j-

1) + gap}All values D(i, j) can be computed in 0(nm) time, as well as D(m, n), which is the desired score of theoptimal alignment. Also, all optimal alignments can be found by backtracking through the matrix of theD(i, j) values.

Note that there are many possible definitions for the alignment score; we only use scoring functions withthe property that D(m, n)

D(i, j) is the score of optimal alignment between the last m—

i characters fromA and the last n — j characters from B.

This problem can be viewed as a longest path problem in a directed acyclic graph G with nm nodes and3nm edges. A node in G corresponds to some cell (i,j) in an (m x n) table, and each node (i, j) has threeedges coming into it from adjacent cells (i - 1, j), (i - 1, ;

-

1), (i, j-

1). The weight of a horizontal or

Page 8: On Near-Optimal Alignments of Biological Sequences

a vertical edge is gap, whereas the weight of a diagonal edge is wab, a and b are the ¿th and jth characters ofA and B, respectively. This graph is called "the edit distance graph of A and B." If node s corresponds to thecell (0,0), and node t corresponds to the cell (m,n), then there is a 1 : 1 correspondence between alignmentsand paths from í to t in G. Hence, near-optimal alignments are essentially near-optimal paths in this simplegraph.

By the graph analog of the problem, it is well known that all optimal alignments between A and B (that is,optimal paths from s to i) have a compact representation: it is the backtrace graph, which can be constructedin 0(nm) time. Every (s, t) path in the backtree graph is an optimal alignment and vice versa. (This propertydoes not directly carry over for other definitions of scores, e.g., affine gap weights; Gotoh, 1982; Altschuland Erickson, 1986). However, no such elegant representation exists for near-optimal paths, and thereforealignments. Throughout the paper we use the graph notation, where a node (/, /) is simply denoted by u; s

is the node (0,0) and t is (m, n). An edge, directed from u to v, is denoted by (u, v) or more simply by theletter e.

PREVIOUS WORK

The problem of enumerating all K-best shortest paths (or longest paths in a DAG) in increasing orderwas addressed very early (in the 1960s and 1970s) in the context of general graphs, and it is a well-studiedproblem in combinatorial optimization. In general, the problem may or may not allow paths with loops. Wedo not attempt to give a complete overview of all known solutions to this problem, but rather to summarizesome relevant ideas that have been suggested. Three main approaches have been suggested. The first is aniterative procedure, that generalizes the Bellman-Ford algorithm for optimal paths, suggested by Bellmanand Kalaba (1960) and Dreyfus (1969) and improved by Fox (1973) and Lawler (1976). The second employsa general scheme for enumerating the K best objects (Pollack, 1961; Yen, 1971; Lawler, 1972, 1976). Thethird exploits the structural relations between near-optimal paths. It is based on the observation of Hoffmanand Pavley (1959) that the kth-best path P is a "deviation" of some jth-best path Q, j < k, i.e., there is an

edge é = (u, v) on P such that P's segment up to u is the same as Q, e <£ Q, and the portion of P fromv to t is an optimal path from v to t. The algorithms of Hoffman and Pavley (1959), Clarke et al. (1963),Yen (1971), Katoh et al. (1982), Perko (1986), and Myers (1987), all show how to store the *

-

1 best pathsefficiently so that the *th best, which must be a deviation of one of them, can be found easily. All of theseapproaches do not require the knowledge of K in advance; there is yet another approach, studied by Shier(1979), which relies on a known K.

The edit distance graph is a very regular graph. It is a directed (acyclic) grid of degree 3, and its edge-weights may be negative. For this type of graph, the algorithm of Bellman and Kalaba (1960) and Dreyfus(1969) as implemented by Fox (1973) and Lawler (1976) enumerates the K-best near-optimal paths in 0(mn)time per path, where K need not be specified in advance. In fact, it finds the K-best near-optimal paths fromany node in the graph to t. The space required is O(Kmn). The algorithm of (Yen, 1971; Lawler, 1972) canbe adapted to generate the next best path in 0(m) time, and O (Km + nm) space. The algorithm of Myers(1987) requires 0(m) time to generate the next path, and 0(Km + m) space.

A more relaxed goal is suggested in Waterman (1983) and Waterman and Byers (1985), which is toenumerate all shortest paths (not necessarily in order) that are within S of the optimum, where S is known inadvance. For this problem, each path can be enumerated in O (m) time; moreover, only O (nm) space is needed.The requirement to know 8 in advance is a limitation, because it is not possible to know a priori at whichlevel of significance is the interesting alignment. These enumeration methods (of K-best or 5-near-optimal)turn out not to be practical in the biological application, since the number of near-optimal paths grows veryfast, and explicit listing of them provides too much information. Another variant, considered in Watermanand Eggert (1987) and used in Schoniger and Waterman (1992), is to list all K-best non-overlapping localalignments. This approach was found to be very useful because it eliminates many uninteresting paths.

Realizing that explicit enumeration is not the desired representation, both Zuker (1991) and Vingron andArgos (1990) suggested building (mxn)O-l (i.e., dot) matrices S, T that store partial information aboutall S-near-optimal alignments. S and T are easily computable in 0(nm) time, but in both methods S needs tobe specified in advance. T(i, j)

1 if there is some 5-near-optimal alignment in which the ¡th character of Aand the jth character of B are aligned, and S(i, j) = 1 if all 5-near-optimal alignment align the i^ characterof A with the j01 character of B, thus defining what they refer to as "reliably aligned regions." These matricesare then used to assess significance of optimal alignments (Vingron and Argos, 1990; Zuker, 1991), to detect

Page 9: On Near-Optimal Alignments of Biological Sequences

alternatives to the optimal one and to construct multiple alignments (Vingron, 1991). This representation,although powerful, loses connectivity information because, clearly, not every legal path through the dots ofthe dot matrices is a 5-near-optimal alignment.

Saqi and Sternberg (1991) employ a different approach to generate a set of possibly good alignments.They first find an optimal alignment for some set of parameters. Then, they increase by A the penalty foraligning position i with position j, for every pair (¿, j) of positions that has been previously aligned, andthe optimal alignment for the adjusted set of parameters is found. This procedure is repeated, and at eachiteration it outputs a new alignment that slightly differs from the previous one. Note that these alignments are

not necessarily 5-near-optimal alignments according to the definition used in this paper.Therefore, it is clear that it is not the actual enumeration of near-optimal alignments, but rather a represen-

tation of their common or uncommon features, that is needed. This is the motivation for the representationsuggested in this paper, which contains more information than the dot matrices (Vingron and Argos, 1990;Vingron, 1991; Zuker, 1991), and yet is more compact. We believe that this additional information, which ismanageable, can be valuable in many cases. Also, the transformation of the edge weights suggested in thispaper provides a new and more efficient way to explore the search space of near-optimal alignments, becauseit suggests a simple implementation to the method that looks for the next-best deviation (Hoffman and Pavley,1959; Yen, 1971), and because it makes the task of counting near-optimal alignments possible.

THE COMBINATORIAL STRUCTURE OF NEAR-OPTIMAL PATHS

Notation Let G be the edit distance graph between two sequences A and B, and let E be its set of edges.The weight of an edge e = (u, v) is denoted by w(e). Let d(x, y) be the length of the optimal (maximal)path from node x to node y in G. If P is a path that goes through nodes x and y, then dp(x, y) is the lengthof the portion of P from x to y. For a path P from s to t (an (s, i)-path), define S(P) = d(s, t)

dp(s, t).If ¿(F) = 8(8 > 0) then P is called a á-path (or a <5-near-optimal path). Throughout the paper, only simplepaths are considered (since G is a DAG, it contains only simple paths).

Given a pair of sequences, two edit distance computations (between the sequences and between theirreverse) produce all values d(s, u) and d(u, t) for every node u, as well as the optimal path from 5 to u andfrom u to t for every node u.

Definition For an edge e e E, define 8(e) = d(s, t)—

(d(s, u) + w(u, v) + d(v, t)). 8(e) is thereforethe difference between the length of the best path from 5 to t that uses e and the optimal (s, t) path.

Definition Define EA, a subset of the edge set E in G, as:

EA = {e\8(e) < A}Figure 5 depicts all edges in Ey Each edge is marked with a pair of numbers; the left number is the 8 valueof this edge.

Claim 2.1 FA is the smallest set ofedges such that every 8' near-optimal path for some 8' < A is a pathinEA

Proof Note first that every á'-near-optimal path, 8' < A, is a path in £_. Let F be a S'-near-optimal pathfor some 8' < A. Assume that P is not a path in EA, so there must be some edge e = (u, v) £ £_ on P. Butsince P uses e, 8(e) = 8" < 8' < A. Hence, ee£4,a contradiction. £A is the smallest such set since everye e FA is on some 8' near-optimal path, 8' < A.

Definition A á-path from 5 to / is called canonical (for e) if there exists an edge e—

(u, v) e P suchthat 8(P) = 8(e). That is, a canonical path consists of the best path from s to u, followed by e and furtherfollowed by the best path from v to f. See, for example, the path which is marked with bold edges in Fig.5: It is canonical for edge (e, h). Note—there is a canonical path for every edge e; however, a path P can

be canonical for more than one edge. For example, the above path is canonical for edge (e, h) as well as foredge (h,k).

Page 10: On Near-Optimal Alignments of Biological Sequences

AACTCAGAC

G

C

C

C

FIG. 5. The edit distance graph and a canonical path. Figure 5 shows a graph that represents optimal alignments andthose which are within 2 and 3 of the optimum, between the sequences "ATTGCGACCA" (vertical) and "AACTCAGAC"(horizontal). The two sequences were compared with a transition/transversion scoring matrix which assigns 3 for an

identity, 0 for a transition, and —2 for a transversion, and a gap score of -2. The optimal score is 8. Next to every edgein the graph we indicated the S and s values (in the form 8/s) of the corresponding edge. The relevant paths are the ones

that connect S and T in the graph. The path P which is marked with bold edges is canonical for the edge (e,h), sinceS((e,h)) = á(P) = 3. It consists of three parts: (S,b,e), the best path from S to e, followed by {e,h), and finally followedby (h,k,n,o,q,r,v,w,T), the best path from h to T. Note that the three conditions of Lemma 2.1 hold for this path. P is alsocanonical for the edge (h,k), for the same reasons.

We next argue that the set of canonical paths are the essential ones among all near-optimal paths since allnoncanonical near-optimal paths can be derived from them.

Lemma 2.1 Let P = ex, e2,..., ek be a canonical A-near-optimalpath. Then(l)Vee P, 5(c) < A(2) there is a contiguous segment ex,.. .er(l < r) along P such that 5 (c;) = A for I < i < r andS(ej) < A

for j < I or j > r. Hence, P is a canonical path for e¡, ... er.(3) 5(«?,-_,) < S(e.) for all i < /, and8(e¡) > o(ei+x)forall i > r.

A canonical path:<5<5 5 5 5 5<5<5

Page 11: On Near-Optimal Alignments of Biological Sequences

For example, in the canonical path of Fig. 5, the edges before and after the canonical edges (e, h) and (h, k)have 8 values less than those of the canonical edges themselves.

Proof For any e e P, 8(e)—

d(s, t)—

dp, (s, t), where P' is the best (s, i)-path that uses e. Since P uses

e, A = 8(P) =d(s,t) -dp(s,t) > d(s, t) -dp,(s,t) = 8(e),so 8(e) < A and (1) is proven.To show (2), let el be the first edge on P for which 8(e¡) = A (since F is a canonical path, there must be

such an edge) and let er be the first edge on P such that 8(er) = A but 8(er+l) < A, or er—

ek if no such r

exists. By definition, 8(e¡) < A for / < /. If r = k, then we are done. Otherwise, it is left to be shown that8(e{) < A for i > r. Suppose e¡ = (u, v). Since P is canonical for e¡, dp(v, t)

d(v, t), and this holds forall nodes that occur after v on P. Let er+l = (x,y). P is not canonical for er+l, but since y occurs after v

on P we know that dp(y,t)—

d(y,t), hence dp(s,x) < d(s,x), so there is a better path to reach x from 5

than via P. Take now any edge e¡ = (x', y'), i > r. P is not canonical for e¡ since there is always a betterway to reach x from 5 than along P: reach x optimally, then follow the portion from x to x on P. Hencedp(s, x') < d(s, x), so 8(e.) < 5(F) = A for all i > r.

Recall that P is optimal for el—

(u, v). Consider some i </,andlete(._, = (p,q)ande¡ = (q, r). Becauseei_l and ei are on the best path from s to u, ei_l is on the best path from 5 to q. Hence, ei_l is on the canonicalpath for ei and by (2) 8(e{_x) < 8(ef). A similar argument shows that for every i > r, ei+l is on the canonicalpath for e;, hence 8(ef) > 8(e¡+i). (3) is therefore proven.

Lemma 2.2 If P = ex,e2,... ,ek is a noncanonical A-near-optimal path, then 8(e) < AVe e P.

Proof For any e e P, the length of the best path that uses e is d(s, t)—

8(e). Since P uses e but is notcanonical (therefore not the best) for e, dp(s, t)

d(s, t)—

A < d(s, t)—

8(e), so 8(e) < A.Denote by A0 < A x < A2 < ... the maximal sequence of values such that

FAoCFA|CFA2C...Lemma 2.3 £A , which is the smallest set of edges that contains every 8' near-optimal path (8' < A(),

is also the union ofall A(. canonical paths for j < i.

Proof Recall that a canonical path must be A-near-optimal for some j. If P is a canonical A. -near-

optimal path, then it will first appear as a path in £A since by Lemma 2.1 8(e) < A. for all e e F, and

8(e')—

A. for some e e P. Also, every edge in FA belongs to some canonical A-near-optimal path, where

Aj < A.ij < i).

Lemma 2.3 implies that the union of all canonical near-optimal paths within A of the optimum containall near-optimal paths in this neighborhood. Hence, any noncanonical á-near-optimal path is a combinationof few segments from canonical paths that are á'-near-optimal, 8' < 8. In general, let PJ and F2 be 8X and 82near-optimal paths, respectively, which intersect at some node u. Then, any path F3 obtained by concatenatingthe (s, u) segment from one path and the (u, t) from the other is <53 near-optimal, 83 < 8X+ 82. In the specialcase where F, and F2 are both optimal, F3 is also optimal.

In terms of alignments, to cope with this combinatorial explosion, we suggest the subset of alignments thatcorrespond to the canonical paths as the representatives for the entire set of near-optimal alignments, whichcan be very large. This set fairly represents the entire set of alignments because any noncanonical alignmentscan be obtained by recombining few canonical ones.

Remark Define G. = (V, EA). The above discussion also implies that any task that involves alignmentsthat are at most A near-optimal (such as counting and enumeration) can be done on GA instead of G, takingadvantage of its sparsity. No theoretical bounds are known on the ratio of \EA\/3nm. However, it is typicallythe case that the interesting A is the one for which |£_| = 0(n + m) = 0(|F|a5).

TRANSFORMATION OF WEIGHTS

We advocate for representation, rather than enumeration, of alignments. However, in some cases it may bedesirable to enumerate alignments. Also, recall that EA may also contain noncanonical paths that are worse

Page 12: On Near-Optimal Alignments of Biological Sequences

than A. In this section we show that a simple transformation on the weights of E provides a powerful toolto manipulate near-optimal paths and to accomplish the tasks mentioned above. This transformation turnsout to be related to the method of Edmonds and Karp (1972) (see also Lawler, 1976) that transforms generalweights of a graph to all nonnegative weights, such that the order of paths are preserved. Our transformationspecializes the Edmonds-Karp transformation to (s, i)-paths; it produces a set of nonnegative weights (recallthat in the original set of weights, some may be negative) which preserves the order of the (s, r)-paths, andmakes the enumeration extremely easy.

Definition For every e = (u, v) e E define

s(e) = 5(e)-

min {S(e')}e'=(x,u)

Each edge in Fig. 5 is marked with a pair of numbers; the left number is the 5 value of that edge, and the rightnumber is its e value, calculated as defined above.

We now prove [Lemma 3.1(3)] that s(e) can be interpreted as the "additional penalty for using e on thepath from u to t rather than following the optimal path from u to t directly." Theorem 3.1 shows how thesetransformed weights can be used.

Lemma 3.1 For any e = (u, v)(1) s(e) = 5(e)

8(e), where e is the edge preceding e on the canonical path for e.

(2) s(e) > 0(3)w(e)+d(v,t)=d(u,t)-e(e).

Proof Let e¡ = (x¡, u} be the set of edges entering u and assume without loss of generality that 5 (ex ) <

S(e2) < ... Note that S(e¡) = d(s, t)—

(d(s, x¡) + w(w¡) + d(u, t)); hence d(s, xx) + w(ex) is the optimalway to reach u from s since d(s, xx) + w(ex) > d(s, x2) + w(e2) >_This implies that the canonical pathfor e enters u via edge ex

(xx, u), so s(e) = 5(c)—

5(e,) and (1) follows.From Lemma 2.1 we know that if P is a A-near-optimal path that is canonical for e, then 5(c) = A and

5(e') < A for any edge e' preceding c on P. Hence (2) follows.Let e1 = (xx, u) from above. We know(i) 5(e,) = d(s, t)

-

(d(s, xx) + w(ex) + d(u, t))(ii) 5(c) = d(s, t)

-

(d(s, xx) + w(ex) + w(e) + d(v, t))hence e(c) = 5(e)

5(cj) = d(u, t)-

w(e)—

d(v, t) or w(e) 4- d(v, t)—

d(u, t)—

e(e)

Theorem 3.1 For any path P from s to t, 8(P) = YleeP £(ß)

Proof Let P—

el, e2,..., ek, where e; = («;, v¡) (ux = s,vk = t and v¡ = u¡+x). Define the path P¡ asthe path obtained by concatenating the edges ex,... ,e¡ with the optimal path from v¡ to t. We claim, byinduction on i, i = I,..., k, that 5(P;) = 53,-<¿ e(e,)- Since Pk

P, the theorem follows.Note that Px is canonical for e, ; therefore 5(P[) = 5(e;). Furthermore, s(ex)

8(ex) as there are no edgesinto s. Hence, for ¿" = 1, 8(PX) = s(ex) as claimed. Assume correctness for j < ¿. P; and Pi+X share the first¿' edges. Since dp (s, t)

dp (s, v¡) + d(v¡, t) we have

dp (s,t) = dp(s,v) + w(ei+x)+d(vi+x,t) (1)= dp(s,t)-d(v¡,t) + w(e¡+1)+d(v¡+x,t) (2)

Recall that vi = ui+x. From Lemma 3.1 we have

w(e¡+x)+d(v¡+x,t) = d(ui+x, t)-

s(ei+x) = d(v¡, t)-

e(e;+1) (3)Substituting (3) in (2) we get dp (s, t) = dp (s, t)

-s(e¡+x).

By induction: dp(s, t) = d(s, t)- Y,j<i siej)> hence dp. is>{) = d(s> 0 ~ ¿Zj<i+X £iej)'< so siPi+\) =

Page 13: On Near-Optimal Alignments of Biological Sequences

For example, in the path P which is marked with bold edges in Fig. 5, 8(P) is 3, and 3 is the sum of all thes values on the edges along P.

The proof of Theorem 3.1 implies that after s(e) has been computed for every edge e, the "goodness" ofa path can be computed "on the fly" as follows. Take a path from 5 to « for some node u: if 8 is the sum ofthe e's along that path, then there is always a ¿¡-near-optimal path from s to t that begins with this segmentfrom 5 to u; namely, the one that proceeds with the optimal path from u to t. If an edge e is followed after u,then any path to t that uses the segment up to u, followed by e, will be at least 8 + e(e))-near-optimal. Anexample of this can be seen in Fig. 6. Starting at S and proceeding to ¿, no decrease has occurred. However,if we proceed to j or / instead of k, then the length of the path will go down by at least 3.

ENUMERATING NEAR-OPTIMAL PATHS

A natural enumeration procedure for all near-optimal paths is now readily available. First, compute 8(e)and e(e) for every edge in G. This preprocessing takes 0(nm) time and space. Then, using the new set ofweights s(e), enumerate the (s, t) paths in order. This can be done as follows: Build a "search tree" thatexplores all the paths in this graph, starting at 5. An internal node in the tree represents a partial path thatstarts at s and ends at some node u; the leaves represent complete s, f-paths. Each internal node is expandedvia all edges emanating from the last node u on the partial path.

C T C G A C

T

C

C

FIG. 6. Deviation edges in the edit distance graph. This figure shows the same near optimal alignments as Fig. 5. Thebold edges in this figure are the deviations that decrease the score of an (S,T) path: {{e.h)} decreases the score by 1, {(b,e)}decreases the score by 2, and {{c,d),(ij),{i,l),(n,p)} decrease the score by 3. The other edges do not decrease the score.

Page 14: On Near-Optimal Alignments of Biological Sequences

With each internal node in the tree we associate a cost 5, which is the sum of e(c) of all edges c on the paththat is represented by this node. This path can always be extended to an (s, t) path P with 5(P) — 5. Whenan internal node is expanded via an edge c, the new node receives the cost of its parent +e(e).

It is now straightforward to observe (from Theorem 3.1) that if the tree is expanded in a "best first search"manner (i.e., the next node to expand is the internal node with the minimum 5), then paths are enumeratedin increasing order. Hence, to enumerate the Ä"-best paths, simply stop after the Kth leaf in the searchtree has been reached. A priority queue that maintains the best K costs of the nodes in the search treeneeds to be maintained. The size of the tree is at most O (Km) nodes, as every leaf is preceded by at mostm+n nodes. After the O (nm) time preprocessing stage, a new path P is enumerated in this manner in0(\P\) = 0(n + m) time. The priority queue maintenance takes O (log Km). Because the entire search treeneeds to be kept throughout, the space requirement is O (K (n + m)) (plus an additional O (nm) space to store

thee(e)'s).The same method can be employed to list all 5'-near-optimal paths for 5' < A, where A is a specified

threshold. First build EA ; then enumerate paths in E. by building the search tree, while pruning any extensionthat leads to a node whose cost exceeds A. This is basically the method of Waterman and Byers (1985).Because any extended node eventually leads to a legal leaf (whose cost is within the threshold), and becauseat any node there is a constant number of extensions to check, the search tree can be explored in any order.Only the current path needs to be maintained, so the space required is 0(m + n), and the time is 0(\P\) perpath.

A MORE EFFICIENT ENUMERATION ALGORITHM

Using the idea of "deviations" (Hoffman and Pavley, 1959; Yen, 1971), together with the transformedweights, we now outline an algorithm that outputs near-optimal paths in increasing order. It requires onlyO(K) space and 0(\P\) time per path. For simplicity, assume that there are no ties between path lengths(e.g., impose lexicographic order in addition to the length).

Recall that Q is a "deviation" of P if there is an edge e = (u, v) on Q such that ß's segment up to u is thesame as P, e g P, and the portion of Q from v to t is an optimal path from v to t. (An example of a deviationcan be seen in Fig. 6, where P

{S,a,c,f,i,k,n,o,q,r,u,w,T} and Q—

{S,a,c,f,i,l,o.q,r,u,w,T}. Edge (i,l) is thedeviation edge that decreases the score by 3.) The algorithm builds a search tree (which is different from thesearch tree described above). A node in the tree represents a solution (a path), and its cost is its 5. A nodeb is a child of a node a in the tree if the path represented by b is a deviation of the path represented by a.

Specifically, let a be a node in the tree that represents a path P = xx,. ..xk, which deviates from its parent bythe edge (x¡,x¡+x). Let the cost of a (¿'.e., 5(P)) be 5. Note that s((Xj,xj+x}) = 0 for all j > i. The childrenof a are found as follows: for every vertex x. on P, j > i, let c,, e2 be the two edges that are incident at x.

but not on P (i.e., (x-, x+l) ^ ex, e2). Create two new nodes, bx and b2, with costs 5 + e(ex) and 5 + s(e2),respectively, and make them the children of a. bx and b2 share the prefix xv ..., x- with a, and deviate fromit on the edges ex and e2, respectively.

The algorithm, which outputs the first, second, third.. .paths in order, proceeds as follows. Initially, thetree consists of a root, where the root of the tree is the optimal path from s to t, and its cost is 0. To find thenext best path, find the leaf a in the tree of minimal cost, and output the path represented by a. Expand a byattaching all deviations of it as its children.

Since expanding a node requires 0(n + m) time (need to check at most two edges at each vertex on thepath), the time to output the next path is 0(n + m). The size of the search tree is 0(Km) because every newsolution creates at most 2(n + m) new nodes. However, this space requirement can be improved by usingthe technique of (Lawler, 1972): when a leaf a is first expanded, only its best child b* needs to be explicitlystored in the tree. When b* is eventually used at some iteration (i.e., it is the next best path), b* is expandedas before, but also the next best child of a is found again, by re-expanding a. The re-expansion is repeated forthe third, fourth.. .child of a. This assures that after K solutions have been output, there are only 2K nodes inthe tree, because it contains two nodes for every solution: its nodes, and its best unused deviation. Moreover,at each iteration, two nodes (instead of one node) are expanded, hence the time requirement is only doubledwhereas the space requirement is reduced by a factor of m.

Page 15: On Near-Optimal Alignments of Biological Sequences

COUNTING NEAR-OPTIMAL ALIGNMENTS

Simulations show that the number of near-optimal alignments grows rapidly as the threshold increases. Thisbehavior depends on the scoring system, i.e., on the weights of the edges in the graph, as well as on the actualsequences, and has been demonstrated in the section Enumerating Near-Optimal Paths and also in (Waterman,personal communication). The significance of an alignment can be assessed not only by the distance of itsscore from the optimal score, but also by its rank, i.e., the number of alignments with better scores than itsown score. Computing the rank of a given alignment is closely related to the problem of counting the numberof near-optimal alignments. The number of near-optimal alignments can clearly be computed by enumeratingthem using the methods of section Enumerating Near-Optimal Paths. In this section we are interested incounting methods that are more efficient than the corresponding enumeration solution. We show that by usingthe transformed set of weights e(e) instead of the original weights, better bounds can be achieved. As before,we use the graph notation, where alignments correspond to paths from s to t in the edit distance graph.

The number of optimal paths from 5 to t is found as follows, and requires O(|F0|) time and space. LetN(v) be the number of optimal paths from s to some specified node v. Then,

N(v) = Y, N(-u">u\{u,v)eE0

and TV(5) = 1. N(t) is the desired count. A natural generalization of this method to count all near-optimalpaths within A of the optimum is to maintain, at each node v, a list of all possible lengths of paths from 5

to v, as well as their count. A node creates its own list by inheriting and merging the lists and the counts ofits predecessors; as described in Waterman. The evaluation requires 0(C\EA\) = 0(Cnm) time and space,where C is the maximum list size (C is typically very large for arbitrary weights; Waterman (personalcommunication) shows that if only the number of matches, mismatches, and gaps are counted than C is0(nm)).

Let A be a specified threshold. If the number of paths which are 8 near-optimal for 5 < A is sought,then the 0(Cnm) bound can be dramatically reduced by using the transformation suggested in the sectionTransformation of Weights. Recall that if the weights on the edges of the graph are transformed as suggestedin the section Transformation of Weights, then the paths that are 8 near-optimal, 8 < A, are exactly the pathswhose transformed length does not exceed A. Hence, if we compute s(e) for each edge e e EA, then we can

count the number of these paths in the transformed graph as follows when the weights are all integers: LetC(v, k), k

0, 1,..., A, be the number of paths of length k from s to v in the transformed graph. Then,N(v,k)= Y N(u,k-s((u,v)))

u\(u,v)€EAThis recursive formula can be evaluated in 0(A|FA|) = O(Anm) time and space. For nonintegral weights,

a similar reduction in the space requirement can be achieved.

COUNTING CANONICAL NEAR-OPTIMAL PATHS

The set of canonical near-optimal paths is a smaller, restricted and more structured set of near-optimalpaths. To count the number of canonical 8 near-optimal paths we need the following lemma:

Lemma 4.1 P = ex,e2,..., ek is a canonical path if there is some I, 1 < / < k, such that(I) for any i < /, £ e(e;) = 8(e¡), and(2)s(ei) = 0foralli > /.

For example, in the canonical path of Fig. 5, the sum of the s values of edges (S,b), {b,e) and (e,h) is 3,which equals 8((e,h)). The s values of the rest of the edges of the canonical path is 0.

Proof Suppose first that (1) and (2) hold. Then, ¿Zjsks(ej) = s(ez) and from Theorem 3.1 8(P) = 8(e¡),hence F is canonical for e..

Page 16: On Near-Optimal Alignments of Biological Sequences

We prove the only if part by induction on A. Assume that for any canonical 5-near-optimal path, 5 < A,(1) and (2) hold. Let P be a canonical A-near-optimal path, and let e¡ = (v, w) be the first edge on P suchthat 5(P) = 8(e¡) = A. We will show that (1) and (2) hold for this /. Suppose that e¡_x = (u, v) and letP' be a canonical path for el_x. Consider the segment on P from s to v; it is the best path from s to v

and it also goes through u, hence its portion up to u is the best path from s to u. Therefore, P' coincideswith P on the segment up to u. Lemma 2.1 implies that 5(P') = 8(e¡_x) < 8(et) = A. Hence by induction,for any ¿' < /

-

1, ¿_)¡<i £ie¡)—

<5(e,)-e/_i precedes e¡ on the canonical path for e;, so from Lemma 3.1e(e;) = 5(e;)

-5(e/_1). Hence, (1) holds since

Y^siej) =s(ei-\ + £íe¡) ^S(el_x) + S(e¡) -8(e[_x) = 8(e¡)Now (2) follows directly, since 5(P) = 5(e¿) by definition, and 5(P) = J2i<k £ie¡)

^ie¡) + Jli<í<k £ie¡)which implies s(e¡)

0 for all ¿' > /.

The above lemma implies that every canonical path P = ex,e2,... ,ek can be decomposed into two parts,Px

ex, e2,..., e¡ and P2 = e/+1,..., ek, such that in Px the property 5Z-<; e(c) = 5 (e¡) holds for any i and,in P2, e(e;) = 0 for any i. Ifwe choose Px as the maximal segment with this property then this decompositionis unique. We therefore associate with each canonical path P a pair of edges (e, /), where e is the last edgeon the Px (maximal) segment, and / is the first edge on the P2 segment. Note that 5(P) = 5(e), and that fromthe maximality of Px 8(e) + s(f) > 8(f) (since, in general, 5(e) + s(f) > 5(/)).

Many canonical paths may be associated with the same pair of edges (c, /). Denote by N(e, f) the numberof canonical (5(e)-near optimal) paths that are associated with the pair (e, /). Because each canonical pathis associated with a unique pair, all we need is to count how many canonical paths are associated with eachpair of adjacent edges (e, /).

Define Ncan(e) as the number of paths that start at s and end with the edge e, which have the property thatYli<i £(ej)

$iei) f°r ^y e(lge e along the path. Also, define N (f) as the number of paths that start withthe edge / and end at t such that e(c) = 0 for any edge e along the path. Then,

N(e f) = \Nca«ie) + N0pt(f)if8(e)+s(f) > 8(f)JO otherwise

Ncan (e) an^ Nopt if)aic comPuted as follows. Let e,, e2, e3 be the three edges that are immediate predecessorsof e, and /,, f2, /3 be the three edges that are immediate successors of /. Set /. = 1, j = 1, 2, 3 if s(e.) +e(e) = 8(e), otherwise /. = 0. Then, initially Ncan(e) — 1 for e = (s, u), and

Kant*) = E hN<*n('¡>7 = 1,2,3

To compute N (f), initially set N (f) = 1 if / = (u, t) and s(f) = 0, and

°p< J [0 otherwise

There are 9nm pairs of edges to consider (at most 9 per node). The recursive evaluation of Ncan (e) and ofN)opt(f) requires 0(nm) time and space. Hence the total time and space is 0(nm). This can be reduced to

0(\EA |) if we are only concerned with canonical paths that are 5-near-optimal for some 5 < A.

SUMMARY

This paper introduces a graphical representation of a large set of alignments whose score is within theneighborhood of the optimum. We define and study the combinatorics of a canonical set of near-optimalalignments.

We then introduce a transformation on the weights of the edit distance graph that simplifies considerablythe tasks of enumerating alignments and counting near-optimal alignments that are within 5 of the optimum.After a first stage that requires 0(nm) time and space, the method generates the next best alignment in 0(m)time and O (K) space, where K is the number of alignments output so far. We can then also count the number

Page 17: On Near-Optimal Alignments of Biological Sequences

of near-optimal alignments that are 0, 1, 2,..., A-near-optimal in 0(A|£_|); although |£_J = 0(nm), it istypically linear in n + m.

We provide several biological examples that demonstrate the applicability of the methods to biologicalsequences. We show how an alignment that is correct biologically can be missed if only optimal alignmentsare computed. We then show, by comparing the heavy and the light chain of the a human immunoglobulinFab fragment, that by considering a large set of good alignments one can detect features that would otherwiserequire an alignment of many sequences. We also show that the number of near-optimal alignments as a

function of 8, where 8 is the distance from the optimal score, can serve as a good discriminator betweenrandom and biological sequences, and can also reveal which set of parameters (e.g., PAM value, gap penalty,etc.) best constrain the biologically relevant alignments.

ACKNOWLEDGMENTS

We would like to thank Dan Gusfield, Gene Lawler, and Martin Vingron for valuable discussions, and TodKlingler for his constant help, and in particular for interpreting the graphs of Figs. 2 and 3.

This work is supported in part by grants LM05305 and LM05716 from the National Library of Medicine.D.N. is supported by a Postdoctoral Fellowship from the Program in Mathematics and Molecular Biology ofthe University of California at Berkeley, under National Science Foundation Grant DMS-9720208.

REFERENCES

Altschul, S.F., and Erickson, B.W. 1986. Optimal sequence alignment using affine gap costs. Bull. Math. Biol, 48,603-616.

•Bellman, R., and Kalaba, R. 1960. On Kth best policies. J. SIAM 8, 582-588.Chao, K.M. 1994. Computing all suboptimal alignments in linear space, 31-42. In M. Chrochemore and D. Gusfield, eds.

Combinatorial Pattern Matching, Lecture Notes in Computer Science, Springer-Verlag, New York.Chothia, C, and Lesk, A.M. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J.

5, 823-826.Chothia, C, and Lesk, A.M. 1987. The evolution of protein structures. Cold Spring Harb. Symp. Quant. Biol 52,399-405.Clarke, S., Krikorian, A., and Rausen, J. 1963. Computing the N-best loopless paths in a network. J. SIAM 11,1096-1102.Doolittle, R.F. 1986. Of Urfs and Orfs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science

Books, Mill Valley, CA.Doolittle, R.F. ed. 1990. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences. Methods

Enzymol. 183.Dreyfus, S.E. 1969. An appraisal of some shortest path algorithms. Operations Res. 17, 395-412.Edmonds, J., and Karp, R.M. 1972. Theoretical improvements in algorithmic efficiency for network flow problems. JACM

19, 248-264.Fitch, W., and Smith, T. 1983. Optimal sequence alignments. Proc. Nati Acad. Sei. USA 80, 1382-1386.Fox, B. 1973. Calculating the Kth shortest paths. Canad. J. Operations Informat. Proc. 11, 66-70.Gönnet, G.H., Cohen, M.A., and Benner, S.A. 1992. Exhaustive matching of the entire protein sequence database. Science

256, 1443-1445.Gotoh, O. 1982. An improved algorithm for matching biological sequences. J. Mol Biol. 162, 705-708.Gusfield, D., Balasubramanian, K., and Naor, D. 1992. Parametric optimization of sequence alignment. Proceedings of

the Third Annual ACM-SIAM Joint Symposium Discrete Algorithms, Orlando, Florida, January 1992.Henikoff, S., and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Nati. Acad. Sei. USA

89,10915-10919.Hoffman, W., and Pavley, R. 1959. A method for the solution of the Nth best path problem. J. ACM 6, 506-514.Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein

sequences. Comput. Appl. Biosci. 8, 275-282.Katoh, N., Ibaraki, T., and Mine, H. 1982. An efficient algorithm for K-shortest simple paths. Networks 12, 411-427.Lawler, E.L. 1972. A procedure for computing the K-best solutions to discrete optimization problems and its applications

to the shortest paths problem. Manag. Sei. 18, 401-405.Lawler, E.L. 1976. Combinatorial Optimization, Networks and Matroids, 374, Holt Rinehart and Winston, New York,

NY.

Page 18: On Near-Optimal Alignments of Biological Sequences

Lesk, A.M., and Chothia, C. 1980. How different amino acid sequences determine similar protein structures: The structureand evolutionary dynamics of the globins. /. Mol. Biol. 136, 225-270.

Myers, E. 1987. Enumerating Paths and Sequence Alignments in Order ofScore, Technical Report 87-19, University ofArizona.

Naor, D., and Brutlag, DL. 1994. K-BEST: A tool for analysis of near-optimal sequence alignments, (in preparation).Perko, A. 1986.'Implementation of algorithms for K shortest loopless paths. Networks 16, 149-160.Pollack, M. 1961. The kth best route through a network. Operations Res. 9, 578-580.Sankoff, D., and Kruskal, J.B., 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of

Sequence Comparison. Addison-Wesley, Reading, MA.Saqi, M.A.S., and Steinberg, M. 1991. A simple method to generate non-trivial alternate alignments of protein sequences.

J. Mol. Biol. 219, 727-732.Schoniger, M., and Waterman, M.S. 1992. A local algorithm for DNA sequence alignment with inversions. Bull. Math.

Biol. 54, 521-536.Sellers, PH. 1974. An algorithm for the distance between two finite sequences. J. Comb. Theory (A) 16, 253-258.Shier, D.R. 1979. On algorithms for finding the K-shortest paths in a network. Networks 9, 195-214.Vingron, M. 1991. Multiple Sequence Alignment and Applications in Molecular Biology, Technical Report 91-92,

Universität Heidelberg.Vingron, M., and Argos, P. 1989. A fast and sensitive multiple sequence alignment algorithm. CABIOS 5, 115-121.Vingron, M., and Argos, P. 1990. Determination of reliable regions in protein sequence alignments. Prot. Eng. 3,565-569.Waterman, M.S. 1983. Sequence alignments in the neighborhood Of the optimum with general application to dynamic

programming. Proc. Nat. Acad. Sei. USA 80, 3123-3124.Waterman, M.S., and Byers, T.H. 1985. A dynamic programming algorithm to find all solutions in a neighborhood of the

optimum. Math. Biosci. 11, 179-188.Waterman, M.S., and Eggert, M. 1987. A new algorithm for best subsequence alignments with application to tRNA-rRNA

comparisons. J. Mol. Biol. 197, 723-728.Waterman, M.S., Eggert, M., and Lander, E. 1992. Parametric sequence comparisons. Proc. Nati. Acad. Sei. USA 89,

6090-6093.Wu, T.T., and Kabat, E. 1970. An analysis of the sequences of the variable regions of Bence-Jones proteins and myeloma

light chains and their implications for antibody complementarity. /. Exp. Med. 132, 211.Yen, J.Y 1971. Finding the K shortest loopless paths in a network. Manag. Sei. 17, 712-716.Zuker, M. 1991. Suboptimal sequence alignment in molecular biology: Alignment with error analysis. /. Mol. Biol. 221,

403-420.

Address reprint requests to:Dr. Douglas L. Brutlag

Department ofBiochemistryStanford University Medical Center

Stanford, CA 94305-5307e-mail: [email protected]

Received for publication August 3, 1994; accepted as revised November 8, 1994.