markovian structures in biological sequence alignments

Markovian Structures in Biological Sequence AlignmentsAuthor(s): Jun S. Liu, Andrew F. Neuwald and Charles E. LawrenceSource: Journal of the American Statistical Association, Vol. 94, No. 445 (Mar., 1999), pp. 1-15Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2669673 .

Accessed: 14/06/2014 07:52

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=astata

http://www.jstor.org/stable/2669673?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Markovian Structures in Biological Sequence Alignments

Jun S. Liu, Andrew F. NEUWALD, and Charles E. LAWRENCE

The alignment of multiple homologous biopolymer sequences is crucial in research on protein modeling and engineering, molecular evolution, and prediction in terms of both gene function and gene product structure. In this article we provide a coherent view of the two recent models used for multiple sequence alignment-the hidden Markov model (HMM) and the block-based motif model-to develop a set of new algorithms that have both the sensitivity of the block-based model and the flexibility of the HMM. In particular, we decompose the standard HMM into two components: the insertion component, which is captured by the so-called "propagation model," and the deletion component, which is described by a deletion vector. Such a decomposition serves as a basis for rational compromise between biological specificity and model flexibility. Furthermore, we introduce a Bayesian model selection criterion that-in combination with the propagation model, genetic algorithm, and other computational aspects-forms the core of PROBE, a multiple alignment and database search methodology. The application of our method to a GTPase family of protein sequences yields an alignment that is confirmed by comparison with known tertiary structures.

KEY WORDS: DNA sequence; Evolution; Gibbs sampler; GTPase; Hidden Markov model; MAP criterion; Model selection; Protein sequence; Sequence comparisons.

1. INTRODUCTION

All of the hereditary information of an individual organism is contained in its genome, which comprises sequences of the four DNA bases (nucleotides), A, T, C, and G. Pro- teins, chains of 20 different amino acid residues, are the action molecules of life and are "spelled" (coded) by segments of the genome, called genes. The universal genetic code is used to translate triplets of DNA bases, called codons, to the 20-letter alphabet of proteins (Campbell 1995). For example, codons CCA and CCG are both translated into the amino acid proline (abbreviated as Pro or P). The biotechnology revolution and many genome sequencing projects have resulted in large and rapidly growing databases of DNA sequences. A rapidly growing database of protein sequences has been derived from the DNA sequences using the universal genetic code. Both are available over the In- ternet (e.g., http://www.ncbi.nlm.nih.gov).

Because DNA and proteins are unbranched heteropoly- mers, they can be characterized by sequences of letters rep- resenting the monomers that form them. Accordingly, the data in these databases are sequences of letters using p- letter (p = 4 for DNA, p = 20 for proteins) alphabets without punctuation or space characters. Table 1 shows typical protein sequences of two GTPases, whose structure and sequence comparison is provided in Section 6. Can we tell if, and if so how, they are related?

Computational molecular biology, which emerged about 20 years ago, focuses mainly on the analysis of such data. Recently this field has been the subject of great interest in

Jun S. Liu is Assistant Professor, Department of Statistics, Stanford University, Stanford, CA 94305. Andrew F. Neuwald is Assistant Inves- tigator, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724. Charles E. Lawrence is Chief of Biometrics Lab, Wadsworth Center for Laboratories and Research, New York State Department of Health, Al- bany, NY 12201. This work was supported in part by Department of Energy grant DE-FG02-96ER62266, National Institutes of Health grant ROI HG01257-01, National Science Foundation grants DMS-9404344 and DMS-9501570, and the Stanford Terman Fellowship. The authors are grateful to Lee Ann McCue and Ye Ding for proofreading the manuscript and to the editor, associate editor, and two referees for their many valuable suggestions.

the biotechnology and pharmaceutical industries (Marshall 1996; Taubes 1996).

1.1 Bioinformatics and Sequence Alignment

The most important contribution of computational biology has been in the development of methods for extracting information from the biopolymer sequence databases via sequence comparison, characterization, and classification- tasks that interest many statisticians. Sequence alignment methodology is central to all of these methods.

It is commonly believed that today's biopolymer sequences evolved from ancestral sequences through mutation and selection. Evolutionary theory holds that stochastic mutational events may alter the genome of an individual and that these changes may be passed to progeny. The likelihood that a given mutation is maintained through generations is determined by its contributions to the fitness of the progeny via a stochastic process called natural selection. At the molecular level, the effect of a mutation on the structure and/or function of a gene product determines the mutation's contribution to the organism's fitness. There- fore, sequence comparison methods help reveal information about biopolymer structure and function, as well as the biological process of molecular evolution.

To aid in the understanding of sequence alignment, let us consider an intentionally oversimplified example. Imagine writing the sentence

Many of our friends love statistics jokes

and asking three children to copy it. You might then obtain three "noisy copies":

Mamy of yous fryers need longer spokes Mony of your own stripeded lovers are nicest Monkeys of ours friendleys have stinking smokes.

By showing these "noisy" copies to your friends and asking them to guess what you originally wrote, you

? 1999 American Statistical Association Journal of the American Statistical Association

March 1999, Vol. 94, No. 445, Applications and Case Studies

1



2 Journal of the American Statistical Association, March 1999

may make an entertaining game. By comparing the noisy sentences, your guests may be able to identify "essen- tial" parts of the original sentence that have been conserved even though the children's transcriptions contain

errors. There are not only typographical errors and misspellings but also inserted or deleted letters and entire words. The following table shows an alignment of the noisy copies:

Mamy of(y)ous fryers (need)long (s)pokes Momy of(y)iur (owns)triped(ed) love(rs are) nices(t) Monk(eys)of ovr(s) friend(leys) have (stinking s)mokes

Here the letters in parentheses are noisy insertions or deletions from one or a few sentences. From the alignment, one may guess several words, including "many," "of," "our,"9 "love," and even perhaps "jokes" and "friends." On the other hand, some words in the original sentence (e.g., "statistics") have been deleted in the copied sentences. Some sentences have words that are not present in any of the other sentences. Nevertheless, game players may be able to infer the main theme of the original sentence. More generally, an alignment problem involves three interrelated tasks: (a) identification of the models (e.g., parameters for letter frequencies) at aligned (conserved) positions, (b) word alignment, and (c) determination of the extent to which common features are conserved in the sentences.

This simple example illustrates a rough approximation to biological reality and the possibility of obtaining important information by comparing related biopolymer sequences. However, the biopolymer alignment problem is much more complicated than the foregoing game. As shown in Table 1, biopolymer sequences lack known rules of grammar, have only a small vocabulary of known "words," contain no blank or punctuation characters, and are unpredictable in many ways. More seriously, biological sequences available for analysis do not at all evolve down independent pathways from a single progenitor, as in the foregoing game. Rather they evolve through generations of progeny. This process can be represented by an evolutionary tree that is rarely observable.

Some methods that incorporate the evolutionary process to align pairs of sequences have been described (Alli- son, Wallace, and Yee 1992; Bishop and Thompson 1986; Thorne, Kishino, and Felsenstein 1991, 1992). More recently, Zhu, Liu, and Lawrence (1998) proposed a Bayesian

Table 1.

H-Ras P21 Protein (Protein Data Base Accession: 121P)

MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQ

YMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIP

YIETSAKTRQGVEDAFYTLVREIRQH

Elongation Factor Tu (EF-TU, Swiss Prot Accession: P02990)

MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEY

DTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYI IVFLNKCDM'

VDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALEGDAEWEAKILELAGFLDSYIPEPERAIDKP

FLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLR

GIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEM

VMPGDNIKMVVTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG

alignment procedure that produces the posterior distribution of the evolutionary distances between pairs of sequences. For multiple sequences, however, the inferences on alignments and on phylogenetic trees are interdepen- dent, and each has been shown to be NP-hard. Because of this inherent computational complexity and other rea- sons, efforts to simultaneously address both problems (i.e., tree alignment methods) have been limited, and attention has been focused on solving the problems separately. In multiple alignment, the focus of this article, two heuristic approaches-weighting and purging-have been used to address sequence correlations induced by the evolutionary process. In various weighting methods, similar sequences are down-weighted to account for their evolutionary close- ness. Alternatively, the purging method that we use here removes closely related sequences to achieve an approximate independence for those remaining sequences.

Databases of biopolymers that have been experimentally shown to have related structures [including SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/), MMDB (http://www.ncbi.nlm.nih.gov/Structure/), and DALI (http://www. embl-heidelberg. de/dali/dali. html)] and functions [such as Swiss-Prot (http://expasy.hcuge. ch/sprot/sprot-top.html)] provide a basis for examining the utility of methods aimed at predicting these characteristics. In contrast, insufficient data are available for a similar ex- amination of the methods for inferring molecular evolution. Although theory for simultaneously addressing evolution and multiple sequence alignment is important and much needed, we find that our methods, based on the assumption of independence after purging, often work well even when the data show substantial departure from such an assumption (Liu, Neuwald, and Lawrence 1995; Neuwald, Liu, Lip- man, and Lawrence 1997; Qu and Lawrence 1998). Indepen- dently, Henikoff, Henikoff, Alford, and Pietrokovski (1995) have shown that these methods work well in conjunction with heuristic weighting procedures.

1.2 Traditional Approach and New Statistical Models

In the traditional routine for comparing two sequences, a heuristic criterion for the goodness of the alignment is selected and fixed, an efficient algorithm is designed to op- timize such a criterion, and finally large deviation theory is applied to assess the statistical significance of such alignments. Popular methods for comparing a pair of sequences have been given by Needleman and Wunsch (1970) and Smith and Waterman (1981), and the methods for searching the database to find a sequence related to the query



Liu, Neuwald, and Lawrence: Biological Sequence Alignments 3

sequence were developed by Altschul, Gish, Miller, Myers, and Lipman (1990) and Pearson and Lipman (1988). The first statistical approach to sequence alignment was given by Bishop and Thompson (1986) (see Karlin and Brendel 1992 and Waterman 1995 for more references).

These pairwise comparison methods have helped in many recent biological discoveries. For example, application of a pairwise sequence alignment method played a key role in the identification and characterization of a recently discov- ered human cancer gene (Bronner et al. 1994). However, for aligning multiple biopolymer sequences, the pairwise comparison methods have limitations in efficiency and accuracy that are particularly pronounced when sequences have many typographical errors, misspellings, insertions, and deletions; that is, when the sequences are subtly related. The rapid growth of the sequence databases has also begun to reduce the utility of these pairwise methods. Specifically, after adjusting for the large number of multiple comparisons, the comparison scores obtained by chance from random sequences are creeping into the range of the comparison scores for truly related sequences (Claverie 1996; Henikoff and Henikoff 1991).

Two statistical models for multiple alignment have recently been developed: the block-motif model, which de- scribes conserved regions in protein or DNA sequences as ungapped blocks (Lawrence and Reilly 1990; Lawrence et al. 1993; Liu 1994; Liu et al. 1995; Neuwald, Liu, and Lawrence 1995), and the hidden Markov model (HMM), which treats the observed sequences as though they were generated by a hypothetical ancestral model via mutation (Baldi, Chauvin, McClure, and Hunkapiller 1994; Eddy 1995; Krogh, Brown, Mian, Sjolander, and Haussler 1994). By using a model similar to the HMM to describe how two sequences relate to each other, Allison and Wallace (1994) presented useful algorithms for conducting multiple alignment considering information on evolution (i.e., assuming that the evolutionary tree is known). Important common features of these methods are that they use explicit statistical models and they treat the multiple alignment problem as a statistical inference problem. These statistical algorithms, which are reviewed and analyzed in Section 2, have addressed the alignment tasks (a) and (b) mentioned in Section 1.1. However, despite its critical importance, the model selection problems inherent in multiple sequence alignment task (c) has received scant attention (Lawrence et al. 1993). To redress this, an approximate Bayesian model selection procedure is described in Section 4. This procedure, in combination with an improved alignment algorithm (described in Section 3), is a key feature of PROBE, a multiple alignment and database search methodology (Neuwald et al. 1997).

1.3 Modeling the Sequence Alignment

Current biopolymer sequences are believed to have arisen from a common ancestral DNA sequence through evolution. This evolutionary process consists of two types of events: point mutations and recombinations. Like typographical errors, point mutations change the identity of the bases in

the sequence, whereas recombinations yield insertions and deletions that misalign the sequences (Lawrence and Reilly 1996). Over time, these mutational events produce numerous families of related protein or DNA sequences that may be responsible for several different but related functions. In addition, proteins that perform the same function in different species may evolve substantially. The analyses of these sequence data hinge on aligning the sequences to discover relationships. Because of biological constraints of life, natural selection eliminates mutations in those portions of biopolymers that play key roles in structure or function. These constraints provide alignment information even when the sequences are evolutionarily distant (Liu et al. 1995).

Four types of recombination events are possible:

* segments of the gene may be deleted. * Extra segments may be inserted. * Segments may be duplicated. * Segments may be transposed (i.e., some segments of

DNA cut from their original locations and inserted into sites at new locations).

Within genes and adjoining regions, insertions and deletions are the most common events, and transpositions and duplications are less common. Because duplications can be dealt with by using methods of Liu et al. (1995) and Neuwald et al. (1995), and because transpositions are very rare, we assume throughout that the latter two recombination events can be safely ignored. Under this assumption, a powerful recursive relationship comes to bear. This recursive relationship forms the basis of popular dynamic programming algorithms for the alignment of a pair of sequences and is also key to the HMM. Clever algorithms with favorable time and space complexity for solving the combinatoric optimization problem associated with pairwise sequence alignments have been developed (Needleman and Wunsch 1970; Smith and Waterman 1981).

These alignment algorithms provide for flexibility in alignment by permitting insertions or deletions between all residues of a sequence. However, a large number of parameters and an associated loss of sensitivity is the price paid for maintaining this flexibility. The inclusion of gap penalties (i.e., penalizing random insertions) into the alignment ob- jective function ameliorated this problem somewhat. Except in cases where there is substantial prior knowledge about the number of gaps and the size of the conserved blocks, the HMM still lacks sensitivity when the number of the sequences under analysis is small and/or the sequences are only subtly related. In fact, when there are less than 30 sequences to be aligned, traditional pairwise comparison- based methods tend to outperform the existing HMM algorithms (Sonnhammer, Eddy, and Durbin 1996). A further difficulty with the HMM approach is that there are no well- founded criteria for determining the alignment model (e.g., how many model positions to be included and what penalty terms to use) for a particular problem and no clear account of uncertainty associated with the particular alignment resulting from the algorithm.

Block motif-based Gibbs sampling strategies have the ability to align subtly related sequences even when the num-




ber of sequences available for analysis is limited (Lawrence et al. 1993; Neuwald et al. 1995). They achieve this added sensitivity through two basic characteristics of functionally related proteins:

* Point mutations and recombinations tend to be limited in functionally or structurally conserved regions of distantly related proteins (Liu et al. 1995). To cap- italize on this observation, these strategies limit the alignment to ungapped blocks (called block motifs) of the sequences, and in so doing greatly reduce the number of free alignment variables.

* Structural and functional constraints are particularly strong on a limited number of key residue positions, which accordingly are more conserved.

A separate sampling step, fragmentation, enables the algorithm to focus on those more conserved positions (Liu et al. 1995). This step further reduces the number of free parameters and removes the need to specify widths of the block motifs. However, these block motif-based algorithms lose sensitivities and are slow to converge when the alignment contains more than three or four motifs. Furthermore, because there are no well-founded criteria for selecting the number of motifs and the number of conserved positions, these values must be specified by the user.

In this article we present ideas to combine HMM with the block motif-based approaches and to address the short- comings of both methods. In Section 2 we briefly review the single block-motif model and the associated Gibbs sampling algorithm and provide an analysis of the standard HMM for sequence alignment. In Section 3 we describe a model designed to capture the spirit of both approaches, which also contains a generalization of the model for flexible incor- poration of deletions. In Section 4 we propose a Bayesian procedure for selecting the number of alignment variables and the number of residue frequency terms to be included in the model. Finally, in Section 5 we describe the application of these methods to enzymes that hydrolyze guanine triphosphate (GTPases).

2. TWO MODELS FOR MULTIPLE ALIGNMENT

For vectors v = (vI,...,vp)T and 0 = (0j,...,0p)', we use the following notations throughout the rest of the article:

lvl = Iv1l + + lvpl,

v + 0 = (v1 + 01, *...* ,vp + p)T

V/0 = (V,/01'.. v **Vp10p)T

ov =

QV,

ovp 1 *p'

17(v) = r(v1) . . . rv

and

v! 1l! .. . vp! (if the vi are integers.)

The notation R or Rk denotes a single observed biopolymer sequence, where R = (rl, . . . , rn) and Rk -

(rk,1,... , rk, 72 with r's as residues. R denotes a collection of multiple sequences, R1, . . . , RK, each written as a row vector. So we can write R (Ri,... ,R

The counting function h( ), whose argument is a set of residues, counts how many of each residue in a set of residues (or base pairs). For example, if R is a protein sequence, then h(R) returns a vector of length 20 with counts of each type of amino acids in R. Symbols Oo, Oj, and e represent the model parameters for the underlying probability laws (e.g., multinomial or product multinomial distributions) to generate every residue in the sequence. An ungapped segment in a biopolymer sequence that is believed to be conserved across functionally or structurally related sequences is termed a motif element, or simply an element. The word "motif" is used to describe the residue frequency pattern for these motif elements among multiple sequences. Mathematically, a motif is determined by its underlying product multinomial model (Liu et al. 1995).

2.1 The Block-Motif Model and Its Alignment

A characteristic view of this approach is that certain segments of the biopolymers (i.e., subsequences critical to the biopolymers' structure or function) tend to be conserved against mutations. Because this conservation pro- tects against both point mutations and recombinations, sequence conservation in distantly related biopolymers presents itself in the form of sets of ungapped subsequences (blocks). To capture this basic biological concept, we use a simple stochastic model, the block-motif model, as the probabilistic mechanism to generate the set of homologous biopolymer sequences. This model's graphical representation is as follows:

Motif Sequence k lc _

C/t '1__width=w

With this model, each sequence Rk (k = 1,.. , K) contains only one occurrence (called an element) of a single motif, as illustrated by the shaded block, with a starting position ak. For the set of K sequences, R, we let A = {al,... ,aK}, in which 1 < ak < nk- w + 1. We call A the alignment variable for R. Furthermore, we use A[-k] to denote all of A but ak, use A+l = al +,...,aK +} to denote the set of i-shifted positions of A, and use {A}l {ak + ij-1: k = 1, . ..,K, j = l, . .., w} to represent the set of residue indices occupied by the motif elements with alignment variable A. For any set C of indices, Rc represents the collection of the residues indexed by elements of C. For example, given any alignment variable A, we have R{A} = {rk,ak+j-1: for j = 1,.. ., w; k = 1,..., K}.

Residues out of the conserved motif element are treated as iid observations from a common multinomial distribution, called the background model, with p (equals 20 for proteins and 4 for DNA or RNA) categories, which can be represented by the probability vector Oo =(Ol,o, ,. S,o, where 01,0 + . + Op,0 =1 and 0i,0 > 0 for all i.




Residues within the motif element are modeled by a product multinomial distribution PM(E) (Liu et al. 1995), where E = and 0j = (OIj,...,Opj)', with lo03 =

P Oij = 1. In other words, the residue in jth position in a motif element is independently generated from the multinomial distribution O . Therefore, a total of w +1 parameter vectors of (p - 1)-dimension are required to fully describe the data.

As discussed by Liu et al. (1995), although these block- motif models are insufficient to characterize a biopolymer (e.g., bases in DNA sequences are known to have serial correlations and G-C rich regions), they do describe the effect of sequence conservation among homologous sequences. The challenging alignment problem corresponds to simultaneously finding the locations of the motif elements and characterizing the residue frequencies in the motifs. As an introduction to our general methodology, we review the single block-motif model and its alignment as treated by Liu et al. (1995).

For any given A, we can write the complete-data likelihood function as

w

7(ROo, E)) A) oc 0h(R{A}c)l Oh(RA+j-l)

j=l

0h(R) W (o)h(RA?j-1)

j=1

Now let the prior for 00 be a Dirichlet distribution D(a), where a = (a I, ... I ap); let the prior for e's be a product Dirichlet D(B), with B = (/3),j = 1,... ,w, and fj =

(13ij, i = l... ,p); and let A be uniform a priori. Then, as was partly derived by Liu (1994) and implemented by Lawrence et al. (1993), we have an explicit form for the conditional posterior predictive distribution wF(ak A[k], R) by integrating out the parameter 0 and E), which can be well approximated by

-F(ak = iIA[-kl R) oc I|( (rk,i+3'-I)i (1)

where the 0 are the posterior means of the 0, given the observed sequence data R and the current alignment A[_k]. This approximate conditional distribution is used in a Gibbs sampling algorithm to do the local alignment and can be applied recursively to align multiple motifs (Lawrence et al. 1993).

2.2 Hidden Markov Model for Sequence Alignment

The HMM, initially introduced in the late 1960s, is a powerful statistical modeling tool that has been widely applied in such areas as signal processing, speech recognition, and time series analysis (Rabiner 1989). The method was first applied to model biological sequences by Churchill (1989) and recently has become very popular in multiple sequence alignment (Baldi et al. 1994; Krogh et al. 1994; Lazareva and Churchill 1997).

The basic form of an H1MM can be written as

where ft and gt are probability distributions (known up to some estimable parameters), the Yt are observations, and the ht form a (possibly time-inhomogeneous) Markov chain and are unobservable (i.e., hidden). The dynamic linear model (West and Harrison 1989), or the so-called "state-space" model in time series analysis, is a special case of this model.

In the evolution of protein sequences, segment transpositions are rare. Thus, although the sequences become mis- aligned via insertions and deletions, conserved residues remain in order. By using this characteristic, the HMM not only captures an important feature of protein evolution, but also results in an effective algorithm.

The HMM structure for multiple sequence alignment treats the sequences to be aligned as iid observations from a probabilistic mechanism (i.e., HMM model) that perturbs a hypothetical common "ancestral" model sequence (called "model"), denoted as M = (M\I,..., M/lL). Here each MII, is regarded as an abstract residue and is represented by a probability vector 01 of length p (4 for DNA sequences and 20 for proteins). When generating biological sequences, the types of perturbations allowed, which are not observable (and thus hidden), are point mutations, insertions, and deletions. Figure 1 illustrates such a model with L = 3: A residue in an observed sequence is generated either by some Ml or by an insertion, which is modeled by a probability vector 00 of length p. (More details on how to generate an observed sequence from M can be found in Baldi et al. 1994 and Krogh et al. 1994.)

One can also think of the process of generating a sequence, say R = (r ,... rn), from M as choosing a path through a (n + 1) x (L + 1) table starting from the upper left corner and ending in the lower right corner (Fig. 2).

The columns for this table are denoted by MO,... , which correspond to a void starting position and L model positions. The rows, ro, ..., rn, correspond to a void starting residue and the n observed sequence residues. The moves that are allowed in this table are of three types: horizontal to the right, vertical down, and diagonal down to the right.

At any position (i, j) of this table, the next step allowed is to (a) position (i, j + 1), which implies that a deletion of model position Mj+l has occurred; (b) position (i + 1, j),

Figure 1. The HMM Architecture for Generating Biopolymer Se- quences With L = 3 Model Positions. From each state, it can go to the next model position, an insertion, or a deletion.




Mo M1 M2 M3 END

r ------- rl

r2 } .

3'

r5 \ I

r 6A

ENDi

Figure 2. The Table-Path Illustration of the HMM. The "ancestral" model sequence is assumed to have four positions, and the observed sequence R is seven residues long. The path in solid arrows presents one particular way of generating the observed sequence R from the model M.

which implies that an insertion has occurred; and finally, (c) position (i + 1, j + 1), which means that only a point mutation is allowed. Thus the path depicted by the solid arrows in Figure 2 corresponds to

MO -~ 1o --~ 1o --~ 1o -- D1I -- M2 --~ M3 -- 13 --~ END.

Extra constraints are usually needed to make such paths unique.

Although the architecture for generating observations described in Figure 1 is easily understood, the meaning of the hidden states for this H1MM is more subtle. One may natu- rally think of the model positions MA4 as the hidden states. However, the true hidden states are the allowable paths just described that traverse the (rn ? 1) x (L ? 1) table and generate from M the observed sequence R. More precisely, we let

END~~h l+ 2 hn

be the hidden states that generate the observed sequence R. We now formally define the hI so that the H cMM for sequence alignment conforms to the standard form of (2) and (3).

Consider residue rl. Because it must be produced by an insertion or a match state, the path to produce r, must be one of the following two types: an insertion after k deletions (i.e., DI ... DkIk, where k can be 0, 1, ...) or a model position after k deletions (i.e., D1 ... DkMk?1,

wherewk 1, 2,..re)2Therefpore, thehdeotteh eod

Ethe firstrtimeth ae patsraches theerow tof mand wuhepther

itll tisneahe fom ane inserostions orf as mtche state. sthues.h canver bhe abruevidensated ase tJ,) hereJ aloral 0,iniat-sjs

ing whether ri is generated from a model position or an insertion, and 61 records the number of deletions that have occurred. If 6, = L, then the only choice for J1 is 1, an insertion. Clearly, h1 takes 2L + 1 possible values for k1 ranging from 0 to L.

In general, ht can be written as (Jt, 5t), where Jt indicates whether it is an insertion or a match and 6t records the total number of deletions that have occurred. With ht = (Jt, 5t) for residue rt, the next hidden state ht+? = (Jt+l?, t+?) for residue rt+l can be one of two types: Jt+1 = 0 and 6t+l = 6t, 6t+11 .. *I* L, or Jt+1 = I and 6t+1 = 6t+11 ... ., L. For example, the path indicated by solid arrows in Figure 2 represents the following hidden state coding for the sequence:

Sequence rl r2 T3 T4 T5 T6

(Gt) (? (O (O (1 (1 (O

The transition probabilities between the ht's can be explicitly written down using the parameters encoded in the HMM architecture of Figure 1 (Krogh et al. 1994).

3. TOWARD A UNIFICATION: PROPAGATION MODEL

Many alignment problems involve multiple motifs. Al- though the single block-motif method of Section 2.1 can be applied iteratively in this case, its failure to capture collinear ordering of the motifs makes the method computa- tionally inefficient when more than a few motifs are present (Lawrence et al. 1993; Neuwald et al. 1995). In contrast, the HMM explicitly capitalizes on collinearity to develop efficient recursive algorithms. These models require large numbers of free parameters, however. Specifically, 2LK degrees of freedom are associated with the trinomial (i.e., insertion, deletion, and match) alignment parameters, and n(p - 1) degrees of freedom are associated with residue frequency multinomial distributions, where L is the number of model positions, n is the average sequence length, K is the number of sequences, and p is the size of the alphabet. This large parameter space can lead to a lack of sensitivity.

To redress these complementary limitations, we describe a Markovian propagation model that takes the form of a block-motif HMM, but with substantially fewer free parameters. Briefly, in this approach the conserved region among multiple sequences is modeled as a fixed number of ungapped and collinear blocks (multiple motifs) with flexible gaps between them. Residues not assigned to any motif are modeled by a common background multinomial model.

As has been stated by Krogh et al. (1994), the block- motif model can be regarded as a special HMM that allows no insertions or deletions within a motif. The propagation model can also be viewed as a special HMM with a flexible gap (insertion) distribution. However, the application of a general HMM leaves four major issues to be addressed: determining the sizes of the blocks (Sec. 3.3), determining the number of blocks (Sec. 4), determining the number of conserved columns (Sec. 4), and ensuring efficient computation (Secs. 3.2 and 5.3). Furthermore, the block-motif viewpoint gives us a new look at the modeling of biological sequences




and establishes strong connections with mixture modeling and statistical classification methods (Liu et al. 1995).

3.1 The Propagation Model

We begin in a manner similar to HMM by assuming that there are L conserved model positions for each sequence to be aligned, with the only difference being that each model position represents a block of residues with width wl. In- tuitively, we can imagine that L motif elements propagate along a sequence. Insertions are reflected by gaps between adjacent motif elements. No deletions are allowed at this point; this issue is addressed in Section 3.4.

Let A = (A1, ... , AL) =(ak,1)KxL be a matrix with aki1

indicating the starting position of the lth motif element in sequence k.

Sequence k I ak,1 ak,2 ak,L

These alignment variables are unobservable. Let vector A., = (ai,i, ... , aKI1)' indicate the starting positions of the Ith motif in all the sequences, and let E)1= (() ... )) denote the parameter vector for the product-multinomial model of the lth motif, where wI is the width of the lth element. We write W = WI + + WL. The likelihood can be written as

L wl

-F(RIA, Oo, 9) o 0h(R{A}c) 1 J0(')}h(RA1?+j-l)

1=1 j=1

L( )lfl(o) \ h(RA1?+j_1)

I=h(R) ( 4?

where A. +j -1 = (al,, + j - ,I--, aK,I 3 - I)T. Us_ ing the same reasoning as in Section 2.1 for the single-motif case, we can integrate over the 0's to simplify computation. If wI _ 1 for all 1, then the A., correspond to those positions such that Jt = 1 in Section 2.2, and the model is equivalent to an HMM with no deletions.

We assume a priori that 0o - D(a) and E)(1) D(B(')) and are independent of A for 1 = 1,... , L, where B(') = ((1), . . , 3$i)). Then the likelihood function of A, with 0's integrated out, is

L Wl

rr(RIA) o F{h(R{A}C)+a} 7 17 Fh(RA.1+j1I) + f31)}. 1=1 j=l

Let Ak. = (ak,,... ,ak,L) denote the alignment vector for the kth sequence. Then a Markovian structure for the a priori distribution of Ak., conditional on the sequence length nk, can be introduced as

L-1

7Fk(Ak.) o? ]7 qK (ak,I, ak,.z1 ), (5) 1=1

where p (x, y) > 0 can be view.ed as a penalty function. Jointly, we set the prior of A as 7(A) =Hk.=1 Fk (Ak.).

Based on the constraint that the motifs not overlap, qi (X, y) must be 0 whenever y - x < Wl. But it can take many forms, such as l/(y - x), exp(-clx - yl), exp{-c(x _

y)2- ,

and so on, when y - x > Wl. The exponential form is most commonly used in alignment literature for its nice mathematical properties that give rise to a fast algorithm. How- ever, an exponential gap penalty may not be suitable for aligning subtle motifs. If we elect not to penalize gaps, then we can set qi (X, y) 1, for y - x > wi, so that { 1 if ak,+1-akl >_w, W V

( 0 otherwise,

where ak,L+l -=- nk+ 1. This prior induces only a collinearity and nonoverlapping constraint. Because gaps between subtle motifs vary greatly, we feel that this "no penalty" prior is most suitable for our tasks. When the number of motifs is to be determined from the data, this penalty issue becomes more subtle, and we defer it to Section 5.

In the propagation model, we treat the number of sequences K, sequence length nk, number of motifs L, and the motif width wI as fixed constants instead of random variables. Because formula (5) is given only up to propor- tionality, to get the actual distribution we need to compute the normalizing constant by summing over all possible values of 1 < ak, 1< ... < ak,L < nk in (5). Although this step is not necessary at present, a similar summation is required for analyzing the posterior distribution of Ak.. We provide a recursive algorithm for this computation in the next section.

3.2 Forward-Backward Recursion for Predictive Updating

Consider a particular sequence Rk. Defining qL (x, y)1 and using an argument similar to that of Liu et al. (1995), we can write the crucial conditional predictive distribution for Ak. as

(Ak. = (il, . . , iL) R, A[1k])

L wj /^( h(rk ,ij+j -1l)

DL jl (ii i1)J1 ( \) (6)

where A[_k] = A \ Ak. and the Oj are defined as in (1). The conditional distribution (6) forms the basis of our

predictive updating version of the Gibbs sampler (Liu 1994). The algorithm proceeds in two steps: randomly (or systematically) choosing a sequence k, and updating its motif element positions Ak. by a draw from distribution (6). To draw Ak. from (6), we need to propagate information forward along the sequence and then sample backward.

Let Q(O) = (1)nkxnk and Q() = (ql(i,j)),1kXnk, where ql (i, j) are the same as those in (5), and let u(1) (U) ,IU), where

I n1 )h for i1., k-WI+ ,




and Ui) - for i > nk - w, + 1. Thus u(1) is proportional to the marginal distribution of ak,1 without considering other motifs, with given motif parameters at 0o and (1). Formula (6) can be rewritten as

p(Ak. - (il,.. iL)) 7(A. = (il,....iL)IR,AA[k]) L

OCHql (il, ilt)ul 1=1

To sample from p(Ak.), we first need to compute the normalizing constant

L

Go =ql (i, ij+')uZl) il i2< ... iL 1=1

Let v(0)(j) - 1, and for j 1 .nk (the length of the sequence) let

m V() (j) = JJ ql(il, i1+?)U$l),

il< . <iM=j 1=1 j ,.,nk; . m=1 .,L

Let v(m) - (v(m)(), . , V(m)(nk)). Then the following recursive relationship holds:

V(m+l) (j) v(m) (k)qm+l (k, j)tmj k<j

which can be written in a matrix form as

v(M+1) = {v(m)Q(m)} * U(m+Hl)

for m=O,...,L-1, (7)

where operation * is defined as for all u (i1, . . ., w,n) and V = (V1)... Vn)9

U * V = V * U - (UIvI, Unn);

and for a matrix S = (sjj)nxn,

U * S S * U = (SijUj)nxn

and

uT * S - S * uT =(SijUi)nxn

Finally, we have Go =vi1 v(L)(j). Hence the marginal distribution of ak,L is V(L)/Go. Furthermore, we can easily derive the conditional distribution of ak,11 with given ak,1.

Thus the random sampling of Ak. can proceed recursively as follows: first, draw ak,L from the distribution v(L) /Go; then for given xl, where xl is a row vector with Os as entries except for a 1 at the starting position ak, 1 of

the lth motif, draw ak,1-1 from the probability vector proportional to p(1-1) = {V(-l1)}T * Q(1-1)x[, and denote it by xlj_.

The computation required by this forward-backward procedure is O(nm) for a general spacing penalty function ql (a, b), where n is the length of the sequence. But when the spacing function is memoryless (i.e., q, (a, b) does not depend on 1 and is exponential in b - al or constant), the amount of computation is reduced to O(nk) (Neuwald et al. 1997).

3.3 Fragmentation and Weighting In the previous section, the conserved part in each protein

sequence was described as a sequel of L ungapped blocks each with known length Wl. But the exact value of wI is rarely known; at best, biologists may have some a priori knowledge on its range. In addition, not all of the positions within a block motif are equally important for protein structure and function. For example, some of the residue positions in a motif may be critical to an enzyme's cat- alytic function, and thus residue types at these positions are highly conserved. On the other hand, there is little conservation of residue types in motif positions that serve only as geometric place holders. Liu et al. (1995) and Neuwald et al. (1995) exploited this characteristic by introducing the fragmentation model in addition to the block-motif characterization. This model allows the aligned columns of a motif to hop stochastically within a neighborhood of their current alignment positions with probability proportional to the degree of conservation in each column as compared to background. We now extend that approach to our propagation model to permit column hopping between motifs, thus removing the requirement of having to specify the number of conserved columns w, allocated to each block motif.

The key to the fragmentation model is the concept of potential width, W1,... , WL, with Wi > wi, andfragmentation indicator A1 = (61,,... ,1,w1), with 1,j= 1 or 0 indicating whether the position is regarded as part of the motif or the background. Thus W1 is the potential span of motif 1, and Al is a vector indicating which of the WT potential positions should be included in the model for motif 1. We further let A = (A1,.. , AL). A graphical representation is shown in Figure 3.

Liu et al. (1995) required that I1 = wl. We require only that the total number of columns be constant; that is,

L L

E 1jl = w = EWi. 1=1 1=1

Ai A2 AL

Sequence k i _- I looms 0-**0 101|1 I ak., I ak,2 ak.L

Figure 3. Graphical Representation of the Fragmented Propagation ModeL. White spaces inside each motif element indicate that those positions of the motif elements are excluded from the motif modeL The excluded positions for the Ith motif are indexed by zeros in A/. The A1 are the same for all sequences.




To accommodate this new feature, we rewrite the parameter vectors as one big matrix,

6) (e (1) E (L) ) _(ol, .. * *,OW) v

where, in old notation, OM') - (o) . W, Hence the likelihood of the model with these added struc-

tures can be written as w

7r(RIA, Al1 I . .. AL io 0 ) oc 0h(RfAAl-) H h(RAA(w))

w=l

where AA(w) denotes the wth overall model position indicated by A. Similarly to Liu et al. (1995), we consider the prior distribution for A as inversely proportional to the total number of possible realizations of A for given total spans, J(Al) = max{w: a1,w = 1} - min{w: 31,w = 1}, for

1=l,..,L; that is,

LF(A o J(Ai) -

T(A)o(U ( |Aj|-2,)

Note that there are (j(Aj) ways of assigning Os and Is for positions within the span of length J(Al). Techniques for treating this new feature via Gibbs sampling are essentially the same as those of Liu et al. (1995) and are omitted.

An important remaining problem is the choice of W, the total number of conserved columns, and L, the total number of conserved blocks. In Section 4 we provide a Bayesian maximum a posteriori (MAP) criterion for choosing these parameters.

3.4 Block-Motif Model With Deletions

Although the propagation model provides a way to combine the spirit of the block-based model and the gap-based HMM, it cannot handle deletion events easily. In this section we show that the deletion issue can be addressed by using a flexible indicator vector.

Suppose that there are L conserved collinear motifs, each with a fixed width wl, in every sequence. For a particular sequence R, the alignment variable A = (al, . . ., aL) represents the starting positions of these L conserved segments. The previously described propagation model assumes that all of the blocks must appear in every sequence to be aligned-permitting no deletions of any block in any of the sequences.

To account for deletions, we introduce a binary vector D = (dl,. .. , dL) for the sequence R, where d, = 0 indicates that the lth block has been deleted and d= 1 indicates otherwise. Therefore, each sequence R is associated with an alignment vector A and a deletion vector D, neither of which is observed. When d, = 1, al indicates the location for lth motif element. When d, = 0, however, the value of al is not meaningful. A and D can be treated as missing data and approached by an EM algorithm. Alternatively, we can give prior distributions to A, D, and 0 and use a Bayesian approach with computation completed by Gibbs sampling.

By giving different prior distributions to (A, D), we can obtain different desirable effects. For example, the simplest

prior distribution for (A, D) is

wr(A, D) = wrj (A)w72(D),

where wr, (A) is the same as wr(A) in Section 3.1 and w 2(D) H d,ip l(1 p)l-di; that is, the d, are mutually independent and P(di = 1) = Pl. A Markovian model for D is sometimes more desirable and can be characterized by

L

r2 (D) = f, (di ) 11 fi (di -1, di), 1=2

where fi(d1_1,d1) is the transition function from dl-1 to dl. We can also model (A, D) jointly with a Markovian structure:

P(al, di IS,-,) = 7r(al, di Jai-,, di-,),

where Si_1 = F{(aj, dj), 1 < j < 1 - 1}, the Cx field generated by all previous a's and d's.

If the width wI of each block is set to 1, then this deletion-propagation model is very similar to a HMM. In particular, al then corresponds to the sequence position of Ml, the lth model position in a HMM (see Sec. 2.2). A deletion of a, indicated by d, = 0 corresponds to a deletion of Ml. In a future work, we show that the deletion model generalizes the HMM of Krogh et al. (1994) and provide computational strategies for implementing the model.

4. MODEL SELECTION: AN APPROXIMATE BAYESIAN APPROACH

Two unresolved model selection issues remain: the number of motif elements (i.e., the number of gaps) and the total number of conserved positions in all motifs. The fragmentation model of Section 3.3 can be applied to allocate these positions into all motif elements.

The difficulty of model selection has long been appreci- ated by statisticians. Among the many solutions that have been proposed, the most popular are the Akaike information criterion (AIC), Bayes information criterion (BIC), and Mallows's Cp. Although these have proven effective for a class of problems they have serious limitations, such as the sequence alignment problems (Lawrence et al. 1993). Model selection methods based on the Bayes factors (or model likelihoods) have proven useful in many Bayesian analyses, and the recent development of Markov chain Monte Carlo (MCMC) methods enables such methods to be car- ried out for very complicated and realistic models (see Kass and Raftery 1995 for a recent review). Other interesting Bayesian approaches to model critique have been pursued by Box (1980), Gelman, Meng, and Stern (1996), and Ru- bin (1984). Of the many different methods, it seems that the one based on the Bayes factor (i.e., the posterior density of the observed data) provides a good starting point for our problem.

As in previous sections, we let A denote the alignment vector (which consists of L motif blocks), R denote the observed sequence data, and e denote the model parameter for a particular model under consideration. Following Box




(1980), we assess model adequacy by the model likelihood p(R), which can be computed as

p(R) Jp(RIe, A) (e, A) dE dA

ZEp(RjA)p(A), A

where p(A) is the prior distribution for the alignment variable. Here we assume that e can be at least approximately integrated out. In many practical situations, as in our alignment problem, computation of P(R) unfortunately is infea- sible, and some Monte Carlo or numerical approximations are necessary.

To simplify the computation involved, we introduce the MAP criterion for model selection, which chooses a model to maximize

log MAP = log{p(RI A) } + log{p(A) }, (8)

where A is the posterior mode of A under that model. If the likelihood function p(RIA) for the alignment vari-

able is very much concentrated at its maximum, then we have the approximation p(R) p(RIA)p(A). Bounds on this approximation can be obtained as follows. Because P(R) = p(R, A)/p(A R), we have

log p(R) logp(R|A) + logp(A) - logp(A R).

Upper and lower bounds for logp(R) based on log MAP are

log MAP < logp(R) < log MAP - EAIRt{logp(AIR)}. (9)

Furthermore, by the information inequality (Cover and Thomas 1991) that for any nondegenerate distribution q(A),

EAIR{logP(AIR)} > EAjR{logq(A)},

the second inequality of (9) can be replaced by logp(R) < log P(AIR)} - EAIR{log q(A)} and can be estimated using Monte Carlo samples. Thus the logMAP is closely related to the Bayes factor. Our experience shows that the logMAP criterion works quite well for multiple alignment problems.

In using the MAP criterion for model selection, one must provide a prior probability for each model. Although the propagation model permits flexible gap penalties, we have found the following no-gap penalty model to be highly effective. Specifically, we assume that the prior probability of observing L blocks in a sequence is taken to be equally likely in a range of possible numbers, say from lo + 1 to Lo. Hence P(L = 1) = /(Lo- lo) for any lo < 1 < Lo. This implicitly introduces a constraint on the possible number of gaps. We further assume that all alignments with L = I motifs are equally likely. Therefore, the prior probability of observing a particular configuration of L = 1 motif elements in a sequence is inversely proportional to the total number of such configurations. The total number of such configurations can be computed using a recursive formula similar to (7). Because the number of all possible alignments of an L-motif model grows super-exponentially with L, the assumption that all models are equally likely is sub-

stantially different from the assumption that all alignments are equally likely.

Monitoring logMAP in MCMC sampling is done efficiently by recursive updating. More precisely, calculating log{p(RIA(0))} for a starting alignment A(?) for an initial model and then for any further iteration, say A(, we compute the increment

log{p(RjA(1))} - log{p(RjA(0))},

which is easily done because our sampling algorithm is composed solely of small local moves.

A heuristic support of the MAP criterion stems from special characteristics of the alignment problem. The posterior alignment distribution contains numerous "chance" local modes that emerge as artifacts of the alignment model rather than of biology (Lawrence and Reilly 1996). Accord- ingly, the inference of biological interest often focuses on a small subset of the alignment ensemble. This subset can be distinguished from chance modes only if they are concentrated around the global mode that an alignment algorithm can detect.

As demonstrated by many of our novel biological find- ings (Neuwald et al. 1997), it appears that the MAP criterion works quite well. A study by Neuwald et al. (1999) showed that the MAP criterion is conservative compared with the p value and Bayes factor approaches in the sense of preferring simpler models. The method performs sat- isfactorily for a simulation example. Qu and Lawrence (1999) showed that some modification of this criterion is required for effective prediction of structural alignments in the molecular modelling database (MMDB) database (URL: http://www.ncbi.nih.gov/structure).

5. EXAMPLE AND DISCUSSION

Cells are very resourceful in their use of materials. For example, the basic building blocks of nucleic acids, ribonu- cleotide, and deoxyribonucleotide triphosphate are used in a number of cellular processes in addition to their role in RNA and DNA synthesis. One of the most important of these, adenosine triphosphate (ATP), is the universal "cur- rency" for chemical energy in all organisms. ATP provides the power for most of the cells' endergonic (energy absorb- ing) processes. A limited number of important endergonic processes are powered by guanosine triphosphate (GTP), however. Reactions involving GTP are the focus of this application. Energy is released when GTP is broken down to guanosine diphosphate (GDP) through hydrolysis of its ter- minal phosphate bond as follows:

GTP + H20 -? GDP + P1 + H+,

where Pi is the phosphate ion. This energy releasing (ex- ergonic) reaction is coupled to the reaction that requires energy (endergonic reaction) and is catalyzed by a GTPase enzyme. Several cellular processes utilize these coupled reactions. In this section we provide a detailed sequence analysis of the GTPases using the methodology described in previous sections.




5.1 The Dataset

Neuwald et al. (1997) examined the utility of PROBE, which is designed to identify protein families contained in the protein databases, find the conserved motifs, and align the family members. One of the families identified was a set of 1,338 GTPases. When the PURGE algorithm (details in Neuwald et al. 1995; available via anonymous ftp at ncbi.nlm.nih.gov), which computes the similarity score for every pair of sequences using a BLOSUM62 scoring matrix and removes close homologs (those with a BLOSUM62 score > 150), was applied to this set of sequences, a dataset of 46 sequences was obtained. For validation purposes, we added to this dataset two distantly related GTPases whose structures had been determined by X-ray crystallography. The sequences of these two proteins are given in Table 1. Because these two sequences are not significantly related, as measured by BLAST, but share common substructures (MMDB), they serve as good internal positive controls. Four sequences out of 46 in the previous dataset were related to the two added sequences. After these four related sequences were removed, the final dataset contained 44 sequences, with no pair having a BLOSUM62 score > 150.

5.2 Prior Specification

Throughout our applications of propagation and PROBE, the priors were set in a manner that is uninformative with respect to the alignment of any specific protein or family of proteins. Specifically, the priors on the O's in (4) were set in accordance with Lawrence et al. (1993); the vectors ca

and ,3(1) in the product Dirichlet were assigned equal values as aYk oc nk, and a I + + a? p = O.1N, where nrk is the total number of residues of type k in the entire dataset and N = ni + ? ?+ np. The prior distribution for the alignment variable A, as given in Section 4, has been used throughout our work. It should be noted that in our prior specification the gap penalty function q (x,Iy) was taken to be constant, which means that no explicit penalty was given to the length of gaps. Instead, the prior distribution of the number of gaps was uniform over a specified range, and, conditioned on this number, all arrangements of gaps were equally likely.

5.3 The Implementation of Propagation

The propagation algorithm and the MAP model selection criteria have been incorporated into software (PROBE) for the identification of protein families. A variation in implementing the propagation is to use a genetic algorithm to improve the mode-finding ability (i.e., find the MAP) of the Gibbs sampler. The algorithm consists of following main steps (see Neuwald et al. 1997 for more details):

1. Create an initial population of M multiple alignments by repeating the following three steps M times:

a. Randomly draw the number of blocks (L) and the total number of columns (W) from a given distribution.

b. Align the purged sequences by using the Gibbs sampling algorithm derived from the propagation model (Sec. 3).

Seq# NCBI ID DB-Access. Start Element 1 Gap 1 Element 2 Gap 2 1) gi1493746 pdb-121P (4) YKLVVVGAGGVGKSALTIQLIQNHF (29-52) LDILDTAGQEEY (65-68) 2) gi1229900 pdb-lETU (13) VNVGTIGHVDHGKTTLTAAITTVLA (38-76) YAHVDCPGHADY (89-92) 3) gi 1O77890 pir-S57091 (4) STIICIGMAGSGKTTFMQRLNSHLR (29-101) NCIIDTPGQIEC (114-125) 4) gil141353 sp-P17103 (62) ATVALVGFPSVGKSSLINAMTNADS (87-109) IQLLDVPGLIEG (122-132) 5) gi1129021 sp-P20964 (159) ADVGLVGFPSVGKSTLLSVVSSAKP (184-207) FVMADLPGLIEG (220-230) 6) bgil434759 trem-Q15029 (130) RNVTLCGHLHHGKTCFVDCLIEQTH (155-199) FNIMDTPGHVNF (212-215) 7) gi11204225 sp-Q10251 (485) PICCILGHVDTGKTKLLDNLRRSNV (510-552) LLIIDTPGHESF (563-566) 8) gil68956 pir-RGECGT (9) GFIAIVGRPNVGKSTLLNKLLGQKI (34-57) AIYVDTPGLHME (70-82) 9) gi11174907 sp-P42871 (13) TRIGIGGPVGSGKTAIIEVITPILI (38-72) LGVETGACPHTA (85-120)

10) gil462264 sp-P25519 (198) PTVSLVGYTNAGKSTLFNRITEARV (223-238) IDVADVGETVLA (251-270) Fragmentation: .... *

Seq# Element 3 Gap3 Element 4 Gap 4 Element 5 Last 1) DQYMRTGEGFLCVFAINNTKSFED (93-109) PMVLVGNKCDL (121-140) YIETSAKTRQGVEDAFYTLVREI (163) 2) ITGAAQMDGAILVVAATDGPMPQT (117-129) YIIVFLNKCDM (141-263) KLLDEGRAGENVGVLLRGIKREE (286) 3) SFASSFPTVIAYIVDTPRNSSPTT (150-166) PMIVVFNKTDV (178-235) VVGVSSFTGDGFDEFMQCVDKKV (256) 4) LSVIRGADLVIFVLSAFEIEQYDR (157-236) PSLVTVNKVDL (248-269) AIFISAAEEKGLDVLKERMWRAL (292) 5) LRHIERTRVIVHVIDMSGLEGRDP (255-275) PQIIVANKMDM (287-305) VFPISAVTREGLRELLFEVANQL (328) 6) TAGLRISDGVVLFIDAAEGVMLNT (240-251) AVTVCINKIDR (263-358) KAPTSSSQRSFVEFILEPLYKIL (381) 7) SRGTSLCNIAILVIDIMHGLEPQT (591-602) PFVVALNKVDR (614-672) LVPTSAQSGEGVPDLVALLISLT (695) 8) SSSIGDVELVIFVVEGTRWTPDDE (107-117) PVILAVNKVDN (129-151) IVPISAETGLNVDTIAAIVRKHL (174) 9) TFSPALADFYIYVIDVAEGEKIPR (145-152) ADILVINKIDL (164-186) YILTNCKTGQGIEELVDMIMRDF (209)

10) LQETRQATLLLHVIDAADVRVQEN (295-310) PTLLVMNKIDM (322-338) RVWLSAQTGAGIPQLFQALTERL (361) Frag. *.***.************- ***** ******.* * *-

Figure 4. Aligned Motif Elements. The alignment of 10 of the 44 GTPase sequences mentioned in the text. Columns are as follows: NCBI sequence ID; protein database and corresponding sequence accession number; starting residue number of the first element; five aligned motif elements in the 10 sequences with the residue numbers of the intervening subsequences (gap) in parentheses; and number of the last residue of the last element. Starred columns are those selected by fragmentation (Liu et al. 1995). NCBI sequence ID numbers of the 44 sequences in the alignment are as follows: gi/493746, gi/229900, gi/1302162, gi/585780, gi/549796, gill 154901, gi/601848, gi/559421, gi/631679, gi/1072199, gill 171566, gi/729139, gi/731641, gi/1072255, gi/13401 15, gi/861254, gi/466271, gi/585177, gi/1086887, gi/479657, gi/544493, gi/730928, gi/1085447, gi/131887, gi/1050856, gi/94524, gi/731284, gi/1079402, gi/600886, gi/124210, gill 175159, gi/466991, gi/1051305, gi/558296, gi/544478, gi/1078133, gi/1077890, gi/141353, gi/129021, gi/434759, gi/1204225, gi/68956, gi/1 174907, and gi/462264.




c. Save the copy of the alignment when it stabilizes. 2. Iteratively apply the following genetic algorithm-type

steps: a. Randomly choose two alignments from the pop-

ulation and determine the possible recombinants derived from the two alignments. (A recombi- nant alignment is composed of the nonoverlapping collinear blocks resulting from the two original alignments.)

b. Select the best ones based on the MAP criteria (sampling proportional to fitness can be done as well) and add it to the population.

c. Remove the "least fit" alignment from the population.

d. Occasionally introduce new variants into the population by repeating steps 2(a)-(c).

As shown in Figure 4, at convergence the propagation algorithm aligned the 44 GTPase sequences, and the MAP model selection criteria identified five motifs with a total of 78 conserved positions.

As shown in Figure 5, there was considerable variation in the degree of conservation at different positions in the alignment. Nearly all of the most highly conserved positions play key roles in binding the substrate (GTP) or the product (GDP) or have important structural roles. As shown in Figure 7, nearly all of the most highly conserved positions interact directly with either GTP or GDP. For example, the conserved lysine (K), a positively charged amino acid, at position 13 of motif 1 interacts with a negatively charged phosphate of GTP/GDP (see Fig. 7). In addition, there are a number of conserved glycines (G), which allow the protein backbone to bend sharply.

5.4 Structure Prediction

Enzymes are proteins that catalyze chemical reactions. Their efficiencies in accelerating chemical reactions are usually several orders of magnitude beyond the best man-made catalysts. An enzyme achieves this efficiency by folding into a precise three-dimensional structure that binds the compound (the substrate) to be chemically converted. Two proteins with very different primary amino acid sequences can efficiently catalyze the same reaction. In such cases the structures of the two enzymes will typically be similar in the regions that bind the substrate. This suggests that predicting a protein's structure from its sequence is extremely difficult-not surprisingly, one of the grand challenges in biology.

Structural prediction based on sequence alignment has proven to be the most successful method for addressing this grand challenge. However, good predictions have been limited to proteins whose sequences are closely related. As sequences become more distant, improper alignments play a major role in the breakdown of these predictions. The method illustrated in this article is especially suitable for aligning distantly related sequences and helps improve structural predictions based on multiple alignment. However, approximations inherent in all multiple sequence models, including ours, demand that such predictions be validated by experimentally derived controls. Accordingly, we have incorporated a pair of distantly related sequences (lETU and 121P) with known structures that have been shown to be similar by the VAST procedure and reported with a structural superposition in MMDB. These protein sequences and their X-ray structures provide useful data to

IN1c3ptiC 1 M9t1if 2

Mo tif Jr l tif 4

Figure 5. Sequence Logos. Posterior distribution of e presented as a sequence logo (Schneider and Stephens 1990) for motifs 1-5. The height Hf') of the jth position of motif I is computed as H(') - Z r{f$0r log2 (06$)>, where 0JQ) - ( 0__ r ranges over 20 amino acids). Accordingly, positions in the alignment that are highly conserved are tall. The height of the letter r is Or,j x Hi. Confidence limits are delineated at one standard deviation in H,.




121P (1) MTEYKLVVVGAGGVGKSALTIQLIQNHF (29-50) CLLDILDTAGQEEY 1ETU PROBE (12) ... VNVGTIGHVDHGKTTLTAAITTVLA (37-75) . . YAHVDCPGHADY 1ETU VAST (9) KPHVNVGTIGHVDHGKTTLTAAITTV.. (35-73) RHYAHVDCPGHA..

121P (#) SAMRDQYMRTGEGFLCVFAINNTKSFED (93-108) VPMVLVGNKCDL (120) IETU PROBE (88-91) .... ITGAAQMDGAILVVAATDGPMPQT (116-128) .YIIVFLNKCDM (139) 1ETU VAST (86-87) VKNMITGAAQMDGAILVVAATDG ..... (111-127) PYIIVFLNKCDM (139)

Figure 6. Comparison Between the Sequence Alignment (Motif Pre- dictions) of 1ETU and 121P, Produced by PROBE, and Their Structural Alignment Based on the Crystallography Data, Produced by VAST The numbers in the parentheses are the ending position of the previous motif and the starting position of the following motif. The sign (#) means that there is no gap between two consecutive motifs. The dots represent those positions that are missed by either PROBE or VAST Although the PROBE and VAST alignments of 1ETU with 121P were produced independently, they are presented in adjacent rows to facilitate comparison. Qu and Lawrence (1999) have shown that reliable structural predictions based on PROBE require cross-validation of PROBE alignments. The purpose of this cross-validation is to ensure that the protein of interest does not bias the alignment in its own favor. This validation is accom- plished by removing the sequence of the protein of interest, here 1ETU, and those that are even marginally similar to it from the multiple alignment. With these removed, a test of the hypothesis that the elements of the protein of interest, individually and collectively, are drawn from the motif model based on the remaining sequences is performed using SCAN (Neuwald et al. 1997). Proteins or elements found to be insignif- icant (p > .05) are not included in the prediction. Here element 5 of 1ETU fails this test, and thus it is not included in the foregoing prediction. Qu and Lawrence (1999) also showed that PROBE typically aligns only those residues that are in the vicinity of a ligand. These residues are usually about half of all those that can be structurally superimposed by VAST Here VAST aligns 137 residues, whereas PROBE aligns only 78. As shown, 63 of these 78 residues agree with VAST

examine the validity of the GTPase alignment represented in Figure 4.

Conditioned on the alignment of the 44 sequences in Fig- ure 4, we decided to predict the structure of IETU, the target protein, based on the structure of 121P, the parent protein. Qu and Lawrence (1999) have reported methods of using the results of PROBE for such predictions, and Qu, Martin, and Lawrence (1998) have predicted the structure of glutamate decarboxylase using these methods. As described further in the legend of Figure 6, the backbone structure of the target can be predicted from that of the parent if criteria on the strength of the motif model and significance of the motif elements of the target are met. The first four motif elements of IETU passed this screen- ing. Figure 6 shows a comparison of the structural alignment produced by VAST with the sequence alignment of PROBE for these four motifs. As shown in Figure 7, the four motifs of IETU structurally superpose well with those of 121P, and are in substantial agreement with predictions for these regions produced by VAST. Furthermore, the four motifs form major components of the GTP binding pocket.

For comparison, we also examined an alignment of these 44 sequences produced by CLUSTAL W (Thompson, Hig- gins, and Gibson 1994) with the structural information and found no agreement between sequence and structural alignments.

5.5 Discussion In this article we have demonstrated a new efficient

method for identifying subtle patterns conserved among

nd Motif ;

impsed Te srucur ofth bakboefte ETUcandis-shown in

stutua moisfo iue6aei&rgtrdrbos n h etpe

thegpredite morutifreleet of 1PWit arershownes drak blue roibbonS.pr imoe.The superpsitio of thes motifsnwith thoe o 1ETU whaiis sobtainedb

fle shadfor EThenge foursructura ThagenGDP bourspndin to theTUi presented astractbal-ad-tick fro igure in yrelo The briheginnings, andth eninsto peac ofntheda foranEUsotfsaren lablda uh ighyikrbbnCoria consrve residues41t (otabvea2.5abit from Figh5 crsare ptreseteasbl-nstcfiures. Theyfamnscrepnigt are colredice ustingCP colornemenso2Pae (carown as graeen,u nitrogns. bleupe,rndoxygeon of rhed), excetifor Aspatathoe 80 (AS8O) whi otise pre sentedzing magentam tof hepqustraed dithfancts bthatetei the onlyoned ofalthes tonsere Regosidu testrcues aisnthnar dsgTre Consrve glycinre clres idus

which dor noTU show wrell both becausTe thyDav bounlyoT as hyrogenatom asid cai balandstc because theylayw speca roeginns for bendings of teback-

(bone, are nots shownFi. Note ar fullscolor vesioalndofsthisk figureis. avilbey

distnly .relatd bi cologica sceqences. T metho, tro

gethern withe a transitiexp searc Aspatrateg and AS8) genichi ale- goredithmaet to speed lp ilutaethe MAPaotimizatitsteon, y formte ortes

ofechi anew mecultpe algnent pand databalresefo sendcing oftool,ck

brobeO ., ltshouh. we cated ta the MA criterio

sees tonl belatgod slcinmtoo biological sequences.Ti ehd o analysis aditional traniiesearch onatg apprxiat moelei sele-

tofanmethdwo multiple eune alignment anaaaesacisneeed Asol

we noted in Section 1, our statistical model has several limitations, and further analyses, either theoretical or empirical,

ar needed A mao retito oftemdl(n fms te

HMM-tye modtucuels is 1Tha Wthe Sequenres tof be1 atignSued-

atrectureate aois havmingur aevolv igted al bongsndpndn pandthwayes.re




Another limitation is that every residue is treated independently given the alignment information. Although much of our experience has shown that the method developed in this article and that of Lawrence et al. (1993) can tolerate datasets with substantial deviation from the independence assumptions, a systematic analysis on robustness of these related methods is desirable. It will also be a great advance if efficient algorithms can be developed to simultaneously infer phylogeny and multiple alignment, with both uncer- tainties taken into account.

By revealing conserved patterns, one can infer structural or functional characteristics, such as the structural motifs predicted in Section 5.4, of all members of the family of aligned sequences based on experimental evidence that may be available for only a few members. Qu and Lawrence (1999) showed how to refine such structural inferences. Sev- eral such characterizations have led to useful discoveries, as reported by Neuwald et al. (1997). In addition to characterizing specific protein families, a major goal for this and related methods is the characterization of the protein "universe" (Green et al. 1993).

[Received January 1997. Revised July 1998.]

REFERENCES

Allison, L., and Wallace, C. S. (1994), "The Posterior Probability Distri- bution of Alignments and Its Application to Parameter Estimation of Evolutionary Trees and to Optimization of Multiple Alignments," Jour- nal of Molecular Evolution, 39, 418-430.

Allison, L., Wallace, C. S., and Yee, C. N. (1992), "Minimum Message Length Encoding Evolutionary Trees and Multiple Alignment," Pro- ceedings of 25th Hawaii International Conference on System Science, 1, 663-674.

Altschul, S. F., Gish, W., Miller, M., Myers, E. W., and Lipman, D. J. (1990), "Basic Local Alignment Search Tool," Journal of Molecular Bi- ology, 215, 403-410.

Baldi, P., Chauvin, Y., McClure, M., and Hunkapiller, T. (1994), "Hidden Markov Models of Biological Primary Sequence Information," Proceed- ings of the National Academy of Science, 91, 1059-1063.

Bishop, M. J., and Thompson, E. A. (1986), "Maximum Likelihood Align- ment of DNA Sequences," Journal of Molecular Biology, 190, 159-165.

Box, G. E. P. (1980), "Sampling and Bayes Inference in Scientific Mod- eling and Robustness," Journal of the Royal Statistical Society, Ser. A, 143, 383-430.

Bronner, C. E., Baker, S. M., Morrison, P. T., Warren, G., Smith, L. G., Lescoe, M. K., Kane, M., Earabino, C., Lipford, J., Lindblom, A., Tan- nergard, P., Bollag, R. J., Godwin, A. R., Ward, D. C., Nordenskj0ld, M., Fishel, R., Kolodner, R., and Liskay, R. M. (1994), "Mutation in the DNA Mismatch Repair Gene Homologue hMLH1 Is Associated With Hereditary Non-Polyposis Colon Cancer," Nature, 368, 258-261.

Campbell, M. K. (1995), Biochemistry (2nd ed.), New York: Saunders Col- lege Publishing.

Churchill, G. A. (1989), "Stochastic Models for Heterogeneous DNA Se- quences," Bulletin of Mathematical Biology, 51, 79-94.

Claverie, J. M. (1996), "Effective Large-Scale Sequences Similarity Searches," Methods in Enzymology, 266, 212-227.

Cover, T. M., and Thomas, J. A. (1991), Elements of Information Theory, New York: Wiley.

Eddy, S. R. (1995), "Multiple Alignment Using Hidden Markov Models," Intelligent Systems for Molecular Biology, 3, 114-120.

Gelman, A., Meng, X. L., and Stern, H. (1996), "Posterior Predictive As- sessment of Model Fitness via Realized Discrepancies" (with discussion), Statistica Sinica, 6, 733-796.

Green, P., Lipman, D., Hillier, L., Waterston, R., States, D., and Claverie,

J. M. (1993), "Ancient Conserved Regions in New Gene-Sequences and the Protein Databases," Science, 259, 1711-1715.

Henikoff, S., and Henikoff, J. G. (1991), "Automated Assembly of Pro- tein Blocks for Database Searching," Nucleic Acids Research, 19, 6565- 6572.

Henikoff, S., Henikoff, J. G., Alford, W. J., and Pietrokovski, S. (1995), "Automated Construction and Graphical Presentation of Protein Blocks From Unaligned Sequences," Gene, 163 (2), GC17-GC26.

Karlin, S., and Brendel, V. (1992), "Chance and Statistical Significance in Protein and DNA Sequence Analysis," Science, 257, 39-49.

Kass, R. E., and Raftery, A. E. (1995), "Bayes Factors," Jou-rnal of the American Statistical Association, 90, 377-395.

Krogh, A., Brown, M., Mian, S., Sjolander, K., and Haussler, D. (1994), "Protein Modeling Using Hidden Markov Models," Joumnal of Molecut- lar Biology, 235, 1501-1531.

Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C. (1993), "Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment," Science, 262, 208-214.

Lawrence, C. E., and Reilly, A. A. (1990), "An Expectation Maximization Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences," PROTEINS. Struicture, Function, and Genetics, 7, 41-51.

(1992), "Likelihood Inferences for Permuted Data With Applica- tion to Gene Regulation," Journal of the American Statistical Associa- tion, 91, 76-85.

Lazareva, B., and Churchil, G. A. (1997), "Bayesian Restoration of a Hid- den Markov Chain With Applications to DNA Sequencing," unpublished manuscript submitted to Journal of the American Statistical Association.

Liu, J. S. (1994), "The Collapsed Gibbs Sampler in Bayesian Computa- tions With Applications to a Gene Regulation Problem," Journal of the American Statistical Association, 89, 958-966.

Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995), "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies," Journal of the American Statistical Association, 90, 1156-1170.

Marshall, E. (1996), "Hot Property: Biologists Who Compute," Science, 272, 1730-1732.

Needleman, S. B., and Wunsch, C. D. (1970), "A General Method Appli- cable to the Search for Similarities in the Amino Acid Sequence of Two Proteins," Journal of Molecular Biology, 48, 443-453.

Neuwald, A. F., Liu, J. S., and Lawrence, C. E. (1995), "Gibbs Motif Sampling: Detection of Bacterial Outer Membrane Protein Repeats," Protein Science, 4, 1618-1632.

Neuwald, A. F., Liu, J. S., Lipman, D. J., and Lawrence, C. E. (1997), "Extracting Protein Alignment Models From the Sequence Database," Nucleic Acid Research, 25(9), 1665-1677.

Pearson, W. R., and Lipman, D. J. (1988), "Improved Tools for Biolog- ical Sequence Comparison," Proceedings of the National Academy of Science, 85, 2444-2448.

Qu, K., and Lawrence, C. E. (1999), "Extended Homology Prediction for Motif Structure by Multiple Sequence Alignment," unpublished manuscript submitted to Modeling and Scientific Computing.

Qu, K., Martin, D. L., and Lawrence, C. E. (1998), "Motifs and Structural Fold of the Cofactor Binding Site of Human Glutamate Decarboxylase," Protein Science, 7, 1092-1105.

Rabiner, L. R. (1989), "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE, 77, 257- 286.

Rubin, D. B. (1984), "Bayesianly Justifiable and Relevant Frequency Cal- culations for the Applied Statistician," The Annals of Statistics, 12, 1151-1172.

Schneider, T. D., and Stephens, R. M. (1990), "Sequence Logos: A New Way To Display Consensus Sequences," Nucleic Acids Research, 20, 6097-6100.

Smith, T. F., and Waterman, M. S. (1981), "Identification of Common Molecular Subsequences," Journal of Molecular Biology, 147, 195-197.

Sonnhammer, E. L. L., Eddy, S. R., and Durbin, R. (1997), "Pfam: A Com- prehensive Database of Protein Domain Families Based on Seed Align- ments," Proteins, 28, 405-420.

Taubes, G. (1996), "Software Matchmakers Help Make Sense of Se- quences," Science, 273, 588-590.

Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994), "CLUSTAL




W: Improving the Sensitivity of Progressive Multiple Sequence Align- ment Through Sequence Weighting, Position-Specific Gap Penalties, and Weight Matrix Choice," Nucleic Acid Research, 22, 4673-4680.

Thorne, J. L., Kishino, H., and Felsenstein, J. (1991), "An Evolutionary Model for Maximum Likelihood Alignment of DNA Sequences," Jour- nal of Molecular Evolution, 33, 114-124.

(1992), "Inching Toward Reality: An Improved Likelihood Model

of Sequence Evolution," Journal of Molecular Evolution, 34, 3-16. Waterman, M. S. (1995), Introduction to Computational Biology, New

York: Chapman and Hall. West, M., and Harrison, J. (1989), Bayesian Forecasting and Dynamic

Models, New York: Wiley. Zhu, J., Liu, J. S., and Lawrence, C. E. (1998), "Bayesian Adaptive Se-

quence Alignment Algorithms," Bioinformics, 14, 25-39.



markovian structures in biological sequence alignments

Documents