protein multiple alignment incorporating primary and secondary structure …junxie/jcb06.pdf ·...

Protein Multiple Alignment Incorporating Primary and

Secondary Structure Information

Nak-Kyeong Kim1 and Jun Xie2,∗

1National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

8600 Rockville Pike, Building 38A

Bethesda, MD 20894-6075

E-mail: [email protected]

2Department of Statistics

Purdue University

150 N. University Street

West Lafayette, IN 47907-2067

E-mail: [email protected]

April 26, 2006

Running Heads: Sequence alignment with secondary structures

KEY WORDS: Gibbs sampling; Likelihood function; Protein sequence motifs; Secondary

structures; Segment overlap.

Abstract

Identifying common local segments, also called motifs, in multiple protein sequences

plays an important role for establishing homology between proteins. Homology is easy

to establish when sequences are similar (sharing an identity > 25%). However, for dis-

tant proteins, it is much more difficult to align motifs that are not similar in sequences

but still share common structures or functions. This paper is a first attempt to align

multiple protein sequences using both primary and secondary structure information.

A new sequence model is proposed so that the model assigns high probabilities not

only to motifs that contain conserved amino acids but also to motifs that present com-

mon secondary structures. The proposed method is tested in a structural alignment

database BAliBASE. We show that information brought by the predicted secondary

structures greatly improves motif identification. A website of this program is available

at http://www.stat.purdue.edu/∼junxie/2ndmodel/sov.html.

1 Introduction

Genome sequencing projects produce enormous sequence data. The interpretation of these

data, however, is an ongoing challenge and highly depends on efficient computational ap-

proaches. Statistical methods and probability models have been successfully used to analyze

biological sequences. In this paper, we are interested in aligning common motifs in multiple

proteins. The observed data are protein amino acid sequences, which are also called the pri-

mary structure of the proteins. Protein motifs here are referred to as local segments (10-50

amino acids) that are critical for protein structures and functions. Multiple sequence align-

ments help to characterize protein structures and functions by common sequence patterns.

Numerous multiple sequence alignment programs are proposed. Thompson et al. (1999b)

provided a comprehensive comparison of ten programs, some of which were highly ranked as

evaluated by BAliBASE (Thompson et al. 1999a) benchmark alignment database. To list a

2

few, ClustalW (Thompson et al. 1994) is a well-used progressive alignment method. A mul-

tiple alignment is built up gradually by aligning the closest sequences first and successively

adding in the more distant ones. Dialign (Morgenstein et al. 1996) is a local alignment

approach, which construct multiple alignments based on segment-to-segment comparisons

rather than residue-to-residue comparisons. The PRRP program (Gotoh 1996) optimizes a

progressive alignment by iteratively dividing the sequences into two groups and realigning

the groups. These three programs will be compared to our proposed alignment method in

Section 4.

In addition to the alignment programs, motifs are often modeled by the position specific

score matrix (PSSM), which corresponds to a product of multinomial distributions of amino

acids. Based on the PSSM model, Lawrence and Reilly (1990) treated the starting positions

of motifs as missing data and proposed an EM algorithm (Dempster et al. 1977) for motif

detection. An EM algorithm is known for slow convergence, and the program often converges

to a local maximum. Lawrence et al. (1993) and Liu et al. (1995) developed a Bayesian model

and a Gibbs sampling algorithm to find the motifs under the same missing-data formulation.

The method has a better chance to escape a local maximum because of its stochastic nature.

Xie et al. (2004) extended the Bayesian model by allowing insertions and deletions within

the motifs. Eddy (1998) developed a hidden Markov model to describe motifs, also allowing

gaps inside motif patterns. Considering insertions and deletions often results in intensive

computation and the program may suffer from lack of convergence. Despite the strengths, all

the above methods use only information of protein primary structures. They have limitations

in finding weak motif patterns that have a low level of similarity between sequences.

Besides sequence, protein structure provides significant information for protein function.

It is assumed that 3-dimensional (3D) structures evolve more slowly than sequences and the

function of a protein is highly influenced by its 3D structure (Silberberg 2000). However,

due to the slow and expensive experimental processes to determine protein 3D structures,

3

only a limited number of proteins have known 3D coordinates. Predicting 3D structure from

the sequence is one of the biggest challenges in computational biology.

Secondary structure is a simplified characteristic of a protein’s 3D structure. All success-

ful methods in the field of 3D fold recognition make use of secondary structure predictions,

showing that secondary structure is a valuable way to establish structural relationship be-

tween proteins. Three state descriptions of protein secondary structure are commonly used:

helix (which includes all helical types), strand (which includes the beta sheet), and coil

(which includes everything else, e.g. bend and turn). Many secondary structure prediction

algorithms have been proposed, for instance, score-based methods (Chou and Fasman 1974;

Garnier et al. 1978), nearest neighbor methods (Salamov and Solovyev 1995), and neural

networks (Rost and Sander 1993, Jones 1999). Several competing methods reached around

70 - 78% accuracy (fraction of correctly predicted three states), with the PSI-PRED (Jones

1999, Bryson et al. 2005) server, a neural network based algorithm, as one of the most accu-

rate tools. We will use PSI-PRED in this paper. Figure 1 shows PSI-PRED prediction for a

short protein UBIQ HUMAN (swiss-prot P02248), which belongs to one of our example data

sets 1ubi in Section 4. The contiguous segments of secondary structures are given, where H,

E, and C represent helix, strand, and coil, respectively.

[Figure 1 about here.]

A family of structurally similar proteins may have divergent amino acid compositions

because 3D structures are not affected too much by substitutions of certain amino acids.

The 3D structures, however, should be conserved to perform a certain function. If the 3D

structures are conserved, it is likely that secondary structures are conserved. Geourjon et

al. (2001) introduced the idea of using the predicted secondary structure in identifying

related proteins with weak sequence similarity. They collected distantly-related sequences

with 10-30% sequence identity and calculated the secondary structure similarity of each pair

of sequences using the SOV (Segment Overlap) measure (Zelma et al. 1999). Sequence

4

homology was established only when the SOV was greater than a threshold. However, this

approach is limited to pairwise protein sequence comparisons. Errami et al. (2003) used

the predicted secondary structures in multiple protein sequences. They validated existing

multiple alignments by discarding unrelated sequences. Relationship was measured by SOV

calculated for all pairs of sequences in a given multiple alignment. This approach gives gen-

eral and vague guidelines in verifying existing multiple alignments, but it does not construct

multiple alignments.

In this paper, we propose a new statistical method that models protein motifs using

both primary and secondary structure information. Segment overlap (SOV) is generalized

to measure the similarity of secondary structures for a group of multiple sequences. A mul-

tiple alignment method is proposed to maximize both amino acid and secondary structure

conservation. Section 2 defines the data structure and presents SOV measurements. Section

3 shows the probabilistic models of motifs using the predicted secondary structures. A Gibbs

sampling algorithm is derived for model inference. Convergence is studied by multiple sim-

ulations and a proposed alignment score. Section 4 evaluates the models using the database

of structural multiple alignment BAliBASE (Thompson, et al. 1999a). Section 5 concludes

with a discussion.

2 Data structure and SOV

2.1 Data structure

A given set of protein sequences can be represented as

sequence R1 : r1,1 r1,2 . . . r1,L1

Data R : sequence R2 : r2,1 r2,2 . . . r2,L2

.... . .

...

sequence RK : rK,1 rK,2 . . . rK,LK

5

where the residue rk,l takes values from an alphabet with 20 different letters, and Lk rep-

resents the length of the kth sequence. We seek segments of length J from each sequence,

which resemble each other as much as possible. The segments are called motifs. The motif

width J can be determined by either the user or a heuristic algorithm (Xie and Kim 2005).

Let A = {ak, k = 1, . . . , K} denote the starting positions of the motif for the K sequences.The alignment could be represented by a matrix, R{A}:

r1,a1 . . . r1,a1+J−1

.... . .

... (1)

rK,aK . . . rK,aK+J−1

When the motif has conserved amino acids, the matrix (1) is well represented by a PSSM and

the existing motif-finding algorithms would work well. When the motif sequences are not

conserved, the motif 3D structure may still be preserved. Therefore, adding the predicted

secondary structures would enhance the motif signal.

2.2 Secondary structure similarity measurement SOV

The three states for the secondary structure are helix (H), strand (E), and coil(C). Secondary

structure similarity can be measured by the Q3 measure, defined as a fraction of residues

correctly matched in the three conformational states. However, the Q3 measurement some-

times gives inappropriate values. For example, predicting the entire myoglobin chain as one

big helix gives a Q3 value of about 80%, which outperforms most of the existing prediction

methods. Alternatively, a better measurement is the Segment Overlap (SOV) by Zelma et

al. (1999). SOV considers natural variations in the boundaries of segments among homolo-

gous protein structures. It is a measure based on secondary structure segments rather than

individual residues.


6

Let s1 and s2 denote any two segments of secondary structure in conformational state i

(i.e. H, E, or C). Let (s1, s2) denote a pair of overlapping segments. For example, (β1, β2)

in Figure 2 is a pair of overlapping segments with strand (E). Let S(i) denote the set of all

overlapping pairs of segments (s1, s2) in state i, and let S′(i) denote the set of segments s1

for which there is no overlapping segment s2 in state i, i.e.:

S(i) = {(s1, s2) : s1 ∩ s2 6= φ,

s1 and s2 are both in the conformational state i},

S ′(i) = {s1 : ∀s2, s1 ∩ s2 = φ,

s1 and s2 are both in the conformational state i}.

Define SOVo for state i as:

SOV o(i) =1

N(i)

∑(s1,s2)∈S(i)

minov(s1, s2) + δ(s1, s2)

maxov(s1, s2)× len(s1),

where N(i) =∑

(s1,s2)∈S(i)len(s1) +

∑s1∈S′(i)

len(s1),

δ(s1, s2) = min{(maxov(s1, s2) − minov(s1, s2));

minov(s1, s2); int(len(s1)/2); int(len(s2)/2)}.

In the formula, len is the segment length, minov is the length of the actual secondary

structure overlap of s1 and s2, maxov is the maximal length of the overlapping structures s1

and s2 (See Figure 2). SOVo of all secondary states is defined as:

SOV o =1

N

∑i∈{H,E,C}

∑(s1,s2)∈S(i)

minov(s1, s2) + δ(s1, s2)

maxov(s1, s2)× len(s1),

where N =∑

i∈{H,E,C}N(i).

To illustrate the calculation of SOVo(E), let us consider the two secondary structures in

Figure 2. There are two overlapping pairs for extended sheet(E): (β1, β2) and (β1, β3). For

7

the first pair, minov(β1, β2) = 2, maxov(β1, β2) = 8, and δ(β1, β2) = min{(8−2); 2; 3; 2} = 2.The second pair can be calculated similarly. Then the value of SOVo(E) is calculated as:

SOV o(E) =1

6 + 6×(

2 + 2

8+

2 + 1

7

)× 6 = 0.464

Summing over all 3 states, the overall SOVo of the given structures is evaluated to be

0.629. The SOVo measure ranges from 0 to 1, where 1 is the perfect match and 0 is the

complete mismatch. The value 0.629 can be roughly interpreted as that 63% of the secondary

structures are matched.

SOVo is originally defined for similarity of an observed secondary structure and its pre-

dicted secondary structure. The asymmetric nature of S(i), N(i) and len(s1) makes SOVo

asymmetric between the two sequences s1 and s2. When this measure is used for the two

predicted structures, a symmetric measure can be defined by:

SOV =SOV o(s1, s2) + SOV

o(s2, s1)

2.

This definition will be used for our SOV calculations.

3 Methods

3.1 Model assumptions

The proposed model consists of two parts, a position-specific score matrix (PSSM) for the

amino acid sequences and a SOV measurement for the secondary structures of the motifs.

Let X = {X1, ..., XK} denote secondary structure strings for the set of K proteins, wheresecondary structure Xi of protein i is either known or predicted by PSI-PRED. PSI-PRED

employs two feed-forward neural networks which predict secondary structure of a protein

based on its similarity output obtained from PSI-BLAST (Position Specific Iterated BLAST,

Altschul et al. 1997). For the given protein, PSI-PRED uses all of its homology proteins from

the NCBI (National Center for Biotechnology Information) protein database. We assume

8

the predicted secondary structures X is an extra given data set in addition to the protein

set of interest R.

As many of other secondary structure prediction methods, PSI-PRED utilizes sequence

information in multiple alignments obtained by PSI-BLAST. The multiple alignment helps

to infer secondary structure. On the other hand, our goal here is to improve multiple

alignment by the predicted secondary structures. Our development could be considered as

the second step of an iterative scheme that optimizes both the quality of the secondary

structure prediction and that of the multiple alignment.

The motif width J in our approach is chosen based on the method by Xie and Kim

(2005). Starting from a short alignment width (e.g. 10), the method expands the motif to

both sides according to the Kullback-Leibler information divergence. We focus our model on

detection and correct alignment of short similar regions in very long sequence of low overall

similarity. The motif width in our problems is typically 10-20. Therefore, we do not allow

any gap within motif. The motifs identified by the proposed multiple alignment method are

ungapped blocks, which correspond to core regions in a group of proteins. On the other hand,

the regions outside of motifs are not aligned. There are insertions and deletions between the

aligned core motifs.

For simplicity, we focus on the model that assumes one motif occurring in each sequence.

Once one motif alignment is obtained, there are methods available to extend to multiple

motif alignments. For instance, we will continue searching the next best motif by a means

of masking (Xie at al. 2004).

For the amino acid frequencies at each position j in the motif, we denote the frequency

parameters θj = (θ1,j. . . . , θ20,j)T , j = 1, . . . , J . Background sequences are assumed from

another common multinomial distribution with parameter θ0 = (θ1,0. . . . , θ20,0)T . Let Θ =

(θ0,θ1, . . . ,θJ). We denote a counting function h such that h(R) = (m1, . . . ,m20)T , where

mi is the number of the ith type letter observed in R. Furthermore, let RA(j) denote the

9

jth column in (1), R{A}c denote the amino acids outside of the motif. Let SOV (al, am)

denote the SOV measure between two segments with width J starting at position al in the

lth sequence and position am in the mth sequence.

3.2 Probability model

Given the previous notations, the complete likelihood function with motif locations A given

is defined as

π(R, A|Θ, λ,X) ∝ θh(R{A}c )0J∏

j=1

θh(RA(j))

j exp{λJ

K

∑l

alignment A. The conjugate prior distribution for Θ is defined. Specifically, the prior for Θ

is a product Dirichlet distribution, denoted by g(Θ). The parameter in the prior distribution

for θj is βj = (β1,j, . . . , β20,j), j = 0, . . . , J , which is defined at the end of this section. For

notation simplicity, considering vectors a = (a1, . . . , a20)T and b = (b1, . . . , b20)

T , we write

that a + b = (a1 + b1, . . . , a20 + b20)T , ab = (ab11 . . . a

b2020 )

T , |a| = |a1| + · · · + |a20|, andΓ(a) = Γ(a1) . . . Γ(a20).

The posterior distribution for A is derived as follows:

π(A|R,X, λ) ∝ π(A,R|λ,X) =∫

π(A,R|Θ, λ,X)g(Θ)dΘ

∝ Γ(h(R{A}c) + β0)J∏

j=1

Γ(h(R(j)) + βj)

× exp{λ JK

∑l

as constants:

π(ak|A[−k],R,X, λ) ∝ π(A|R,X, λ)π(A[−k]|R,X, λ)

∝ Γ(h(R{A[−k]}c) + β0 − h(R{ak}))Γ(h(R{A[−k]}c + β0)

×J∏

j=1

Γ(h(RA[−k](j)) + βj + h(rk,ak+j−1))

Γ(h(RA[−k](j)) + βj)

× exp{λ JK

∑l: l 6=k

SOV (al, ak)}

By using Stirling’s formula, the (predictive) posterior distribution for ak can be simplified

as:

π(ak|A[−k],R,X, λ) ∝J∏

j=1

(θ̂j[k]

θ̂0[k]

)h(rk,ak+j−1)

× exp{λ JK

∑l: l 6=k

SOV (al, ak)}, (3)

where θ̂j[k] and θ̂0[k] are the posterior means of θj and θ0, whose calculations are specified

below. Given the current alignment defined by A[−k], the probability of updating ak depends

on both the amino acid pattern, i.e., the odds ratio of the motif probability versus the

background probability, and the similarity of the secondary structures, i.e., SOV (al, ak),

l = 1, ..., K and l 6= k.The posterior means of θj[k] = (θ1,j[k]. . . . , θ20,j[k])

T , j = 1, . . . , J , are evaluated based

on the current alignment and a pseudo-count correction. Let fi be the observed relative

frequency of amino acid i in the current alignment except sequence k. Let pi be the relative

frequency of amino acids in the background, N be the sequence number except sequence k,

N = K − 1, and B is the weight of the pseudo-count correction. A simple pseudo-countcorrection approach estimates the posterior mean by θ̂i,j[k] = (N ·fi +B ·pi)/(N +B), whereB ·pi corresponds to the Dirichlet prior parameter βi,j in our Bayesian model. Alternatively,a better approach is the Blosum pseudo-count correction method (Altschul et al. 1997).

It replaces pi in the formula by a frequency that is calculated from a Blosum (Henikoff

12

and Henikoff 1992) amino acid substitution matrix. Formally, the pseudo-count B · pi ismultiplied by

∑20j=1 fje

µSij , where Sij is the substitution score of amino acid pair (i, j) defined

by a Blosum matrix (e.g. BLOSUM62), and µ is the scale parameter for the matrix. This

frequency estimate uses the prior knowledge of amino acid relationships embodied in the

substitution matrix Sij. Those residues favored by the substitution matrix to align with the

residues actually observed received high pseudo-count frequencies.

3.3 Gibbs sampling algorithm with multiple simulations

A Gibbs sampling procedure is used to generate samples according to Formula (3). The

sampling approach provides a good means to characterize the posterior distribution of motif

locations A. For instance, the mode of the posterior distribution gives an optimal motif

alignment. The Gibbs sampling starts with a random initial value of A, which is chosen

uniformly from all possible locations. Then ak, k = 1, . . . , K is updated one by one sequence.

The algorithm has two basic steps:

1. Exclude sequence k and calculate the current parameters θj[k] and θ0[k] using the

Blosum pseudo-count correction method described above. The predicted secondary

structures of the motif segments, except sequence k, are ready to use.

2. The likelihood ratio between the motif model and the background model is calculated

as in Formula 3. The new motif location ak is generated according to the weight (the

likelihood ratio).

The algorithm iterates the previous two steps for all sequences k = 1, . . . , K, in thou-

sands of iterations. The most probable sample A, obtained in the Gibbs sampling iterations,

corresponds to a mode (typically a local maximum) of the posterior distribution of A. Equiv-

alently, we consider maximizing an alignment score defined as:

Score =J∑

j=1

20∑i=1

ci,jlogθ̂i,j

θ̂i,0+ λ

J

K

∑l

where the ci,j’s are amino acid counts from the complete alignment. The first term in the

score formula is similar to the score defined by the standard Gibbs sampling approach with

only amino acid frequency (Jensen et al. 2004). The second term is a new contribution by

secondary structures.

Our simulations indicate, starting from a given random initial location A, the Gibbs sam-

pling algorithm always converges within a thousand of iterations. However, the convergent

results may vary from simulation to simulation with different initial values A. The sampling

result of an individual Markov chain only corresponds to one of many local maxima. We

evaluate the sampling procedure using multiple simulations.

As an ad hoc guideline, we always run Gibbs sampling with several choices of the param-

eter λ, for instance, λ = 0.5, 1, 1.5, 2. In addition, 50-100 Markov chain simulations from

different random initial locations A are used for each λ value. Gelman and Rubin (1992)

noticed the importance of running multiple Gibbs sampling chains for obtaining reliable sta-

tistical inferences. Besides obtaining an over-dispersed distribution of the motif alignment A,

running multiple Markov chains solves the difficult problem of setting the unknown parame-

ter λ. Instead of setting a λ value for the given protein data, we consider the best alignment

as the one that has a high probability under several λ values. Therefore, the alignments that

repeat most frequently in these multiple simulations and also have high alignment scores are

reported as the candidate alignments.

4 Application

To evaluate the proposed alignment method using secondary structure predictions, we com-

pare it with the standard Gibbs sampling (Lawrence et al. 1993; Liu et al. 1995), as

well as the highly ranked multiple alignment programs, including ClustalW (Thompson et

al. 1994), Dialign (Morgenstein et al. 1996), and PRRP (Gotoh 1996). The programs

are tested on reference alignments from the BAliBASE (Thompson et al. 1999a) bench-

14

mark alignment database (http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE), which con-

tains manually-refined multiple sequence alignments. The aligned regions are defined as core

blocks, whose alignments are validated to ensure functional or structural conservation. Most

data sets in BAliBASE include a few proteins (< 10). For our program purpose, we select ten

big data sets, each of which have more than 10 sequences. These data sets are also chosen to

represent the most difficult alignment problems. Specifically, four data sets (1idy, 1r69, 1ubi,

1wit) are selected from BAliBASE Reference 3 containing divergent protein families with av-

erage sequence identity less than 22%. Two data sets (Kinase2 and 1vln) are selected from

BAliBASE Reference 4 containing sequences with large N/C terminal extensions, and four

data sets (1thm1, s51, kinase2, kinase3) are selected from BAliBASE Reference 5 containing

internal insertions.

The names and features of the four data sets from Reference 3 are listed in Table 1.

Notice that instead of using the short sequences provided in BAliBASE, we collect the whole

protein sequences from the SWISS-PROT database (Bairoch and Apweiler 1997). The input

sequences for our alignments are much longer than those in BAliBASE therefore are supposed

to be harder to correctly align the structural core blocks. Motif widths are determined by

the extension procedure (Xie and Kim 2005), with 22, 19, 19, and 16 for 1idy, 1r69, 1ubi,

and 1wit respectively.

[Table 1 about here.]

To illustrate the impact of using secondary structures, we plot the likelihood function

of motif location ak for the third sequence (RPC2 BPP22) in the set of 1r69. Except for

this sequence, we assume that all the other motif locations are known. Figure 3 (a) shows

the log-likelihood based on only the SOV part, and Figure 3 (b) shows the log-likelihood

of the full model (PSSM + SOV; black) and the log-likelihood based on only the amino

acids part (PSSM; grey). The likelihood function based on only the SOV part gives high

probabilities for a few motif locations, whereas the likelihood based on PSSM alone shows

15

high probability peaks at many locations. Inference for the true motif location (position

17 for this data) is not an easy job, because the likelihood function based on either PSSM

or SOV alone has no dominant mode. In contrast, combining PSSM with SOV, we obtain

a better-shaped likelihood function. The true motif location at position 17 is clearly the

global mode and the relative difference from the second mode is strong. For this type of

data, the predicted secondary structure enhances the motif pattern, therefore the true motif

is easier to be identified under the new model. As demonstrated in Table 2, the proposed

alignment method with secondary structure information finds the true motif of 1r69 much

more frequently (3.85 more times) than the standard Gibbs sampling method.


Table 2 shows comparisons of our proposed model with the standard Gibbs sampling

method. For each data set, the alignments obtained by both methods are compared to the

structural alignments in BAliBASE. A good alignment is defined when a large number of

sequences out of the total number in each data set are correctly aligned. The criteria of

determining good alignments are listed in the second column in Table 2. Multiple Markov

chain simulations are used for the proposed method (PSSM+SOV) and the standard Gibbs

sampling, where the proposed method runs 200 Markov chains, 50 runs at each of four

λ = 0.5, 1, 1.5, 2, and the standard Gibbs sampling runs 100 Markov chains. The numbers

in the table represent the number of runs that correctly found the structural core blocks

in BAliBASE. Our model (PSSM + SOV) shows better success rates in finding the true

motifs. For example, the success rate for 1idy increases from 0% to 12.5%. The rate for 1r69

increases from 10% to 38.5%.


Further comparisons of the proposed method (PSSM+SOV) with ClustalW, Dialign, and

PRRP are displayed in Table 3. The reported alignments from PSSM+SOV are the most

16

frequent alignments in 200 Markov chain simulations as described previously. Alignments are

measured by the number of correctly aligned sequences out of the total number of sequences

in each data set. For the data set 1ubi, PSSM+SOV performs much better than the other

3 programs. For data sets 1idy and 1r69, PSSM+SOV performs as well as Dialign but

better than the other 2 programs. For the rest of the data sets, all programs work well. In

summary, PSSM+SOV is the best choice among these programs. Plots of the percents of

correctly aligned sequences for each of the programs in each of the data sets are shown in

Figure 4. The line of PSSM+SOV (dark blue) has high alignment values in all data sets.



The comparisons indicate that the proposed method using secondary structure predic-

tions works at least as well as the best alignment programs using amino acid sequence

information alone, and even better in some situations. Studying the structural alignments

of these data sets in BAliBASE, we found that most of the alignments had conserved amino

acids at several positions, except the alignments of 1idy and 1ubi. Our proposed method out-

performs other alignment programs in these two data sets, because the secondary structures

greatly enhance the motif signals in addition to amino acid conservation. As an example,

the structural alignment of 1idy from BAliBASE is shown in Figure 5. The underlined

segments share common core structures and therefore are referred to as the true motif seg-

ments. Table 4 shows the alignment for 1idy by our approach using both PSSM and SOV.

This alignment corresponds to the first and second core structural regions in Figure 5. The

aligned amino acid segments show that there is no strongly conserved amino acid pattern,

except column 17. In contrast, the predicted secondary structures show a conservation. The

secondary structure of the motif can be considered as a helix-turn-helix (helix-coil-helix)

structure.

17



5 Discussion

The currently existing methods of identifying protein motifs consider only amino acid fea-

tures of the motifs. The proposed model is the first attempt to utilize the predicted secondary

structures for a probabilistic model of motifs. It is not surprising that information brought

by the predicted secondary structures improves multiple alignments. The similarity mea-

surement of secondary structures, SOV values, are defined for the whole motif segments. The

dependence feature of adjacent amino acids is partially modeled in our approach, whereas

all existing models assume that the positions in a motif are independent.

Probability models and Bayesian methods showed great advantages in dealing with high

dimensional complicated sequence features. Our scoring function is in terms of probability,

which is defined exponentially proportional to a similarity measurement of secondary struc-

tures. Instead of directly maximizing a score function, Gibbs sampling method is employed to

simulate samples of the posterior probability, whose modes correspond to alignments of high

scores. Difficult convergence to the global maximum is a big concern in multiple sequence

alignment. We solve this problem by simulating multiple Markov chains from different ran-

dom initial values and under different parameter λ values. The most probable alignment

from multiple simulations is likely to be the true alignment.

The proposed model can be improved by including reliability indices of secondary struc-

ture predictions. PSI-PRED (Jones 1999) assigns a score of confidence level at 0-9 for each

predicted secondary state (H, E or C). The score 9 indicates the most reliable prediction,

whereas score 0 indicates the least reliable prediction. It is known that the reliability in-

dices correlate very well with prediction accuracy. A weighted SOV measurement may be

18

developed such that the similarity between two segments of secondary structure in a confor-

mational state (i.e. H, E or C) will be weighted by the sum of the confidence indices of the

segments. The weighted SOV can then be substituted into Formulas (2) and (3) for a better

model of secondary structures.

References

Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J, Zhang, Z, Miller, W., and Lipman,

D. J. (1997), “Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs,” Nucleic Acids Research, 25, 3389-3402.

Bairoch, A., and Apweiler, R. (1997), “The SWISS-PROT protein sequence database: its

relevance to human molecular medical research,” Journal of Molecular Medicine, 75, 312-316.

Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S.& Jones, D. T.

(2005) “Protein structure prediction servers at University College London”, Nucleic Acids

Research, 33 (Web Server issue), W36-38.

Chou, P. Y., and Fasman, U. D. (1974), “Prediction of protein conformation,” Biochem-

istry, 13, 211-215.

Dempster, A. P., Laird, N. M. and, Rubin, D. B. (1977), “Maximum likelihood from

incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Ser. B, 39,

1-38.

Eddy, S. R. (1998), “Profile hidden Markov models”. Bioinformatics, 14, 755-763.

Errami, M., Goeurjon, C., and Deléage, G. (2003), “Detection of unrelated proteins in

sequences multiple alignments by using predicted secondary structures,” Bioinformatics, 19,

506-512.

Garnier, J., Osguthorpe, D. J., and Robson, B. (1978), “Analysis of the accuracy and

implications of simple methods for predicting the secondary structure of globular proteins,”

Journal of Molecular Biology, 120, 97-120.

19

Gelman, A. and Rubin, D. B. (1992), “Inference from iterative simulation using multiple

sequences”, Statistical Science, 7, 457-72.

Geourjon, C., Combet, C., Blanchet, C., and Deléage, G. (2001), “Identification of Re-

lated Proteins with Weak Sequence Identity Using Secondary Structure Information,” Pro-

tein Science, 10, 788-797.

Gotoh, O. (1996), “Significant improvement in accuracy of multiple protein sequence

alignments by iterative refinement as assessed by reference to structural alignments”, J.

Mol. Biol., 264, 823-838.

Henikoff, S., and Henikoff, J. G. (1992), “Amino Acid Substitution Matrices from Protein

Blocks,” Proceedings of the National Academy of Sciences, 89, 10915-10919.

Jensen, S. T., Liu, X. S., Shou, Q., and Liu, J. S. (2004), “Computational Discovery of

Gene Regulatory Binding Motifs: A Bayesian Perspective,” Statistical Science, 19, 188-204.

Jones, D. T. (1999) “Protein secondary structure prediction based on position-specific

scoring matrices”, Journal of Molecular Biology, 292, 195-202.

Kabsch, W., and Sander, C. (1983), “Dictionary of protein secondary structure: pattern

recognition of hydrogen-bonded and geometrical features,” Biopolymers, 22, 2577-2637.

Lawrence, C. E., and Reilly, A. A. (1990), “An Expectation-Maximization (EM) Algo-

rithm for the Identification and Characterization of Common Sites in Biopolymer Sequences,”

Proteins, 7, 41-51.

Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton,

J. C. (1993), “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple

Alignment,” Science, 262, 208-214.

Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995), “Bayesian Models for Multi-

ple Local Sequence Alignment and Gibbs Sampling Strategies,” Journal of the American

Statistical Association, 90, 1156-1170.

Morgenstein, B., Dress, A. and Werner, T. (1996), “Multiple DNA and protein sequence

20

alignment based on segment-to-segment comparison”, PNAS, 93, 12098-12103.

Rost, B., and Sander, C. (1993), “Prediction of Protein Secondary Structure at Better

than 70% Accuracy,” Journal of Molecular Biology, 232, 584-599.

Salamov, A. A., and Solovyev V. V. (1995), “Prediction of protein secondary structure

by combining nearest-neighbour algorithms and multiple sequence alignments,” Journal of

Molecular Biology, 247, 11-15.

Jones, D. T. (1999) “Protein secondary structure prediction based on position-specific

scoring matrices”, Journal of Molecular Biology, 292, 195-202. Silberberg, M. S. (2000),

Chemistry: The molecular nature of matter and change (2nd ed.), Boston, MA: McGraw-

Hill.

Thompson, J. D., Higgins, D. G., Gibson, T.J.(1994), “CLUSTAL W: improving the

sensitivity of progressivemultiple sequence alignment through sequence weighting, position-

specific gap penalties and weight matrix choice”, Nucleic Acids Res, 22, 4673-4680.

Thompson, J. D., Plewniak, F., and Poch, O. (1999a), “BAliBASE: a benchmark align-

ment database for the evaluation of multiple alignment programs,” Bioinformatics, 15, 87-88.

Thompson, J. D., Plewniak, F., and Poch, O. (1999b), “A comprehensive comparison of

multiple sequence alignment programs”, Nucleic Acids Research, 27, 2682-2690.

Xie, J., Li, K.-C., and Bina, M. (2004), “A Bayesian Insertion/Deletion Algorithm for

Distant Protein Motif Searching via Entropy Filtering,” Journal of the American Statistical

Association, 99, 409-420.

Xie, J., and Kim, N.-K. (2005), “Bayesian Models and Markov Chain Monte Carlo Meth-

ods for Protein Motifs with the Secondary Characteristics,” Journal of Computational Biol-

ogy, 12, 952-970..

Zelma, A., Venclovas, C., Fidelis, K., and Rost, B. (1999), “A Modified Definition of

Sov, a Segment-Based Measure for Protein Secondary Structure Prediction Assessment,”

Proteins, 34, 220-223.

21

Dataset Family name no. of sequences Length∗ Motif width Ave. Identity(%)1idy DNA binding 25 101-636 22 191r69 Repressor 23 71-882 19 181ubi Ubiquitin 22 70-1132 19 201wit Twitchin 19 93-250 16 22

Table 1: Four data sets from BAliBASE Reference 3. ∗The sequence lengths are longer thanthose in BAliBASE because the whole sequences were collected from the SWISS-PROTdatabase.

22

Dataset Correct alignments Rate of the correct alignments(correctly aligned / total # of seq) PSSM+SOV Standard Gibbs (PSSM alone)

1idy 16/25 and more 25/200 (12.5%) 0/100 (0%)1r69 21/23 and more 77/200 (38.5%) 10/100 (10%)1ubi 20/22 and more 18/200 (9%) 1/100 (1%)1wit 15/19 and more 37/200 (18.5%) 4/100 (4%)

Table 2: Comparison of the multiple Markov chain simulation results for the proposedmethod (PSSM+SOV) and the standard Gibbs sampling method. Correct alignments aredefined when the number of correctly aligned sequences are equal to or larger than thecutoff in the second column. The rate of correct alignments are obtained from multipleMarkov chain simulations, 200 Markov chains for the proposed method (PSSM+SOV) and100 Markov chains for the standard Gibbs sampling (PSSM alone). The number of Markovchains that find the correct alignments are reported in the third and fourth columns.

23

Dataset PSSM+SOV ClustalW Dialign PRRP1idy 16/25 10/25 16/25 8/251r69 23/23 9/23 23/23 5/231ubi 20/22 5/22 4/22 1/221wit 16/19 15/19 16/19 18/19

1thm1 11/11 10/11 11/11 11/11kinase2 16/17 15/17 15/17 15/17

kinase2 insert 11/12 12/12 12/12 11/12kinase3 insert 19/19 18/19 18/19 18/19

s51 15/15 15/15 15/15 15/151vln 13/14 14/14 14/14 14/14

Table 3: Comparison of the rate of the correctly aligned sequences for the proposed method(PSSM+SOV) with three highly ranked programs, ClustalW, Dialign, and PRRP. The num-bers are the correctly aligned sequences out of the total number of sequences in each dataset. Our proposed method performs better or as well as the other programs for all the datasets.

24

Sequence name Aligned AA Segment Secondary Structures

sp|P06876|MYB MOUSE RIIYQAHKRLGNRWAEIAKLLP HHHHHHHHHHCCHHHHHHHHHC

sp|P27898|MYBP MAIZE DIIIKLHATLGNRWSLIASHLP HHHHHHHHHCCCCHHHHHHHHC

sp|P20025|MYB3 MAIZE DLIVKLHSLLGNKWSLIAARLP HHHHHHHHHCCCHHHHHHHHHC

sp|P27900|GL1 ARATH DLIIRLHKLLGNRWSLIAKRVP HHHHHHHHHHCCHHHHHHHHCC

sp|P20027|MYB3 HORVU DHIVALHQILGNRWSQIASHLP HHHHHHHHHCCCHHHHHHHHHC

sp|P80073|MYB2 PHYPA NLILDLHATLGNRWSRIAAQLP HHHHHHHHHCCCHHHHHHHHHC

sp|P02259|H5 CHICK AAIRAEKSRGGSSRQSIQKYIK HHHHHHHHCCCCCHHHHHHHHH

sp|P15870|H1D STRPU SALESLKEKKGSSRQAILKYVK HHHHHHHHCCCCCHHHHHHHHH

sp|P15869|H1B STRPU AAITALKERGGSSAQAIRKYIE HHHHHHHHCCCCCHHHHHHHHH

sp|P35060|H1 TIGCA AAIKALKERNGSSLPAIKKYIA HHHHHHHHCCCCCHHHHHHHHH

sp|Q05831|H1L MYTTR AAITAMKNRKGSSVQAIRKYIL HHHHHHHHCCCCCHHHHHHHHH

sp|P02257|H1 ECHCR AAIAAQKERRGSSVAKIQSYIA HHHHHHHHCCCCCHHHHHHHHH

sp|P10771|H11 CAEEL EAIKQLKDRKGASKQAILKFIS HHHHHHHHCCCCCHHHHHHHHH

sp|P06894|H1A PLADU TAILGLKERKGSSMVAIKKYIA HHHHHHHHCCCCCHHHHHHHHH

sp|P26568|H11 ARATH DAIVTLKERTGSSQYAIQKFIE HHHHHHHHCCCCCHHHHHHHHH

sp|P54671|H1 DICDI TAIAHYKDRTGSSQPAIIKYIE HHHHHHHHCCCCCHHHHHHHHH

sp|P15282|ARGR ECOLI AFKALLKEEKFSSQGEIVAALQ HHHHHHHHHCHHHHHHHHHHHH

sp|P95721|ARGR STRCL RIVDILNRQPVRSQSQLAKLLA HHHHHHHHHCCCCHHHHHHHHH

sp|P17893|ARGR BACSU KIREIITSNEIETQDELVDMLK HHHHHHHHHCHHHHHHHHHHHH

sp|O31408|ARGR BACST KIREIIMSNDIETQDELVDRLR HHHHHHHHHCHHHHHHHHHHHH

sp|Q54870|ARGR STRPN LIKKMITEEKLSTQKEIQDRLE HHHHHHHHHCHHHHHHHHHHHH

sp|P94992|ARGR MYCTU RIVAILSSAQVRSQNELAALLA HHHHHHHHHCCCCHHHHHHHHH

sp|P03032|TRPR ECOLI VRIVEELLRGEMSQRELKNELG HHHHHHHHHCCCCHHHHHHHCC

sp|P44889|TRPR HAEIN LQIVSQLIDKNMPQREIQQNLN HHHHHHHHHCCCCHHHHHHHHC

sp|P34257|TC3A CAEEL VSLHEMSRKISRSRHCIREYLK CCHHHHHHHHCCCCHHHHHHHH

Table 4: Alignments of the data set 1idy by the proposed method. This alignment corre-sponds to the first and second core structural regions shown in Figure 6. While there is aclear conservation in the secondary structures for this motif, the aligned amino acid segmentshows no strongly conserved column, except column 17. The secondary structure of themotif can be considered a helix-turn-helix structure.

25

Conf: 968887179808999855874388999988874088842028812883616897600047

Pred: CEEEEEECCCEEEEEEECCCCHHHHHHHHHHHHHCCCCCCCEEECCCEEECCCCEEHHCC

AA: MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN

10 20 30 40 50 60

Conf: 8999889999763189

Pred: CCCCCEEEEEEEECCC

AA: IQKESTLHLVLRLRGG

70

Figure 1: An example of secondary structure prediction by PSI-PRED. The protein isUBIQ HUMAN (swissprot ID P02248), which is a sequence in the data set 1ubi. The linesin the order are confidence level of the secondary structure prediction, string of the predictedsecondary structure, and the original amino acid sequence.

26

1βStructure 1 CCCC EEEEEECCCC

2β 3βStructure 2 CC EEEECCEEECCC

-- --++++++++

+++++++

Figure 2: Illustration of minov and maxov in the SOVo(E) calculation. (- -) indicates theminov of (β1, β2) and (β1, β3). The first line of (++) indicates the maxov of (β1, β2) and thesecond line of (++) indicates the maxov of (β1, β3).

27

(a)

0 50 100 150 200

24

68

1012

14

position

rela

tive

valu

esof

log-

likel

ihoo

d

(b)

0 50 100 150 200

-20

-10

010

2030

position

rela

tive

valu

esof

log-

likel

ihoo

d

Figure 3: Log-likelihood plot for a sequence, RPC2 BPP22 in the data set 1r69 from BAl-iBASE. (a) The log-likelihood calculated by the SOV part; (b) The log-likelihood calculatedby the proposed model (PSSM + SOV; black) and the log-likelihood calculated by the aminoacids part (PSSM; grey).

28

(a)

0

0.2

0.4

0.6

0.8

1

1.2

1idy 1r69 1ubi 1wit

perc

ento

fcor

rect

lyal

igne

dse

quen

ces

PSSM+SOV

Gibbs sampling(PSSM only)

ClustalW

Dialign

PRRP

(b)

0

0.2

0.4

0.6

0.8

1

1.2

1idy

1r69 1u

bi1w

it

1thm

1

kinas

e2

kinas

e2_in

sert

kinas

e3_in

sert

s51

1vln

perc

ento

fcor

rect

lyal

igne

dse

quen

ces

PSSM+SOV

ClustalW

Dialign

PRRP

Figure 4: Comparison of the proposed method with the standard Gibbs sampling method,ClustalW, Dialign, and PRRP. The plots represent percents of the correctly aligned sequencesout of the total number of sequences for each program in each data set. The proposed method(PSSM+SOV) as demonstrated by the line of dark blue performs the best in all programs.

29

1idy 1 mevkktswt eeedrILYQA hkr lgnR WAEIAKLLp.........grt dnamybp_maize 1 advkrgniskeeedIIIKL hatlgnRWSLIASHL p.........grtdnemyb3_maize 1 .dlkrgnftadeddLIVKL hsllgnKWSLIAARL p.........grtdnegl1_arath 1 .nvnkgnfteqeedLIIRL hkllgnRWSLIAKRV p.........grtdnqmyb3_horvu 1 .dlkrgcfsqqeedHIVAL hqilgnRWSQIASHL p.........grtdnemyb2_phypa 1 .dlkrgifseaeenLILDL hatlgnRWSRIAAQL p.........grtdne1hstA 1 ...shpt ysemiaaAIR AEksrggsS RQSIQKYIksh ykvgh...n adlqh1d_strpu 1 ...shpkysdmiasALESL kekkgsSRQAILKYV kanftvgd...nanvhh1b_strpu 1 ...ahpsssemvlaAITAL kerggsSAQAIRKYI eknytvdi..kkqaifh1_tigca 1 ...thpptsvmvmaAIKAL kerngsSLPAIKKYI aanykvdv..vknahfh1l_myttr 1 ....kpstlsmivaAITAM knrkgsSVQAIRKYI lannkgin.tshlgsah1_echcr 1 ...ahppvidmitaAIAAQ kerrgsSVAKIQSYI aakyrcdi..nalnphh11_caeel 1 ...ahppyintikeAIKQL kdrkgaSKQAILKFI sqnyklgdnviqinahh1a_pladu 1 ...ahppvatmvvtAILGL kerkgsSMVAIKKYI aanyrvdv..arlapfh11_arath 1 ...shptyeemikdAIVTL kertgsSQYAIQKFI eekrkelp..ptfrklh1_dicdi 1 ...nhptyqvmistAIAHY kdrtgsSQPAIIKYI eanynvap..dtfktq1aoy 1 .mrssakqee lvkaFKALL keekfsS QGEIVAALqeq .gfd...nin qskARGR_STRCL 1 ........marhrrIVDIL nrqpvrSQSQLAKLL adn.gls....vtqatG3273713 1 enlnpvtrtarqalILQIL dkqkvtSQVQLSELL lde.gid....itqatAHRC_BACSU 1 .....mnkgqrhikIREII tsneieTQDELVDML kqd.gyk....vtqatARGR_BACST 1 .....mnkgqrhikIREII msndieTQDELVDRLrea.gfn....vtqatARGR_STRPN 1 .....mrkrdrhqlIKKMI teeklsTQKEIQDRL eah.nvc....vtqttARGR_MYCTU 1 gpevaanragrqarIVAIL ssaqvrSQNELAALL aae.gie....vtqat1jhgA 1 .t pderealgtrvrIIEEL lr ge.mSQRELKNELg..........ag iatTRPR_HAEIN 1 .taderdavglrlqIVSQL idkn.mPQREIQQNLn..........tsaatG3328572 1 .sfserkdvasryhIIRAL lege.lTQREIAEKY g..........vsiaq1tc3C 1 ....rgsals dterAQLDV mkll nvSLHEMSRKIs..........rs rhc

1idy 42 IKNHWNSTmrr kv.mybp_maize 42 IKNYWNSHlsrq..myb3_maize 41 IKNYWNTHvrrk..gl1_arath 41 VKNYWNTH lskk..myb3_horvu 41 IKNFWNSCikkk..myb2_phypa 41 IKNYWNTRlkkr..1hstA 45 IKLSIRRL la agv.h1d_strpu 45 IKQALKRG vtsgq.h1b_strpu 46 IKRALITG vekgt.h1_tigca 46 IKKALKSL vekkk.h1l_myttr 46 MKLAFAKG lksgv.h1_echcr 46 IRRALKNQ vksga.h11_caeel 48 HRQALKRGvtska.h1a_pladu 46 IRKFIRKA vkqtkgh11_arath 46 LLLNLKRL vasgk.h1_dicdi 46 LKLALKRL vakgt.1aoy 46 VSRMLTKFgavrt.ARGR_STRCL 38 LSRDLDELgavki.G3273713 46 LSRDLDELgarkv.AHRC_BACSU 41 VSRDIKELhlvkv.ARGR_BACST 41 VSRDIKEMqlvkv.ARGR_STRPN 41 LSRDLREIgltkv.ARGR_MYCTU 46 LSRDLEELgavkl.1jhgA 39 ITRGSNSLka apv.TRPR_HAEIN 39 ITRGSNMIktmdp.G3328572 39 ITRGSNALkgldp.1tc3C 37 IRVYLKDPvsygt.

Figure 5: The structural alignment of the data set 1idy reported in BAliBASE. The under-lined segments are core structural regions.

30

protein multiple alignment incorporating primary and secondary structure …junxie/jcb06.pdf ·...

Documents