protein multiple alignment incorporating primary and secondary structure …junxie/jcb06.pdf ·...

30
Protein Multiple Alignment Incorporating Primary and Secondary Structure Information Nak-Kyeong Kim 1 and Jun Xie 2,* 1 National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike, Building 38A Bethesda, MD 20894-6075 E-mail: [email protected] 2 Department of Statistics Purdue University 150 N. University Street West Lafayette, IN 47907-2067 E-mail: [email protected] April 26, 2006 Running Heads: Sequence alignment with secondary structures KEY WORDS: Gibbs sampling; Likelihood function; Protein sequence motifs; Secondary structures; Segment overlap.

Upload: others

Post on 24-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Protein Multiple Alignment Incorporating Primary and

    Secondary Structure Information

    Nak-Kyeong Kim1 and Jun Xie2,∗

    1National Center for Biotechnology Information

    National Library of Medicine

    National Institutes of Health

    8600 Rockville Pike, Building 38A

    Bethesda, MD 20894-6075

    E-mail: [email protected]

    2Department of Statistics

    Purdue University

    150 N. University Street

    West Lafayette, IN 47907-2067

    E-mail: [email protected]

    April 26, 2006

    Running Heads: Sequence alignment with secondary structures

    KEY WORDS: Gibbs sampling; Likelihood function; Protein sequence motifs; Secondary

    structures; Segment overlap.

  • Abstract

    Identifying common local segments, also called motifs, in multiple protein sequences

    plays an important role for establishing homology between proteins. Homology is easy

    to establish when sequences are similar (sharing an identity > 25%). However, for dis-

    tant proteins, it is much more difficult to align motifs that are not similar in sequences

    but still share common structures or functions. This paper is a first attempt to align

    multiple protein sequences using both primary and secondary structure information.

    A new sequence model is proposed so that the model assigns high probabilities not

    only to motifs that contain conserved amino acids but also to motifs that present com-

    mon secondary structures. The proposed method is tested in a structural alignment

    database BAliBASE. We show that information brought by the predicted secondary

    structures greatly improves motif identification. A website of this program is available

    at http://www.stat.purdue.edu/∼junxie/2ndmodel/sov.html.

    1 Introduction

    Genome sequencing projects produce enormous sequence data. The interpretation of these

    data, however, is an ongoing challenge and highly depends on efficient computational ap-

    proaches. Statistical methods and probability models have been successfully used to analyze

    biological sequences. In this paper, we are interested in aligning common motifs in multiple

    proteins. The observed data are protein amino acid sequences, which are also called the pri-

    mary structure of the proteins. Protein motifs here are referred to as local segments (10-50

    amino acids) that are critical for protein structures and functions. Multiple sequence align-

    ments help to characterize protein structures and functions by common sequence patterns.

    Numerous multiple sequence alignment programs are proposed. Thompson et al. (1999b)

    provided a comprehensive comparison of ten programs, some of which were highly ranked as

    evaluated by BAliBASE (Thompson et al. 1999a) benchmark alignment database. To list a

    2

  • few, ClustalW (Thompson et al. 1994) is a well-used progressive alignment method. A mul-

    tiple alignment is built up gradually by aligning the closest sequences first and successively

    adding in the more distant ones. Dialign (Morgenstein et al. 1996) is a local alignment

    approach, which construct multiple alignments based on segment-to-segment comparisons

    rather than residue-to-residue comparisons. The PRRP program (Gotoh 1996) optimizes a

    progressive alignment by iteratively dividing the sequences into two groups and realigning

    the groups. These three programs will be compared to our proposed alignment method in

    Section 4.

    In addition to the alignment programs, motifs are often modeled by the position specific

    score matrix (PSSM), which corresponds to a product of multinomial distributions of amino

    acids. Based on the PSSM model, Lawrence and Reilly (1990) treated the starting positions

    of motifs as missing data and proposed an EM algorithm (Dempster et al. 1977) for motif

    detection. An EM algorithm is known for slow convergence, and the program often converges

    to a local maximum. Lawrence et al. (1993) and Liu et al. (1995) developed a Bayesian model

    and a Gibbs sampling algorithm to find the motifs under the same missing-data formulation.

    The method has a better chance to escape a local maximum because of its stochastic nature.

    Xie et al. (2004) extended the Bayesian model by allowing insertions and deletions within

    the motifs. Eddy (1998) developed a hidden Markov model to describe motifs, also allowing

    gaps inside motif patterns. Considering insertions and deletions often results in intensive

    computation and the program may suffer from lack of convergence. Despite the strengths, all

    the above methods use only information of protein primary structures. They have limitations

    in finding weak motif patterns that have a low level of similarity between sequences.

    Besides sequence, protein structure provides significant information for protein function.

    It is assumed that 3-dimensional (3D) structures evolve more slowly than sequences and the

    function of a protein is highly influenced by its 3D structure (Silberberg 2000). However,

    due to the slow and expensive experimental processes to determine protein 3D structures,

    3

  • only a limited number of proteins have known 3D coordinates. Predicting 3D structure from

    the sequence is one of the biggest challenges in computational biology.

    Secondary structure is a simplified characteristic of a protein’s 3D structure. All success-

    ful methods in the field of 3D fold recognition make use of secondary structure predictions,

    showing that secondary structure is a valuable way to establish structural relationship be-

    tween proteins. Three state descriptions of protein secondary structure are commonly used:

    helix (which includes all helical types), strand (which includes the beta sheet), and coil

    (which includes everything else, e.g. bend and turn). Many secondary structure prediction

    algorithms have been proposed, for instance, score-based methods (Chou and Fasman 1974;

    Garnier et al. 1978), nearest neighbor methods (Salamov and Solovyev 1995), and neural

    networks (Rost and Sander 1993, Jones 1999). Several competing methods reached around

    70 - 78% accuracy (fraction of correctly predicted three states), with the PSI-PRED (Jones

    1999, Bryson et al. 2005) server, a neural network based algorithm, as one of the most accu-

    rate tools. We will use PSI-PRED in this paper. Figure 1 shows PSI-PRED prediction for a

    short protein UBIQ HUMAN (swiss-prot P02248), which belongs to one of our example data

    sets 1ubi in Section 4. The contiguous segments of secondary structures are given, where H,

    E, and C represent helix, strand, and coil, respectively.

    [Figure 1 about here.]

    A family of structurally similar proteins may have divergent amino acid compositions

    because 3D structures are not affected too much by substitutions of certain amino acids.

    The 3D structures, however, should be conserved to perform a certain function. If the 3D

    structures are conserved, it is likely that secondary structures are conserved. Geourjon et

    al. (2001) introduced the idea of using the predicted secondary structure in identifying

    related proteins with weak sequence similarity. They collected distantly-related sequences

    with 10-30% sequence identity and calculated the secondary structure similarity of each pair

    of sequences using the SOV (Segment Overlap) measure (Zelma et al. 1999). Sequence

    4

  • homology was established only when the SOV was greater than a threshold. However, this

    approach is limited to pairwise protein sequence comparisons. Errami et al. (2003) used

    the predicted secondary structures in multiple protein sequences. They validated existing

    multiple alignments by discarding unrelated sequences. Relationship was measured by SOV

    calculated for all pairs of sequences in a given multiple alignment. This approach gives gen-

    eral and vague guidelines in verifying existing multiple alignments, but it does not construct

    multiple alignments.

    In this paper, we propose a new statistical method that models protein motifs using

    both primary and secondary structure information. Segment overlap (SOV) is generalized

    to measure the similarity of secondary structures for a group of multiple sequences. A mul-

    tiple alignment method is proposed to maximize both amino acid and secondary structure

    conservation. Section 2 defines the data structure and presents SOV measurements. Section

    3 shows the probabilistic models of motifs using the predicted secondary structures. A Gibbs

    sampling algorithm is derived for model inference. Convergence is studied by multiple sim-

    ulations and a proposed alignment score. Section 4 evaluates the models using the database

    of structural multiple alignment BAliBASE (Thompson, et al. 1999a). Section 5 concludes

    with a discussion.

    2 Data structure and SOV

    2.1 Data structure

    A given set of protein sequences can be represented as

    sequence R1 : r1,1 r1,2 . . . r1,L1

    Data R : sequence R2 : r2,1 r2,2 . . . r2,L2

    .... . .

    ...

    sequence RK : rK,1 rK,2 . . . rK,LK

    5

  • where the residue rk,l takes values from an alphabet with 20 different letters, and Lk rep-

    resents the length of the kth sequence. We seek segments of length J from each sequence,

    which resemble each other as much as possible. The segments are called motifs. The motif

    width J can be determined by either the user or a heuristic algorithm (Xie and Kim 2005).

    Let A = {ak, k = 1, . . . , K} denote the starting positions of the motif for the K sequences.The alignment could be represented by a matrix, R{A}:

    r1,a1 . . . r1,a1+J−1

    .... . .

    ... (1)

    rK,aK . . . rK,aK+J−1

    When the motif has conserved amino acids, the matrix (1) is well represented by a PSSM and

    the existing motif-finding algorithms would work well. When the motif sequences are not

    conserved, the motif 3D structure may still be preserved. Therefore, adding the predicted

    secondary structures would enhance the motif signal.

    2.2 Secondary structure similarity measurement SOV

    The three states for the secondary structure are helix (H), strand (E), and coil(C). Secondary

    structure similarity can be measured by the Q3 measure, defined as a fraction of residues

    correctly matched in the three conformational states. However, the Q3 measurement some-

    times gives inappropriate values. For example, predicting the entire myoglobin chain as one

    big helix gives a Q3 value of about 80%, which outperforms most of the existing prediction

    methods. Alternatively, a better measurement is the Segment Overlap (SOV) by Zelma et

    al. (1999). SOV considers natural variations in the boundaries of segments among homolo-

    gous protein structures. It is a measure based on secondary structure segments rather than

    individual residues.

    [Figure 2 about here.]

    6

  • Let s1 and s2 denote any two segments of secondary structure in conformational state i

    (i.e. H, E, or C). Let (s1, s2) denote a pair of overlapping segments. For example, (β1, β2)

    in Figure 2 is a pair of overlapping segments with strand (E). Let S(i) denote the set of all

    overlapping pairs of segments (s1, s2) in state i, and let S′(i) denote the set of segments s1

    for which there is no overlapping segment s2 in state i, i.e.:

    S(i) = {(s1, s2) : s1 ∩ s2 6= φ,

    s1 and s2 are both in the conformational state i},

    S ′(i) = {s1 : ∀s2, s1 ∩ s2 = φ,

    s1 and s2 are both in the conformational state i}.

    Define SOVo for state i as:

    SOV o(i) =1

    N(i)

    ∑(s1,s2)∈S(i)

    minov(s1, s2) + δ(s1, s2)

    maxov(s1, s2)× len(s1),

    where N(i) =∑

    (s1,s2)∈S(i)len(s1) +

    ∑s1∈S′(i)

    len(s1),

    δ(s1, s2) = min{(maxov(s1, s2) − minov(s1, s2));

    minov(s1, s2); int(len(s1)/2); int(len(s2)/2)}.

    In the formula, len is the segment length, minov is the length of the actual secondary

    structure overlap of s1 and s2, maxov is the maximal length of the overlapping structures s1

    and s2 (See Figure 2). SOVo of all secondary states is defined as:

    SOV o =1

    N

    ∑i∈{H,E,C}

    ∑(s1,s2)∈S(i)

    minov(s1, s2) + δ(s1, s2)

    maxov(s1, s2)× len(s1),

    where N =∑

    i∈{H,E,C}N(i).

    To illustrate the calculation of SOVo(E), let us consider the two secondary structures in

    Figure 2. There are two overlapping pairs for extended sheet(E): (β1, β2) and (β1, β3). For

    7

  • the first pair, minov(β1, β2) = 2, maxov(β1, β2) = 8, and δ(β1, β2) = min{(8−2); 2; 3; 2} = 2.The second pair can be calculated similarly. Then the value of SOVo(E) is calculated as:

    SOV o(E) =1

    6 + 6×(

    2 + 2

    8+

    2 + 1

    7

    )× 6 = 0.464

    Summing over all 3 states, the overall SOVo of the given structures is evaluated to be

    0.629. The SOVo measure ranges from 0 to 1, where 1 is the perfect match and 0 is the

    complete mismatch. The value 0.629 can be roughly interpreted as that 63% of the secondary

    structures are matched.

    SOVo is originally defined for similarity of an observed secondary structure and its pre-

    dicted secondary structure. The asymmetric nature of S(i), N(i) and len(s1) makes SOVo

    asymmetric between the two sequences s1 and s2. When this measure is used for the two

    predicted structures, a symmetric measure can be defined by:

    SOV =SOV o(s1, s2) + SOV

    o(s2, s1)

    2.

    This definition will be used for our SOV calculations.

    3 Methods

    3.1 Model assumptions

    The proposed model consists of two parts, a position-specific score matrix (PSSM) for the

    amino acid sequences and a SOV measurement for the secondary structures of the motifs.

    Let X = {X1, ..., XK} denote secondary structure strings for the set of K proteins, wheresecondary structure Xi of protein i is either known or predicted by PSI-PRED. PSI-PRED

    employs two feed-forward neural networks which predict secondary structure of a protein

    based on its similarity output obtained from PSI-BLAST (Position Specific Iterated BLAST,

    Altschul et al. 1997). For the given protein, PSI-PRED uses all of its homology proteins from

    the NCBI (National Center for Biotechnology Information) protein database. We assume

    8

  • the predicted secondary structures X is an extra given data set in addition to the protein

    set of interest R.

    As many of other secondary structure prediction methods, PSI-PRED utilizes sequence

    information in multiple alignments obtained by PSI-BLAST. The multiple alignment helps

    to infer secondary structure. On the other hand, our goal here is to improve multiple

    alignment by the predicted secondary structures. Our development could be considered as

    the second step of an iterative scheme that optimizes both the quality of the secondary

    structure prediction and that of the multiple alignment.

    The motif width J in our approach is chosen based on the method by Xie and Kim

    (2005). Starting from a short alignment width (e.g. 10), the method expands the motif to

    both sides according to the Kullback-Leibler information divergence. We focus our model on

    detection and correct alignment of short similar regions in very long sequence of low overall

    similarity. The motif width in our problems is typically 10-20. Therefore, we do not allow

    any gap within motif. The motifs identified by the proposed multiple alignment method are

    ungapped blocks, which correspond to core regions in a group of proteins. On the other hand,

    the regions outside of motifs are not aligned. There are insertions and deletions between the

    aligned core motifs.

    For simplicity, we focus on the model that assumes one motif occurring in each sequence.

    Once one motif alignment is obtained, there are methods available to extend to multiple

    motif alignments. For instance, we will continue searching the next best motif by a means

    of masking (Xie at al. 2004).

    For the amino acid frequencies at each position j in the motif, we denote the frequency

    parameters θj = (θ1,j. . . . , θ20,j)T , j = 1, . . . , J . Background sequences are assumed from

    another common multinomial distribution with parameter θ0 = (θ1,0. . . . , θ20,0)T . Let Θ =

    (θ0,θ1, . . . ,θJ). We denote a counting function h such that h(R) = (m1, . . . ,m20)T , where

    mi is the number of the ith type letter observed in R. Furthermore, let RA(j) denote the

    9

  • jth column in (1), R{A}c denote the amino acids outside of the motif. Let SOV (al, am)

    denote the SOV measure between two segments with width J starting at position al in the

    lth sequence and position am in the mth sequence.

    3.2 Probability model

    Given the previous notations, the complete likelihood function with motif locations A given

    is defined as

    π(R, A|Θ, λ,X) ∝ θh(R{A}c )0J∏

    j=1

    θh(RA(j))

    j exp{λJ

    K

    ∑l

  • alignment A. The conjugate prior distribution for Θ is defined. Specifically, the prior for Θ

    is a product Dirichlet distribution, denoted by g(Θ). The parameter in the prior distribution

    for θj is βj = (β1,j, . . . , β20,j), j = 0, . . . , J , which is defined at the end of this section. For

    notation simplicity, considering vectors a = (a1, . . . , a20)T and b = (b1, . . . , b20)

    T , we write

    that a + b = (a1 + b1, . . . , a20 + b20)T , ab = (ab11 . . . a

    b2020 )

    T , |a| = |a1| + · · · + |a20|, andΓ(a) = Γ(a1) . . . Γ(a20).

    The posterior distribution for A is derived as follows:

    π(A|R,X, λ) ∝ π(A,R|λ,X) =∫

    π(A,R|Θ, λ,X)g(Θ)dΘ

    ∝ Γ(h(R{A}c) + β0)J∏

    j=1

    Γ(h(R(j)) + βj)

    × exp{λ JK

    ∑l

  • as constants:

    π(ak|A[−k],R,X, λ) ∝ π(A|R,X, λ)π(A[−k]|R,X, λ)

    ∝ Γ(h(R{A[−k]}c) + β0 − h(R{ak}))Γ(h(R{A[−k]}c + β0)

    ×J∏

    j=1

    Γ(h(RA[−k](j)) + βj + h(rk,ak+j−1))

    Γ(h(RA[−k](j)) + βj)

    × exp{λ JK

    ∑l: l 6=k

    SOV (al, ak)}

    By using Stirling’s formula, the (predictive) posterior distribution for ak can be simplified

    as:

    π(ak|A[−k],R,X, λ) ∝J∏

    j=1

    (θ̂j[k]

    θ̂0[k]

    )h(rk,ak+j−1)

    × exp{λ JK

    ∑l: l 6=k

    SOV (al, ak)}, (3)

    where θ̂j[k] and θ̂0[k] are the posterior means of θj and θ0, whose calculations are specified

    below. Given the current alignment defined by A[−k], the probability of updating ak depends

    on both the amino acid pattern, i.e., the odds ratio of the motif probability versus the

    background probability, and the similarity of the secondary structures, i.e., SOV (al, ak),

    l = 1, ..., K and l 6= k.The posterior means of θj[k] = (θ1,j[k]. . . . , θ20,j[k])

    T , j = 1, . . . , J , are evaluated based

    on the current alignment and a pseudo-count correction. Let fi be the observed relative

    frequency of amino acid i in the current alignment except sequence k. Let pi be the relative

    frequency of amino acids in the background, N be the sequence number except sequence k,

    N = K − 1, and B is the weight of the pseudo-count correction. A simple pseudo-countcorrection approach estimates the posterior mean by θ̂i,j[k] = (N ·fi +B ·pi)/(N +B), whereB ·pi corresponds to the Dirichlet prior parameter βi,j in our Bayesian model. Alternatively,a better approach is the Blosum pseudo-count correction method (Altschul et al. 1997).

    It replaces pi in the formula by a frequency that is calculated from a Blosum (Henikoff

    12

  • and Henikoff 1992) amino acid substitution matrix. Formally, the pseudo-count B · pi ismultiplied by

    ∑20j=1 fje

    µSij , where Sij is the substitution score of amino acid pair (i, j) defined

    by a Blosum matrix (e.g. BLOSUM62), and µ is the scale parameter for the matrix. This

    frequency estimate uses the prior knowledge of amino acid relationships embodied in the

    substitution matrix Sij. Those residues favored by the substitution matrix to align with the

    residues actually observed received high pseudo-count frequencies.

    3.3 Gibbs sampling algorithm with multiple simulations

    A Gibbs sampling procedure is used to generate samples according to Formula (3). The

    sampling approach provides a good means to characterize the posterior distribution of motif

    locations A. For instance, the mode of the posterior distribution gives an optimal motif

    alignment. The Gibbs sampling starts with a random initial value of A, which is chosen

    uniformly from all possible locations. Then ak, k = 1, . . . , K is updated one by one sequence.

    The algorithm has two basic steps:

    1. Exclude sequence k and calculate the current parameters θj[k] and θ0[k] using the

    Blosum pseudo-count correction method described above. The predicted secondary

    structures of the motif segments, except sequence k, are ready to use.

    2. The likelihood ratio between the motif model and the background model is calculated

    as in Formula 3. The new motif location ak is generated according to the weight (the

    likelihood ratio).

    The algorithm iterates the previous two steps for all sequences k = 1, . . . , K, in thou-

    sands of iterations. The most probable sample A, obtained in the Gibbs sampling iterations,

    corresponds to a mode (typically a local maximum) of the posterior distribution of A. Equiv-

    alently, we consider maximizing an alignment score defined as:

    Score =J∑

    j=1

    20∑i=1

    ci,jlogθ̂i,j

    θ̂i,0+ λ

    J

    K

    ∑l

  • where the ci,j’s are amino acid counts from the complete alignment. The first term in the

    score formula is similar to the score defined by the standard Gibbs sampling approach with

    only amino acid frequency (Jensen et al. 2004). The second term is a new contribution by

    secondary structures.

    Our simulations indicate, starting from a given random initial location A, the Gibbs sam-

    pling algorithm always converges within a thousand of iterations. However, the convergent

    results may vary from simulation to simulation with different initial values A. The sampling

    result of an individual Markov chain only corresponds to one of many local maxima. We

    evaluate the sampling procedure using multiple simulations.

    As an ad hoc guideline, we always run Gibbs sampling with several choices of the param-

    eter λ, for instance, λ = 0.5, 1, 1.5, 2. In addition, 50-100 Markov chain simulations from

    different random initial locations A are used for each λ value. Gelman and Rubin (1992)

    noticed the importance of running multiple Gibbs sampling chains for obtaining reliable sta-

    tistical inferences. Besides obtaining an over-dispersed distribution of the motif alignment A,

    running multiple Markov chains solves the difficult problem of setting the unknown parame-

    ter λ. Instead of setting a λ value for the given protein data, we consider the best alignment

    as the one that has a high probability under several λ values. Therefore, the alignments that

    repeat most frequently in these multiple simulations and also have high alignment scores are

    reported as the candidate alignments.

    4 Application

    To evaluate the proposed alignment method using secondary structure predictions, we com-

    pare it with the standard Gibbs sampling (Lawrence et al. 1993; Liu et al. 1995), as

    well as the highly ranked multiple alignment programs, including ClustalW (Thompson et

    al. 1994), Dialign (Morgenstein et al. 1996), and PRRP (Gotoh 1996). The programs

    are tested on reference alignments from the BAliBASE (Thompson et al. 1999a) bench-

    14

  • mark alignment database (http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE), which con-

    tains manually-refined multiple sequence alignments. The aligned regions are defined as core

    blocks, whose alignments are validated to ensure functional or structural conservation. Most

    data sets in BAliBASE include a few proteins (< 10). For our program purpose, we select ten

    big data sets, each of which have more than 10 sequences. These data sets are also chosen to

    represent the most difficult alignment problems. Specifically, four data sets (1idy, 1r69, 1ubi,

    1wit) are selected from BAliBASE Reference 3 containing divergent protein families with av-

    erage sequence identity less than 22%. Two data sets (Kinase2 and 1vln) are selected from

    BAliBASE Reference 4 containing sequences with large N/C terminal extensions, and four

    data sets (1thm1, s51, kinase2, kinase3) are selected from BAliBASE Reference 5 containing

    internal insertions.

    The names and features of the four data sets from Reference 3 are listed in Table 1.

    Notice that instead of using the short sequences provided in BAliBASE, we collect the whole

    protein sequences from the SWISS-PROT database (Bairoch and Apweiler 1997). The input

    sequences for our alignments are much longer than those in BAliBASE therefore are supposed

    to be harder to correctly align the structural core blocks. Motif widths are determined by

    the extension procedure (Xie and Kim 2005), with 22, 19, 19, and 16 for 1idy, 1r69, 1ubi,

    and 1wit respectively.

    [Table 1 about here.]

    To illustrate the impact of using secondary structures, we plot the likelihood function

    of motif location ak for the third sequence (RPC2 BPP22) in the set of 1r69. Except for

    this sequence, we assume that all the other motif locations are known. Figure 3 (a) shows

    the log-likelihood based on only the SOV part, and Figure 3 (b) shows the log-likelihood

    of the full model (PSSM + SOV; black) and the log-likelihood based on only the amino

    acids part (PSSM; grey). The likelihood function based on only the SOV part gives high

    probabilities for a few motif locations, whereas the likelihood based on PSSM alone shows

    15

  • high probability peaks at many locations. Inference for the true motif location (position

    17 for this data) is not an easy job, because the likelihood function based on either PSSM

    or SOV alone has no dominant mode. In contrast, combining PSSM with SOV, we obtain

    a better-shaped likelihood function. The true motif location at position 17 is clearly the

    global mode and the relative difference from the second mode is strong. For this type of

    data, the predicted secondary structure enhances the motif pattern, therefore the true motif

    is easier to be identified under the new model. As demonstrated in Table 2, the proposed

    alignment method with secondary structure information finds the true motif of 1r69 much

    more frequently (3.85 more times) than the standard Gibbs sampling method.

    [Figure 3 about here.]

    Table 2 shows comparisons of our proposed model with the standard Gibbs sampling

    method. For each data set, the alignments obtained by both methods are compared to the

    structural alignments in BAliBASE. A good alignment is defined when a large number of

    sequences out of the total number in each data set are correctly aligned. The criteria of

    determining good alignments are listed in the second column in Table 2. Multiple Markov

    chain simulations are used for the proposed method (PSSM+SOV) and the standard Gibbs

    sampling, where the proposed method runs 200 Markov chains, 50 runs at each of four

    λ = 0.5, 1, 1.5, 2, and the standard Gibbs sampling runs 100 Markov chains. The numbers

    in the table represent the number of runs that correctly found the structural core blocks

    in BAliBASE. Our model (PSSM + SOV) shows better success rates in finding the true

    motifs. For example, the success rate for 1idy increases from 0% to 12.5%. The rate for 1r69

    increases from 10% to 38.5%.

    [Table 2 about here.]

    Further comparisons of the proposed method (PSSM+SOV) with ClustalW, Dialign, and

    PRRP are displayed in Table 3. The reported alignments from PSSM+SOV are the most

    16

  • frequent alignments in 200 Markov chain simulations as described previously. Alignments are

    measured by the number of correctly aligned sequences out of the total number of sequences

    in each data set. For the data set 1ubi, PSSM+SOV performs much better than the other

    3 programs. For data sets 1idy and 1r69, PSSM+SOV performs as well as Dialign but

    better than the other 2 programs. For the rest of the data sets, all programs work well. In

    summary, PSSM+SOV is the best choice among these programs. Plots of the percents of

    correctly aligned sequences for each of the programs in each of the data sets are shown in

    Figure 4. The line of PSSM+SOV (dark blue) has high alignment values in all data sets.

    [Table 3 about here.]

    [Figure 4 about here.]

    The comparisons indicate that the proposed method using secondary structure predic-

    tions works at least as well as the best alignment programs using amino acid sequence

    information alone, and even better in some situations. Studying the structural alignments

    of these data sets in BAliBASE, we found that most of the alignments had conserved amino

    acids at several positions, except the alignments of 1idy and 1ubi. Our proposed method out-

    performs other alignment programs in these two data sets, because the secondary structures

    greatly enhance the motif signals in addition to amino acid conservation. As an example,

    the structural alignment of 1idy from BAliBASE is shown in Figure 5. The underlined

    segments share common core structures and therefore are referred to as the true motif seg-

    ments. Table 4 shows the alignment for 1idy by our approach using both PSSM and SOV.

    This alignment corresponds to the first and second core structural regions in Figure 5. The

    aligned amino acid segments show that there is no strongly conserved amino acid pattern,

    except column 17. In contrast, the predicted secondary structures show a conservation. The

    secondary structure of the motif can be considered as a helix-turn-helix (helix-coil-helix)

    structure.

    17

  • [Table 4 about here.]

    [Figure 5 about here.]

    5 Discussion

    The currently existing methods of identifying protein motifs consider only amino acid fea-

    tures of the motifs. The proposed model is the first attempt to utilize the predicted secondary

    structures for a probabilistic model of motifs. It is not surprising that information brought

    by the predicted secondary structures improves multiple alignments. The similarity mea-

    surement of secondary structures, SOV values, are defined for the whole motif segments. The

    dependence feature of adjacent amino acids is partially modeled in our approach, whereas

    all existing models assume that the positions in a motif are independent.

    Probability models and Bayesian methods showed great advantages in dealing with high

    dimensional complicated sequence features. Our scoring function is in terms of probability,

    which is defined exponentially proportional to a similarity measurement of secondary struc-

    tures. Instead of directly maximizing a score function, Gibbs sampling method is employed to

    simulate samples of the posterior probability, whose modes correspond to alignments of high

    scores. Difficult convergence to the global maximum is a big concern in multiple sequence

    alignment. We solve this problem by simulating multiple Markov chains from different ran-

    dom initial values and under different parameter λ values. The most probable alignment

    from multiple simulations is likely to be the true alignment.

    The proposed model can be improved by including reliability indices of secondary struc-

    ture predictions. PSI-PRED (Jones 1999) assigns a score of confidence level at 0-9 for each

    predicted secondary state (H, E or C). The score 9 indicates the most reliable prediction,

    whereas score 0 indicates the least reliable prediction. It is known that the reliability in-

    dices correlate very well with prediction accuracy. A weighted SOV measurement may be

    18

  • developed such that the similarity between two segments of secondary structure in a confor-

    mational state (i.e. H, E or C) will be weighted by the sum of the confidence indices of the

    segments. The weighted SOV can then be substituted into Formulas (2) and (3) for a better

    model of secondary structures.

    References

    Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J, Zhang, Z, Miller, W., and Lipman,

    D. J. (1997), “Gapped BLAST and PSI-BLAST: a new generation of protein database search

    programs,” Nucleic Acids Research, 25, 3389-3402.

    Bairoch, A., and Apweiler, R. (1997), “The SWISS-PROT protein sequence database: its

    relevance to human molecular medical research,” Journal of Molecular Medicine, 75, 312-316.

    Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S.& Jones, D. T.

    (2005) “Protein structure prediction servers at University College London”, Nucleic Acids

    Research, 33 (Web Server issue), W36-38.

    Chou, P. Y., and Fasman, U. D. (1974), “Prediction of protein conformation,” Biochem-

    istry, 13, 211-215.

    Dempster, A. P., Laird, N. M. and, Rubin, D. B. (1977), “Maximum likelihood from

    incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Ser. B, 39,

    1-38.

    Eddy, S. R. (1998), “Profile hidden Markov models”. Bioinformatics, 14, 755-763.

    Errami, M., Goeurjon, C., and Deléage, G. (2003), “Detection of unrelated proteins in

    sequences multiple alignments by using predicted secondary structures,” Bioinformatics, 19,

    506-512.

    Garnier, J., Osguthorpe, D. J., and Robson, B. (1978), “Analysis of the accuracy and

    implications of simple methods for predicting the secondary structure of globular proteins,”

    Journal of Molecular Biology, 120, 97-120.

    19

  • Gelman, A. and Rubin, D. B. (1992), “Inference from iterative simulation using multiple

    sequences”, Statistical Science, 7, 457-72.

    Geourjon, C., Combet, C., Blanchet, C., and Deléage, G. (2001), “Identification of Re-

    lated Proteins with Weak Sequence Identity Using Secondary Structure Information,” Pro-

    tein Science, 10, 788-797.

    Gotoh, O. (1996), “Significant improvement in accuracy of multiple protein sequence

    alignments by iterative refinement as assessed by reference to structural alignments”, J.

    Mol. Biol., 264, 823-838.

    Henikoff, S., and Henikoff, J. G. (1992), “Amino Acid Substitution Matrices from Protein

    Blocks,” Proceedings of the National Academy of Sciences, 89, 10915-10919.

    Jensen, S. T., Liu, X. S., Shou, Q., and Liu, J. S. (2004), “Computational Discovery of

    Gene Regulatory Binding Motifs: A Bayesian Perspective,” Statistical Science, 19, 188-204.

    Jones, D. T. (1999) “Protein secondary structure prediction based on position-specific

    scoring matrices”, Journal of Molecular Biology, 292, 195-202.

    Kabsch, W., and Sander, C. (1983), “Dictionary of protein secondary structure: pattern

    recognition of hydrogen-bonded and geometrical features,” Biopolymers, 22, 2577-2637.

    Lawrence, C. E., and Reilly, A. A. (1990), “An Expectation-Maximization (EM) Algo-

    rithm for the Identification and Characterization of Common Sites in Biopolymer Sequences,”

    Proteins, 7, 41-51.

    Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton,

    J. C. (1993), “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple

    Alignment,” Science, 262, 208-214.

    Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995), “Bayesian Models for Multi-

    ple Local Sequence Alignment and Gibbs Sampling Strategies,” Journal of the American

    Statistical Association, 90, 1156-1170.

    Morgenstein, B., Dress, A. and Werner, T. (1996), “Multiple DNA and protein sequence

    20

  • alignment based on segment-to-segment comparison”, PNAS, 93, 12098-12103.

    Rost, B., and Sander, C. (1993), “Prediction of Protein Secondary Structure at Better

    than 70% Accuracy,” Journal of Molecular Biology, 232, 584-599.

    Salamov, A. A., and Solovyev V. V. (1995), “Prediction of protein secondary structure

    by combining nearest-neighbour algorithms and multiple sequence alignments,” Journal of

    Molecular Biology, 247, 11-15.

    Jones, D. T. (1999) “Protein secondary structure prediction based on position-specific

    scoring matrices”, Journal of Molecular Biology, 292, 195-202. Silberberg, M. S. (2000),

    Chemistry: The molecular nature of matter and change (2nd ed.), Boston, MA: McGraw-

    Hill.

    Thompson, J. D., Higgins, D. G., Gibson, T.J.(1994), “CLUSTAL W: improving the

    sensitivity of progressivemultiple sequence alignment through sequence weighting, position-

    specific gap penalties and weight matrix choice”, Nucleic Acids Res, 22, 4673-4680.

    Thompson, J. D., Plewniak, F., and Poch, O. (1999a), “BAliBASE: a benchmark align-

    ment database for the evaluation of multiple alignment programs,” Bioinformatics, 15, 87-88.

    Thompson, J. D., Plewniak, F., and Poch, O. (1999b), “A comprehensive comparison of

    multiple sequence alignment programs”, Nucleic Acids Research, 27, 2682-2690.

    Xie, J., Li, K.-C., and Bina, M. (2004), “A Bayesian Insertion/Deletion Algorithm for

    Distant Protein Motif Searching via Entropy Filtering,” Journal of the American Statistical

    Association, 99, 409-420.

    Xie, J., and Kim, N.-K. (2005), “Bayesian Models and Markov Chain Monte Carlo Meth-

    ods for Protein Motifs with the Secondary Characteristics,” Journal of Computational Biol-

    ogy, 12, 952-970..

    Zelma, A., Venclovas, C., Fidelis, K., and Rost, B. (1999), “A Modified Definition of

    Sov, a Segment-Based Measure for Protein Secondary Structure Prediction Assessment,”

    Proteins, 34, 220-223.

    21

  • Dataset Family name no. of sequences Length∗ Motif width Ave. Identity(%)1idy DNA binding 25 101-636 22 191r69 Repressor 23 71-882 19 181ubi Ubiquitin 22 70-1132 19 201wit Twitchin 19 93-250 16 22

    Table 1: Four data sets from BAliBASE Reference 3. ∗The sequence lengths are longer thanthose in BAliBASE because the whole sequences were collected from the SWISS-PROTdatabase.

    22

  • Dataset Correct alignments Rate of the correct alignments(correctly aligned / total # of seq) PSSM+SOV Standard Gibbs (PSSM alone)

    1idy 16/25 and more 25/200 (12.5%) 0/100 (0%)1r69 21/23 and more 77/200 (38.5%) 10/100 (10%)1ubi 20/22 and more 18/200 (9%) 1/100 (1%)1wit 15/19 and more 37/200 (18.5%) 4/100 (4%)

    Table 2: Comparison of the multiple Markov chain simulation results for the proposedmethod (PSSM+SOV) and the standard Gibbs sampling method. Correct alignments aredefined when the number of correctly aligned sequences are equal to or larger than thecutoff in the second column. The rate of correct alignments are obtained from multipleMarkov chain simulations, 200 Markov chains for the proposed method (PSSM+SOV) and100 Markov chains for the standard Gibbs sampling (PSSM alone). The number of Markovchains that find the correct alignments are reported in the third and fourth columns.

    23

  • Dataset PSSM+SOV ClustalW Dialign PRRP1idy 16/25 10/25 16/25 8/251r69 23/23 9/23 23/23 5/231ubi 20/22 5/22 4/22 1/221wit 16/19 15/19 16/19 18/19

    1thm1 11/11 10/11 11/11 11/11kinase2 16/17 15/17 15/17 15/17

    kinase2 insert 11/12 12/12 12/12 11/12kinase3 insert 19/19 18/19 18/19 18/19

    s51 15/15 15/15 15/15 15/151vln 13/14 14/14 14/14 14/14

    Table 3: Comparison of the rate of the correctly aligned sequences for the proposed method(PSSM+SOV) with three highly ranked programs, ClustalW, Dialign, and PRRP. The num-bers are the correctly aligned sequences out of the total number of sequences in each dataset. Our proposed method performs better or as well as the other programs for all the datasets.

    24

  • Sequence name Aligned AA Segment Secondary Structures

    sp|P06876|MYB MOUSE RIIYQAHKRLGNRWAEIAKLLP HHHHHHHHHHCCHHHHHHHHHC

    sp|P27898|MYBP MAIZE DIIIKLHATLGNRWSLIASHLP HHHHHHHHHCCCCHHHHHHHHC

    sp|P20025|MYB3 MAIZE DLIVKLHSLLGNKWSLIAARLP HHHHHHHHHCCCHHHHHHHHHC

    sp|P27900|GL1 ARATH DLIIRLHKLLGNRWSLIAKRVP HHHHHHHHHHCCHHHHHHHHCC

    sp|P20027|MYB3 HORVU DHIVALHQILGNRWSQIASHLP HHHHHHHHHCCCHHHHHHHHHC

    sp|P80073|MYB2 PHYPA NLILDLHATLGNRWSRIAAQLP HHHHHHHHHCCCHHHHHHHHHC

    sp|P02259|H5 CHICK AAIRAEKSRGGSSRQSIQKYIK HHHHHHHHCCCCCHHHHHHHHH

    sp|P15870|H1D STRPU SALESLKEKKGSSRQAILKYVK HHHHHHHHCCCCCHHHHHHHHH

    sp|P15869|H1B STRPU AAITALKERGGSSAQAIRKYIE HHHHHHHHCCCCCHHHHHHHHH

    sp|P35060|H1 TIGCA AAIKALKERNGSSLPAIKKYIA HHHHHHHHCCCCCHHHHHHHHH

    sp|Q05831|H1L MYTTR AAITAMKNRKGSSVQAIRKYIL HHHHHHHHCCCCCHHHHHHHHH

    sp|P02257|H1 ECHCR AAIAAQKERRGSSVAKIQSYIA HHHHHHHHCCCCCHHHHHHHHH

    sp|P10771|H11 CAEEL EAIKQLKDRKGASKQAILKFIS HHHHHHHHCCCCCHHHHHHHHH

    sp|P06894|H1A PLADU TAILGLKERKGSSMVAIKKYIA HHHHHHHHCCCCCHHHHHHHHH

    sp|P26568|H11 ARATH DAIVTLKERTGSSQYAIQKFIE HHHHHHHHCCCCCHHHHHHHHH

    sp|P54671|H1 DICDI TAIAHYKDRTGSSQPAIIKYIE HHHHHHHHCCCCCHHHHHHHHH

    sp|P15282|ARGR ECOLI AFKALLKEEKFSSQGEIVAALQ HHHHHHHHHCHHHHHHHHHHHH

    sp|P95721|ARGR STRCL RIVDILNRQPVRSQSQLAKLLA HHHHHHHHHCCCCHHHHHHHHH

    sp|P17893|ARGR BACSU KIREIITSNEIETQDELVDMLK HHHHHHHHHCHHHHHHHHHHHH

    sp|O31408|ARGR BACST KIREIIMSNDIETQDELVDRLR HHHHHHHHHCHHHHHHHHHHHH

    sp|Q54870|ARGR STRPN LIKKMITEEKLSTQKEIQDRLE HHHHHHHHHCHHHHHHHHHHHH

    sp|P94992|ARGR MYCTU RIVAILSSAQVRSQNELAALLA HHHHHHHHHCCCCHHHHHHHHH

    sp|P03032|TRPR ECOLI VRIVEELLRGEMSQRELKNELG HHHHHHHHHCCCCHHHHHHHCC

    sp|P44889|TRPR HAEIN LQIVSQLIDKNMPQREIQQNLN HHHHHHHHHCCCCHHHHHHHHC

    sp|P34257|TC3A CAEEL VSLHEMSRKISRSRHCIREYLK CCHHHHHHHHCCCCHHHHHHHH

    Table 4: Alignments of the data set 1idy by the proposed method. This alignment corre-sponds to the first and second core structural regions shown in Figure 6. While there is aclear conservation in the secondary structures for this motif, the aligned amino acid segmentshows no strongly conserved column, except column 17. The secondary structure of themotif can be considered a helix-turn-helix structure.

    25

  • Conf: 968887179808999855874388999988874088842028812883616897600047

    Pred: CEEEEEECCCEEEEEEECCCCHHHHHHHHHHHHHCCCCCCCEEECCCEEECCCCEEHHCC

    AA: MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN

    10 20 30 40 50 60

    Conf: 8999889999763189

    Pred: CCCCCEEEEEEEECCC

    AA: IQKESTLHLVLRLRGG

    70

    Figure 1: An example of secondary structure prediction by PSI-PRED. The protein isUBIQ HUMAN (swissprot ID P02248), which is a sequence in the data set 1ubi. The linesin the order are confidence level of the secondary structure prediction, string of the predictedsecondary structure, and the original amino acid sequence.

    26

  • 1βStructure 1 CCCC EEEEEECCCC

    2β 3βStructure 2 CC EEEECCEEECCC

    -- --++++++++

    +++++++

    Figure 2: Illustration of minov and maxov in the SOVo(E) calculation. (- -) indicates theminov of (β1, β2) and (β1, β3). The first line of (++) indicates the maxov of (β1, β2) and thesecond line of (++) indicates the maxov of (β1, β3).

    27

  • (a)

    0 50 100 150 200

    24

    68

    1012

    14

    position

    rela

    tive

    valu

    esof

    log-

    likel

    ihoo

    d

    (b)

    0 50 100 150 200

    -20

    -10

    010

    2030

    position

    rela

    tive

    valu

    esof

    log-

    likel

    ihoo

    d

    Figure 3: Log-likelihood plot for a sequence, RPC2 BPP22 in the data set 1r69 from BAl-iBASE. (a) The log-likelihood calculated by the SOV part; (b) The log-likelihood calculatedby the proposed model (PSSM + SOV; black) and the log-likelihood calculated by the aminoacids part (PSSM; grey).

    28

  • (a)

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1idy 1r69 1ubi 1wit

    perc

    ento

    fcor

    rect

    lyal

    igne

    dse

    quen

    ces

    PSSM+SOV

    Gibbs sampling(PSSM only)

    ClustalW

    Dialign

    PRRP

    (b)

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1idy

    1r69 1u

    bi1w

    it

    1thm

    1

    kinas

    e2

    kinas

    e2_in

    sert

    kinas

    e3_in

    sert

    s51

    1vln

    perc

    ento

    fcor

    rect

    lyal

    igne

    dse

    quen

    ces

    PSSM+SOV

    ClustalW

    Dialign

    PRRP

    Figure 4: Comparison of the proposed method with the standard Gibbs sampling method,ClustalW, Dialign, and PRRP. The plots represent percents of the correctly aligned sequencesout of the total number of sequences for each program in each data set. The proposed method(PSSM+SOV) as demonstrated by the line of dark blue performs the best in all programs.

    29

  • 1idy 1 mevkktswt eeedrILYQA hkr lgnR WAEIAKLLp.........grt dnamybp_maize 1 advkrgniskeeedIIIKL hatlgnRWSLIASHL p.........grtdnemyb3_maize 1 .dlkrgnftadeddLIVKL hsllgnKWSLIAARL p.........grtdnegl1_arath 1 .nvnkgnfteqeedLIIRL hkllgnRWSLIAKRV p.........grtdnqmyb3_horvu 1 .dlkrgcfsqqeedHIVAL hqilgnRWSQIASHL p.........grtdnemyb2_phypa 1 .dlkrgifseaeenLILDL hatlgnRWSRIAAQL p.........grtdne1hstA 1 ...shpt ysemiaaAIR AEksrggsS RQSIQKYIksh ykvgh...n adlqh1d_strpu 1 ...shpkysdmiasALESL kekkgsSRQAILKYV kanftvgd...nanvhh1b_strpu 1 ...ahpsssemvlaAITAL kerggsSAQAIRKYI eknytvdi..kkqaifh1_tigca 1 ...thpptsvmvmaAIKAL kerngsSLPAIKKYI aanykvdv..vknahfh1l_myttr 1 ....kpstlsmivaAITAM knrkgsSVQAIRKYI lannkgin.tshlgsah1_echcr 1 ...ahppvidmitaAIAAQ kerrgsSVAKIQSYI aakyrcdi..nalnphh11_caeel 1 ...ahppyintikeAIKQL kdrkgaSKQAILKFI sqnyklgdnviqinahh1a_pladu 1 ...ahppvatmvvtAILGL kerkgsSMVAIKKYI aanyrvdv..arlapfh11_arath 1 ...shptyeemikdAIVTL kertgsSQYAIQKFI eekrkelp..ptfrklh1_dicdi 1 ...nhptyqvmistAIAHY kdrtgsSQPAIIKYI eanynvap..dtfktq1aoy 1 .mrssakqee lvkaFKALL keekfsS QGEIVAALqeq .gfd...nin qskARGR_STRCL 1 ........marhrrIVDIL nrqpvrSQSQLAKLL adn.gls....vtqatG3273713 1 enlnpvtrtarqalILQIL dkqkvtSQVQLSELL lde.gid....itqatAHRC_BACSU 1 .....mnkgqrhikIREII tsneieTQDELVDML kqd.gyk....vtqatARGR_BACST 1 .....mnkgqrhikIREII msndieTQDELVDRLrea.gfn....vtqatARGR_STRPN 1 .....mrkrdrhqlIKKMI teeklsTQKEIQDRL eah.nvc....vtqttARGR_MYCTU 1 gpevaanragrqarIVAIL ssaqvrSQNELAALL aae.gie....vtqat1jhgA 1 .t pderealgtrvrIIEEL lr ge.mSQRELKNELg..........ag iatTRPR_HAEIN 1 .taderdavglrlqIVSQL idkn.mPQREIQQNLn..........tsaatG3328572 1 .sfserkdvasryhIIRAL lege.lTQREIAEKY g..........vsiaq1tc3C 1 ....rgsals dterAQLDV mkll nvSLHEMSRKIs..........rs rhc

    1idy 42 IKNHWNSTmrr kv.mybp_maize 42 IKNYWNSHlsrq..myb3_maize 41 IKNYWNTHvrrk..gl1_arath 41 VKNYWNTH lskk..myb3_horvu 41 IKNFWNSCikkk..myb2_phypa 41 IKNYWNTRlkkr..1hstA 45 IKLSIRRL la agv.h1d_strpu 45 IKQALKRG vtsgq.h1b_strpu 46 IKRALITG vekgt.h1_tigca 46 IKKALKSL vekkk.h1l_myttr 46 MKLAFAKG lksgv.h1_echcr 46 IRRALKNQ vksga.h11_caeel 48 HRQALKRGvtska.h1a_pladu 46 IRKFIRKA vkqtkgh11_arath 46 LLLNLKRL vasgk.h1_dicdi 46 LKLALKRL vakgt.1aoy 46 VSRMLTKFgavrt.ARGR_STRCL 38 LSRDLDELgavki.G3273713 46 LSRDLDELgarkv.AHRC_BACSU 41 VSRDIKELhlvkv.ARGR_BACST 41 VSRDIKEMqlvkv.ARGR_STRPN 41 LSRDLREIgltkv.ARGR_MYCTU 46 LSRDLEELgavkl.1jhgA 39 ITRGSNSLka apv.TRPR_HAEIN 39 ITRGSNMIktmdp.G3328572 39 ITRGSNALkgldp.1tc3C 37 IRVYLKDPvsygt.

    Figure 5: The structural alignment of the data set 1idy reported in BAliBASE. The under-lined segments are core structural regions.

    30