predicting protein structure using only sequence information

Predicting Protein Structure UsingOnly Sequence InformationKevin Karplus,* Christian Barrett, Melissa Cline, Mark Diekhans, Leslie Grate, and Richard HugheyJohn Baskin School of Engineering, University of California, Santa Cruz, Santa Cruz, California

ABSTRACT This paper presents results of blindpredictions submitted to the CASP3 protein struc-ture prediction experiment. We made predictionsusing the SAM-T98 method, an iterative hiddenMarkov model–based method for constructing pro-tein family profiles. The method is purely sequence-based, using no structural information, and yet wasable to predict structures as well as all but five of thestructure-based methods in CASP3. Proteins Suppl1999;3:121–125. r 1999 Wiley-Liss, Inc.

Key words: protein structure prediction; remote ho-molog search; hidden Markov model;SAM-T98

INTRODUCTION

One method of protein sequence analysis is the identifi-cation of homologous proteins—proteins that share acommon evolutionary history and have similar overallstructure and function. 5 Here we report on the use ofSAM-T98, 11,16 a newly developed hidden Markov model(HMM) 9,13 method for recognition of homologs with lowsequence similarity, and how it fared in the fold-recogni-tion section of the CASP3 experiment.

HMMs combine the best aspects of weight matrices andlocal sequence alignment methods and can be used toassign probabilities to proteins in database search. 6 OurHMM fold-recognition method differs from protein thread-ing methods10,14,15,19 in that pairwise interactions are notmodeled or used. Instead, we employ Bayesian meth-ods2,3,17 to incorporate prior information in the form ofDirichlet mixture densities20 over position-specific aminoacid distributions. The components of the mixture reflectdifferent patterns of sequence conservation and can becombined with data from aligned homologs to form data-dependent estimates of amino-acid probabilities.

In the CASP3 experiments, we used the recently devel-oped SAM-T98 remote homology detection method tocompare the CASP3 targets to a database of proteinswhose structures are known (see Methods). We discusshow successful this method was in finding similar struc-tures for the targets in Results and Discussion and discussthe lessons learned in the Conclusion.

A prediction server using the SAM-T98 method dis-cussed here is available on the World Wide Web (http://www.cse.ucsc.edu/research/compbio/HMM-apps), as isdocumentation and licensing information for the SequenceAlignment and Modeling (SAM) hidden Markov modelsoftware suite.9

METHODS

Since the SAM-T98 method is fully described else-where,11 it will be described only briefly here. The methodis purely sequence-based and does not employ any struc-tural information. The method iterates through the follow-ing steps several times (four for template library, six orseven for the target models), using the initial sequence forinput in the first iteration.

1. Build an HMM from a sequence or multiple alignment,using sequence weighting and Dirichlet mixtures. Thetotal sequence weight is chosen to get an appropriatelevel of generality in the resulting model.

2. Score a nonredundant sequence database with theHMM and retain as a training set sequences that scorebetter than some threshold value. Scoring is based onlog odds, in which the likelihood of a sequence havingbeen generated from the HMM is compared with thelikelihood of the sequence having been generated fromsome null model. For the null model we used thereverse of the HMM—the score generated using thisnull model is the same score one would get from scoringthe reversed sequence with the original HMM. Thisnovel null model cancels out artificially strong scoresdue to length and composition biases and more subtlesources such as conserved rare residues and long helices.

3. Re-estimate the HMM with these sequences, usingsequence weighting and Dirichlet mixture priors.20

4. Realign the training set using the retrained HMM. Thismultiple alignment is used as an input to the first stepin the next iteration.

During each iteration, the score threshold in step 2 ismade less stringent in order to capture less similar se-quences that are still, we hope, homologs. The final multiplealignment, called the SAM-T98 alignment, is used to con-struct the HMM used for database search and alignment.

For CASP3, we first built a SAM-T98 HMM for everysequence in a representative set of structure templatesfrom the protein data bank (PDB)4 and for every targetsequence. To find possible templates for a target sequence,we scored all of PDB with the target HMM, scored the

Grant sponsor: National Science Foundation; Grant numbers: DBI-9408579 and DBI-9808007; Grant sponsor: Department of Energy;Grant number: DE-FG0395-ER62112.

*Correspondence to: Kevin Karplus, Computer Engineering, Univer-sity of California, Santa Cruz, Santa Cruz, CA 95064. E-mail: [email protected]

Received 2 February 1999; Accepted 24 May 1999

PROTEINS: Structure, Function, and Genetics Suppl 3:121–125 (1999)

r 1999 WILEY-LISS, INC.

target sequence with every template HMM, and summedthe two scores. The structures corresponding to the bestsummed scores were then investigated manually. For mosttargets, we submitted only one structure as our prediction—usually the best-scoring one. If we had a high-scoring PDBsequence that was not in our template library, we some-times augmented the template library with an HMM builtfrom this PDB sequence, in order to be able comparesummed scores. We ended up with about 2,100 HMMs inour template library.

RESULTS AND DISCUSSION

Since we predicted on all of the targets for CASP3, wehave divided them into three categories to simplify theirevaluation. These categories are based on the difficulty of

finding the correct structure. Targets that had very similarsequences of known structure have been placed in the easytargets category, whereas those that had only more dis-tantly related known structures are members of the moder-ately difficult targets category. Targets that had little or nosimilarity to known structures are in the very difficulttargets category. Table I shows the results for the first twocategories. Except for T0085, the multiheme cytochrome, asubmitted cost less than 29 was a successful prediction,although scores as strong as 227.37 would have beenincorrect had we not filtered out those predictions by hand.

Easy Targets

For the fifteen ‘‘easy’’ targets—T0047, T0048, T0049,T0055, T0057, T0058, T0060, T0062, T0064, T0068, T0069,

TABLE I. Targets Ranked by Score of Top Hit as Found by the SAM-T98 Method

Easy Targets

Target Predicted PDBCost Was Correct

CASP3Predicteda Top Hit Predicted Top

T0058 1akz 2416.33 1 1 CMT0047 1mup 2261.04 1 1 CMT0068 1rmg 2245.31 1 1 CMT0069 1rtm1 2244.64 ?b ?T0060 1gifA 2226.38 1 1 CMT0076 1almC 2155.88 2226.34 1 1 CMT0049 3pte 2217.70 1 1 CMT0048 1dcpA 2205.74 2209.62 1 1 CMT0062 2pia 1 2cnd 2149.19 ?b ?T0055 1esl 2134.70 1 1 CMT0064 1adr 1 1ois 2102.26 1 1 CMT0082 1bol 282.19 1 1 CMT0057 1gd10 247.30 1 1 CMT0070 2omf 244.92 1 1 CMT0074 2scpA 223.87 225.75 1 1 FR/CM

Moderately Difficult Targets

Target Predicted PDBCost Was Correct

CASP3Predicteda Top Hit Predicted Top

T0085 3cyr 262.68 — — FRT0053 1fvkA 25.72 227.37 — — FRT0083 1lmb3 220.18 220.51 1 1 FRT0044 1eps 212.70 1 1 FRT0059 2dri 24.89 210.47 — — FR/ABT0079 1neq 1 1san 29.70 1 1 FRT0071 1hviB 28.88 — — FR/ABT0075 1oya 24.88 28.12 — — FR/ABT0080 1t7pB 1 1mugA 24.12 27.14 — — FR/ABT0043 NF NF 27.14 — 1 FRT0054 NF NF 26.97 — — FRT0077 1tif 25.90 26.63 — — FR/ABT0067 1rhi2 25.59 26.04 — — FR/ABT0081 3chy 0.54 25.82 6 — FRT0061 1amj 23.31 25.76 — — FR/ABT0046 2mcm 25.75 26.49 1 1 FRT0063 1pex 23.30 25.65 — — FRaThis column gives the sum of costs for the target model and template model (for T0043 and T0054 wepredicted ‘‘new fold’’). If we did not submit our best-scoring template, then its cost is also reported incolumn four.bStructures have not been released yet for T0062 and T0069, but we are fairly confident of our predictions.AB, ab initio; CM, comparative modeling; FR, fold recognition; 1, prediction correct; 2, prediction incorrect.

122 K. KARPLUS ET AL.

T0070, T0074, T0076, T0082—our method gave unambigu-ous results: the correct structural template always had thebest (most negative) cost (results not available yet forT0062 and T0069). This cost was always less than 228, forwhich we expected fewer than 1% false positives.11

Our submitted alignments for these targets were gener-ally the automatically produced alignments, sometimessubject to minor hand editing. Figure 1 shows our pre-dicted alignment of T0074 to the template structure 2scpA.It shows that our alignment was quite accurate, apartfrom the first region which is shifted by two residues. Thefigure is also an example of one of the few cases in whichhand-editing improved on the automatic alignment. Inthis case, the automatic alignment had shifted the 21residues PWAVKPEDKAKYDAIFDSLS of T0074 7 resi-dues toward the N-terminal region of 2scpA, while thehand alignment shifted them four residues toward theC-terminus.

Overall, our alignments for the easy targets were usu-ally among the best alignments submitted to CASP3, eventhough we used no structure information in generating them.

Our three-dimensional (3D) prediction for target T0076was quite poor. We aligned T0076 to 1almC (a theoreticalmodel) because our top hit, 2mysC, had a probable mistrac-ing. We thought that 1almC corrected this mistracing, butsince it did not, our 3D prediction was poor even thoughthe sequence alignment was accurate. We would have donebetter to use the second-highest-scoring template (1wdcC),which has an accurate 3D structure.

Moderately Difficult Targets

Sixteen ‘‘moderately difficult’’ targets are in this cat-egory: T0043, T0044, T0046, T0053, T0054, T0059, T0063,T0067, T0071, T0075, T0077, T0079, T0080, T0081, T0083,T0085. We correctly predicted similar structures for fivetargets: T0044, T0046, T0079, T0081, T0083, all but T0081of which used the top hit. Most of the other structures inthis category had costs too close to zero to yield muchconfidence in our predictions.

Because of the low similarity between the targets andtemplates, even the ‘‘correct’’ predictions had alignmentsthat were accurate only for portions of the target sequence.We used local alignment to find the folds but globalalignment to provide the submitted alignment. The globalalignments generally aligned more residue pairs than didthe correct structural alignments, but if we had submittedthe local alignments, we would have missed many of theresidue pairs that were correctly predicted. Because root-mean-square (RMS) deviation is very sensitive to overpre-diction, our RMS scores for the entire alignment look poor,even though we often have a well-predicted core align-ment. Determining which parts of an alignment are worthpredicting and which should be removed remains a diffi-cult problem for us.

Our T0085 prediction (2cthA) was an incorrect multi-heme cytochrome, despite the high score. Matching threeheme-binding sites provided a strong similarity signal,even though the overall fold was different. The correctmultiheme cytochrome was in our top six hits (out of about2,000 templates).

For T0053, we were misled by our postscoring sequenceanalysis. We considered the correct template 1ak1 forT0053 (our tenth highest-scoring template), because itscored well in one of our template libraries and was also achelatase. We correctly rejected our top hit (1djxB), be-cause it did not cluster well the known metal-bindingresidues in T0053. We chose 1fvkA (our eighth highestscoring template), because it clustered the residues well.Unfortunately, we did not analyze the clustering on 1ak1.The target HMM (which gave the erroneous high score to1djxB) was poor because there were only three shortmatches to the target found in the nonredundant proteindatabase by the SAM-T98 method (other than the targetitself), so the HMM had to generalize from very little data.In such cases, it may be advantageous to put more weighton the template HMM scores, but we did not attempt this.

For T0071, we were again misled by our postscoringsequence analysis. We looked at, but rejected, some correct

Fig. 1. This figure presents the structural alignment of T0074 to 2scpAfound by the Yale structural aligner7 as aligned uppercase letters.Superimposed on the alignment is our predicted alignment, as linesconnecting the residues we aligned. Slanted lines between uppercase

letters indicate a shift of the predicted alignment relative to the correctstructural alignment. This hand alignment is slightly better than theautomatic one we started from, which had the same correctly alignedresidues, but the N-terminus of T0074 was more misaligned.

123SEQUENCE-BASED PROTEIN STRUCTURE PREDICTION

templates (1euu and 1dlhA) for the first domain because oflow scores and unconvincing alignments. We wanted tofind an SH3 domain, because the C-terminus of EPS15binds to T0071and is known to bind to an SH3 do-main.1,18 This hint from the literature was used to decidebetween a small number of folds, all of which had fairlyweak scores with our method.

We were surprised that 2mcm turned out to be a correctprediction for T0046, because the similarity to immuno-globulins was weak and most immunoglobulins are quitesimilar to each other. Our alignment turned out to beterrible, as can be seen in Figure 2.

The known active-site residues for T0081 clustered wellwhen the target sequence was aligned to 3chy. This wasour rationale for choosing 3chy as our prediction. It turnsout that 3chy has a similar alpha-beta-alpha structure,but threaded in a different order than T0081. If we do acircular permutation of the chain, we can get a muchbetter superposition of the structures; unfortunately, ourmethod did not predict this circular permutation but anincorrect alignment. Even constructing an HMM for thechimeric sequence 3chy followed by 3chy does not allowour methods to find the permuted alignment. We wouldhave done better to predict 1rvv1, which scored better than3chy and had the correct threading order. We had rejected1rvv1 because our alignments for it did not cluster theaspartic acid residues, which we had expected.

During the early part of CASP3, we predicted ‘‘new fold’’for targets that produced only weak scores to templatestructures. For this reason we predicted that T0043 would

be a new fold, even though our top hit turned out to be thecorrect fold.

There were a number of targets for which the correctstructural template was in our list of top 10 to 20 hits, butwe were not able to pick it out. These targets, with therank (out of approximately 2,000) of the correct hit inparentheses, are T0043(1), T0053(16), T0054(9), T0059(16),T0063(10), T0067(16), T0071(6), T0085(6). We hope thatsmall improvements to the method and the increase in thenumber of homologs in the databases will allow themethod to discriminate better in future.

The low similarity between targets and structures inthis category reduced alignment quality considerably com-pared with the alignments for the easier targets discussedearlier. We almost always hand-edited our automaticallygenerated alignments for these targets. In general, though,hand alignment did not provide much improvement, andwe would have done about as well with considerably lesseffort had we submitted our automatic alignments. Forexample, we show one of the better hand alignments inFigure 3, but the automatic alignment it was based on didnot include the incorrect alignment at the C-terminus.

Very Difficult Targets

Almost all of the remaining ‘‘very difficult’’ targets can becharacterized as targets that represented new folds or thatbore similarity only to fragments of solved protein struc-tures. For all of the targets in this category, our methodindicated with high likelihood that there was no similarstructure.

Fig. 2. This figure shows the correct structural superposition of T0046 and 2mcm as found bythe Yale structural aligner.7 Lines connect the residues we predicted to be aligned. This alignment isone of our worst alignments for a correct fold. The problem here was in our hand-editing, as theautomatic alignment we started from was much better.

Fig. 3. This figure presents the correct structural superposition ofT0083 and 1lmb3 as found by the VAST structural aligner.8 (The Yalestructural aligner provides a somewhat longer alignment, which agrees onthe common part.) The lines connect the residues we predicted to be

aligned. This alignment is among the best alignments we submitted for themoderately difficult targets, but automatic alignment was slightly betterthan this submission.

124 K. KARPLUS ET AL.

Secondary Structure Prediction Using SAM-T98

We also used the SAM-T98 multiple alignments for thetarget sequences as inputs to a neural net for secondarystructure prediction. This turned out to be one of the twobest secondary structure predictors at CASP3, althoughwe did not use the secondary structure predictions in ourfold predictions. When we had a correct fold prediction,deducing the secondary structure from the predicted align-ment provided more accurate secondary structure predic-tion than did the neural net, but the neural net was moreaccurate when we had an incorrect fold prediction. We planto combine this secondary structure predictor with twoothers we are building and put them all on the World WideWeb in the next 6 months.

CONCLUSION

We have discussed the HMM-based SAM-T98 methodfor remote homology detection and how it was applied toprotein structure prediction in the CASP3 experiment.

Many of the fold-recognition methods do considerablybetter on multidomain proteins when given the domainboundaries, but when we tested our method after CASP3on the true domains, we gained no benefit from having thatextra information.12 We suspect that our use of localalignment and sum-of-all-paths scoring makes our methodrather insensitive to the inclusion of extra domains, sothere is little gain from excising them.

Perhaps the biggest lesson learned is that we do notknow enough about proteins to adjust SAM-T98 align-ments manually. We would have been better off trustingthe programs even when they seemed wrong. Proteinexperts with more knowledge of the proteins would mostlikely be able to adjust the alignments better than wecould. We do still get value from human interaction, butthe value is mainly in including functional information orinformation about known binding sites rather than inadjusting alignments.

The costs provided by SAM-T98 are a strong, but notperfect, indicator of the correctness of the predictions.Using a calibration of our method from known struc-tures,11 many of the targets had such weak similaritiesthat we had little confidence in those predictions.

We believe SAM-T98 has taken sequence-only methodsabout as far as they will go. For many of the ‘‘moderatelydifficult’’ targets we did not select the correct structureeven though it was in the top 20 hits, so even a smallamount of additional information should improve themethod significantly. We are currently investigating a fewways to include structure information: building our tem-plate library HMMs from structural multiple alignments(rather than single sequences), using information from thestructure of the template to trim alignments, and usingsequence-structure compatibility measures to evaluatealignments.

ACKNOWLEDGMENTS

This work was supported in part by a Graduate Assis-tance in Areas of National Need (GAANN) graduate fellow-ship. We are grateful to David Haussler for starting thehidden Markov model and Dirichlet mixture work atUCSC, as these approaches were instrumental to oursuccess.

REFERENCES

1. Benmerah A, Begue B, Dautry-Varsat A, Cerf-Bensussan N. Theear of alpha-adaptin interacts with the COOH-terminal domain ofthe Eps 15 protein. J Biol Chem 1996;271:12111–12116.

2. Berger J. Statistical Decision Theory and Bayesian Analysis. NewYork: Springer-Verlag; 1985.

3. Bernardo J, Smith A. Bayesian Theory. New York: John Wiley &Sons; 1994.

4. Bernstein F, Koetzle T F, Williams GJ, et al. The protein databank: a computer-based archival file for macromolecular struc-tures. J Mol Biol 1977;112:535–542.

5. Doolittle R F. Of URFs and ORFs: A Primer on How to AnalyzeDerived Amino Acid Sequences. Mill Valley, CA: University Sci-ence Books; 1986.

6. Eddy S. Hidden Markov models. Curr Opin Struct Biol 1996;6:361–365.

7. Gerstein M, Levitt M. Comprehensive assessment of automaticstructural alignment against a manual standard, the SCOPclassification of proteins. Protein Sci 1998;7:445–456.

8. Gilbrat J, Madej T, Bryant S. Surprising similarities in structurecomparison. Curr Opin Struct Biol 1996;6:377–385.

9. Hughey R, Krogh A. Hidden Markov models for sequence analysis:extension and analysis of the basic method. Comput Appl Biosci1996;12:95–107. Information on obtaining SAM is available athttp://www.cse.ucsc.edu/research/compbio/sam.html.

10. Jones D, Thornton J. Protein fold recognition. J Comput AidedMol Des 1993;7:439–456.

11. Karplus K, Barrett C, Hughey R. Hidden Markov models fordetecting remote protein homologies. Bioinformatics, 1998;14:846–856.

12. Kelley LA, MacCallum RM, Sternberg M, et al. CAFASP-1:Critical assessment of fully automated structure prediction meth-ods. Proteins, in press.

13. Krogh A, Brown M, Mian IS, Sjolander K, Haussler D. HiddenMarkov models in computational biology: applications to proteinmodeling. J Mol Biol 1994;235:1501–1531.

14. Lathrop RH, Smith TF. Global optimum protein threading withgapped alignment and empirical pair potentials. J Mol Bio1996;255:641–665.

15. Lemer C, Rooman M, Wodak S. Protein structure prediction bythreading methods: evaluation of current techniques. Proteins1995;23:337–355.

16. Park J, Karplus K, Barrett C, et al. Sequence comparisons usingmultiple sequences detect twice as many remote homologues aspairwise methods. J Mol Biol 1998;284:1201–1210. Paper avail-able at http://www.mrc-lmb.cam.ac.uk/genomes/jong/assess_paper/assess_paperNov.html.

17. Santner TJ, Duffy DE. The Statistical Analysis of Discrete Data.New York: Springer Verlag; 1989.

18. Schumacher C, Knudsen B, Ohuchi T, Di Fiore P, Glassman R,Hanafusa H. The SH3 domain of Crk binds specifically to aconserved proline-rich motif in Eps15 and Eps15R. J BiolChem1995;270:15341–15347.

19. Sippl M. Knowledge-based potentials for proteins. Curr OpinStruct Biol 1995;5:229–235.

20. Sjolander K, Karplus K, Brown MP, et al. Dirichlet mixtures: amethod for improving detection of weak but significant proteinsequence homology. Comput Appl Biosci 1996;12:327–345.

125SEQUENCE-BASED PROTEIN STRUCTURE PREDICTION

predicting protein structure using only sequence information

Documents