Distinguishing Enzyme Structures from Non-enzymes Without Alignments

Download Distinguishing Enzyme Structures from Non-enzymes Without Alignments

Post on 03-Nov-2016




0 download


  • Distinguishing Enzyme Structures from Non-enzymesWithout Alignments

    Paul D. Dobson and Andrew J. Doig*

    Department of BiomolecularSciences, UMIST, P.O. Box 88Manchester M60 1QD, UK

    The ability to predict protein function from structure is becoming increas-ingly important as the number of structures resolved is growing morerapidly than our capacity to study function. Current methods for predict-ing protein function are mostly reliant on identifying a similar protein ofknown function. For proteins that are highly dissimilar or are only similarto proteins also lacking functional annotations, these methods fail. Here,we show that protein function can be predicted as enzymatic or not with-out resorting to alignments. We describe 1178 high-resolution proteins in astructurally non-redundant subset of the Protein Data Bank using simplefeatures such as secondary-structure content, amino acid propensities,surface properties and ligands. The subset is split into two functionalgroupings, enzymes and non-enzymes. We use the support vectormachine-learning algorithm to develop models that are capable of assign-ing the protein class. Validation of the method shows that the function canbe predicted to an accuracy of 77% using 52 features to describe eachprotein. An adaptive search of possible subsets of features produces asimplified model based on 36 features that predicts at an accuracy of80%. We compare the method to sequence-based methods that also avoidcalculating alignments and predict a recently released set of unrelatedproteins. The most useful features for distinguishing enzymes from non-enzymes are secondary-structure content, amino acid frequencies, numberof disulphide bonds and size of the largest cleft. This method is applicableto any structure as it does not require the identification of sequence orstructural similarity to a protein of known function.

    q 2003 Elsevier Science Ltd. All rights reserved

    Keywords: protein function prediction; structure; enzyme; machinelearning; structural genomics*Corresponding author


    We aim to demonstrate that protein function canbe predicted as enzymatic or not without resortingto alignments. Protein function prediction methodsare important as international structural genomicsinitiatives are expected to generate thousands ofprotein structures in the next decade. The capacityof laboratories studying protein function is not suf-ficient to keep pace with the number of structuresbeing released, with the consequence that manynew structures lack functional annotations.Predictions can help guide the activities of theselaboratories towards functionally more important

    proteins. The main approaches to in silico proteinfunction prediction from structure are neatly sum-marised by Sung-Ho Kim.1 They are assignmentof a function from a similar fold, sequence, struc-ture or active site of previously known function,assignment from a structure with a common ligandand ab initio function prediction (implying amethod that does not work by comparison withanother protein of known function).

    The most common methods rely on identifyingsimilarity to a protein of known function andtransferring that function. Sequence alignmentsare identified using approaches such as BLAST2 orFASTA.3 The power of PSI-BLAST4 has permittedthe detection of sequence similarities that inferhomology down to below 20%. Even when thelikes of PSI-BLAST fail, the sequence can still yielduseful information in the form of sequence motifs,which can be identified using PRINTS,5 BLOCKS,6

    0022-2836/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved

    E-mail address of the corresponding author:andrew.doig@umist.ac.uk

    Abbreviations used: EC, Enzyme Commission; CE,Combinatorial Extension.

    doi:10.1016/S0022-2836(03)00628-4 J. Mol. Biol. (2003) 330, 771783

  • PROSITE7 and other similar tools. Using predictedsecondary structures to assign fold class canexpand the information content of a sequence stillfurther, since fold classes are often associated witha particular set of functions.8

    The next logical step after using predicted struc-ture is to use real structure. As structure is morehighly conserved than sequence, it is often possibleto detect similarities that are beyond the reach ofeven the most sophisticated sequence alignmentalgorithms. Structural similarity is detected usingtools such as Combinatorial Extension9 andVAST,10 which map structures onto each other.Incomplete structural alignments can still suggestfold class. A problem encountered when identify-ing similar folds is that there may not be onespecific function associated with a fold, makingchoosing the correct one non-trivial. The TIMbarrel fold is known to be involved in at least 18different enzymatic processes1 and while this doesgive a narrowing of the number of possiblefunctions to assign, the precise function remainsunknown.

    Transferring function from a protein that sharesa ligand is a method that can give variable resultsif not tempered with some biochemical knowledge.For example, possession of NADH suggests anoxidoreductase enzyme. Functionally unimportantligands may be shared by many structures, but tosay that these proteins share a common functionwould be far from accurate. Ligand data can beused in conjunction with data concerning theimmediate protein environment that binds theligand. Binding-site correspondence is a strongindicator of functional similarity,11 as is the casewith the correspondence of the near-identicalcatalytic triads in the active sites of trypsins andsubtilisins,12 two evolutionarily distant but func-tionally similar types of protein. The utility of thisapproach is demonstrated by the ProCat database.13

    For sequences and structures that are highlysimilar, the reliability of the predicted function isgood, though in a recent study it has been shownto be less than previously thought.14 For pair-wisesequence alignments above 50%, less than 30%share exact EC numbers. This suggests the level ofsequence/structure conservation that impliesfunction conservation is much lower than believedformerly and demonstrates the pressing need forprotein function prediction methods that are notdependent upon tools that detect alignments.

    Non-alignment-based function predictions havebeen made using many different techniques. Textdata mining of scientific literature15 uses the infor-mation in scientific abstracts to assign subcellularlocalisations, which can be used as an indicator offunction. Amino acid compositions have beenused to predict localisation.16,17 The Rosetta Stone18

    method allows function predictions to be made forproteins that do not align to a protein of knownfunction by examining gene fusions. If the proteinaligns to part of a fused protein and the part ofthe fused protein it does not align to matches a

    protein of known function, that function can betransferred to the original protein. Phylogeneticprofiling19 functionally relates proteins with similarprofiles. The gene neighbour method uses theobservation that if the genes that encode two pro-teins are close on a chromosome, the proteins tendto be functionally related.20,21 Neural networks havebeen used to combine predicted post-translationalmodifications into sophisticated systems capable ofpredicting subcellular location and function.22

    While similarity-based methods do provide themost precise and dependable means of functionprediction, in many cases it is apparent that theyare heavily reliant on being able to identify highlysimilar proteins of known function. With one ofthe principal objectives of the structural genomicsinitiatives being the elucidation of structures fromthe more sparsely populated regions of fold space,the problem of not finding a similar protein ofknown function is more likely to occur. A methodsuggested by Stawiski et al.23 that lies between asimilarity-based approach and an ab initio method,is based on the observation that proteins of similarfunction often use basic structural features in asimilar manner. For example, they note that pro-teases often have smaller than average surfaceareas and higher Ca densities. Similarly,O-glycosidases24 deviate from the norm in termsof features such as the surface roughness (or fractaldimension). Features identified as being indicativeof a certain function permit the construction ofmachine-learning-based classification schemes thatallow function predictions for novel proteins with-out resorting to conventional similarity-basedmethods. The broad structural similarities thatcharacterise a functional class of proteins extendbeyond the reach of structural alignments, yet ithas been shown that they can be used for proteinfunction prediction. Here, we demonstrate amethod of identifying protein function asenzymatic or not without resorting to alignmentsto proteins of known function. To do this, wedescribe each protein in a non-redundant subsetof the Protein Data Bank25 in terms of simplefeatures such as residue preference, residue surfacefractions, secondary structure fractions, disulphidebonds, size of the largest surface pocket andpresence of ligands. As we are demonstrating amethod for use when alignment methods do notyield results, we restrict ourselves to features thatdo not rely on alignments. As such, our method isfor use when alignment methods fail. Histogramsillustrate that for some features the distributionsof enzymes and non-enzymes are different. Inorder to utilise these differences we combine thedata into a predictive model using the supportvector machine technique. Adaptive programmingis used to find a more optimal subset of features,giving a greater predictive accuracy whilst simul-taneously simplifying the model. We validatethese models by leave-out analyses and predictinga set of unrelated proteins submitted to the ProteinData Bank since the training set was compiled.

    772 Predicting Protein Function from Structure

  • Using the same approach, we investigate the utilityof models built only using amino acid propensities.Being easily calculable from sequence, this pro-vides a method for predicting the function of pro-teins that cannot be aligned to a protein of knownfunction, even if we do not have a structure. Wealso make a comparison to the ProtFun enzyme/non-enzyme methods described by Brunak et al.22


    The support vector machine works by deducing

    the globally optimal position of a hyperplane sep-arating the distribution of two classes of pointsscattered in a multi-dimensional space. The num-ber of features used to describe the position ofpoints determines the dimensionality of thathyperspace. The 52 features used to describe eachprotein are shown in Table 1. All features are easilycalculable from any protein structure. No feature isbased on mapping sequence or structure onto aknown protein, so the model can be said to betruly independent of alignment-based techniques.

    Heterogen data is presented to the machine inbinary form (1 for present, 0 for absent; Table 2).

    Table 1. All features and a more optimal subset

    Features used to describe the difference between structures of enzymes and non-enzymes. Bold and italicised features are thoseselected by the adaptive search of possible subsets as part of the most optimal subset identified.

    Predicting Protein Function from Structure 773

  • Some metal ions are present only once in the data-set, in which case the feature was not included inthe set of features on which the machine wastrained.

    Results from the leave-out analyses of predictionaccuracy are shown in Table 3. These data arebased on a model generated using a radial basiskernel function. Linear and polynomial kernelsperform less well (not shown).

    When using all 52 features, the results indicate apredictive accuracy for the set of features around7677%. Random performance based purely onthe size of classes could give, at best, an accuracyof 58.7% by predicting enzyme every time. Thesubset selection strategy selects 36 features fromthe total 52. The increase in accuracy is approxi-mately 34% to 80%. Bold and italicised featuresin Table 1 show the subset that gives the results inTable 3. On a set of 52 unrelated (by CombinatorialExtension) proteins submitted to the Protein DataBank since the training set was compiled, this 36feature model predicts with an accuracy of 79%.Three unrelated proteins with no functional assign-ment in the Protein Data Bank were predicted asfollows (confidence in parentheses): 1JAL non-enzyme (1.081), 1NY1 non-enzyme (20.166),1ON0 enzyme (0.548).

    Table 4 shows that certain features are selectedmore than others. Possibly as a result of its lowabundance at protein surfaces, the surface fractionof tryptophan is never used. Ligand data are usedonly rarely (calcium is never used). This isprobably related to the sparseness of heterogens inthe dataset. Other features, particularly secondary-structure contents (Figure 2) and the volume ofthe largest pocket (Figure 3) are selected frequently.

    The confidence of a support vector machineprediction is related to the distance from thehyperplane. Larger distances equate to higherconfidence. Figure 1 illustrates how the 36 featuremodels in a leave-one-out analysis were predicted.On enzymes, the method is 89.7% accurate. Fornon-enzymes, the accuracy is 68.6%.

    Models trained only on amino acid propensitiesgive an indication of how accurate function predic-tion can be from sequence when a similar proteinof known function cannot be detected (Table 5).

    Subsets contain only the features that combine tomore optimally describe the difference betweenclasses using the support vector machine. Tovisualise this, histograms can be used to depictthe underlying probability distribution of classesfor a particular feature. Figures 25 show typicalhistograms.

    Correlations between class distributions for cer-tain features, such as largest pocket volume, helixand sheet contents, are low. Most other featureshave distributions that tend to differ only slightlyfor the two classes. However, the support vectormachine combines individual features into amulti-dimensional joint probability distribution, soit is the manner in which feature distributionsinteract that is important. It is for this reason thatwrapper methods were preferred to filters for sub-set selection (see Methods). A pre-filter methodbased on a correlation coefficient would haveresulted in a less optimal performance.

    For a-helix fraction (Figure 2) it is evident thatthere are large differences between enzymes andnon-enzymes, and the coiled-coil state that isobserved in some DNA-binding proteins isdepicted at the upper end of the scale by a slightincrease in the non-enzyme density. Most otherhistograms approximately fit to standard statisticaldistributions such as normal, Poisson, etc. as typi-fied by Figures 4 and 5. Figure 3 shows the volumeof the largest surface pocket. Above 1000 A3,enzymes are more prevalent, so a protein with alarge surface pocket is more likely to be enzymatic.Conversely, and this may be more relevant,proteins with small surface pockets tend not toshow enzyme activity.

    The ProtFun22 prediction servers also predictwhether a protein is an enzyme. These serverswork entirely on features calculated from proteinsequence. Like the method presented here, theyare not reliant on alignments. We submittedsequences from all the proteins in the dataset toProtFun analysis. The results indicate that on thisset the ProtFun approach is 75.21% accurate.

    Table 2. Ligand groups in the dataset

    Hetero group Enzyme Non-enzyme

    Metals and metal-containing compoundsCalcium 46 23Cobalt 1 0Copper 2 1Heme 21 15Iron (not heme) 4 1Manganese 0 0Nickel 1 0Zinc 0 1

    CofactorsATP 33 6FAD 40 0NAD 26 0

    Numbers of each hetero group type in the dataset and eachclass. Metals present only once in the dataset were not used inthe training set.

    Table 3. Average percentage accuracies and averagestandard errors of models built on all features and theoptimal subset of 36 features in Table 1


    Number of features 1% 5% 10% 20%

    All (52) 77.16 76.87 76.86 76.17Standard error 1.24 1.23 1.29Subset (36) 80.14 80.26 80.17 80.43Standard error 1.28 1.24 1.31

    Comparison of the prediction accuracy of models generatedfrom all features and the more optimal subset of features gener-ated by an adaptive search of possible subsets. The accuracyincrease is approximately 34%.

    774 Predicting Protein Function from Structure

  • Table 4. Twenty subsets chosen by adaptive searching

    The composition of 20 models selected by the adaptive programming regime. Grey shaded squares indicate that a feature is presentin a subset. The number of times a feature was selected over 20 runs is shown on the left. Percentage accuracies for each model areshown along the bottom.

    Predicting Protein Function from Structure 775

  • Discussion

    It is apparent that there is a need for methods topredict protein function when conventionalapproaches do not yield results. We demonstratethe utility of representing proteins not in terms ofthe precise locations of residues, but by usingsimple features such as residue preference, second-ary structure, surface features and ligands. Whenthese data are combined using the support vectormachine approach, a model is built that can predictthe class of a novel protein as enzymatic or not toan accuracy of approximately 77%. This accuracyis based on a model built upon 52 features. A sub-set of these features selected by a basic adaptiveprogram gives a much simpler model, capable ofpredicting with an increased accuracy of 80%.Figure 1 suggests that the model works by definingwhat constitutes an enzyme rather than a non-enzyme, as it predicts enzymes with much greateraccuracy. This makes sense, as enzymes are truly aclass of proteins, with some similar propertiessuch as having active sites, whereas non-enzymesare merely all those proteins that are not enzymesthat have been accumulated into somethingmore artificial and vague. One might expect thatwith greater diversity in the non-enzyme class, amodel that isolates enzymes from the rest wouldbe easier to derive. It is also apparent fromFigure 1 that high-confidence predictions are rarelyincorrect.

    The method chosen for modelling the data andmaking predictions is the support vectormachine,26,27 a machine-learning tool for two-classclassification. The goal of classification tools is tofind a rule that best maps each member of trainingset S (described by the properties X) to the correctclassification Y. Each point in the input space is

    described by the vector xi; yi :xi [ X , R


    yi [ Y { 1;21}The support vector machine is based on separatingthe N-dimensional distribution of points in a train-ing set. A hyper-plane is fit to the input space in amanner that globally optimally separates the twouser-defined classes. The orientation of a testsample relative to the hyper-plane gives thepredicted class.

    Recently, the popularity of the support vectormachine has increased dramatically. It has beenshown to frequently out-perform previous classifi-cation favourites such as neural networks. Thereare two major features of the support vectormachine that make it capable of such high per-formance, the kernel function and structural riskminimisation.

    It is evident that for most real problems the opti-mal separating hyper-plane will not fit linearlybetween the two classes without significant error,though a non-linear separation may well be feas-ible. The kernel function permits the mapping ofdata in the input space into a higher-dimensionalfeature space where non-linear mappings becomelinearly possible. The choice of kernel functiondetermines the nature of the mapping to thefeature space.

    Perhaps the most critical component of the sup-port vector machine is the manner in which risk isminimised. The concept of risk can be defined asthe number of errors the model makes. It isapparent that a good model should make as fewerrors as possible. Traditionally, risk is minimisedby fitting a model to the training set so that it can

    Figure 1. Prediction confidence distributions. Distance from the hyperplane is related to the confidence of aprediction. The distributions of predictions for an 80.9% (36 feature) model are shown.

    776 Predicting Protein Function from Structure

  • predict the training set as accurately as possible.This is known as empirical risk minimisation.However, while the model may be able to predictthe training set almost perfectly, it may not per-form so accurately on data unlike what it has beentrained on. The model is said to have lost theability to generalise. If the training set can beguaranteed to represent everything the modelcould be asked to predict, then the problem wouldnot be so great, but this situation is rare. The resultof empirical risk minimisation is often a modelwith poor predictive abilities as a consequence ofdescribing the training set too well. In this situ-ation, we say that the model is over-fit to the data.Risk is minimised differently by the support vectormachine. Structural risk minimisation uses ameasure of the number of different models imple-

    mentable in the training set to control the point atwhich a model can be said to be over-fit to thetraining data. In this way, generalisation abilitycan be controlled. A rule can be found thatbalances the generalisation ability with the empiri-cal risk minimisation method to ensure that theprediction performance is optimal on the trainingset without being over-fit to it.

    The adaptive search of possible subsets was pre-ferred to filter methods that quantify the differencebetween classes for each feature, as it explores thejoint probability distributions of possible subsetsin the context of the learning algorithm. Thismeans that the subset selection strategy investi-gates features as they interact, rather than in iso-lation. The final subset selected by the adaptiveprogramming search (what remains upon com-pletion of the adaptive process) is not always thesame. Local performance maxima within the setcan be responsible for this. Without an exhaustivesearch it is difficult to show that the adaptive pro-gramming approach finds the most optimal subsetfrom the 252 possibilities. However, it consistentlyfinds simple subsets capable of predicting withgreater accuracy without sacrificing generalisationperformance (by our measure of generalisation, inwhich the subset is prevented from over-fitting byrestricting the tolerance of variation in leave-outresults so that any subset of features incapable ofpredicting well on all fractions of the dataset isnot permitted).

    Table 4 illustrates that certain features areselected consistently, such as the secondary-struc-ture fractions (Figure 2), largest pocket volume(Figure 3), fractions of phenylalanine (Figure 4),fractions of surface asparagine and isoleucine, andATP, NAD and FAD. Some of these more fre-quently selected features have class distributionsthat differ sufficiently to permit slightly improved(compared to random) function predictions with-out other features. For example, the volume of thelargest pocket is a feature that one might expect todiffer greatly between classes and, indeed, itshows little correlation between enzymes andnon-enzymes. Enzymes tend to have large surfacepockets containing their active sites. Therefore, aprotein without a large surface pocket is probablynot an enzyme. This seems to be true, as themajority of enzymes have a largest pocket volumeabove 1000 A3. Above this volume there are stillnon-enzymes as a result of those proteins thatrequire large surface pockets for binding but haveno enzymatic role (for example, a DNA-bindingprotein such as a transcription factor).

    The presence of any of the coenzymes ATP, FADand NAD in a structure suggests strongly that theprotein is an enzyme. ATP is found in some non-enzyme structures, which is to be expected, as it isinvolved in almost all processes that involveenergy transfer. Despite this, in the majority ofcases it is still indicative of an enzyme.

    Another feature that exhibits poor correlation ishelix content. A cursory glance at the helix content

    Figure 2. Histograms to illustrate differences insecondary structure contents.

    Predicting Protein Function from Structure 777

  • distributions suggest that alone it should permit abetter than random prediction, as proteins withless than 30% helix tend to be non-enzymes,between 3060% enzymes tend to dominate, andabove 60% non-enzymes are more prevalent.Similar information can be gleaned from the sheetfractions, with a significant proportion of non-enzymes having no b-sheet. Between 10% and30%, enzymes are more common than non-enzymes. Greater than 30% sheet content suggestsnon-enzymes. That secondary structure contentshave such discriminating power is intriguing andopens up avenues of research employingsecondary-structure predictions or experimentaldata.

    For most features, there is little apparentdifference between class probability distributionsthat suggests they will have any discriminatingpower. Yet it is clear that these features do havesome utility, as they are chosen consistently byadaptive searching. Biological interpretations forsome of the more frequently selected features canbe proposed, though these are difficult to verify.Many residue types are known to be associatedwith certain processes linked to function and/orsubcellular location, such as the role of asparaginein N-linked glycosylation (in Asn-X-Ser/Thr,X Pro). This may explain why surface aspara-gine is a useful feature, as certain classes ofproteins may require glycosylation (such as certainextracellular proteins that require enhancedsolubility).

    A set of residues that accounts for only a smallfraction of the total surface includes cysteine, iso-leucine, leucine, methionine and valine. Surfacefractions of these hydrophobic residues areselected frequently. One possible explanation forthis is that a residue type with a preference for

    being buried, and so normally present at the sur-face at only low levels, is significant when it isfound unburied. That is, something that introducesinstability, like having a solvent-accessible aliphaticresidue, would not happen without purpose. Thissuggests involvement in the proteins function orcellular location. Patches of hydrophobic residuesat the surface probably occur in proteins that inter-act with other proteins, making complex formationenergetically more feasible.

    The high frequency of selection of the histidinefraction feature is interesting. It is not a highlyabundant residue at the surface, yet it has the inter-esting property of having a pKa of 67, whichmakes it amenable to mediating proton transfer.This is a common process in many enzymemechanisms and offers an explanation for thehigher frequency at which it occurs in the enzymeclass.

    Several features are selected very rarely, if at all.Removal of such features allows for a muchsimpler model, as the level of abstraction by thekernel function from the input space to the featurespace is much lower in the absence of noisy oruseless features. In combination, features cancooperate to form a joint probability distribution

    Figure 3. Differences in the volume of the largest surface pocket (A3). Pockets of volume above 1000 A3 suggest theprotein is an enzyme. (The non-enzyme value in the leftmost bin, excluded here for clarity, is 0.32).

    Table 5. Average percentage accuracies and standarderrors of models built on amino acid frequencies


    Features 1% 5% 10% 20%

    20 75.38 75.00 74.83 74.98SE 1.33 1.37 1.35

    The performance of models built on all amino acid frequen-cies and the associated standard errors.

    778 Predicting Protein Function from Structure

  • that lends itself well to partitioning, so too afeature can act to disrupt this and force a lesswell-fit model.

    Sparse features, such as the metal ion data, seemto contribute little. It is surprising that some typesof metal are present so infrequently in this dataset.Calcium is not present infrequently, nor is it dis-tributed equally between classes. However, it doesnot get used in any of the models generated.

    To summarise the feature selections, it is clearthat some features are useful for predicting proteinfunction in isolation, as the distributions differgreatly between classes. In combination, this dis-criminatory power is heightened. Features thatseem to correlate well across the two classes offunction can bring something to the predictivemethod that is not immediately apparent fromtheir distributions. It is worth noting that the radialbasis function kernel performs better than linearand polynomial kernels. This is indicative of thehigh complexity of the dataset distribution.Evidently, different sub-groupings of each class

    exist in pockets throughout the total distribution,forming discontinuous clusters of data. Of thethree kernels used, the radial basis function ismost capable of competently mapping complexdata such as this into higher dimensions, wherethe problem becomes linearly tractable.

    The construction of the dataset is intended toavoid the problem of over-representation in theProtein Data Bank, yet it still suffers the problemof under-representation. For example, it isnotoriously difficult to obtain structures formembrane proteins and, as a consequence, thereare very few deposited in the Protein Data Bank.Another source of bias stems from the dataset con-taining only X-ray crystal structures (as a conse-quence of the programmes used to calculatecertain features being intolerant of the ambiguitythat is often seen with NMR structures). The struc-tures that are not amenable to crystallisation aretherefore potentially under-represented. The conse-quence of these biases is that models generatedmay be less capable of predicting functions for

    Figure 4. Example histogram.The differences in fractions ofphenylalanine between enzymesand non-enzymes.

    Figure 5. Example histogram.The differences in surface fractionsof aspartate between classes.

    Predicting Protein Function from Structure 779

  • under-represented proteins. As under-representedclasses have structures solved, the method shouldeasily be able to accommodate this extrainformation.

    There is an uneven distribution of enzymeclasses in the PDB (with hydrolases being by farthe largest class and ligases much the smallest).This is reflected in the dataset, so one may suspectthat the model may focus on predicting certainclasses more than others. However, errors in aself-consistent test (using a model to predict theset on which it was trained) range only from411% over the six top-level enzyme classifi-cations, suggesting that the model is not overlydirected towards any particular class. For non-enzymes, there is no equivalent classificationscheme applicable here, but from functional anno-tations within the PDB files no bias is immediatelyobvious.

    Without structures it is still possible to makefunction predictions. Purely working from aminoacid frequencies, models can be built in the sameway that are capable of predicting to approxi-mately 75% accuracy. It seems prudent to validatethis result in future work by using a larger, morecomprehensive dataset based on a non-redundantsequence dataset (here the results are based on thestructural dataset and so are subject to the biaseslisted above). With a dataset that covers a morediverse selection of proteins the problem changes,so perhaps the accuracies shown herein should betreated with a degree of caution. From this datasetit is clear that predictions to some above-randomextent are certainly possible using only amino acidfrequencies.

    A comparison is made with the ProtFun methoddescribed by Brunak et al.22 by running our datasetthrough their algorithms. This is not really a com-parison on an even footing, as there are more dataavailable to us in a structure, but more data avail-able to ProtFun in that they have a larger datasetthat may contain sequences similar to sequencesin our dataset. However, the comparison highlightsa greater accuracy for the method herein, and goessome way to vindicating the inclusion of structuredata in protein function predictions.

    We have demonstrated the utility of representingproteins in terms of simple features as a method ofdescribing the similarities between proteins with acommon function. Using this information, wehave been able to construct models that can predictprotein function as enzymatic or not to an accuracyof 80%. We are currently in the process of settingup a server that will take as input a PDB file,extract data for the features used in our best per-forming model and make predictions.


    Dataset construction

    The dataset consists of X-ray crystal structures with aresolution of less than or equal to 2.5 A and an R-factorof 0.25 or better. A structurally non-redundant represen-tation of the Protein Data Bank provides a firmergrounding for validating results as prediction accuraciesare artificially high with a redundant dataset (it is easierto make a correct prediction for an object if the model isbuilt upon data that is essentially the same). Removingsimilarity also avoids the problem of biases in the PDBthat are the result of certain research areas producingmore structures than others.

    The Combinatorial Extension9 methods implementedat UCSC provide structural alignment Z-scores for allchains in the PDB against all others, or approximationsthrough a representative structure. This information issufficient to build the set of structures required. Theauthors of CE propose a biological interpretation of theZ-scores. Scores of 4.5 and above suggest the sameprotein family, scores of 4.04.5 are indicative of thesame protein superfamily and/or fold, while scores of3.54.0 imply a possible biologically interesting simi-larity. In this cull of the PDB, no chain in any proteinstructurally aligns to any other chain in the dataset witha Z-score of 3.5 or above outside of its parent structure.Aligning to another chain in a parent structure is per-mitted, as redundancy is only an issue in leave-outtesting, so similarity between chains that are always leftout together does not bias results.

    The dataset is split into 691 enzymes and 487 non-enzymes on the basis of EC number, annotations in thePDB and Medline abstracts. Ten proteins were excludedfrom the total cull due to incomplete functionannotation. The dataset is provided as a supplement.

    Features for model building

    Easily computed features that differ between classesare used to describe each protein. Residue preference iscalculated as the number of residues of each typedivided by the total number of residues in the protein.Secondary structure content is calculated similarly work-ing from STRIDE28 assignments. The helix content isderived from the total amount of the structure assignedas a, 310 or p helix. Sheet and turn fractions are calcu-lated similarly. Disulphide bond numbers are derivedfrom the SSBOND section of the PDB file. The size ofthe largest pocket is the molecular surface29 volume inA3, as calculated by the cavity detection algorithmCASTp.30 Surface properties are calculated fromNACCESS, with the fraction of the total surfaceattributable to each residue type being features.

    Heterogen data are taken from the PDB file and pre-sented to the machine in binary form (1 for present, 0for absent). When ligands are used in crystallisation,often an analogue or a derivative of a structure is usedfor technical reasons. This has the potential to makedata very sparse. For example, it is desirable to clusteradenosine triphosphate with the structures for adenosinemono- and diphosphate. There are methods available


    Hubbard, S. J. & Thornton, J. M. (1993). NACCESS,Department of Biochemistry and Molecular Biology,University College London.

    780 Predicting Protein Function from Structure

  • that cluster similar hetero groups together, yet they oftengroup molecules that are clearly different. After investi-gating the utility of certain clustering schemes, it wasdecided that only a very simple and severe classificationscheme would classify rigorously enough to avoid error.The feature ATP contains all structures that contain oneof the following HET codes: AMP, A, ADP, ATP. Theseare codes for adenosine mono-, di- and triphosphate.Similarly, the feature FAD consists of proteins that con-tain either FAD or FDA. We consider the oxidised andreduced forms of FAD as equivalent. The NAD featureuses the HET codes NAD, NAH, NAP and NDP. TheNAD feature treats NAD and NADP in both oxidisedand reduced states as equivalent. In order to findequivalent HET codes the stereo smiles utility in theMacromolecular Structure Database was used.31 Whilethis gives some scope for detecting when analogueshave been used, it does not cover the whole range ofalternative hetero groups that could be used asanalogues. A looser clustering of hetero groups maycover these, but would not provide the level ofspecificity necessary to prevent noise being introduced.In terms of restricting noise, the more austere schemedetailed above performs best. Completeness is sacrificedto preserve accuracy.

    All non-binary features are normalised over the rangeto take a value between 0 and 1, as recommended withsupport vector machines.

    The support vector machine

    The implementation of the support vector machineused here is SVMLight.32 The kernel functions usedwere linear, polynomial and radial basis function. Otherthan varying the kernel function, the algorithm was runon default settings.

    Performance analysis: leave-out-analysis

    In order to validate the method, some measure ofaccuracy is required. A simple but reliable method ofassessing accuracy is the leave-out method of crossvalidation.33 From a set S a subset of size x is excluded,with the remaining Sx members forming the trainingset. The model built from this training set is used to pre-dict the classes of members of the excluded subset x.Each of the S/x subsets is left out and predicted in turn.The sum of the accuracy on each subset divided by thenumber of subsets gives an estimate of total accuracy.At its lower extremity, x can be a portion of size contain-ing only one member (a leave-one-out analysis), thoughfor larger datasets this is often computationallyexpensive and a larger value of x is preferred. The resultsfrom different values of x are not always consistent.There is debate concerning the value x should take.Here, results are presented for x being one protein, 5%,10% and 20% of the set. The percentage leave-outanalyses are each repeated ten times with different par-titions to ensure that any result is not purely the conse-quence of a fortuitous splitting of the dataset. Meanperformance and standard error are averaged over theten partitions.

    Performance analysis: newly released PDB files

    Since the dataset was compiled the PDB has releasedmany protein structures. The majority of these proteinsare related to existing structures in some way. By usingthe same Z-score criterion as before, 56 new proteinsthat are not related to the dataset were identified. Thesesplit into 29 enzymes and 23 non-enzymes, with 1L4Xbeing discarded due to it being a de novo designed pep-tide. Three further structures annotated as having noknown function were 1JAL, 1NY1 and 1ON0.

    Performance analysis: comparison to ProtFun22

    The ProtFun servers predict protein function fromsequence without resorting to alignments. A sequencesubmitted in FASTA format is processed to generate fea-tures for a neural network-based prediction. Using theprobabilities generated, we compared our method.When a structure contained multiple chains, the averageprobability was used.

    Feature subset selection by adaptive programming

    In the presence of two models of differing complexityand approximately equivalent performance there is noreason to presume that the more complex model is anymore valid than the simpler model (by the principle ofOccams Razor).34 Choosing a more relevant subset offeatures leads to a simpler model that can often beof greater accuracy. This can be due to noisy featuresthat introduce unhelpful disturbances to the joint prob-ability distribution and make fitting the hyperplanemore complex.

    Methods for choosing a subset of features exist in twobasic forms.35 Filter methods rely on pre-processing thedata to find features that differ between the two classes.They employ techniques that quantify the extent towhich the two class distributions for a feature correlate(with poorly correlating features being useful for dis-tinguishing between classes). Filter methods take littleaccount of how the distributions of features interact inmultidimensional space. They are more suited to selec-tion of subsets from a very large number of features,when the computational expense of wrapper methodscan prove prohibitive. Wrapper methods use the learn-ing algorithm itself to search for a subset of features.Here, the learning algorithm is used as the selectioncriterion in a simple adaptive programming search ofpossible feature subsets.36 There are 252 possible subsetsto choose from 52 features, making an exhaustive searchimpossible. Adaptive programming is well suited toquerying large search spaces. A basic adaptive pro-gramme randomly generates solutions to a problem andthen uses some measure of fitness to select the parentsof the next generation of possible solutions. Theseparents then reproduce with mutations. Here, a simpleadaptive programme starts from a random subset offeatures. Offspring from this parent are generated byintroducing random mutations at a frequency of 10%per reproduction (that is, five changes are made, as thisis approximately 10% of 52. It is permitted for a featureto change and change back). On the population sizereaching 15, the leave-20%-out analysis forms the basisof the scoring system that is used to select the modelwith greatest accuracy. Here, there is a risk that the sub-set begins to describe the problem only for the datasetand in so doing loses the ability to generalise. If the http://svmlight.joachims.org

    Predicting Protein Function from Structure 781

  • accuracy on each left-out set is similar, then the subset offeatures is capable of describing the difference betweenclasses irrespective of the subset of data it is trained on.If the average prediction accuracy is high, but thestandard deviation of accuracies over all left-out sets isalso high, then it can be said that the model does notgeneralise so well. Low performance on some left-outsets is indicative of a subset of features that cannotdescribe the difference between classes for all data itcould be asked to predict. The selection criteria for theadaptive programming regime are therefore based onthe leave-20%-out estimated average performance, andon the standard deviation of these results. The maximumtolerated standard deviation is that generated by thewhole set of features. A subset capable of predictingwith greater accuracy than the rest of its generation, butwith a standard deviation that is lower than or equival-ent to the ancestor of all generations, can be said tohave specialised no more than the common ancestor asa consequence of subset selection and so is consideredmore fit. For example, if the common ancestor predictsat 75% accuracy with a standard deviation of 2.5%, asubset that performs with .75% accuracy and #2.5% ismore fit.

    After selection, a precise copy of the parent is alwaysmaintained in the next generation to avoid backwardssteps. This subset gives rise to the next generation. Aftera maximum of 150 generations, the algorithm ceases.Repetition of the algorithm suggests the existence ofdifferent subsets of approximately equivalent perform-ance, which is indicative of the adaptive programmefollowing different trajectories to multiple local maxima.The local maximum with best performance is the bestmodel.


    This work was funded by a BBSRC Engineeringand Biological Systems committee studentship. Wethank Ben Stapley for helpful discussions andKristoffer Rapacki of the Center for BiologicalSequence Analysis, Technical University ofDenmark for assistance with the ProtFun results.


    1. Thornton, J. M. (2001). Structural genomics takes off.Trends Biochem. Sci. 26, 8889.

    2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &Lipman, D. (1990). Basic local alignment search tool.J. Mol. Biol. 215, 403410.

    3. Pearson, W. R. & Lipman, D. J. (1988). Improvedtools for biological sequence comparison. Proc. NatlAcad. Sci. USA, 85, 24442448.

    4. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,J., Zhang, Z., Miller, W. & Lipman, D. J. (1997).Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucl. Acids Res.25, 33893402.

    5. Attwood, T. K. (2002). The PRINTS database: aresource for identification of protein families. Brief.Bioinform. 3, 252263.

    6. Henikoff, S., Henikoff, J. G. & Pietrokovski, S. (1999).Blocksredundant database of protein alignment

    blocks derived from multiple compilations.Bioinformatics, 15, 471479.

    7. Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist,C. J., Hofmann, K. & Bairoch, A. (2002). ThePROSITE database, its status in 2002. Nucl. AcidsRes. 30, 235238.

    8. Rice, D. & Eisenberg, D. (1997). A 3D1Dsubstitution matrix for protein fold recognition thatincludes predicted secondary structure of thesequence. J. Mol. Biol. 267, 10261038.

    9. Shindyalov, I. N. & Bourne, P. E. (1998). Proteinstructure alignment by incremental combinatorialextension (CE) of the optimal path. Protein Eng. 11,739747.

    10. Madej, T., Gibrat, J.-F. & Bryant, S. H. (1995). Thread-ing a database of protein cores. Proteins: Struct. Funct.Genet. 23, 356369.

    11. Russell, R. B., Sasieni, P. D. & Sternberg, M. J. E.(1998). Supersites within superfolds. Binding sitesimilarity in the absence of homology. J. Mol. Biol.282, 903918.

    12. Wallace, A. C., Laskowski, R. & Thornton, J. M.(1996). Derivation of 3D coordinate templates forsearching structural databases: application to theSer-His-Asp catalytic triads of the serine proteinasesand lipases. Protein Sci. 5, 10011013.

    13. Wallace, A. C., Borkakoti, N. & Thornton, J. M.(1997). TESS: a geometric hashing algorithm forderiving 3D coordinate templates for searchingstructural databases: application to enzyme activesites. Protein Sci. 6, 23082323.

    14. Rost, B. (2002). Enzyme function less conserved thananticipated. J. Mol. Biol. 318, 595608.

    15. Stapley, B. J., Kelley, L. A. & Sternberg, M. J. E.(2001). Predicting the subcellular location of proteinsfrom text using support vector machines. In Proceed-ings of the Pacific Symposium on Biocomputing 2002,Kauai, Hawaii (Altman, R. B., Dunker, A. K., Hunter,L., Lauderdale, K. & Klein, T. E., eds), WorldScientific, Singapore.

    16. Cai, Y., Liu, X. & Chou, K. (2002). Artificial neuralnetwork model for predicting protein subcellularlocation. Comput. Chem. 26, 179182.

    17. Chou, K. & Elrod, D. (1999). Protein subcellularlocation prediction. Protein Eng. 12, 107118.

    18. Marcotte, E. M., Pellegrini, M., Ng, H., Rice, D. W.,Yeates, T. O. & Eisenberg, D. (1999). Detectingprotein function and proteinprotein interactionsfrom genome sequences. Science, 285, 751753.

    19. Pellegrini, M., Marcotte, E. M., Thompson, M. J.,Eisenberg, D. & Yeates, T. O. (1999). Assigningprotein functions by comparative genome analysis:protein phylogenetic profiles. Proc. Natl Acad. Sci.USA, 96, 42854288.

    20. Dandekar, T., Snel, B., Huynen, M. & Bork, P. (1999).Conservation of gene order: a fingerprint of proteinsthat physically interact. Trends Biochem. Sci. 23,324328.

    21. Overbeek, R., Fonstein, M., DSouza, M., Pusch, G. D.& Maltsev, N. (1999). The use of gene clusters to inferfunctional coupling. Proc. Natl Acad. Sci. USA, 96,28962901.

    22. Jensen, L. J., Gupta, R., Blom, N., Devos, D.,Tamames, J., Kesmir, C. et al. (2002). Prediction ofhuman protein function from post-translationalmodifications and localization features. J. Mol. Biol.319, 12571265.

    23. Stawiski, E. W., Baucom, A. E., Lohr, S. C. &Gregoret, L. M. (2000). Predicting protein function

    782 Predicting Protein Function from Structure

  • from structure: unique structural features of pro-teases. Proc. Natl Acad. Sci. USA, 97, 39543958.

    24. Stawiski, E. W., Mandel-Guttfreund, Y., Lowenthal,A. C. & Gregoret, L. M. (2001). Progress in predictingprotein function from structure: unique features ofO-glycosidases. In Proceedings of the Pacific Symposiumon Biocomputing 2002, Kauai, Hawaii (Altman, R. B.,Dunker, A. K., Hunter, L., Lauderdale, K. & Klein,T. E., eds), World Scientific, Singapore.

    25. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T. N., Weissig, H. et al. (2000). The ProteinData Bank. Nucl. Acids Res. 28, 235242.

    26. Burges, C. (1998). A tutorial on support vectormachines for pattern recognition. Data MiningKnowledge Discov. 2, 121167.

    27. Vapnik, V. (1999). The Nature of Statistical LearningTheory, Springer, New York.

    28. Frishman, D. & Argos, P. (1995). Knowledge-basedprotein secondary structure assignment. Proteins:Struct. Funct. Genet. 23, 566579.

    29. Connolly, M. L. (1983). Analytical molecular surfacecalculation. J. Appl. Crystallog. 16, 548558.

    30. Liang, J., Edelsbrunner, H. & Woodward, C. (1998).Anatomy of protein pockets and cavities: measure-ment of binding site geometry and implications forligand design. Protein Sci. 7, 18841897.

    31. Boutselakis, H., Copeland, J., Dimitropoulos, D.,Fillon, J., Golovin, A., Henrick, K. et al. (2003).E-MSD: the European Bioinformatics Institutemacromolecular structure database. Nucl. Acids. Res.31, 458462.

    32. Joachims, T. (1998). Text categorization with supportvector machines: learning with many relevantfeatures. In Proceedings of ECML-98, 10th European

    Conference on Machine Learning, Chemnitz, Germany(Nedellec, C. & Rouveirol, C., eds), Springer, Berlin.

    33. Bishop, C. M. (1995). Cross validation. In NeuralNetworks for Pattern Recognition, sect. 9.8.1. pp. 372375, Oxford University Press, Oxford, UK.

    34. John, G. H., Kohavi, R. & Pfleger, K. (1994). Irrelevantfeatures and the subset selection problem. In MachineLearning: Proceedings of the Eleventh InternationalConference (Cohen, W. W. & Hirsh, H., eds), RutgersUniversity, New Brunswick, NJ.

    35. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M.,Poggio, T. & Vapnik, V. (2000). Feature selection forsupport vector machines. NIPS, 13, 668674.

    36. Holland, J. (1992). Adaptation in Natural and ArtificialSystems, MIT Press, Cambridge.

    Edited by J. Thornton

    (Received 30 January 2003; received in revised form28 April 2003; accepted 9 May 2003)

    Supplementary Material comprising lists of PDBcodes for the proteins used here is available onScience Direct

    Predicting Protein Function from Structure 783

    Distinguishing Enzyme Structures from Non-enzymes Without AlignmentsIntroductionResultsDiscussionMethodsDataset constructionFeatures for model buildingThe support vector machinePerformance analysis: leave-out-analysisPerformance analysis: newly released PDB files

    Performance analysis: comparison to ProtFun&xref rid=Outline placeholderFeature subset selection by adaptive programming



View more >