Distinguishing Enzyme Structures from Non-enzymes Without Alignments

Download Distinguishing Enzyme Structures from Non-enzymes Without Alignments

Post on 03-Nov-2016

213 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Distinguishing Enzyme Structures from Non-enzymesWithout Alignments</p><p>Paul D. Dobson and Andrew J. Doig*</p><p>Department of BiomolecularSciences, UMIST, P.O. Box 88Manchester M60 1QD, UK</p><p>The ability to predict protein function from structure is becoming increas-ingly important as the number of structures resolved is growing morerapidly than our capacity to study function. Current methods for predict-ing protein function are mostly reliant on identifying a similar protein ofknown function. For proteins that are highly dissimilar or are only similarto proteins also lacking functional annotations, these methods fail. Here,we show that protein function can be predicted as enzymatic or not with-out resorting to alignments. We describe 1178 high-resolution proteins in astructurally non-redundant subset of the Protein Data Bank using simplefeatures such as secondary-structure content, amino acid propensities,surface properties and ligands. The subset is split into two functionalgroupings, enzymes and non-enzymes. We use the support vectormachine-learning algorithm to develop models that are capable of assign-ing the protein class. Validation of the method shows that the function canbe predicted to an accuracy of 77% using 52 features to describe eachprotein. An adaptive search of possible subsets of features produces asimplified model based on 36 features that predicts at an accuracy of80%. We compare the method to sequence-based methods that also avoidcalculating alignments and predict a recently released set of unrelatedproteins. The most useful features for distinguishing enzymes from non-enzymes are secondary-structure content, amino acid frequencies, numberof disulphide bonds and size of the largest cleft. This method is applicableto any structure as it does not require the identification of sequence orstructural similarity to a protein of known function.</p><p>q 2003 Elsevier Science Ltd. All rights reserved</p><p>Keywords: protein function prediction; structure; enzyme; machinelearning; structural genomics*Corresponding author</p><p>Introduction</p><p>We aim to demonstrate that protein function canbe predicted as enzymatic or not without resortingto alignments. Protein function prediction methodsare important as international structural genomicsinitiatives are expected to generate thousands ofprotein structures in the next decade. The capacityof laboratories studying protein function is not suf-ficient to keep pace with the number of structuresbeing released, with the consequence that manynew structures lack functional annotations.Predictions can help guide the activities of theselaboratories towards functionally more important</p><p>proteins. The main approaches to in silico proteinfunction prediction from structure are neatly sum-marised by Sung-Ho Kim.1 They are assignmentof a function from a similar fold, sequence, struc-ture or active site of previously known function,assignment from a structure with a common ligandand ab initio function prediction (implying amethod that does not work by comparison withanother protein of known function).</p><p>The most common methods rely on identifyingsimilarity to a protein of known function andtransferring that function. Sequence alignmentsare identified using approaches such as BLAST2 orFASTA.3 The power of PSI-BLAST4 has permittedthe detection of sequence similarities that inferhomology down to below 20%. Even when thelikes of PSI-BLAST fail, the sequence can still yielduseful information in the form of sequence motifs,which can be identified using PRINTS,5 BLOCKS,6</p><p>0022-2836/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved</p><p>E-mail address of the corresponding author:andrew.doig@umist.ac.uk</p><p>Abbreviations used: EC, Enzyme Commission; CE,Combinatorial Extension.</p><p>doi:10.1016/S0022-2836(03)00628-4 J. Mol. Biol. (2003) 330, 771783</p></li><li><p>PROSITE7 and other similar tools. Using predictedsecondary structures to assign fold class canexpand the information content of a sequence stillfurther, since fold classes are often associated witha particular set of functions.8</p><p>The next logical step after using predicted struc-ture is to use real structure. As structure is morehighly conserved than sequence, it is often possibleto detect similarities that are beyond the reach ofeven the most sophisticated sequence alignmentalgorithms. Structural similarity is detected usingtools such as Combinatorial Extension9 andVAST,10 which map structures onto each other.Incomplete structural alignments can still suggestfold class. A problem encountered when identify-ing similar folds is that there may not be onespecific function associated with a fold, makingchoosing the correct one non-trivial. The TIMbarrel fold is known to be involved in at least 18different enzymatic processes1 and while this doesgive a narrowing of the number of possiblefunctions to assign, the precise function remainsunknown.</p><p>Transferring function from a protein that sharesa ligand is a method that can give variable resultsif not tempered with some biochemical knowledge.For example, possession of NADH suggests anoxidoreductase enzyme. Functionally unimportantligands may be shared by many structures, but tosay that these proteins share a common functionwould be far from accurate. Ligand data can beused in conjunction with data concerning theimmediate protein environment that binds theligand. Binding-site correspondence is a strongindicator of functional similarity,11 as is the casewith the correspondence of the near-identicalcatalytic triads in the active sites of trypsins andsubtilisins,12 two evolutionarily distant but func-tionally similar types of protein. The utility of thisapproach is demonstrated by the ProCat database.13</p><p>For sequences and structures that are highlysimilar, the reliability of the predicted function isgood, though in a recent study it has been shownto be less than previously thought.14 For pair-wisesequence alignments above 50%, less than 30%share exact EC numbers. This suggests the level ofsequence/structure conservation that impliesfunction conservation is much lower than believedformerly and demonstrates the pressing need forprotein function prediction methods that are notdependent upon tools that detect alignments.</p><p>Non-alignment-based function predictions havebeen made using many different techniques. Textdata mining of scientific literature15 uses the infor-mation in scientific abstracts to assign subcellularlocalisations, which can be used as an indicator offunction. Amino acid compositions have beenused to predict localisation.16,17 The Rosetta Stone18</p><p>method allows function predictions to be made forproteins that do not align to a protein of knownfunction by examining gene fusions. If the proteinaligns to part of a fused protein and the part ofthe fused protein it does not align to matches a</p><p>protein of known function, that function can betransferred to the original protein. Phylogeneticprofiling19 functionally relates proteins with similarprofiles. The gene neighbour method uses theobservation that if the genes that encode two pro-teins are close on a chromosome, the proteins tendto be functionally related.20,21 Neural networks havebeen used to combine predicted post-translationalmodifications into sophisticated systems capable ofpredicting subcellular location and function.22</p><p>While similarity-based methods do provide themost precise and dependable means of functionprediction, in many cases it is apparent that theyare heavily reliant on being able to identify highlysimilar proteins of known function. With one ofthe principal objectives of the structural genomicsinitiatives being the elucidation of structures fromthe more sparsely populated regions of fold space,the problem of not finding a similar protein ofknown function is more likely to occur. A methodsuggested by Stawiski et al.23 that lies between asimilarity-based approach and an ab initio method,is based on the observation that proteins of similarfunction often use basic structural features in asimilar manner. For example, they note that pro-teases often have smaller than average surfaceareas and higher Ca densities. Similarly,O-glycosidases24 deviate from the norm in termsof features such as the surface roughness (or fractaldimension). Features identified as being indicativeof a certain function permit the construction ofmachine-learning-based classification schemes thatallow function predictions for novel proteins with-out resorting to conventional similarity-basedmethods. The broad structural similarities thatcharacterise a functional class of proteins extendbeyond the reach of structural alignments, yet ithas been shown that they can be used for proteinfunction prediction. Here, we demonstrate amethod of identifying protein function asenzymatic or not without resorting to alignmentsto proteins of known function. To do this, wedescribe each protein in a non-redundant subsetof the Protein Data Bank25 in terms of simplefeatures such as residue preference, residue surfacefractions, secondary structure fractions, disulphidebonds, size of the largest surface pocket andpresence of ligands. As we are demonstrating amethod for use when alignment methods do notyield results, we restrict ourselves to features thatdo not rely on alignments. As such, our method isfor use when alignment methods fail. Histogramsillustrate that for some features the distributionsof enzymes and non-enzymes are different. Inorder to utilise these differences we combine thedata into a predictive model using the supportvector machine technique. Adaptive programmingis used to find a more optimal subset of features,giving a greater predictive accuracy whilst simul-taneously simplifying the model. We validatethese models by leave-out analyses and predictinga set of unrelated proteins submitted to the ProteinData Bank since the training set was compiled.</p><p>772 Predicting Protein Function from Structure</p></li><li><p>Using the same approach, we investigate the utilityof models built only using amino acid propensities.Being easily calculable from sequence, this pro-vides a method for predicting the function of pro-teins that cannot be aligned to a protein of knownfunction, even if we do not have a structure. Wealso make a comparison to the ProtFun enzyme/non-enzyme methods described by Brunak et al.22</p><p>Results</p><p>The support vector machine works by deducing</p><p>the globally optimal position of a hyperplane sep-arating the distribution of two classes of pointsscattered in a multi-dimensional space. The num-ber of features used to describe the position ofpoints determines the dimensionality of thathyperspace. The 52 features used to describe eachprotein are shown in Table 1. All features are easilycalculable from any protein structure. No feature isbased on mapping sequence or structure onto aknown protein, so the model can be said to betruly independent of alignment-based techniques.</p><p>Heterogen data is presented to the machine inbinary form (1 for present, 0 for absent; Table 2).</p><p>Table 1. All features and a more optimal subset</p><p>Features used to describe the difference between structures of enzymes and non-enzymes. Bold and italicised features are thoseselected by the adaptive search of possible subsets as part of the most optimal subset identified.</p><p>Predicting Protein Function from Structure 773</p></li><li><p>Some metal ions are present only once in the data-set, in which case the feature was not included inthe set of features on which the machine wastrained.</p><p>Results from the leave-out analyses of predictionaccuracy are shown in Table 3. These data arebased on a model generated using a radial basiskernel function. Linear and polynomial kernelsperform less well (not shown).</p><p>When using all 52 features, the results indicate apredictive accuracy for the set of features around7677%. Random performance based purely onthe size of classes could give, at best, an accuracyof 58.7% by predicting enzyme every time. Thesubset selection strategy selects 36 features fromthe total 52. The increase in accuracy is approxi-mately 34% to 80%. Bold and italicised featuresin Table 1 show the subset that gives the results inTable 3. On a set of 52 unrelated (by CombinatorialExtension) proteins submitted to the Protein DataBank since the training set was compiled, this 36feature model predicts with an accuracy of 79%.Three unrelated proteins with no functional assign-ment in the Protein Data Bank were predicted asfollows (confidence in parentheses): 1JAL non-enzyme (1.081), 1NY1 non-enzyme (20.166),1ON0 enzyme (0.548).</p><p>Table 4 shows that certain features are selectedmore than others. Possibly as a result of its lowabundance at protein surfaces, the surface fractionof tryptophan is never used. Ligand data are usedonly rarely (calcium is never used). This isprobably related to the sparseness of heterogens inthe dataset. Other features, particularly secondary-structure contents (Figure 2) and the volume ofthe largest pocket (Figure 3) are selected frequently.</p><p>The confidence of a support vector machineprediction is related to the distance from thehyperplane. Larger distances equate to higherconfidence. Figure 1 illustrates how the 36 featuremodels in a leave-one-out analysis were predicted.On enzymes, the method is 89.7% accurate. Fornon-enzymes, the accuracy is 68.6%.</p><p>Models trained only on amino acid propensitiesgive an indication of how accurate function predic-tion can be from sequence when a similar proteinof known function cannot be detected (Table 5).</p><p>Subsets contain only the features that combine tomore optimally describe the difference betweenclasses using the support vector machine. Tovisualise this, histograms can be used to depictthe underlying probability distribution of classesfor a particular feature. Figures 25 show typicalhistograms.</p><p>Correlations between class distributions for cer-tain features, such as largest pocket volume, helixand sheet contents, are low. Most other featureshave distributions that tend to differ only slightlyfor the two classes. However, the support vectormachine combines individual features into amulti-dimensional joint probability distribution, soit is the manner in which feature distributionsinteract that is important. It is for this reason thatwrapper methods were preferred to filters for sub-set selection (see Methods). A pre-filter methodbased on a correlation coefficient would haveresulted in a less optimal performance.</p><p>For a-helix fraction (Figure 2) it is evident thatthere are large differences between enzymes andnon-enzymes, and the coiled-coil state that isobserved in some DNA-binding proteins isdepicted at the upper end of the scale by a slightincrease in the non-enzyme density. Most otherhistograms approximately fit to standard statisticaldistributions such as normal, Poisson, etc. as typi-fied by Figures 4 and 5. Figure 3 shows the volumeof the largest surface pocket. Above 1000 A3,enzymes are more prevalent, so a protein with alarge surface pocket is more likely to be enzymatic.Conversely, and this may be more relevant,pr...</p></li></ul>