discrimination of outer membrane proteins using machine learning algorithms

7
Discrimination of Outer Membrane Proteins Using Machine Learning Algorithms M. Michael Gromiha * and Makiko Suwa Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan ABSTRACT Discriminating outer membrane proteins (OMPs) from other folding types of globu- lar and membrane proteins is an important task both for identifying OMPs from genomic sequences and for the successful prediction of their secondary and tertiary structures. In this work, we have ana- lyzed the performance of different methods, based on Bayes rules, logistic functions, neural networks, support vector machines, decision trees, etc. for discriminating OMPs. We found that most of the machine learning techniques discriminate OMPs with similar accuracy. The neural network-based method could discriminate the OMPs from other proteins [globular/transmembrane helical (TMH)] at the fivefold cross-validation accuracy of 91.0% in a dataset of 1,088 proteins. The accuracy of discrimi- nating globular proteins is 88.8% and that of TMH proteins is 93.7%. Further, the neural network method is tested with globular proteins belonging to 30 different folding types and it could successfully exclude 95% of the considered proteins. The pro- teins with SAM domain such as knottins, rubre- doxin, and thioredoxin folds are eliminated with 100% accuracy. These accuracy levels are compa- rable to or better than other methods in the literature. We suggest that this method could be effectively used to discriminate OMPs and for detecting OMPs in genomic sequences. Proteins 2006;63:1031–1037. © 2006 Wiley-Liss, Inc. Key words: transmembrane -barrel; outer mem- brane protein (OMP); discrimination; protein folds; neural network INTRODUCTION Dissecting membrane proteins in genomic sequences is one of the most important problems in computational biology. The successful discrimination of membrane pro- teins from globular ones would help to identify them in genomes. It has been reported that the transmembrane helical (TMH) proteins could be discriminated with the highest accuracy of more than 90%. 1 However, the success rate of discriminating outer membrane proteins (OMPs) is rather moderate. This might be attributable to the inter- vention of many charged and polar residues in the mem- brane. Recently, several methods have been proposed for identi- fying -barrel membrane proteins and transmembrane -barrels in proteomes. 2–6 Wimley 2 analyzed the architec- ture of 15 OMPs and proposed a method based on hydropho- bicity for identifying -barrel membrane proteins in genomic sequences. Martelli et al. 3 used 12 OMPs and developed a Hidden Markov Model (HMM) method for picking up the -barrel membrane proteins. Liu et al. 4 analyzed the amino acid composition in the membrane spanning regions of 12 -barrel membrane proteins and applied the information for discrimination. Bagos et al. 7 developed an algorithm based on HMM for discriminating OMPs. Natt et al. 8 used a set of 16 OMPs and proposed a machine learning technique for discrimination. We have proposed different methods for discriminating OMPs that are based on amino acid composition, residue pair prefer- ence, motifs, etc. 9 –11 The accuracy of these methods lies in the range of 80 –90% using amino acid sequence informa- tion alone. In this work, we have analyzed the performance of different algorithms, such as Bayes rules, neural net- works, Support Vector Machines (SVM), and decision trees. We found that the fivefold cross-validation accuracy is almost similar in most of the machine learning algo- rithms in the range of 88 –91% and the accuracy of discriminating OMPs using neural networks is marginally better than other methods. It could discriminate the OMPs at an accuracy of 91% in a dataset of 1,088 proteins. Further, the influence of structural classes, folding types, and amino acid residues will be discussed. MATERIALS AND METHODS Datasets In our earlier work, we have used a set of 377 OMPs, 674 globular proteins belonging to four different structural classes, and 268 TMH proteins for discrimination. 9 We have removed the redundant sequences using CD-HIT algorithm 12 as implemented by Holm and Sander 13 and blastclust program. 14 The final dataset contains 208 OMPs, 155 all-, 156 all-, 184 , 179 /, and 206 TMH proteins. These datasets have the proteins with less than *Correspondence to: M. Michael Gromiha, Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), AIST Tokyo Waterfront Bio-IT Re- search Building, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan. E-mail: [email protected] Received 21 August 2005; Revised 26 October 2005; Accepted 18 November 2005 Published online 21 February 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.20929 PROTEINS: Structure, Function, and Bioinformatics 63:1031–1037 (2006) © 2006 WILEY-LISS, INC.

Upload: m-michael-gromiha

Post on 06-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Discrimination of Outer Membrane Proteins Using MachineLearning AlgorithmsM. Michael Gromiha* and Makiko SuwaComputational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST),Tokyo, Japan

ABSTRACT Discriminating outer membraneproteins (OMPs) from other folding types of globu-lar and membrane proteins is an important taskboth for identifying OMPs from genomic sequencesand for the successful prediction of their secondaryand tertiary structures. In this work, we have ana-lyzed the performance of different methods, basedon Bayes rules, logistic functions, neural networks,support vector machines, decision trees, etc. fordiscriminating OMPs. We found that most of themachine learning techniques discriminate OMPswith similar accuracy. The neural network-basedmethod could discriminate the OMPs from otherproteins [globular/transmembrane helical (TMH)]at the fivefold cross-validation accuracy of 91.0% ina dataset of 1,088 proteins. The accuracy of discrimi-nating globular proteins is 88.8% and that of TMHproteins is 93.7%. Further, the neural networkmethod is tested with globular proteins belonging to30 different folding types and it could successfullyexclude 95% of the considered proteins. The pro-teins with SAM domain such as knottins, rubre-doxin, and thioredoxin folds are eliminated with100% accuracy. These accuracy levels are compa-rable to or better than other methods in theliterature. We suggest that this method could beeffectively used to discriminate OMPs and fordetecting OMPs in genomic sequences. Proteins2006;63:1031–1037. © 2006 Wiley-Liss, Inc.

Key words: transmembrane �-barrel; outer mem-brane protein (OMP); discrimination;protein folds; neural network

INTRODUCTION

Dissecting membrane proteins in genomic sequences isone of the most important problems in computationalbiology. The successful discrimination of membrane pro-teins from globular ones would help to identify them ingenomes. It has been reported that the transmembranehelical (TMH) proteins could be discriminated with thehighest accuracy of more than 90%.1 However, the successrate of discriminating outer membrane proteins (OMPs) israther moderate. This might be attributable to the inter-vention of many charged and polar residues in the mem-brane.

Recently, several methods have been proposed for identi-fying �-barrel membrane proteins and transmembrane

�-barrels in proteomes.2–6 Wimley2 analyzed the architec-ture of 15 OMPs and proposed a method based on hydropho-bicity for identifying �-barrel membrane proteins ingenomic sequences. Martelli et al.3 used 12 OMPs anddeveloped a Hidden Markov Model (HMM) method forpicking up the �-barrel membrane proteins. Liu et al.4

analyzed the amino acid composition in the membranespanning regions of 12 �-barrel membrane proteins andapplied the information for discrimination. Bagos et al.7

developed an algorithm based on HMM for discriminatingOMPs. Natt et al.8 used a set of 16 OMPs and proposed amachine learning technique for discrimination. We haveproposed different methods for discriminating OMPs thatare based on amino acid composition, residue pair prefer-ence, motifs, etc.9–11 The accuracy of these methods lies inthe range of 80–90% using amino acid sequence informa-tion alone.

In this work, we have analyzed the performance ofdifferent algorithms, such as Bayes rules, neural net-works, Support Vector Machines (SVM), and decisiontrees. We found that the fivefold cross-validation accuracyis almost similar in most of the machine learning algo-rithms in the range of 88–91% and the accuracy ofdiscriminating OMPs using neural networks is marginallybetter than other methods. It could discriminate the OMPsat an accuracy of 91% in a dataset of 1,088 proteins.Further, the influence of structural classes, folding types,and amino acid residues will be discussed.

MATERIALS AND METHODSDatasets

In our earlier work, we have used a set of 377 OMPs, 674globular proteins belonging to four different structuralclasses, and 268 TMH proteins for discrimination.9 Wehave removed the redundant sequences using CD-HITalgorithm12 as implemented by Holm and Sander13 andblastclust program.14 The final dataset contains 208 OMPs,155 all-�, 156 all-�, 184 � � �, 179 �/�, and 206 TMHproteins. These datasets have the proteins with less than

*Correspondence to: M. Michael Gromiha, Computational BiologyResearch Center (CBRC), National Institute of Advanced IndustrialScience and Technology (AIST), AIST Tokyo Waterfront Bio-IT Re-search Building, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan. E-mail:[email protected]

Received 21 August 2005; Revised 26 October 2005; Accepted 18November 2005

Published online 21 February 2006 in Wiley InterScience(www.interscience.wiley.com). DOI: 10.1002/prot.20929

PROTEINS: Structure, Function, and Bioinformatics 63:1031–1037 (2006)

© 2006 WILEY-LISS, INC.

40% sequence identity. Further, we have tested our methodwith 1,612 proteins belonging to 30 different folding typesof globular proteins and a dataset containing 114 OMPs,187 TMH proteins, and 195 globular proteins obtainedwith less than 20% sequence identity.

Computation of Amino Acid Composition

The amino acid composition for each OMP has beencomputed using the number of amino acids of each typeand the total number of residues. It is defined as:

Comp�i� � �ni / N (1)

where i stands for the 20 amino acid residues, ni is thenumber of residues of each type, and N is the total numberof residues. The summation is through all the residues inthe particular protein.

Fivefold Cross-Validation Method

We have performed a fivefold cross-validation test forassessing the validity of the present work. In this method,the dataset is divided into five groups; four of them areused for training and the remainder are used for testingthe method. The same procedure is repeated five times andthe average is computed for obtaining the accuracy of themethod.

Calculation of Sensitivity, Specificity, andAccuracy

We have used different measures to assess the accuracyof discriminating OMPs, non-OMPs, and combination ofthe two. The term sensitivity shows the correct predictionof OMPs, specificity about the non-OMPs, and accuracyindicates the overall assessment. These terms are definedas follows:

Sensitivity � TP/(TP � FN)Specificity � TN/(TN � FP)Accuracy � (TP � TN)/(TP � TN � FP � FN)

where TP, FP, TN, and FN refer to the number of truepositives (OMPs identified as OMPs), false positives (non-OMPs identified as OMPs), true negatives (non-OMPsidentified as non-OMPs), and false negatives (OMPs iden-tified as non-OMPs), respectively.

Machine Learning Techniques

We have analyzed several machine learning techniquesimplemented in the WEKA program15 for discriminatingOMPs. This program includes several methods based onBayes functions, neural networks, logistic functions, sup-port vector machines, regression analysis, nearest neigh-bor methods, meta learning, decision trees and rules. TheBayesian network uses various search algorithms andquality measures to find a minimum set of direct dependen-cies that together explain the observed correlations in thedata. The best Bayesian network is the one that modelsthe observed data using a measure of scoring metric, atrade-off between complexity and accuracy.16 Naive Bayesimplements the simple probabilistic classifier (mappingfrom a feature space to discrete set of labels) and it uses

the normal distribution to model numeric attributes.17

Logistic function is a classifier, which uses a multinomiallogistic regression model with a ridge estimator.18 Neuralnetwork is a network of nonlinear processing units thathave adjustable connection strengths, hidden layers, andthe discrimination is mainly based on feed-forward net-works using the back propagation learning rule. The goalof the method is to find a good input-output mapping thatcan then be used to predict the test set.19,20 RBF networkimplements a normalized Gaussian radial basis functionnetwork for classification. It uses the k-means clusteringalgorithm to provide the basis functions and learns eitherlogistic regression or linear regression. The symmetricmultivariate Gaussians are fit to the data for each cluster.The k-nearest neighbor classifier is a simple instance-based learning algorithm, which uses the distance metriccriterion for discrimination using its k-nearest neighbors.It is a local approximation, focusing on the neighborhood ofthe query instance and it takes the weighed average of thenearest neighbors, which smoothes out isolated trainingexamples.6,21 Bagging meta learning is a meta algorithmto improve the classification accuracy. It reduces varianceand helps to avoid over-fitting.22 Classification via regres-sion classifies the data using regression methods. In thismethod, each class is binarized and one regression modelis built for each class.23 Decision tree J4.8 is a classifier forgenerating a pruned or unpruned decision tree and it is amapping of observations to classification. Decision treesare built with an inner node representing the variable, anarc to the child, representing a possible value of thevariable and a leaf for the predicted value of targetvariable using the values of the variables represented bythe path from the root.24 NB tree is a classifier forgenerating a decision tree with naive Bayes classifiers atthe leaves.25 The decision trees are built partially in eachiteration for partial decision tree classifier and it makesthe best leaf into a rule for classification.26 We haveanalyzed different classifiers and datasets to discriminateOMPs from all other folding types of globular and mem-brane proteins.

RESULTS AND DISCUSSIONStatistical Analysis of Amino Acid Compositions inDifferent Folding Types of Globular and MembraneProteins

The statistical analysis on the amino acid compositionsof globular and OMPs showed that the residues Glu, His,Ile, Cys, Gln, Asn, and Ser show a subtle differencebetween the globular and OMPs. Whereas the compositionof Glu, His, Ile, and Cys is higher in globular proteins thanOMPs, an opposite trend is observed for Ser, Asn, and Gln.The formation of disulfide bonds between Cys residuesrequires an oxidative environment and such disulfidebridges are not usually found in intracellular proteins.27

The analysis on the three-dimensional structures of 15�-barrel OMPs shows the presence of just eight (0.1%) Cysresidues and none of them are in membrane part.28 Hence,the occurrence of Cys is significantly higher in globularproteins than in OMPs. Glu is a strong helix former29 and

PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot

1032 M.M. GROMIHA AND M. SUWA

this tendency influences the higher occurrence of it inglobular proteins than OMPs. The comparative analysison the occurrence of Ile in the �-strand segments ofglobular and OMPs revealed that the preference of Ile inOMPs is less than that in globular proteins,28 which mayincrease the occurrence of it in globular proteins.

However, the composition of the residues Ser, Asn, andGln are significantly higher in OMPs than in globularproteins. The structural analysis of several OMPs showsthat these residues have an important role in the stabilityand function of OMPs.9 In OmpA, the interior of �-strandscontains an extended hydrogen bonding network of chargedand polar residues and especially, the side-chains of theresidues Ser22, Gln228, and Asn258 in OmpT, locatedabove the membrane form hydrogen bonds to main chainatoms in the �-barrel. Interestingly, none of the residues,which have high composition in globular proteins (Glu,His, Ile, and Cys), are involved in such pattern.30,31 Thebinding of cyanocobalamin (CN-Cbl) with BtuB is impor-tant for its function. The binding region of this protein,hatch domain, is dominated by the residues Ser, Gln, andAsn, which form van der Waals and hydrogen bondinginteractions to stabilize the hatch apices. Especially, theresidues Asn185 and Asn276 are important for the stabil-ity of the upper surface of CN-Cbl binding pocket.32,33

Further, the structure and function of a peptide–proteincomplex of Omp32 is mainly achieved by the interaction ofeight residues in the peptide, which are dominated by Asn,Gln, and Ser.34 We infer from these observations that thehigh occurrence of Ser, Asn, and Gln in OMPs is requiredin the formation of �-barrel structures in the membrane,stability of binding pockets, and the function of OMPs.

The amino acid composition of 20 residues in OMPs and�-helical membrane proteins shows that the residues Ala,Ile, Leu, Val, Phe, Trp, and Met have higher composition in�-helical membrane proteins than OMPs. The higheroccurrence of hydrophobic residues in TMH proteins thanOMPs indicates that the membrane spanning regions ofTMH proteins are accommodated mainly with the longstretches of hydrophobic amino acid residues.35

Discrimination of OMPs From Other Folding Typesof Globular and Membrane Proteins

We have analyzed the performance of different methodsfor discriminating OMPs from other folding types of globu-lar (all-�, all-�, � � �, and �/�) and membrane (TMH)proteins. In this discrimination, we have used the aminoacid composition as the main attributes. It has been shownthat amino acid composition could discriminate the OMPswith reliable accuracy.6,9 The results obtained for a set ofmachine learning techniques are presented in Table I. Weobserved that most of the machine learning methodsdiscriminated the OMPs with accuracy in the range of88–91% in a set of 1,088 proteins. The method based onk-nearest neighbor discriminated the OMPs with a fivefoldcross-validation accuracy of 85%, which is similar to thatreported by Garrow et al.6 using different datasets. Wenoticed that the performance of the neural network-basedmethod is marginally better than other methods and thediscrimination accuracy is 91%, which might be attribut-able to the selection of proper adjustable parameters, suchas connection weights and hidden layers. It has beenreported that the method based on neural networks per-formed well in discriminating DNA binding proteins,predicting protein secondary structures, etc.36,37 The accu-racy of excluding globular and TMH proteins is 94%whereas that of identifying OMPs is 79%. This resultindicates that several OMPs are wrongly identified asglobular/TMH proteins. However, the k-nearest neighbormethod has a similar power of correctly identifying OMPsand excluding other proteins. The algorithm based onBayes functions shows an accuracy of 88% in which theOMPs are correctly identified up to an accuracy of 83% andother proteins are excluded at an accuracy of 89%. Thisanalysis showed that there is no significant difference inperformance between different machine learning methods.Further, the use of different adjustable parameters inthese methods would make it possible for any method toperform better than the others.

TABLE I. Discrimination of OMPs and Non-OMPs Using DifferentMachine Learning Approaches

Method

Five-fold cross-validation (%)

Sensitivity Specificity Accuracy

Bayesnet 76.0 (81.6) 93.5 (89.5) 90.1 (87.7)Naive Bayes 83.2 (80.7) 89.2 (89.3) 88.1 (87.3)Logistic function 66.8 (66.7) 94.3 (92.9) 89.1 (86.9)Neural network 79.3 (74.6) 93.8 (92.7) 91.0 (88.5)RBF network 79.3 (71.9) 93.0 (92.7) 90.3 (88.3)k-Nearest neighbor 85.1 (81.6) 85.0 (88.0) 85.0 (86.5)Bagging meta learning 63.9 (65.8) 94.7 (95.0) 88.8 (88.3)Classification via Regression 68.8 (69.3) 93.4 (91.9) 88.7 (86.7)Decision tree J4.8 67.3 (67.5) 94.5 (90.1) 89.3 (84.9)NBTree 69.2 (68.4) 94.1 (90.8) 89.3 (85.7)Partial decision tree 67.8 (72.8) 94.5 (90.3) 89.4 (86.3)

The sensitivity, specificity, and accuracy obtained with the dataset of proteins withless than 20% sequence identity are given in parentheses.

PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot

DISCRIMINATION OF OUTER MEMBRANE PROTEINS 1033

Discrimination of OMPs From Globular/TMHProteins

We have analyzed the predictive power of several meth-ods for discriminating OMPs from a pool of 208 OMPs and674 globular proteins. The results are presented in TableII. The accuracy is similar to the one that we obtained fordiscriminating OMPs from non-OMPs. However, the dis-crimination of OMPs from TMH proteins is significantlyhigher than that from non-OMPs (see Table II). This resultindicates that the amino acid composition of TMH proteinsdiffers considerably from that of OMPs, which lead to highaccuracy in discrimination. This might be attributed to thefact that the TMH proteins have the stretches of hydropho-bic residues in their membrane spanning segments35,38

and the OMPs have hydrophobic as well as polar residuesinside the membrane.39,40

Discrimination Results Using the Dataset ofProteins Obtained With Less Than 20% SequenceIdentity

We have examined the influence of datasets for discrimi-nating OMPs using a subset of sequences obtained withless than 20% sequence identity. The results obtained withdifferent machine learning techniques using this datasetare included in Table I. We observed that most of themethods showed accuracy in the range of 86–89% fordiscriminating OMPs and non-OMPs. These accuracy lev-els are nearly 2% less than that obtained with the datasetof proteins with less than 40% sequence identity. Further,

as we observed with the dataset of less than 40% sequenceidentity, the discrimination accuracy between OMPs andTMH proteins is better than that between OMPs andglobular proteins using the dataset of proteins with lessthan 20% sequence identity (Table II).

Analysis of Different Structural Classes

It has been reported that the discrimination of all-�, � ��, and �/� class of proteins from OMPs is better than thatbetween all-� and OMPs.6,9,10 We have analyzed theinfluence of protein structural class in discriminatingOMPs, and the results obtained with fivefold cross-validation for five typical methods are presented in TableIII. In this calculation, proteins of specific class and OMPshave been used for discrimination. We observed that thelogistic function and decision tree J4.8 show a similartrend with less predictive accuracy of all-� proteins thanother classes. Bayes net method shows the highest accu-racy of discriminating all-� proteins whereas k-nearestneighbor algorithm discriminated the �/� class proteinswith highest accuracy. In neural networks, all the struc-tural classes of globular proteins have a similar level ofaccuracy.

Prediction Results for Different Folding Types ofGlobular Proteins

We have tested the method based on neural network in aset of 1,612 globular proteins belonging to 30 differentfolding types. These proteins have been selected from the

TABLE II. Discrimination of OMPs and Globular (or TMH) ProteinsUsing Different Machine Learning Approaches

Method

Five-fold cross-validation (%)

Sensitivity Specificity Accuracy

Bayesnet 76.9 92.9 89.1 (81.9)97.6 92.2 94.9 (92.0)

Naive Bayes 84.1 88.3 87.3 (81.2)96.6 87.4 92.0 (90.0)

Logistic function 70.2 92.0 86.8 (84.1)93.3 92.2 92.8 (92.7)

Neural network 78.8 91.8 88.8 (85.1)93.8 93.7 93.7 (92.4)

RBF network 80.8 93.2 90.2 (84.8)94.2 91.7 93.0 (92.0)

k-Nearest neighbor 85.0 85.1 85.0 (78.6)95.2 91.3 93.2 (89.4)

Bagging meta learning 69.2 94.8 88.8 (85.4)92.3 88.3 90.3 (90.7)

Classification via Regression 73.6 93.9 89.1 (82.2)94.2 92.2 93.2 (88.0)

Decision tree J4.8 70.0 90.5 85.8 (82.5)88.9 85.4 87.2 (86.4)

NBTree 68.8 92.1 86.2 (79.3)91.3 89.3 90.3 (88.7)

Partial decision tree 67.3 92.1 86.3 (79.9)93.3 91.3 92.3 (88.0)

Data in italics show the discrimination result between OMPs and TMH proteins.The accuracy obtained with the dataset of proteins with less than 20% sequenceidentity is given in parentheses.

PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot

1034 M.M. GROMIHA AND M. SUWA

SCOP database41 with the criteria that there should be atleast 25 proteins in each fold and the sequence identity isnot more than 25%. The results are presented in Table IV.We observed that the proteins belonging to the foldingtypes, SAM domain-like, S-adenosyl-L-methionine-depen-dent methyltransferases, knottins, rubredoxins, and thiore-doxin folds have been successfully excluded with 100%accuracy. The proteins with concanavalin A-like lectins/

glucanases fold belonging to all-� class of proteins have thelowest accuracy of 77%. Further, proteins from most of thefolds showed an accuracy of more than 90% and theaverage accuracy for all the considered proteins is 95%.The same proteins were tested with statistical method andwe observed an average accuracy of 80% as reportedearlier.9 When we tested the neural network method in aset of 62 nonredundant DNA binding proteins,42 we ob-

TABLE III. Discrimination of OMPs and Globular Proteins Belongingto Different Structural Classes

Method

Five-fold cross-validation accuracy (%)

All-� All-� � � � �/�

Bayesnet 90.1 (86.8) 87.4 (80.0) 87.5 (84.4) 86.0 (83.2)Logistic function 87.9 (86.2) 84.6 (82.9) 87.5 (84.4) 86.6 (82.6)Neural network 86.0 (85.5) 86.5 (81.2) 86.2 (86.2) 86.8 (87.7)k-Nearest neighbor 83.2 (81.8) 79.7 (84.1) 79.3 (83.8) 87.6 (87.7)Decision tree J4.8 83.5 (76.1) 79.7 (79.4) 84.4 (81.4) 84.5 (86.5)

The accuracy obtained with the dataset of proteins with less than 20% sequenceidentity is given in parentheses.

TABLE IV. Accuracy of Correctly Excluding Globular Proteins of Different Folds

Fold

Discriminationaccuracy (%)

Correlation DeviationNeural Statistical

Cytochrome C (a.3) 96.0 88.0 0.82 1.23DNA/RNA binding 3-helical bundle (a.4) 95.1 84.5 0.75 1.40Four helical up and down bundle (a.24) 92.3 76.9 0.81 1.27EF hand-like fold (a.39) 96.0 96.0 0.73 1.50SAM domain-like (a.60) 100.0 88.5 0.70 1.64�-� superhelix (a.118) 95.7 87.2 0.76 1.20Immunoglobulin-like �-sandwich (b.1) 89.6 64.7 0.87 0.99Common fold of diphtheria toxin/transcription

factors/cytochrome f (b.2)85.7 50.0 0.87 0.96

Cupredoxin-like (b.6) 93.3 73.3 0.84 1.05Galactose-binding domain-like (b.18) 96.0 56.0 0.90 0.89Concanavalin A-like lectins/glucanases (b.29) 76.9 38.5 0.94 0.67SH3-like barrel (b.34) 97.6 85.7 0.80 1.27OB-fold (b.40) 96.2 79.5 0.80 1.19Double-stranded �-helix (b.82) 97.1 88.2 0.86 1.04Nucleoplasmin-like (b.121) 90.5 38.1 0.88 0.99TIM �/�-barrel (c.1) 94.5 81.4 0.92 0.77NAD(P)-binding Rossmann-fold domains (c.2) 96.1 83.1 0.90 0.96FAD/NAD(P)-binding domain (c.3) 93.5 90.3 0.89 0.92Flavodoxin-like (c.23) 98.2 92.7 0.88 1.01Adenine nucleotide � hydrolase-like (c.26) 97.1 94.1 0.80 1.22P-loop containing nucleoside triphosphate hydrolases (c.37) 98.9 90.5 0.86 1.05Thioredoxin fold (c.47) 100.0 96.9 0.77 1.39Ribonuclease H-like motif (c.55) 95.9 89.8 0.87 0.94S-adenosyl-L-methionine-dependent methyltransferases (c.66) 100.0 97.1 0.82 1.23�/�-Hydrolases (c.69) 89.2 75.7 0.93 0.76�-Grasp, ubiquitin-like (d.15) 97.6 83.3 0.77 1.31Cystatin-like (d.17) 96.0 88.0 0.85 1.04Ferredoxin-like (d.58) 93.2 89.8 0.83 1.10Knottins (g.3) 100.0 67.5 0.04 2.18Rubredoxin-like (g.41) 100.0 71.4 0.52 1.70

The highest accuracy is indicated in bold. SCOP classification is given in parentheses. Correlation: correlation between theamino acid compositions of proteins belonging to specific fold and OMPs. Deviation: mean absolute deviation between theamino acid compositions of proteins belonging to specific fold and OMPs.

PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot

DISCRIMINATION OF OUTER MEMBRANE PROTEINS 1035

tained an accuracy of 96% (sensitivity, 97.1% and specific-ity, 90.3%) for correctly discriminating OMPs and DNAbinding proteins.

We have examined the relationship between the aminoacid compositions of globular proteins belonging to differ-ent folding types and OMPs using the parameters, correla-tion and deviation. The results are presented in Table IV.We noticed that the families, which are predicted with lessaccuracy, have a strong correlation between their aminoacid compositions and the composition of OMP. As anexample, the proteins with concanavalin A-like lectins/glucanases fold have the lowest accuracy of 77%, whichshow a correlation of 0.94 with OMPs. The high correlationbetween the amino acid compositions of proteins fromconcanavalin A-like lectins/glucanases fold and OMPsmakes the algorithms difficult to discriminate the proteinsbelonging to this fold. Further, these proteins have thelowest difference between their compositions with OMPs(0.67; Table IV). However, the proteins belonging to thefamilies that are predicted with 100% accuracy showcomparatively weaker correlation and higher deviationbetween their amino acid compositions and OMPs. Theseproteins can be discriminated easier than proteins belong-ing to other folds. As an example, knottins, rubredoxin-like, and SAM domain-like proteins are the three familieswith the lowest correlations (0.04, 0.52, and 0.70, respec-tively) and largest deviations (2.18, 1.70, and 1.64, respec-tively). The proteins belonging to these folds are discrimi-nated with 100% accuracy. This analysis showed that thefamilies that are close to OMPs are difficult to discrimi-nate and that with different folding types are easier todiscriminate from OMPs.

Influence of Specific Residues for Discrimination

In our earlier work, we have shown that the sevenresidues, Glu, His, Ile, Cys, Gln, Asn, and Ser, showsignificant difference between the compositions of globularand membrane proteins.9 We have analyzed the influenceof these seven amino acid residues for discrimination. Weobserved that the composition of these seven amino acidresidues could discriminate the OMPs with an overallaccuracy of 88.9%, which is marginally lower than thatobtained with all 20 amino acid residues. However, theaccuracy of identifying OMPs is 15% less and excludingnon-OMPs is 1% higher than that with 20 residues.

We have further examined the influence of the aminoacids Ser, Asn, Gln, Thr, Gly, Tyr, Ala, Arg, and Leu thathave higher composition in OMPs than globular proteins.We obtained an accuracy of 89% for discriminating OMPs(70.7% for identifying OMPs and 93.3% for excludingnon-OMPs). However, use of the other 13 residues cor-rectly identified 64.4% of the OMPs and excluded 94.9% ofnon-OMPs.

We have also examined the influence of each amino acidresidue by using the amino acid composition of 19 aminoacid residues and excluding a specific residue. We ob-served that the exclusion of several residues marginallydecreased the discrimination accuracy (1–2%).

Comparison With Other Methods

Liu et al.4 proposed a method based on the amino acidcomposition of residues in transmembrane �-strand seg-ments to discriminate �-barrel membrane proteins. Theyused just 12 proteins for developing the parameters andtested with 241 OMPs, and the accuracy was reported to be84%. Martelli et al.3 devised a method based on HMMusing 12 OMPs and tested the method in 145 OMPs, whichyielded an accuracy of 84%. Bagos et al.7 used an HMM fordiscriminating �-barrel OMPs and reported an accuracy of88% for a set of 133 OMPs. We have used a set of 208 OMPsand 880 non-OMPs and discriminated them with anaccuracy of 91%. Further, the present method correctlyexcluded 95% of the globular proteins belonging to 30different folding types. These accuracy levels are similar toor better than other methods in the literature.

Possible Improvements

In this work, we have analyzed the performance ofdifferent machine learning algorithms for discriminatingOMPs and non-OMPs using amino acid composition andavailable sequences of OMPs. The accuracy of discrimina-tion may be improved with the following additional fea-tures: (i) increasing the number of OMP sequences whenavailable, (ii) including additional features, such as aminoacid pair preference showing the residue distributionalong the sequence, and (iii) incorporating alignmentprofiles, which have improved the accuracy of predictingsecondary structures in globular and membrane pro-teins.43

CONCLUSIONS

We have systematically analyzed the applications ofseveral machine learning algorithms for discriminatingOMPs from other folding types of globular and membraneproteins. We observed that most of the methods discrimi-nate the OMPs and non-OMPs with a similar level ofaccuracy in the range of 88–91%. The neural network-based method could correctly distinguish the OMPs andother proteins with an accuracy of 91% using fivefoldcross-validation test. Further, it has successfully excluded95% of the globular proteins belonging to 30 differentfolding types. All the proteins belonging to SAM domain-like, knottins, rubredoxin, and thioredoxin folds are ex-cluded with 100% accuracy. These accuracy levels arecomparable to or better than other methods in the litera-ture. We suggest that this method could be effectively usedto discriminate OMPs and for detecting OMPs in genomicsequences.

ACKNOWLEDGMENTS

The authors thank the referees for constructive com-ments and Dr. Yutaka Akiyama for encouragement.

REFERENCES

1. Hirokawa T, Boon-Chieng S, Mitaku S. SOSUI: classification andsecondary structure prediction system for membrane proteins.Bioinformatics 1998;14:378–379.

2. Wimley WC. Toward genomic identification of beta-barrel mem-

PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot

1036 M.M. GROMIHA AND M. SUWA

brane proteins: composition and architecture of known structures.Protein Sci 2002;11:301—312.

3. Martelli PL, Fariselli P, Krogh A, Casadio R. A sequence-profile-based HMM for predicting and discriminating beta barrel mem-brane proteins. Bioinformatics 2002;18:S46–S53.

4. Liu Q, Zhu Y, Wang B, Li Y. Identification of beta-barrel mem-brane proteins based on amino acid composition properties andpredicted secondary structure. Comput Biol Chem 2003;27:355–361.

5. Bigelow HR, Petrey DS, Liu J, Przybylski D, Rost B. Predictingtransmembrane beta-barrels in proteomes. Nucleic Acids Res2004;32:2566–2577.

6. Garrow AG, Agnew A, Westhead DR. TMB-Hunt: a web server toscreen sequence sets for transmembrane beta-barrel proteins.Nucleic Acids Res 2005;33:W188–192.

7. Bagos PG, Liakopoulos TD, Spyropoulos IC, Hamodrakas SJ. AHidden Markov Model method, capable of predicting and discrimi-nating beta-barrel outer membrane proteins. BMC Bioinformatics2004;5:29.

8. Natt NK, Kaur H, Raghava GP. Prediction of transmembraneregions of beta-barrel proteins using ANN- and SVM-based meth-ods. Proteins 2004;56:11–18.

9. Gromiha MM, Suwa M. A simple statistical method for discriminat-ing outer membrane proteins with better accuracy. Bioinformatics2005;21:961–968.

10. Gromiha MM, Ahmad S, Suwa M. Application of residue distribu-tion along the sequence for discriminating outer membraneproteins. Comput Biol Chem 2005;29:135–142.

11. Gromiha MM. Motifs in outer membrane protein sequences:applications for discrimination. Biophys Chem 2005;117:65–71.

12. Li W, Jaroszewski L, Godzik A. Clustering of highly homologoussequences to reduce the size of large protein databases. Bioinfor-matics 2001;17:282–283.

13. Holm L, Sander C. Removing near-neighbour redundancy fromlarge protein sequence collections. Bioinformatics 1998;14:423–429.

14. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST andPSI-BLAST: a new generation of protein database search pro-grams. Nucleic Acids Res 1997;25:3389–3402.

15. Witten IH, Frank E. Data mining: practical machine learningtools and techniques. 2nd ed. San Francisco: Morgan Kaufmann;2005.

16. Robles V, Larranaga P, Pena JM, et al. Bayesian networkmulti-classifiers for protein secondary structure prediction. ArtifIntell Med 2004;31:117–136.

17. Sun H. A naive Bayes classifier for prediction of multidrugresistance reversal activity on the basis of atom typing. J MedChem 2005;48:4031–4039.

18. le Cessie S, van Houwelingen JC. Ridge estimators in logisticregression. Appl Stat 1992;41:191–201.

19. Qian N, Sejnowski TJ. Predicting the secondary structure ofglobular proteins using neural network models. J Mol Biol 1988;202:865–884.

20. Gromiha MM, Ahmad S, Suwa M. Neural network-based predic-tion of transmembrane beta-strand segments in outer membraneproteins. J Comput Chem 2004;25:762–767.

21. Aha D, Kibler D. Instance-based learning algorithms. MachineLearning 1991;6:37–66.

22. Breiman L. Bagging predictors. Machine Learning 1996;24:123–140.

23. Frank E, Wang Y, Inglis S, Holmes G, Witten IH. Using modeltrees for classification. Machine Learning 1998;32:63–76.

24. Quinlan R. C4.5: programs for machine learning. San Mateo, CA:Morgan Kaufmann Publishers; 1993.

25. Kohavi R. Scaling up the accuracy of naı̈ve-Bayes classifiers: adecision tree hybrid. Proceedings of the 2nd International Confer-ence on Knowledge Discovery and Data Mining; 1996.

26. Frank E, Witten IH. Generating accurate rule sets without globaloptimization. In: Shavlik J, editor. Machine learning: proceedingsof the 15th international conference. San Mateo, CA: MorganKaufmann Publishers; 1998.

27. Branden C, Tooze C. Introduction to protein structure. New York:Garland Publishing; 1999.

28. Gromiha MM, Suwa M. Variation of amino acid properties inall-beta globular and outer membrane protein structures. IntJ Biol Macromol 2003;32:93–98.

29. Chou PY, Fasman GD. Prediction of the secondary structure ofproteins from their amino acid sequence. Adv Enzymol 1978;47:45–148.

30. Pautsch A, Schulz GE. High-resolution structure of the OmpAmembrane domain. J Mol Biol 2000;298:273–282.

31. Vandeputte-Rutten L, Kramer RA, Kroon J, Dekker N, EgmondMR, Gros P. Crystal structure of the outer membrane proteaseOmpT from Escherichia coli suggests a novel catalytic site. EMBOJ 2001;20:5033–5039.

32. Chimento DP, Mohanty AK, Kadner RJ, Wiener MC. Substrate-induced transmembrane signaling in the cobalamin transporterBtuB. Nat Struct Biol 2003;10:394–401.

33. Chimento DP, Kadner RJ, Wiener MC. The Escherichia coli outermembrane cobalamin transporter BtuB: structural analysis ofcalcium and substrate binding, and identification of orthologoustransporters by sequence/structure conservation. J Mol Biol 2003;332:999–1014.

34. Zeth K, Diederichs K, Welte W, Engelhardt H. Crystal structure ofOmp32, the anion-selective porin from Comamonas acidovorans,in complex with a periplasmic peptide at 2.1 Å resolution.Structure 2000;8:981–992.

35. Gromiha MM. A simple method for predicting transmembranealpha helices with better accuracy. Protein Eng 1999;12:557–561.

36. Rost B, Sander C. Prediction of protein secondary structure atbetter than 70% accuracy. J Mol Biol 1993;232:584–599.

37. Ahmad S, Gromiha MM, Sarai A. Analysis and prediction ofDNA-binding proteins and their binding residues based on compo-sition, sequence and structural information. Bioinformatics 2004;20:477–486.

38. White SH, Wimley WC. Membrane protein folding and stability:physical principles. Annu Rev Biophys Biomol Struct 1999;28:319–365.

39. Gromiha MM, Majumdar R, Ponnuswamy PK. Identification ofmembrane spanning beta strands in bacterial porins. Protein Eng1997;10:497–500.

40. Schulz GE. The structure of bacterial outer membrane proteins.Biochim Biophys Acta 2002;1565:308–317.

41. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a struc-tural classification of proteins database for the investigation ofsequences and structures. J Mol Biol 1995;247:536–540.

42. Gromiha MM, Siebers JG, Selvaraj S, Kono H, Sarai A. Intermo-lecular and intramolecular readout mechanisms in protein-DNArecognition. J Mol Biol 2004;337:285–294.

43. Przybylski D, Rost B. Alignments grow, secondary structureprediction improves. Proteins 2002;46:197–205.

PROTEINS: Structure, Function, and Bioinformatics DOI 10.1002/prot

DISCRIMINATION OF OUTER MEMBRANE PROTEINS 1037