feature selection and classification of protein-protein complexes based on their binding affinities...
TRANSCRIPT
Feature selection and classification of protein-protein complexes
based on their binding affinities using machine learning approaches
Short title: Classification of protein-protein complexes
K Yugandhar and M. Michael Gromiha*
Department of Biotechnology, Indian Institute of Technology Madras, Chennai 600036,
Tamilnadu, India.
Key words: binding affinity, discrimination, feature selection, machine learning techniques, protein-
protein interactions.
*corresponding author
Tel: +91-2257-4138
Fax: +91-2257-4102
E-mail: [email protected]
Research Article Proteins: Structure, Function and BioinformaticsDOI 10.1002/prot.24564
This article has been accepted for publication and undergone full peer review but has not beenthrough the copyediting, typesetting, pagination and proofreading process which may lead todifferences between this version and the Version of Record. Please cite this article as an‘Accepted Article’, doi: 10.1002/prot.24564© 2014 Wiley Periodicals, Inc.Received: Jan 13, 2014; Revised: Mar 14, 2014; Accepted: Mar 14, 2014
2
Abstract:
Protein-protein interactions are intrinsic to virtually every cellular process. Predicting the binding
affinity of protein-protein complexes is one of the challenging problems in computational and
molecular biology. In this work, we related sequence features of protein-protein complexes with
their binding affinities using machine learning approaches. We set up a database of 185 protein-
protein complexes for which the interacting pairs are heterodimers and their experimental
binding affinities are available. On the other hand, we have developed a set of 610 features from
the sequences of protein complexes and utilized Ranker search method, which is the combination
of Attribute evaluator and Ranker method for selecting specific features. We have analyzed
several machine learning algorithms to discriminate protein-protein complexes into high and low
affinity groups based on their Kd values. Our results showed a 10-fold cross-validation accuracy
of 76.1% with the combination of nine features using support vector machines. Further, we
observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our
method would serve as an effective tool for identifying the interacting partners in protein-protein
interaction networks and human-pathogen interactions based on the strength of interactions.
Page 2 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
3
Introduction:
Many biological functions involve the formation of protein-protein complexes1,2 and it is
an important prerequisite for two proteins to interact with each other in cell signaling pathways,
regulation of metabolic pathways, immunologic recognition, DNA replication, progression
through the cell cycle, and protein synthesis.3 Protein-protein interactions are also essential to
make any significant biological change as a complex. Understanding the recognition mechanism
and binding specificity of protein-protein complexes are challenging problems in molecular and
computational biology.
Protein-protein complexes can be classified into various types such as dimeric-
multimeric, homodimer-heterodimer, obligate-non obligate and transient-permanent based on
different criteria such as the number and type of subunits involved, interaction time and
biological significance of the complexes.1,4 Several studies have been carried out in recent times
on different aspects of protein-protein interactions, which include the role of specific
interactions5-7, understanding the recognition mechanism
6, identifying the binding sites from
protein structures8-13
and predicting the interaction sites from amino acid sequence.14-16
Further,
prediction methods have been developed for identifying the interacting partners using protein
structure17,18
and sequence information19-21
. These methods are mainly based on structural
similarity18, physico-chemical properties
21, evolutionary information
18 etc.
Binding affinity of protein-protein complexes is one such parameter, which can be related
to almost every functional aspect of the proteins. Experimentally, identification of interacting
protein-protein pairs can be done with yeast two-hybrid system, Förster/fluorescence resonance
energy transfer (FRET), surface plasmon resonance and isothermal calorimetry.22 The data on
interacting pairs of proteins have been deposited in databases such as DIP23, BioGRID
24 and
Page 3 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
4
STRING.25 Further, tools such as PIPE2
26 provides a platform for integration and annotation of
interaction data from various databases. Complex experimental setup and more time demanding
protocols stress the necessity of computational methods that could give reliable information
about interacting partners or binding affinity. On these directions, several computational methods
have been developed to predict interacting protein partners, which are quite successful despite
having some challenges.27,28
In the case of binding affinity methods, few structure based methods
have been proposed using empirical scoring functions,29-32
knowledge based methods33-36
and
quantitative structure activity relationship methods.37 These methods were mainly based on
structural information and it is necessary to develop sequence based methods for annotating
protein-protein interaction networks and identifying interacting partners at large scale.
In this work, we have systematically analyzed the sequences of interacting proteins and
derived a set of 610 features. Using feature selection procedures, we have selected a set of nine
features (attributes) and developed a model for discriminating protein-protein complexes based
on their affinities. The selected features include predicted biding site residues, propensities for α-
helices and β-sheets, which are reported to be important for the binding affinity of protein-
protein complexes.57-60
Then we systematically analyzed the contribution of those selected
features for discriminating protein-protein complexes based on their binding affinities. Our
method using support vector machines could discriminate 155 protein-protein complexes of high
and low affinities with a 10-fold cross-validation accuracy of 76.1%. Further, our method was
tested with a set of 30 complexes, which showed an accuracy of 83.3%. We suggest that our
method could be effectively used for identifying interacting partners with low and high affinities
in protein-protein interaction networks and host-pathogen interactions.
Page 4 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
5
Materials and Methods
Dataset:
We have compiled a dataset of 185 protein-protein complexes for the present study with
the following conditions: (i) the experimental binding affinity (Kd value) is known,33,38-40
(ii)
both the binding partners of a complex have more than 50 amino acids each and (iii) the
complexes are heterodimers. The dataset include protein-protein complexes with diverse
functions (antigen-antibody, enzyme-inhibitor, G-protein containing, receptor containing
etc.), various ranges of molecular weights and disordered regions.
These 185 complexes have been classified into two groups based on their binding
affinities. The complexes with Kd less than 10-8 M were considered as high affinity class and
complexes with Kd value greater than or equal to 10-8 M were considered as low affinity
class. The Kd range for the high affinity class is the one generally considered for permanent
protein-protein complexes,41 which emphasizes the biological importance of our model. With
this criterion, we have obtained a balanced dataset in which, 98 and 87 protein-protein
complexes have been assigned under high and low affinity classes, respectively. The Protein
Data Bank (PDB) codes42 for these two sets of complexes are given in Table I and the
description for all the 185 complexes is given in supplementary Table S1.
Features
We have utilized a set of 610 sequence based features in this study. The features include
a diverse set of 544 features that account for various physico-chemical, conformational,
energetic and biochemical properties of amino acids obtained from AAindex database43 as well
Page 5 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
6
as 49 properties from the literature.44 In addition, we have used 17 features from the
information on predicted binding site residues, predicted aromatic and charged residues at the
interface15 and predicted solvent accessibility.
45 All those features have been computed for all
the considered 185 protein-protein complexes from their amino acid sequences. Further, we
have reduced the number of features as discussed below.
Machine learning methods:
WEKA Data mining software46 was used for machine learning tasks constituting feature
selection and classification. We have analyzed various machine learning techniques
implemented in WEKA platform for discriminating protein-protein complexes based on their
binding affinity. WEKA includes several methods based on neural networks, regression
analysis, Bayes function, logistic functions, nearest neighbor methods, meta learning,
decision trees and rules. Based on the performance of all the techniques on different feature
sets using experimenter module in WEKA, We selected SMO (Sequential Minimal
Optimization) algorithm,47 which is a SVM based method for the classification of complexes
in our dataset. The SVM is a learning machine for two-group classification problems that
transforms the attribute space into multidimensional feature space using a kernel function to
separate dataset instances by an optimal hyperplane.48
We have used feature selection methods available in WEKA46 and a program available at
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm.49 WEKA provides
various attribute and subset evaluator methods such as CFS,50 Chi-squared, Classifier,
SVM,51 Infogain, ReliefF,
52 Gain ratio, Consistency
53 and so on as well as various search
Page 6 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
7
methods including Best-first, Genetic search54 and Ranker. Brief description of each of the
above mentioned methods available in WEKA is given below.
CFS: Considers the individual predictive ability of each feature along with the degree of
redundancy between them.
Chi-squared: Computes the value of the chi-squared statistic with respect to the class.
Classifier: Evaluates attribute subsets on training data or a separate hold out testing set. It
uses a classifier to estimate the ‘merit’ of a set of attributes.
SVM: Evaluates based on SVM-RFE i.e. “Recursive feature elimination”.
Infogain: Measures the information gain with respect to the class.
ReliefF: Evaluates the worth of an attribute by repeatedly sampling an instance and
considering the value of the given attribute for the nearest instance of the same and
different class.
Gain ratio: Measures the gain ratio with respect to the class.
Consistency: Evaluates the worth of a subset of attributes by the level of consistency in the
class values when the training instances are projected onto the subset of
attributes.
Best-first: Searches by greedy hill-climbing augmented with a back-tracking facility.
Genetic: Performs a search using the simple genetic algorithm.
Ranker: Ranks the attributes by their individual evaluations.
Page 7 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
8
Feature set reduction and selection
The feature set was reduced in order to remove redundancy by employing
"Dimensionality reduction by correlation" criteria. The refined feature set contains 216 amino
acid properties, which has the absolute r-value of less than 0.85 between any two considered
properties. We have added 17 more features from the information on predicted interface
residues and solvent accessibility,15,45
which resulted in a total of 233 features. Further, the best
features, which contribute for the discrimination, were selected by employing Ranker search
method (combination of SVM attribute evaluator and ranker methods) in WEKA software.
Assessment of discrimination performance and validation procedures
We have used n-fold cross validation procedure for evaluating the performance of the
method. In this procedure n-1 data have been utilized to develop a model and the rest of the
data were used to test the method.
The prediction performance has been assessed with the following measures:
Accuracy = (TP+TN)/(TP+TN+FP+FN) (1)
Sensitivity (or) Recall = ΤP/(TP+FN) (2)
Specificity = TN/(TN+FP) (3)
Precision = TP/(TP+FP) (4)
F-measure = 2 x ((Precision x Recall)/(Precision + Recall)) (5)
In these equations, TP, TN, FP and FN, represent, true positives, true negatives, false
positives and false negatives, respectively. In addition, AUC (Area under the ROC curve) has
been estimated for the correspondence between true positive rate and false positive rate.
Page 8 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
9
Results and discussion
Selected features for discrimination
We have tried various combinations of evaluator and selection methods available in
WEKA software to delineate the best features for discriminating protein-protein complexes
with high and low affinities. We observed that the Ranker search method showed the best
performance using SVM attribute evaluator, which analyzes all the considered features and
arranges them in the order of priority for discrimination. The usage of all the features showed
an average accuracy of about 70%, which also causes the problem of over-fitting as the number
of features are very high compared with the number of data used for the present study. Hence,
we removed the features one by one and evaluated the performance in terms of accuracy and
ROC. We noticed a marginal increase of prediction accuracy with the elimination of different
features. Finally, we have identified a set of 9 features, which showed the maximum accuracy
with 10-fold and 3-fold cross validation tests. The selected properties are weights for α-helix at
the window position of -6,55 β-sheet at the window positions of -6, -3 and 5,
55 principal
property value z2 showing side chain bulkiness,56 number of predicted binding site residues in
receptors,18 number of predicted binding site aromatic and positively charges residues in
receptors and ligands18 and percentage of binding site aromatic and positively charged residues
in ligands.18
Among the final list of 9 features, more than 44% (4 features) are selected from predicted
binding site residues. Interestingly, the information on the aromatic and positively charged
residues at the binding sites is identified as one of the most important features for
discriminating the protein-protein complexes based on their affinities. This observation agrees
Page 9 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
10
well with the previous results reported in the literature57,58
and emphasizes the importance of
binding site residues and especially aromatic and positively charged residues in the binding
sites in governing the binding affinity between two interacting proteins. In addition, secondary
structure based properties play an important role for discrimination along with the physical
property, side chain bulkiness. It has been shown that induction of α−helical structure in
Calmodulin-binding sequence of a target protein is an important step in the activation of target
enzymes which in turn could be a determining factor for binding affinity.59,60
The selected
feature set for our model consists of a measure of helix propensity as one of its features. These
results emphasize the importance of α−helices in formation of protein-protein complexes and
governing the binding affinity.
Analysis of selected features for discrimination
We have classified the protein-protein complexes into two groups based on their affinities
and analyzed the distribution of all the nine selected features. Few specific examples are
discussed below:
Figure 1(A) shows the distribution of protein-protein complexes with high and low
affinities based on the number of predicted aromatic and positively charged residues at the
interface of ligands. We noticed that the number of these residues is less in low affinity
complexes compared with the complexes of high affinity. The high affinity protein-protein
complexes have more number of positively charged and aromatic residues at the interface
compared with complexes of low affinities. With the cutoff of 9 residues, the percentage of
high and low affinity complexes is 23% and 6%, respectively. Interestingly, this parameter is
selected as one of the features for discriminating high and low affinity complexes, which are
Page 10 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
11
also reported to be an important factor for understanding the binding specificity of protein-
protein complexes.57,58
Figure 1(B) shows the weights for β-sheet at the window position of -6 and used in
predicting protein secondary structures.55 We noticed that more number of high affinity
complexes have the weights of less than zero. We observed a similar trend for the weights to
α-helix. These results reveal the importance of secondary structures for the specificity of
protein-protein complexes in agreement with experimental reports.60 Other selected properties
also showed marked differences between low and high affinity complexes (data not shown).
Discrimination of protein-protein complexes based on their affinities
We have utilized different algorithms available in WEKA to discriminate the protein-
protein complexes based on their affinities, and the SMO method (which uses support vector
machines) showed the best performance based on sensitivity, specificity, accuracy and ROC.
Further, we have varied the adjustable parameters in SVM, and the model with Polynomial
kernel and C value of 1.0 yielded the highest accuracy. The discrimination performance using
different datasets is presented in Table II. Our method could discriminate low and high affinity
protein-protein complexes with an average accuracy of 76.1% using 10-fold cross-validation on
a set of 155 complexes. The sensitivity and specificity are 75.6% and 76.7%, respectively. We
have applied the same model to a test set of 30 complexes and the discrimination accuracy is
83%. Further, we have tested the problem of over-fitting by evaluating the model with self-
consistency and the results are very much similar to that of the cross validation experiments. This
observation verifies that there is no over-fitting factor in our model. It is noteworthy that the
Page 11 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
12
model was developed with a limited set of 185 complexes and it can be refined with the
availability of more number of data on protein-protein binding affinity.
Influence of sequence redundancy for discriminating high and low affinity complexes
Protein-protein binding affinity is influenced by several experimental factors such as
protein concentration, pH, temperature etc. In addition, mutation of a single amino acid residue
could drastically change the affinity of the complexes.61,62
Hence, we have not considered the
redundancy criteria and used all the 185 complexes in the present work. The performance of our
model on a blind data set emphasizes that it is robust and no over fitting is associated with it.
For further evaluation, we have developed a non-redundant datasets using the cutoff of
less than 25% sequence identity in (i) receptor (ii) ligand and (iii) receptor or ligand.63 This
yielded a set of 92, 125 and 144 protein-protein complexes based on the non-redundancy in
receptor, ligand and either of them, respectively. Our method could discriminate the high and
low affinity complexes in these three datasets with the accuracy of 64%, 77% and 75%,
respectively.
Analyzing performance of the model on a particular family of proteins apart from the test
set
Apart from the test set of 30 complexes, we have examined the prediction power of our model on
an additional blind set of seven complexes, which belongs to a common group called “Tumor
necrosis factor (TNF) superfamily”.64 Among the seven complexes, three of them have high
affinity and four have low affinity. Our method correctly classified all the three high affinity
complexes, and three out of the four low affinity complexes, which showed an accuracy of
85.7%.
Page 12 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
13
Performance of the method on disordered proteins
Dosztanyi et al.65 reported that the hub proteins contain great proportion of disordered
regions and they tend to have long sequences. We have analyzed influence of disordered proteins
(or regions) on the affinities of hetero-dimeric complexes. Among the 185 complexes used in the
present study, structures of free proteins are available for 138 complexes and 90 of them have at
least one protein in the disordered state. The analysis of these 90 complexes based on their
binding affinities showed that 60% of them (54 complexes) are of low affinity. Our model could
correctly classify 77% of all the disordered complexes using 10-fold cross-validation, which is
similar to the performance on the whole dataset of 185 complexes.
It has been reported that the ordered and intrinsically unstructured complexes mainly differ in the
interface properties.66 Interestingly, our method selected four features (from the list of 233 features),
which are related to interface properties (derived from predicted binding site residues). Hence, these
properties might play key role in differentiating ordered and disordered complexes as well as into high
and low affinity complexes. Further, we have divided the set of 138 complexes into two groups
(disordered with 90 complexes and ordered with 48 complexes) and performed feature selections
separately with the aim of achieving the highest accuracy. We found that most of the features found in the
two sets are similar except few features such as “charge of the protein”, which is selected only for ordered
proteins. It supports the previous observation suggesting the importance of electrostatic and cation-π
interactions for the recognition of protein-protein complexes.7 In addition, we noticed that 7 of the 11
features including interface properties selected for the disordered set are also present in the final list
derived for 155 complexes. This reiterates the importance of the reduced set of the features developed in
this work.
Page 13 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
14
Analysis of large-scale protein-protein interacting pairs based on high and low affinity
We have employed our method for analyzing protein-protein interaction data available in
major databases. We have collected a set of 4712 protein-protein interactions in yeast that are
deposited in DIP database.23 Our analysis showed that 43% and 57% of the interacting pairs are
with high and low affinity, respectively. The predicted high affinity complexes will be helpful to
select the targets in structure based drug design. Further analysis of protein-protein interactions
in various organisms, host-pathogen interactions and validations is in progress.
Rational comparison of different attribute selection methods
We have employed different combinations of attribute evaluators and search methods
available in WEKA for feature selection process and the results are presented in Table III. From
this table, we observed that the combination of SVM attribute evaluator and Ranker search
method has the best performance using a minimum number of nine features. Other combinations
of various attributor evaluator and search methods showed either less AUC or utilized more
number of features than SVM attribute evaluator and Ranker search method. In addition, we
have used random forest method available at
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm for selecting the
features,49 which showed an AUC of 0.74 (and the accuracy is 73.5%) using 24 features. Hence,
we have selected the combination of SVM attribute evaluator and Ranker search method
selecting the features to discriminate high and low affinity protein-protein complexes.
Page 14 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
15
Comparison of different classifiers
We have compared the performance of different classifiers for discriminating high and
low affinity protein-protein complexes and the results for 7 typical methods are presented in
Table IV. Most of the methods discriminated the low and high affinity protein-protein
complexes with an accuracy in the range of 62% to 72%. The present method based on support
vector machines could discriminate them with an accuracy of 76.1%, which is a balance between
the sensitivity (75.6%) and specificity (76.7%).
Influence of specific features for discrimination
We have evaluated the importance of all the selected features for discrimination by
removing a specific feature from the list and analyzed the accuracy using 10-fold cross-
validation on a dataset of 155 complexes. The results are presented in Table V. We noticed that
the accuracy decreases 1 to 6% by removing a single feature. Specifically, percentage of
aromatic and positively charged residues in predicted binding sites of ligands decreased the
accuracy from 76.1% to 71.6% showing its importance in discrimination.
Comparison with other methods
The present work is the first sequence based method for classifying protein-protein
complexes based on their binding affinity. This method is different from other structure based
methods proposed in the literature, which are mainly for predicting the absolute binding affinity
of protein-protein complexes. These methods have several limitations: (i) applicable only to a
training set of complexes,35 (ii) utilizes a large number of descriptors,
37 (iii) show high
correlation only to rigid complexes35 and (iv) the requirement of structural information.
28 On the
other hand, the present method has several advantages: (i) the features are derived from amino
Page 15 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
16
acid sequences, (ii) utilized a limited number of features, (iii) classifies into low and high affinity
protein-protein complexes and (iv) shows a good performance. Although direct comparison of
our method with other existing methods is not appropriate, the analysis shows that the present
method has several advantages over other methods reported in the literature.
Conclusion
The analysis on a large number of amino acid features, which are influencing the binding
affinity of protein-protein complexes showed that the conformational properties, α-helical and β-
strand tendencies, bulkiness and the number of predicted aromatic and charged residues at the
protein-protein interface are important for discriminating protein-protein complexes of high and
low affinities. Interestingly, the dominance of aromatic and charged residues at the interface are
important for recognition due to the formation of electrostatic, aromatic-aromatic and cation-π
interactions, which are reported to play vital roles for the formation of protein-protein
complexes. In addition, the features related with protein secondary structures are also shown to
play an important role in recognition, which are identified by our feature selection methods. The
combination of these features could successfully discriminate the high and low affinity protein-
protein complexes with an accuracy in the range of 76-85% using different sets of data and
validations procedures. Hence, the present method could be used to identify the interacting
partners in protein-protein interaction networks and human-pathogen interactions based on their
affinities. Further, the work on predicting the binding affinity from amino acid sequence is in
progress.
Page 16 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
17
Acknowledgements
We thank the Associate Editor and reviewers for constructive comments. KY thanks the
University Grants Commission (UGC), Government of India for providing research fellowship.
We thank the Bioinformatics facility and Indian Institute of Technology Madras for
computational facilities. The work was partially supported by the Department of Science and
Technology, Government of India to MMG (SR/SO/BB-0036/2011).
Supportive/Supplementary Material
Table S1: Description for all the complexes used in the study.
References:
1. Jones S, Thornton JM. Principles of protein-protein interactions. Proc Natl Acad Sci USA
1996;93:13-20.
2. Nooren IM, Thornton JM. Diversity of protein-protein interactions. EMBO J 2003;22:3486-
3492.
3. Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson JD. Molecular Biology of the Cell.
NewYork: Garland; 1989, 2nd edn.
4. Keskin O, Gursoy A, Ma B, Nussinov R. Principles of protein-protein interactions: what are
the preferred ways for proteins to interact? Chem Rev 2008;108:1225-1244.
5. Bahadur RP, Chakrabarti P, Rodier F, Janin J. A dissection of specific and non-specific
protein-protein interfaces. J Mol Biol 2004;336:943-955.
6. Gromiha MM. Protein Bioinformatics: From Sequence to Function. Elsevier; 2010.
7. Gromiha MM, Yokota K, Fukui K. Energy based approach for understanding the recognition
mechanism in protein-protein complexes. Mol Biosyst 2009;5:1779-1786.
8. Jones S, Thornton JM. Prediction of protein-protein interaction sites using patch analysis. J
Mol Biol 1997;272:133-43.
9. Neuvirth, H.; Raz, R.; Schreiber, G. ProMate: a structure based prediction program to identify
the location of protein–protein binding sites. J Mol Biol 2004;338:181-199.
10. Fernandez-Recio J, Totrov M, Abagyan R. Identification of protein-protein interaction sites
from docking energy landscapes. J Mol Biol 2004;335:843-865.
11. Fernandez-Recio J, Totrov M, Skorodumov C, Abagyan R. Optimal docking area: a new
method for predicting protein–protein interaction sites. Proteins 2005;58:134-143.
12. La D, Kihara D. A novel method for protein-protein interaction site prediction using
phylogenetic substitution models. Proteins 2012;80:126-141.
13. La D, Kong M, Hoffman W, Choi YI, Kihara D. Predicting permanent and transient protein-
protein interfaces. Proteins 2013;81(5):805-818.
Page 17 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
18
14. Ofran Y, Rost B. Predict protein-protein interaction sites from local sequence
information. FEBS Lett 2003;544:236-239.
15. Ofran Y, Rost B. ISIS: interaction sites identified from sequence. Bioinformatics
2007;23:e13-6.
16. Ahmad S, Mizuguchi K. Partner-Aware Prediction of Interacting Residues in Protein-Protein
Complexes from Sequence Data. PLoS ONE 2011;6(12):e29104.
17. Shoemaker BA, Panchenko AR. Deciphering protein-protein interactions. Part II.
Computational methods to predict protein and domain interaction partners. Plos Comput Biol
2007;3:595-601.
18. Tuncbag N, Gursoy A, Keskin O. Prediction of protein-protein interactions: unifying
evolution and structure at protein interfaces. Phys Biol 2011;8:035006.
19. Martin S, Roe D, Faulon JL. Predicting protein–protein interactions using signature
products. Bioinformatics 2005;21(2):218-226.
20. Pan XY, Zhang YN, Shen HB. Large-Scale Prediction of Human Protein-Protein Interactions
from Amino Acid Sequence Based on Latent Topic Features. J Proteome Res
2010;9(10):4992-5001.
21. Zhang YN, Pan XY, Huang Y, Shen HB. Adaptive compressive learning for prediction of
protein-protein interactions from primary sequence. J Theor Biol 2011;283(1):44-52.
22. Phizicky EM, Fields S. Protein-protein interactions: methods for detection and
analysis. Microbiol Rev 1995;59:94-123.
23. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of
interacting proteins: 2004 update. Nucleic Acids Res 2004;32(suppl 1):D449-D451.
24. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general
repository for interaction datasets. Nucleic Acids Res 2006;34(suppl 1):D535-D539.
25. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, von Mering C.
The STRING database in 2011: functional interaction networks of proteins, globally
integrated and scored. Nucleic Acids Res 2011;39(suppl 1):D561-D568.
26. Ramos H, Shannon P, Brusniak MY, Kusebauch U, Moritz RL, Aebersold R. The Protein
Information and Property Explorer 2: Gaggle‐like exploration of biological proteomic data
within one webpage. Proteomics 2001;11(1):154-158.
27. Wass MN, David A, Sternberg MJ. Challenges for the prediction of macromolecular
interactions. Curr Opin Struct Biol 2011;21:382-390.
28. Kastritis PL, Bonvin AMJJ. On the binding affinity of macromolecular interactions: daring to
ask why proteins interact. J R Soc Interface 2013;10:20120835.
29. Horton N, Lewis M. Calculation of the free energy of association for protein complexes.
Protein Sci 1992;1:169-181.
30. Ma XH, Wang CX, Li CH, Chen WZ. A fast empirical approach to binding free energy
calculations based on protein interface information. Protein Eng 2002;15:677-681.
31. Audie J, Scarlata S. A novel empirical free energy function that explains and predicts
protein–protein binding affinities. Biophys Chem 2007;129:198-211.
32. Jiang L, Gao Y, Mao F, Liu Z, Lai L. Potential of mean force for protein-protein interaction
studies. Proteins 2002;46:190-196.
33. Zhang C, Liu S, Zhu Q, Zhou Y. A knowledge-based energy function for protein-ligand,
protein-protein, and protein-DNA complexes. J Med Chem 2005;48:2325-2335.
Page 18 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
19
34. Su Y, Zhou A, Xia X, Li W, Sun Z. Quantitative prediction of protein-protein binding
affinity with a potential of mean force considering volume correction. Protein Sci
2009;18:2550-2558.
35. Moal IH, Agius R, Bates PA. Protein-protein binding affinity prediction on a diverse set of
structures. Bioinformatics 2011;27:3002-3009.
36. Vreven T, Hwang H, Pierce BG, Weng Z. Prediction of protein-protein binding free
energies. Protein Sci 2012;21:396-404.
37. Tian F, Lv Y, Yang L. Structure-based prediction of protein-protein binding affinity with
consideration of allosteric effect. Amino Acids 2012;43:531-543.
38. Kastritis PL, Bonvin AM. Are scoring functions in protein-protein docking ready to predict
interactomes? Clues from a novel binding affinity benchmark. J Proteome Res
2010;9:2216-2225.
39. Kastritis PL, Moal IH, Hwang H, Weng Z, Bates PA, Bonvin AM, Janin J. A structure-based
benchmark for protein-protein binding affinity. Protein Sci 2011;20:482-491.
40. Nooren IM, Thornton JM. Structural characterisation and functional significance of transient
protein–protein interactions. J Mol Biol 2003;325:991-1018.
41. Perkins JR, Diboun I, Dessailly BH, Lees JG, Orengo C. Transient protein-protein
interactions: structural, functional, and network properties. Structure 2010;18:1233-43.
42. Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, Green RK, Goodsell
DS, Prlic A, Quesada M, Quinn GB, Ramos AG, West-brook JD, Young J, Zardecki C,
Berman HM, Bourne PE. The rcsb protein data bank: new resources for research and
education. Nucleic Acids Res 2013;41:D475-D482.
43. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M.
AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 2008;36:
D202-D205.
44. Gromiha, MM. A statistical model for predicting protein folding rates from amino acid
sequence with structural class information. J. Chem. Inf. Model. 2005;45:494-501.
45. Garg A, Kaur H, Raghava GP. Real value prediction of solvent accessibility in proteins using
multiple sequence alignment and secondary structure information. Proteins 2005;61:318-24.
46. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA Data
Mining Software: An Update; SIGKDD Explorations. 2009;11(1):10-18.
47. Platt JC. Fast Training of Support Vector Machines using Sequential Minimal Optimization.
Microsoft Research 2000;12:41-65.
48. Hearst MA. Support Vector Machines. IEEE INTELLIGENT SYSTEMS 1998;18-28.
49. Breiman L. Random forests. 2001; Available at
http://oz.berkeley.edu/users/breiman/randomforest2001.pdf.
50. Hall MA. Correlation-Based Feature Selection for Machine Learning, PhD thesis, Dept. of
Computer Science, Univ of Waikato, Hamilton, New Zealand, 1998.
51. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using
support vector machines. Machine Learning 2002;46:389-422.
52. Sikonja M, Kononenko I. An Adaptation of Relief for Attribute Estimation in
Regression, Proceedings of 14th International Conference on Machine Learning (ICML ',97),
Nashville, TN, USA, July 8-12 1997;pp.296-304.
Page 19 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
20
53. Liu H, Setiono R. A probabilistic approach to feature selection - A filter solution, 13th
International Conference on Machine Learning, Bari, Italy, July 3-6, 1996;319-327.
54. Goldberg DE. Genetic Algorithms in Search, Optimization and Machine Learning, Addison-
Wesley, 1989.
55. Qian N, Sejnowski T. Predicting the secondary structure of globular proteins using Neural
Network models. J Mol Biol 1988;202:865-884.
56. Wold S, Eriksson L, Hellberg S, Jonsson, J, Sjöström M, Skagerberg B, Wikström C.
Principal property values for six non-natural amino acids and their application to a structure-
activity relationship for oxytocin peptide analogues. Can J Chem 1987;65:1814-1820.
57. Gromiha MM, Selvaraj S, Jayaram B, Fukui K. Identification and analysis of binding site
residues in protein complexes: energy based approach. Lecture notes in comp sci
2010;6215:626-633.
58. Gromiha MM, Saranya N, Selvaraj S, Jayaram B, Fukui K. Sequence and structural features
of binding site residues in protein-protein complexes: comparison with protein-nucleic acid
complexes. Proteome Sci 2011;9:S13.
59. Yuan T, Walsh MP, Sutherland C, Fabian H, Vogel HJ. Calcium-dependent and -
independent interactions of the calmodulin-binding domain of cyclic nucleotide
phosphodiesterase with calmodulin. Biochemistry 1999;38:1446-1455.
60. Brokx RD, Lopez MM, Vogel HJ, Makhatadze GI. Energetics of target peptide binding by
calmodulin reveals different modes of binding. J Biol Chem 2001;276:14083-14091.
61. Thorn KS, Bogan AA. ASEdb: a database of alanine mutations and their effects on the free
energy of binding in protein interactions. Bioinformatics 2001;17:284-285.
62. Kumar MDS, Gromiha MM. PINT: Protein-protein Interactions Thermodynamic Database.
Nucl Acids Res 2006;34:D195-198.
63. Wang G, Dunbrack Jr. RL. PISCES: a protein sequence culling server. Bioinformatics
2003;19(12):1589-91.
64. Day ES, Cote SM, Whitty A. Binding efficiency of protein-protein complexes. Biochemistry
2012;51:9124-9136.
65. Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P. Disorder and sequence repeats in hub
proteins and their implications for network evolution. J Proteome Res 2006;5(11):2985-
2995.
66. Mészáros B, Tompa P, Simon I, Dosztányi Z. Molecular principles of the interactions of
disordered proteins. J Mol Biol 2007;372(2):549-561.
Page 20 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
21
Figure Legends
Figure 1: Analysis of selected features for their discriminative ability
Figure 1(A): Number of aromatic and positively charged residues in predicted binding sites of ligands
Figure 1(B): Weights for β-sheet at the window position of -6
Page 21 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
1
Table I
List of PDB codes for the protein-protein complexes with high and low affinities
Class PDB IDs with chains of the two interacting proteins
High affinity
(98 complexes)
1ACB_E:I, 1AHW_AB:C, 1ATN_A:D, 1AVX_A:B, 1AY7_A:B, 1BJ1_HL:VW, 1BRS_A:D, 1BVN_P:T,
1DFJ_E:I, 1DQJ_AB:C, 1EAW_A:B, 1EER_A:BC, 1EMV_B:A, 1EZU_C:AB, 1F34_A:B, 1FLE_E:I,
1FSK_BC:A, 1GPW_A:B, 1GXD_A:C, 1HCF_AB:X, 1I2M_A:B, 1IBR_A:B, 1IQD_AB:C, 1JIW_P:I,
1JPS_HL:T, 1JTG_A:B, 1K5D_AB:C, 1KXP_A:D, 1KXQ_A:H, 1M10_A:B, 1MAH_A:F, 1NB5_AP:I,
1NCA_HL:N, 1NSN_HL:S, 1OC0_A:B, 1OPH_A:B, 1P2C_AB:C, 1PXV_A:C, 1R0R_E:I, 1RV6_VW:X,
1T6B_X:Y, 1UUG_A:B, 1VFB_AB:C, 1WDW_BD:A, 1WEJ_HL:F, 1YVB_A:I, 1ZLI_A:B, 2ABZ_B:E, 2B42_A:B, 2GOX_A:B, 2HRK_A:B, 2I25_N:L, 2I9B_E:A, 2J0T_A:D, 2JEL_HL:P, 2NYZ_AB:D,
2O3B_A:B, 2OUL_A:B, 2OZA_B:A, 2PTC_E:I, 2SIC_E:I, 2SNI_E:I, 2UUY_A:B, 2VDB_A:B,
2VIR_AB:C, 3BP8_AB:C, 3SGB_E:I, 1AVW_A:B, 1BQL_LH:Y, 1BTH_HL:P, 1CSE_E:I, 1FDL_HL:Y,
1FSS_A:B, 1HWG_A:C, 1IGC_HL:A, 1JHL_HL:A, 1PPF_E:I, 1STF_E:I, 1TBQ_JK:S, 1TEC_E:I,
1TPA_E:I, 1YQV_HL:Y, 2KAI_AB:I, 3HFM_HL:Y, 3HHR_A:C, 4HTC_HL:I, 4SGB_E:I, 4TPI_Z:I,
1BGX_HL:T, 1BKD_R:S, 1CGI_E:I, 1N8O_ABC:E, 1RRP_A:B, 1Y64_A:B, 2FD6_HL:U, 2SEC_E:I, 2TPI_ZI:S, 7CEI_A:B
Low affinity
(87 complexes)
1A2K_AB:C, 1AK4_A:D, 1AKJ_AB:DE, 1AVZ_B:C, 1B6C_A:B, 1BUH_A:B, 1BVK_DE:F,
1CBW_ABC:D, 1E4K_AB:C, 1E6E_A:B, 1E6J_HL:P, 1E96_A:B, 1EFN_A:B, 1EWY_A:C, 1F6M_A:C,
1FC2_C:D, 1FFW_A:B, 1FQJ_A:B, 1GCQ_B:C, 1GLA_G:F, 1GRN_A:B, 1H1V_A:G, 1H9D_A:B,
1HE8_A:B, 1I4D_AB:D, 1IB1_AB:E, 1IJK_BC:A, 1JMO_A:HL, 1JWH_CD:A, 1KAC_A:B,
1KKL_ABC:H, 1KLU_AB:D, 1KTZ_A:B, 1LFD_B:A, 1MLC_AB:E, 1MQ8_A:B, 1NVU_Q:S, 1NVU_R:S, 1NW9_B:A, 1PVH_A:B, 1QA9_A:B, 1R6Q_A:C, 1RLB_ABCD:E, 1S1Q_A:B, 1US7_A:B, 1WQ1_G:R,
1XD3_A:B, 1XQS_A:C, 1Z0K_A:B, 1ZHI_A:B, 1ZM4_A:B, 2A9K_A:B, 2AJF_A:E, 2AQ3_A:B,
2B4J_AB:C, 2BTF_A:P, 2C0L_A:B, 2FJU_B:A, 2HLE_A:B, 2HQS_A:H, 2MTA_HL:A, 2OOB_A:B,
2OOR_AB:C, 2PCB_A:B, 2PCC_A:B, 2TGP_I:Z, 2VIS_AB:C, 2WPT_B:A, 3BZD_A:B, 3CPH_G:A,
1A0O_A:B, 1DKG_AB:D, 1GUA_A:B, 1MDA_LH:A, 1MEL_M:B, 1NMB_HL:N, 1YCS_A:B, 1AZS_AB:C,
1DE4_AB:CF, 1EFU_A:B, 1FAK_HL:T, 1GHQ_A:B, 1GP2_A:BG, 1HE1_C:A, 1R8S_A:E, 1SBB_A:B,
1TMQ_A:B, 2OT3_B:A
Page 22 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
5051525354555657585960
2
Table II
Performance of the present model generated using SMO with selected nine features
Dataset Validation Accuracy (%) Sensitivity (%) Specificity (%) F-measure Precision AUC
185 complexes As full training set 77.3 74.5 80.5 0.77 0.78 0.78
185 complexes 10-fold cross-validation 77.3 75.5 79.3 0.77 0.78 0.77
155 complexes As full training set 76.8 76.8 76.7 0.77 0.77 0.77
155 complexes 10-fold cross-validation 76.1 75.6 76.7 0.76 0.76 0.76
155 complexes 3-fold cross-validation 76.8 74.4 79.5 0.77 0.77 0.77
30 complexes Test set 83.3 81.3 85.7 0.83 0.84 0.84
Page 23 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
5051525354555657585960
3
Table III
Comparison of different attribute selection methods using SMO as the classifier using 10-fold cross-validation on 185
complexes
*Ranked features using program available at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm.49
Selection method
(attribute evaluator + Search method)
Number of
selected
features
Accuracy
(%)
Sensitivity
(%)
Specificity
(%)
F-measure Precision AUC
Random Forest* 24 73.5 70.5 69.1 0.74 0.74 0.74
Infogain attribute evaluator + Ranker 22 74.1 74.5 73.6 0.74 0.74 0.74
Chi-squared attribute evaluator + Ranker 21 75.7 73.5 78.2 0.76 0.76 0.76
ReliefF attribute evaluator +Ranker 18 73.0 69.4 77.0 0.73 0.73 0.73
Gain ratio attribute evaluator + Ranker 17 71.4 71.4 71.3 0.71 0.71 0.71
Consistency subset evaluator + Genetic search 13 71.4 70.4 72.4 0.71 0.72 0.71
CFS subset evaluator + Bestfirst 11 70.8 72.4 69.0 0.71 0.71 0.71
Classifier subset evaluator + Genetic 11 72.4 72.5 72.4 0.73 0.73 0.72
SVM attribute evaluator + Ranker 9 77.3 75.5 79.3 0.77 0.78 0.77
CFS subset evaluator + Bestfirst 9 68.1 67.4 69.0 0.68 0.68 0.68
Chi-squared attribute evaluator + Ranker 9 68.7 61.2 77.0 0.69 0.70 0.69
Classifier subset evaluator + Genetic 9 71.4 73.5 69.0 0.71 0.71 0.71
Consistency subset evaluator + Genetic search 9 68.1 70.4 65.5 0.68 0.68 0.68
Gain ratio attribute evaluator + Ranker 9 70.3 63.3 78.7 0.70 0.71 0.71
Infogain attribute evaluator + Ranker 9 70.3 63.3 78.7 0.70 0.71 0.71
ReliefF attribute evaluator + Ranker 9 70.3 71.4 69.0 0.70 0.70 0.70
Page 24 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
5051525354555657585960
4
Table IV
Performance of different classifiers on training set of 155 complexes using the selected nine features with 10-fold cross-
validation
Method Accuracy (%) Sensitivity (%) Specificity (%) F-measure Precision AUC
Bayesian Logistic Regression 61.9 82.9 38.4 0.60 0.63 0.61
Naive Bayes 71.6 64.6 79.5 0.72 0.73 0.75
Multilayer Perceptron 67.7 68.3 67.1 0.68 0.68 0.69
SMO (Support vector machines) 76.1 75.6 76.7 0.76 0.76 0.76
IBK(K-nearest neighbors) 62.6 61.0 64.4 0.63 0.63 0.63
J48 decision tree 68.4 69.5 67.1 0.68 0.68 0.66
Random Forest 65.8 75.6 54.8 0.65 0.66 0.68
Page 25 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
5051525354555657585960
5
Table V
Importance of individual attributes in the selected feature set
S.No. Attribute removed Accuracy
(%)
Sensitivity
(%)
Specificity
(%)
AUC
1 Weights for α-helix at the window position of -6 73.6 73.1 73.9 0.74
2
Weights for β-sheet at the window position of -6 73.6 73.1 73.9 0.74
3 Weights for β-sheet at the window position of -3 74.8 73.1 76.7 0.75
4 Weights for β-sheet at the window position of 5 74.2 72 76.7 0.74
5 Principal property value z2 (Side chain bulk) 74.9 74.4 75.3 0.75
6 Number of predicted binding site residues in receptors 73.5 72 75.3 0.74
7 Number of aromatic and positively charged residues in predicted binding sites of receptors 74.8 74.4 75.3 0.75
8 Number of aromatic and positively charged residues in predicted binding sites of ligands 74.8 75.6 74 0.75
9 Percentage of aromatic and positively charged residues in predicted binding sites of ligands 71.6 74.4 68.5 0.71
Page 26 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics
5051525354555657585960
Figure 1: Analysis of selected features for their discriminative ability: (a) Number of aromatic and positively charged residues in predicted binding sites of ligands; (b) Weights for beta-sheet at the window position of -
6
12x13mm (600 x 600 DPI)
Page 27 of 27
John Wiley & Sons, Inc.
PROTEINS: Structure, Function, and Bioinformatics