feature selection and classification of protein-protein complexes based on their binding affinities...

Feature selection and classification of protein-protein complexes

based on their binding affinities using machine learning approaches

Short title: Classification of protein-protein complexes

K Yugandhar and M. Michael Gromiha*

Department of Biotechnology, Indian Institute of Technology Madras, Chennai 600036,

Tamilnadu, India.

Key words: binding affinity, discrimination, feature selection, machine learning techniques, protein-

protein interactions.

*corresponding author

Tel: +91-2257-4138

Fax: +91-2257-4102

E-mail: [email protected]

Research Article Proteins: Structure, Function and BioinformaticsDOI 10.1002/prot.24564

This article has been accepted for publication and undergone full peer review but has not beenthrough the copyediting, typesetting, pagination and proofreading process which may lead todifferences between this version and the Version of Record. Please cite this article as an‘Accepted Article’, doi: 10.1002/prot.24564© 2014 Wiley Periodicals, Inc.Received: Jan 13, 2014; Revised: Mar 14, 2014; Accepted: Mar 14, 2014

2

Abstract:

Protein-protein interactions are intrinsic to virtually every cellular process. Predicting the binding

affinity of protein-protein complexes is one of the challenging problems in computational and

molecular biology. In this work, we related sequence features of protein-protein complexes with

their binding affinities using machine learning approaches. We set up a database of 185 protein-

protein complexes for which the interacting pairs are heterodimers and their experimental

binding affinities are available. On the other hand, we have developed a set of 610 features from

the sequences of protein complexes and utilized Ranker search method, which is the combination

of Attribute evaluator and Ranker method for selecting specific features. We have analyzed

several machine learning algorithms to discriminate protein-protein complexes into high and low

affinity groups based on their Kd values. Our results showed a 10-fold cross-validation accuracy

of 76.1% with the combination of nine features using support vector machines. Further, we

observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our

method would serve as an effective tool for identifying the interacting partners in protein-protein

interaction networks and human-pathogen interactions based on the strength of interactions.

of 27

John Wiley & Sons, Inc.

PROTEINS: Structure, Function, and Bioinformatics

3

Introduction:

Many biological functions involve the formation of protein-protein complexes1,2 and it is

an important prerequisite for two proteins to interact with each other in cell signaling pathways,

regulation of metabolic pathways, immunologic recognition, DNA replication, progression

through the cell cycle, and protein synthesis.3 Protein-protein interactions are also essential to

make any significant biological change as a complex. Understanding the recognition mechanism

and binding specificity of protein-protein complexes are challenging problems in molecular and

computational biology.

Protein-protein complexes can be classified into various types such as dimeric-

multimeric, homodimer-heterodimer, obligate-non obligate and transient-permanent based on

different criteria such as the number and type of subunits involved, interaction time and

biological significance of the complexes.1,4 Several studies have been carried out in recent times

on different aspects of protein-protein interactions, which include the role of specific

interactions5-7, understanding the recognition mechanism

6, identifying the binding sites from

protein structures8-13

and predicting the interaction sites from amino acid sequence.14-16

Further,

prediction methods have been developed for identifying the interacting partners using protein

structure17,18

and sequence information19-21

. These methods are mainly based on structural

similarity18, physico-chemical properties

21, evolutionary information

18 etc.

Binding affinity of protein-protein complexes is one such parameter, which can be related

to almost every functional aspect of the proteins. Experimentally, identification of interacting

protein-protein pairs can be done with yeast two-hybrid system, Förster/fluorescence resonance

energy transfer (FRET), surface plasmon resonance and isothermal calorimetry.22 The data on

interacting pairs of proteins have been deposited in databases such as DIP23, BioGRID

24 and

of 27



4

STRING.25 Further, tools such as PIPE2

26 provides a platform for integration and annotation of

interaction data from various databases. Complex experimental setup and more time demanding

protocols stress the necessity of computational methods that could give reliable information

about interacting partners or binding affinity. On these directions, several computational methods

have been developed to predict interacting protein partners, which are quite successful despite

having some challenges.27,28

In the case of binding affinity methods, few structure based methods

have been proposed using empirical scoring functions,29-32

knowledge based methods33-36

and

quantitative structure activity relationship methods.37 These methods were mainly based on

structural information and it is necessary to develop sequence based methods for annotating

protein-protein interaction networks and identifying interacting partners at large scale.

In this work, we have systematically analyzed the sequences of interacting proteins and

derived a set of 610 features. Using feature selection procedures, we have selected a set of nine

features (attributes) and developed a model for discriminating protein-protein complexes based

on their affinities. The selected features include predicted biding site residues, propensities for α-

helices and β-sheets, which are reported to be important for the binding affinity of protein-

protein complexes.57-60

Then we systematically analyzed the contribution of those selected

features for discriminating protein-protein complexes based on their binding affinities. Our

method using support vector machines could discriminate 155 protein-protein complexes of high

and low affinities with a 10-fold cross-validation accuracy of 76.1%. Further, our method was

tested with a set of 30 complexes, which showed an accuracy of 83.3%. We suggest that our

method could be effectively used for identifying interacting partners with low and high affinities

in protein-protein interaction networks and host-pathogen interactions.

of 27



5

Materials and Methods

Dataset:

We have compiled a dataset of 185 protein-protein complexes for the present study with

the following conditions: (i) the experimental binding affinity (Kd value) is known,33,38-40

(ii)

both the binding partners of a complex have more than 50 amino acids each and (iii) the

complexes are heterodimers. The dataset include protein-protein complexes with diverse

functions (antigen-antibody, enzyme-inhibitor, G-protein containing, receptor containing

etc.), various ranges of molecular weights and disordered regions.

These 185 complexes have been classified into two groups based on their binding

affinities. The complexes with Kd less than 10-8 M were considered as high affinity class and

complexes with Kd value greater than or equal to 10-8 M were considered as low affinity

class. The Kd range for the high affinity class is the one generally considered for permanent

protein-protein complexes,41 which emphasizes the biological importance of our model. With

this criterion, we have obtained a balanced dataset in which, 98 and 87 protein-protein

complexes have been assigned under high and low affinity classes, respectively. The Protein

Data Bank (PDB) codes42 for these two sets of complexes are given in Table I and the

description for all the 185 complexes is given in supplementary Table S1.

Features

We have utilized a set of 610 sequence based features in this study. The features include

a diverse set of 544 features that account for various physico-chemical, conformational,

energetic and biochemical properties of amino acids obtained from AAindex database43 as well

of 27



6

as 49 properties from the literature.44 In addition, we have used 17 features from the

information on predicted binding site residues, predicted aromatic and charged residues at the

interface15 and predicted solvent accessibility.

45 All those features have been computed for all

the considered 185 protein-protein complexes from their amino acid sequences. Further, we

have reduced the number of features as discussed below.

Machine learning methods:

WEKA Data mining software46 was used for machine learning tasks constituting feature

selection and classification. We have analyzed various machine learning techniques

implemented in WEKA platform for discriminating protein-protein complexes based on their

binding affinity. WEKA includes several methods based on neural networks, regression

analysis, Bayes function, logistic functions, nearest neighbor methods, meta learning,

decision trees and rules. Based on the performance of all the techniques on different feature

sets using experimenter module in WEKA, We selected SMO (Sequential Minimal

Optimization) algorithm,47 which is a SVM based method for the classification of complexes

in our dataset. The SVM is a learning machine for two-group classification problems that

transforms the attribute space into multidimensional feature space using a kernel function to

separate dataset instances by an optimal hyperplane.48

We have used feature selection methods available in WEKA46 and a program available at

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm.49 WEKA provides

various attribute and subset evaluator methods such as CFS,50 Chi-squared, Classifier,

SVM,51 Infogain, ReliefF,

52 Gain ratio, Consistency

53 and so on as well as various search

of 27



7

methods including Best-first, Genetic search54 and Ranker. Brief description of each of the

above mentioned methods available in WEKA is given below.

CFS: Considers the individual predictive ability of each feature along with the degree of

redundancy between them.

Chi-squared: Computes the value of the chi-squared statistic with respect to the class.

Classifier: Evaluates attribute subsets on training data or a separate hold out testing set. It

uses a classifier to estimate the ‘merit’ of a set of attributes.

SVM: Evaluates based on SVM-RFE i.e. “Recursive feature elimination”.

Infogain: Measures the information gain with respect to the class.

ReliefF: Evaluates the worth of an attribute by repeatedly sampling an instance and

considering the value of the given attribute for the nearest instance of the same and

different class.

Gain ratio: Measures the gain ratio with respect to the class.

Consistency: Evaluates the worth of a subset of attributes by the level of consistency in the

class values when the training instances are projected onto the subset of

attributes.

Best-first: Searches by greedy hill-climbing augmented with a back-tracking facility.

Genetic: Performs a search using the simple genetic algorithm.

Ranker: Ranks the attributes by their individual evaluations.

of 27



8

Feature set reduction and selection

The feature set was reduced in order to remove redundancy by employing

"Dimensionality reduction by correlation" criteria. The refined feature set contains 216 amino

acid properties, which has the absolute r-value of less than 0.85 between any two considered

properties. We have added 17 more features from the information on predicted interface

residues and solvent accessibility,15,45

which resulted in a total of 233 features. Further, the best

features, which contribute for the discrimination, were selected by employing Ranker search

method (combination of SVM attribute evaluator and ranker methods) in WEKA software.

Assessment of discrimination performance and validation procedures

We have used n-fold cross validation procedure for evaluating the performance of the

method. In this procedure n-1 data have been utilized to develop a model and the rest of the

data were used to test the method.

The prediction performance has been assessed with the following measures:

Accuracy = (TP+TN)/(TP+TN+FP+FN) (1)

Sensitivity (or) Recall = ΤP/(TP+FN) (2)

Specificity = TN/(TN+FP) (3)

Precision = TP/(TP+FP) (4)

F-measure = 2 x ((Precision x Recall)/(Precision + Recall)) (5)

In these equations, TP, TN, FP and FN, represent, true positives, true negatives, false

positives and false negatives, respectively. In addition, AUC (Area under the ROC curve) has

been estimated for the correspondence between true positive rate and false positive rate.

of 27



9

Results and discussion

Selected features for discrimination

We have tried various combinations of evaluator and selection methods available in

WEKA software to delineate the best features for discriminating protein-protein complexes

with high and low affinities. We observed that the Ranker search method showed the best

performance using SVM attribute evaluator, which analyzes all the considered features and

arranges them in the order of priority for discrimination. The usage of all the features showed

an average accuracy of about 70%, which also causes the problem of over-fitting as the number

of features are very high compared with the number of data used for the present study. Hence,

we removed the features one by one and evaluated the performance in terms of accuracy and

ROC. We noticed a marginal increase of prediction accuracy with the elimination of different

features. Finally, we have identified a set of 9 features, which showed the maximum accuracy

with 10-fold and 3-fold cross validation tests. The selected properties are weights for α-helix at

the window position of -6,55 β-sheet at the window positions of -6, -3 and 5,

55 principal

property value z2 showing side chain bulkiness,56 number of predicted binding site residues in

receptors,18 number of predicted binding site aromatic and positively charges residues in

receptors and ligands18 and percentage of binding site aromatic and positively charged residues

in ligands.18

Among the final list of 9 features, more than 44% (4 features) are selected from predicted

binding site residues. Interestingly, the information on the aromatic and positively charged

residues at the binding sites is identified as one of the most important features for

discriminating the protein-protein complexes based on their affinities. This observation agrees

of 27



10

well with the previous results reported in the literature57,58

and emphasizes the importance of

binding site residues and especially aromatic and positively charged residues in the binding

sites in governing the binding affinity between two interacting proteins. In addition, secondary

structure based properties play an important role for discrimination along with the physical

property, side chain bulkiness. It has been shown that induction of α−helical structure in

Calmodulin-binding sequence of a target protein is an important step in the activation of target

enzymes which in turn could be a determining factor for binding affinity.59,60

The selected

feature set for our model consists of a measure of helix propensity as one of its features. These

results emphasize the importance of α−helices in formation of protein-protein complexes and

governing the binding affinity.

Analysis of selected features for discrimination

We have classified the protein-protein complexes into two groups based on their affinities

and analyzed the distribution of all the nine selected features. Few specific examples are

discussed below:

Figure 1(A) shows the distribution of protein-protein complexes with high and low

affinities based on the number of predicted aromatic and positively charged residues at the

interface of ligands. We noticed that the number of these residues is less in low affinity

complexes compared with the complexes of high affinity. The high affinity protein-protein

complexes have more number of positively charged and aromatic residues at the interface

compared with complexes of low affinities. With the cutoff of 9 residues, the percentage of

high and low affinity complexes is 23% and 6%, respectively. Interestingly, this parameter is

selected as one of the features for discriminating high and low affinity complexes, which are

of 27



11

also reported to be an important factor for understanding the binding specificity of protein-

protein complexes.57,58

Figure 1(B) shows the weights for β-sheet at the window position of -6 and used in

predicting protein secondary structures.55 We noticed that more number of high affinity

complexes have the weights of less than zero. We observed a similar trend for the weights to

α-helix. These results reveal the importance of secondary structures for the specificity of

protein-protein complexes in agreement with experimental reports.60 Other selected properties

also showed marked differences between low and high affinity complexes (data not shown).

Discrimination of protein-protein complexes based on their affinities

We have utilized different algorithms available in WEKA to discriminate the protein-

protein complexes based on their affinities, and the SMO method (which uses support vector

machines) showed the best performance based on sensitivity, specificity, accuracy and ROC.

Further, we have varied the adjustable parameters in SVM, and the model with Polynomial

kernel and C value of 1.0 yielded the highest accuracy. The discrimination performance using

different datasets is presented in Table II. Our method could discriminate low and high affinity

protein-protein complexes with an average accuracy of 76.1% using 10-fold cross-validation on

a set of 155 complexes. The sensitivity and specificity are 75.6% and 76.7%, respectively. We

have applied the same model to a test set of 30 complexes and the discrimination accuracy is

83%. Further, we have tested the problem of over-fitting by evaluating the model with self-

consistency and the results are very much similar to that of the cross validation experiments. This

observation verifies that there is no over-fitting factor in our model. It is noteworthy that the

of 27



12

model was developed with a limited set of 185 complexes and it can be refined with the

availability of more number of data on protein-protein binding affinity.

Influence of sequence redundancy for discriminating high and low affinity complexes

Protein-protein binding affinity is influenced by several experimental factors such as

protein concentration, pH, temperature etc. In addition, mutation of a single amino acid residue

could drastically change the affinity of the complexes.61,62

Hence, we have not considered the

redundancy criteria and used all the 185 complexes in the present work. The performance of our

model on a blind data set emphasizes that it is robust and no over fitting is associated with it.

For further evaluation, we have developed a non-redundant datasets using the cutoff of

less than 25% sequence identity in (i) receptor (ii) ligand and (iii) receptor or ligand.63 This

yielded a set of 92, 125 and 144 protein-protein complexes based on the non-redundancy in

receptor, ligand and either of them, respectively. Our method could discriminate the high and

low affinity complexes in these three datasets with the accuracy of 64%, 77% and 75%,

respectively.

Analyzing performance of the model on a particular family of proteins apart from the test

set

Apart from the test set of 30 complexes, we have examined the prediction power of our model on

an additional blind set of seven complexes, which belongs to a common group called “Tumor

necrosis factor (TNF) superfamily”.64 Among the seven complexes, three of them have high

affinity and four have low affinity. Our method correctly classified all the three high affinity

complexes, and three out of the four low affinity complexes, which showed an accuracy of

85.7%.

of 27



13

Performance of the method on disordered proteins

Dosztanyi et al.65 reported that the hub proteins contain great proportion of disordered

regions and they tend to have long sequences. We have analyzed influence of disordered proteins

(or regions) on the affinities of hetero-dimeric complexes. Among the 185 complexes used in the

present study, structures of free proteins are available for 138 complexes and 90 of them have at

least one protein in the disordered state. The analysis of these 90 complexes based on their

binding affinities showed that 60% of them (54 complexes) are of low affinity. Our model could

correctly classify 77% of all the disordered complexes using 10-fold cross-validation, which is

similar to the performance on the whole dataset of 185 complexes.

It has been reported that the ordered and intrinsically unstructured complexes mainly differ in the

interface properties.66 Interestingly, our method selected four features (from the list of 233 features),

which are related to interface properties (derived from predicted binding site residues). Hence, these

properties might play key role in differentiating ordered and disordered complexes as well as into high

and low affinity complexes. Further, we have divided the set of 138 complexes into two groups

(disordered with 90 complexes and ordered with 48 complexes) and performed feature selections

separately with the aim of achieving the highest accuracy. We found that most of the features found in the

two sets are similar except few features such as “charge of the protein”, which is selected only for ordered

proteins. It supports the previous observation suggesting the importance of electrostatic and cation-π

interactions for the recognition of protein-protein complexes.7 In addition, we noticed that 7 of the 11

features including interface properties selected for the disordered set are also present in the final list

derived for 155 complexes. This reiterates the importance of the reduced set of the features developed in

this work.

of 27



14

Analysis of large-scale protein-protein interacting pairs based on high and low affinity

We have employed our method for analyzing protein-protein interaction data available in

major databases. We have collected a set of 4712 protein-protein interactions in yeast that are

deposited in DIP database.23 Our analysis showed that 43% and 57% of the interacting pairs are

with high and low affinity, respectively. The predicted high affinity complexes will be helpful to

select the targets in structure based drug design. Further analysis of protein-protein interactions

in various organisms, host-pathogen interactions and validations is in progress.

Rational comparison of different attribute selection methods

We have employed different combinations of attribute evaluators and search methods

available in WEKA for feature selection process and the results are presented in Table III. From

this table, we observed that the combination of SVM attribute evaluator and Ranker search

method has the best performance using a minimum number of nine features. Other combinations

of various attributor evaluator and search methods showed either less AUC or utilized more

number of features than SVM attribute evaluator and Ranker search method. In addition, we

have used random forest method available at

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm for selecting the

features,49 which showed an AUC of 0.74 (and the accuracy is 73.5%) using 24 features. Hence,

we have selected the combination of SVM attribute evaluator and Ranker search method

selecting the features to discriminate high and low affinity protein-protein complexes.

of 27



15

Comparison of different classifiers

We have compared the performance of different classifiers for discriminating high and

low affinity protein-protein complexes and the results for 7 typical methods are presented in

Table IV. Most of the methods discriminated the low and high affinity protein-protein

complexes with an accuracy in the range of 62% to 72%. The present method based on support

vector machines could discriminate them with an accuracy of 76.1%, which is a balance between

the sensitivity (75.6%) and specificity (76.7%).

Influence of specific features for discrimination

We have evaluated the importance of all the selected features for discrimination by

removing a specific feature from the list and analyzed the accuracy using 10-fold cross-

validation on a dataset of 155 complexes. The results are presented in Table V. We noticed that

the accuracy decreases 1 to 6% by removing a single feature. Specifically, percentage of

aromatic and positively charged residues in predicted binding sites of ligands decreased the

accuracy from 76.1% to 71.6% showing its importance in discrimination.

Comparison with other methods

The present work is the first sequence based method for classifying protein-protein

complexes based on their binding affinity. This method is different from other structure based

methods proposed in the literature, which are mainly for predicting the absolute binding affinity

of protein-protein complexes. These methods have several limitations: (i) applicable only to a

training set of complexes,35 (ii) utilizes a large number of descriptors,

37 (iii) show high

correlation only to rigid complexes35 and (iv) the requirement of structural information.

28 On the

other hand, the present method has several advantages: (i) the features are derived from amino

of 27



16

acid sequences, (ii) utilized a limited number of features, (iii) classifies into low and high affinity

protein-protein complexes and (iv) shows a good performance. Although direct comparison of

our method with other existing methods is not appropriate, the analysis shows that the present

method has several advantages over other methods reported in the literature.

Conclusion

The analysis on a large number of amino acid features, which are influencing the binding

affinity of protein-protein complexes showed that the conformational properties, α-helical and β-

strand tendencies, bulkiness and the number of predicted aromatic and charged residues at the

protein-protein interface are important for discriminating protein-protein complexes of high and

low affinities. Interestingly, the dominance of aromatic and charged residues at the interface are

important for recognition due to the formation of electrostatic, aromatic-aromatic and cation-π

interactions, which are reported to play vital roles for the formation of protein-protein

complexes. In addition, the features related with protein secondary structures are also shown to

play an important role in recognition, which are identified by our feature selection methods. The

combination of these features could successfully discriminate the high and low affinity protein-

protein complexes with an accuracy in the range of 76-85% using different sets of data and

validations procedures. Hence, the present method could be used to identify the interacting

partners in protein-protein interaction networks and human-pathogen interactions based on their

affinities. Further, the work on predicting the binding affinity from amino acid sequence is in

progress.

of 27



17

Acknowledgements

We thank the Associate Editor and reviewers for constructive comments. KY thanks the

University Grants Commission (UGC), Government of India for providing research fellowship.

We thank the Bioinformatics facility and Indian Institute of Technology Madras for

computational facilities. The work was partially supported by the Department of Science and

Technology, Government of India to MMG (SR/SO/BB-0036/2011).

Supportive/Supplementary Material

Table S1: Description for all the complexes used in the study.

References:

1. Jones S, Thornton JM. Principles of protein-protein interactions. Proc Natl Acad Sci USA

1996;93:13-20.

2. Nooren IM, Thornton JM. Diversity of protein-protein interactions. EMBO J 2003;22:3486-

3492.

3. Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson JD. Molecular Biology of the Cell.

NewYork: Garland; 1989, 2nd edn.

4. Keskin O, Gursoy A, Ma B, Nussinov R. Principles of protein-protein interactions: what are

the preferred ways for proteins to interact? Chem Rev 2008;108:1225-1244.

5. Bahadur RP, Chakrabarti P, Rodier F, Janin J. A dissection of specific and non-specific

protein-protein interfaces. J Mol Biol 2004;336:943-955.

6. Gromiha MM. Protein Bioinformatics: From Sequence to Function. Elsevier; 2010.

7. Gromiha MM, Yokota K, Fukui K. Energy based approach for understanding the recognition

mechanism in protein-protein complexes. Mol Biosyst 2009;5:1779-1786.

8. Jones S, Thornton JM. Prediction of protein-protein interaction sites using patch analysis. J

Mol Biol 1997;272:133-43.

9. Neuvirth, H.; Raz, R.; Schreiber, G. ProMate: a structure based prediction program to identify

the location of protein–protein binding sites. J Mol Biol 2004;338:181-199.

10. Fernandez-Recio J, Totrov M, Abagyan R. Identification of protein-protein interaction sites

from docking energy landscapes. J Mol Biol 2004;335:843-865.

11. Fernandez-Recio J, Totrov M, Skorodumov C, Abagyan R. Optimal docking area: a new

method for predicting protein–protein interaction sites. Proteins 2005;58:134-143.

12. La D, Kihara D. A novel method for protein-protein interaction site prediction using

phylogenetic substitution models. Proteins 2012;80:126-141.

13. La D, Kong M, Hoffman W, Choi YI, Kihara D. Predicting permanent and transient protein-

protein interfaces. Proteins 2013;81(5):805-818.

of 27



18

14. Ofran Y, Rost B. Predict protein-protein interaction sites from local sequence

information. FEBS Lett 2003;544:236-239.

15. Ofran Y, Rost B. ISIS: interaction sites identified from sequence. Bioinformatics

2007;23:e13-6.

16. Ahmad S, Mizuguchi K. Partner-Aware Prediction of Interacting Residues in Protein-Protein

Complexes from Sequence Data. PLoS ONE 2011;6(12):e29104.

17. Shoemaker BA, Panchenko AR. Deciphering protein-protein interactions. Part II.

Computational methods to predict protein and domain interaction partners. Plos Comput Biol

2007;3:595-601.

18. Tuncbag N, Gursoy A, Keskin O. Prediction of protein-protein interactions: unifying

evolution and structure at protein interfaces. Phys Biol 2011;8:035006.

19. Martin S, Roe D, Faulon JL. Predicting protein–protein interactions using signature

products. Bioinformatics 2005;21(2):218-226.

20. Pan XY, Zhang YN, Shen HB. Large-Scale Prediction of Human Protein-Protein Interactions

from Amino Acid Sequence Based on Latent Topic Features. J Proteome Res

2010;9(10):4992-5001.

21. Zhang YN, Pan XY, Huang Y, Shen HB. Adaptive compressive learning for prediction of

protein-protein interactions from primary sequence. J Theor Biol 2011;283(1):44-52.

22. Phizicky EM, Fields S. Protein-protein interactions: methods for detection and

analysis. Microbiol Rev 1995;59:94-123.

23. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of

interacting proteins: 2004 update. Nucleic Acids Res 2004;32(suppl 1):D449-D451.

24. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general

repository for interaction datasets. Nucleic Acids Res 2006;34(suppl 1):D535-D539.

25. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, von Mering C.

The STRING database in 2011: functional interaction networks of proteins, globally

integrated and scored. Nucleic Acids Res 2011;39(suppl 1):D561-D568.

26. Ramos H, Shannon P, Brusniak MY, Kusebauch U, Moritz RL, Aebersold R. The Protein

Information and Property Explorer 2: Gaggle‐like exploration of biological proteomic data

within one webpage. Proteomics 2001;11(1):154-158.

27. Wass MN, David A, Sternberg MJ. Challenges for the prediction of macromolecular

interactions. Curr Opin Struct Biol 2011;21:382-390.

28. Kastritis PL, Bonvin AMJJ. On the binding affinity of macromolecular interactions: daring to

ask why proteins interact. J R Soc Interface 2013;10:20120835.

29. Horton N, Lewis M. Calculation of the free energy of association for protein complexes.

Protein Sci 1992;1:169-181.

30. Ma XH, Wang CX, Li CH, Chen WZ. A fast empirical approach to binding free energy

calculations based on protein interface information. Protein Eng 2002;15:677-681.

31. Audie J, Scarlata S. A novel empirical free energy function that explains and predicts

protein–protein binding affinities. Biophys Chem 2007;129:198-211.

32. Jiang L, Gao Y, Mao F, Liu Z, Lai L. Potential of mean force for protein-protein interaction

studies. Proteins 2002;46:190-196.

33. Zhang C, Liu S, Zhu Q, Zhou Y. A knowledge-based energy function for protein-ligand,

protein-protein, and protein-DNA complexes. J Med Chem 2005;48:2325-2335.

of 27



19

34. Su Y, Zhou A, Xia X, Li W, Sun Z. Quantitative prediction of protein-protein binding

affinity with a potential of mean force considering volume correction. Protein Sci

2009;18:2550-2558.

35. Moal IH, Agius R, Bates PA. Protein-protein binding affinity prediction on a diverse set of

structures. Bioinformatics 2011;27:3002-3009.

36. Vreven T, Hwang H, Pierce BG, Weng Z. Prediction of protein-protein binding free

energies. Protein Sci 2012;21:396-404.

37. Tian F, Lv Y, Yang L. Structure-based prediction of protein-protein binding affinity with

consideration of allosteric effect. Amino Acids 2012;43:531-543.

38. Kastritis PL, Bonvin AM. Are scoring functions in protein-protein docking ready to predict

interactomes? Clues from a novel binding affinity benchmark. J Proteome Res

2010;9:2216-2225.

39. Kastritis PL, Moal IH, Hwang H, Weng Z, Bates PA, Bonvin AM, Janin J. A structure-based

benchmark for protein-protein binding affinity. Protein Sci 2011;20:482-491.

40. Nooren IM, Thornton JM. Structural characterisation and functional significance of transient

protein–protein interactions. J Mol Biol 2003;325:991-1018.

41. Perkins JR, Diboun I, Dessailly BH, Lees JG, Orengo C. Transient protein-protein

interactions: structural, functional, and network properties. Structure 2010;18:1233-43.

42. Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, Green RK, Goodsell

DS, Prlic A, Quesada M, Quinn GB, Ramos AG, West-brook JD, Young J, Zardecki C,

Berman HM, Bourne PE. The rcsb protein data bank: new resources for research and

education. Nucleic Acids Res 2013;41:D475-D482.

43. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M.

AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 2008;36:

D202-D205.

44. Gromiha, MM. A statistical model for predicting protein folding rates from amino acid

sequence with structural class information. J. Chem. Inf. Model. 2005;45:494-501.

45. Garg A, Kaur H, Raghava GP. Real value prediction of solvent accessibility in proteins using

multiple sequence alignment and secondary structure information. Proteins 2005;61:318-24.

46. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA Data

Mining Software: An Update; SIGKDD Explorations. 2009;11(1):10-18.

47. Platt JC. Fast Training of Support Vector Machines using Sequential Minimal Optimization.

Microsoft Research 2000;12:41-65.

48. Hearst MA. Support Vector Machines. IEEE INTELLIGENT SYSTEMS 1998;18-28.

49. Breiman L. Random forests. 2001; Available at

http://oz.berkeley.edu/users/breiman/randomforest2001.pdf.

50. Hall MA. Correlation-Based Feature Selection for Machine Learning, PhD thesis, Dept. of

Computer Science, Univ of Waikato, Hamilton, New Zealand, 1998.

51. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using

support vector machines. Machine Learning 2002;46:389-422.

52. Sikonja M, Kononenko I. An Adaptation of Relief for Attribute Estimation in

Regression, Proceedings of 14th International Conference on Machine Learning (ICML ',97),

Nashville, TN, USA, July 8-12 1997;pp.296-304.

of 27



20

53. Liu H, Setiono R. A probabilistic approach to feature selection - A filter solution, 13th

International Conference on Machine Learning, Bari, Italy, July 3-6, 1996;319-327.

54. Goldberg DE. Genetic Algorithms in Search, Optimization and Machine Learning, Addison-

Wesley, 1989.

55. Qian N, Sejnowski T. Predicting the secondary structure of globular proteins using Neural

Network models. J Mol Biol 1988;202:865-884.

56. Wold S, Eriksson L, Hellberg S, Jonsson, J, Sjöström M, Skagerberg B, Wikström C.

Principal property values for six non-natural amino acids and their application to a structure-

activity relationship for oxytocin peptide analogues. Can J Chem 1987;65:1814-1820.

57. Gromiha MM, Selvaraj S, Jayaram B, Fukui K. Identification and analysis of binding site

residues in protein complexes: energy based approach. Lecture notes in comp sci

2010;6215:626-633.

58. Gromiha MM, Saranya N, Selvaraj S, Jayaram B, Fukui K. Sequence and structural features

of binding site residues in protein-protein complexes: comparison with protein-nucleic acid

complexes. Proteome Sci 2011;9:S13.

59. Yuan T, Walsh MP, Sutherland C, Fabian H, Vogel HJ. Calcium-dependent and -

independent interactions of the calmodulin-binding domain of cyclic nucleotide

phosphodiesterase with calmodulin. Biochemistry 1999;38:1446-1455.

60. Brokx RD, Lopez MM, Vogel HJ, Makhatadze GI. Energetics of target peptide binding by

calmodulin reveals different modes of binding. J Biol Chem 2001;276:14083-14091.

61. Thorn KS, Bogan AA. ASEdb: a database of alanine mutations and their effects on the free

energy of binding in protein interactions. Bioinformatics 2001;17:284-285.

62. Kumar MDS, Gromiha MM. PINT: Protein-protein Interactions Thermodynamic Database.

Nucl Acids Res 2006;34:D195-198.

63. Wang G, Dunbrack Jr. RL. PISCES: a protein sequence culling server. Bioinformatics

2003;19(12):1589-91.

64. Day ES, Cote SM, Whitty A. Binding efficiency of protein-protein complexes. Biochemistry

2012;51:9124-9136.

65. Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P. Disorder and sequence repeats in hub

proteins and their implications for network evolution. J Proteome Res 2006;5(11):2985-

2995.

66. Mészáros B, Tompa P, Simon I, Dosztányi Z. Molecular principles of the interactions of

disordered proteins. J Mol Biol 2007;372(2):549-561.

of 27



21

Figure Legends

Figure 1: Analysis of selected features for their discriminative ability

Figure 1(A): Number of aromatic and positively charged residues in predicted binding sites of ligands

Figure 1(B): Weights for β-sheet at the window position of -6

of 27



1

Table I

List of PDB codes for the protein-protein complexes with high and low affinities

Class PDB IDs with chains of the two interacting proteins

High affinity

(98 complexes)

1ACB_E:I, 1AHW_AB:C, 1ATN_A:D, 1AVX_A:B, 1AY7_A:B, 1BJ1_HL:VW, 1BRS_A:D, 1BVN_P:T,

1DFJ_E:I, 1DQJ_AB:C, 1EAW_A:B, 1EER_A:BC, 1EMV_B:A, 1EZU_C:AB, 1F34_A:B, 1FLE_E:I,

1FSK_BC:A, 1GPW_A:B, 1GXD_A:C, 1HCF_AB:X, 1I2M_A:B, 1IBR_A:B, 1IQD_AB:C, 1JIW_P:I,

1JPS_HL:T, 1JTG_A:B, 1K5D_AB:C, 1KXP_A:D, 1KXQ_A:H, 1M10_A:B, 1MAH_A:F, 1NB5_AP:I,

1NCA_HL:N, 1NSN_HL:S, 1OC0_A:B, 1OPH_A:B, 1P2C_AB:C, 1PXV_A:C, 1R0R_E:I, 1RV6_VW:X,

1T6B_X:Y, 1UUG_A:B, 1VFB_AB:C, 1WDW_BD:A, 1WEJ_HL:F, 1YVB_A:I, 1ZLI_A:B, 2ABZ_B:E, 2B42_A:B, 2GOX_A:B, 2HRK_A:B, 2I25_N:L, 2I9B_E:A, 2J0T_A:D, 2JEL_HL:P, 2NYZ_AB:D,

2O3B_A:B, 2OUL_A:B, 2OZA_B:A, 2PTC_E:I, 2SIC_E:I, 2SNI_E:I, 2UUY_A:B, 2VDB_A:B,

2VIR_AB:C, 3BP8_AB:C, 3SGB_E:I, 1AVW_A:B, 1BQL_LH:Y, 1BTH_HL:P, 1CSE_E:I, 1FDL_HL:Y,

1FSS_A:B, 1HWG_A:C, 1IGC_HL:A, 1JHL_HL:A, 1PPF_E:I, 1STF_E:I, 1TBQ_JK:S, 1TEC_E:I,

1TPA_E:I, 1YQV_HL:Y, 2KAI_AB:I, 3HFM_HL:Y, 3HHR_A:C, 4HTC_HL:I, 4SGB_E:I, 4TPI_Z:I,

1BGX_HL:T, 1BKD_R:S, 1CGI_E:I, 1N8O_ABC:E, 1RRP_A:B, 1Y64_A:B, 2FD6_HL:U, 2SEC_E:I, 2TPI_ZI:S, 7CEI_A:B

Low affinity

(87 complexes)

1A2K_AB:C, 1AK4_A:D, 1AKJ_AB:DE, 1AVZ_B:C, 1B6C_A:B, 1BUH_A:B, 1BVK_DE:F,

1CBW_ABC:D, 1E4K_AB:C, 1E6E_A:B, 1E6J_HL:P, 1E96_A:B, 1EFN_A:B, 1EWY_A:C, 1F6M_A:C,

1FC2_C:D, 1FFW_A:B, 1FQJ_A:B, 1GCQ_B:C, 1GLA_G:F, 1GRN_A:B, 1H1V_A:G, 1H9D_A:B,

1HE8_A:B, 1I4D_AB:D, 1IB1_AB:E, 1IJK_BC:A, 1JMO_A:HL, 1JWH_CD:A, 1KAC_A:B,

1KKL_ABC:H, 1KLU_AB:D, 1KTZ_A:B, 1LFD_B:A, 1MLC_AB:E, 1MQ8_A:B, 1NVU_Q:S, 1NVU_R:S, 1NW9_B:A, 1PVH_A:B, 1QA9_A:B, 1R6Q_A:C, 1RLB_ABCD:E, 1S1Q_A:B, 1US7_A:B, 1WQ1_G:R,

1XD3_A:B, 1XQS_A:C, 1Z0K_A:B, 1ZHI_A:B, 1ZM4_A:B, 2A9K_A:B, 2AJF_A:E, 2AQ3_A:B,

2B4J_AB:C, 2BTF_A:P, 2C0L_A:B, 2FJU_B:A, 2HLE_A:B, 2HQS_A:H, 2MTA_HL:A, 2OOB_A:B,

2OOR_AB:C, 2PCB_A:B, 2PCC_A:B, 2TGP_I:Z, 2VIS_AB:C, 2WPT_B:A, 3BZD_A:B, 3CPH_G:A,

1A0O_A:B, 1DKG_AB:D, 1GUA_A:B, 1MDA_LH:A, 1MEL_M:B, 1NMB_HL:N, 1YCS_A:B, 1AZS_AB:C,

1DE4_AB:CF, 1EFU_A:B, 1FAK_HL:T, 1GHQ_A:B, 1GP2_A:BG, 1HE1_C:A, 1R8S_A:E, 1SBB_A:B,

1TMQ_A:B, 2OT3_B:A

of 27



5051525354555657585960

2

Table II

Performance of the present model generated using SMO with selected nine features

Dataset Validation Accuracy (%) Sensitivity (%) Specificity (%) F-measure Precision AUC

185 complexes As full training set 77.3 74.5 80.5 0.77 0.78 0.78

185 complexes 10-fold cross-validation 77.3 75.5 79.3 0.77 0.78 0.77

155 complexes As full training set 76.8 76.8 76.7 0.77 0.77 0.77



30 complexes Test set 83.3 81.3 85.7 0.83 0.84 0.84

of 27



5051525354555657585960

3

Table III

Comparison of different attribute selection methods using SMO as the classifier using 10-fold cross-validation on 185

complexes

*Ranked features using program available at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm.49

Selection method

(attribute evaluator + Search method)

Number of

selected

features

Accuracy

(%)

Sensitivity

(%)

Specificity

(%)

F-measure Precision AUC

Random Forest* 24 73.5 70.5 69.1 0.74 0.74 0.74

Infogain attribute evaluator + Ranker 22 74.1 74.5 73.6 0.74 0.74 0.74

Chi-squared attribute evaluator + Ranker 21 75.7 73.5 78.2 0.76 0.76 0.76

ReliefF attribute evaluator +Ranker 18 73.0 69.4 77.0 0.73 0.73 0.73

Gain ratio attribute evaluator + Ranker 17 71.4 71.4 71.3 0.71 0.71 0.71

Consistency subset evaluator + Genetic search 13 71.4 70.4 72.4 0.71 0.72 0.71

CFS subset evaluator + Bestfirst 11 70.8 72.4 69.0 0.71 0.71 0.71

Classifier subset evaluator + Genetic 11 72.4 72.5 72.4 0.73 0.73 0.72

SVM attribute evaluator + Ranker 9 77.3 75.5 79.3 0.77 0.78 0.77

CFS subset evaluator + Bestfirst 9 68.1 67.4 69.0 0.68 0.68 0.68

Chi-squared attribute evaluator + Ranker 9 68.7 61.2 77.0 0.69 0.70 0.69

Classifier subset evaluator + Genetic 9 71.4 73.5 69.0 0.71 0.71 0.71

Consistency subset evaluator + Genetic search 9 68.1 70.4 65.5 0.68 0.68 0.68

Gain ratio attribute evaluator + Ranker 9 70.3 63.3 78.7 0.70 0.71 0.71

Infogain attribute evaluator + Ranker 9 70.3 63.3 78.7 0.70 0.71 0.71

ReliefF attribute evaluator + Ranker 9 70.3 71.4 69.0 0.70 0.70 0.70

of 27



5051525354555657585960

4

Table IV

Performance of different classifiers on training set of 155 complexes using the selected nine features with 10-fold cross-

validation

Method Accuracy (%) Sensitivity (%) Specificity (%) F-measure Precision AUC

Bayesian Logistic Regression 61.9 82.9 38.4 0.60 0.63 0.61

Naive Bayes 71.6 64.6 79.5 0.72 0.73 0.75

Multilayer Perceptron 67.7 68.3 67.1 0.68 0.68 0.69

SMO (Support vector machines) 76.1 75.6 76.7 0.76 0.76 0.76

IBK(K-nearest neighbors) 62.6 61.0 64.4 0.63 0.63 0.63

J48 decision tree 68.4 69.5 67.1 0.68 0.68 0.66

Random Forest 65.8 75.6 54.8 0.65 0.66 0.68

of 27



5051525354555657585960

5

Table V

Importance of individual attributes in the selected feature set

S.No. Attribute removed Accuracy

(%)

Sensitivity

(%)

Specificity

(%)

AUC

1 Weights for α-helix at the window position of -6 73.6 73.1 73.9 0.74

2

Weights for β-sheet at the window position of -6 73.6 73.1 73.9 0.74

3 Weights for β-sheet at the window position of -3 74.8 73.1 76.7 0.75

4 Weights for β-sheet at the window position of 5 74.2 72 76.7 0.74

5 Principal property value z2 (Side chain bulk) 74.9 74.4 75.3 0.75

6 Number of predicted binding site residues in receptors 73.5 72 75.3 0.74

7 Number of aromatic and positively charged residues in predicted binding sites of receptors 74.8 74.4 75.3 0.75

8 Number of aromatic and positively charged residues in predicted binding sites of ligands 74.8 75.6 74 0.75

9 Percentage of aromatic and positively charged residues in predicted binding sites of ligands 71.6 74.4 68.5 0.71

of 27



5051525354555657585960

Figure 1: Analysis of selected features for their discriminative ability: (a) Number of aromatic and positively charged residues in predicted binding sites of ligands; (b) Weights for beta-sheet at the window position of -

6

12x13mm (600 x 600 DPI)

of 27



feature selection and classification of protein-protein complexes based on their binding affinities...

Documents