computational identification of potential molecular interactions in arabidopsis1[c]

13
Bioinformatics Computational Identification of Potential Molecular Interactions in Arabidopsis 1[C][W] Mingzhi Lin 2 , Bin Hu 2 , Lijuan Chen, Peng Sun, Yi Fan, Ping Wu, and Xin Chen* Department of Bioinformatics, Zhejiang University, Hangzhou, People’s Republic of China, 310058 (M.L., B.H., L.C., Y.F., X.C.); and State Key Laboratory of Plant Physiology and Biochemistry, Zhejiang University, Hangzhou, People’s Republic of China, 310058 (P.S., P.W.) Knowledge of the protein interaction network is useful to assist molecular mechanism studies. Several major repositories have been established to collect and organize reported protein interactions. Many interactions have been reported in several model organisms, yet a very limited number of plant interactions can thus far be found in these major databases. Computational identification of potential plant interactions, therefore, is desired to facilitate relevant research. In this work, we constructed a support vector machine model to predict potential Arabidopsis (Arabidopsis thaliana) protein interactions based on a variety of indirect evidence. In a 100-iteration bootstrap evaluation, the confidence of our predicted interactions was estimated to be 48.67%, and these interactions were expected to cover 29.02% of the entire interactome. The sensitivity of our model was validated with an independent evaluation data set consisting of newly reported interactions that did not overlap with the examples used in model training and testing. Results showed that our model successfully recognized 28.91% of the new interactions, similar to its expected sensitivity (29.02%). Applying this model to all possible Arabidopsis protein pairs resulted in 224,206 potential interactions, which is the largest and most accurate set of predicted Arabidopsis interactions at present. In order to facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource, with detailed annotations and more specific per interaction confidence measurements. This database and related documents are freely accessible at http://www.cls.zju.edu.cn/pair/. The complex cellular functions of an organism rely on physical interactions between proteins. Decipher- ing the protein-protein interaction network to under- stand higher level phenotypes and their regulations is always a major focus of both experimental biologists and computational biologists. A number of high- throughput (HTP) assays have been developed to identify in vitro protein interactions from several model organisms (Uetz et al., 2000; Giot et al., 2003; Li et al., 2004). A number of initiatives, such as IntAct (Kerrien et al., 2006), Molecular INTeraction database (Chatr-aryamontri et al., 2007), the Database of Inter- acting Proteins (Salwinski et al., 2004), Biomolecular Interaction Network Database (BIND; Alfarano et al., 2005), and BioGRID (Stark et al., 2006), have been established to systematically collect and organize the interaction data reported by both proteome-scale HTP experiments and traditional low-throughput studies focusing on individual proteins or pathways. Arabidopsis (Arabidopsis thaliana) has long been studied as a model organism to investigate the phys- iology, biochemistry, growth, development, and me- tabolism of a flowering plant at the molecular level. The molecular mechanism studies of various pheno- types and their regulations in Arabidopsis may be facilitated by a comprehensive reference protein inter- action network, based on which working hypotheses could be invented with more guidance and confi- dence. However, due to technological limitations, most experimentally reported protein interactions in available databases were from other organisms. A very limited number of plant interactions could be found in these databases. Therefore, an accurate prediction of the Arabidopsis interactome would be valuable to assist relevant research. Studies on the computational identification of po- tential interactions started along with the advent of HTP interaction-detection technologies, which often produced a large number of false positives (Deane et al., 2002). Indirect evidence of protein interaction (e.g. protein colocalization and relevance in function) were hence introduced to boost the confidence of HTP results (Jansen et al., 2003). Further investigations demonstrated that direct inference of protein interac- tions from such indirect evidence alone was possible 1 This work was supported by the National Natural Science Foundation of China (grant no. 30600039), by the Chinese Ministry of Science and Technology (maintenance grant to the State Key Labo- ratory of Plant Physiology and Biochemistry, Zhejiang University, for long-term running cost of the PAIR database), and by the National Basic Research Program of China (grant no. 2005CB20901 to P.W.). 2 These authors contributed equally to the article. * Corresponding author; e-mail [email protected]. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Xin Chen ([email protected]). [C] Some figures in this article are displayed in color online but in black and white in the print edition. [W] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.109.141317 34 Plant Physiology Ò , September 2009, Vol. 151, pp. 34–46, www.plantphysiol.org Ó 2009 American Society of Plant Biologists Downloaded from https://academic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 November 2021

Upload: others

Post on 04-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

Bioinformatics

Computational Identification of Potential MolecularInteractions in Arabidopsis1[C][W]

Mingzhi Lin2, Bin Hu2, Lijuan Chen, Peng Sun, Yi Fan, Ping Wu, and Xin Chen*

Department of Bioinformatics, Zhejiang University, Hangzhou, People’s Republic of China, 310058 (M.L.,B.H., L.C., Y.F., X.C.); and State Key Laboratory of Plant Physiology and Biochemistry, Zhejiang University,Hangzhou, People’s Republic of China, 310058 (P.S., P.W.)

Knowledge of the protein interaction network is useful to assist molecular mechanism studies. Several major repositories havebeen established to collect and organize reported protein interactions. Many interactions have been reported in several modelorganisms, yet a very limited number of plant interactions can thus far be found in these major databases. Computationalidentification of potential plant interactions, therefore, is desired to facilitate relevant research. In this work, we constructed asupport vector machine model to predict potential Arabidopsis (Arabidopsis thaliana) protein interactions based on a variety ofindirect evidence. In a 100-iteration bootstrap evaluation, the confidence of our predicted interactions was estimated to be48.67%, and these interactions were expected to cover 29.02% of the entire interactome. The sensitivity of our model wasvalidated with an independent evaluation data set consisting of newly reported interactions that did not overlap with theexamples used in model training and testing. Results showed that our model successfully recognized 28.91% of the newinteractions, similar to its expected sensitivity (29.02%). Applying this model to all possible Arabidopsis protein pairs resultedin 224,206 potential interactions, which is the largest and most accurate set of predicted Arabidopsis interactions at present. Inorder to facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource, with detailed annotationsand more specific per interaction confidence measurements. This database and related documents are freely accessible athttp://www.cls.zju.edu.cn/pair/.

The complex cellular functions of an organism relyon physical interactions between proteins. Decipher-ing the protein-protein interaction network to under-stand higher level phenotypes and their regulations isalways a major focus of both experimental biologistsand computational biologists. A number of high-throughput (HTP) assays have been developed toidentify in vitro protein interactions from severalmodel organisms (Uetz et al., 2000; Giot et al., 2003;Li et al., 2004). A number of initiatives, such as IntAct(Kerrien et al., 2006), Molecular INTeraction database(Chatr-aryamontri et al., 2007), the Database of Inter-acting Proteins (Salwinski et al., 2004), BiomolecularInteraction Network Database (BIND; Alfarano et al.,

2005), and BioGRID (Stark et al., 2006), have beenestablished to systematically collect and organize theinteraction data reported by both proteome-scale HTPexperiments and traditional low-throughput studiesfocusing on individual proteins or pathways.

Arabidopsis (Arabidopsis thaliana) has long beenstudied as a model organism to investigate the phys-iology, biochemistry, growth, development, and me-tabolism of a flowering plant at the molecular level.The molecular mechanism studies of various pheno-types and their regulations in Arabidopsis may befacilitated by a comprehensive reference protein inter-action network, based on which working hypothesescould be invented with more guidance and confi-dence. However, due to technological limitations,most experimentally reported protein interactions inavailable databases were from other organisms. Averylimited number of plant interactions could be found inthese databases. Therefore, an accurate prediction ofthe Arabidopsis interactome would be valuable toassist relevant research.

Studies on the computational identification of po-tential interactions started along with the advent ofHTP interaction-detection technologies, which oftenproduced a large number of false positives (Deaneet al., 2002). Indirect evidence of protein interaction(e.g. protein colocalization and relevance in function)were hence introduced to boost the confidence of HTPresults (Jansen et al., 2003). Further investigationsdemonstrated that direct inference of protein interac-tions from such indirect evidence alone was possible

1 This work was supported by the National Natural ScienceFoundation of China (grant no. 30600039), by the ChineseMinistry ofScience and Technology (maintenance grant to the State Key Labo-ratory of Plant Physiology and Biochemistry, Zhejiang University,for long-term running cost of the PAIR database), and by theNational Basic Research Program of China (grant no. 2005CB20901to P.W.).

2 These authors contributed equally to the article.* Corresponding author; e-mail [email protected] author responsible for distribution of materials integral to the

findings presented in this article in accordance with the policydescribed in the Instructions for Authors (www.plantphysiol.org) is:Xin Chen ([email protected]).

[C] Some figures in this article are displayed in color online but inblack and white in the print edition.

[W] The online version of this article contains Web-only data.www.plantphysiol.org/cgi/doi/10.1104/pp.109.141317

34 Plant Physiology�, September 2009, Vol. 151, pp. 34–46, www.plantphysiol.org � 2009 American Society of Plant Biologists

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 2: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

(Scott and Barton, 2007). The accuracy and effective-ness of using indirect evidence to predict interactionshave also been thoroughly assessed (Qi et al., 2006;Suthram et al., 2006). These works offered preciousinsights into how protein interactions may be predictedaccurately on a proteomic scale. In other organismssuch as Homo sapiens, the prediction of an entireinteractome has already been proven applicable anduseful (Rhodes et al., 2005).On the other side, several efforts have been made to

collect and organize a comprehensive map of Arabi-dopsis molecular interactions. For instances, around20,000 interactions were inferred by homology toknown interactions in other organisms (Geisler-Leeet al., 2007). Another work predicted 23,396 interac-tions based onmultiple indirect data and curated 4,666interactions from the literature and enzyme complexes(Cui et al., 2008). The Arabidopsis reactome databasewas established describing the functions of 2,195proteins with 8,269 reactions in 318 superpathways(Tsesmetzis et al., 2008). And a general interactiondatabase, IntAct (Kerrien et al., 2006), had allocated aspecial unit actively curating all plant protein interac-tions from literature and submitted data sets, whichnow contains 2,649 Arabidopsis interactions. How-ever, in yeast, approximately 18,000 protein-proteininteractions had been estimated for approximately6,000 genes (Yu et al., 2008). Assuming the same rate ofinteraction, approximately 200,000 protein interactionswouldbe expected for approximately 20,000Arabidopsisgenes. Therefore, the current collection of Arabidop-sis interactions is still significantly limited. Moreover,most previous prediction works did not provide rigor-ous confidence measurements for their predicted inter-actions, which further limited their scope of applications.Recent advances in statistical learning presented a

powerful algorithm, support vector machine (SVM),which may be used to predict interactions based onmultiple indirect data. Although the basis of SVM hadbeen laid in the 1960s, the idea of SVM was onlyofficially proposed in the 1990s by Vapnik (1998, 2000).Then, research on its theoretical and application as-pects thrived. It has been applied in a wide range ofproblems, including text categorization (de Vel et al.,2001; Kim et al., 2001), image classification and objectdetection (Ben-Yacoub et al., 1999; Karlsen et al., 2000),flood stage forecasting (Liong and Sivapragasam,2002), microarray gene expression data analysis(Brown et al., 2000), drug design (Zhao et al., 2006a,2006b), protein solvent accessibility prediction (Yuanet al., 2002), and protein fold prediction (Ding andDubchak, 2001; Hua and Sun, 2001). Many studieshave demonstrated that SVM was consistently supe-rior to other supervised learning methods (Brownet al., 2000; Burbidge et al., 2001; Cai et al., 2003).In this work, with careful preparation of example

data and selection of indirect evidence, we constructedan SVM model to predict potential Arabidopsis inter-actions. False positives were tightly controlled. Withthe high-confidence model, we identified altogether

224,206 potential interactions, which were expected tobe 48.67% accurate and to cover 29.02% of the entireArabidopsis interactome. More specific confidencemeasurements were also assigned on a per interactionbasis. To facilitate the use of our results, we present thePredicted Arabidopsis Interactome Resource (PAIR;http://www.cls.zju.edu.cn/pair/), featuring detailedannotations and a friendly user interface.

RESULTS AND DISCUSSION

Example Data Set

Known pairs of interacting proteins (positive exam-ples) and noninteracting proteins (negative examples)were required to train an SVM prediction model. Asdetailed in “Materials and Methods,” we assembled agold-standard example set consisting of 4,139 positiveand 4,139 negative examples (Table I). In order toverify whether the proteins in our example interac-tions were a good sample of the entire Arabidopsisproteome, we used four categorizing systems to com-pute and compare the protein distributions in ourexample interactions and those in the entire Arabi-dopsis proteome. The Arabidopsis Information Re-source (TAIR) database (Rhee et al., 2003) providedthree classification systems based on Gene Ontology(GO) annotation (i.e. the molecular function-basedsystem, the cellular component-based system, andthe biological process-based system, which are collec-tively known as the GO Slim classification systems). Inmost cases, a GO Slim category in each classificationsystem corresponds to an existing term in GO. Geneproducts that are annotated to the term itself, or to anyof the children of this term, are included in thecorresponding GO Slim category. In some cases, aGO Slim category may also encompass more than oneGO term. In the GO Slim classification systems, cate-gories were carefully chosen and were intended to benonoverlapping. In our analysis, proteins classifiedinto the “unknown categories” (e.g. “molecular func-tion unknown,” “biological process unknown,” and“cellular component unknown”) were ignored, whilethe proteins classified into the “other categories” (e.g.“other molecular functions,” “other biological pro-cesses,”and“other cellularcomponents”)were counted.

As shown in Figure 1A, proteins in our interactionexamples and proteins in the entire Arabidopsis pro-teome showed similar histograms according to the GOSlim Biological Process categories. The Pearson’s cor-

Table I. Composition of gold-standard interaction examples

Data Source No. of Interactions No. of Proteins

IntAct 2,649 1,138BIND 808 416BioGRID 736 317TAIR 1,177 770Total 4,139 1,679

Predicted Arabidopsis Interactome

Plant Physiol. Vol. 151, 2009 35

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 3: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

relation coefficient (r) between these two distributionswas 0.95. Using Fisher’s z transformation, this coefficientcorresponds to a statistical significance of P = 3.17 31027. Similarly, according to the GO Slim MolecularFunction categories, we observed a correlation coeffi-cient of 0.52 (P = 2.8 3 1022; Fig. 1B), and according tothe GO Slim Cellular Component categories, we ob-served a correlation coefficient of 0.68 (P = 2.9 3 1023;Fig. 1C). These significant correlations indicated thatproteins in our example data set were indeed a goodsample of the entire proteome. Besides the annotation-based categories, sequence-based protein families alsoserved as a good classification system to compareprotein distributions. According to the Pfam database(Coggill et al., 2008), all Arabidopsis proteins wereclassified into 2,854 families. As shown in Figure 1D,the frequency distributions of example proteins and ofall Arabidopsis proteins were roughly consistent,showing a correlation coefficient of 0.75 (P , 1 310210). All of these data suggested that the proteins inour interaction examples represented a good sample ofthe Arabidopsis proteome.

Strength of the Indirect Evidence

As detailed in “Materials and Methods,” potentialinteractions were predicted based on 14 features de-rived from four types of indirect evidence (Table II).Therefore, it is of interest to evaluate the extent towhich these features carried information about proteininteraction. One way to assess the discriminationpower of a feature is to compare its value distributionin interaction examples against its value distributionin noninteraction examples. Informative features areexpected to show apparent differences.

Figure 1. Protein distribution in our interaction examples and in the entire Arabidopsis proteome. Blue bars show the fraction ofinteracting proteins in each category, and red bars show the fraction of all Arabidopsis proteins in each category. A, Proteindistribution based on the GO Slim Biological Process categories. The Pearson’s correlation coefficient between two distributionsis 0.95 (P = 3.17 3 1027). B, Protein distribution based on the GO Slim Molecular Function categories. The correlationcoefficient is 0.52 (P = 2.83 1022). C, Protein distribution based on the GO Slim Cellular Component categories. The correlationcoefficient is 0.68 (P = 2.9 3 1023). ER, Endoplasmic reticulum. D, Protein distribution based on the Pfam protein familyclassifications. The correlation coefficient is 0.75 (P , 1 3 10 210). In this diagram, only the largest 150 families are shown forclarity. [See online article for color version of this figure.]

Table II. Composition of features

Evidence Type No. of Features

Coexpression 1Domain interaction 9Colocalization 1Shared annotation 3Total 14

Lin et al.

36 Plant Physiol. Vol. 151, 2009

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 4: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

In this way, we divided the value range of eachfeature into 20 equal bins. The fraction of positiveexamples and the fraction of negative examples thatfell into each bin were calculated to make the corre-sponding distribution histograms. We plotted thesedistribution histograms and the associated likelihoodratio (LR) curve in Figure 2. The LR for each bin isdefined as the fraction of interaction examples abovethis bin divided by the fraction of noninteractionexamples above this bin. Above a bin means that thefeature values are expected to better indicate an inter-action than all values in this bin. For features describ-ing coexpression, colocalization, and domain interaction,this means that the feature values are greater than thebiggest number in this bin. But for features describingshared annotation, above a bin means the contrary. Forconvenience in comparison, we wanted all LR curvesto increase with the horizontal axes. Therefore, forfeatures describing shared annotation, the horizontalaxes were plotted as 1 – the feature value. Figure 2shows several typical feature strength diagrams. Acomplete set of these diagrams is provided in Sup-plemental Text S1. It was clear that in these diagrams,no sharp differences existed between the value dis-tributions in interaction examples and those in non-interaction examples. Pearson’s correlation betweenthe interaction and noninteraction distributions rangedfrom 0.324 to 1. The Kolmogorov-Smirnov test alsoindicated that these interaction and noninteractiondistributions were not significantly different, dis-

playing distances from 0.1 to 0.4, corresponding to aP value range of 1.0 to 0.313 (no significance to rejectthe hypothesis that they were the same; Supplemen-tal Text S1). However, it was also obvious that therewere always noticeable deviations between the in-teraction and noninteraction distributions for eachfeature. In the best scenarios (at the right-mostpoint), these features always led to significant LRs(i.e. .10-fold). Hence, they shall be considered weakbut relevant. A delicate scheme to exploit their com-bined discriminative power was necessary for accu-rate prediction.

The above set of feature LRs also offered an inter-esting possibility to predict potential interactionsbased on the total LR of a protein pair being aninteraction rather than a noninteraction, which wascomputed as the product of the 14 single feature LRs(Rhodes et al., 2005; Cui et al., 2008). While weregarded this approach as a useful supplement toassign per interaction prediction confidence, using itas an interactome prediction method may not beoptimal in our case, for two reasons. First, the featureswe used were not independent. Strong correlationsexisted among features describing the same type ofevidence (e.g. the Pearson’s correlation between thenine domain interaction features ranged from 0.0049 to0.918). And features describing different indirect evi-dence may also correlate, such as the colocalizationfeature and the shared GO cellular component anno-tation feature (r = 0.0638, P, 13 10210). Therefore, the

Figure 2. Comparison of the feature value distribution in interaction examples and the feature value distribution innoninteraction examples. For each feature, its value range is divided into 20 equal bins. Blue curves and red curves show thefraction of interaction examples and the fraction of noninteraction examples that fall into each bin. Green curves represent theLR. A complete set of these analysis diagrams can be found in Supplemental Text S1. The horizontal axis represents the featurevalue (in A–C) or the 1 – feature value (in D). The left vertical axis represents the fraction of protein pairs. The right vertical axisrepresents the LR. A, Protein colocalization feature. B, Domain interaction feature 9 (Random Decision Forest Framework). C,Gene coexpression feature. D, Shared annotation feature 3 (cellular component). [See online article for color version of thisfigure.]

Predicted Arabidopsis Interactome

Plant Physiol. Vol. 151, 2009 37

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 5: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

product of LRs would be dominated by the informa-tion coded in multiple correlated features. Usefulinformation carried by other independent featuresmay be obscured, which may lead to significant biasin prediction. Second, as shown in Figure 2, the LRcurves were not monotonically increasing, whichmeant that using a single cutoff value to compute LRwas too simple. Finer structures of the feature valuedistributions could be exploited for better predictionaccuracy. Consequently, prediction systems that aremore complicated and tolerate correlated featureswere desired. Many previous machine learning stud-ies had suggested that the SVM algorithm excelled inthis scenario.

Accuracy of Our Prediction Models

We constructed an SVMmodel to predict Arabidop-sis interactions. Its parameters, C and g, were opti-mized by a grid search. Model performances duringparameter optimization were estimated by 10-foldcross-validation with the F1 measure. Accuracy ofthe resulting model was evaluated by a more precise100-iteration bootstrap experiment. In each iteration,one-tenth of the example data were randomly left outfrom training, and these data were used to test themodel and compute the confusion matrix. After 100iterations, the mean and SD of the confusion matrix(Table III), together with other derivative accuracymeasurements, were computed. The optimal modelproduced 84.52% 6 0.20% sensitivity, 90.58% 6 0.25%specificity, and 87.44%6 0.14% F1 measure. Its overallaccuracy was estimated to be 87.55% 6 0.14%.

In addition, a receiver operating characteristic(ROC) curve analysis was performed to verify whetherthe SVM algorithm was effective in interaction pre-diction. In the ROC curve, sensitivity was plottedagainst 12 specificity for SVMmodels trained with allrange of parameters. Therefore, the ROC curve anal-ysis is a global and unbiased evaluation of the effec-tiveness of an algorithm, since it does not depend onany specific parameter used to build a particularmodel (Hanley and McNeil, 1982). The bigger thearea under the curve, the more effective a predictionalgorithm would be. As a result, we had an area underthe curve of 91.86% for the SVM approach, which wasslightly better than those for several other widely usedprediction methods (Supplemental Fig. S1). These

results demonstrated that the SVM approach wassuitable to exploit the information carried in theweak evidence and to make accurate predictions.

Although the above prediction model worked wellon our example set consisting of equal numbers ofpositive and negative examples, its application ininteractome prediction was still limited. Assumingthat the ratio of interacting protein pairs out of randompairs in Arabidopsis was similar to that in yeast (i.e.approximately 1:775; Yu et al., 2008), its 90.58% spec-ificity would translate to a large number of falsepositives. And the precision of its predicted interac-tions would drop to approximately 1.15%, worthlessfor expert inspection. Based on our past experiences(Cai et al., 2003; Xue et al., 2004; Xu et al., 2007), theSVM algorithm usually displayed an increased accu-racy for the type of examples with either larger numberor greater diversity. Thus, one way to increase specific-ity without significantly compromising sensitivitywas to expand the negative examples in the trainingdata. With this strategy, we added 409,761 randomlygenerated negative examples to the training set, rais-ing the positive-to-negative ratio to 1:100. With thesame protocol, this enlarged example set produced anoptimal SVM model showing 29.02% 6 0.13% sensitiv-ity, 99.95% 6 0.0013% specificity, 85.15% 6 0.33%precision, and 43.29% 6 0.17% F1 measure. Applyingthis model to all possible Arabidopsis protein pairs,224,206 potential interactions were identified.

These 224,206 predicted interactions also implied anestimation of the Arabidopsis interactome size. Theequation “true positives + false positives = all predictedpositives” can be rewritten as:

Nint 3 Senþ ðNall 2NintÞ3 ð12 SpeÞ ¼ Npred

where Nint represents the interactome size implied byour model, Nall is the number of all possible Arabi-dopsis protein pairs (2.293 108),Npred is the number ofpredicted interactions (224,206), and Sen and Spe arethe sensitivity and specificity of our model. Solvingthis equation gave an estimated Arabidopsis interac-tome size of 3.783 105, corresponding to an interactionrate of 1:606, similar to that estimated in yeast (ap-proximately 1:775; Yu et al., 2008). With this estima-tion, this model was expected to have a precision of48.67% in interactome prediction. And the first modeltrained with an equal number of positive and negative

Table III. The confusion matrix of our prediction models

The 1:1 model (high-coverage model) refers to the optimal model trained with an equal number of positive and negative examples. The 1:100model (high-confidence model) refers to the optimal model trained with the expanded example data set in which the positive-to-negative ratio is1:100. All measurements are provided in the format of mean 6 SD computed by 100-iteration bootstrap evaluation.

Predicted Actual

Positive NegativePositive TP: 3,498.32 6 8.28 (1:1 model), 1,201.28 6 5.57

(1:100 model)FP: 389.88 6 10.44 (1:1 model), 209.44 6 5.43(1:100 model)

Negative FN: 640.68 6 8.28 (1:1 model), 2,937.72 6 5.57(1:100 model)

TN: 3,749.12 6 10.44 (1:1 model), 413,690 6 5.43(1:100 model)

Lin et al.

38 Plant Physiol. Vol. 151, 2009

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 6: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

examples was expected to display a precision of 1.46%in interactome prediction.Comparing the above two models, the one trained

with an equal number of positive and negative exam-ples had a high sensitivity (84.52%). Or, it would beless likely to miss a real interaction. For this reason, wecall it the high-coverage model, or the 1:1 model.However, this good coverage came at a cost. Theconfidence of its predicted interactions (precision)was estimated to be as low as 1.46%. In contrast, themodel trained with 100 times more negative examplesshowed a high precision (48.67%). Or, it would be lesslikely to wrongly predict an interaction. Thus, we callthis model the high-confidence model, or the 1:100model. Nonetheless, its predicted interactions wereexpected to cover only 29.02% of the entire interac-tome. While the interactions predicted by the high-coverage model could be more useful to reconfirminteractions identified by other low-confidence ap-proaches (e.g. some HTP experiments that tend toreport many false positives), the interactions predictedby the high-confidence model would be more inter-esting for direct expert inspection.In addition, it was noted that our high-coverage

predictions included the entire set of high-confidencepredictions, suggesting the self-consistency of ourmethods.

Rediscovery of Newly Reported Interactions

The sensitivity of our prediction models in interac-tome prediction was further validated with oneindependent evaluation data set containing newlyreported interactions that were not included in themajor interaction databases at the time when ourpositive examples were assembled. This data set wasextracted from the latest update of BioGRID (Starket al., 2006), containing 166 novel interactions. Asshown in Supplemental Table S1, 122 (73.49%) novelinteractions could be successfully recognized by thehigh-coverage model. This rate of sensitivity wasslightly lower than the expected one (84.52%). There-fore, we considered this rate (73.49%) as the likelysensitivity of our high-coverage model when general-

ized to interactome prediction. On the other hand,the high-confidence model was able to identify 48(28.91%) novel interactions. This sensitivity was com-parable to its expected one (29.02%). In contrast, bychance alone, our high-coverage model and high-confidence model would be expected to have approx-imately 70.27 6 8.37 and approximately 4.96 6 1.94overlaps, respectively, with this independent evalua-tion data set. Consequently, the significance of recog-nizing 122 interactions by our high-coverage modelwas 3.17 3 10210. And the significance of recognizing48 interactions by our high-confidence model was lessthan 1 3 10210.

Comparison with Other Predicted Interactomes

Before this work, there had been several efforts topredict Arabidopsis interactions. As mentioned in theintroduction, without using complex statisticalmodels, Geisler-Lee et al. (2007) predicted interactionsby homology to reported interactions in other organ-isms (herein referred to as the Interologs data set).Also, Cui et al. (2008) identified 23,396 potential inter-actions based on multiple indirect evidence with atotal likelihood ratio cutoff and published as the AtPIDdatabase (the AtPID data set). In addition to thesepredicted interactomes, Obayashi et al. (2009) com-puted a coexpression network from 58 microarrayexperiments, 1,388 slides, which was available as theATTED-II database (the ATTED-II data set). These datasets offered an interesting opportunity to comparewith our results.

Table IV shows the overlaps between our predic-tions and the above data sets, together with the sig-nificance values. It was shown that all data setsexhibited significant overlaps that were not likelyproduced by chance, indicating their consistency ingeneral. The AtPID set showed the largest proportionof overlap with our predictions, most likely as a resultof using a similar set of indirect evidence (e.g. genecoexpression and GO biological process similarity). Incontrast, the ATTED-II set had relatively lower over-laps with all data sets, including the gold-standardinteraction examples. This was not surprising, as

Table IV. Comparison with other predicted interactomes

The data sets being compared were our high-confidence predictions (SVM), the homology-based predictions described by Geisler-Lee et al. (2007;Interologs), the interactions predicted by multiple indirect evidence as in the AtPID database (Cui et al., 2008), the coexpression network as in theATTED-II database (Obayashi et al., 2009), and the gold-standard interaction examples compiled in this work (GS). Note that the ATTED-II set is notexactly a predicted interactome. It was included for the purpose of comparing predicted interactomes with a coexpression network. The Interologsset and the AtPID set contained all of their predicted interactions. Their gold-standard interactions not recognized by their prediction methods werenot counted. The size of each data set is given after the data set name in the left column. For each overlap, its significance (probability of existence bychance alone) is provided.

SVM Interologs AtPID ATTED-II GS

SVM (224,206) – 1,266 (P , 1 3 10–10) 5,789 (P , 1 3 10–10) 847 (P , 1 3 10–10) 917 (P , 1 3 10–10)Interologs (69,225) 1,266 (P , 1 3 10–10) – 3,132 (P , 1 3 10–10) 253 (P , 1 3 10–10) 260 (P , 1 3 10–10)AtPID (24,418) 5,789 (P , 1 3 10–10) 3,132 (P , 1 3 10–10) – 434 (P , 1 3 10–10) 132 (P , 1 3 10–10)ATTED-II (48,276) 847 (P , 1 3 10–10) 253 (P , 1 3 10–10) 434 (P , 1 3 10–10) – 52 (P , 1 3 10–10)GS (4,139) 917 (P , 1 3 10–10) 260 (P , 1 3 10–10) 132 (P , 1 3 10–10) 52 (P , 1 3 10–10) –

Predicted Arabidopsis Interactome

Plant Physiol. Vol. 151, 2009 39

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 7: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

coexpression does not equal interaction. Althoughgene coexpression and protein interaction are wellcorrelated in prokaryotes, the correlation is not asstrong in eukaryotes (Bhardwaj and Lu, 2005), prob-ably because the layer of regulatory mechanisms isdifferent between coexpression and protein interaction(Obayashi et al., 2009). This was the reason why, in thiswork, coexpression was employed as one feature.Besides it, several other indirect data were also con-sidered to make more reliable predictions.

However, although coexpression is weakly corre-lated with interaction and not enough to indicate aninteraction by itself, we indeed observed an increasedlevel of expression correlation in our predicted inter-actions. The ATTED-II database used two measure-ments to gauge the correlations between expressionprofiles: the Pearson’s correlation (PC) and mutualrank (MR). A gene pair with higher PC and lower MRwas more likely to interact. Results showed that ourpredicted interactions had an average PC of 0.1341,higher than that of all protein pairs (0.0075; P , 1 310210), and the mean MR of our predicted interactionswas 7,454.6, lower than that of all protein pairs(11,286.6; P , 1 3 10210). This increase of expressioncorrelation was similarly observed in our gold-standard interactions, which displayed an average PCof 0.1224 (P , 1 3 10210) and a mean MR of 7,994.5(P , 1 3 10210). Interestingly, the Wilcoxon rank sumtest indicated that the average PC of our predictedinteractions (0.1341) was slightly higher than that of thegold-standard interactions (0.1224; P = 0.002), while themean MR of our predicted interactions (7,454.6) waslargely comparable to that of the gold-standard inter-actions (7,994.5; P = 0.077, no significance).

It was also noted that our predicted interactions hada significant overlap with the potential interactionsidentified by homology to known interactions in otherspecies. Existence of homologous interactions was oneof piece of indirect evidence that we did not use in ourprediction scheme. Therefore, if predictions by thehomology-based approach were independent of ours,the interactions predicted by both approaches shouldbe more reliable. Indeed, for this set of interactions, wehad an average SVM confidence score of 1.586 (de-tailed in “Data Access” below), which was higher thanthe average of all our predicted interactions (0.761; P,13 10210 byWilcoxon rank sum test). And on the otherside, Geisler-Lee et al. (2007) computed a confidencevalue to indicate the confidence of each homology-predicted interaction. For the interactions predicted byboth approaches, we observed an average confidencevalue of 18.507, higher than that of all the homology-predicted interactions (4.208; P , 1 3 10210). Theseresults suggested the opportunity to derive a set ofpredicted interactions of extra high confidence. Forthis purpose, we assembled an updated set of knowninteractions consisting of 198,537 low-throughput in-teractions from Saccharomyces cerevisiae, Drosophilamelanogaster, Caenorhabditis elegans, and H. sapiens.These interactions were then mapped onto 275,403

potential Arabidopsis interactions based on the latesthomologous gene group assignments as in the Inpar-anoid database (O’Brien et al., 2005). Among them,7,927 interactions were also predicted by our high-confidence model. In this set, the average number ofhomologous interactions for one predicted interactionwas approximately 1.79, larger than that in all thehomology-based predictions (1.39; P , 2.2e-16). Inother words, our SVM model was able to pick outthose homology-based predictions that had more sup-port from reported interactions in other species.

Analysis of the Meiotic Recombination-RelatedInteraction Network

Usage of our predicted interactome was illustratedby an analysis of the meiotic recombination-relatedinteractions. Meiosis is essential for sexual reproduc-tion, in which the step of recombination is critical fornormal meiosis. Understanding its underlying molec-ular mechanisms and regulation has attracted constantattention. The current model of meiotic recombination,known as the double-strand break repair mechanism,was mostly established through analysis of the bud-ding yeast, S. cerevisiae. According to this model (Sup-plemental Fig. S2), recombination is initiated by theintroduction of a double-strand break into the recip-ient chromatid by a double-strand endonuclease (step1). This cut is enlarged to a gap, and 3# single-strandedtermini are produced by the action of exonucleases(step 2). One of the free 3# ends then invades ahomologous region of the donor duplex, displacing asmall D loop (step 3). The D loop is then enlarged byrepair synthesis until the other 3# end can anneal tocomplementary single-stranded sequences. Repairsynthesis from the second 3# end completes theprocess of gap repair, and branch migration resultsin the formation of two Holliday junctions (step 4).Resolution of two junctions by cutting either inner orouter strands leads to two possible noncrossoverand two possible crossover configurations (step 5;Szostak et al., 1983). The Arabidopsis proteins knownto function in recombination was nicely reviewed byWijeratne and Ma (2007).

In our predicted high-confidence interaction net-work, 11 proteins mentioned in this review wereidentified as seed proteins (Table V) and extractedwith their first neighbors to create a subnetwork for in-depth analysis. This subnetwork contained 122 pro-teins and 884 interactions (Fig. 3). Among them, eightinteractions had been experimentally confirmed and40 interactions were supported by homologous inter-actions in other species (Fig. 3; Supplemental Tables S2and S3). For example, BRCA2A and BRCA2B, ortho-logs of the human breast cancer susceptibility protein2, had been reported to physically interact with DMC1and RAD51 in Arabidopsis (Siaud et al., 2004; Drayet al., 2006), which were also predicted by our SVMmodel. Another example of predicted interactionssupported by known homologue interactions could

Lin et al.

40 Plant Physiol. Vol. 151, 2009

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 8: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

be found among the Structural Maintenance of Chro-mosomes family proteins (i.e. SMC1, SMC2, andSMC3) and the Sister Chromatid Cohesion familyproteins (i.e. SYN2, SYN3, and SYN4). These proteinsare part of the evolutionarily conserved cohesin com-plex that mediates chromosome cohesion and enablesaccurate chromosome segregation in both mitosis andmeiosis (Revenkova and Jessberger, 2005). A clusteringanalysis was also performed with the Markov Clusteralgorithm (Enright et al., 2002) to reveal the internalstructure of our predicted subnetwork. Seven clusterswere identified, with six of them connected (Fig. 3).Interestingly, the seed proteins that function in

different recombination steps were distributed in non-overlapping clusters (Fig. 3; Supplemental Table S4),suggesting that our subnetwork structure was func-tionally relevant. A closer look at the GO biologicalprocess annotations of these proteins indicated thatproteins in the same cluster were often involved insimilar or related biological processes. For example,for the first step of recombination that creates a DNAdouble-strand break, there were two seed proteins,meiotic recombination proteins SPO11-1 (AtSPO11-1)and SPO11-2 (AtSPO11-2; Grelon et al., 2001; Staceyet al., 2006). Both of them were partitioned into cluster1, together with three other proteins. All of them wereinvolved in the process of “DNA topological change.”Similarly, the four seed proteins for the third step ofrecombination (i.e. DNA repair protein RAD51 homo-log 1 [RAD51] and homolog 3 [RAD51C], DNA repairprotein XRCC3 homolog, and meiotic recombinationprotein DMC1 homolog) were all partitioned intocluster 4, together with seven other proteins. Theseproteins were all annotated with similar or relatedbiological processes, such as “base-excision repair,”“DNA repair,” and “meiosis.” The seed proteins forstep 4 made two clusters, cluster 5 and cluster 6. Bothof them were seemingly homogenous. Cluster 5 hadthe seed protein Rock-N-Rollers (RCK), a DNA heli-case required for interference-sensitive crossoverevents (Chen et al., 2005). It also contained 21 otherhelicases, including the RecQ helicase family (i.e.

RecQL2, RecQL3, RecQL4A, RecQL1, and RecQL4B),required for genome stability maintenance (Hartunget al., 2000). RecQL2 and RecQL4A are able to suppresshomologous recombination (Hartung et al., 2007;Kobbe et al., 2008), while RecQL4B is specificallyrequired to promote crossovers (Hartung et al.,2007). These opposing effects suggested that a deli-cate regulation of homologous crossover might beachieved by this cluster of proteins. The other cluster,cluster 6, was around MSH4, a meiosis-specific mem-ber of the MutS homolog family (Higgins et al., 2004).This cluster consisted of proteins involved in “mis-match repair” (e.g. MutS homolog proteins MSH2,MSH3, MSH4, and MSH7) and “regulation of tran-scription” (e.g. Arabidopsis trithorax-like proteinsATX1, ATX2, ATX3, ATX4, and ATX5). It was knownthat MSH2, a DNA mismatch repair gene MutS ho-molog, was able to repress recombination of mis-matched heteroduplexes (Emmanuel et al., 2006).And one of the trithorax-like proteins, ATX1, wasreported to be an epigenetic regulator with histone H3Lys-4 methyltransferase activity (Alvarez-Venegaset al., 2003), which is critical in yeast for the formationof programmed DNA double-strand breaks that initi-ate homologous recombination during meiosis (Bordeet al., 2009). These reports suggested the functionalrelevance of cluster 6 proteins.

There were also cases of proteins in one cluster thatdisplayed more heterogeneous annotations. For exam-ple, although one of the seed proteins for the secondstep of recombination, DNA repair protein RAD50,formed a relatively homogenous cluster (cluster 3;with proteins from the SMC family and the SisterChromatid Cohesion family), the other seed protein,double-strand break repair proteinMRE11, was part ofcluster 2, with 32 other proteins annotated with adiversity of biological processes ranging from devel-opment to stress responses. However, it was noticedthat, despite the fact that they belonged to a number ofbiological processes, many of the cluster 2 proteins hadthe capability to bind small molecules, such as cal-cium, FK506, and cyclosporine A. It was reported that

Table V. Arabidopsis proteins that function in meiotic recombination

These proteins, as reviewed by Wijeratne and Ma (2007), were used as seeds to create a subnetwork ofmeiotic recombination-related interactions in our predicted interactome.

Gene SymbolArabidopsis Genome

Initiative No.Function in Recombination

Cluster

No.

AtSPO11-1 AT3G13170 Double-strand break induction (step 1) 1AtSPO11-2 AT1G63990 Double-strand break induction (step 1) 1MRE11 AT5G54260 Processing of broken ends (step 2) 2RAD50 AT2G31970 Processing of broken ends (step 2) 3DMC1 AT3G22880 Strand invasion (step 3) 4RAD51 AT5G20850 Strand invasion (step 3) 4RAD51C AT2G45280 Strand invasion (step 3) 4XRCC3 AT5G57450 Strand invasion (step 3) 4RCK AT3G27730 Repair synthesis and branch migration (step 4) 5MSH4 AT4G17380 Repair synthesis and branch migration (step 4) 6MLH1 AT4G09140 Holliday junction resolution (step 5) 7

Predicted Arabidopsis Interactome

Plant Physiol. Vol. 151, 2009 41

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 9: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

calmodulin antagonists and cAMP inhibited the en-hancement of double-strand break repair by ionizingradiation in human cells (Wang et al., 2000). Our dataimplied the possibility that MRE11 might mediate thissuppression. It would be interesting to experimentallydemonstrate whether MRE11 actually participatesin the regulation of double-strand break repair bycalmodulin antagonists and cAMP as well as in theregulation by other small molecules suggested bycluster 2 proteins.

Data Access

In order to facilitate the use of our predictedinteractions, we present PAIR, which is freely acces-

sible at http://www.cls.zju.edu.cn/pair/. The cur-rent version of PAIR supports searches by protein, byinteraction, and by homologous interaction in S.cerevisiae, D. melanogaster, C. elegans, and H. sapiens.To search a protein, users can specify its name (oralias), locus, or National Center for BiotechnologyInformation accession number. Interactions can besearched by specifying their component proteins orby specifying the component proteins in their homo-logues interactions. All queries support the use ofwild characters such as “*” (representing a string ofany length) and “?” (representing a single character).If multiple hits are found, a list is presented allowingthe choice of a specific protein or interaction to showin detail.

Figure 3. The meiotic recombination-related proteins in our predicted interactome. Eleven seed proteins reviewed byWijeratneand Ma (2007) and their first neighbors were extracted from our predicted interactions. Interactions between them can begrouped into seven clusters, with seed proteins involved in different recombination steps distributed in nonoverlapping clusters.Triangles represent the seed proteins and circles represent their neighbors. Proteins (nodes) are colored according to their GOmolecular function annotations. Edges representing experimentally confirmed interactions are colored red. Those representinginteractions homologous to known interactions in other organisms are colored blue. Other predicted interactions are coloredgray. [See online article for color version of this figure.]

Lin et al.

42 Plant Physiol. Vol. 151, 2009

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 10: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

The protein detail page provides annotations of aprotein taken from other databases, including its ac-cession numbers, description, aliases, domain compo-sition, GO annotations, and cellular localizations.Predicted interactions involving this protein are alsolisted at the bottom. The interaction detail page showsconcise descriptions of the two component proteins,the evidence supporting this interaction, along with itshomologous interactions in other species. Two types ofper interaction confidence measurements are presentedfor each interaction. One is the LRs for features and thetotal LR for each interaction (see “Strength of theIndirect Evidence” above). These LRs are providedalong with their percentile rankings in interactionexamples to indicate their significance. The othertype of per interaction confidence measurement is aSVM-based confidence score. It had been proposedthat the absolute value of decision function output wascorrelated with the prediction accuracy (Hua and Sun,2001). Based on this idea, we computed this value foreach predicted interaction as its confidence score.Similarly, these confidence scores are provided alongwith their percentile rankings in interaction examplesto indicate their significance. Whenever a list of inter-actions is presented, filters based on the total LR andthe SVM-based confidence score are available forselection of more reliable predictions. However, be-cause these per interaction confidence measurementswere not rigorously assessed and were not able todirectly translate into a probability of true prediction,they shall be used with caution.Users interested in pathway analysis are able to

register with PAIR. After login, the database allowsusers to add specific interactions into a personalizedstorage place called “My Collection.” Once pathwaymining is done, the interactions stored in My Collec-tion can be exported in the HUPO-PSI format (Kerrienet al., 2007), which is readily usable by a variety ofmainstream pathway analysis softwares, such as Cyto-scape (Shannon et al., 2003).

Future Prospects

As discussed above, the accuracy and coverage ofour predicted interactions may seem to reach a mean-ingful level to assist plant biologists. However, thereare still margins for improvement. For example, ourtraining examples were far from adequate. Our inter-action examples covered approximately 1% of the totalinteractome. Rapid advances in plant molecular biol-ogy will certainly provide more interaction examplesthat are able to better characterize their shared char-acteristics and produce more accurate predictionmodels. Also, when the feature strength diagramswere compared among features describing the sametype of indirect evidence, those based on experimentaldata or annotation tended to be more discriminativethan those based on predicted data (e.g. the domaininteraction features derived from Protein Data Bankcomplex entries versus the domain interaction features

derived from predicted data). Yet, unfortunately, for asignificant fraction of protein pairs, many experimen-tal data-based features were missing. In other words,these features saw no differences in a great numberof protein pairs, which considerably limited theirstrength and hence required prediction-based featuresto complement. Therefore, more comprehensive indi-rect data may also help to increase prediction accuracy.In addition, novel approaches for feature extraction(Ramani and Marcotte, 2003), kernel computation (Xuet al., 2008), and SVM optimization (Shen et al., 2007)may offer new opportunities to build a better predic-tion model.

For these reasons, regular updates of PAIR arenecessary to take advantage of the increasing amountof interaction information and indirect evidence aswell as the quickly advancing computational tech-niques in feature extraction and model construction.The PAIR database was established as part of theinformatics infrastructure supporting the State KeyLaboratory of Plant Physiology and Biochemistry,China. Two updates per year are scheduled. Its long-term running cost is covered by amaintenance grant tothe State Key Laboratory of Plant Physiology andBiochemistry from the ChineseMinistry of Science andTechnology.

CONCLUSION

The PAIR database was constructed with carefulconsideration of the lessons learned in previous inter-action prediction studies. It offers a high-confidenceinteraction set with an expected accuracy of approx-imately 48.7% and a high-coverage interaction set thatwas estimated to include approximately 73.5% of theentire interactome. These data currently represent themost accurate and comprehensive set of predictedArabidopsis interactions. Detailed confidence mea-surements for each predicted interaction are alsoavailable. PAIR can be accessed freely online. As partof the infrastructure supporting the State Key Labora-tory of Plant Physiology and Biochemistry, China,PAIR will be updated and upgraded twice a year. Itis our hope that this resource is able to complement theexisting informatics framework for Arabidopsis re-search and offer further help in the study of importantmolecular mechanisms in this model plant.

MATERIALS AND METHODS

SVM

SVM is a statistical learning algorithm that learns from a set of training

examples and builds a mathematical model to classify new examples. In

typical settings, the training set contains two types (classes) of examples,

positive examples (e.g. interactions) and negative examples (e.g. noninter-

actions). Each example is represented by a vector of real numbers (also known

as features) describing its characteristics (e.g. indirect evidence of interaction).

The SVM algorithm learns the relationship between the class labels and

feature values in the training examples and encodes this relationship into a

Predicted Arabidopsis Interactome

Plant Physiol. Vol. 151, 2009 43

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 11: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

mathematical model. This model can then be used to predict the class label of

any given example based on its feature values. A brief account of the SVM

algorithm is provided in Supplemental Text S1.

Preparation of the Example Data Set

Known pairs of interacting proteins (positive examples) and noninteract-

ing proteins (negative examples) are required to train an SVM prediction

model. To assemble an accurate positive example set, we retrieved and

integrated experimentally determined Arabidopsis (Arabidopsis thaliana)

physical interactions from IntAct (Kerrien et al., 2006), BIND (Alfarano

et al., 2005), BioGRID (Stark et al., 2006), and TAIR (Rhee et al., 2003). Because

HTP experiments tend to suffer from high false-positive rates (Deane et al.,

2002; von Mering et al., 2002), we excluded interactions only supported by

HTP experiments. As shown in Table I, this resulted in 4,139 interactions

involving 1,679 proteins.

Unlike interacting proteins, it is rare to find reported noninteracting

proteins. Several studies have attempted to choose negative examples as pairs

of proteins with different cellular localizations. However, this approach did

not seem to improve model accuracy. Furthermore, it was argued that this

would lead to biased estimations of accuracy, because the constraints placed

on the cellular distribution of the negative examples could make the predic-

tion task easier (Ben-Hur and Noble, 2006). Therefore, we followed the

approach described by Zhang et al. (2004) to select negative examples as

random protein pairs that did not overlap with the positive examples. The

resulting negative examples may contain several potential interactions. Yet,

given the ratio of interacting protein pairs versus noninteracting protein pairs

in yeast (approximately 1:775; Yu et al., 2008), this level of contamination was

likely acceptable. In this way, we generated the same number of negative

examples. The gold-standard data set consisted of a total of 8,278 protein

pairs, half positive, half negative.

Features

Four types of informative indirect evidence were chosen for interaction

prediction based on previous studies on the strength of various indirect

evidence (Qi et al., 2006; Suthram et al., 2006). Fourteen features were

computed by different mathematical characterizations of this evidence.

Therefore, a 14-dimension feature vector was constructed to represent each

protein pair.

The first type of indirect evidence was gene coexpression. Interacting

proteins are often coexpressed. A large set of Arabidopsis gene expression

profiles were available at the TAIR database (ftp://ftp.arabidopsis.org/

home/tair/Microarrays/), which contained 1,436 experiments normalized

by a robust multiarray average method (Rhee et al., 2003). Using this data set,

we calculated the Pearson’s correlation coefficient for the expression profiles

of each protein pair as one feature.

The second type of indirect evidence was domain interaction. Protein

interactions involve physical interactions between their domains. It has been

proposed that novel protein interactions can be inferred by known domain

interactions. Here, we annotated the domain composition of each Arabidopsis

protein according to the Pfam database (Coggill et al., 2008), which assigned

one or more of the 2,854 distinctive domains to one or more of the 20,183

Arabidopsis proteins. Known interacting domains were retrieved from the

DOMINE database (Raghavachari et al., 2008), which contained two data sets

derived from the Protein Data Bankmolecular complex entries and seven data

sets predicted by computational approaches. According to each data set, we

counted the number of interacting domains in a pair of proteins as one feature.

This resulted in nine different features.

The third type of indirect evidence was shared annotation. Interacting

proteins were expected to share similar molecular functions, to be involved in

the same biological processes, and to localize in the same cellular components.

GO (Raghavachari et al., 2008) provided a standardized way to annotate

proteins from these aspects and therefore offered an excellent chance to

compare their annotations. In the GO term tree, strongly related terms were

expected to have a shared parent term close to the leaf terms and narrow in

concept. The fraction of proteins annotated to this shared parent term and all

of its child terms reflects the unexpectedness that two proteins share this

particular parent annotation term. Consequently, this fraction could be a

suitable score to measure the similarity between two annotation terms. The

smaller this score is, the more similar the two terms are. Extending the concept

of term similarity, we defined the protein annotation similarity score as the

smallest term similarity score between their annotation terms. In TAIR, there

were 112,347 annotations assigned to Arabidopsis gene products. By focusing

specifically on the terms in one of the three categories (i.e. molecular function,

biological process, and cellular component), we computed three protein

annotation similarity scores as three features.

The last type of indirect evidence was colocalization. Interacting proteins

are usually colocalized. We obtained protein subcellular localization infor-

mation from the Arabidopsis Subcellular Database (Heazlewood et al., 2007).

The colocalization feature was then calculated as described by Qiu and Noble

(2008). Let f(l) be the fraction of all proteins presented in location l, and let I(i,l)

be an indicator function indicating that protein i is observed to localize to l.

The colocalization feature value is then calculated as follows: F(A,B) = maxl{2 log(f(l))I(A,l)I(B,l)}. In other words, for protein pair (A,B), this feature was

computed as the negative logarithm of the fraction of all proteins presented in

the most specific location where both proteins A and B were observed to

localize.

As shown in Table II, 14 features were computed for each pair of proteins:

one feature based on coexpression, one feature based on colocalization, nine

features based on domain interaction, and three features based on shared

annotation.

Construction of a Prediction Model

An SVM model was constructed to predict Arabidopsis protein interac-

tions. The SVM algorithm is well documented (Winters-Hilt et al., 2006). A

brief introduction is also provided in Supplemental Text S1. For implementa-

tion, we used the public software package LIBSVM version 2.88 (http://www.

csie.ntu.edu.tw/~cjlin/libsvm/). After comprehensive evaluation, we chose

to use the radial basis function kernel for its best performance (data not

shown). In this setup, two parameters, the regularization parameter C and the

kernel width parameter g, were optimized by a grid search in an empirical

range (0.001 to 1,000 for C and 3 3 1025 to 4 3 104 for g), maximizing the F1

measure as defined below. Model performance with each parameter set was

estimated by 10-fold cross-validation.

Based on whether a real positive or negative example in the testing set was

predicted as positive or negative, the prediction result for each protein pair

can be divided into four categories: true positive (TP), when a real positive

example was predicted correctly; true negative (TN), when a real negative

example was predicted correctly; false positive (FP), when a real negative

example was predicted as positive; and false negative (FN), when a real

positive example was predicted as negative. These four indicators are collec-

tively called the confusion matrix (Table III). Many accuracy measurements

are derived from the confusion matrix, such as the sensitivity [or recall; TP/

(TP + FN)], specificity [TN/(TN + FP)], and precision [TP/(TP + FP)]. In the

application of predicting interactions, the ability to accurately and compre-

hensively recognize potential interactions is critical. Therefore, the measure-

ments of precision and recall are most relevant. Precision conveys the idea of

how confident we arewhen themodel predicts an interaction. Recall measures

how much proportion of the real interactions can be recognized by the model.

Giving equal weights to precision (p) and recall (r), Rijsbergen (1979) intro-

duced the “F1 measure” as the geometric mean [2rp/(r + p)]. This measure-

ment favors a balanced accuracy on both interacting and noninteracting

protein pairs. For example, if a model predicted all examples as interactions,

the overall accuracy would be the fraction of interactions in the test data set.

However, the F1 measure would be only zero, reflecting the worst accuracy on

all types of examples.

Comparison of Our Predictions with OtherPredicted Interactomes

Our predicted interactions were compared with three published data sets.

The Interologs data set was obtained as the supplemental data of Geisler-Lee

et al. (2007). The AtPID data set was retrieved from the AtPID database

(version 3; Cui et al., 2008). And the ATTED-II data set was downloaded from

the ATTED-II database (version 4.1; Obayashi et al., 2009). The ATTED-II

database did not explicitly suggest a criterion by which potential interactions

shall be identified based on the expression correlations they provided.

However, this database did provide a set of 48,276 highly coexpressed gene

pairs, consisting of each gene paired with three other genes that displayed the

top three expression similarities. It is clear that coexpression does not equal to

interaction. Yet for the purpose of comparing predicted interactomes with a

coexpression network, we took these highly coexpressed gene pairs as the

Lin et al.

44 Plant Physiol. Vol. 151, 2009

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 12: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

potential interactions identified by the coexpression method. It shall also be

noted that in our comparison, the Interologs set and the AtPID set included all

of their predicted interactions. Their gold-standard examples not recognized

by their respective prediction methods were not counted.

The overlaps between data sets were counted, and the probabilities of these

overlaps appearing by chance were calculated as follows. For any two data

sets, with 10,000 time iterations, we first estimated the distribution of overlaps

between two random set of protein pairs having the same sizes and protein

distributions as the two data sets being compared. Assuming a normal

distribution, the probability of the observed overlap between these two data

sets was then computed against the distribution of random overlaps. This

probability was taken as the significance value for these two data sets being

correlated.

Supplemental Data

The following materials are available in the online version of this article.

Supplemental Figure S1. ROC curve analysis of four prediction methods.

Supplemental Figure S2. Double-strand break repair model for meiotic

recombination.

Supplemental Table S1. Rediscovery of newly reported interactions by

our prediction models.

Supplemental Table S2. Eight experimentally confirmed interactions in

our predicted meiotic recombination subnetwork.

Supplemental Table S3. Forty predicted interactions with known homol-

ogous interactions in our predicted meiotic recombination subnetwork.

Supplemental Table S4. The predicted meiotic recombination subnet-

work.

Supplemental Text S1. A brief introduction to the SVM algorithm and the

complete set of feature analysis diagrams.

ACKNOWLEDGMENTS

We thank Lu Zhang for helpful discussions on feature extraction and Li

Chen for help in Web site construction.

Received May 11, 2009; accepted July 6, 2009; published July 10, 2009.

LITERATURE CITED

Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K,

Betel D, Bobechko B, Boutilier K, Burgess E, et al (2005) The Biomo-

lecular Interaction Network Database and related tools 2005 update.

Nucleic Acids Res 33: D418–D424

Alvarez-Venegas R, Pien S, Sadder M, Witmer X, Grossniklaus U,

Avramova Z (2003) ATX-1, an Arabidopsis homolog of trithorax, acti-

vates flower homeotic genes. Curr Biol 13: 627–637

Ben-Hur A, Noble WS (2006) Choosing negative examples for the predic-

tion of protein-protein interactions. BMC Bioinformatics (Suppl 1) 7: S2

Ben-Yacoub S, Abdeljaoued Y, Mayoraz E (1999) Fusion of face and

speech data for person identity verification. IEEE Trans Neural Netw 10:

1065–1074

Bhardwaj N, Lu H (2005) Correlation between gene expression profiles and

protein-protein interactions within and across genomes. Bioinformatics

21: 2730–2738

Borde V, Robine N, Lin W, Bonfils S, Geli V, Nicolas A (2009) Histone H3

lysine 4 trimethylation marks meiotic recombination initiation sites.

EMBO J 28: 99–111

Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS,

Ares M, Haussler D (2000) Knowledge-based analysis of microarray

gene expression data by using support vector machines. Proc Natl Acad

Sci USA 97: 262–267

Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design by machine

learning: support vector machines for pharmaceutical data analysis.

Comput Chem 26: 5–14

Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: Web-based

support vector machine software for functional classification of a

protein from its primary sequence. Nucleic Acids Res 31: 3692–3697

Chatr-aryamontri ACA, Palazzi LM, Nardelli G, Schneider MV,

Castagnoli L, Cesareni G (2007) MINT: the Molecular INTeraction

database. Nucleic Acids Res 35: D572–D574

Chen C, Zhang W, Timofejeva L, Gerardin Y, Ma H (2005) The Arabidop-

sis ROCK-N-ROLLERS gene encodes a homolog of the yeast ATP-

dependent DNA helicase MER3 and is required for normal meiotic

crossover formation. Plant J 43: 321–334

Coggill P, Finn RD, Bateman A (2008) Identifying protein domains with

the Pfam database. Curr Protoc Bioinformatics 23: 2.5.1–2.5.17

Cui J, Li P, Li G, Xu F, Zhao C, Li Y, Yang Z, Wang G, Yu Q, Shi T (2008)

AtPID: Arabidopsis thaliana Protein Interactome Database. An integra-

tive platform for plant systems biology. Nucleic Acids Res 36: D999–

D1008

Deane CM, Salwinski L, Xenarios I, Eisenberg D (2002) Protein interac-

tions: two methods for assessment of the reliability of high throughput

observations. Mol Cell Proteomics 1: 349–356

de Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content

for author identification forensics. Sigmod Record 30: 55–64

Ding CH, Dubchak I (2001) Multi-class protein fold recognition using

support vector machines and neural networks. Bioinformatics 17:

349–358

Dray E, Siaud N, Dubois E, Doutriaux MP (2006) Interaction between

Arabidopsis Brca2 and its partners Rad51, Dmc1, and Dss1. Plant

Physiol 140: 1059–1069

Emmanuel E, Yehuda E, Melamed-Bessudo C, Avivi-Ragolsky N, Levy

AA (2006) The role of AtMSH2 in homologous recombination in

Arabidopsis thaliana. EMBO Rep 7: 100–105

Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm

for large-scale detection of protein families. Nucleic Acids Res 30:

1575–1584

Geisler-Lee J, O’Toole N, Ammar R, Provart NJ, Millar AH, Geisler

M (2007) A predicted interactome for Arabidopsis. Plant Physiol 145:

317–329

Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi

CE, Godwin B, Vitols E, et al (2003) A protein interaction map of

Drosophila melanogaster. Science 302: 1727–1736

Grelon M, Vezon D, Gendrot G, Pelletier G (2001) AtSPO11-1 is necessary

for efficient meiotic recombination in plants. EMBO J 20: 589–600

Hanley JA, McNeil BJ (1982) The meaning and use of the area under a

receiver operating characteristic (ROC) curve. Radiology 143: 29–36

Hartung F, Plchova H, Puchta H (2000) Molecular characterisation of RecQ

homologues in Arabidopsis thaliana. Nucleic Acids Res 28: 4275–4282

Hartung F, Suer S, Puchta H (2007) Two closely related RecQ helicases have

antagonistic roles in homologous recombination and DNA repair in

Arabidopsis thaliana. Proc Natl Acad Sci USA 104: 18836–18841

Heazlewood JL, Verboom RE, Tonti-Filippini J, Small I, Millar AH (2007)

SUBA: the Arabidopsis Subcellular Database. Nucleic Acids Res 35:

D213–D218

Higgins JD, Armstrong SJ, Franklin FC, Jones GH (2004) The Arabidopsis

MutS homolog AtMSH4 functions at an early step in recombination:

evidence for two classes of recombination in Arabidopsis. Genes Dev 18:

2557–2570

Hua S, Sun Z (2001) A novel method of protein secondary structure

prediction with high segment overlap measure: support vector machine

approach. J Mol Biol 308: 397–407

Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A,

Snyder M, Greenblatt JF, Gerstein M (2003) A Bayesian networks

approach for predicting protein-protein interactions from genomic data.

Science 302: 449–453

Karlsen RE, Gorsich DJ, Gerhart GR (2000) Target classification via

support vector machines. Opt Eng 39: 704–711

Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C,

Dimmer E, Feuermann M, Friedrichsen A, Huntley R, et al (2006)

IntAct: open source resource for molecular interaction data. Nucleic

Acids Res 35: D561–D565

Kerrien S, Orchard S, Montecchi-Palazzi L, Aranda B, Quinn AF, Vinod

N, Bader GD, Xenarios I, Wojcik J, Sherman D, et al (2007) Broadening

the horizon: level 2.5 of the HUPO-PSI format for molecular interac-

tions. BMC Biol 5: 44

Kim KI, Jung K, Park SH, Kim HJ (2001) Support vector machine-based

text detection in digital video. Pattern Recognit 34: 527–529

Predicted Arabidopsis Interactome

Plant Physiol. Vol. 151, 2009 45

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021

Page 13: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]

Kobbe D, Blanck S, Demand K, Focke M, Puchta H (2008) AtRECQ2, a

RecQ helicase homologue from Arabidopsis thaliana, is able to disrupt

various recombinogenic DNA structures in vitro. Plant J 55: 397–405

Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO,

Han JD, Chesneau A, Hao T, et al (2004) A map of the interactome

network of the metazoan C. elegans. Science 303: 540–543

Liong SY, Sivapragasam C (2002) Flood stage forecasting with support

vector machines. J Am Water Resour Assoc 38: 173–186

Obayashi T, Hayashi S, Saeki M, Ohta H, Kinoshita K (2009) ATTED-II

provides coexpressed gene networks for Arabidopsis. Nucleic Acids

Res 37: D987–D991

O’Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehen-

sive database of eukaryotic orthologs. Nucleic Acids Res 33: D476–D480

Qi Y, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of different

biological data and computational classification methods for use in

protein interaction prediction. Proteins 63: 490–500

Qiu J, Noble WS (2008) Predicting co-complexed protein pairs from

heterogeneous data. PLOS Comput Biol 4: e1000054

Raghavachari B, Tasneem A, Przytycka TM, Jothi R (2008) DOMINE:

a database of protein domain interactions. Nucleic Acids Res 36:

D656–D661

Ramani AK, Marcotte EM (2003) Exploiting the co-evolution of interacting

proteins to discover interaction specificity. J Mol Biol 327: 273–284

Revenkova E, Jessberger R (2005) Keeping sister chromatids together:

cohesins in meiosis. Reproduction 130: 783–790

Rhee SYBW, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez

M, Huala E, Lander G, Montoya M, Miller N, et al (2003) The

Arabidopsis Information Resource (TAIR): a model organism database

providing a centralized, curated gateway to Arabidopsis biology, re-

search materials and community. Nucleic Acids Res 31: 224–228

Rhodes DRTS, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram

S, Ghosh D, Pandey A, Chinnaiyan AM (2005) Probabilistic model of the

human protein-protein interaction network. Nat Biotechnol 23: 951–959

Rijsbergen CJ (1979) Information Retrieval. Butterworths, London

Salwinski LMC, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The

Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32:

D449–D451

Scott MS, Barton GJ (2007) Probabilistic prediction and ranking of human

protein-protein interactions. BMC Bioinformatics 8: 239

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N,

Schwikowski B, Ideker T (2003) Cytoscape: a software environment for

integrated models of biomolecular interaction networks. Genome Res

13: 2498–2504

Shen Q, Shi WM, Kong W, Ye BX (2007) A combination of modified

particle swarm optimization algorithm and support vector machine for

gene selection and tumor classification. Talanta 71: 1679–1683

Siaud N, Dray E, Gy I, Gerard E, Takvorian N, Doutriaux MP (2004) Brca2

is involved in meiosis in Arabidopsis thaliana as suggested by its

interaction with Dmc1. EMBO J 23: 1392–1401

Stacey NJ, Kuromori T, Azumi Y, Roberts G, Breuer C, Wada T, Maxwell

A, Roberts K, Sugimoto-Shirasu K (2006) Arabidopsis SPO11-2 func-

tions with SPO11-1 in meiotic recombination. Plant J 48: 206–216

Stark CBB, Requly T, Boucher L, Breitkreutz A, Tyers M (2006) BioGRID: a

general repository for interaction datasets. Nucleic Acids Res 34:

D535–D539

Suthram SST, Ruppin E, Sharan R, Ideker T (2006) A direct comparison of

protein interaction confidence assignment schemes. BMC Bioinformat-

ics 7: 360

Szostak JW, Orr-Weaver TL, Rothstein RJ, Stahl FW (1983) The double-

strand-break repair model for recombination. Cell 33: 25–35

Tsesmetzis N, Couchman M, Higgins J, Smith A, Doonan JH, Seifert GJ,

Schmidt EE, Vastrik I, Birney E, WuG, et al (2008) Arabidopsis reactome:

a foundation knowledgebase for plant systems biology. Plant Cell 20:

1426–1436

Uetz PGL, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D,

Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, et al (2000) A

comprehensive analysis of protein-protein interactions in Saccharomy-

ces cerevisiae. Nature 503: 623–627

Vapnik VN (1998) Statistical Learning Theory. Wiley, New York

Vapnik VN (2000) The Nature of Statistical Learning Theory, Ed 2.

Springer, New York

von Mering CKR, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002)

Comparative assessment of large-scale data sets of protein-protein

interactions. Nature 417: 399–403

Wang Y, Mallya SM, Sikpi MO (2000) Calmodulin antagonists and cAMP

inhibit ionizing-radiation-enhancement of double-strand-break repair

in human cells. Mutat Res 460: 29–39

Wijeratne, Asela J, Ma, H (2007) Genetic analyses of meiotic recombination

in Arabidopsis. Journal of Integrative Plant Biology 49: 1199–1207

Winters-Hilt S, Yelundur A, McChesney C, Landry M (2006) Support

vector machine implementations for classification and clustering. BMC

Bioinformatics (Suppl 2) 7: S4

Xu H, Lin M, Wang W, Li Z, Huang J, Chen Y, Chen X (2007) Learning the

drug target-likeness of a protein. Proteomics 7: 4255–4263

Xu JH, Li F, Sun QF (2008) Identification of microRNA precursors with

support vector machine and string kernel. Genomics Proteomics Bio-

informatics 6: 121–128

Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) Effect of molecular

descriptor feature selection in support vector machine classification of

pharmacokinetic and toxicological properties of chemical agents.

J Chem Inf Comput Sci 44: 1630–1638

Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J,

Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al (2008) High-

quality binary protein interaction map of the yeast interactome network.

Science 322: 104–110

Yuan Z, Burrage K, Mattick JS (2002) Prediction of protein solvent

accessibility using support vector machines. Proteins 48: 566–570

Zhang LV, Wong SL, King OD, Roth FP (2004) Predicting co-complexed

protein pairs using genomic and proteomic data integration. BMC

Bioinformatics 5: 38

Zhao C, Zhang H, Zhang X, Zhang R, Luan F, Liu M, Hu Z, Fan B (2006a)

Prediction of milk/plasma drug concentration (M/P) ratio using sup-

port vector machine (SVM) method. Pharm Res 23: 41–48

Zhao CY, Zhang HX, Zhang XY, Liu MC, Hu ZD, Fan BT (2006b)

Application of support vector machine (SVM) for prediction toxic

activity of different data sets. Toxicology 217: 105–119

Lin et al.

46 Plant Physiol. Vol. 151, 2009

Dow

nloaded from https://academ

ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N

ovember 2021