computational identification of potential molecular interactions in arabidopsis1[c]
TRANSCRIPT
![Page 1: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/1.jpg)
Bioinformatics
Computational Identification of Potential MolecularInteractions in Arabidopsis1[C][W]
Mingzhi Lin2, Bin Hu2, Lijuan Chen, Peng Sun, Yi Fan, Ping Wu, and Xin Chen*
Department of Bioinformatics, Zhejiang University, Hangzhou, People’s Republic of China, 310058 (M.L.,B.H., L.C., Y.F., X.C.); and State Key Laboratory of Plant Physiology and Biochemistry, Zhejiang University,Hangzhou, People’s Republic of China, 310058 (P.S., P.W.)
Knowledge of the protein interaction network is useful to assist molecular mechanism studies. Several major repositories havebeen established to collect and organize reported protein interactions. Many interactions have been reported in several modelorganisms, yet a very limited number of plant interactions can thus far be found in these major databases. Computationalidentification of potential plant interactions, therefore, is desired to facilitate relevant research. In this work, we constructed asupport vector machine model to predict potential Arabidopsis (Arabidopsis thaliana) protein interactions based on a variety ofindirect evidence. In a 100-iteration bootstrap evaluation, the confidence of our predicted interactions was estimated to be48.67%, and these interactions were expected to cover 29.02% of the entire interactome. The sensitivity of our model wasvalidated with an independent evaluation data set consisting of newly reported interactions that did not overlap with theexamples used in model training and testing. Results showed that our model successfully recognized 28.91% of the newinteractions, similar to its expected sensitivity (29.02%). Applying this model to all possible Arabidopsis protein pairs resultedin 224,206 potential interactions, which is the largest and most accurate set of predicted Arabidopsis interactions at present. Inorder to facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource, with detailed annotationsand more specific per interaction confidence measurements. This database and related documents are freely accessible athttp://www.cls.zju.edu.cn/pair/.
The complex cellular functions of an organism relyon physical interactions between proteins. Decipher-ing the protein-protein interaction network to under-stand higher level phenotypes and their regulations isalways a major focus of both experimental biologistsand computational biologists. A number of high-throughput (HTP) assays have been developed toidentify in vitro protein interactions from severalmodel organisms (Uetz et al., 2000; Giot et al., 2003;Li et al., 2004). A number of initiatives, such as IntAct(Kerrien et al., 2006), Molecular INTeraction database(Chatr-aryamontri et al., 2007), the Database of Inter-acting Proteins (Salwinski et al., 2004), BiomolecularInteraction Network Database (BIND; Alfarano et al.,
2005), and BioGRID (Stark et al., 2006), have beenestablished to systematically collect and organize theinteraction data reported by both proteome-scale HTPexperiments and traditional low-throughput studiesfocusing on individual proteins or pathways.
Arabidopsis (Arabidopsis thaliana) has long beenstudied as a model organism to investigate the phys-iology, biochemistry, growth, development, and me-tabolism of a flowering plant at the molecular level.The molecular mechanism studies of various pheno-types and their regulations in Arabidopsis may befacilitated by a comprehensive reference protein inter-action network, based on which working hypothesescould be invented with more guidance and confi-dence. However, due to technological limitations,most experimentally reported protein interactions inavailable databases were from other organisms. Averylimited number of plant interactions could be found inthese databases. Therefore, an accurate prediction ofthe Arabidopsis interactome would be valuable toassist relevant research.
Studies on the computational identification of po-tential interactions started along with the advent ofHTP interaction-detection technologies, which oftenproduced a large number of false positives (Deaneet al., 2002). Indirect evidence of protein interaction(e.g. protein colocalization and relevance in function)were hence introduced to boost the confidence of HTPresults (Jansen et al., 2003). Further investigationsdemonstrated that direct inference of protein interac-tions from such indirect evidence alone was possible
1 This work was supported by the National Natural ScienceFoundation of China (grant no. 30600039), by the ChineseMinistry ofScience and Technology (maintenance grant to the State Key Labo-ratory of Plant Physiology and Biochemistry, Zhejiang University,for long-term running cost of the PAIR database), and by theNational Basic Research Program of China (grant no. 2005CB20901to P.W.).
2 These authors contributed equally to the article.* Corresponding author; e-mail [email protected] author responsible for distribution of materials integral to the
findings presented in this article in accordance with the policydescribed in the Instructions for Authors (www.plantphysiol.org) is:Xin Chen ([email protected]).
[C] Some figures in this article are displayed in color online but inblack and white in the print edition.
[W] The online version of this article contains Web-only data.www.plantphysiol.org/cgi/doi/10.1104/pp.109.141317
34 Plant Physiology�, September 2009, Vol. 151, pp. 34–46, www.plantphysiol.org � 2009 American Society of Plant Biologists
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 2: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/2.jpg)
(Scott and Barton, 2007). The accuracy and effective-ness of using indirect evidence to predict interactionshave also been thoroughly assessed (Qi et al., 2006;Suthram et al., 2006). These works offered preciousinsights into how protein interactions may be predictedaccurately on a proteomic scale. In other organismssuch as Homo sapiens, the prediction of an entireinteractome has already been proven applicable anduseful (Rhodes et al., 2005).On the other side, several efforts have been made to
collect and organize a comprehensive map of Arabi-dopsis molecular interactions. For instances, around20,000 interactions were inferred by homology toknown interactions in other organisms (Geisler-Leeet al., 2007). Another work predicted 23,396 interac-tions based onmultiple indirect data and curated 4,666interactions from the literature and enzyme complexes(Cui et al., 2008). The Arabidopsis reactome databasewas established describing the functions of 2,195proteins with 8,269 reactions in 318 superpathways(Tsesmetzis et al., 2008). And a general interactiondatabase, IntAct (Kerrien et al., 2006), had allocated aspecial unit actively curating all plant protein interac-tions from literature and submitted data sets, whichnow contains 2,649 Arabidopsis interactions. How-ever, in yeast, approximately 18,000 protein-proteininteractions had been estimated for approximately6,000 genes (Yu et al., 2008). Assuming the same rate ofinteraction, approximately 200,000 protein interactionswouldbe expected for approximately 20,000Arabidopsisgenes. Therefore, the current collection of Arabidop-sis interactions is still significantly limited. Moreover,most previous prediction works did not provide rigor-ous confidence measurements for their predicted inter-actions, which further limited their scope of applications.Recent advances in statistical learning presented a
powerful algorithm, support vector machine (SVM),which may be used to predict interactions based onmultiple indirect data. Although the basis of SVM hadbeen laid in the 1960s, the idea of SVM was onlyofficially proposed in the 1990s by Vapnik (1998, 2000).Then, research on its theoretical and application as-pects thrived. It has been applied in a wide range ofproblems, including text categorization (de Vel et al.,2001; Kim et al., 2001), image classification and objectdetection (Ben-Yacoub et al., 1999; Karlsen et al., 2000),flood stage forecasting (Liong and Sivapragasam,2002), microarray gene expression data analysis(Brown et al., 2000), drug design (Zhao et al., 2006a,2006b), protein solvent accessibility prediction (Yuanet al., 2002), and protein fold prediction (Ding andDubchak, 2001; Hua and Sun, 2001). Many studieshave demonstrated that SVM was consistently supe-rior to other supervised learning methods (Brownet al., 2000; Burbidge et al., 2001; Cai et al., 2003).In this work, with careful preparation of example
data and selection of indirect evidence, we constructedan SVM model to predict potential Arabidopsis inter-actions. False positives were tightly controlled. Withthe high-confidence model, we identified altogether
224,206 potential interactions, which were expected tobe 48.67% accurate and to cover 29.02% of the entireArabidopsis interactome. More specific confidencemeasurements were also assigned on a per interactionbasis. To facilitate the use of our results, we present thePredicted Arabidopsis Interactome Resource (PAIR;http://www.cls.zju.edu.cn/pair/), featuring detailedannotations and a friendly user interface.
RESULTS AND DISCUSSION
Example Data Set
Known pairs of interacting proteins (positive exam-ples) and noninteracting proteins (negative examples)were required to train an SVM prediction model. Asdetailed in “Materials and Methods,” we assembled agold-standard example set consisting of 4,139 positiveand 4,139 negative examples (Table I). In order toverify whether the proteins in our example interac-tions were a good sample of the entire Arabidopsisproteome, we used four categorizing systems to com-pute and compare the protein distributions in ourexample interactions and those in the entire Arabi-dopsis proteome. The Arabidopsis Information Re-source (TAIR) database (Rhee et al., 2003) providedthree classification systems based on Gene Ontology(GO) annotation (i.e. the molecular function-basedsystem, the cellular component-based system, andthe biological process-based system, which are collec-tively known as the GO Slim classification systems). Inmost cases, a GO Slim category in each classificationsystem corresponds to an existing term in GO. Geneproducts that are annotated to the term itself, or to anyof the children of this term, are included in thecorresponding GO Slim category. In some cases, aGO Slim category may also encompass more than oneGO term. In the GO Slim classification systems, cate-gories were carefully chosen and were intended to benonoverlapping. In our analysis, proteins classifiedinto the “unknown categories” (e.g. “molecular func-tion unknown,” “biological process unknown,” and“cellular component unknown”) were ignored, whilethe proteins classified into the “other categories” (e.g.“other molecular functions,” “other biological pro-cesses,”and“other cellularcomponents”)were counted.
As shown in Figure 1A, proteins in our interactionexamples and proteins in the entire Arabidopsis pro-teome showed similar histograms according to the GOSlim Biological Process categories. The Pearson’s cor-
Table I. Composition of gold-standard interaction examples
Data Source No. of Interactions No. of Proteins
IntAct 2,649 1,138BIND 808 416BioGRID 736 317TAIR 1,177 770Total 4,139 1,679
Predicted Arabidopsis Interactome
Plant Physiol. Vol. 151, 2009 35
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 3: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/3.jpg)
relation coefficient (r) between these two distributionswas 0.95. Using Fisher’s z transformation, this coefficientcorresponds to a statistical significance of P = 3.17 31027. Similarly, according to the GO Slim MolecularFunction categories, we observed a correlation coeffi-cient of 0.52 (P = 2.8 3 1022; Fig. 1B), and according tothe GO Slim Cellular Component categories, we ob-served a correlation coefficient of 0.68 (P = 2.9 3 1023;Fig. 1C). These significant correlations indicated thatproteins in our example data set were indeed a goodsample of the entire proteome. Besides the annotation-based categories, sequence-based protein families alsoserved as a good classification system to compareprotein distributions. According to the Pfam database(Coggill et al., 2008), all Arabidopsis proteins wereclassified into 2,854 families. As shown in Figure 1D,the frequency distributions of example proteins and ofall Arabidopsis proteins were roughly consistent,showing a correlation coefficient of 0.75 (P , 1 310210). All of these data suggested that the proteins inour interaction examples represented a good sample ofthe Arabidopsis proteome.
Strength of the Indirect Evidence
As detailed in “Materials and Methods,” potentialinteractions were predicted based on 14 features de-rived from four types of indirect evidence (Table II).Therefore, it is of interest to evaluate the extent towhich these features carried information about proteininteraction. One way to assess the discriminationpower of a feature is to compare its value distributionin interaction examples against its value distributionin noninteraction examples. Informative features areexpected to show apparent differences.
Figure 1. Protein distribution in our interaction examples and in the entire Arabidopsis proteome. Blue bars show the fraction ofinteracting proteins in each category, and red bars show the fraction of all Arabidopsis proteins in each category. A, Proteindistribution based on the GO Slim Biological Process categories. The Pearson’s correlation coefficient between two distributionsis 0.95 (P = 3.17 3 1027). B, Protein distribution based on the GO Slim Molecular Function categories. The correlationcoefficient is 0.52 (P = 2.83 1022). C, Protein distribution based on the GO Slim Cellular Component categories. The correlationcoefficient is 0.68 (P = 2.9 3 1023). ER, Endoplasmic reticulum. D, Protein distribution based on the Pfam protein familyclassifications. The correlation coefficient is 0.75 (P , 1 3 10 210). In this diagram, only the largest 150 families are shown forclarity. [See online article for color version of this figure.]
Table II. Composition of features
Evidence Type No. of Features
Coexpression 1Domain interaction 9Colocalization 1Shared annotation 3Total 14
Lin et al.
36 Plant Physiol. Vol. 151, 2009
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 4: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/4.jpg)
In this way, we divided the value range of eachfeature into 20 equal bins. The fraction of positiveexamples and the fraction of negative examples thatfell into each bin were calculated to make the corre-sponding distribution histograms. We plotted thesedistribution histograms and the associated likelihoodratio (LR) curve in Figure 2. The LR for each bin isdefined as the fraction of interaction examples abovethis bin divided by the fraction of noninteractionexamples above this bin. Above a bin means that thefeature values are expected to better indicate an inter-action than all values in this bin. For features describ-ing coexpression, colocalization, and domain interaction,this means that the feature values are greater than thebiggest number in this bin. But for features describingshared annotation, above a bin means the contrary. Forconvenience in comparison, we wanted all LR curvesto increase with the horizontal axes. Therefore, forfeatures describing shared annotation, the horizontalaxes were plotted as 1 – the feature value. Figure 2shows several typical feature strength diagrams. Acomplete set of these diagrams is provided in Sup-plemental Text S1. It was clear that in these diagrams,no sharp differences existed between the value dis-tributions in interaction examples and those in non-interaction examples. Pearson’s correlation betweenthe interaction and noninteraction distributions rangedfrom 0.324 to 1. The Kolmogorov-Smirnov test alsoindicated that these interaction and noninteractiondistributions were not significantly different, dis-
playing distances from 0.1 to 0.4, corresponding to aP value range of 1.0 to 0.313 (no significance to rejectthe hypothesis that they were the same; Supplemen-tal Text S1). However, it was also obvious that therewere always noticeable deviations between the in-teraction and noninteraction distributions for eachfeature. In the best scenarios (at the right-mostpoint), these features always led to significant LRs(i.e. .10-fold). Hence, they shall be considered weakbut relevant. A delicate scheme to exploit their com-bined discriminative power was necessary for accu-rate prediction.
The above set of feature LRs also offered an inter-esting possibility to predict potential interactionsbased on the total LR of a protein pair being aninteraction rather than a noninteraction, which wascomputed as the product of the 14 single feature LRs(Rhodes et al., 2005; Cui et al., 2008). While weregarded this approach as a useful supplement toassign per interaction prediction confidence, using itas an interactome prediction method may not beoptimal in our case, for two reasons. First, the featureswe used were not independent. Strong correlationsexisted among features describing the same type ofevidence (e.g. the Pearson’s correlation between thenine domain interaction features ranged from 0.0049 to0.918). And features describing different indirect evi-dence may also correlate, such as the colocalizationfeature and the shared GO cellular component anno-tation feature (r = 0.0638, P, 13 10210). Therefore, the
Figure 2. Comparison of the feature value distribution in interaction examples and the feature value distribution innoninteraction examples. For each feature, its value range is divided into 20 equal bins. Blue curves and red curves show thefraction of interaction examples and the fraction of noninteraction examples that fall into each bin. Green curves represent theLR. A complete set of these analysis diagrams can be found in Supplemental Text S1. The horizontal axis represents the featurevalue (in A–C) or the 1 – feature value (in D). The left vertical axis represents the fraction of protein pairs. The right vertical axisrepresents the LR. A, Protein colocalization feature. B, Domain interaction feature 9 (Random Decision Forest Framework). C,Gene coexpression feature. D, Shared annotation feature 3 (cellular component). [See online article for color version of thisfigure.]
Predicted Arabidopsis Interactome
Plant Physiol. Vol. 151, 2009 37
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 5: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/5.jpg)
product of LRs would be dominated by the informa-tion coded in multiple correlated features. Usefulinformation carried by other independent featuresmay be obscured, which may lead to significant biasin prediction. Second, as shown in Figure 2, the LRcurves were not monotonically increasing, whichmeant that using a single cutoff value to compute LRwas too simple. Finer structures of the feature valuedistributions could be exploited for better predictionaccuracy. Consequently, prediction systems that aremore complicated and tolerate correlated featureswere desired. Many previous machine learning stud-ies had suggested that the SVM algorithm excelled inthis scenario.
Accuracy of Our Prediction Models
We constructed an SVMmodel to predict Arabidop-sis interactions. Its parameters, C and g, were opti-mized by a grid search. Model performances duringparameter optimization were estimated by 10-foldcross-validation with the F1 measure. Accuracy ofthe resulting model was evaluated by a more precise100-iteration bootstrap experiment. In each iteration,one-tenth of the example data were randomly left outfrom training, and these data were used to test themodel and compute the confusion matrix. After 100iterations, the mean and SD of the confusion matrix(Table III), together with other derivative accuracymeasurements, were computed. The optimal modelproduced 84.52% 6 0.20% sensitivity, 90.58% 6 0.25%specificity, and 87.44%6 0.14% F1 measure. Its overallaccuracy was estimated to be 87.55% 6 0.14%.
In addition, a receiver operating characteristic(ROC) curve analysis was performed to verify whetherthe SVM algorithm was effective in interaction pre-diction. In the ROC curve, sensitivity was plottedagainst 12 specificity for SVMmodels trained with allrange of parameters. Therefore, the ROC curve anal-ysis is a global and unbiased evaluation of the effec-tiveness of an algorithm, since it does not depend onany specific parameter used to build a particularmodel (Hanley and McNeil, 1982). The bigger thearea under the curve, the more effective a predictionalgorithm would be. As a result, we had an area underthe curve of 91.86% for the SVM approach, which wasslightly better than those for several other widely usedprediction methods (Supplemental Fig. S1). These
results demonstrated that the SVM approach wassuitable to exploit the information carried in theweak evidence and to make accurate predictions.
Although the above prediction model worked wellon our example set consisting of equal numbers ofpositive and negative examples, its application ininteractome prediction was still limited. Assumingthat the ratio of interacting protein pairs out of randompairs in Arabidopsis was similar to that in yeast (i.e.approximately 1:775; Yu et al., 2008), its 90.58% spec-ificity would translate to a large number of falsepositives. And the precision of its predicted interac-tions would drop to approximately 1.15%, worthlessfor expert inspection. Based on our past experiences(Cai et al., 2003; Xue et al., 2004; Xu et al., 2007), theSVM algorithm usually displayed an increased accu-racy for the type of examples with either larger numberor greater diversity. Thus, one way to increase specific-ity without significantly compromising sensitivitywas to expand the negative examples in the trainingdata. With this strategy, we added 409,761 randomlygenerated negative examples to the training set, rais-ing the positive-to-negative ratio to 1:100. With thesame protocol, this enlarged example set produced anoptimal SVM model showing 29.02% 6 0.13% sensitiv-ity, 99.95% 6 0.0013% specificity, 85.15% 6 0.33%precision, and 43.29% 6 0.17% F1 measure. Applyingthis model to all possible Arabidopsis protein pairs,224,206 potential interactions were identified.
These 224,206 predicted interactions also implied anestimation of the Arabidopsis interactome size. Theequation “true positives + false positives = all predictedpositives” can be rewritten as:
Nint 3 Senþ ðNall 2NintÞ3 ð12 SpeÞ ¼ Npred
where Nint represents the interactome size implied byour model, Nall is the number of all possible Arabi-dopsis protein pairs (2.293 108),Npred is the number ofpredicted interactions (224,206), and Sen and Spe arethe sensitivity and specificity of our model. Solvingthis equation gave an estimated Arabidopsis interac-tome size of 3.783 105, corresponding to an interactionrate of 1:606, similar to that estimated in yeast (ap-proximately 1:775; Yu et al., 2008). With this estima-tion, this model was expected to have a precision of48.67% in interactome prediction. And the first modeltrained with an equal number of positive and negative
Table III. The confusion matrix of our prediction models
The 1:1 model (high-coverage model) refers to the optimal model trained with an equal number of positive and negative examples. The 1:100model (high-confidence model) refers to the optimal model trained with the expanded example data set in which the positive-to-negative ratio is1:100. All measurements are provided in the format of mean 6 SD computed by 100-iteration bootstrap evaluation.
Predicted Actual
Positive NegativePositive TP: 3,498.32 6 8.28 (1:1 model), 1,201.28 6 5.57
(1:100 model)FP: 389.88 6 10.44 (1:1 model), 209.44 6 5.43(1:100 model)
Negative FN: 640.68 6 8.28 (1:1 model), 2,937.72 6 5.57(1:100 model)
TN: 3,749.12 6 10.44 (1:1 model), 413,690 6 5.43(1:100 model)
Lin et al.
38 Plant Physiol. Vol. 151, 2009
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 6: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/6.jpg)
examples was expected to display a precision of 1.46%in interactome prediction.Comparing the above two models, the one trained
with an equal number of positive and negative exam-ples had a high sensitivity (84.52%). Or, it would beless likely to miss a real interaction. For this reason, wecall it the high-coverage model, or the 1:1 model.However, this good coverage came at a cost. Theconfidence of its predicted interactions (precision)was estimated to be as low as 1.46%. In contrast, themodel trained with 100 times more negative examplesshowed a high precision (48.67%). Or, it would be lesslikely to wrongly predict an interaction. Thus, we callthis model the high-confidence model, or the 1:100model. Nonetheless, its predicted interactions wereexpected to cover only 29.02% of the entire interac-tome. While the interactions predicted by the high-coverage model could be more useful to reconfirminteractions identified by other low-confidence ap-proaches (e.g. some HTP experiments that tend toreport many false positives), the interactions predictedby the high-confidence model would be more inter-esting for direct expert inspection.In addition, it was noted that our high-coverage
predictions included the entire set of high-confidencepredictions, suggesting the self-consistency of ourmethods.
Rediscovery of Newly Reported Interactions
The sensitivity of our prediction models in interac-tome prediction was further validated with oneindependent evaluation data set containing newlyreported interactions that were not included in themajor interaction databases at the time when ourpositive examples were assembled. This data set wasextracted from the latest update of BioGRID (Starket al., 2006), containing 166 novel interactions. Asshown in Supplemental Table S1, 122 (73.49%) novelinteractions could be successfully recognized by thehigh-coverage model. This rate of sensitivity wasslightly lower than the expected one (84.52%). There-fore, we considered this rate (73.49%) as the likelysensitivity of our high-coverage model when general-
ized to interactome prediction. On the other hand,the high-confidence model was able to identify 48(28.91%) novel interactions. This sensitivity was com-parable to its expected one (29.02%). In contrast, bychance alone, our high-coverage model and high-confidence model would be expected to have approx-imately 70.27 6 8.37 and approximately 4.96 6 1.94overlaps, respectively, with this independent evalua-tion data set. Consequently, the significance of recog-nizing 122 interactions by our high-coverage modelwas 3.17 3 10210. And the significance of recognizing48 interactions by our high-confidence model was lessthan 1 3 10210.
Comparison with Other Predicted Interactomes
Before this work, there had been several efforts topredict Arabidopsis interactions. As mentioned in theintroduction, without using complex statisticalmodels, Geisler-Lee et al. (2007) predicted interactionsby homology to reported interactions in other organ-isms (herein referred to as the Interologs data set).Also, Cui et al. (2008) identified 23,396 potential inter-actions based on multiple indirect evidence with atotal likelihood ratio cutoff and published as the AtPIDdatabase (the AtPID data set). In addition to thesepredicted interactomes, Obayashi et al. (2009) com-puted a coexpression network from 58 microarrayexperiments, 1,388 slides, which was available as theATTED-II database (the ATTED-II data set). These datasets offered an interesting opportunity to comparewith our results.
Table IV shows the overlaps between our predic-tions and the above data sets, together with the sig-nificance values. It was shown that all data setsexhibited significant overlaps that were not likelyproduced by chance, indicating their consistency ingeneral. The AtPID set showed the largest proportionof overlap with our predictions, most likely as a resultof using a similar set of indirect evidence (e.g. genecoexpression and GO biological process similarity). Incontrast, the ATTED-II set had relatively lower over-laps with all data sets, including the gold-standardinteraction examples. This was not surprising, as
Table IV. Comparison with other predicted interactomes
The data sets being compared were our high-confidence predictions (SVM), the homology-based predictions described by Geisler-Lee et al. (2007;Interologs), the interactions predicted by multiple indirect evidence as in the AtPID database (Cui et al., 2008), the coexpression network as in theATTED-II database (Obayashi et al., 2009), and the gold-standard interaction examples compiled in this work (GS). Note that the ATTED-II set is notexactly a predicted interactome. It was included for the purpose of comparing predicted interactomes with a coexpression network. The Interologsset and the AtPID set contained all of their predicted interactions. Their gold-standard interactions not recognized by their prediction methods werenot counted. The size of each data set is given after the data set name in the left column. For each overlap, its significance (probability of existence bychance alone) is provided.
SVM Interologs AtPID ATTED-II GS
SVM (224,206) – 1,266 (P , 1 3 10–10) 5,789 (P , 1 3 10–10) 847 (P , 1 3 10–10) 917 (P , 1 3 10–10)Interologs (69,225) 1,266 (P , 1 3 10–10) – 3,132 (P , 1 3 10–10) 253 (P , 1 3 10–10) 260 (P , 1 3 10–10)AtPID (24,418) 5,789 (P , 1 3 10–10) 3,132 (P , 1 3 10–10) – 434 (P , 1 3 10–10) 132 (P , 1 3 10–10)ATTED-II (48,276) 847 (P , 1 3 10–10) 253 (P , 1 3 10–10) 434 (P , 1 3 10–10) – 52 (P , 1 3 10–10)GS (4,139) 917 (P , 1 3 10–10) 260 (P , 1 3 10–10) 132 (P , 1 3 10–10) 52 (P , 1 3 10–10) –
Predicted Arabidopsis Interactome
Plant Physiol. Vol. 151, 2009 39
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 7: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/7.jpg)
coexpression does not equal interaction. Althoughgene coexpression and protein interaction are wellcorrelated in prokaryotes, the correlation is not asstrong in eukaryotes (Bhardwaj and Lu, 2005), prob-ably because the layer of regulatory mechanisms isdifferent between coexpression and protein interaction(Obayashi et al., 2009). This was the reason why, in thiswork, coexpression was employed as one feature.Besides it, several other indirect data were also con-sidered to make more reliable predictions.
However, although coexpression is weakly corre-lated with interaction and not enough to indicate aninteraction by itself, we indeed observed an increasedlevel of expression correlation in our predicted inter-actions. The ATTED-II database used two measure-ments to gauge the correlations between expressionprofiles: the Pearson’s correlation (PC) and mutualrank (MR). A gene pair with higher PC and lower MRwas more likely to interact. Results showed that ourpredicted interactions had an average PC of 0.1341,higher than that of all protein pairs (0.0075; P , 1 310210), and the mean MR of our predicted interactionswas 7,454.6, lower than that of all protein pairs(11,286.6; P , 1 3 10210). This increase of expressioncorrelation was similarly observed in our gold-standard interactions, which displayed an average PCof 0.1224 (P , 1 3 10210) and a mean MR of 7,994.5(P , 1 3 10210). Interestingly, the Wilcoxon rank sumtest indicated that the average PC of our predictedinteractions (0.1341) was slightly higher than that of thegold-standard interactions (0.1224; P = 0.002), while themean MR of our predicted interactions (7,454.6) waslargely comparable to that of the gold-standard inter-actions (7,994.5; P = 0.077, no significance).
It was also noted that our predicted interactions hada significant overlap with the potential interactionsidentified by homology to known interactions in otherspecies. Existence of homologous interactions was oneof piece of indirect evidence that we did not use in ourprediction scheme. Therefore, if predictions by thehomology-based approach were independent of ours,the interactions predicted by both approaches shouldbe more reliable. Indeed, for this set of interactions, wehad an average SVM confidence score of 1.586 (de-tailed in “Data Access” below), which was higher thanthe average of all our predicted interactions (0.761; P,13 10210 byWilcoxon rank sum test). And on the otherside, Geisler-Lee et al. (2007) computed a confidencevalue to indicate the confidence of each homology-predicted interaction. For the interactions predicted byboth approaches, we observed an average confidencevalue of 18.507, higher than that of all the homology-predicted interactions (4.208; P , 1 3 10210). Theseresults suggested the opportunity to derive a set ofpredicted interactions of extra high confidence. Forthis purpose, we assembled an updated set of knowninteractions consisting of 198,537 low-throughput in-teractions from Saccharomyces cerevisiae, Drosophilamelanogaster, Caenorhabditis elegans, and H. sapiens.These interactions were then mapped onto 275,403
potential Arabidopsis interactions based on the latesthomologous gene group assignments as in the Inpar-anoid database (O’Brien et al., 2005). Among them,7,927 interactions were also predicted by our high-confidence model. In this set, the average number ofhomologous interactions for one predicted interactionwas approximately 1.79, larger than that in all thehomology-based predictions (1.39; P , 2.2e-16). Inother words, our SVM model was able to pick outthose homology-based predictions that had more sup-port from reported interactions in other species.
Analysis of the Meiotic Recombination-RelatedInteraction Network
Usage of our predicted interactome was illustratedby an analysis of the meiotic recombination-relatedinteractions. Meiosis is essential for sexual reproduc-tion, in which the step of recombination is critical fornormal meiosis. Understanding its underlying molec-ular mechanisms and regulation has attracted constantattention. The current model of meiotic recombination,known as the double-strand break repair mechanism,was mostly established through analysis of the bud-ding yeast, S. cerevisiae. According to this model (Sup-plemental Fig. S2), recombination is initiated by theintroduction of a double-strand break into the recip-ient chromatid by a double-strand endonuclease (step1). This cut is enlarged to a gap, and 3# single-strandedtermini are produced by the action of exonucleases(step 2). One of the free 3# ends then invades ahomologous region of the donor duplex, displacing asmall D loop (step 3). The D loop is then enlarged byrepair synthesis until the other 3# end can anneal tocomplementary single-stranded sequences. Repairsynthesis from the second 3# end completes theprocess of gap repair, and branch migration resultsin the formation of two Holliday junctions (step 4).Resolution of two junctions by cutting either inner orouter strands leads to two possible noncrossoverand two possible crossover configurations (step 5;Szostak et al., 1983). The Arabidopsis proteins knownto function in recombination was nicely reviewed byWijeratne and Ma (2007).
In our predicted high-confidence interaction net-work, 11 proteins mentioned in this review wereidentified as seed proteins (Table V) and extractedwith their first neighbors to create a subnetwork for in-depth analysis. This subnetwork contained 122 pro-teins and 884 interactions (Fig. 3). Among them, eightinteractions had been experimentally confirmed and40 interactions were supported by homologous inter-actions in other species (Fig. 3; Supplemental Tables S2and S3). For example, BRCA2A and BRCA2B, ortho-logs of the human breast cancer susceptibility protein2, had been reported to physically interact with DMC1and RAD51 in Arabidopsis (Siaud et al., 2004; Drayet al., 2006), which were also predicted by our SVMmodel. Another example of predicted interactionssupported by known homologue interactions could
Lin et al.
40 Plant Physiol. Vol. 151, 2009
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 8: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/8.jpg)
be found among the Structural Maintenance of Chro-mosomes family proteins (i.e. SMC1, SMC2, andSMC3) and the Sister Chromatid Cohesion familyproteins (i.e. SYN2, SYN3, and SYN4). These proteinsare part of the evolutionarily conserved cohesin com-plex that mediates chromosome cohesion and enablesaccurate chromosome segregation in both mitosis andmeiosis (Revenkova and Jessberger, 2005). A clusteringanalysis was also performed with the Markov Clusteralgorithm (Enright et al., 2002) to reveal the internalstructure of our predicted subnetwork. Seven clusterswere identified, with six of them connected (Fig. 3).Interestingly, the seed proteins that function in
different recombination steps were distributed in non-overlapping clusters (Fig. 3; Supplemental Table S4),suggesting that our subnetwork structure was func-tionally relevant. A closer look at the GO biologicalprocess annotations of these proteins indicated thatproteins in the same cluster were often involved insimilar or related biological processes. For example,for the first step of recombination that creates a DNAdouble-strand break, there were two seed proteins,meiotic recombination proteins SPO11-1 (AtSPO11-1)and SPO11-2 (AtSPO11-2; Grelon et al., 2001; Staceyet al., 2006). Both of them were partitioned into cluster1, together with three other proteins. All of them wereinvolved in the process of “DNA topological change.”Similarly, the four seed proteins for the third step ofrecombination (i.e. DNA repair protein RAD51 homo-log 1 [RAD51] and homolog 3 [RAD51C], DNA repairprotein XRCC3 homolog, and meiotic recombinationprotein DMC1 homolog) were all partitioned intocluster 4, together with seven other proteins. Theseproteins were all annotated with similar or relatedbiological processes, such as “base-excision repair,”“DNA repair,” and “meiosis.” The seed proteins forstep 4 made two clusters, cluster 5 and cluster 6. Bothof them were seemingly homogenous. Cluster 5 hadthe seed protein Rock-N-Rollers (RCK), a DNA heli-case required for interference-sensitive crossoverevents (Chen et al., 2005). It also contained 21 otherhelicases, including the RecQ helicase family (i.e.
RecQL2, RecQL3, RecQL4A, RecQL1, and RecQL4B),required for genome stability maintenance (Hartunget al., 2000). RecQL2 and RecQL4A are able to suppresshomologous recombination (Hartung et al., 2007;Kobbe et al., 2008), while RecQL4B is specificallyrequired to promote crossovers (Hartung et al.,2007). These opposing effects suggested that a deli-cate regulation of homologous crossover might beachieved by this cluster of proteins. The other cluster,cluster 6, was around MSH4, a meiosis-specific mem-ber of the MutS homolog family (Higgins et al., 2004).This cluster consisted of proteins involved in “mis-match repair” (e.g. MutS homolog proteins MSH2,MSH3, MSH4, and MSH7) and “regulation of tran-scription” (e.g. Arabidopsis trithorax-like proteinsATX1, ATX2, ATX3, ATX4, and ATX5). It was knownthat MSH2, a DNA mismatch repair gene MutS ho-molog, was able to repress recombination of mis-matched heteroduplexes (Emmanuel et al., 2006).And one of the trithorax-like proteins, ATX1, wasreported to be an epigenetic regulator with histone H3Lys-4 methyltransferase activity (Alvarez-Venegaset al., 2003), which is critical in yeast for the formationof programmed DNA double-strand breaks that initi-ate homologous recombination during meiosis (Bordeet al., 2009). These reports suggested the functionalrelevance of cluster 6 proteins.
There were also cases of proteins in one cluster thatdisplayed more heterogeneous annotations. For exam-ple, although one of the seed proteins for the secondstep of recombination, DNA repair protein RAD50,formed a relatively homogenous cluster (cluster 3;with proteins from the SMC family and the SisterChromatid Cohesion family), the other seed protein,double-strand break repair proteinMRE11, was part ofcluster 2, with 32 other proteins annotated with adiversity of biological processes ranging from devel-opment to stress responses. However, it was noticedthat, despite the fact that they belonged to a number ofbiological processes, many of the cluster 2 proteins hadthe capability to bind small molecules, such as cal-cium, FK506, and cyclosporine A. It was reported that
Table V. Arabidopsis proteins that function in meiotic recombination
These proteins, as reviewed by Wijeratne and Ma (2007), were used as seeds to create a subnetwork ofmeiotic recombination-related interactions in our predicted interactome.
Gene SymbolArabidopsis Genome
Initiative No.Function in Recombination
Cluster
No.
AtSPO11-1 AT3G13170 Double-strand break induction (step 1) 1AtSPO11-2 AT1G63990 Double-strand break induction (step 1) 1MRE11 AT5G54260 Processing of broken ends (step 2) 2RAD50 AT2G31970 Processing of broken ends (step 2) 3DMC1 AT3G22880 Strand invasion (step 3) 4RAD51 AT5G20850 Strand invasion (step 3) 4RAD51C AT2G45280 Strand invasion (step 3) 4XRCC3 AT5G57450 Strand invasion (step 3) 4RCK AT3G27730 Repair synthesis and branch migration (step 4) 5MSH4 AT4G17380 Repair synthesis and branch migration (step 4) 6MLH1 AT4G09140 Holliday junction resolution (step 5) 7
Predicted Arabidopsis Interactome
Plant Physiol. Vol. 151, 2009 41
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 9: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/9.jpg)
calmodulin antagonists and cAMP inhibited the en-hancement of double-strand break repair by ionizingradiation in human cells (Wang et al., 2000). Our dataimplied the possibility that MRE11 might mediate thissuppression. It would be interesting to experimentallydemonstrate whether MRE11 actually participatesin the regulation of double-strand break repair bycalmodulin antagonists and cAMP as well as in theregulation by other small molecules suggested bycluster 2 proteins.
Data Access
In order to facilitate the use of our predictedinteractions, we present PAIR, which is freely acces-
sible at http://www.cls.zju.edu.cn/pair/. The cur-rent version of PAIR supports searches by protein, byinteraction, and by homologous interaction in S.cerevisiae, D. melanogaster, C. elegans, and H. sapiens.To search a protein, users can specify its name (oralias), locus, or National Center for BiotechnologyInformation accession number. Interactions can besearched by specifying their component proteins orby specifying the component proteins in their homo-logues interactions. All queries support the use ofwild characters such as “*” (representing a string ofany length) and “?” (representing a single character).If multiple hits are found, a list is presented allowingthe choice of a specific protein or interaction to showin detail.
Figure 3. The meiotic recombination-related proteins in our predicted interactome. Eleven seed proteins reviewed byWijeratneand Ma (2007) and their first neighbors were extracted from our predicted interactions. Interactions between them can begrouped into seven clusters, with seed proteins involved in different recombination steps distributed in nonoverlapping clusters.Triangles represent the seed proteins and circles represent their neighbors. Proteins (nodes) are colored according to their GOmolecular function annotations. Edges representing experimentally confirmed interactions are colored red. Those representinginteractions homologous to known interactions in other organisms are colored blue. Other predicted interactions are coloredgray. [See online article for color version of this figure.]
Lin et al.
42 Plant Physiol. Vol. 151, 2009
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 10: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/10.jpg)
The protein detail page provides annotations of aprotein taken from other databases, including its ac-cession numbers, description, aliases, domain compo-sition, GO annotations, and cellular localizations.Predicted interactions involving this protein are alsolisted at the bottom. The interaction detail page showsconcise descriptions of the two component proteins,the evidence supporting this interaction, along with itshomologous interactions in other species. Two types ofper interaction confidence measurements are presentedfor each interaction. One is the LRs for features and thetotal LR for each interaction (see “Strength of theIndirect Evidence” above). These LRs are providedalong with their percentile rankings in interactionexamples to indicate their significance. The othertype of per interaction confidence measurement is aSVM-based confidence score. It had been proposedthat the absolute value of decision function output wascorrelated with the prediction accuracy (Hua and Sun,2001). Based on this idea, we computed this value foreach predicted interaction as its confidence score.Similarly, these confidence scores are provided alongwith their percentile rankings in interaction examplesto indicate their significance. Whenever a list of inter-actions is presented, filters based on the total LR andthe SVM-based confidence score are available forselection of more reliable predictions. However, be-cause these per interaction confidence measurementswere not rigorously assessed and were not able todirectly translate into a probability of true prediction,they shall be used with caution.Users interested in pathway analysis are able to
register with PAIR. After login, the database allowsusers to add specific interactions into a personalizedstorage place called “My Collection.” Once pathwaymining is done, the interactions stored in My Collec-tion can be exported in the HUPO-PSI format (Kerrienet al., 2007), which is readily usable by a variety ofmainstream pathway analysis softwares, such as Cyto-scape (Shannon et al., 2003).
Future Prospects
As discussed above, the accuracy and coverage ofour predicted interactions may seem to reach a mean-ingful level to assist plant biologists. However, thereare still margins for improvement. For example, ourtraining examples were far from adequate. Our inter-action examples covered approximately 1% of the totalinteractome. Rapid advances in plant molecular biol-ogy will certainly provide more interaction examplesthat are able to better characterize their shared char-acteristics and produce more accurate predictionmodels. Also, when the feature strength diagramswere compared among features describing the sametype of indirect evidence, those based on experimentaldata or annotation tended to be more discriminativethan those based on predicted data (e.g. the domaininteraction features derived from Protein Data Bankcomplex entries versus the domain interaction features
derived from predicted data). Yet, unfortunately, for asignificant fraction of protein pairs, many experimen-tal data-based features were missing. In other words,these features saw no differences in a great numberof protein pairs, which considerably limited theirstrength and hence required prediction-based featuresto complement. Therefore, more comprehensive indi-rect data may also help to increase prediction accuracy.In addition, novel approaches for feature extraction(Ramani and Marcotte, 2003), kernel computation (Xuet al., 2008), and SVM optimization (Shen et al., 2007)may offer new opportunities to build a better predic-tion model.
For these reasons, regular updates of PAIR arenecessary to take advantage of the increasing amountof interaction information and indirect evidence aswell as the quickly advancing computational tech-niques in feature extraction and model construction.The PAIR database was established as part of theinformatics infrastructure supporting the State KeyLaboratory of Plant Physiology and Biochemistry,China. Two updates per year are scheduled. Its long-term running cost is covered by amaintenance grant tothe State Key Laboratory of Plant Physiology andBiochemistry from the ChineseMinistry of Science andTechnology.
CONCLUSION
The PAIR database was constructed with carefulconsideration of the lessons learned in previous inter-action prediction studies. It offers a high-confidenceinteraction set with an expected accuracy of approx-imately 48.7% and a high-coverage interaction set thatwas estimated to include approximately 73.5% of theentire interactome. These data currently represent themost accurate and comprehensive set of predictedArabidopsis interactions. Detailed confidence mea-surements for each predicted interaction are alsoavailable. PAIR can be accessed freely online. As partof the infrastructure supporting the State Key Labora-tory of Plant Physiology and Biochemistry, China,PAIR will be updated and upgraded twice a year. Itis our hope that this resource is able to complement theexisting informatics framework for Arabidopsis re-search and offer further help in the study of importantmolecular mechanisms in this model plant.
MATERIALS AND METHODS
SVM
SVM is a statistical learning algorithm that learns from a set of training
examples and builds a mathematical model to classify new examples. In
typical settings, the training set contains two types (classes) of examples,
positive examples (e.g. interactions) and negative examples (e.g. noninter-
actions). Each example is represented by a vector of real numbers (also known
as features) describing its characteristics (e.g. indirect evidence of interaction).
The SVM algorithm learns the relationship between the class labels and
feature values in the training examples and encodes this relationship into a
Predicted Arabidopsis Interactome
Plant Physiol. Vol. 151, 2009 43
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 11: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/11.jpg)
mathematical model. This model can then be used to predict the class label of
any given example based on its feature values. A brief account of the SVM
algorithm is provided in Supplemental Text S1.
Preparation of the Example Data Set
Known pairs of interacting proteins (positive examples) and noninteract-
ing proteins (negative examples) are required to train an SVM prediction
model. To assemble an accurate positive example set, we retrieved and
integrated experimentally determined Arabidopsis (Arabidopsis thaliana)
physical interactions from IntAct (Kerrien et al., 2006), BIND (Alfarano
et al., 2005), BioGRID (Stark et al., 2006), and TAIR (Rhee et al., 2003). Because
HTP experiments tend to suffer from high false-positive rates (Deane et al.,
2002; von Mering et al., 2002), we excluded interactions only supported by
HTP experiments. As shown in Table I, this resulted in 4,139 interactions
involving 1,679 proteins.
Unlike interacting proteins, it is rare to find reported noninteracting
proteins. Several studies have attempted to choose negative examples as pairs
of proteins with different cellular localizations. However, this approach did
not seem to improve model accuracy. Furthermore, it was argued that this
would lead to biased estimations of accuracy, because the constraints placed
on the cellular distribution of the negative examples could make the predic-
tion task easier (Ben-Hur and Noble, 2006). Therefore, we followed the
approach described by Zhang et al. (2004) to select negative examples as
random protein pairs that did not overlap with the positive examples. The
resulting negative examples may contain several potential interactions. Yet,
given the ratio of interacting protein pairs versus noninteracting protein pairs
in yeast (approximately 1:775; Yu et al., 2008), this level of contamination was
likely acceptable. In this way, we generated the same number of negative
examples. The gold-standard data set consisted of a total of 8,278 protein
pairs, half positive, half negative.
Features
Four types of informative indirect evidence were chosen for interaction
prediction based on previous studies on the strength of various indirect
evidence (Qi et al., 2006; Suthram et al., 2006). Fourteen features were
computed by different mathematical characterizations of this evidence.
Therefore, a 14-dimension feature vector was constructed to represent each
protein pair.
The first type of indirect evidence was gene coexpression. Interacting
proteins are often coexpressed. A large set of Arabidopsis gene expression
profiles were available at the TAIR database (ftp://ftp.arabidopsis.org/
home/tair/Microarrays/), which contained 1,436 experiments normalized
by a robust multiarray average method (Rhee et al., 2003). Using this data set,
we calculated the Pearson’s correlation coefficient for the expression profiles
of each protein pair as one feature.
The second type of indirect evidence was domain interaction. Protein
interactions involve physical interactions between their domains. It has been
proposed that novel protein interactions can be inferred by known domain
interactions. Here, we annotated the domain composition of each Arabidopsis
protein according to the Pfam database (Coggill et al., 2008), which assigned
one or more of the 2,854 distinctive domains to one or more of the 20,183
Arabidopsis proteins. Known interacting domains were retrieved from the
DOMINE database (Raghavachari et al., 2008), which contained two data sets
derived from the Protein Data Bankmolecular complex entries and seven data
sets predicted by computational approaches. According to each data set, we
counted the number of interacting domains in a pair of proteins as one feature.
This resulted in nine different features.
The third type of indirect evidence was shared annotation. Interacting
proteins were expected to share similar molecular functions, to be involved in
the same biological processes, and to localize in the same cellular components.
GO (Raghavachari et al., 2008) provided a standardized way to annotate
proteins from these aspects and therefore offered an excellent chance to
compare their annotations. In the GO term tree, strongly related terms were
expected to have a shared parent term close to the leaf terms and narrow in
concept. The fraction of proteins annotated to this shared parent term and all
of its child terms reflects the unexpectedness that two proteins share this
particular parent annotation term. Consequently, this fraction could be a
suitable score to measure the similarity between two annotation terms. The
smaller this score is, the more similar the two terms are. Extending the concept
of term similarity, we defined the protein annotation similarity score as the
smallest term similarity score between their annotation terms. In TAIR, there
were 112,347 annotations assigned to Arabidopsis gene products. By focusing
specifically on the terms in one of the three categories (i.e. molecular function,
biological process, and cellular component), we computed three protein
annotation similarity scores as three features.
The last type of indirect evidence was colocalization. Interacting proteins
are usually colocalized. We obtained protein subcellular localization infor-
mation from the Arabidopsis Subcellular Database (Heazlewood et al., 2007).
The colocalization feature was then calculated as described by Qiu and Noble
(2008). Let f(l) be the fraction of all proteins presented in location l, and let I(i,l)
be an indicator function indicating that protein i is observed to localize to l.
The colocalization feature value is then calculated as follows: F(A,B) = maxl{2 log(f(l))I(A,l)I(B,l)}. In other words, for protein pair (A,B), this feature was
computed as the negative logarithm of the fraction of all proteins presented in
the most specific location where both proteins A and B were observed to
localize.
As shown in Table II, 14 features were computed for each pair of proteins:
one feature based on coexpression, one feature based on colocalization, nine
features based on domain interaction, and three features based on shared
annotation.
Construction of a Prediction Model
An SVM model was constructed to predict Arabidopsis protein interac-
tions. The SVM algorithm is well documented (Winters-Hilt et al., 2006). A
brief introduction is also provided in Supplemental Text S1. For implementa-
tion, we used the public software package LIBSVM version 2.88 (http://www.
csie.ntu.edu.tw/~cjlin/libsvm/). After comprehensive evaluation, we chose
to use the radial basis function kernel for its best performance (data not
shown). In this setup, two parameters, the regularization parameter C and the
kernel width parameter g, were optimized by a grid search in an empirical
range (0.001 to 1,000 for C and 3 3 1025 to 4 3 104 for g), maximizing the F1
measure as defined below. Model performance with each parameter set was
estimated by 10-fold cross-validation.
Based on whether a real positive or negative example in the testing set was
predicted as positive or negative, the prediction result for each protein pair
can be divided into four categories: true positive (TP), when a real positive
example was predicted correctly; true negative (TN), when a real negative
example was predicted correctly; false positive (FP), when a real negative
example was predicted as positive; and false negative (FN), when a real
positive example was predicted as negative. These four indicators are collec-
tively called the confusion matrix (Table III). Many accuracy measurements
are derived from the confusion matrix, such as the sensitivity [or recall; TP/
(TP + FN)], specificity [TN/(TN + FP)], and precision [TP/(TP + FP)]. In the
application of predicting interactions, the ability to accurately and compre-
hensively recognize potential interactions is critical. Therefore, the measure-
ments of precision and recall are most relevant. Precision conveys the idea of
how confident we arewhen themodel predicts an interaction. Recall measures
how much proportion of the real interactions can be recognized by the model.
Giving equal weights to precision (p) and recall (r), Rijsbergen (1979) intro-
duced the “F1 measure” as the geometric mean [2rp/(r + p)]. This measure-
ment favors a balanced accuracy on both interacting and noninteracting
protein pairs. For example, if a model predicted all examples as interactions,
the overall accuracy would be the fraction of interactions in the test data set.
However, the F1 measure would be only zero, reflecting the worst accuracy on
all types of examples.
Comparison of Our Predictions with OtherPredicted Interactomes
Our predicted interactions were compared with three published data sets.
The Interologs data set was obtained as the supplemental data of Geisler-Lee
et al. (2007). The AtPID data set was retrieved from the AtPID database
(version 3; Cui et al., 2008). And the ATTED-II data set was downloaded from
the ATTED-II database (version 4.1; Obayashi et al., 2009). The ATTED-II
database did not explicitly suggest a criterion by which potential interactions
shall be identified based on the expression correlations they provided.
However, this database did provide a set of 48,276 highly coexpressed gene
pairs, consisting of each gene paired with three other genes that displayed the
top three expression similarities. It is clear that coexpression does not equal to
interaction. Yet for the purpose of comparing predicted interactomes with a
coexpression network, we took these highly coexpressed gene pairs as the
Lin et al.
44 Plant Physiol. Vol. 151, 2009
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 12: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/12.jpg)
potential interactions identified by the coexpression method. It shall also be
noted that in our comparison, the Interologs set and the AtPID set included all
of their predicted interactions. Their gold-standard examples not recognized
by their respective prediction methods were not counted.
The overlaps between data sets were counted, and the probabilities of these
overlaps appearing by chance were calculated as follows. For any two data
sets, with 10,000 time iterations, we first estimated the distribution of overlaps
between two random set of protein pairs having the same sizes and protein
distributions as the two data sets being compared. Assuming a normal
distribution, the probability of the observed overlap between these two data
sets was then computed against the distribution of random overlaps. This
probability was taken as the significance value for these two data sets being
correlated.
Supplemental Data
The following materials are available in the online version of this article.
Supplemental Figure S1. ROC curve analysis of four prediction methods.
Supplemental Figure S2. Double-strand break repair model for meiotic
recombination.
Supplemental Table S1. Rediscovery of newly reported interactions by
our prediction models.
Supplemental Table S2. Eight experimentally confirmed interactions in
our predicted meiotic recombination subnetwork.
Supplemental Table S3. Forty predicted interactions with known homol-
ogous interactions in our predicted meiotic recombination subnetwork.
Supplemental Table S4. The predicted meiotic recombination subnet-
work.
Supplemental Text S1. A brief introduction to the SVM algorithm and the
complete set of feature analysis diagrams.
ACKNOWLEDGMENTS
We thank Lu Zhang for helpful discussions on feature extraction and Li
Chen for help in Web site construction.
Received May 11, 2009; accepted July 6, 2009; published July 10, 2009.
LITERATURE CITED
Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K,
Betel D, Bobechko B, Boutilier K, Burgess E, et al (2005) The Biomo-
lecular Interaction Network Database and related tools 2005 update.
Nucleic Acids Res 33: D418–D424
Alvarez-Venegas R, Pien S, Sadder M, Witmer X, Grossniklaus U,
Avramova Z (2003) ATX-1, an Arabidopsis homolog of trithorax, acti-
vates flower homeotic genes. Curr Biol 13: 627–637
Ben-Hur A, Noble WS (2006) Choosing negative examples for the predic-
tion of protein-protein interactions. BMC Bioinformatics (Suppl 1) 7: S2
Ben-Yacoub S, Abdeljaoued Y, Mayoraz E (1999) Fusion of face and
speech data for person identity verification. IEEE Trans Neural Netw 10:
1065–1074
Bhardwaj N, Lu H (2005) Correlation between gene expression profiles and
protein-protein interactions within and across genomes. Bioinformatics
21: 2730–2738
Borde V, Robine N, Lin W, Bonfils S, Geli V, Nicolas A (2009) Histone H3
lysine 4 trimethylation marks meiotic recombination initiation sites.
EMBO J 28: 99–111
Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS,
Ares M, Haussler D (2000) Knowledge-based analysis of microarray
gene expression data by using support vector machines. Proc Natl Acad
Sci USA 97: 262–267
Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design by machine
learning: support vector machines for pharmaceutical data analysis.
Comput Chem 26: 5–14
Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: Web-based
support vector machine software for functional classification of a
protein from its primary sequence. Nucleic Acids Res 31: 3692–3697
Chatr-aryamontri ACA, Palazzi LM, Nardelli G, Schneider MV,
Castagnoli L, Cesareni G (2007) MINT: the Molecular INTeraction
database. Nucleic Acids Res 35: D572–D574
Chen C, Zhang W, Timofejeva L, Gerardin Y, Ma H (2005) The Arabidop-
sis ROCK-N-ROLLERS gene encodes a homolog of the yeast ATP-
dependent DNA helicase MER3 and is required for normal meiotic
crossover formation. Plant J 43: 321–334
Coggill P, Finn RD, Bateman A (2008) Identifying protein domains with
the Pfam database. Curr Protoc Bioinformatics 23: 2.5.1–2.5.17
Cui J, Li P, Li G, Xu F, Zhao C, Li Y, Yang Z, Wang G, Yu Q, Shi T (2008)
AtPID: Arabidopsis thaliana Protein Interactome Database. An integra-
tive platform for plant systems biology. Nucleic Acids Res 36: D999–
D1008
Deane CM, Salwinski L, Xenarios I, Eisenberg D (2002) Protein interac-
tions: two methods for assessment of the reliability of high throughput
observations. Mol Cell Proteomics 1: 349–356
de Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content
for author identification forensics. Sigmod Record 30: 55–64
Ding CH, Dubchak I (2001) Multi-class protein fold recognition using
support vector machines and neural networks. Bioinformatics 17:
349–358
Dray E, Siaud N, Dubois E, Doutriaux MP (2006) Interaction between
Arabidopsis Brca2 and its partners Rad51, Dmc1, and Dss1. Plant
Physiol 140: 1059–1069
Emmanuel E, Yehuda E, Melamed-Bessudo C, Avivi-Ragolsky N, Levy
AA (2006) The role of AtMSH2 in homologous recombination in
Arabidopsis thaliana. EMBO Rep 7: 100–105
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm
for large-scale detection of protein families. Nucleic Acids Res 30:
1575–1584
Geisler-Lee J, O’Toole N, Ammar R, Provart NJ, Millar AH, Geisler
M (2007) A predicted interactome for Arabidopsis. Plant Physiol 145:
317–329
Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi
CE, Godwin B, Vitols E, et al (2003) A protein interaction map of
Drosophila melanogaster. Science 302: 1727–1736
Grelon M, Vezon D, Gendrot G, Pelletier G (2001) AtSPO11-1 is necessary
for efficient meiotic recombination in plants. EMBO J 20: 589–600
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a
receiver operating characteristic (ROC) curve. Radiology 143: 29–36
Hartung F, Plchova H, Puchta H (2000) Molecular characterisation of RecQ
homologues in Arabidopsis thaliana. Nucleic Acids Res 28: 4275–4282
Hartung F, Suer S, Puchta H (2007) Two closely related RecQ helicases have
antagonistic roles in homologous recombination and DNA repair in
Arabidopsis thaliana. Proc Natl Acad Sci USA 104: 18836–18841
Heazlewood JL, Verboom RE, Tonti-Filippini J, Small I, Millar AH (2007)
SUBA: the Arabidopsis Subcellular Database. Nucleic Acids Res 35:
D213–D218
Higgins JD, Armstrong SJ, Franklin FC, Jones GH (2004) The Arabidopsis
MutS homolog AtMSH4 functions at an early step in recombination:
evidence for two classes of recombination in Arabidopsis. Genes Dev 18:
2557–2570
Hua S, Sun Z (2001) A novel method of protein secondary structure
prediction with high segment overlap measure: support vector machine
approach. J Mol Biol 308: 397–407
Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A,
Snyder M, Greenblatt JF, Gerstein M (2003) A Bayesian networks
approach for predicting protein-protein interactions from genomic data.
Science 302: 449–453
Karlsen RE, Gorsich DJ, Gerhart GR (2000) Target classification via
support vector machines. Opt Eng 39: 704–711
Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C,
Dimmer E, Feuermann M, Friedrichsen A, Huntley R, et al (2006)
IntAct: open source resource for molecular interaction data. Nucleic
Acids Res 35: D561–D565
Kerrien S, Orchard S, Montecchi-Palazzi L, Aranda B, Quinn AF, Vinod
N, Bader GD, Xenarios I, Wojcik J, Sherman D, et al (2007) Broadening
the horizon: level 2.5 of the HUPO-PSI format for molecular interac-
tions. BMC Biol 5: 44
Kim KI, Jung K, Park SH, Kim HJ (2001) Support vector machine-based
text detection in digital video. Pattern Recognit 34: 527–529
Predicted Arabidopsis Interactome
Plant Physiol. Vol. 151, 2009 45
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021
![Page 13: Computational Identification of Potential Molecular Interactions in Arabidopsis1[C]](https://reader031.vdocuments.mx/reader031/viewer/2022020706/61fc9c949d50e757a521a3c0/html5/thumbnails/13.jpg)
Kobbe D, Blanck S, Demand K, Focke M, Puchta H (2008) AtRECQ2, a
RecQ helicase homologue from Arabidopsis thaliana, is able to disrupt
various recombinogenic DNA structures in vitro. Plant J 55: 397–405
Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO,
Han JD, Chesneau A, Hao T, et al (2004) A map of the interactome
network of the metazoan C. elegans. Science 303: 540–543
Liong SY, Sivapragasam C (2002) Flood stage forecasting with support
vector machines. J Am Water Resour Assoc 38: 173–186
Obayashi T, Hayashi S, Saeki M, Ohta H, Kinoshita K (2009) ATTED-II
provides coexpressed gene networks for Arabidopsis. Nucleic Acids
Res 37: D987–D991
O’Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehen-
sive database of eukaryotic orthologs. Nucleic Acids Res 33: D476–D480
Qi Y, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of different
biological data and computational classification methods for use in
protein interaction prediction. Proteins 63: 490–500
Qiu J, Noble WS (2008) Predicting co-complexed protein pairs from
heterogeneous data. PLOS Comput Biol 4: e1000054
Raghavachari B, Tasneem A, Przytycka TM, Jothi R (2008) DOMINE:
a database of protein domain interactions. Nucleic Acids Res 36:
D656–D661
Ramani AK, Marcotte EM (2003) Exploiting the co-evolution of interacting
proteins to discover interaction specificity. J Mol Biol 327: 273–284
Revenkova E, Jessberger R (2005) Keeping sister chromatids together:
cohesins in meiosis. Reproduction 130: 783–790
Rhee SYBW, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez
M, Huala E, Lander G, Montoya M, Miller N, et al (2003) The
Arabidopsis Information Resource (TAIR): a model organism database
providing a centralized, curated gateway to Arabidopsis biology, re-
search materials and community. Nucleic Acids Res 31: 224–228
Rhodes DRTS, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram
S, Ghosh D, Pandey A, Chinnaiyan AM (2005) Probabilistic model of the
human protein-protein interaction network. Nat Biotechnol 23: 951–959
Rijsbergen CJ (1979) Information Retrieval. Butterworths, London
Salwinski LMC, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The
Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32:
D449–D451
Scott MS, Barton GJ (2007) Probabilistic prediction and ranking of human
protein-protein interactions. BMC Bioinformatics 8: 239
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N,
Schwikowski B, Ideker T (2003) Cytoscape: a software environment for
integrated models of biomolecular interaction networks. Genome Res
13: 2498–2504
Shen Q, Shi WM, Kong W, Ye BX (2007) A combination of modified
particle swarm optimization algorithm and support vector machine for
gene selection and tumor classification. Talanta 71: 1679–1683
Siaud N, Dray E, Gy I, Gerard E, Takvorian N, Doutriaux MP (2004) Brca2
is involved in meiosis in Arabidopsis thaliana as suggested by its
interaction with Dmc1. EMBO J 23: 1392–1401
Stacey NJ, Kuromori T, Azumi Y, Roberts G, Breuer C, Wada T, Maxwell
A, Roberts K, Sugimoto-Shirasu K (2006) Arabidopsis SPO11-2 func-
tions with SPO11-1 in meiotic recombination. Plant J 48: 206–216
Stark CBB, Requly T, Boucher L, Breitkreutz A, Tyers M (2006) BioGRID: a
general repository for interaction datasets. Nucleic Acids Res 34:
D535–D539
Suthram SST, Ruppin E, Sharan R, Ideker T (2006) A direct comparison of
protein interaction confidence assignment schemes. BMC Bioinformat-
ics 7: 360
Szostak JW, Orr-Weaver TL, Rothstein RJ, Stahl FW (1983) The double-
strand-break repair model for recombination. Cell 33: 25–35
Tsesmetzis N, Couchman M, Higgins J, Smith A, Doonan JH, Seifert GJ,
Schmidt EE, Vastrik I, Birney E, WuG, et al (2008) Arabidopsis reactome:
a foundation knowledgebase for plant systems biology. Plant Cell 20:
1426–1436
Uetz PGL, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D,
Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, et al (2000) A
comprehensive analysis of protein-protein interactions in Saccharomy-
ces cerevisiae. Nature 503: 623–627
Vapnik VN (1998) Statistical Learning Theory. Wiley, New York
Vapnik VN (2000) The Nature of Statistical Learning Theory, Ed 2.
Springer, New York
von Mering CKR, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002)
Comparative assessment of large-scale data sets of protein-protein
interactions. Nature 417: 399–403
Wang Y, Mallya SM, Sikpi MO (2000) Calmodulin antagonists and cAMP
inhibit ionizing-radiation-enhancement of double-strand-break repair
in human cells. Mutat Res 460: 29–39
Wijeratne, Asela J, Ma, H (2007) Genetic analyses of meiotic recombination
in Arabidopsis. Journal of Integrative Plant Biology 49: 1199–1207
Winters-Hilt S, Yelundur A, McChesney C, Landry M (2006) Support
vector machine implementations for classification and clustering. BMC
Bioinformatics (Suppl 2) 7: S4
Xu H, Lin M, Wang W, Li Z, Huang J, Chen Y, Chen X (2007) Learning the
drug target-likeness of a protein. Proteomics 7: 4255–4263
Xu JH, Li F, Sun QF (2008) Identification of microRNA precursors with
support vector machine and string kernel. Genomics Proteomics Bio-
informatics 6: 121–128
Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) Effect of molecular
descriptor feature selection in support vector machine classification of
pharmacokinetic and toxicological properties of chemical agents.
J Chem Inf Comput Sci 44: 1630–1638
Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J,
Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al (2008) High-
quality binary protein interaction map of the yeast interactome network.
Science 322: 104–110
Yuan Z, Burrage K, Mattick JS (2002) Prediction of protein solvent
accessibility using support vector machines. Proteins 48: 566–570
Zhang LV, Wong SL, King OD, Roth FP (2004) Predicting co-complexed
protein pairs using genomic and proteomic data integration. BMC
Bioinformatics 5: 38
Zhao C, Zhang H, Zhang X, Zhang R, Luan F, Liu M, Hu Z, Fan B (2006a)
Prediction of milk/plasma drug concentration (M/P) ratio using sup-
port vector machine (SVM) method. Pharm Res 23: 41–48
Zhao CY, Zhang HX, Zhang XY, Liu MC, Hu ZD, Fan BT (2006b)
Application of support vector machine (SVM) for prediction toxic
activity of different data sets. Toxicology 217: 105–119
Lin et al.
46 Plant Physiol. Vol. 151, 2009
Dow
nloaded from https://academ
ic.oup.com/plphys/article/151/1/34/6109796 by guest on 18 N
ovember 2021