predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme...

6
Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine Yong-Xian Fan a,b , Hong-Bin Shen a,n a Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China b School of Computer Science and Engineering, Guilin University of Electronic Technology, Guilin 541004, China article info Article history: Received 27 August 2012 Received in revised form 21 November 2012 Accepted 25 November 2012 Available online 24 October 2013 Keywords: Pupylated protein Pupylation sites Pseudo-amino acid composition Extreme learning machine Bioinformatics PupS abstract Pupylation is one of the most important post-translational modications of prokaryotic proteins playing a key role in regulating a wild range of biological processes. Prokaryotic ubiquitin-like protein can attach to specic lysine residues of substrate proteins by forming isopeptide bonds for the selective degradation of proteins in Mycobacterium tuberculosis. In order to comprehensively understand these pupylation- related biological processes, identication of pupylation sites in the substrate protein sequence is the rst step. The traditional wet-lab experimental approaches are both laborious and time-consuming. To timely and effectively discover pupylation sites when facing with the avalanche of new protein sequences emerging during the post-genomic Era, a novel computational predictor called PupS ( pupylation site predictor) is proposed. PupS is constructed on the pseudo-amino acid composition and trained with extreme learning machine. The jackknife cross-validation results on the training dataset show that the area under an ROC Curve (AUC) value is 0.6483 by PupS, and an AUC of 0.6779 is obtained on the independent set. Our results also demonstrate that ELM is complementary to other algorithms and that constructing an ensemble classier will generate better results. PupS software package is available at http://www.csbio.sjtu.edu.cn/bioinf/PupS/. & 2013 Elsevier B.V. All rights reserved. 1. Introduction Prokaryotic ubiquitin-like protein (Pup) is the rst identied post-translational small protein modier in prokaryotes. Pup can attach to specic lysine (K) residues of substrate proteins by forming isopeptide bonds for the selective degradation of proteins in Mycobacterium tuberculosis (Mtb), which is similar to ubiquitin (Ub) mediated proteolysis in eukaryotes [1]. Pupylation in prokar- yotes plays critical roles in numerous regulatory functions such as protein degradation and signal transduction [2]. Although pupyla- tion and ubiquitylation in eukaryotes are similar in functional roles, the enzymology of pupylation and ubiquitylation is different. In contrast with the three-step reaction of eukaryotic ubiquityla- tion with E1, E2, and E3 ligases [3], prokaryotic pupylation requires only two steps, where two enzymes thus are involved in the pupylation. Firstly, the C-terminal glutamine of Pup is deamidated to glutamic acid by deamidase of Pup (Dop) [4]. Secondly, the deamidated Pup is attached to specic lysine (K) residues of pupylated substrate proteins by proteasome accessory factor A (PafA) [5]. The identication of pupylated substrate proteins along with pupylation sites can provide valuable insights into the substrate specicity and functions of pupylation. With the application of the large-scale proteomics technology, such as tandem mass spectro- metry, the number of identied pupylated proteins continues growing [69]. The proteome-wide analyses already revealed that there are at least several hundreds of potential pupylated proteins in the model organism M. smegmatis, in which the selective degradation of pupylation-mediated protein was proposed to be highly dynamic and dependent on the culture conditions [8,9]. Although much progress has been achieved in this regard, the number of experimentally veried pupylated proteins and pupyla- tion sites are still relatively small as the traditional experimental approaches are laborious and time-consuming. Alternative bioin- formatics method is highly desired that can quickly predict the potential true pupylation sites throughout the entire protein sequences, which will provide timely and helpful information for further experimental verications. To the best of our knowledge, only one predictor GPS-PUP [10] is available specically for predicting pupylation sites currently. GPS-PUP is mainly based on the no interval alignment scoring method and a training set including 127 experimentally identied pupylation sites in 109 prokaryotic pupylated proteins. In this article, to speed up the progress in this eld, we developed a new Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.11.058 n Corresponding author. Tel.: þ86 21 34205320; fax: þ86 21 34204022. E-mail address: [email protected] (H.-B. Shen). Neurocomputing 128 (2014) 267272

Upload: hong-bin

Post on 30-Dec-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine

Predicting pupylation sites in prokaryotic proteins using pseudo-aminoacid composition and extreme learning machine

Yong-Xian Fan a,b, Hong-Bin Shen a,n

a Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing,Ministry of Education of China, Shanghai 200240, Chinab School of Computer Science and Engineering, Guilin University of Electronic Technology, Guilin 541004, China

a r t i c l e i n f o

Article history:Received 27 August 2012Received in revised form21 November 2012Accepted 25 November 2012Available online 24 October 2013

Keywords:Pupylated proteinPupylation sitesPseudo-amino acid compositionExtreme learning machineBioinformaticsPupS

a b s t r a c t

Pupylation is one of the most important post-translational modifications of prokaryotic proteins playinga key role in regulating a wild range of biological processes. Prokaryotic ubiquitin-like protein can attachto specific lysine residues of substrate proteins by forming isopeptide bonds for the selective degradationof proteins in Mycobacterium tuberculosis. In order to comprehensively understand these pupylation-related biological processes, identification of pupylation sites in the substrate protein sequence is the firststep. The traditional wet-lab experimental approaches are both laborious and time-consuming. To timelyand effectively discover pupylation sites when facing with the avalanche of new protein sequencesemerging during the post-genomic Era, a novel computational predictor called PupS (pupylation sitepredictor) is proposed. PupS is constructed on the pseudo-amino acid composition and trained withextreme learning machine. The jackknife cross-validation results on the training dataset show that thearea under an ROC Curve (AUC) value is 0.6483 by PupS, and an AUC of 0.6779 is obtained on theindependent set. Our results also demonstrate that ELM is complementary to other algorithms and thatconstructing an ensemble classifier will generate better results. PupS software package is available athttp://www.csbio.sjtu.edu.cn/bioinf/PupS/.

& 2013 Elsevier B.V. All rights reserved.

1. Introduction

Prokaryotic ubiquitin-like protein (Pup) is the first identifiedpost-translational small protein modifier in prokaryotes. Pup canattach to specific lysine (K) residues of substrate proteins byforming isopeptide bonds for the selective degradation of proteinsin Mycobacterium tuberculosis (Mtb), which is similar to ubiquitin(Ub) mediated proteolysis in eukaryotes [1]. Pupylation in prokar-yotes plays critical roles in numerous regulatory functions such asprotein degradation and signal transduction [2]. Although pupyla-tion and ubiquitylation in eukaryotes are similar in functionalroles, the enzymology of pupylation and ubiquitylation is different.In contrast with the three-step reaction of eukaryotic ubiquityla-tion with E1, E2, and E3 ligases [3], prokaryotic pupylation requiresonly two steps, where two enzymes thus are involved in thepupylation. Firstly, the C-terminal glutamine of Pup is deamidatedto glutamic acid by deamidase of Pup (Dop) [4]. Secondly, thedeamidated Pup is attached to specific lysine (K) residues ofpupylated substrate proteins by proteasome accessory factor A(PafA) [5].

The identification of pupylated substrate proteins along withpupylation sites can provide valuable insights into the substratespecificity and functions of pupylation. With the application of thelarge-scale proteomics technology, such as tandem mass spectro-metry, the number of identified pupylated proteins continuesgrowing [6–9]. The proteome-wide analyses already revealed thatthere are at least several hundreds of potential pupylated proteinsin the model organism M. smegmatis, in which the selectivedegradation of pupylation-mediated protein was proposed to behighly dynamic and dependent on the culture conditions [8,9].Although much progress has been achieved in this regard, thenumber of experimentally verified pupylated proteins and pupyla-tion sites are still relatively small as the traditional experimentalapproaches are laborious and time-consuming. Alternative bioin-formatics method is highly desired that can quickly predict thepotential true pupylation sites throughout the entire proteinsequences, which will provide timely and helpful information forfurther experimental verifications.

To the best of our knowledge, only one predictor GPS-PUP [10]is available specifically for predicting pupylation sites currently.GPS-PUP is mainly based on the no interval alignment scoringmethod and a training set including 127 experimentally identifiedpupylation sites in 109 prokaryotic pupylated proteins. In thisarticle, to speed up the progress in this field, we developed a new

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.neucom.2012.11.058

n Corresponding author. Tel.: þ86 21 34205320; fax: þ86 21 34204022.E-mail address: [email protected] (H.-B. Shen).

Neurocomputing 128 (2014) 267–272

Page 2: Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine

sequenced based computational method called PupS for predictingpupylation sites. We used an advanced machine learning method,extreme learning machine (ELM) [11], to establish the predictionmodel. ELM has also been found powerful in dealing with complexbiological data. For example, ELM was applied for the proteinsequence and microarray gene expression cancer diagnosis classi-fications [18,19]. Recently, ELM was also incorporated with the N-to-1 neural networks to detect the transmembrane β-barrel(TMBB) proteins [20]. Pseudo-amino acid composition (PseAAC)was used to encode the amino acid sequences into vectors, whichcan not only represent amino acid composition information, butalso can capture the correlations between amino acid residues [12]and has been widely used for protein attribute prediction [13–16](detailed review in [17]).

Several features distinguish current study from other relativeworks. Firstly, a larger new non-redundant dataset is constructedin this study that is important for training a solid statistic machinelearning algorithm. Secondly, the pseudo-amino acid composition(PseAAC) feature applied in this paper is able to capture theinformation of sequence-order correlations. Thirdly, the appliedprediction model of ELM is much more faster than other algo-rithm, such as SVM, which enables current model capable ofdealing with large-scale datasets.

2. Materials and methods

2.1. Datasets

A freely accessible database named PupDB [21] has beenestablished which integrates information of both pupylated pro-teins and pupylation sites. In this study, for the purpose of traininga solid predictor, we constructed a new non-redundant datasetaccording to the following steps:

(1) All pupylated proteins and pupylation sites with experimentalevidences were extracted from the latest PupDB database version[21]. This initial dataset contains 182 pupylated proteins includ-ing 215 pupylation sites, where 2 of these pupylation sites thatare not lysine (K) amino acid were not considered.

(2) The remaining 180 pupylated proteins including 213 pupyla-tion sites were further inputted to the CD-HIT [22] method toremove redundancy with pairwise sequence identity cut-off30%. We then obtained the final non-redundant dataset con-sisting of 145 pupylated protein sequences including 174pupylation sites.

(3) Ten pupylated proteins were randomly selected from theoriginal non-redundant dataset to construct an independenttest set, and then the remaining 135 pupylated proteins act asthe training set.

(4) We then used a peptide centered with the residue lysine (K) toencode the target that needed to be predicted as positive(pupylation site) or negative (non-pupylation site). By testingseveral different sizes and according to the two sample logo(see Fig. 1), a window size of 25 residues was adopted in thisstudy. Each residue lysine (K) can then be represented with apeptide segment consisting of 12 residues upstream and 12residues downstream of the lysine (K). Since the number ofpupylation sites and putative non-pupylation sites wereimbalanced (less than 1:12), a relatively balanced procedureis applied, and three times negative samples were selected tomatch the positive ones in the training set. But in theindependent test set, we retained all the positive and negativesamples in order to simulate the real situation.

(5) Finally, the 135 training pupylated proteins including 158positive peptide segments (pupylation sites) and 1928

negative peptides (putative non-pupylation sites) constitutethe training set. In the meantime, 10 testing pupylatedproteins including 16 positive samples (pupylation sites) and279 negative samples (putative non-pupylation sites) consti-tute the testing set.

2.2. Methods

2.2.1. Sequence analysis of the position-specific attributesTo determine whether pupylation and putative non-pupylation

sites have distinct sequence properties, we calculated statisticallysignificant differences in the distribution of amino acid residues inthe vicinity of 174 (158þ16) pupylation peptides and 2207(1928þ279) putative non-pupylation segments. The two samplelogo of the position-specific residue composition in the vicinity ofthe pupylation sites and putative non-pupylation sites in awindow with length 25 was created as shown in Fig. 1 [23]. Polaramino acids (G, S, T, Y, C) show as green, acidic (Q, N) purple, basic(K, R, H) blue, positive (D, E) red, and hydrophobic (A, V, L, I, P, W, F,M) amino acids as black in the two sample logo.

The two sample logo showed compositional differences betweenpupylation sites and putative non-pupylation sites. The mostdistinct feature of pupylation sites is the enrichment of positiveamino acid (E) at positions �4 and �7, positive amino acid (D) atpositions �5 and 4, hydrophobic amino acids (A, V, and L) atpositions �12, �9, �6, �3 and �2, and positively charged aminoacid (R) at positions �12, �11, 7 and 12. On the contrary, thedepletion of lysine (K) at position �2, polar amino acids (G and S) atpositions �7, 4 and 9, positive amino acid (D) at position 12, andhydrophobic amino acid (L) at position -12 are observed aroundpupylation sites. These statistics show that there are strong correla-tions between the residues around the pupylation sites, requiringproper encoding methods.

2.2.2. Encoding protein sequence with pseudo-aminoacid composition

According to the definition of amino acid composition (AAC),AAC of a protein sequence can be represented by 20 discretenumbers with each denoting the occurrence frequency of one ofthe 20 native amino acids in the protein. But, if using the 20dimensional AAC to represent a protein sequence, all its sequenceorder information would be lost. For instance, AAC cannot catchthe strong residue correlations around the pupylation sites asshown in Fig. 1. In view of this, instead of the conventional AAC,we adopt PseAAC to represent a protein sample in a (20þλ)dimensional vector. The first 20 elements in PseAAC reflect thetraditional global amino acid composition in the sequence and thelater λ elements represent the local correlations among residues[12]. In our application, each sample is presented with a peptide

Fig. 1. The two sample logo of the position-specific residue composition in thevicinity of the 174 pupylation sites and 2207 non-pupylation sites in a window of25 residues, illustrating compositional differences between pupylation sites andputative non-pupylation sites. Only amino acid residues significantly enriched anddepleted (Po0.05; t-test) are shown.

Y.-X. Fan, H.-B. Shen / Neurocomputing 128 (2014) 267–272268

Page 3: Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine

with 25 amino acid residues consisting of 12 residues upstreamand 12 residues downstream of the lysine (K). Given the ithpeptide sample Pi with 25 amino acid residues:

Pi ¼ R1i R

2i R

3i R

4i R

5i R

6i R

7i …R25

i ; ð1Þwhere R1

i represents the 1st residue of the ith peptide, R2i the 2nd

residue, and so forth. Its PseAAC can be generally formulated as

xi ¼ ½x1i ; x2i ;…; x20i ; x20þ1i ; x20þ2

i ;…; x20þ λi �T ; ð2Þ

where the first 20 components are the same as those in theconventional AAC and here λ should be less than the peptidesample's length of 25, and x20þ1

i ; x20þ2i ;…; x20þ λ

i are the factorsrelated to λ different ranks of sequence-order correlations that canbe easily computed. In our study, it is found by preliminary teststhat the optimal value for λ is 11. Thus, given a peptide sample, a31-D (20þ11) PseAAC vector can be derived from Eq. (2) or withthe online server at http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/[24].

2.2.3. Constructing PupS with extreme learning machineIn principle, after encoding the features, any statistical machine

learning algorithm can be applied to predict pupylation sites. Inthis paper, we mainly focus on the investigation of a recentlyproposed novel model of extreme learning machine (ELM) for itsfaster speed and good performances reported in the literature [11].ELM is a special single-hidden layer feed forward networks(SLFNs) and can provide good generalization performance atextremely training speed [11,25–28]. If a SLFNs with L hiddennodes and activation function h(x) can approximate these Narbitrary distinct samples [xi, yi]Aℝd�ℝm without error, we thencan get

∑L

i ¼ 1βihiðxjÞ ¼ hðxjÞβ¼ yj; j¼ 1; 2; :::; N ð3Þ

where βi is the output weight vector connecting the ith hiddennode and the output nodes, β¼[β1, …, βL]T is the output weightmatrix, and h(x)¼[h1(x), …, hL(x)] is the output vector of hiddenlayer with respect to the input x, and yj is the outputted score fromthe network indicating the probability of the query residue beingpredicted to be true pupylation site. In our study, the sigmoidactivation function G(w, x, b) was employed for h.

The above N equations can be written compactly as thefollowing:

Hβ¼ Y ; ð4Þwhere

H ¼hðx1Þ⋮hðxNÞ

264

375¼

h1ðx1Þ ⋯ hLðx1Þ⋮ ⋯ ⋮

h1ðxNÞ ⋯ hLðxNÞ

264

375N�L

¼Gðw1; x1;b1Þ ⋯ GðwL; x1; bLÞ

⋮ ⋯ ⋮Gðw1; xN ;b1Þ ⋯ GðwL; xN ; bLÞ

264

375N�L

; ð5Þ

and

Y ¼yT1⋮yTN

264

375N�m

; ð6Þ

where wi and bi are the parameters of the ith hidden node, and His called the hidden layer output matrix of the neural networks.Huang et al. [11, 29] have already proved that ELM was differentfrom the common SLFNs in which the hidden node parametersneed to be learned iteratively, while in ELM the hidden nodeparameters may be randomly chosen and fixed with almost any

nonzero activation function. Then the output weight vectors canbe analytically determined when any continuous target functionwas approximated on any compact input sets. So Eq. (4) becomes alinear system and the output weight matrix β are estimated as

β¼H†Y ; ð7Þwhere H†is the Moore–Penrose generalized inverse of the hiddenlayer output matrix H. Once the random hidden node parametersare generated and the output weight vectors are obtained accord-ing to Eq. (7), the predictions of a set of unknown protein peptidescan then be assigned by using Eq. (3).

Given a training set of pupylated protein sequences with theirpupylation sites, the learning procedure is shown in Fig. 2. Aflowchart intuitively showing how to predict pupylation sites byPupS from primary sequence by integrating PseAAC and ELM isgiven in Fig. 3.

2.2.4. Cross-validation and performance assessmentIn order to provide stringent performance evaluation, the

jackknife cross-validation, which takes one peptide segment outfor testing while keeping the remaining peptide segments fortraining and this procedure will be terminated when all thepeptides have been tested individually, was carried out on thetraining set at peptide level which helped to select the optimalnumber of hidden neurons. On that basis, the independent testwas carried out for comparing the prediction performancebetween ELM, SVM, and GPS-PUP [10]. The predictive ability wasassessed with several measures, namely, Sensitivity (SN), Specifi-city (SP), Positive predictive value (PPV), F1 score, the Matthew'scorrelation coefficient (MCC), and the Overall Accuracy (ACC). Theyare respectively defined as follows:

SN¼ TPTPþFN

; ð8Þ

SP¼ TNTNþFP

; ð9Þ

PPV¼ TPTPþFP

; ð10Þ

Six steps learning procedure for predicting pupylation sites

1) Extract the positive peptides (pupylation sites) and negative peptides (putative

non-pupylation sites) with a window of length 25;

2) For each peptide, generating its pseudo amino acid composition vector according

to Eq.(2);

3) Set the number of hidden neurons in ELM;

4) Assign randomly the hidden node parameters;

5) Calculate the hidden layer output matrix H according to Eq.(5);

6) Obtain the estimations of the output weight vectors according to Eq.(7).

Fig. 2. Six steps learning procedure for predicting pupylation sites based onPseAAC and ELM.

Fig. 3. The flowchart showing how PupS works.

Y.-X. Fan, H.-B. Shen / Neurocomputing 128 (2014) 267–272 269

Page 4: Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine

F1¼ 2� SN� PPVSNþPPV

; ð11Þ

MCC¼ TP� TN�FP� FNffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðTPþFPÞðTNþFNÞðTPþFNÞðTNþFPÞ

p ; ð12Þ

ACC¼ TPþTNTPþTNþFPþFN

; ð13Þ

where TP, FP, TN, and FN denote the number of true positives, falsepositives, true negatives, and false negatives, respectively.

However, the above-mentioned measurements rely on theselected threshold. The area under the ROC curve (AUC), whichis threshold-independent for evaluating the performances, can beeasily calculated according to the following formula [30]:

AUC¼ S0�n0ðn0þ1Þ=2n0 � n1

; ð14Þ

where n0 and n1 denote the number of positive and negativesamples respectively, and S0 is the sum of the ranks of all positivesamples in the list of all samples ranked in increasing order byestimated probabilities belonging to positive. AUC values can giveus a good insight into performance comparison of different predic-tion methods.

Although the AUC is threshold-independent, an appropriatethreshold must be selected for the final decision. For the classifierwhich outputs a continuous numeric value to represent the con-fidence or probability of a sample belonging to the predicted class,adjusting the classification threshold will lead to different confusionmatrices which decide different ROC points [31]. From each confu-sion matrix, Specificity, Sensitivity, Accuracy, and Matthew's corre-lation coefficient can be calculated. Usually for the imbalancedclassification problem, the evaluation results are reported accordingto the ROC point where the sensitivity value is equal to thespecificity value [32]. This specific ROC point for reporting evalua-tion results in a balanced manner is the intersection points of theROC curve and the straight line that passes through points (0, 1) and(1, 0). At the same time, the optimal threshold is also identifiedaccordingly.

3. Results and discussions

In order to assess the performance of combining the PseAACencoding with ELM algorithm for pupylation sites prediction andcompare with other methods, all the experiments were carried outin MATLAB 7.12 on a Dell PC with 2.92 GHz Intel Core (TM) DuoCPU with 4 GB memory.

3.1. Performance of PupS based on ELM

In our study, the feature vector generated by the PseAACencoding was quite compact. So we selected the optimal numberof hidden neurons by testing from 10 to 90 with 5 for interval. Theresults of jackknife validation were shown in Fig. 4. It was foundthat the AUC value reached the maximum 0.6483 when thenumber of hidden neurons was set to 70. Under the condition ofthe optimal number of hidden neurons, the ROC curve forpredicting pupylation sites was plotted as shown in Fig. 5. Theoptimal decision threshold was selected according to the inter-section points of the ROC curve and the straight line that passesthrough points (0, 1) and (1, 0). The other measures are listed inTable 1. According to the balanced threshold selected in Fig. 5, thesensitivity value (SN) is 62.03% which is equal to the specificityvalue (SP), and the overall accuracy (ACC), positive predictive value(PPV), F1 score, and Matthew's correlation coefficient (MCC) are62.03%, 35.26%, 44.96%, and 0.2099, respectively.

3.2. Comparison with other methods on independent set

To demonstrate the performance of current model, we comparecurrent predictor with the other GPS-PUP model. At the sametime, we also carry the comparisons between ELM, SVM, NaiveBayes classifer, and KNN algorithms. The independent set wassubmitted to the GPS-PUP web server [10] and the outputs wereused to calculate the corresponding sensitivity and specificityaccording to the high, medium and low thresholds which havebeen set in GPS-PUP [10]. It should be pointed out that we cannotguarantee that these 10 pupylated proteins (the independent testset) are not included in the training dataset of GPS-PUP [10]. TheSVM model was structured on the same training set as PupS, andthe SVM optimal cost parameter C and kernel parameter γ wereobtained based on the grid-search. Namely, We tried the 9�9¼81different combinations of C A[26, 25, …, 2-1, 2�2] and γ A[23, 22,…, 2�4, 2�5] to select the optimal combination of (C, γ) for thefinal independent test. The Naive Bayes classifer model was builton the same training set using the default Gaussian distribution.For KNN algorithm, the only parameter K was set to 5 by testing 21different parameters of [1, 3, …, 19, 21].

The performances of different methods are shown in Fig. 6 andTable 2. As observed from Fig. 6, the ELM algorithm achieves theAUC value of 0.6779 on the independent set. According to the

Fig. 4. The relationship between the jackknife validation performance and thenumber of hidden neurons.

Fig. 5. ROC curve of PupS for predicting pupylation sites using the jackknife test atpeptide level.

Table 1Prediction performance for predicting pupylation sites using the jackknife test atpeptide level on the training dataset.

AUC SP SN ACC PPV F1 MCC

0.6483 62.03% 62.03% 62.03% 35.26% 44.96% 0.2099

Y.-X. Fan, H.-B. Shen / Neurocomputing 128 (2014) 267–272270

Page 5: Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine

outputs of GPS-PUP web server [10] with different thresholds, thesensitivity and specificity are 12.50% and 91.04% at the highthreshold, which is denoted by the solid circle point of “�” inFig. 6 and is under the ROC curve of the ELM algorithm. Thisdemonstrates that current PupS predictor is better than GPS-PUP[10] at the high threshold. The cross point “þ” denotes thesensitivity value of 25.00% and specificity value of 88.53% at themedium threshold, and the star point “n” represents the sensitiv-ity value of 62.50% and specificity value of 82.04% at the lowthreshold.

By comparing ELM with SVM, we find that the training time ofSVM algorithm (0.6034s) is roughly four times the training time ofELM algorithm (0.1404 s). The training time of SVM algorithmmainlydepends on the sizes of the searching grids of optimal parameters. Inconsidering of this, many methods have been proposed to speed upthe optimization process [33,34]. For example, the recent developedGFO calculator has reduced the traditional 2D grid search in SVMparameter spaces to 1D search problem [33,34]. These methods areexpected to improve the efficiency of SVM, but are not the focus ofthis study. In addition to the high efficiency of ELM compared to theSVM with traditional 2D grid search of optimal parameters, theprediction performance of ELM is also promising. The AUC of theSVM algorithm is 0.6664, which is lower than the ELM. Thesensitivity (SN), specificity (SP), overall accuracy (ACC), positivepredictive value (PPV), F1 score, and Matthew's correlation coeffi-cient (MCC) of ELM algorithm reach 62.50%, 62.72%, 62.71%, 8.77%,15.38%, and 0.1173, respectively, while these criteria of SVM arerespectively 62.50%, 61.65%, 61.70%, 8.55%, 15.04%, and 0.1118 asillustrated in Table 2. Hence, it can be seen that almost all evaluation

measures of ELM algorithm are a little better than those of SVMalgorithm except sensitivity (SN). Considering the high efficiency aswell as the good performance, ELM is expected to be particularlysuitable for dealing with large-scale biological dataset.

We also trained and tested the Naive Bayes and KNN algorithmson current benchmark dataset, and the results are shown inTable 2 and Fig. 6. The sensitivity and specificity of KNN algorithmare 18.75% and 90.68%, which is denoted by the hollow circle pointof “o” in Fig. 6. These results demonstrate that the simple KNNalgorithm is not as powerful as other algorithms for predicting thepupylation sites. The training time of Naive Bayes classifier algo-rithm (0.0040 s) is the shortest and its AUC is 0.6967. Though thetraining efficiency and AUC value of Naive Bayes classifier outper-form ELM and SVM, its sensitivity (SN), specificity (SP), overallaccuracy (ACC), positive predictive value (PPV), F1 score, andMatthew's correlation coefficient (MCC) are 56.25%, 56.99%,56.95%, 6.98%, 12.41%, and 0.0605 respectively, which are lowerthan those of ELM and SVM.

The different performance of ELM, SVM, and Naive Bayes classifiersmotivates us to investigate whether the prediction accuracy can befurther enhanced by fusing the diversities of the three algorithms. Inorder to demonstrate this, the 3 outputs of ELM, SVM, and Naive Bayesclassifiers were integrated together using the simple average rule,which can be formulated as the following equation:

EðxÞ ¼ 13

∑3

j ¼ 1pjðxÞ; ð15Þ

where pjðxÞ denotes the predicted probability of the jth algorithm.The AUC value of the ensembled classifier EðxÞ of Eq. (15) is

0.6985 as shown in Fig. 6, which is higher than all the singlealgorithms. Its sensitivity (SN), specificity (SP), overall accuracy(ACC), positive predictive value (PPV), F1 score, and Matthew'scorrelation coefficient (MCC) reaches 68.75%, 67.03%, 67.12%,10.68%, 18.49%, and 0.1700 respectively, which are also higherthan those of ELM, SVM and Naive Bayes classifiers. These resultsdemonstrate that ELM algorithm is complementary to otherclassifiers and integrating ELM with others is a promising way toconstruct more reliable predictors and achieve better results.

3.3. Discussions

During the experiments, we also found that the ELM model isalso affected by the random initializations similar to other neuralnetwork algorithms. We thus repeated the experiments for 100times with different initializations to show the stability of ELM.Table 3 shows the mean and standard deviations of these 7 differ-ent evaluation metrics. Comparing Table 3 with Table 1, we canfind that although the ELM shows some instabilities in the 100times of experiments, the standard deviations are not significant

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-Specificity

Sen

siti

vit

y

ROC curve

ELM (AUC=0.6779)

SVM (AUC=0.6664)

NB (AUC=0.6967)

ELM+SVM+NB (AUC=0.6985)

KNN (0.0932, 0.1875)

GPS-PUP (0.0896, 0.1250)

GPS-PUP (0.1147, 0.2500)

GPS-PUP (0.1796, 0.6250)

Fig. 6. The performance comparison between ELM, SVM, Naive Bayes classifier,KNN, and GPS-PUP on the independent test set.

Table 2Comparisons between ELM, SVM, and Naive Bayes classifier on the independent test set.

Method SP SN ACC PPV F1 AUC MCC Training time (s)

ELM 62.72% 62.50% 62.71% 8.77% 15.38% 0.6779 0.1173 0.1404SVM 61.65% 62.50% 61.70% 8.55% 15.04% 0.6664 0.1118 0.6034Naive Bayes 56.99% 56.25% 56.95% 6.98% 12.41% 0.6967 0.0605 0.0040

Table 3Means and standard deviations of the 7 evaluation measurements for predicting pupylation sites using the jackknife test on the training dataset in 100 times.

AUC SP (%) SN (%) ACC (%) PPV (%) F1 (%) MCC

0.651170.0142 62.2371.02 62.2670.98 62.2471.01 35.4170.63 45.1170.78 0.210870.0072

Y.-X. Fan, H.-B. Shen / Neurocomputing 128 (2014) 267–272 271

Page 6: Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine

indicating ELM is overall a stable learning algorithm, which is alsosupported by the literature [11].

It is also interesting to find from the Fig. 4 that the number ofhidden nodes in the ELM plays an important role in its perfor-mance. Although this paper has tested a large range from 10 to 90and selected 70 to construct our final PupS model, it still could notbe an optimal one. Besides, the choice of this parameter can alsobe applications dependent. Hence, one of the important futuredirections is to reveal the number of hidden nodes with the datastructure and investigate a proper data-driven approach that canautomatically yield the proper parameter.

4. Conclusions

In order to identify the pupylation sites in query pupylatedproteins, a novel predictor PupS was presented based on pseudo-amino acid composition and extreme learning machine (ELM).This study demonstrates that ELM is highly efficient and theprediction performance is slight better than SVM in pupylationsites prediction. Furthermore, we also show that an ensembleclassifier that integrates ELM with other classifiers will generatebetter results. Our proposed method which couples PseAACencoding with ELM can be easily applied to deal with otherproblems in bioinformatics.

Acknowledgments

This work was supported in part by the National Natural ScienceFoundation of China (Nos. 61222306, 91130033, 61175024,21365008), Shanghai Science and Technology Commission (No.11JC1404800), a Foundation for the Author of National ExcellentDoctoral Dissertation of PR China (No. 201048), Program for NewCentury Excellent Talents in University (NCET-11-0330), and Shang-hai Jiao Tong University Innovation Fund for Postgraduates.

References

[1] M.J. Pearce, J. Mintseris, J. Ferreyra, S.P. Gygi, K.H. Darwin, Ubiquitin-likeprotein involved in the proteasome pathway of Mycobacterium tuberculosis,Science 322 (2008) 1104–1107.

[2] J. Herrmann, L.O. Lerman, A. Lerman, Ubiquitin and ubiquitin-like proteins inprotein regulation, Circ. Res. 100 (2007) 1276–1291.

[3] R.J. Mayer, A.J. Ciechanover, M. Rechsteiner, Protein Degradation: Cell Biologyof the Ubiquitin-Proteasome System, John Wiley & Sons, 2008.

[4] F. Striebel, F. Imkamp, M. Sutter, M. Steiner, A. Mamedov, E. Weber-Ban,Bacterial ubiquitin-like modifier Pup is deamidated and conjugated to sub-strates by distinct but homologous enzymes, Nat. Struct. Mol. Biol. 16 (2009)647–651.

[5] E. Guth, M. Thommen, E. Weber-Ban, Mycobacterial ubiquitin-like proteinligase PafA follows a two-step reaction pathway with a phosphorylated pupintermediate, J. Biol. Chem. 286 (2011) 4412–4419.

[6] F.A. Cerda-Maira, F. McAllister, N.J. Bode, K.E. Burns, S.P. Gygi, K.H. Darwin,Reconstitution of the Mycobacterium tuberculosis pupylation pathway inEscherichia coli, EMBO Rep. 12 (2011) 863–870.

[7] R.A. Festa, F. McAllister, M.J. Pearce, J. Mintseris, K.E. Burns, S.P. Gygi,K.H. Darwin, Prokaryotic ubiquitin-like protein (Pup) proteome of Mycobac-terium tuberculosis, PLoS One 5 (2010) e8589.

[8] C. Poulsen, Y. Akhter, A.H. Jeon, G. Schmitt-Ulms, H.E. Meyer, A. Stefanski,K. Stuhler, M. Wilmanns, Y.H. Song, Proteome-wide identification of myco-bacterial pupylation targets, Mol. Syst. Biol. 6 (2010) 386.

[9] J. Watrous, K. Burns, W.T. Liu, A. Patel, V. Hook, V. Bafna, C.E. Barry, S. Bark,P.C. Dorrestein, Expansion of the mycobacterial PUPylome, Mol. BioSyst. 6(2010) 376–385.

[10] Z. Liu, Q. Ma, J. Cao, X. Gao, J. Ren, Y. Xue, GPS-PUP: computational predictionof pupylation sites in prokaryotic proteins, Mol. BioSyst. 7 (2011) 2737–2740.

[11] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory andapplications, Neurocomputing 70 (2006) 489–501.

[12] K.C. Chou, Prediction of protein cellular attributes using pseudo-amino acidcomposition, Proteins 43 (2001) 246–255.

[13] H.B. Shen, K.C. Chou, Using optimized evidence-theoretic K-nearest neighborclassifier and pseudo-amino acid composition to predict membrane proteintypes, Biochem. Biophys. Res. Commun. 334 (2005) 288–292.

[14] H.B. Shen, K.C. Chou, Predicting protein subnuclear location with optimizedevidence-theoretic K-nearest classifier and pseudo amino acid composition,Biochem. Biophys. Res. Commun. 337 (2005) 752–756.

[15] H.B. Shen, J. Yang, K.C. Chou, KNN Fuzzy, Predicting membrane protein typesfrom pseudo-amino acid composition, J. Theor. Biol. 240 (2006) 9–13.

[16] L. Nanni, S. Brahnam, A. Lumini, Wavelet images and Chou's pseudo-aminoacid composition for protein classification, Amino Acids 43 (2011) 657–665.

[17] K.C. Chou, Some remarks on protein attribute prediction and pseudo-aminoacid composition, J. Theor. Biol. 273 (2011) 236–247.

[18] D. Wang, G.B. Huang, Protein sequence classification using extreme learningmachine, in: Proceedings of the IEEE International Joint Conference on NeuralNetworks, IJCNN 2005, Montreal, Canada, 2005, pp. 1406–1411.

[19] R. Zhang, G.B. Huang, N. Sundararajan, P. Saratchandran, Multicategoryclassification using an extreme learning machine for microarray gene expres-sion cancer diagnosis, IEEE/ACM Trans. Comput. Biol. Bioinf. 4 (2007) 485–495.

[20] C. Savojardo, P. Fariselli, R. Casadio, Improving the detection of transmem-brane beta-barrel chains with N-to-1 extreme learning machines, Bioinfor-matics 27 (2011) 3123–3128.

[21] C.W. Tung, PupDB: a database of pupylated proteins, BMC Bioinf. 13 (2012) 40.[22] Y. Huang, B. Niu, Y. Gao, L. Fu, W. Li, CD-HIT Suite: a web server for clustering

and comparing biological sequences, Bioinformatics 26 (2010) 680–682.[23] V. Vacic, L.M. Iakoucheva, P. Radivojac, Two Sample Logo: a graphical

representation of the differences between two sets of sequence alignments,Bioinformatics 22 (2006) 1536–1537.

[24] H.B. Shen, K.C. Chou, PseAAC: a flexible web server for generating variouskinds of protein pseudo amino acid composition, Anal. Biochem. 373 (2008)386–388.

[25] G.B. Huang, L. Chen, Convex incremental extreme learning machine, Neuro-computing 70 (2007) 3056–3062.

[26] G.B. Huang, L. Chen, Enhanced random search based incremental extremelearning machine, Neurocomputing 71 (2008) 3460–3468.

[27] G.B. Huang, L. Chen, C.K. Siew, Universal approximation using incrementalconstructive feedforward networks with random hidden nodes, IEEE Trans.Neural Networks 17 (2006) 879–892.

[28] G.B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine forregression and multiclass classification, IEEE Trans. Syst., Man, Cybern. PartB: Cybern. 42 (2012) 513–529.

[29] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learningscheme of feedforward neural networks, in: Proceedings of the IEEE Interna-tional Joint Conference on Neural Networks, 2004, pp. 985–990.

[30] D.J. Hand, R.J. Till, A simple generalisation of the area under the ROC curve formultiple class classification problems, Mach. Learn. 45 (2001) 171–186.

[31] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. DataEng. 21 (2009) 1263–1284.

[32] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDDExplorations Newsl. 6 (2004) 40–49.

[33] J.B. Yin, T. Li, H.B. Shen, Gaussian kernel optimization: complex problem and asimple solution, Neurocomputing 74 (2011) 3816–3822.

[34] J.B. Lei, J.B. Yin, H.B. Shen, GFO: a data driven approach for optimizing gaussianfunction based similarity metric in computational biology, Neurocomputing99 (2013) 307–315.

Yong-Xian Fan is a Ph.D. candidate with Department ofAutomation, Shanghai Jiao Tong University, China. Hisresearch interests include machine learning, datamining, computational biology and bioinformatics.

Hong-Bin Shen received his Ph.D. degree from Shang-hai Jiaotong University China in 2007. He was a post-doctoral research fellow of Harvard Medical Schoolfrom 2007 to 2008, and a visiting professor of Uni-versity of Michigan in 2012. Currently, he is a professorof Institute of Image Processing and Pattern Recogni-tion, Shanghai Jiaotong University. His research inter-ests include data mining, pattern recognition, andbioinformatics. Dr. Shen has published more than 60papers and constructed 20 bioinformatics severs inthese areas and he serves the editorial members ofseveral international journals.

Y.-X. Fan, H.-B. Shen / Neurocomputing 128 (2014) 267–272272