identification of human drug targets using machine-learning algorithms

7
Identication of human drug targets using machine-learning algorithms Priyanka Kumari a,1 , Abhigyan Nath a,1 , Radha Chaube b,n a Bioinformatics Section, Mahila Mahavidyalaya, Banaras Hindu University, Varanasi 221005, India b Zoology/Bioinformatic Section, Mahila Mahavidyalaya, Banaras Hindu University, Varanasi 221005, India article info Article history: Received 6 May 2014 Accepted 6 November 2014 Keywords: Drug targets Ensemble learning SMOTE Dipeptide composition Property group composition ReliefF abstract Identication of potential drug targets is a crucial task in the drug-discovery pipeline. Successful identication of candidate drug targets in entire genomes is very useful, and computational prediction methods can speed up this process. In the current work we have developed a sequence-based prediction method for the successful identication and discrimination of human drug target proteins, from human non-drug target proteins. The training features include sequence-based features, such as amino acid composition, amino acid property group composition, and dipeptide composition for generating predictive models. The classication of human drug target proteins presents a classic example of class imbalance. We have addressed this issue by using SMOTE (Synthetic Minority Over-sampling Technique) as a preprocessing step, for balancing the training data with a ratio of 1:1 between drug targets (minority samples) and non-drug targets (majority samples). Using ensemble classication learning method- Rotation Forest and ReliefF feature-selection technique for selecting the optimal subset of salient features, the best model with selected features can achieve 87.1% sensitivity, 83.6% specicity, and 85.3% accuracy, with 0.71 Matthews correlation coefcient (mcc) on a tenfold stratied cross-validation test. The subset of identied optimal features may help in assessing the compositional patterns in human drug targets. For further validation, using a rigorous leave-one-out cross-validation test, the model achieved 88.1% sensitivity, 83.0% specicity, 85.5% accuracy, and 0.712 mcc. The proposed method was tested on a second dataset, for which the current pipeline gave promising results. We suggest that the present approach can be applied successfully as a complementary tool to existing methods for novel drug target prediction. & 2014 Elsevier Ltd. All rights reserved. 1. Introduction The identication of drug targets is one of the foremost requirements in the drug-discovery process. During this drug- discovery process, we come across certain proteins which are druggable(a target is said to be druggable when it can interact with drug molecules), according to their structure, but their binding does not lead to any therapeutic effect. At that time, we were limited in the ability about the identication of such human drug target proteins, to ascertain their specic character, nature and efcacy. Researchers have been working on drug research and development for many years, but only about 324 drug targets have so far been identied as suitable for clinical use [1]. Therefore, it is imperative to determine more potential targets for drug design and discovery [2]. Ideal drug targets should have some desirable properties, such as druggability, and must show active involve- ment in a signicant biological pathway. Knowledge about its structure-function relationship is equally desirable [3]. According to some recent studies, most of the drug targets fall into the family of enzymes, transporters, GPCRs (G protein-coupled receptors), ion channels, nuclear receptors etc. GPCRs and enzymes represent the most important target classes of proteins for drug discovery [4]. More than50% of drug targets are located on only 4 key protein families, namely GPCRs, nuclear receptors, ligand-gated ion channels and voltage-gated ion channels [2]. Sequence homology and domain search methods of existing drug target families have been developed to identify new drug targets [5]. Other studies have revealed certain binding sites which might bind to drug-like compounds on the protein surface, based on 3D structures [6].But these methods have limited scope because of the lesser number of proteins they contain, with known 3D structure. Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/cbm Computers in Biology and Medicine http://dx.doi.org/10.1016/j.compbiomed.2014.11.008 0010-4825/& 2014 Elsevier Ltd. All rights reserved. n Corresponding author. Tel.: þ91 9336847252. E-mail addresses: [email protected], [email protected] (R. Chaube). 1 These authors contributed equally and hence are regarded as joint rst authors. Computers in Biology and Medicine 56 (2015) 175181

Upload: radha

Post on 04-Apr-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Identification of human drug targets using machine-learning algorithms

Identification of human drug targets usingmachine-learning algorithms

Priyanka Kumari a,1, Abhigyan Nath a,1, Radha Chaube b,n

a Bioinformatics Section, Mahila Mahavidyalaya, Banaras Hindu University, Varanasi 221005, Indiab Zoology/Bioinformatic Section, Mahila Mahavidyalaya, Banaras Hindu University, Varanasi 221005, India

a r t i c l e i n f o

Article history:Received 6 May 2014Accepted 6 November 2014

Keywords:Drug targetsEnsemble learningSMOTEDipeptide compositionProperty group compositionReliefF

a b s t r a c t

Identification of potential drug targets is a crucial task in the drug-discovery pipeline. Successfulidentification of candidate drug targets in entire genomes is very useful, and computational predictionmethods can speed up this process. In the current work we have developed a sequence-based predictionmethod for the successful identification and discrimination of human drug target proteins, from humannon-drug target proteins. The training features include sequence-based features, such as amino acidcomposition, amino acid property group composition, and dipeptide composition for generatingpredictive models. The classification of human drug target proteins presents a classic example of classimbalance. We have addressed this issue by using SMOTE (Synthetic Minority Over-sampling Technique)as a preprocessing step, for balancing the training data with a ratio of 1:1 between drug targets (minoritysamples) and non-drug targets (majority samples). Using ensemble classification learning method-Rotation Forest and ReliefF feature-selection technique for selecting the optimal subset of salientfeatures, the best model with selected features can achieve 87.1% sensitivity, 83.6% specificity, and 85.3%accuracy, with 0.71 Matthews correlation coefficient (mcc) on a tenfold stratified cross-validation test.The subset of identified optimal features may help in assessing the compositional patterns in humandrug targets. For further validation, using a rigorous leave-one-out cross-validation test, the modelachieved 88.1% sensitivity, 83.0% specificity, 85.5% accuracy, and 0.712 mcc. The proposed method wastested on a second dataset, for which the current pipeline gave promising results. We suggest that thepresent approach can be applied successfully as a complementary tool to existing methods for noveldrug target prediction.

& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

The identification of drug targets is one of the foremostrequirements in the drug-discovery process. During this drug-discovery process, we come across certain proteins which are“druggable” (a target is said to be druggable when it can interactwith drug molecules), according to their structure, but theirbinding does not lead to any therapeutic effect. At that time, wewere limited in the ability about the identification of such humandrug target proteins, to ascertain their specific character, natureand efficacy. Researchers have been working on drug research anddevelopment for many years, but only about 324 drug targets haveso far been identified as suitable for clinical use [1]. Therefore, it is

imperative to determine more potential targets for drug designand discovery [2]. Ideal drug targets should have some desirableproperties, such as druggability, and must show active involve-ment in a significant biological pathway. Knowledge about itsstructure-function relationship is equally desirable [3].

According to some recent studies, most of the drug targets fallinto the family of enzymes, transporters, GPCRs (G protein-coupledreceptors), ion channels, nuclear receptors etc. GPCRs and enzymesrepresent the most important target classes of proteins for drugdiscovery [4]. More than50% of drug targets are located on only 4 keyprotein families, namely GPCRs, nuclear receptors, ligand-gated ionchannels and voltage-gated ion channels [2].

Sequence homology and domain search methods of existingdrug target families have been developed to identify new drugtargets [5]. Other studies have revealed certain binding sites whichmight bind to drug-like compounds on the protein surface, basedon 3D structures [6].But these methods have limited scopebecause of the lesser number of proteins they contain, with known3D structure.

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/cbm

Computers in Biology and Medicine

http://dx.doi.org/10.1016/j.compbiomed.2014.11.0080010-4825/& 2014 Elsevier Ltd. All rights reserved.

n Corresponding author. Tel.: þ91 9336847252.E-mail addresses: [email protected],

[email protected] (R. Chaube).1 These authors contributed equally and hence are regarded as joint first

authors.

Computers in Biology and Medicine 56 (2015) 175–181

Page 2: Identification of human drug targets using machine-learning algorithms

Previously, Li et al. [7]used SVM (Support vector machines) for theexclusive classification of human drug target proteins from humannon-drug target proteins, with an overall accuracy of 84%. SVM wasalso used by Han et al. [8] for the discrimination of drug targets andnon-drug targets, with 83.6% overall accuracy, using sequence-basedfeatures such as amino acid composition and other physicochemicalproperties. The work of Han et al. [8] focused on successfullycommercialized and research targets in the therapeutic target data-base [9]. Using physicochemical features Bakheet et al. [3] analyzedthe differences between drug targets and non-drug targets, andsubsequently, generated druggability rules.

A dataset is said to be imbalanced when there is a large differencebetween the numbers of examples belonging to different classes.Human drug target protein discrimination from human non-drugtarget proteins presents a class imbalance problem, in which theclass of interest—i.e. drug targets—are the minority samples, andnon-drug targets are the majority samples. This presents a challengefor the machine-learning algorithms in learning the concepts of theminority class samples, and causes a sharp decline in accuratelypredicting (the accuracy for the positive class, i.e. sensitivity) theminority class samples, as the learning is biased towards the majorityclass samples. In the present work, we have addressed the issue ofimbalanced datasets on predictive accuracy, and have used SMOTE(Synthetic Minority Over-sampling Technique) to balance the dataset,which is used later to generate a predictive model for the successfuldiscrimination of human drug targets and non-drug targets.

2. Materials and methods

2.1. Dataset

We used two datasets, the first is the dataset of Bakheet et al. [3],which consisted of 148 human drug target proteins (positive samples),and 3573 non-drug target proteins (negative samples), having lessthan 20% pairwise sequence identity. The second dataset was createdusing 186 human drug target proteins from Li et al. [7] as positivesamples, and with the negative samples from the first dataset.

2.2. Representation of protein sequences

We calculated three sets of features to represent the drug targetproteins and non-drug target protein sequences.

Amino acid composition (AAC): The percentage composition ofthe 20 amino acid residues was calculated using the followingformula:

% Amino acid composition of nth protein sequence

¼ total number of amino acids of type itotal number of amino acids in the nth proteins sequence

� 100

ð1Þ

This constitutes the first 20 components of the feature vector.Amino acid property group composition (PGC): Physicochem-

ical properties of amino acids play an important role in determin-ing the function and structure of a protein sequence. Amino acidproperty groups divide the amino acid residues into overlappingsets of similar physicochemical properties. It is included in thefeature vector for encapsulating the compositional differences ofsimilar groupings of amino acids.

% Amino acid property group composition of nth protein sequence

¼ total number of amino acids of type ith property grouptotal number of amino acids in the nth protein sequence

� 100

ð2Þ

This constitutes the next 11 components of the feature vector.The amino acid property groups [10], which we have taken for thecurrent study, are presented in Table 1.

Dipeptide composition (DPC): There can be 400 dipeptides in aprotein sequence; these are included in the feature vector toencapsulate the local sequence order of amino acids.

% Dipeptide composition of nth protein sequence

¼ total number of dipeptides of ith type400

� 100 ð3Þ

This constitutes the last 400 components of the feature vector.Every drug target protein and non-drug target protein is repre-sented by a maximum of 431 length feature vector.

2.3. SMOTE

When the dataset is imbalanced in terms of the number ofpositive and negative class instances, the common evaluationmetrics, such as accuracy with which most of the learningalgorithms are optimized to perform, tend to be biased towardsmajority class instances, which is undesirable. SMOTE (SyntheticMinority Over-sampling Technique) [11] is a sampling techniquewhich oversamples the minority class, by using randomly selectedk nearest neighbors of the minority class, and generating asynthetic minority class instance, by interpolating nearest minor-ity class instances. As the present dataset is highly imbalanced weused SMOTE to balance the dataset. We have taken all the 148drug target proteins from the first dataset, and 500 random non-drug target proteins. We increased the number of drug targetproteins to be equal to the non-drug proteins for the training set.Subsequently, this balanced dataset, having an equal number ofpositive and negative class instances, was used for training andmodel generation. The same procedure was followed for balancingthe second dataset.

Table 1The property groups of amino acids that we have taken for our study.

S. no. Name of amino acid property group Amino acids in the specific group

1. Tiny amino acids group Ala, Cys, Gly, Ser, Thr2. Small amino acids group Ala, Cys, Asp, Gly, Asn, Pro, Ser, Thr and Val3. Aliphatic amino acids group Ile, Leu and Val.4. Non-polar amino acid groups Ala, Cys, Phe, Gly, Ile, Leu, Met, Pro, Val, Trp and Tyr5. Aromatic amino acid group Phe, His, Trp and Tyr6. Polar amino acid group Asp, Glu, His, Lys, Asn, Gln. Arg, Ser, and Thr.7. Charged amino acid group Asp, Glu, His, Arg, Lys8. Basic amino acid group His, Lys and Arg9. Acidic amino acid group Asp and Glu10. Hydrophobic acid group Ala, Cys, Phe, Ile, Leu, Met, Val, Trp, Tyr11. Hydrophilic acid group Asp, Glu, Lys, Asn, Gln

P. Kumari et al. / Computers in Biology and Medicine 56 (2015) 175–181176

Page 3: Identification of human drug targets using machine-learning algorithms

2.4. Rotation forest classifier

It has been seen in many empirical studies that an ensemble ofclassifiers outperforms a single classifier [12]. A classifier ensemblecombines the prediction of multiple classifiers. It is like combiningthe classification outcome of many base classifiers trained ondifferent regions of the input space. For the successful learningof an ensemble classifier there should be a best-possible tradeoffbetween diversity and accuracy among the different base classi-fiers. Rotation Forest (ROF) is an ensemble classifier developed byRodriguez and Kuncheva [13]. The concepts of bagging, featurerandomization, and principal component analysis are used toincrease diversity in the ensemble. Experimentally, it has beenshown to perform better than Bagging [14], boosting [15] andRandom Forest [16]. The base classifiers in the rotation forestensemble are the decision trees. For each tree a bootstrappedsample is drawn. Then the full feature set is split randomly into ksubsets. These subsets are then transformed linearly and placed ina rotation matrix, to be used for training for each decision treeclassifier. The average rule is used to combine the outcome of allthe base classifiers. Rotation Forest has found its applications invarious domains of bioinformatics, as in [17,18].

2.5. ReliefF feature-selection technique

When large numbers of attributes are used to represent theprotein sequences, then the predictive model may suffer from thecurse of dimensionality. The presence of redundant and uninfor-mative features increases the training time, and the generalizationability of the generated model. Given a set of features F and targetclass C, the aim of the feature selection is to find a minimum set offeatures F, in order to achieve maximum classification perfor-mance. The ReleifF algorithm [19] assesses the quality of theattributes by assigning higher and lower weights to the differentattributes, according to their discriminating ability. The algorithmbegins by randomly selecting an instance I from the trainingsample. Then it finds the k nearest neighbors of the same class,called nearest hits H, and also k nearest neighbors of the differentclasses, called nearest miss M. It then assigns weights to features,based on their ability to discriminate similar samples. Highervalues of weights are assigned if the instances I and M havedifferent values on the given attribute—i.e. if the difference is large

Table 2Evaluation metrics of different algorithms using amino acid composition asfeatures.

Features: Amino acid composition

LearningAlgorithm

Sensitivity Specificity Accuracy MCC AUC F-measure

Naïve Bayes 84.1 51.8 68.0 0.379 0.765 0.671SMO 79.9 52.2 66.1 0.334 0.661 0.661RF 85.5 69.2 77.4 0.554 0.852 0.772ROF 82.9 72.0 77.5 0.552 0.856 0.774KNN 98.8 56.6 77.8 0.612 0.779 0.767

Table 3Evaluation metrics of different algorithms using amino acid composition andproperty group composition as features.

Features: Amino acid composition þproperty group composition

LearningAlgorithm

Sensitivity Specificity Accuracy MCC AUC F-measure

Naïve Bayes 86.1 49.2 67.7 0.380 0.753 0.666SMO 82.3 54.8 68.6 0.386 0.686 0.680RF 87.7 67.2 77.5 0.561 0.858 0.772ROF 88.1 73.4 80.8 0.622 0.885 0.807KNN 97.6 62.0 79.9 0.683 0.798 0.792

Table 4Evaluation metrics of different algorithms using amino acid composition, propertygroup composition and dipeptide composition as features.

Features: Amino acid composition þproperty group compositionþdipeptidecomposition

LearningAlgorithm

Sensitivity Specificity Accuracy MCC AUC F-measure

Naïve Bayes 88.9 25.8 57.4 0.189 0.665 0.527SMO 86.1 71.4 78.8 0.581 0.787 0.786RF 85.9 79.4 82.7 0.654 0.909 0.826ROF 83.9 82.2 83.1 0.661 0.918 0.830KNN 100 25.4 62.8 0.382 0.629 0.568

Table 5Evaluation metrics of ROF algorithm using features from reliefF feature selectiontechnique.

AACþPGCþDPC

Number offeatures

Sensitivity Specificity Accuracy MCC AUC F-measure

10 81.1 64.2 72.7 0.46 0.784 0.72520 81.1 73.4 77.7 0.55 0.842 0.77630 81.7 75.4 78.6 0.57 0.871 0.78540 85.3 79.0 82.2 0.64 0.892 0.82150 86.1 82.0 84.0 0.68 0.917 0.84060 84.9 80.0 82.5 0.65 0.909 0.82470 86.5 82.6 84.5 0.69 0.908 0.84580 87.7 79.4 83.5 0.67 0.915 0.83590 85.1 80.2 82.7 0.65 0.916 0.826100 85.3 82.0 83.6 0.67 0.916 0.836150 85.1 81.0 83.1 0.66 0.919 0.830200 85.7 81.2 83.4 0.67 0.912 0.834250 86.9 81.2 84.0 0.68 0.932 0.840300 86.7 83.6 85.1 0.70 0.926 0.851350 85.5 84.2 84.8 0.69 0.932 0.848400 87.1 83.6 85.3 0.71 0.930 0.853ALL 83.9 82.2 83.1 0.661 0.918 0.830

Fig. 1. ROC plots of different classifiers using all the 431 sequence features on thefirst dataset.

P. Kumari et al. / Computers in Biology and Medicine 56 (2015) 175–181 177

Page 4: Identification of human drug targets using machine-learning algorithms

between them—and lower values are assigned if instances I and Hhave different values on the given attribute. ReliefF has been usedsuccessfully for feature reduction, for example, in [20,21]. We haveincorporated the feature-selection process along with the classifieritself. The feature-selection process was performed firstly on thetraining folds (in a tenfold stratified cross-validation), and then theselected features were used to evaluate the testing fold. In this waythe process avoids exceedingly optimistic evaluation parameters,as the feature-selection algorithm has not seen the test dataduring the feature-selection process.

All the experiments were performed in WEKA [22] which is anopen-source machine-learning platform.

3. Evaluation metrics

When the amount of training data is limited then cross-validation is typically used. We tested the generated predictivemodels by using a tenfold stratified cross-validation. During thisprocess the dataset is first divided equally into 10 subsets. Theneach of the subsets is, in turn, used as the testing set, and theremaining 9 subsets are used as the training sets. In stratifiedcross-validation, each fold or subset contains equal proportions ofinstances belonging to different class labels.

For further validation of the performance of the generatedmodels we used leave-one-out cross-validation (LOOCV), which isdeemed to be a more rigorous and very objective statistical test[23]. It is now being widely used and accepted by researchers, toestimate the accuracy of various predictors [24–31]. In thismethod, each sequence is, in turn, singled out as a test instance,and the model is trained using the remaining training instances.This process is repeated n times (n is the total number ofsequences in the dataset). The results for all n sequences are foreach sequence of the dataset. These are averaged and that averagerepresents the final error estimate.

We used the following evaluation metrics to assess and comparethe different machine-learning models. These parameters are calcu-lated from the values of true positives (TP), false negatives (FN), truenegatives (TN) and false positives (FP).

Sensitivity: This parameter allows the computation of thepercentage of correctly predicted drug target proteins.

Sensitivity¼ TP= TPþFNð Þ � 100 ð4ÞSpecificity: This parameter allows the computation of the

percentage of correctly predicted non-drug target proteins.

Specificity¼ TN= TNþFPð Þ � 100 ð5ÞAccuracy: Percentage of correctly predicted drug target and

non-drug target proteins.

Accuracy¼ TPþTNð Þ= TPþFPþTNþFNð Þ � 100 ð6ÞArea under ROC (AUC): It is the area under the curve (AUC) of a

receiver-operating characteristic (ROC) curve. The area under theROC is an important evaluation metric for assessing the

Table 6Ranking of features (except the dipeptides) in the first dataset using ReliefF featureselection technique.

Rank First dataset

1. F2. V3. N4. CHARGED5. T6. A7. HYDROPHILIC8. ACIDIC9. I10. Q11. S12. SMALL13. E14. AROMATIC15. TINY16. L17. R18. NONPOLAR19. W20. POLAR21. ALIPHATIC22. K23. HYDROPHOBIC24. D25. G26. H27. P28. M29. BASIC30. Y31. C

AA

A R N D C Q E G H I L K M F P S T W Y VR RW

N

D

C CM

Q QA AR QC QQ QE

E

G GE GP

H HN HQ HE

I

L LQKM MC MW MVFP PQ PG PP PYS STT TC TTW WA WD WC WH WK WP WS WWY

V

Fig. 2. showing the selected dipeptides from dataset 1. Dipeptides which were selected are shown in dark shades and the dipeptides that are discarded are annotated in therespective cells.

P. Kumari et al. / Computers in Biology and Medicine 56 (2015) 175–181178

Page 5: Identification of human drug targets using machine-learning algorithms

performance of the model. The value of the AUC ranges from 0 to1. In the best case it is 1; in the worst case it is zero; in randomranking it is 0.5.

Matthews Correlation Coefficient (MCC): Its value ranges from�1 to þ1, where a value of þ1 means accurate prediction, and avalue of zero means random prediction.

MCC¼ TP � TNð Þ�ðFP � FNÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiTPþFNð Þ TPþFPð Þ TNþFPð ÞðTNþFNÞ

p ð7Þ

F-measure: This is a combined measure of precision and recall,and is calculated as:

F�measure¼ 2� Precision� RecallPrecisionþRecall

ð8Þ

The best value is 1, and in the worst case it is 0.

4. Result and discussion

We tested five different machine-learning algorithms withthree different sets of feature vectors: (i) amino acid composition;(ii) amino acid composition and amino acid property groupcomposition; and (iii) amino acid composition, amino acid prop-erty group composition and dipeptide composition. A distinctimprovement in all the learning algorithms was visible after theinclusion of all three types of sequence-based features. Weachieved a 100% accuracy in a self-consistency test on the firstdataset (as compared to 96% in the previous study [3]), and 99.7%on the second dataset. In the self-consistency test, the class labelof drug targets and non-drug targets is predicted by using therules of the same set on which they were trained. It gives anoptimistic error estimate, and is not sufficient to evaluate the truediscriminating power of the learning algorithm. The performanceof different machine-learning algorithms, using different featuresets on a tenfold stratified cross-validation, are presented inTables 2–4. It can be seen from these tables that the ensemble-based methods, such as random forest and rotation forest, per-formed consistently better than other non ensemble methods.Overall performance of the rotation forest algorithm was betterthan all other algorithms.

The feature set consisting of AAC, PGC and DPC resulted inmaximum discrimination between protein drug targets andnon-targets, with sensitivity of 83.9%, specificity of 82.2%, andaccuracy of 83.1% (see Fig. 1 for ROC plot).

The length of the dipeptide feature set is 431, including AAC,PGC and DPC. In such long feature vectors there is a possibility ofthe presence of redundant and uninformative features. Redundantand irrelevant features do not provide any important informationfor classification/prediction. Instead, they can result in increasedtraining time and overfitting, which are undesirable for thesuccessful development of prediction/classification models. Usingthe ReliefF feature-selection technique, we gradually increased thenumber of selected features, and observed their effect on themodel evaluation metrics (Table 5). We systematically increasedthe number of features in the optimal feature subset. The featureset with 400 features gave the best performance, with 85.3%accuracy, 0.71 mcc and F-measure value of 0.853.The tradeoffbetween the evaluation parameters was best for this subset offeatures.

We tabulated the subset of optimal features (AAC and PGConly) selected by the ReliefF feature-selection technique, in accor-dance with their importance in discriminating drug and non-drug

target proteins (Table 6). The dipeptides which were selected anddiscarded from the optimal subset of features are shown in Fig. 2.The white boxes with annotated dipeptides in Fig. 2 represent thediscarded dipeptides. The dipeptides of Q, P and W are mostlyabsent from the optimal subset of features.

For the second dataset we used the same pipeline as thatapplied to the first dataset. Based on overall accuracy, AUC andF-measure rotation forest outperformed all other machine-learning algorithms, and the inclusion of all three features vizAAC, PGC and DPC significantly improved the prediction accuracy(Table 7) (see Fig. 3 for ROC plot).

The results of leave-one-out cross-validation are tabulated inTable 8, for both first and second datasets. The rotation forestalgorithm achieved an accuracy of 85.5% and 87.2%, on the first andsecond dataset, respectively, which is better than other testedmachine-learning algorithms.

Table 7Evaluation parameters on the second dataset using different feature sets.

Learningalgorithm

Sensitivty Specificity Accuracy MCC AUC Fmeaseure

AACNaïveBayes

88.8 61.6 75.2 0.524 0.858 0.784

SMO 75.7 67.8 71.8 0.436 0.717 0.717RF 88.2 78.4 83.3 0.670 0.915 0.833ROF 86.7 81.6 84.1 0.683 0.927 0.841KNN 98.0 62.4 80.2 0.674 0.805 0.796

AACþPGCNaïveBayes

87.3 65.6 76.4 0.541 0 0.762

SMO 73.1 68.2 70.7 0.414 0.707 0.706RF 89.2 79.8 84.5 0.694 0.918 0.845ROF 87.1 82.2 84.6 0.693 0.930 0.846KNN 98.0 66.4 82.2 0.679 0.918 0.845

AACþPGCþDPCNaïveBayes

95.6 26.8 61.3 0.309 0.782 0.561

SMO 89.0 81.6 85.3 0.708 0.853 0.853RF 87.8 79.8 83.8 0.679 0.930 0.838ROF 90.8 83.4 87.1 0.754 0.940 0.871KNN 99.0 55.0 77.0 0.602 0.764 0.759

Fig. 3. ROC plots of different classifiers using all the 431 sequence features on thesecond dataset.

P. Kumari et al. / Computers in Biology and Medicine 56 (2015) 175–181 179

Page 6: Identification of human drug targets using machine-learning algorithms

5. Conclusion

Computational prediction and classification of putative humandrug targets can accelerate drug-design pipelines. In the presentwork, we present a systematic assessment of different machine-learning algorithms for the purpose of human drug target predic-tion, with different sequence-based subsets of features. Theclassification/discrimination/prediction of human drug target pro-teins from human non-drug target proteins presents an exampleof class imbalance. To address this issue, SMOTE was used tobalance the training dataset for the unbiased training of thedifferent learning algorithms. Ensemble-based methods per-formed relatively better than other classifiers. Notable amongthese is the rotation forest algorithm. The inclusion of dipeptidecomposition, along with amino acid composition and amino acidproperty group composition, improved the classifier performance.Furthermore, we used the ReliefF feature-selection technique toremove redundant and uninformative features, in order to achievebetter evaluation metrics. The best model after feature selectionwith 400 features achieved an overall sensitivity of 87.1%, speci-ficity of 83.6%, and accuracy of 85.3%, in a tenfold stratified cross-validation test. Feature selection increased the robustness of themodels, and helped in identifying an optimal subset of features,which may help in determining the compositional patterns in thehuman drug target proteins. We used a second dataset to test ourcurrent approach, in which the rotation forest algorithm achievedpromising results. In addition, we used leave-one-out cross-validation to test the performance of the different models, usingboth the first and second datasets. With the incorporation ofstructural features and relevant physicochemical properties, moreaccurate prediction tools for the identification of novel humantargets can be developed. The present work presents a usefulalignment-free method for discriminating human drug targetproteins from non-drug target proteins.

Conflict of interest

None declared.

References

[1] J. Drews, Drug discovery: a historical perspective, Science 287 (2000)1960–1964.

[2] J.P. Overington, B. Al-Lazikani, A.L. Hopkins, How many drug targets are there?Nat. Rev., Drug Discovery 5 (2006) 993–996.

[3] T.M. Bakheet, A.J. Doig, Properties and identification of human protein drugtargets, Bioinformatics 25 (2009) 451–457.

[4] C.J. Zheng, L.Y. Han, C.W. Yap, Z.L. Ji, Z.W. Cao, Y.Z. Chen, Therapeutic targets:progress of their exploration and investigation of their characteristics, Phar-macol. Rev. 58 (2006) 259–279.

[5] A.L. Hopkins, C.R. Groom, The druggable genome,, Nat. Rev., Nat. Rev., DrugDiscovery 1 (2002) 727–730.

[6] P.J. Hajduk, J.R. Huth, C. Tse, Predicting protein druggability, Drug DiscoveryToday 10 (2005) 1675–1682.

[7] Q. Li, L. Lai, Prediction of potential drug targets based on simple sequenceproperties, BMC Bioinf. 8 (2007) 353.

[8] L.Y. Han, C.J. Zheng, B. Xie, J. Jia, X.H. Ma, F. Zhu, H.H. Lin, X. Chen, Y.Z. Chen,Support vector machines approach for predicting druggable proteins: recentprogress in its exploration and investigation of its usefulness, Drug DiscoveryToday 12 (2007) 304–313.

[9] F. Zhu, B. Han, P. Kumar, X. Liu, X. Ma, X. Wei, L. Huang, Y. Guo, L. Han,C. Zheng, Y. Chen, Update of TTD: therapeutic target database, Nucleic AcidsRes. 38 (2010) D787–D791.

[10] A. Nath, R. Chaube, K. Subbiah, An insight into the molecular basis forconvergent evolution in fish antifreeze proteins, Comput. Biol. Med. 43(2013) 817–821.

[11] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: syntheticminority over-sampling technique, J. Artif. Int. Res 16 (2002) 321–357.

[12] R. Polikar, Ensemble based systems in decision making, Circuits Syst. Mag.,IEEE 6 (2006) 21–45.

[13] J.J. Rodriguez, L.I. Kuncheva, C.J. Alonso, Rotation forest: a new classifierensemble method, IEEE Trans. Pattern Anal. Mach. Intell 28 (2006) 1619–1630.

[14] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140.[15] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning

and an application to boosting, J. Comput. Syst. Sci 55 (1997) 119–139.[16] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32.[17] K.-H. Liu, D.-S. Huang, Cancer classification using rotation forest, Comput. Biol.

Med. 38 (2008) 601–610.[18] A. Dehzangi, S. Phon-Amnuaisuk, M. Manafi, S. Safa, Using rotation forest for

protein fold prediction problem: an empirical study, in: C. Pizzuti, M. Ritchie,M. Giacobini (Eds.), Evolutionary Computation, Machine Learning and DataMining in Bioinformatics, Springer, Berlin Heidelberg, 2010, pp. 217–227.

[19] K. Kira, L.A. Rendell, A Practical Approach to Feature Selection, in: Proceedingsof the ninth international workshop on Machine learning, Morgan KaufmannPublishers Inc, Aberdeen, Scotland, United Kingdom (1992) 249–256.

[20] K.K. Kandaswamy, K.-C. Chou, T. Martinetz, S. Möller, P.N. Suganthan,S. Sridharan, G. Pugalenthi, AFP-Pred: a random forest approach for predictingantifreeze proteins from sequence-derived properties, J. Theor. Biol. 270(2011) 56–62.

[21] K. Kandaswamy, G. Pugalenthi, M. Hazrati, K.-U. Kalies, T. Martinetz, BLProt:prediction of bioluminescent proteins based on support vector machine andrelieff feature selection, BMC Bioinf. 12 (2011) 345.

[22] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKAdata mining software: an update, SIGKDD Explor. Newsl 11 (2009) 10–18.

[23] K.-C. Chou, C.-T. Zhang, Prediction of protein structural classes, Crit. Rev.Biochem. Mol 30 (1995) 275–349.

[24] H.-L. Xie, L. Fu, X.-D. Nie, Using ensemble SVM to identify human GPCRsN-linked glycosylation sites based on the general form of Chou's PseAAC,Protein Eng. Des. Sel. 26 (2013) 735–742.

[25] G.P. Zhou, N. Assa-Munt, Some insights into protein structural class prediction,Prot.: Struct., Funct. Bioinf. 44 (2001) 57–59.

Table 8Leave one out cross validation evaluation parameters on the first and second dataset.

Features: AAC þPGCþDPC

LOOCV on the first dataset

Learning algorithm Sensitivity Specificity Accuracy MCC AUC F-measure

Naive Bayes 96.8 24.8 60.9 0.312 0.797 0.550SMO 89.2 82.8 86.0 0.722 0.860 0.860RF 90.0 79.8 84.9 0.702 0.934 0.849ROF 90.2 84.2 87.2 0.746 0.943 0.872KNN 99.4 54.8 77.1 0.606 0.769 0.759

LOOCV on the second datasetLearning algorithm Sensitivity Specificity Accuracy MCC AUC F-measure

Naive Bayes 96.8 24.8 60.9 0.312 0.797 0.550SMO 89.2 82.8 86.0 0.722 0.860 0.860RF 90.0 79.8 84.9 0.702 0.934 0.849ROF 90.2 84.2 87.2 0.746 0.943 0.872KNN 99.4 54.8 77.1 0.606 0.769 0.759

P. Kumari et al. / Computers in Biology and Medicine 56 (2015) 175–181180

Page 7: Identification of human drug targets using machine-learning algorithms

[26] G.-P. Zhou, K. Doctor, Subcellular location prediction of apoptosis proteins,Prot.: Struct., Funct. Bioinf. 50 (2003) 44–48.

[27] K.-C. Chou, Y.-D. Cai, Predicting protein structural class by functional domaincomposition, Biochem. Biophys. Res 321 (2004) 1007–1009.

[28] C. Chen, X. Zhou, Y. Tian, X. Zou, P. Cai, Predicting protein structural class withpseudo-amino acid composition and support vector machine fusion network,Anal. Biochem. 357 (2006) 116–121.

[29] Y. Gao, S. Shao, X. Xiao, Y. Ding, Y. Huang, Z. Huang, K.C. Chou, Using pseudoamino acid composition to predict protein subcellular location: approached

with Lyapunov index, Bessel function, and Chebyshev filter, Amino Acids 28(2005) 373–376.

[30] W.-R. Qiu, X. Xiao, K.-C. Chou, iRSpot-TNCPseAAC: identify recombinationspots with trinucleotide composition and pseudo amino acid components, Int.J. Mol. Sci. 15 (2014) 1746–1766.

[31] Y. Xu, J. Ding, L.-Y. Wu, K.-C. Chou, iSNO-PseAAC: predict cysteine S-nitrosylationsites in proteins by incorporating position specific amino acid propensity intopseudo amino acid composition, PLoS ONE 8 (2013) e55844.

P. Kumari et al. / Computers in Biology and Medicine 56 (2015) 175–181 181