a multidimensional analysis of machine learning methods performance in the classification of...

12
A multidimensional analysis of machine learning methods performance in the classication of bioactive compounds Sabina Smusz a,b , Rafał Kurczab a , Andrzej J. Bojarski a, a Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343 Kraków, Poland b Group of Crystal Chemistry of Drugs, Faculty of Chemistry, Jagiellonian University, R. Ingardena 3, 30-060 Kraków, Poland abstract article info Article history: Received 17 September 2012 Received in revised form 30 July 2013 Accepted 8 August 2013 Available online 17 August 2013 Keywords: Machine learning Virtual screening Classication Drug design A multidimensional analysis of machine learning methods performance in the classication of bioactive compounds was carried out. Eleven learning algorithms (including 4 meta-classiers): J48, RandomForest, NaïveBayes, PART, Hyperpipes, SMO, Ibk, MultiBoostAB, Decorate, FilteredClassier and Bagging, implemented in the WEKA package, were evaluated in the classication of 5 protein target ligands (cyclooxygenase-2, HIV-1 protease and metalloproteinase inhibitors, M 1 and 5-HT 1A agonists), using 8 different ngerprints for molecular representation (EStateFP, FP, ExtFP, GraphFP, KlekFP, MACCSFP, PubChemFP, and SubFP). The inuence of the number of actives in the training data as well as the computational expenses expressed by the time required for building a predictive model was also taken into account. Tests were performed for sets containing a similar number of actives and inactives and also for datasets recreating virtual screening conditions. In order to facilitate the interpretation of results, the evaluating parameters (recall, precision, and MCC) values were presented in the form of heat maps. The classication of cyclooxygenase-2 inhibitors was almost perfect regardless of the condi- tions, yet the results for the rest of the targets varied between different experiments. The performance of machine learning methods was improved by increasing the number of actives in the training data; however, the moving to virtual screening conditions was generally connected with a signicant fall in precision. Some methods, e.g. SMO, Bagging, Decorate and MultiBoostAB, were more stable regarding changes in classication conditions, whereas in the case of the others, such as NaïveBayes, J48 or Hyperpipes, the performance strongly varied between different datasets, ngerprints and targets. The application of meta-learning led to an increase in the values of evaluating parameters. KlekFP was a ngerprint which yielded the best results, although its use was connected with great computational expenses. On the other hand, EStateFP and SubFP gave worse results, especially in virtual screening-like conditions. © 2013 Elsevier B.V. All rights reserved. 1. Introduction Data mining techniques (including machine learning methods) are widely used in the process of drug design. They are extremely useful for virtual screening tasks [1,2], where potentially active compounds are selected from large libraries of chemical structures. Before classica- tion, a training process is performed, where a classier learns how to distinguish actives from inactives. This is achieved by supplying it with a set of molecules which are known to be active towards a partic- ular target, and a set of those that are identied as (or are assumed to be) inactive. Then, in most cases, a predictive model is being built, which is next used to assign appropriate class labels to unclassied instances [3,4]. Numerous examples of using machine learning methods in the pro- cess of virtual screening have been reported [115]. Several aspects of classication conditions and their inuence on the machine learning methods performance have been considered: different methods tested for one target [16,17], or one or two methods tested for the classication of compounds showing activity towards several different targets [1821]. For molecular representation, different molecular descriptors and ngerprints have been used. For example, Plewczynski et al. tested the performance of 7 machine learning methods for 5 different targets using regular atom pair descriptors [22,23], Cannon et al. used MOLPRINT 2D circular ngerprints to test the performance of the NaïveBayes classier, Inductive Logic Programming and Support Vector Inductive Logic Programming for 11 targets [24], and Argaval et al. used the FP2 and MOLPRINT 2D ngerprints to show the use of ranking methods in virtual screening tasks [21]. However, to the best of our knowledge, no paper shows the simulta- neous inuence of many factors on machine learning methods perfor- mance. We aimed to carry out some extended tests and to analyze the classication effectiveness depending on various parameters at the same time. In initial tests, 11 machine learning methods were tested for the classication of ligands active towards 5 different protein targets, for the training sets with 2 distinct numbers of actives, using 8 Chemometrics and Intelligent Laboratory Systems 128 (2013) 89100 Corresponding author. Tel.: +48 12 662 33 65; fax: +48 12 637 45 00. E-mail address: [email protected] (A.J. Bojarski). 0169-7439/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chemolab.2013.08.003 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

Upload: andrzej-j

Post on 13-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r .com/ locate /chemolab

A multidimensional analysis of machine learning methods performancein the classification of bioactive compounds

Sabina Smusz a,b, Rafał Kurczab a, Andrzej J. Bojarski a,⁎a Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343 Kraków, Polandb Group of Crystal Chemistry of Drugs, Faculty of Chemistry, Jagiellonian University, R. Ingardena 3, 30-060 Kraków, Poland

⁎ Corresponding author. Tel.: +48 12 662 33 65; fax: +E-mail address: [email protected] (A.J. Bojars

0169-7439/$ – see front matter © 2013 Elsevier B.V. All rihttp://dx.doi.org/10.1016/j.chemolab.2013.08.003

a b s t r a c t

a r t i c l e i n f o

Article history:Received 17 September 2012Received in revised form 30 July 2013Accepted 8 August 2013Available online 17 August 2013

Keywords:Machine learningVirtual screeningClassificationDrug design

A multidimensional analysis of machine learning methods performance in the classification of bioactivecompounds was carried out. Eleven learning algorithms (including 4 meta-classifiers): J48, RandomForest,NaïveBayes, PART, Hyperpipes, SMO, Ibk, MultiBoostAB, Decorate, FilteredClassifier and Bagging, implementedin the WEKA package, were evaluated in the classification of 5 protein target ligands (cyclooxygenase-2, HIV-1protease and metalloproteinase inhibitors, M1 and 5-HT1A agonists), using 8 different fingerprints for molecularrepresentation (EStateFP, FP, ExtFP, GraphFP, KlekFP, MACCSFP, PubChemFP, and SubFP). The influence of thenumber of actives in the training data as well as the computational expenses expressed by the time requiredfor building a predictive model was also taken into account. Tests were performed for sets containing a similarnumber of actives and inactives and also for datasets recreating virtual screening conditions. In order to facilitatethe interpretation of results, the evaluating parameters (recall, precision, andMCC) valueswere presented in theform of heat maps. The classification of cyclooxygenase-2 inhibitors was almost perfect regardless of the condi-tions, yet the results for the rest of the targets varied between different experiments. The performance ofmachinelearningmethodswas improved by increasing the number of actives in the training data; however, themoving tovirtual screening conditions was generally connectedwith a significant fall in precision. Somemethods, e.g. SMO,Bagging, Decorate andMultiBoostAB,weremore stable regarding changes in classification conditions,whereas inthe case of the others, such as NaïveBayes, J48 or Hyperpipes, the performance strongly varied between differentdatasets, fingerprints and targets. The application of meta-learning led to an increase in the values of evaluatingparameters. KlekFP was a fingerprint which yielded the best results, although its use was connected with greatcomputational expenses. On the other hand, EStateFP and SubFP gave worse results, especially in virtualscreening-like conditions.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

Data mining techniques (including machine learning methods) arewidely used in the process of drug design. They are extremely usefulfor virtual screening tasks [1,2], where potentially active compoundsare selected from large libraries of chemical structures. Before classifica-tion, a training process is performed, where a classifier learns how todistinguish actives from inactives. This is achieved by supplying itwith a set of molecules which are known to be active towards a partic-ular target, and a set of those that are identified as (or are assumed tobe) inactive. Then, in most cases, a predictive model is being built,which is next used to assign appropriate class labels to unclassifiedinstances [3,4].

Numerous examples of using machine learning methods in the pro-cess of virtual screening have been reported [1–15]. Several aspects ofclassification conditions and their influence on the machine learning

48 12 637 45 00.ki).

ghts reserved.

methods performance have been considered: different methods testedfor one target [16,17], or one or twomethods tested for the classificationof compounds showing activity towards several different targets[18–21]. For molecular representation, different molecular descriptorsand fingerprints have been used. For example, Plewczynski et al. testedthe performance of 7 machine learning methods for 5 different targetsusing regular atom pair descriptors [22,23], Cannon et al. usedMOLPRINT 2D circular fingerprints to test the performance of theNaïveBayes classifier, Inductive Logic Programming and Support VectorInductive Logic Programming for 11 targets [24], and Argaval et al. usedthe FP2 and MOLPRINT 2D fingerprints to show the use of rankingmethods in virtual screening tasks [21].

However, to the best of our knowledge, no paper shows the simulta-neous influence of many factors on machine learning methods perfor-mance. We aimed to carry out some extended tests and to analyzethe classification effectiveness depending on various parameters atthe same time. In initial tests, 11 machine learning methods weretested for the classification of ligands active towards 5 different proteintargets, for the training sets with 2 distinct numbers of actives, using 8

Page 2: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

Table 1Composition of datasets for the evaluation of different machine learning methods.

Protein target Ligands MDDR activityindex

Number of actives/numberof inactives

Train set 1 Train set 2 Test set

COX-2 Inhibitors 78454 125/316 242/316 884/950M1 Agonists 09249 107/315 281/315 874/950HIV PR Inhibitors 71523 105/350 203/350 932/1100metalloproteinase Inhibitors 78432 69/280 144/280 644/8005-HT1A Agonists 06235 100/340 198/340 903/1050

Table 2The composition of sets used for testing different machine learning methods in virtualscreening conditions.

Protein target Number of actives/number of inactives

Train set Test set

COX-2 242/2300 884/99,000M1 281/2300 874/99,000HIV PR 203/2300 932/99,000Metalloproteinase 144/2300 644/99,0005-HT1A 198/2300 903/99,000

90 S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

fingerprints for the molecular representation, taking into account thetime required for building a predictive model. Then, the virtual screen-ing conditions were recreated by adding a great number of inactivestructures to the dataset. Because of high computational expenses, thenumber of experiments was reduced and the tests were performed forselected methods only. A detailed analysis of all the obtained resultsled to some general conclusionswhich are presented in the formof use-ful tips thatmay be helpful also for thosewho start to use artificial intel-ligence tools in their work. They are to help choose the most effectivemachine learning method in particular conditions (the number ofknown actives, the type of fingerprint and the amount of time thatcan be assigned to this job). All the software presented in this work isfreely available for academic use.

2. Materials and methods

2.1. Experiment preparation

Five different protein targets (namely: cyclooxygenase-2, muscarin-ic receptorM1, HIV-1 protease, metalloproteinase and 5-HT1A receptor),

Table 3The characteristics of fingerprints used for evaluation of machine learning methods and the ab

Fingerprint Abbreviation Length of fingerprint(number of bits)

Short description

EState fingerprint EStateFP 79 Uses the electrotopolofor each atom in themgiven structure [33].

Fingerprint FP 1024 Runs a graph search alto six atoms long. Thewhole molecule is form

Extended fingerprint ExtFP 1024 An extension of FP, co

Graph only fingerprint GraphFP 1024 A modification of FP, i

Klekota Roth fingerprint KlekFP 4860 Uses the fingerprints isubstructures associat

MACCS fingerprint MACCSFP 166 A fingerprint based onparticular atoms, bond

PubChem fingerprint PubChemFP 881 A substructure fingerpparticular features in t

Substructure fingerprint SubFP 308 A fingerprint based on

well-recognized in machine learning experiments [6,23,25,26] werechosen to enable comparisons between our results and those obtainedpreviously. For the above targets, all compounds of a certain activity(cyclooxygenase-2, HIV-1 protease, metalloproteinase inhibitorsand M1, 5-HT1A agonists) present in the MDDR (the MDL Drug DataReport) database [27] were extracted. Then, ligands of a given targetwere clustered hierarchically in the Canvas version 1.4 [28], with theMOLPRINT 2D fingerprint for molecular representation and theTanimoto metric as a similarity measure. Two training sets (trainset 1 and train set 2), containing a different number of actives,were constructed (the number of actives in train set 2 N the numberof actives in train set 1). Several compounds were taken from eachcluster to form part of train set 2; then, that set was reduced by re-moving all the representatives of several clusters to form train set1. The molecules that were not included in the training data werecomprised in the test set. The compounds assumed to be inactivewere randomly selected from the ZINC database [29] and their num-ber was chosen in away ensuring similar proportions between activeand inactive molecules for different targets (Table 1).

In order to check the effectiveness of selected machine learningmethods in conditions similar to those present during the virtualscreening of large databases, a great number of inactives (randomly se-lected from ZINC) were added to the data collection, and new sets of adifferent composition were constructed (Table 2).

For each structure from the datasets, 8 different fingerprints(Table 3) were calculated using the PaDEL-Descriptor [30], then, theoutput file was transformed to a WEKA (version 3.6 [31]) readableone. WEKA (Waikato Environment for Knowledge Analysis) is a collec-tion of machine learning algorithms which can be applied to data min-ing tasks. It has already been used in the field of drug discovery e.g. byHammann et al. [16] and Bruce et al. [32]. The machine learning algo-rithms available are assembled into several subsets, 11 methods werechosen for evaluation, having maintained the algorithm diversity(Table 4). In most cases, default parameters of the machine learningmethods were used. All the calculations were performed on Intel Core2 Duo CPU 3.00 GHz computer system with 4 GB RAM running a 64-bit Linux operating system.

Based on the results of the initial tests and on the computationalexpenses connected with the use of particular machine learningmethods the number of classifiers was limited to 9 for the virtualscreening experiments (J48, RandomForest, NaïveBayes, Hyperpipes,SMO, MultiBoostAB, Decorate, FilteredClassifier and Bagging) and allmeta-learners were tested only in a combination with the ensembleclassifier — RandomForest.

breviations used in the successive work.

gical state (EState) formalism, introduced byKier andHall. The EState index is computedolecule and it represents its electronic state, considering the influence of other atoms in a

gorithm (breadth first search). Each atom in the structure is a starting point of a string upn, for each string, a hash code is produced and on its basis a bit string that representsed [34].

ntaining additional bits which give information about ring features [30].

n which a bond order is not considered during fingerprint generation [30].

ntroduced by Klekota and Roth. It is based on the occurrence of particular chemicaled with biological activity [35].

the MACCS keys. Each bit is a representation of the presence (or the absence) ofs, groups, properties, etc. in the chemical structure [36].

rint based on structural keys— each bit represents the presence (or the absence) ofhe molecule [30,37].

the presence of the SMART patterns, developed by Christian Laggner [30,38].

Page 3: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

Table 4A brief description of the machine learning methods chosen for evaluation with the optional name abbreviations used in the successive work.

Classifier Classification scheme Synopsis

J48 Trees The implementation of a C4.5 algorithm. It builds up a decision tree on the basis of a training set. The criteria for choosing attributes to beplaced in the root and particular nodes are based on the Information Theory. In the root of a tree there is placed an attribute whichseparates most clearly the training instances (contains the highest amount of information). The tree is built recursively— in each nodethere is the attribute chosen using the already described criteria. When a tree is created, an algorithm goes back and removes brancheswhich do not help in the classification process — this is called pruning [39,40].In our tests, the minimum number of instances per leaf was set at 2, and the value of the confidence factor during pruning was 0.25.During that process, a subtree raising operation was considered.

RandomForest (RF) Trees A combination of decision trees, where each parent node is split into nomore than two children (binary partitioning). Each tree is grownon the subset of training data (it is a bootstrap sample) and produces a response. The final outcome is a result of “voting” for a particularclass — the instance is assigned to a class with the highest number of votes (each vote being of the same importance). The trees aregrown to their maximum possible extent since there is no pruning [41,22].During the classification process, the number of generated trees was set at 10. The trees had unlimited depth and the seed numberequaled one.

NaïveBayes (NB) Bayes An algorithm based on Bayes' theorem and on the assumption that all attribute values are conditionally independent (the classifier iscalled “naive” because the latter assumption is seldom true). Hence each attribute is considered separately, and an individual probabilityof belonging to a particular class is calculated for it. The class with the highest probability is chosen as a final answer [3].

PART Rules A classifier which generates a decision list using a separate-and-conquer approach. In each iteration it builds a partial decision treeaccording to the C4.5 algorithm. As a rule, a path is chosen from the root to such a node, that leads to a situation in which the highestnumber of instances is covered by this rule. Such instances are removed from the training data and the process of growing a tree isrepeated until no examples are left [42].The confidence factor for pruning was set at 0.25, and 2 was the minimum number of instances per rule.

Hyperpipes Misc A very simple algorithm, capable of dealing with a great number of attributes. A pipemarkedwith a particular class label is constructedfor each class from the training set. Examples drawn from the training data are analyzed instance by instance, and each pipe monitorsattribute values: if a given one has not yet occurred, it is being added to the pipe. During testing, each case is comparedwith the patternof attribute values present in pipes and is assigned to the one that matches it the best [43].

SMO Functions An algorithmwhich solves the SVMquadratic programming optimization problem (QP) bydecomposing it into a series of sub-problemswhich can be solved analytically. QP involves minimizing (maximizing) a quadratic objective function subject to a set of linearconstraints [44].The complexity parameter was set at 1, the epsilon for a round-off error was 1.0E-12, and an option of normalizing training data waschosen. The normalized polynomial kernel was used with the exponent value equal to 2.

IBk Lazy A k-nearest neighbor classifier which marks an unclassified instance with the label of the majority of k-nearest neighbors. The distancebetween instances is measured using the Euclidean metric. For k = 1, the instance is assigned to the class of its closest neighbor in thetraining set [6].For the nearest neighbor search, a brute force (exhaustive) search algorithmwas usedwith a Euclidean distance function. One neighborwas used for the classification process.

MultiBoostAB Meta An algorithmwhich is a modified version of AdaBoost combined here with wagging. The former builds up a single classifier as a linearcombination of “weak” classifiers (trained on subsets of original training data). Wagging supplies training cases with different weightsaccording to the continuous Poisson distribution [45].The classifier performed 10 iterations, the approximate number of subcommittees was set at 3, and the weight threshold for weightpruning was equal to 100.

Decorate Meta An algorithm which forms a diverse ensemble of classifiers. The first member of an ensemble is a classifier trained on original trainingdata. Then, classifiers are trained on an original training set with additional artificial data generated by a special model. These examplesare given class labels which differentiate them the best from the current ensemble prediction. This is to maximize the diversification ofclassifiers in the ensemble [46,47].During training, a classifier used one artificial example, the number ofmember classifiers in theDecorate ensemblewas set at 10, and thesame value was set for the maximum number of iterations.

FilteredClassifier Meta A classifier which enables a combination of an arbitrary filter and an arbitrary classifier [48].An attribute selection filter was used together with the best first search method to find important features.

Bagging Meta An algorithmwhich createsmodels on bootstrap (drawn randomlywith replacement from the original training data, havingmaintainedits size) training sets. A final outcome is the result of voting (or averaging) for the prediction of each component of the ensemble [39].The number of iterations to be performed by Bagging was set at 10.

91S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

2.2. Machine learning methods evaluation

The performance of different machine learning methods in a binaryclassificationwasmeasured using three evaluating parameters: recall—R (1), precision — P (2), and the Matthews correlation coefficient —MCC (3). Recall and precision range from 0 to 1, whereas MCC canreturn values from−1 to 1. The time required for building a predictivemodel was also taken into account.

R ¼ TPTPþ FN

ð1Þ

P ¼ TPTPþ FP

ð2Þ

MCC ¼ TP � TN−FP � FNffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiTPþ FPð Þ � TPþ FNð Þ � TNþ FPð Þ � TNþ FNð Þp ð3Þ

where:

TP the number of true positives (correctly classified actives),FP the number of false positives (inactives wrongly classified as

actives),TN the number of true negatives (correctly classified inactives),

Page 4: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

Fig. 1. Heat maps of recall, precision and MCC values obtained in initial experiments.

92 S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

Page 5: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

Fig. 2. Heat maps of differences in results obtained for meta and base classifiers.

93S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

FN the number of false negatives (actives wrongly classified asinactives),

Recall gives information about a fraction of positives selected fromthe test set, whereas the precision defines the correctness of positive in-stances prediction [22]. Low values of the latter parameter indicate a

high rate of false positives. MCC gives a balanced measure of machinelearningmethods performance, combining all the information includedin the confusionmatrix to form a single number [49]. Its maximumpos-sible value (equal to 1) refers to a perfect prediction,−1 corresponds toreverse classification, whereas 0 represents prediction that is no betterthan random.

Page 6: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

Fig. 3.Heatmaps presenting differences in results obtained for train set 2 and train set 1. *ΔMCC = -1.00.

94 S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

3. Results

3.1. Initial tests

All the results obtained in the initial tests are presented in Fig. 1 asheatmaps (generatedwith the use of ‘matrix2png’ software [50]), qual-itatively representing levels of the parameters studied (recall, precisionand MCC — columns of maps) in relation to studied targets (rows ofmaps), machine learning (ML) methods (rows on the maps), finger-prints (columns on themaps) and the size of training sets (neighboringcolumns on the maps). In addition, to facilitate an analysis ofthe influence ofmeta-learning algorithms and the impact of the trainingset composition on ML performance, two sets of differential maps areshown in Figs. 2 and 3, respectively. All the numerical values can befound in the Supplementary data (Tables A1 and A2). Since the timeneeded for building a predictive model was taken into account, theresults of its measurement are given in Table 5.

The machine learning methods performance strongly depended onclassification conditions (Fig. 1). It varied not only for different classi-fiers and targets, but also for differentfingerprints and sets composition.

3.1.1. TargetsThe first thing that stood out from the analysis of the results, is an al-

most perfect classification of cyclooxygenase-2 inhibitors (Fig. 1, row1).All the evaluating parameters values were close to 1 (with only 10 ex-ceptions to all the 912 experimental results), regardless of the trainingset composition and fingerprint. High-level results were also obtainedfor HIV protease inhibitors (N0.9 concerning R and P), but a significantdecrease (especially in R b0.65 for sets containing less actives) wasobserved when EStateFP and SubFP were used for representing themolecules (Fig. 1, row 3 (Fig. 1.3)). On the other hand, SubFP was a fin-gerprint, for which the biggest changes in recall values were observedafter putting additional actives to the training data (N0.15 on average).In the classification of M1 agonists (Fig. 1, row 4 (Fig. 1.4)), a decreaseconnected with the use of SubFP was not observed; however, a slightfall occurred also for GraphFP (~0.7 on average) and PubChemFP(b0.8). For this target, the most considerable improvement connectedwith the use of meta-learning algorithms was found (~0.02 on averagefor MCC Fig. 2, row 4, column C). The results obtained for the classifica-tion of 5-HT1A agonists and metalloproteinase inhibitors are character-ized by the lowest average R values (below 0.8; Fig. 1, row 2, columnA; Fig. 1, row 5, column A). The R values for those targets were also themost susceptible to the enlargement of the training set, especially forEstateFP, and also for SubFP for metalloproteinase. Furthermore, thisalso led to the most noticeable changes in MCC for the protein's ligands(N0.05 on average; Fig. 1, row 2, column C; Fig. 1, row 4, column C).

3.1.2. Evaluating parameters

3.1.2.1. Recall. R reached different values depending on the classificationconditions (Fig. 1.A). For COX-2 inhibitors, it approached 1, regardless ofthe fingerprint and the classifier. For the rest of the targets, the resultswere generally the worst for EStateFP (~0.5–0.7) and usually the bestfor KlekFP (R ranged on average from ~0.8 to over 0.95 for different tar-gets). However, ExtFP, FP andMACCSFP also provided a high fraction ofactives selected from the test set, with their advantage of a shorter timerequired for building a predictivemodel. The R values also approached 1when Hyperpipes was used as a classifier for all the fingerprints exceptGraphFP, where it fell to ~0.7. Themore actives in the training data, thehigher the R, butwhenmeta-learningwas applied, its changes varied indifferent cases.

3.1.2.2. Precision. The P values did not depend to such an extent on classi-fication conditions and usually exceeded 0.9 (Fig. 1.B). EStateFP andSubFP were fingerprints that for some targets caused the P values tolower to ~0.7–0.8. In contrast to the rest of the classifiers, Hyperpipes

Page 7: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

Table 5Time required for building predictive model for metalloproteinase inhibitors.

Time for building predictive model [s]

ML method EStateFP ExtFP FP GraphFP KlekFP MACCSFP PubChemFP SubFP

Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2

J48 0.19 0.13 1.69 2.86 2.20 2.51 2.74 3.14 8.06 8.85 0.28 0.27 1.31 2.40 0.46 0.81RandomForest 0.26 0.26 0.37 0.48 0.44 0.48 0.44 0.58 1.79 1.82 0.15 0.23 0.35 0.53 0.75 0.95NaïveBayesm 0.01 0.02 0.45 0.55 0.66 0.54 0.41 0.47 1.13 1.29 0.22 0.11 0.28 0.36 0.35 0.41PART mmm 0.34 0.36 3.33 4.31 3.32 5.23 4.62 8.72 14.07 15.80 0.39 0.63 3.57 5.31 1.81 1.27Hyperpipesm 0.01 0.02 0.05 0.05 0.05 0.05 0.03 0.04 0.07 0.09 0.02 0.02 0.27 0.03 0.00 0.02SMO mmm 0.16 0.16 0.47 0.50 0.56 0.45 0.20 0.30 0.19 0.19 0.24 0.75 0.12 0.21 0.06 0.32MultiBoostAB(J48) 0.84 1.03 20.78 31.78 24.80 33.01 24.81 39.83 96.50 92.69 2.34 4.12 16.92 26.73 4.57 5.50MultiBoostAB(RF) 1.98 2.81 0.35 0.49 0.40 0.48 4.21 0.60 2.07 1.89 0.16 0.22 0.38 0.53 5.78 7.70MultiBoostAB(NB) 0.85 0.74 8.56 10.42 9.57 10.80 8.88 10.66 39.40 41.58 1.26 1.77 7.21 8.82 2.02 2.31Decorate(J48) 3.71 1.64 46.50 60.51 47.06 65.75 47.10 69.09 130.34 171.84 5.20 7.33 34.25 40.37 7.25 10.41Decorate(RF) 3.89 4.26 7.30 9.51 7.05 9.80 7.70 11.00 24.44 32.71 3.39 4.97 7.65 9.31 6.81 9.51Decorate(NB) 1.06 0.84 12.69 17.53 15.39 22.37 12.62 19.89 77.82 78.61 1.64 3.57 13.99 12.51 5.52 7.23FilteredClassifier(J48) 0.04 0.03 8.31 8.85 11.27 10.20 0.78 11.65 16.26 20.11 0.19 0.18 2.27 2.27 0.21 0.23FilteredClassifier(RF) 0.06 0.06 8.35 8.66 11.50 10.08 1.31 12.20 15.89 20.33 0.21 0.19 2.42 2.49 0.28 0.27FilteredClassifier(NB) 0.03 0.03 8.13 8.50 11.23 10.94 0.94 11.23 15.41 19.94 0.13 0.14 2.20 2.51 0.20 0.23Bagging(J48) 0.75 1.51 16.88 28.24 17.94 27.21 21.64 30.61 61.76 86.93 1.80 2.59 14.57 18.55 4.46 5.73Bagging(RF) 1.56 2.46 2.99 4.73 3.40 4.77 3.98 5.59 13.12 17.77 1.86 2.03 3.95 4.81 4.53 6.09Bagging(NB) 0.11 0.13 4.37 5.63 4.37 5.60 3.88 5.05 10.20 15.26 0.87 0.50 3.19 3.49 0.44 0.57

95S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

did not provide satisfying P values which in some cases even fell below0.5. In themajority of cases, Pwas decreased by putting additional activesto the training data; on the other hand, the use of meta-classifiers(MultiBoostAB, Decorate and Bagging) led to its improvement.

3.1.2.3. MCC. The MCC values on average exceeded 0.7, being lower forEStateFP and SubFP for 5-HT1A agonists, HIV protease inhibitors, metal-loproteinase inhibitors, and for the use of GraphFP for the classificationof M1 agonists (they even fell to 0.4). They were generally raised by in-creasing the number of active molecules in the training data (Fig. 3.C)and by the use of meta-learning algorithms (Fig. 2. C). The range of im-provement varied for different targets: as regards training sets 1 to 2change, the highest uplift was observed for the classification of metallo-proteinase inhibitors and 5-HT1A agonists (N0.05), while in the case ofmeta-classifiers, an average enhancement was highest for M1 agonists(very often close to 0.2).

3.1.3. FingerprintsIn general, KlekFP was the fingerprint that yielded the best results

(Fig. 1), although its use required the highest amount of time for build-ing a predictive model and the greatest time extension connected withthe application of meta-classifiers (Table 5). Results comparable tothose pertinent to KlekFP were also obtained for ExtFP, FP, MACCSFPand PubChemFP, with slight variations depending on the target andtraining set composition. As regards hashed fingerprints, it wasGraphFPthat led to the worst results (on average a ~0.1 decrease in MCC com-pared to ExtFP and FP). EStateFP and SubFP did not yield high valuesof the evaluating parameters either, especially as regards the classifica-tion of HIV protease and metalloproteinase inhibitors (a fall in MCCbelow 0.5 in some cases). As to the combination of these targets withthe above-mentioned fingerprints, R strongly depended on the trainingset composition, with differences between train set 1 and train set 2 ex-ceeding on average 0.2 formetalloproteinase and ~0.1 for HIV inhibitors(Fig. 3.A). The P values did not depend to such an extent on fingerprint,although in the case of EStateFP and SubFP for metalloproteinaseinhibitors and GraphFP for 5-HT1A andM1 agonists therewas a decreaseby over 0.1 in the average values of that parameter compared to thebest-performing KlekFP.

3.1.4. Training set compositionAs regards the influence of the increased number of actives in the

training set (Fig. 3), changes in the evaluating parameters also stronglydepended on other classification conditions. Metalloproteinase was the

target, for which the R uplift was the highest of all tested targets(~0.15); however, when EStateFP or SubFP was used for molecular rep-resentation, it was increased on average by more than 0.2 (0.65 forSubFP using SMO). Relatively big changes – over 0.1 – were also ob-served regarding the classification of 5-HT1A agonists. When movingfrom set 1 to set 2, the smallest changes were observed for classificationof COX-2 inhibitors. Thiswas due to the fact that the R, P andMCC valuesalready approached 1 for sets with a lower number of actives. On theother hand, a higher number of actives caused a fall in the P values.Those changes were very similar in all the targets tested and usuallydid not exceed 0.05. Due to a fairly significant improvement in R andan only slight decrease in P, enhancement of MCC was also observed,being the biggest while choosing metalloproteinase and 5-HT1A astargets (~0.07 and 0.05 on average, respectively).

3.1.5. Machine learning methods

3.1.5.1. J48, RandomForest. J48 yielded the P values exceeding 0.8(Fig. 1.B), whichwere only slightly affected by changes in the fingerprintsused formolecular representation and training set composition (a changewas below0.05whenmoving from training set 1 to set 2; Fig. 3.B). On theother hand, R varied greatly for different targets andfingerprints (from~1for COX-2 inhibitors, through ~0.8 for HIV protease inhibitors, to ~0.75for M1 and 5-HT1A agonists; in the case of metalloproteinase inhibitors,it reached values of ~0.75–0.8 for all the fingerprints except EStateFPand SubFP, for which it was below 0.4; Fig. 1.A). In general it neededmore time for building predictive model than other classifiers (e.g.RandomForest, NaïveBayes, Hyperpipes and SMO, Table 5), especially inthe case of longer fingerprints, such as KlekFP. Using it with meta-learners caused the greatest time extension (compared to the base classi-fiers — RandomForest and NaïveBayes). J48 was very susceptible toboosting. Its performance was always improved (even up to 0.15 in thecase ofMCC)whenMultiBoostAB, Decorate and Baggingwere used.How-ever, FilteredClassifier with “best first” search method did not providebetter results compared to J48 alone (Fig. 1.C and Fig. 2).

As to the other classifier using the concept of decision trees,RandomForest, on average gave better results than did J48: the R values— up to 0.05 higher (Fig. 1.A), the P values — higher by ~0.02–0.03(Fig. 1.B), which in consequence led to an increase by 0.08 on averagein MCC (Fig. 1.C). Combining RandomForest (also a member of thegroup of ensemble classifiers) with meta-learning methods did notcause a considerable uplift, probably due to the already favorable resultsfor RandomForest alone (Fig. 2). The possible enhancement strongly

Page 8: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

96 S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

depended on the target and the fingerprint, and was on average below0.05 in the case of MCC (Fig. 2.C). The R values for both J48 andRandomForest went up (by ~0.1) by putting additional actives to thetraining data (Fig. 3.A); although the precision slightly fell (on averagebelow 0.05; Fig. 3.B), the overall improvement expressed by MCCchanges, was on the level of 0.3–0.4 (being the highest for metallopro-teinase inhibitors and 5-HT1A agonists — ~0.6–0.7, and the lowest forCOX-2 inhibitors — below 0.01; Fig. 3.C).

3.1.5.2. NaïveBayes. In contrast to the previously described classifiers, thetraining data composition dependence was different for NaïveBayes.There was generally no or very small enhancement of R — close to0.02 (only for EStateFP it reached 0.2 for 5-HT1A; Fig. 3.A), and changesin P were not so direct: its values varied for different fingerprints andtargets (Fig. 3.B). Also the use of meta-learners led to various shifts inthe evaluating parameters, which strongly depended on classificationconditions. NaïveBayes alone provided good results, the R values being

Fig. 4. Heat maps of recall, precision and MCC values

on a level similar to that obtained for J48 (Fig. 1.A) and the P valuesbeing slightly lower compared to J48 (by ~0.03, Fig. 1.B), especially for5-HT1A and metalloproteinase ligands. NaïveBayes also worked fastand required a very small amount of time to build a predictive model,being similar to RandomForest (Table 5).

3.1.5.3. PART. The PART classifier, which uses an algorithm similar toJ48, gave results comparable to those obtained with the use of thedecision tree. The P values obtained for those classifiers did notdiffer much, but rise in R and MCC (up to ~0.05) was observedin favor of the rule method. However, it needed on averagetwice as much time for building a predictive model than did J48(Table 5).

3.1.5.4. Hyperpipes. As regards Hyperpipes, it was a method thatperformed the classification process in the shortest time compared to alltested algorithms (Table 5). There was no problem with recognizing

obtained in virtual screening-like experiments.

Page 9: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

97S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

active molecules regardless of conditions, as the R values exceeded0.95 in the majority of cases (Fig. 1.A); on the other hand, a highrate of false positives was found during classification with its use(the P values reached ~0.5, except for COX-2 for most fingerprints;Fig. 1.B). The MCC values of ~0.2–0.5 were consequences of errorsin the classification of inactive molecules. The training data en-largement did not generally improve the Hyperpipes performance(Fig. 3). A rise in R was observed (on average up to ~0.04–0.06),but the P value fell considerably (by ~0.04), so the balanced per-formance described by MCC was worse after the additional in-stances were put to the training set (by ~0.03–0.07, dependingon target).

3.1.5.5. SMO, Ibk. SMO and Ibk are classifiers which gave the best resultscompared to other methods, at a level comparable to RandomForest(Fig. 1). As to the former, the obtained P values exceeded 0.95 in themajority of cases, and for the latter, only in several examples (mostlyconnected with the use of EStateFP and SubFP) the P value fell to ~0.8(Fig. 1.B). Also the R values on average exceeded 0.8; however, in thecase of SMO, theuse of the above-mentioned EStateFP and SubFP causedits fell even below 0.3 (the classification of metalloproteinase inhibitorsusing the training set 1; Fig. 1.A). The performance of both Ibk and SMOwas better when there were more active ligands in the training data(only in case of classification of M1 agonists the MCC values fell by nomore than 0.03; Fig. 3). Those tendencies were similar like in thecase of the previously described classifiers: putting additional ac-tives elevated the R value (up to ~0.12 for SMO, and ~0.7 for Ibk;Fig. 3.A), decreased the value of P (by no more than 0.05; Fig. 3.B)and led to MCC enhancement on average by 0.05 for SMO (~0.15for metalloproteinase inhibitors) and ~0.01 for Ibk (~0.035 for me-talloproteinase inhibitors).

3.1.5.6. Meta-learners. Changes in the evaluating parameters connectedwith the use of meta-classifiers varied depending on diverse factors(Fig. 2). As regard R, it was considerably changed when moving to dif-ferent targets and fingerprints (Fig. 2.A). As to sets containing fewer ac-tives,meta-learners generally improved the values of that parameter forJ48; however, in the case of NaïveBayes, no change was observed inmost cases. Combining ensemble classifiers (MultiBoostAB, Baggingand Decorate with RandomForest) did not led to better results either,nor did the use of the “best first” algorithm for attribute selection (theuse of FilteredClassifier was connected with considerably worse resultsin most cases). The situation is clearer when sets containing more ac-tives are taken into account. The R improvement for J48 is more indicat-ed (for all meta-learners except FilteredClassifier), and amounts to~0.05; however, it still varied a lot in the case of other base classifiers.P is a parameter which is substantially enhanced byMultiBoostAB, Dec-orate and Bagging (Fig. 2.B). J48 is still a classifier whose highest upliftwas observed, especially when ExtFP or FP was used for molecular rep-resentation (it even approached 0.15 for metalloproteinase inhibitors).As a result of changes in R and P, the MCC values were also higherafter MultiBoostAB, Decorate and Bagging were applied (Fig. 2.C). Onaverage, a rise of ~0.08 was reported for J48; however, for the rest ofthe base learners, changes in MCC varied depending on different classi-fication conditions. For example, Bagging enhanced the MCC obtainedusing RandomForest for sets containing a higher amount of actives (by~0.05); however, in the case of training data with fewer actives, the oc-currence of such an enhancement depended strongly on the target andthefingerprint. On the other hand, NaïveBayeswas the only classifier forwhich a rise inMCCwas observed in themajority of caseswhen the bestfirst algorithmwas used for selecting attributes (especially for the clas-sification of COX-2 inhibitors andM1 agonists). The use ofmeta-learnerswas also connected with computational time extension, being thegreatest in the case of Decorate, and the smallest for FilteredClassifier(Table 5). MultiBoostAB and Bagging led to a comparable time length-ening— slightly smaller than it was a result of using Decorate.

3.2. Virtual screening conditions

3.2.1. TargetsLike in the case of the results of initial tests, the best performancewas

reported for cyclooxygenase-2 inhibitors (Fig. 4.1); however, even in thecase of this target, a significant precision fall was observed (b0.2) for themajority of classifiers when ExtFP, FP, KlekFP or SubFPwas used. Regard-less of the conditions, R remained at a high level,with values approaching1 for all the fingerprints and targets. Lower MCC values, especially withthe use of four mentioned fingerprints, for NaïveBayes and Hyperpipes(b0.5) resulted from a considerable decrease in P. As regards 5-HT1A ag-onists (Fig. 4.2), significant changes in R– compared to those found in ini-tial tests (N0.2) – were observed for EstateFP only. For the rest of thefingerprints, they usually did not exceed 0.2, and the highest changerate was reported for a combination of SMO and MACCSFP (N0.3). How-ever, in the case of the latter target, the values of P fell dramatically in thetests in virtual screening conditions (belowor close to 0.3 formost cases).Only a combination of Bagging and Decorate with RandomForest andExtFP or FP gave values exceeding 0.95. A similar tendencywas observedfor HIV protease inhibitors (Fig. 4.3). For EStateFP and SubFP, the R and Pvalues were low (~0.5–0.6, and ~0.2, respectively), and a group ofmethods providing satisfying results was extended (comparing to 5-HT1A agonists) with RandomForest, SMO and MultiBoostAB in combina-tionwith not only ExtFP and FP, but also KlekFP and PubChemFP (in casethe latter fingerprint for RandomForest and MultiBoostAB P was lowerand equal ~0.85). As regards classification of M1 agonists (Fig. 4.4), incontrast to the rest of the tested targets, no decrease in R for EStateFPand SubFP was observed in virtual screening-like experiments (thehighest R decrease exceeding 0.2 was described for PubChemFP). Inter-estingly, a fall in P was observed for all the fingerprints but SubFP, forwhich values of that parameter approached 1 for all themachine learningmethods except NaïveBayes. R and P equaled nearly 1 caused the samevalues of MCC, which were higher in virtual screening than in the initialtests. As to metalloproteinase inhibitors (Fig. 4.5), the R values obtainedin virtual screening conditions for EStateFP and SubFP were the lowestof all the tested targets (b0.4 in most cases). P andMCC for those finger-printswere also very low: belowor close to 0.2. As regards that target, thebest results were obtained for a combination of SMO and KlekFP (MCCexceeding 0.9).

3.2.2. Evaluating parameters

3.2.2.1. Recall. R reached values similar to those obtained in earlier tests,keeping the tendencies in changes when different targets and finger-prints were taken into consideration (Fig. 4.A). Regardless of a finger-print and a classifier, it was equal (or close to) 1 for COX-2 inhibitors.In the experiments with agonists of the M1 receptor, the lowest valueswere obtained with using PubChemFP (~0.65) and GraphFP (~0.7),whereas the best results were found in the case of FP, KlekFP (on aver-age close to 0.8–0.85) and SubFP, for which the R values approached 1(in the initial tests, SubFP also showed very high R values for that target:close to 0.9). As regards the rest of the targets, the inability to recognizeactive compounds using EStateFP was apparent in the case of R valueslower (in most cases) than 0.5. On the other hand, the use of FP, ExtFPand KlekFP usually led to the highest values (~0.9 for HIV protease in-hibitors, and ~0.7–0.75 for metalloproteinase and 5-HT1A ligands);PubChemFP also provided high values for 5-HT1A agonists and HIVprotease inhibitors (~0.9 on average, but in case of the former targetin most cases it approached 1). As regards classifiers, in generalHyperpipes was the most effective in selecting actives from the testset providing R values equal almost to 1 regardless of the experimentalconditions.

3.2.2.2. Precision. In virtual screening conditions, the P was considerablylower (Fig. 4. B) compared to the results in the initial tests (Fig. 1. B), andit strongly depended on the type of fingerprint and machine learning

Page 10: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

98 S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

method. Even in the case of COX-2 inhibitors (unlike the tendency to-wards a very high classification effectiveness for that target), the Pvalues fell as low as 0.2 in some cases when J48, NaïveBayes andHyperpipes were used. When different classifiers were taken into ac-count, Decorate and Bagging were the methods, whose application incombination with RandomForest gave the highest values: for somecases, they even exceeded 0.95.

3.2.2.3. MCC. Changes in the MCC values resulted from the previouslydescribed parameter variations (Fig. 4.C). As expected, the lowestvalues were obtained for Hyperpipes (slightly over 0) when a machinelearning method was taken into account, and for EStateFP (onaverage b0.4, forM1 agonists andb0.3 for the remainder of targets exceptCOX-2), considering the type of the fingerprint. As a consequence of avery low precision, the MCC rate for J48 and NaïveBayes was also low(~0.3 on average). However, some other classifiers, such as Decorateand Bagging, used in combination with ExtFP, FP or KlekFP, led to MCCvalue exceeding 0.8.

3.2.3. Machine learning methodsAs regards the performance of machine learning methods in virtual

screening conditions, it is noteworthy that it was at a similar level forsome classifiers compared to the previous tests; however, there werealso the other ones, for which the evaluating parameters were consider-ably varied. J48, NaïveBayes and Hyperpipes could be assigned to thelatter methods. Although the R values were still at high level (seeFig. 4. A) and reached ~0.7–0.8 for J48 and NaïveBayes (as a rule, forall the fingerprints except EStateFP and SubFP) and were close to 1 forHyperpipes (regardless of the fingerprint), the P value fell, even below0.1 (Fig. 4. B). Only in the case of the classification of COX-2 inhibitors,J48 and NaïveBayes for some fingerprints reached the values exceeding0.7. Some problems with a high rate of false positives also occurred forHyperpipes in our initial tests (Fig. 1.B); in the latter case however,the P values were much higher (~0.5). A great number of negative in-stances, wrongly classified as actives also generated lower MCC valuesfor those classifiers (even as low as 0 in the case of Hyperpipes) —

Fig. 4.C.As regards RandomForest, it was very effective in classifying of COX-

2 inhibitors (the values of all evaluating parameters were close to 1 inmost cases). As regards the other targets, R (except for EStateFP andSubFP) reached high values (~0.7–0.8; Fig. 4.A), whereas the values ofP fell in most cases (Fig. 4.B). FP and ExtFP were the only fingerprintsfor which it was still at a high level, exceeding 0.8 with the exceptionof M1 agonists.

In contrast to the previously described classifiers, the inclusion of ad-ditional inactives in the dataset did not result in a significant deteriora-tion of SMO performance in comparison with the initial tests (Fig. 4.3).The R and P values exceeded ~0.8 (with some fluctuations dependingon the type of the fingerprint and target); however, as in the case ofthe other classifiers, the use of EStateFP and SubFP was connectedwith a considerably poorer performance (for some targets, they reachedvalues as low as ~0.3). The values of MCC (except for the already men-tioned fingerprints) were also high, especially for FP, ExtFP and KlekFP:over 0.8 (Fig. 4. C).

The use of meta-classifiers in virtual screening conditions involvedchanges in the values of the evaluating parameters. MultiBoostAB gaveonly subtle variations (most often not exceeding 0.03, except for COX-2 inhibitors); in the case of that meta-learner, a slight decrease in theevaluating parameters values (instead of the base-classifier improve-ment) was sometimes observed. FilteredClassifier did not ensure betterresults either. In themajority of cases, its application to virtual screeningwas connected with a reduction in MCC values, even by more than 0.2(Fig. 4. C). It was usually P which contributed most significantly to theattenuation of the results, having lowered them even by more than0.3 in comparison with RandomForest alone (Fig. 4. B). Decorate andBagging were themethods that improved the base-classifier: especially

the P values (Fig. 4. B)— it was raised by as much as 0.3 for Decorate insome cases in comparison with results obtained for RandomForest. Ingeneral, FP, ExtFP and KlekFP performed best with meta-learningalgorithms.

4. Discussion

On the basis of the analysis presented above it is not possible tochoose the best combination ofmethod, fingerprint and dataset compo-sition that would give the best performance in all the tested conditions.The strong dependence of machine learning methods performance onclassification conditions is also the source of difficulties in making com-parisons with the results obtained by other groups.

As described in the Results, although there seem to be significantdifferences between various experiments (machine learning methodsperformance varied not only between various classifiers, but alsobetween different fingerprints and set composition), a few generaltendencies could be observed.

The very best results were obtained for the classification ofcyclooxygenase-2 inhibitors. The values of all evaluating parametersapproached 1 regardless of the conditions in the initial tests; only insome cases (usually connected with the use of NaïveBayes andHyperpipes), a significant fall in the P values (also resulting inlower MCC values) occurred in virtual screening. In general, P wasthe most sensitive parameter for introducing virtual screening con-ditions (its values decreased by even more than 0.7); however, inthe initial tests its level was high (over 0.8) for all conditions (exceptfor the use of Hyperpipes), being still raised by the use of meta-learning. The R values were more diversified in the initial tests,some more pronounced changes being connected with the increas-ing number of actives in the training data. However, in the largedatabase screening, R did not change significantly, and that tendencywas maintained (lower values for EStateFP and SubFP, and Rapproaches 1 in the case of Hyperpipes). Variations in MCC werethe consequence of earlier parameter changes, but it was usually Pthat mostly contributed to them.

Despite the fact that a slight decrease in the P values was observedwhen additional actives were put to the training data, an overall im-provement (MCC) in the classification performance was observed as aresult of a fairly significant increase in the R values. A similar effect ofthe increased number of actives in the dataset was also observed byPlewczynski et al. [22] and Bruce et al. [32].

The importance of the type of fingerprint when using computationaltechniques was also previously indicated [26]. In our tests, KlekFPyielded the best results, although it required the biggest amount oftime for building predictive models. This was clearly visible in virtualscreening conditions when the dataset contained a higher number ofmolecules. The performance of hashed fingerprints, especially ExtFPand FP, as well as MACCSFP and PubChemFP, was also at a high level.However, the type of fingerprint rather than its length had the strongestinfluence on efficiency of the machine learning methods. The resultsobtained forMACCSFP (166 bits) are similar to those referring to KlekFP(4860 bits). A substantial decrease in the values of the evaluatingparameters occurred in the case of EStateFP and SubFP, being visiblein both the initial and virtual screening-like experiments.

As regardsmachine learningmethods, the best inmost cases and themost stable were SMO and Ibk, this finding being in linewith the resultsobtained by Plewczynski et al. [22,23] and Cheng et al. [10]. The above-quoted authors reported that SVM outperformed the other methods,and described a good performance for k-NN algorithm. Han et al. [18]tried to improve the SVM performance by training it on sets containingdiverse inactive compounds. The adjusted model was then applied toscreening the database of almost 3 million molecules towards ligandswith a singlemechanism (including HIV protease inhibitors) andmulti-ple mechanisms (agents active in the central nervous system). Theconstructed model was able to identify 78% of the known protease

Page 11: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

99S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

inhibitors and 67% of the CNS-active agents. The latter results wereworse than those obtained in our experiments, which, however, couldbe due to differences in the classification conditions on which machinelearning performance strongly depends.

As to decision trees, RandomForest outperformed J48, especially invirtual screening-like tests. In the case of the latter classifier a significantdecrease in the P values was observed. However, J48 was more suscep-tible to learning at a meta-level (a rise by as much as ~0.15 MCC oc-curred when J48 was combined with MultiBoostAB, Decorate andBagging). Some satisfying results obtained with decision trees werealso reported by Plewczynski et al. [22], Bruce et al. [32], Wang et al.[17] and Hamman et al. [16].

Meta-learning algorithms also provided high-level results. Exceptfor the FilteredClassifier, for which an uplift in the results was notalways observed (which should be further tested with other attributeselection algorithms), MultiBoostAB, Decorate and Bagging led to asignificant improvement in the performance, the latter two classifiersgiving results comparable to those obtained using SMO. Such advan-tages accruing from the application of meta-learning methods werealso mentioned by Bruce et al. [32].

The latter authors also indicated a decrease in the classification accu-racy with an increase in the number of instances in the dataset, whichwas also observed in our experiments. However, the performancelevel of somemethodswas high for small sets (initial tests), a significantdecrease in classification effectiveness occurred in virtual screeningconditions. This remark applies in particular to Hyperpipes andNaïveBayes, for which P was considerably lower after additional in-stances were used. A high rate of false positives for NB was alreadypointed out by Cannon et al. [24] in results of experiments with theuse of datasets similar in size to those employed in our tests(94,290 compounds). Despite those difficulties, a huge advantageof Hyperpipes is that it could easily recognize actives, regardless ofthe conditions (the R values close to 1), and that it worked extremelyfast even in the case of KlekFP (the shortest time required for build-ing a predictive model).

Due to a great dependence of the performance of machine learningmethods on various factors such as the type of fingerprint, target andthe dataset composition, choosing the optimal conditions for a screen-ing experiment is not a simple matter, hence each experiment shouldbe considered individually. The results obtained in the initial tests maybe useful when a machine learning method is chosen for the selectionof actives from smaller bases (no. of compounds b~2000), whereas thefindings derived from experiments performed in virtual screening con-ditions are likely to help while screening large databases of structures.

In general, the use of KlekFP as a fingerprint in combination withSMO, Ibk or RandomForest should yield high-level results, whichcan be further improved by application of meta-learning methods(MultiBoostAB, Decorate, and Bagging). However, this way of molecularrepresentation is connected with great computational expenses, whichleads to the lengthening of the duration of experiment. Putting addi-tional actives to the training data also produces improvement in theperformance of machine learning methods, although a significant in-crease in the number of instances in the dataset is the cause of a highernumber of false positives. If an experiment is aimed at selecting asmanyactives as possible, not taking account of the number of false positives,Hyperpipes is the best tool to be used for such an experiment— its recallvalues are usually close to 1, regardless of the conditions. In a situationwhen computational expenses are taken into account, hashed finger-prints (excluding GraphFP), MACSSFP or PubChemFP should be applied– also in combination with RandomForest or SMO – as they offer thebest rate of evaluating parameters and the shortest time needed forbuilding a predictive model. Although in the case of Ibk a predictivemodel is not built, the time needed for evaluating a test set is consider-ably longer than for the rest of classifiers, which is connected with thenecessity of making comparisons between each tested instance andeach instance of the training data.

5. Conclusions

The present study describes the performance of 11machine learningmethods with the use of 8 fingerprints for the classification of mole-cules, active towards 5 protein targets in different dataset compositions,including tests for selected methods in virtual screening-like experi-ments. Although the effectiveness of each classifier always variesdepending on classification process conditions, some general relation-ships that may be helpful in choosing a proper method in a particularcase may be given. Below are some practical hints based on the resultsobtained for the 5 tested targets; however, we cannot guarantee thatthey would be of use in the case of other targets.

Useful tips:

• If computational expenses are not important, it may beworthwhile tochoose KlekFP as a fingerprint and SMO, Ibk or RandomForest as aclassifier, preferably together with meta-learning methods;

• If a rapid classification process, together with identification of all theactive compounds (but not necessarily all inactives, e.g. when it ispossible to filter them off using other tools) is needed, Hyperpipes to-gether with the hashed fingerprints, MACCSFP or PubChemFP isadvised;

• If possible, more active ligands should be added to the training data,especially those with diverse properties;

• It is worthwhile to apply learning at the meta-level;• If only few active ligands are available in the training data, it is advis-able to choose Ibk, especially in combination with ExtFP or FP as a fin-gerprint;

• When performing experiments with a great number of instances, it isrecommended to choose SMO or RandomForest (in particular togeth-er with meta-learners);

• EStateFP and SubFP are not recommended as molecular representa-tions, especially for datasets with a great number of instances.

Acknowledgments

The study was partly supported by a grant PRELUDIUM 2011/03/N/NZ2/02478 financed by the National Science Centre and by a projectUDA-POIG.01.03.01-12-100/08-00 co-financed by the European Unionfrom the European Fund of Regional Development (EFRD); http://www.prokog.pl.

Appendix A. Supplementary data

Supplementary data contains numerical values of all evaluatingparameters (Tables A1, A2). Supplementary data to this article can befound online at http://dx.doi.org/10.1016/j.chemolab.2013.08.003.

References

[1] A. Breda, L.A. Basso, D.S. Santos, W.F. de Azevedo Jr., Virtual screening of drugs: scorefunctions, docking, and drug design, Curr. Comput. Aided Drug Des. 4 (2008)265–272.

[2] H. Geppert, M. Vogt, J. Bajorath, Current trends in ligand-based virtual screening:molecular representations, dataminingmethods, new application areas, and perfor-mance evaluation, J. Chem. Inf. Model. 50 (2010) 205–216.

[3] J.L. Melville, E.K. Burke, J.D. Hirst, Machine learning in virtual screening, Comb.Chem. High Throughput Screen. 12 (2009) 332–343.

[4] B. Chen, R.F. Harrison, G. Papadatos, P. Willett, D.J. Wood, X.Q. Lewell, P. Greenidge,N. Stiefl, Evaluation ofmachine-learningmethods for ligand-based virtual screening,J. Comput. Aided Mol. Des. 21 (2007) 53–62.

[5] A. Schwaighofer, T. Schroeter, S. Mika, G. Blanchard, How wrong can we get? A re-view of machine learning approaches and error bars, Comb. Chem. High ThroughputScreen. 12 (2009) 453–468.

[6] X.H. Ma, J. Jia, F. Zhu, Y. Xue, Z.R. Li, Y.Z. Chen, Comparative analysis of machinelearning methods in ligand-based virtual screening of large compound libraries,Comb. Chem. High Throughput Screen. 12 (2009) 344–357.

[7] Y. Fukunishi, Structure-based drug screening and ligand-based drug screening withmachine learning, Comb. Chem. High Throughput Screen. 12 (2009) 397–408.

[8] D. Hecht, Applications of machine learning and computational intelligence to drugdiscovery and development, Drug Dev. Res. 72 (2011) 53–65.

Page 12: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds

100 S. Smusz et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 89–100

[9] L. Wang, Z. Wang, A. Yan, Q. Yuan, Classification of Aurora-A kinase inhibitors usingself-organizing map (SOM) and support vector machine (SVM), Mol. Inf. 30 (2011)35–44.

[10] F. Cheng, Y. Yu, J. Shen, L. Yang, W. Li, G. Liu, P.W. Lee, Y. Tang, Classification of cyto-chrome P450 inhibitors and noninhibitors using combined classifiers, J. Chem. Inf.Model. 51 (2011) 996–1011.

[11] D.C. Weis, D.P. Visco, J.-L. Faulon, Data mining PubChem using a support vector ma-chine with the signature molecular descriptor: classification of factor XIa inhibitors,J. Mol. Graph. Model. 27 (2008) 466–475.

[12] Y.H. Wang, Y. Li, S.L. Yang, L. Yang, Classification of substrates and inhibitors ofP-glycoprotein using unsupervised machine learning approach, J. Chem. Inf.Model. 45 (2005) 750–757.

[13] C.Y. Liew, X.H. Ma, C.W. Yap, Consensusmodel for identification of novel PI3K inhib-itors in large chemical library, J. Comput. Aided Mol. Des. 24 (2010) 131–141.

[14] X.H. Liu, H.Y. Song, J.X. Zhang, B.C. Han, X.N. Wei, X.H. Ma, W.K. Cui, Y.Z. Chen, Iden-tifying novel type ZBGs and nonhydroxamate HDAC inhibitors through a SVM basedvirtual screening approach, Mol. Inf. 29 (2010) 407–420.

[15] X.H. Liu, X.H. Ma, C.Y. Tan, Y.Y. Jiang, M.L. Go, B.C. Low, Y.Z. Chen, Virtual screening ofAbl inhibitors from large compound libraries by support vector machines, J. Chem.Inf. Model. 49 (2009) 2101–2110.

[16] F.Hammann,H.Gutmann,U. Baumann, C.Helma, J. Drewe, Classificationof cytochromeP 450 activities using machine learning methods, Mol. Pharm. 33 (2009) 796–801.

[17] M. Wang, X.-G. Yang, Y. Xue, Identifying hERG potassium channel inhibitors bymachine learning methods, QSAR Comb. Sci. 27 (2008) 1028–1035.

[18] L.Y. Han, X.H. Ma, H.H. Lin, J. Jia, F. Zhu, Y. Xue, Z.R. Li, Z.W. Cao, Z.L. Ji, Y.Z. Chen, Asupport vector machines approach for virtual screening of active compounds ofsingle and multiple mechanisms from large libraries at an improved hit-rate andenrichment factor, J. Mol. Graph. Model. 26 (2008) 1276–1286.

[19] X.H. Ma, R. Wang, S.Y. Yang, Z.R. Li, Y. Xue, Y.C. Wei, B.C. Low, Y.Z. Chen, Evaluationof virtual screening performance of support vector machines trained by sparselydistributed active compounds, J. Chem. Inf. Model. 48 (2008) 1227–1237.

[20] D. Plewczynski, kNNsim: k-nearest neighbors similarity with genetic algorithmfeatures optimization enhances the prediction of activity classes for small molecules,J. Mol. Model. 15 (2009) 591–596.

[21] S. Agarwal, D. Dugar, S. Sengupta, Ranking chemical structures for drug discovery: anew machine learning approach, J. Chem. Inf. Model. 50 (2010) 716–731.

[22] D. Plewczynski, S.H. Spieser, U. Koch, Assessing different classification methods forvirtual screening, J. Chem. Inf. Model. 46 (2006) 1098–1106.

[23] D. Plewczynski, Brainstorming: weighted voting prediction of inhibitors for proteintargets, J. Mol. Model. 17 (2011) 2133–2141.

[24] E.O. Cannon, A. Amini, A. Bender, M.J.E. Sternberg, S.H. Muggleton, R.C. Glen, J.B.O.Mitchell, Support vector inductive logic programming outperforms the naiveBayes classifier and inductive logic programming for the classification of bioactivechemical compounds, J. Comput. Aided Mol. Des. 21 (2007) 269–280.

[25] D. Plewczynski, M. Von Grotthuss, S.A.H. Spieser, L. Rychlewski, L.S. Wyrwicz, K.Ginalski, U. Koch, Target specific compound identification using a support vectormachine, Comb. Chem. High Throughput Screen. 10 (2007) 189–196.

[26] E.J. Gardiner, V.J. Gillet, M. Haranczyk, J.D. Holliday, N. Malim, P. Willett, Turbo sim-ilarity searching: effect of fingerprint and dataset on virtual-screening performance,Analysis 2 (2009) 103–114.

[27] MDDR licensed by Accelrys, Inc., USA; accelrys.com (accessed Sep 02, 2012).[28] J. Duan, M. Sastry, S. Dixon, J. Lowrie, W. Sherman, Analysis and comparison of 2D

fingerprints: insights into database screening performance using eight fingerprintmethods, J. Mol. Graph. Model. 3 (2011) P1.

[29] J.J. Irwin, B.K. Shoichet, ZINC— a free database of commercially available compoundsfor virtual screening, J. Chem. Inf. Model. 45 (2005) 177–182.

[30] C.W. Yap, Software news and update PaDEL-descriptor: an open source software to cal-culate molecular descriptors and fingerprints, J. Comput. Chem. 32 (2010) 1466–1474.

[31] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKAdata mining software: an update, SIGKDD Explor. 11 (2009) 10–18.

[32] C.L. Bruce, J.L. Melville, S.D. Pickett, J.D. Hirst, Contemporary QSAR classifiers com-pared, J. Chem. Inf. Model. 47 (2007) 219–227.

[33] L.H. Hall, L.B. Kier, Electrotopological state indices for atom types: a novel combina-tion of electronic, topological, and valence state information, J. Chem. Inf. Model. 35(1995) 1039–1045.

[34] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willighagen, The Chem-istry Development Kit (CDK): an open-source Java library for chemo- and bioinfor-matics, J. Chem. Inf. Comput. Sci. 43 (2003) 493–500.

[35] J. Klekota, F.P. Roth, Chemical substructures that enrich for biological activity, Bioin-formatics 24 (2008) 2518–2525.

[36] T. Ewing, J.C. Baber, M. Feher, Novel 2D fingerprints for ligand-based virtual screen-ing, J. Chem. Inf. Model. 46 (2006) 2423–2431.

[37] National Center for Biotechnology Information. All resources. Downloads. FTP:Pubchem. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specification (accessed Sep 02, 2012).

[38] C. Laggner, SMARTS patterns for functional group classification, http://semanticchemistry.googlecode.com/svn-history/r41/wiki/InteLigand.wiki2009(accessed Sep 02, 2012).

[39] R. Kohavi, R. Quinlan, Data mining tasks and methods: classification: decision-treediscovery, in: W. Klösgen, J.M. Żytkow (Eds.), Handbook of Data Mining and Knowl-edge Discovery, Oxford University Press, New York, 2002, pp. 267–276.

[40] T.S. Korting, C4.5 algorithm and Multivariate Decision Tress. Image ProcessingDivision, National Institute for Space Research` – INPE Sao Jose dos Campos – SP,Brazil. http://www.dpi.inpe.br/~tkorting/projects/c45/material.pdf 2006, (accessedSep 02, 2012).

[41] L. Breiman, Random Forests, Mach. Learn. 45 (2001) 5–32.[42] E. Frank, I.H.Witten, Generating accurate rule sets without global optimization, Proc

15th International Conference on Machine Learning, Morgan Kaufmann PublishersInc., Madison, Wisconsin, 1998, pp. 144–151.

[43] Z.A. Deeb, T. Devine, Randomized decimation HyperPipes, http://www.csee.wvu.edu/timm/tmp/r7.pdf 2010(accessed Sep 02, 2012).

[44] J.C. Platt, Sequential minimal optimization: a fast algorithm for trainingsupport vectormachines, Technical Report MSR-TR-98-14, Microsoft Research, 1998, pp. 1–21.

[45] G.I. Webb, MultiBoosting: a technique for combining boosting and wagging, Mach.Learn. 40 (2000) 159–196.

[46] P. Melville, R.J. Mooney, Constructing diverse classifier ensembles using artificialtraining examples, Proceedings of the Eighteenth International Joint Conferenceon Artificial Intelligence, Morgan Kaufmann Publishers Inc., Acapulco, Mexico,2003, pp. 505–510.

[47] J. Stefanowski, M. Pachocki, Comparing performance of committee based ap-proaches to active learning, in: M. Klopotek, A. Przepiorkowski, S. Wierzchon, K.Trojanowski (Eds.), Recent Advances in Intelligent Information Systems, EXIT,Warsaw, 2009, pp. 457–470.

[48] The University of Wakaito. Faculty of Computing and Mathematical Sciences. Depart-ment of Computer Science. Machine Learning. Software. Documentation. Weka API.FilteredClassifier. http://weka.sourceforge.net/doc.stable/ (accessed Sep 02, 2012).

[49] C. Savojardo, P. Fariselli, P.L. Martelli, P. Shukla, R. Casadio, Prediction of the bondingstate of cysteine residues in proteins with machine-learning methods, in: R. Rizzo,P.J.G. Lisboa (Eds.), Computational Intelligence Methods for Bioinformatics and Bio-statistics 7th International Meeting, Springer-Verlag, Berlin Heidelberg, 2011,pp. 98–111, (NBI 6685).

[50] P. Pavlidis, W.S. Noble, Matrix2png: a utility for visualizingmatrix data, Bioinformat-ics 19 (2003) 295–296.