Predicting Protein-Ligand Binding Affinities Using Novel Geometrical Descriptors and Machine-Learning Methods

Download Predicting Protein-Ligand Binding Affinities Using Novel Geometrical Descriptors and Machine-Learning Methods

Post on 10-Feb-2017




8 download

Embed Size (px)


<ul><li><p>Predicting Protein-Ligand Binding Affinities Using Novel Geometrical Descriptors andMachine-Learning Methods</p><p>Wei Deng, Curt Breneman,*, and Mark J. Embrechts</p><p>Departments of Chemistry and Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute,Troy, New York 12180</p><p>Received October 30, 2003</p><p>Inspired by the concept of knowledge-based scoring functions, a new quantitative structure-activityrelationship (QSAR) approach is introduced for scoring protein-ligand interactions. This approach considersthat the strength of ligand binding is correlated with the nature of specific ligand/binding site atom pairs ina distance-dependent manner. In this technique, atom pair occurrence and distance-dependent atom pairfeatures are used to generate an interaction score. Scoring and pattern recognition results obtained usingKernel PLS (partial least squares) modeling and a genetic algorithm-based feature selection method arediscussed.</p><p>INTRODUCTION</p><p>Recent advances in molecular biology, X-ray crystal-lography, and NMR spectroscopy are providing three-dimensional structures of protein-ligand complexes withatomic resolution at an escalating rate. Combined with vastincreases in accessible computing power, these advanceshave facilitated dramatic changes in rational drug designmethodologies.</p><p>There are two common strategies for approaching structure-based drug design.1 The first is to start with a set of existingligands with known binding properties, where the originallead compounds are altered to create diverse families of newcandidate molecules. The other method involves the estima-tion of optimal functional group placements within targetbinding sites, followed by connection of these functionalgroups via scaffold structures to generate putative moleculesthat may be candidates for experimental validation. However,both techniques require a means of rapidly predicting theexpected binding energy of the putative structures.</p><p>Typically, thousands of virtual candidate molecules aregenerated for a given binding site and require scoring andrank ordering prior to being considered for synthesis.Therefore, fast protein-ligand scoring functions that canprovide useful estimates of ligand binding affinities may beused for synthetic prioritizationsa crucial component ofefficient drug design.2,3</p><p>Several methods (force field, empirical, and knowledge-based scoring functions) have been developed to calculatebinding free energies. Among these, knowledge-based scor-ing functions are the most recently developed method andhave been used successfully to study protein-ligandinteractions.4-6</p><p>Traditional knowledge-based scoring functions requiretraining sets comprising thousands of protein-ligand com-</p><p>plexes. This large training data requirement stems from thestatistical definition of a knowledge-based potential asexpressed in eq 1:7</p><p>where kB is the Boltzmann constant,T is the absolutetemperature, andgref corresponds to an artificially definedreference distribution. The termgij(r) is a frequency orprobability distribution of atom pairs of typesi and j at adistancer from each other and is given by7</p><p>Here,Ni,j(r) is the occurrence of atom pairsi,j at a distancebetweenr and r + dr.</p><p>From these definitions, it may be seen that a zerooccurrence of any kind of atom pair in the training datacannot be tolerated. Therefore, sufficient data need to becollected and thousands of structures must be examined inorder to produce statistically valid results and to ensure theinclusion of some atom types that are rarely seen.6</p><p>The method described in this work was inspired by theknowledge-based scoring function defined by Gohlke et al.7,8</p><p>and uses the 17 atom type assignments that are listed in Table1, resulting in 289 possible atom pairs. The algorithmpresented in this work counts the occurrence of each atomtype pair that incorporates atoms from both ligand andbinding site within a certain distance range and uses thisinformation to form descriptors for quantitative structure-activity/-property relationships (QSAR/QSPR) analysis. Inour initial approach, all atom pairs with distances between1 and 6 were considered in the analysis.</p><p>To achieve better predictive results, distance dependencewas also considered within the protein-ligand atom pairinteractions. To accomplish this, the descriptor algorithm was</p><p>* Corresponding author phone: (518) 276-2678; fax: (518) 276-4887;e-mail:</p><p> Department of Chemistry. Department of Decision Sciences and Engineering Systems.</p><p>Wij (r) ) - kBT ln(gij(r)gref ) (1)</p><p>gi,j(r) )Ni,j(r)/4r</p><p>2</p><p>r(Ni,j(r)/4r2)(2)</p><p>699J. Chem. Inf. Comput. Sci.2004,44, 699-703</p><p>10.1021/ci034246+ CCC: $27.50 2004 American Chemical SocietyPublished on Web 01/24/2004</p></li><li><p>modified to take distances between interacting atom pairsinto account. In this modification, the distance range wasdivided into five bins, 1 wide. Each descriptor value wasthen defined as the occurrence of a particular atom pairwithin a specified distance bin. From the results obtained inthis study, it may be seen that the distance-dependent atomtype pair descriptors reflect the characteristics of protein-ligand interactions and show promise for rapid protein-ligand scoring.</p><p>Earlier work by DeWitte and Shakhnovich also character-ized protein-ligand interactions using the occurrence ofspecific atom pairs to develop a knowledge-based scoringfunction with only one distance interval (SMoG).9 Animproved scoring function with two distance intervals, adifferent reference state, and additional parameters waspublished later (SMoG2001).10 In the spirit of traditionalknowledge-based potentials, SMoG technology defines anartificial reference state for comparison with the frequenciesof occurrence of specific atom pairs. In contrast, the methoddescribed in this paper considers only the occurrence anddistance range of each atom pair, without the need forcomparison against an artificial reference state. Although fatalto some knowledge-based methods, zero occurrences ofspecific atom pairs are allowed in this method and do notsignificantly affect the final prediction results.</p><p>Therefore, the theory behind the new method is simplebut useful. The utilization of QSAR modeling for scoring isflexible and applicable to sparse systems, given that modelscan be built using various sets of ligand-binding site pairsfor specific applications. The statistical learning theoryunderlying this method is uncomplicated, and a large quantityof training data is not strongly required.</p><p>METHODS</p><p>Protein-Ligand Complex Data Set.Two data sets werestudied, where one contained 61 and the other 105 structur-ally diverse protein-ligand complexes, respectively. Dif-ferent types of proteins were present in the data sets. Forinstance, the 105-complex data set contains aspartic proteases(17), serine proteases (17), metalloproteases (21), humancarbonic anhydrases (16), sugar-binding proteases (14),endothiapepsins (10), and other proteins (10). Each of thetwo data sets contain complete structures for each cocrys-tallized complex, as well as the unbound protein and ligand</p><p>structures. All data were obtained from the RSCB ProteinData Bank.11 The experimental pKd values for this set ofligand-protein complexes were taken from the literature andwere found to range from 1.49 to 11.53.10</p><p>Two groups of complexes were randomly selected fromthe original lists as blind external test sets, one containing 6complexes for the 61-complex data set and the othercontaining 10 complexes for the 105-complex data set.</p><p>Atom type pair occurrences were then computed for eachcomplex using the simple atom typing and distance algorithmdescribed below. Figure 1 shows the distribution of theoccurrences of all atom pairs in the large (105-member)protein-ligand complex data set.</p><p>Atom Type Descriptor Generation. All atoms of theproteins and ligands were assigned to one of the 17 Gohlkeclasses of atom types used in this study. Therefore, eachprotein-ligand interaction could be characterized as a patterncomprising occurrence counts of atom pairs within a speci-fied cutoff range (6 ) from ligand atoms.</p><p>To take distance dependence into account, the second typeof distance-dependent atom type descriptor was developed.The allowed distance range was divided into five bins of 1 each, ranging from 1 to 6 . Thus, the total number ofpossible protein-ligand atom pairs grows to 1445, and eachdescriptor value was then defined as the occurrence of aparticular atom type pair within a specified distance bin.</p><p>Feature Selection.When the number of descriptors islarge, the data set is certain to contain irrelevant andredundant features that increase the dimensionality of theproblem and make the model difficult to interpret.12 Thiscurse of dimensionality can also lead to model overtrain-ing.13 To reduce the number of irrelevant descriptors in themodel, a GAFEAT (genetic algorithm feature selection)process is employed to perform objective feature selectionbefore model building.</p><p>The genetic algorithm approach is a general optimizationmethod first developed by Holland14 and involves an iterativemutation/scoring/selection procedure on a constant-sizepopulation of individuals.15 The general process starts witha random or heuristic creation of an initial population,followed by a scoring procedure to determine their fitness.During every evolutionary generation, the individuals in the</p><p>Table 1. Atom Type Assignments8</p><p>atom type atoms included</p><p>C.3 carbon sp3C.2 carbon carbon in aromatic carbon in amidinium and guanidinium groupsN.3 (N.4) nitrogen sp3 (positively charged nitrogen) nitrogen in aromatic nitrogen in amid bondsN.pl3 nitrogen in amidinium and guanidinium groupsO.3 oxygen sp3O.2 oxygen sp2O.co2 oxygen in carboxylate groupsS.3 (S.2) sulfur sp3 (sulfur sp2)P.3 phosphorus sp3F fluorineCl chlorineBr bromineMet Ca, Zn, Ni, Fe</p><p>Figure 1. Distribution of atom pairs of all the protein-ligandcomplexes in the 105-member data set.</p><p>700 J. Chem. Inf. Comput. Sci., Vol. 44, No. 2, 2004 DENG ET AL.</p></li><li><p>current population are evaluated according to a predefinedset of quality criteria (a fitness function). To form a newpopulation in the next generation, individuals are selectedaccording to their fitness, and new individuals are introducedby crossover and mutation.16 The following equation showsthe fitness function used in the GAFEAT method. In thisequation,Fk is the fitness of descriptork, CiR is the correlationbetween the descriptori and the response,Cij is theintercorrelation between descriptori and j, R is the inter-correlation penalty factor, and is a death penalty factorthat is applied if any intercorrelation is more than 0.95. The factor is assigned a value of 1000.</p><p>Not only can feature selection objectively reduce thenumber of descriptors without reference to any particularmodel, but it can also select the most important descriptorsthat contribute to a given protein-ligand interaction. Al-though some of the selected descriptors might not be directlyattributable to a specific component of a protein-ligandinteraction (i.e. hydrogen bonding), they may contain otherchemical information relevant to that interaction. An inter-pretation of the descriptor patterns resulting from subjective(model-driven) feature selection and model building canfacilitate an understanding of the fundamental interactionsinvolved in generating a protein-ligand binding score.</p><p>Data Processing.The overall model building and testingprocedure is illustrated in Figure 2. Following featureselection, the training data are split randomly into a seriesof training subsets and validation set pairs (in this case, thevalidation set size if equal to the external test set size). QSARmodels are then built using each subset of the training setand evaluated using the corresponding validation set. Togeneralize the model, this bootstrapping mode is applied tothe model-building procedure 100 times (100 bootstraps).17</p><p>Finally, predictions are made on the external test set usingan aggregate of all QSAR models from the bootstrappingprocedure. The resulting bootstrap aggregate prediction yieldsa score for each complex.</p><p>In this study, the kernel PLS (K-PLS) method was usedto produce nonlinear regression models. The K-PLS imple-mentation used in this work is available as part of the RPI</p><p>Analyze/Stripminer modeling software available on ourproject website ( Figure 3 illustrateshow K-PLS works in both training and test modes. K-PLSworks in distance space instead of descriptor space andallows a variety of kernel functions to be used to determinedistances between descriptor patterns. In the present work,a radial basis function (RBF) kernel was utilized.</p><p>In the training mode, K-PLS internally builds anN Nkernel matrix from a training set consisting ofN compoundsandM descriptors. The resulting matrix can be considered anonlinear transformation of the input data into distance space,after which distances between cases are used as descriptorsin a subsequent linear PLS learning process. In the testingmode, a test set withT compounds andM descriptors ismultiplied by a modifiedT N kernel matrix, and the resultsare fed into a previously determined PLS model to makeeach prediction.18</p><p>The Analyze/Stripminer shell program was used in thisinvestigation to perform data processing, feature reduction,and predictive modeling. The Analyze/Stripminer programpackage was developed by one of the coauthors (M.J.E.) andsupports several different machine-learning models: kernelpartial least squares regression, genetic algorithms, supportvector machines, sensitivity analysis, neural networks, andlocal learning. This program has been utilized successfullyfor in-silico drug design, economic financially related studies,and nuclear physics applications.19</p><p>RESULTS</p><p>After objective feature selection consisting of the removalof nonchanging features, highly intercorrelated descriptors,and 4-sigma outliers,20 the remaining descriptors were usedin a subjective feature selection and modeling process. Table2 shows the prediction results (R2) of the validation sets andexternal test sets for each data set using both atom type pairsand distance-dependent atom type pair descriptors, bothbefore and after subjective feature selection. Note that theresults shown for the validation set predictions are theaverage values from 100 bootstraps.</p><p>Figure 4 and Figure 5 illustrate the K-PLS predictionresults for the validation set and test set of model 9,respectively. In Figure 4, the discontinuous vertical bars show</p><p>Figure 2. Flow chart of data processing with feature selection.</p><p>Fk ) i)1</p><p>N</p><p>CiR + R i)1,i*j</p><p>N</p><p>Cij + (3)</p><p>k ) 1, 2, 3, ..., population size</p><p>Figure 3. K-PLS prediction in both training and testing modes.Nis the number of descriptor patterns (training cases) in the trainingset, andM is the number of descriptors used in both training andtest sets.T is the number of test cases.Wi signify kernel inputinformation going into the model, andYr are the prediction results.</p><p>PREDI...</p></li></ul>