a novel method for protein-ligand binding affinity prediction and the related descriptors...
TRANSCRIPT
A Novel Method for Protein-Ligand Binding Affinity
Prediction and the Related Descriptors Exploration
SHUYAN LI, LILI XI, CHENGQI WANG, JIAZHONG LI, BEILEI LEI, HUANXIANG LIU, XIAOJUN YAO
Department of Chemistry, Lanzhou University, Lanzhou 730000, China
Received 18 March 2008; Revised 18 June 2008; Accepted 22 June 2008DOI 10.1002/jcc.21078
Published online 10 September 2008 in Wiley InterScience (www.interscience.wiley.com).
Abstract: In this study, a novel method was developed to predict the binding affinity of protein-ligand based on a
comprehensive set of structurally diverse protein-ligand complexes (PLCs). The 1300 PLCs with binding affinity
(493 complexes with Kd and 807 complexes with Ki) from the refined dataset of PDBbind Database (release 2007)
were studied in the predictive model development. In this method, each complex was described using calculated
descriptors from three blocks: protein sequence, ligand structure, and binding pocket. Thereafter, the PLCs data
were rationally split into representative training and test sets by full consideration of the validation of the models.
The molecular descriptors relevant to the binding affinity were selected using the ReliefF method combined with
least squares support vector machines (LS-SVMs) modeling method based on the training data set. Two final opti-
mized LS-SVMs models were developed using the selected descriptors to predict the binding affinities of Kd and Ki.
The correlation coefficients (R) of training set and test set for Kd model were 0.890 and 0.833. The corresponding
correlation coefficients for the Ki model were 0.922 and 0.742, respectively. The prediction method proposed in this
work can give better generalization ability than other recently published methods and can be used as an alternative
fast filter in the virtual screening of large chemical database.
q 2008 Wiley Periodicals, Inc. J Comput Chem 30: 900–909, 2009
Key words: protein-ligand binding affinity; ReliefF method; least squares support vector machines (LS-SVMs);
model validation
Introduction
Recently, structure-based drug design (SBDD) methods such as
docking and de novo design have been widely used in the drug
discovery and drug design process, especially in lead discovery
and lead optimization. In SBDD methods, one of the most im-
portant issues is the screening of available ligands with the rele-
vant macromolecular target. In most cases, the stronger a ligand
molecule binding with the target protein, the more likely it will
affect the protein’s biological function, and as a consequence, it
will be likely to be a suitable drug candidate. Therefore, the
assessment of the binding affinity between ligand and receptor
plays an important role in drug discovery and design. As this af-
finity is mainly determined by the interaction between the ligand
and the relevant receptor, the study of the relationship between
the features of a ligand-protein complex and its binding affinity
becomes very important in modern drug discovery.1,2
Usually, binding affinity of a ligand to a given receptor can
be determined by experimental ways, such as NMR spectrome-
try,1–3 microcalorimetry,4–6 and surface plasmon resonance,7 but
they are often time-consuming and expensive. Therefore, many
in silico methods are developed to predict the binding affinity
based on the properties of the ligand and the receptor. The
advantages of these in silico methods lie in the fact that once
the reliable prediction models were developed, they are much
less dependent on the experimental data set; they could be used
even for the some virtual compounds that have not been synthe-
sized. At the same time, it can also filter some compounds that
have ‘‘no’’ or ‘‘low’’ activity. The most widely used methods in
this field are some kinds of docking and scoring methods that
can identify the binding mode of the ligands and estimate the
strength of the protein-ligand interactions. There are many
classes of these methods, such as force field-based (like DOCK,8
GOLD,9 SIE,10 and LIE11), knowledge-based approaches (e.g.,
DrugScore,12 DFIRE,13 3DDFT,14 PMF,15,16 BLEEP,17,18
ITScore,19 and M-Score20), and the empirical scoring functions
Contract/grant sponsor: Program for New Century Excellent Talents in
University; contract/grant number: NCET-07-0399
Contract/grant sponsors: Scientific Research Foundation for the Returned
Overseas Chinese Scholars, State Education Ministry, Zhide Foundation
of Lanzhou University
Correspondence to: X. Yao; e-mail: [email protected]
q 2008 Wiley Periodicals, Inc.
with some kinds of statistical methods (X-Score,21 FlexX
Score,22 VALIDATE,23 SCORE1 (LUDI),24 SCORE,25,26 Chem-
Score,27 SMoG,28,29 GEMDOCK,30 and SODOCK31). Besides,
some methods add the parallel approach in this field (high-
Throughput MM-PBSA32,33).
Recently, as an alternative to widely used docking and scor-
ing approach, some other in silico methods based on the struc-
tures of ligands and the relevant proteins are also proposed for
the fast prediction of the binding affinity with some success (e.g.
Hi-PLS34 and novel geometrical descriptors-based method35,36).
These methods often use the molecular descriptors calculated
from the structure of the ligand and protein as the inputs, and
then, use some modeling methods to develop predictive models
for binding affinity. Compared with docking and scoring meth-
ods, these methods showed some obvious advantages such as
easy implementation, fast prediction process, and good predic-
tive ability and could be used as a fast filter in the virtual
screening of large chemical database.
The recently published in silico methods together with their
correlation coefficient (R), standard deviation (SD) or root mean
square error (RMSE), and descriptors were listed in Table 1 to
give out an overview. The results of the model listed in the
Table were based on the largest dataset for the cases where sev-
eral different predictive models are available.
As can be seen from Table 1, these methods still need some
improvement of the prediction accuracy and generalization abil-
ity. First, most methods used only part of the large structure-
diverse dataset, and their testing results are not very satisfactory.
Many of the methods mentioned earlier used the refined set
[2003 release, containing 800 protein-ligand complexes (PLCs)]
of PDBbind,37,38 including the methods of PMF, ChemScore,
PLP, LUDI, GOLD, and X-Score, Hi-PLS, etc.16,34,39 Second,
some prediction models gave very good accuracy, but they were
based on only a relatively small data set. For instance, the work
by Zhang et al.35 used a data set containing 264 PLCs with
binding affinities (pKd) and yielding the best R2ext of the models
as 0.83. Although with its high prediction result, this prediction
method was based on a relatively small dataset and a little com-
plicated with more than 1000 models being built and filtered.
Therefore, the limited accuracy and the complexity of current
methods prevent the wide applications of these methods in the
real virtual screening of large chemical database. There is still a
large need for the improvement of the present scoring function,
and the development of new general binding affinity prediction
method with good predictive ability and that covers more
ligand-protein complex space.
Inspired by the successes of binding affinity prediction
method based on the molecular descriptors of ligand-protein
complexes and its easy implementation, we proposed a new
method for the fast prediction method by introducing new
molecular descriptors, modeling development method, and
model validation. To overcome the low-accuracy and high-com-
plexity problems of current methods, we present an accurate and
concise modeling method for the binding affinity prediction of
PLCs based on a general structural diverse data set (release
2007 of PDBbind database refined set, including 1300 PLCs). It
should also be pointed out that the developed models in our
study were carefully validated, which was often ignored in many
similar studies. The main process of our prediction method is
shown in Figure 1. As can be seen from the flow chart, our
method mainly includes the following steps: the collection of the
data set from the PDBbind database refined set, and then three
blocks of descriptors including protein sequence, ligand structure,
and binding pocket information were used to describe each com-
plex. Afterward, all the data were split into representative training
and test sets. Then the descriptors relevant to the binding affinity
were selected using the ReliefF method combined with least
squares support vector machines (LS-SVMs) modeling method
based on the training data set. The selected descriptors were used
to build predictive models using LS-SVMs method. The models
were validated internally by the leave-one-out (LOO) crossvalida-
tion, Y-scrambling on the training set, and externally validated by
the use of the rational selected test set.
Materials and Methods
Data Sets
All the PLCs information were collected from the 2007 release of
PDBbind database.37 The PDBbind database is a collection of the
experimentally measured binding affinities exclusively for the
PLCs available in the Protein Data Bank,40 with both information
about binding affinities and known 3D structures. The current
release contains 3214 PLCs, and a total of 1300 PLCs were
selected to form the ‘‘refined set’’ by consideration of the quality of
structures and binding affinity data, which is compiled particularly
for docking/scoring and binding affinity prediction studies. The
resolution of the PLCs crystal structure is lower than 2.5 A.38,41
To study the comprehensive information of PLCs, the whole
refined set was used in this work without any additional filter of
PLCs, like other works.34 Within the 1300 PLC data in the
refined set, there are 493 data with the binding affinity of disso-
ciation constant (Kd) value and 807 with inhibition constant (Ki)
value. We used the negative logarithm of Kd and Ki values in
this study (pKd and pKi). The pKd and pKi scale in this study
ranged from 0.5 to 13.96, spanning over 13 orders of magnitude,
with a mean value of 6.39 and a SD of 2.16. Two binding affin-
ity prediction models were developed based on pKd and pKi
values, respectively.
Descriptors Generation
The PLCs were described by three blocks of descriptors:
sequence information of protein, ligand structural information,
and binding pocket structural information.
Block1: Descriptors Based on Protein Sequences
The PDB files of proteins were converted into the amino acid
sequence files with FASTA format by the script of Python lan-
guage.42 Then, the structural and physicochemical features of
proteins were computed from amino acid sequence using the
PROFEAT43 method. Seven types of descriptors were generated,
including (a) amino acid composition, dipeptide composition,44
(b) normalized Moreau-Broto autocorrelation,40,45 (c) Moran
autocorrelation,46 (d) Geary autocorrelation,47 (e) sequence-
901Protein-Ligand Binding Affinity Prediction
Journal of Computational Chemistry DOI 10.1002/jcc
order-coupling number, quasi sequence-order descriptors,48 (f)
pseudo amino acid composition (k 5 30),49 and (g) the compo-
sition and transition and distribution of various structural and
physicochemical properties.50,51 At last, 1497 descriptors were
calculated in this block.
Block2: Descriptors of Ligand
All the ligands taken from PDBbind database were prechecked
for the missing atom types and were added with the hydrogen
atom using the HYPERCHEM52 program. A total of 1664
molecular descriptors were calculated using the DRAGON soft-
ware.53,54
Block3: Descriptors of Binding Pocket
The binding pocket structures of the PLCs from the PDBbind
database were first added with hydrogen and minimized using
the Tripos force field in Sybyl 6.9.55 Then seven types of
descriptors were calculated, including CPSA (charged partial
Table 1. Overview of Recently Published in silico Methods for the Prediction of Protein-Ligand
Binding Affinities.
Method Dataset size
Q2 of
training set
R of
test set Error
Descriptor type or
method description Ref.
VALIDATE 51 – 0.900 – Empirical scoring function 35
NGD 61 0.71 0.775 – Distance-dependent atom type descriptors 36
DFIRE 100 – 0.630 – 19 atom types and a distance-scale finite
ideal-gas reference (DFIRE) state
13
NGD 105 0.60 0.800 – Distance-dependent atom type descriptors 36
SMoG96 120 – 0.648 – Empirical scoring function 35
ENTess 264 0.66 0.911 – Chemical geometrical descriptors and
Pauling atomic electronegativities
35
BLEEP 351 – 0.728 – Knowledge-based scoring function 35
Hi-PLS 612 0.27 0.458 1.95a Principal components of 4 blocks structural
descriptors
34
GOLD::GoldScore 694 – 0.285 2.16b Empirical scoring function 37
Cerius2::LigScore 717 – 0.406 2.00b Empirical scoring function 37
SMoG2001 725 – 0.660 – Empirical scoring function 35
Sybyl::F-Score 732 – 0.141 2.19b Empirical scoring function 37
GOLD::ChemScore 741 – 0.423 2.00b Empirical scoring function 37
GOLD::ChemScore_opt 762 – 0.449 1.96b Empirical scoring function 37
GOLD::GoldScore_opt 772 – 0.365 2.06b Empirical scoring function 37
Sybyl::PMF-Score 785 – 0.147 2.16b Knowledge-based scoring function 37
Cerius2::LUDI1 790 – 0.334 2.08b Empirical scoring function 37
Cerius2::PMF 795 – 0.253 2.13b Knowledge-based scoring function 37
Sybyl::ChemScore 797 – 0.499 1.91b Empirical scoring function 37
Cerius2::LUDI2 799 – 0.379 2.04b Empirical scoring function 37
X-Score::HPScore 800 – 0.514 1.89b Empirical scoring function 37
X-Score::HMScore 800 – 0.566 1.82b Empirical scoring function 37
X-Score::HSScore 800 – 0.506 1.90b Empirical scoring function 37
DrugScore::Pair 800 – 0.473 1.94b Knowledge-based scoring function 37
DrugScore::Surf 800 – 0.463 1.95b Knowledge-based scoring function 37
DrugScore::Pair/Surf 800 – 0.476 1.94b Knowledge-based scoring function 37
Sybyl::D-Score 800 – 0.322 2.09b Empirical scoring function 37
Sybyl::G-Score 800 – 0.443 1.98b Empirical scoring function 37
Cerius2::PLP1 800 – 0.458 1.96b Empirical scoring function 37
Cerius2::PLP2 800 – 0.455 1.96b Empirical scoring function 37
Cerius2::LUDI3 800 – 0.331 2.08b Empirical scoring function 37
HINT 800 – 0.330 2.08b Empirical scoring function 37
X-Score::HMScore 1 DrugScore::Pair 800 – 0.573 – Consensus scoring function 37
X-Score::HMScore 1 Sybyl::ChemScore 800 – 0.586 – Consensus scoring function 37
X-Score::HMScore 1 Cerius2::PLP2 800 – 0.573 – Consensus scoring function 37
DrugScore::Pair 1 Sybyl::ChemScore 800 – 0.520 – Consensus scoring function 37
DrugScore::Pair 1 Cerius2::PLP2 800 – 0.476 – Consensus scoring function 37
Sybyl::ChemScore 1 Cerius2::PLP2 800 – 0.521 – Consensus scoring function 37
aRoot mean square error (RMSE) of predictive result of test set.bStandard deviation (SD) between experimental and predictive value of test set.
902 Li et al. • Vol. 30, No. 6 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
surface area), Volsurf, ZAPSOLVATION, ZAPDESCRIPTORS,
FINGERPRINT, MOLPROP_VOLUME, and MOL_WEIGHT. In
this block, 125 descriptors of the binding sites were generated.
Dataset Division
After the calculation of descriptors, all the original data were
split into representative training and test sets using the Kennard-
stone (KS) method56 by full consideration of the model valida-
tion. In this division, the composition of the training and test
sets is of crucial importance. The rational splitting must guaran-
tee that the training and test sets are scattered over the whole
area occupied by representative points in the descriptor space,
and that the training set is distributed over the area occupied by
representative points for the whole data set.54 KS method can
divide the experimental region uniformly and usually use the
Euclidean distance to select the samples (candidate objects) into
the calibration space (training set). If there is absence of strong
irregularities in the factor space, the procedure starts first select-
ing a set of points close to those selected by the D-optimal
method,57 i.e. on the borderline of the data set (plus the center
point, if this is chosen as the starting point). It then proceeds to
fill up the calibration space. It is a uniform mapping algorithm
and yields a flat distribution of the data which is preferable for a
regression model.58 After data division, there are 394 and 645
samples in the training sets of pKd and pKi data. There are 99
and 162 samples in the test sets of Kd and Ki data, respectively.
Descriptor Selection
Once all the three blocks of 3286 descriptors were generated, they
were first filtered to remove the redundant variables whose pair
correlation coefficients were higher than 0.9. Finally, 1694 descrip-
tors remained: 1099 in block1, 559 in block2, and 36 in block3.
Using the aforementioned division of training, the ReliefF
method59 combined with modeling method was used to extract
the crucial descriptors based on the training set from 1694
descriptors. Here, each descriptor was regarded as a feature.
ReliefF algorithm assigns a ‘‘relevance’’ weight to each feature,
which is meant to denote the relevance of the feature to the tar-
get concept. It evaluates the worth of an attribute by repeatedly
sampling an instance and considering the value of the given at-
tribute for the nearest instance of the same and different class
(can be either discrete or continuous data).60 The number of
descriptors selected to build the final model was based on the
best LOO crossvalidation (Q2) results. Training models (using
LS-SVMs method here in this study) with different number of
top-ranked descriptors using default parameters were built, and
the one that gave the favorable Q2 result was selected in the
further analysis.61
LS-SVMs Modeling
LS-SVMs62 are reformulations to the standard SVMs,63,64 which
result in a set of linear equations instead of a quadratic program-
ming problem of SVMs. This algorithm was introduced in this
study as modeling tool due to its good performance and easy
implementation for regression problems, as were proved by our
previous works.65–68 The theory of LS-SVMs and the differences
between LS-SVMs and SVMs were described in detail in our
previous works.65–68 Here, we briefly describe only the main
idea of LS-SVMs for function estimation.
In principle, LS-SVMs always fits a linear relation (y 5 wx1 b) between the regressor (x) and the dependent variable (y).The best relation can be obtained by minimizing the cost func-
tion (Q) containing a penalized regression error term:
QLS-SVM ¼ 1
2wTwþ c
XNk¼1
e2k (1)
subject to:
yk ¼ wTuðxkÞ þ bþ ek; k ¼ 1; . . . ;N ð2Þwhere u: Rn ? Rm is the feature map mapping the input space
to a usually high dimensional feature space, c is the relative
weight of the error term, ek is error variables taking noisy data
into account and avoiding poor generalization.
LS-SVMs consider this optimization problem to be a con-
strained optimization problem and use a Lagrange function to
solve it. By solving the Lagrange style of eq. (1), the weight
coefficients (w) can be written:
w ¼XNk¼1
akxk with ak ¼ 2cek (3)
By substituting (3) into the original regression line (y 5 wx1 b), the following result can be obtained:
y ¼XNk¼1
akxTk xþ b (4)
It can be seen that the Lagrange multipliers can be defined
as:
ak ¼ ðxTk xþ ð2cÞ�1Þ�1ðyk � bÞ (5)
Figure 1. The flow chart of the proposed method for predicting
binding affinity.
903Protein-Ligand Binding Affinity Prediction
Journal of Computational Chemistry DOI 10.1002/jcc
Finding these Lagrange multipliers is very simple as opposed
to the SVM approach in which a more difficult relation has to
be solved to obtain these values. In addition, it easily allows
nonlinear regression as an extension of the linear approach by
introducing the kernel function. This leads to the following non-
linear regression function:
f ðxÞ ¼XNk¼1
akKðx; xkÞ þ b (6)
where K(x,xk) is the kernel function. The value is equal to the
inner product of two vectors x and xk in the feature space F(x)and F(xk), that is K ¼ ðx; xkÞ ¼ UðxÞT � UðxkÞ. The choice of a
kernel and its specific parameters together with c has to be tuned
by the user. The radial basis function (RBF) kernel
K ¼ ðx; xkÞ ¼ expð� xk � xk k2=r2Þ is commonly used and then
LOO crossvalidation was used to tune the optimized values of
the two parameter c (the relative weight of the regression error)
and r (the kernel parameter of the RBF kernel). Here, the opti-
mal parameters are found by an intensive grid search method.
The result of this grid search is an error-surface spanned by the
model parameters. A robust model is obtained by selecting those
parameters that give the lowest error in a smooth area.
The RMSE was used as the error function, and it is com-
puted according to the following equation:
RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn
i¼1 ðyi � yiÞ2n
s(7)
where yi and yi are the experimental and calculated response
value of the i-th object, respectively, and n is the number of
samples.
All calculations implementing LS-SVMs were performed
using the Matlab/C toolbox.69
Internal and External Validation of Model
Once the models were built, only the prediction results of train-
ing set was not enough to prove the predictive ability of the
model. Therefore, two further methods were used to evaluate the
stability and robustness of the model.
The first one was internal validation based on the training set
only. The used methods mainly include LOO crossvalidation
correlation coefficient (Q2) of the model and the permutation
testing (Y scrambling) of the model.70 The LOO crossvalidation
procedure consists of removing one sample from the training set
and constructing the model only on the basis of the remaining
training data and then testing on the removed sample. In this
fashion, all of the training data samples were tested, and Q2 was
calculated. Another internal validation was performed by permu-
tation testing: new models were recalculated for randomly reor-
dered responses. The resulting models obtained on the data set
with randomized response should have significantly lower Q2
values than the proposed ones because the relationship between
the structure and response is broken. This is proof of the
proposed model’s validity, because it can be reasonably con-
cluded that the originally proposed model was not obtained by
chance correlation.54,71
The second method is the external validation based on the
test set.68,70 For any predictive model, the most important goal
is to use it for external prediction. Because the significant
descriptors were selected based on the training set only, the test
set here, completely not involved in the development of the
model was used to validate the built model externally. Com-
pared with crossvalidation, external validation can provide a
more rigorous evaluation of the model’s predictive capability for
untested PLCs. The external test set was evaluated by R, Q2ext
and RMSE. Here, Q2ext is defined as:
Q2ext ¼ 1�
Pnexti¼1 ðyi � yiÞ2Pnexti¼1 ðyi � yiÞ2
(8)
where the sum runs over the test set objects (next) and yi is the
average value of the training set responses.
Results and Discussion
The Construction and Internal Validation of Models
After training LS-SVM models with different number of top-
ranked descriptors from ReliefF methods, we selected the
descriptors that got the highest Q2 for the further analysis.
Thereafter, 35 significant features were selected to build the final
predictive model. The selected descriptors were listed in Tables
2 and 3.
In this investigation, two LS-SVMs models for binding affin-
ity prediction were built separately based on the PCLs with pKd
data (Kd model for short) and pKi data (Ki model for short). In
the LS-SVMs model development, only two parameters r and cneed to be optimized. The parameter r in the style of r2 and the
parameter c were tuned simultaneously in a grid 20 3 20 rang-
ing from 0.01 to 100 and 0.01 to 100, respectively. And then a
contour plot of the optimization error can be visualized easily
(Figs. 2 and 3). The optimal parameter settings can be selected
from a smooth subarea with a low prediction error. The selected
optimal values of c and r2 for the Kd model are 1.833 and
23.357, respectively, marked by the magenta square in the Fig-
ure 2. And, they are 1.833 and 11.288 for the Ki model, marked
by the magenta square in the Figure 3.
The internal validation and prediction results of Kd model
and Ki model are shown in Table 4, and the predictive results
versus the experiment values are plotted as Figures 4 and 5. The
Q2 of Kd model is 0.586 and 0.500 for Ki model which shows
the stability of the model for the PDBbind database refine data-
set. The corresponding Q2 value by Lindstrom et al.34 was only
0.220.
In addition, Y-randomization was applied to exclude the pos-
sibility of chance correlation, i.e., fortuitous correlation without
any predictive ability. The results of the random models per-
formed using a scrambled order of the experimental rate con-
904 Li et al. • Vol. 30, No. 6 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
stant repeated 300 times were significantly lower than the origi-
nal models. The highest values of Q2 of the 300 alters were
shown in Table 4. It also indicated the statistical reliability of
our predictive models.
The External Validation of Models
Ninety-nine Kd data and 162 Ki data, not used during the devel-
opment of model, were used as the external validation dataset.
The correlation coefficient Rext for Kd test set and Ki test set are
0.833 and 0.742, respectively. The RMSE and Q2ext values were
also shown in Table 4. The external validation results are better
than that in the references based on the same kind of data-
set.16,34,39
Besides, in order to further prove the generalization ability of
our model, an overall 10-fold crossvalidation of this model on
the whole datasets was also performed. Because the Q2 result is
not stable for an n-fold crossvalidation, we repeat this procedure
for 10 times and get an average Q2 for Kd and Ki datasets as
0.569 and 0.496. These results are consistent with the LOO
Table 2. Significant Descriptors Selected to Build the Kd Model.
No. Name Block Content Type
1 AROM B2 Aromaticity index Geometrical descriptors
2 FNSA1 B3 Partial negative surface area/total molecular surface area CPSA
3 FPSA2 B3 Total charge weighted PNSA(Partial negative surface area)/total
molecular surface area
CPSA
4 SizPatc2 B3 The size of the smallest patches of potential as a fraction of the total
surface area
ZAPDESCRIPTORS
5 C-011 B2 CR3X Atom-centred fragments
6 nSO2N B2 Number of sulfonamides (thio-/dithio-) Functional group counts
7 S-110 B2 R-SO2-R Atom-centred fragments
8 Ms B2 Mean electrotopological state Constitutional descriptors
9 MAXDN B2 Maximal electrotopological negative variation Topological descriptors
10 GATS1p B2 Geary autocorrelation—lag 1/weighted by atomic Sanderson
electronegativities
2D autocorrelations
11 GATS1m B2 Geary autocorrelation—lag 1/weighted by atomic masses 2D autocorrelations
12 SizPatc1 B3 The size of the smallest patches of potential as a fraction of the total
surface area
ZAPDESCRIPTORS
13 JGI9 B2 Mean topological charge index of order9 Topological charge indices
14 Ui B2 Unsaturation index Molecular properties
15 GATS3p B2 Geary autocorrelation—lag 3/weighted by atomic polarizabilities 2D autocorrelations
16 FINGERPRINT B3 A collection of binary variables, which have the value of 1 if the
compound contains a particular fragment, say phenyl or carbonyl, and
0 otherwise.
FINGERPRINT
17 FESA3_1 B3 Fractional surface Areas: surface of potential greater than 11 kT ZAPDESCRIPTORS
18 GATS1e B2 Geary autocorrelation—lag 1/weighted by atomic Sanderson
electronegativities
2D autocorrelations
19 C-007 B2 CH2X2 Atom-centred fragments
20 R4e1 B2 R maximal autocorrelation of lag 4/weighted by atomic Sanderson
electronegativities
GETAWAY descriptors
21 X0A B2 Average connectivity index chi-0 Connectivity indices
22 MAXDP B2 Maximal electrotopological positive variation Topological descriptors
23 RARS B2 R matrix average row sum GETAWAY descriptors
24 PPSA1 B3 Partial positive surface area CPSA
25 R6e B2 R autocorrelation of lag 6/weighted by atomic Sanderson
electronegativities
GETAWAY descriptors
26 BLTD48 B2 Verhaar model of Daphnia base-line toxicity from MLOGP (mmol/l) Molecular properties
27 C-024 B2 R--CH--R Atom-centred fragments
28 GATS3m B2 Geary autocorrelation—lag 3/weighted by atomic masses 2D autocorrelations
29 ARR B2 Aromatic ratio Constitutional descriptors
30 R1p B2 R autocorrelation of lag 1/weighted by atomic polarizabilities GETAWAY descriptors
31 LogP B3 Mean of a linear equation derived by fitting VolSurf descriptor to
experimental data on water/octanol partition coefficient
Volsurf
32 nCrt B2 Number of ring tertiary C(sp3) Functional group counts
33 GVWAI-80 B2 Ghose-Viswanadhan-Wendoloski drug-like index at 80% Molecular properties
34 Mor19m B2 3D-MoRSE—signal 19 / weighted by atomic masses 3D-MoRSE descriptors
35 HATS2e B2 Leverage-weighted autocorrelation of lag 2 / weighted by atomic
Sanderson electronegativities
GETAWAY descriptors
905Protein-Ligand Binding Affinity Prediction
Journal of Computational Chemistry DOI 10.1002/jcc
Q2 of internal validation, which also suggests that our models
are robust and stable.
Exploration of the Selected Descriptors
As shown in Tables 2 and 3, 35 significant descriptors were
selected separately for Kd model and Ki model. The order in the
table was according to their importance by the ranking method
of ReliefF algorithm. By analyzing these 35 descriptors, some
interesting information about protein-ligand interaction can be
inferred. It should also be pointed out that it is difficult to give a
full explanation about all the descriptors involved in the model
and that the similar reported works gave few detailed analysis of
the descriptors involved in the model development.
First, electrostatic interaction is proved to be an important
interaction for binding affinity by our models. More than one-
third of the descriptors are electrostatic relevant features. There
are 15 such descriptors in Kd model (No. 2, 8, 9, 10, 13, 18, 20,
22, 25, 35 in block2 and No. 3, 4, 12, 17, 24 in block3) and 12
such descriptors in Ki model (No. 3, 7, 9, 11, 12, 15, 17 in
block2 and No. 6, 10, 21, 32, 34 in block3). The information
derived from our models indicated that the electrostatic interac-
tion was very important in describing the binding affinity
between the ligand and protein.
Second, hydrophobic effect also proved to be important for
the protein-ligand binding. It is interesting that different models
have the hydrophobic descriptor in different blocks. Both models
select the descriptors relevant to logP. However, in Kd model,
Table 3. Significant Descriptors Selected to Build the Ki Model.
No. Name Block Content Type
1 AROM B2 Aromaticity index Geometrical descriptors
2 piPC10 B2 Molecular multiple path count of order 10 Walk and path counts
3 JGI8 B2 Mean topological charge index of order8 Topological charge indices
4 SHP2 B2 Average shape profile index of order 2 Randic molecular profiles
5 Ui B2 Unsaturation index Molecular properties
6 SizPatc1 B3 The size of the largest patches of potential as a fraction of the total
surface area
ZAPDESCRIPTORS
7 JGI9 B2 Mean topological charge index of order9 Topological charge indices
8 AMR B2 Ghose-Crippen molar refractivity Molecular properties
9 JGI7 B2 Mean topological charge index of order7 Topological charge indices
10 SizPatc2 B3 The size of the smallest patches of potential as a fraction of the total
surface area
ZAPDESCRIPTORS
11 JGI10 B2 Mean topological charge index of order10 Topological charge indices
12 HATS5e B2 Leverage-weighted autocorrelation of lag 5/weighted by atomic
Sanderson electronegativities
GETAWAY descriptors
13 R1p B2 R autocorrelation of lag 1/weighted by atomic polarizabilities GETAWAY descriptors
14 H-047 B2 H attached to C1(sp3)/C0(sp2) Atom-centred fragments
15 H2e B2 H autocorrelation of lag 2/weighted by atomic Sanderson
electronegativities
GETAWAY descriptors
16 S-107 B2 R2S / RS-SR Atom-centred fragments
17 MAXDP B2 Maximal electrotopological positive variation Topological descriptors
18 ALOGP B2 Ghose-Crippen octanol-water partition coeff. (logP) Molecular properties
19 Mor05p B2 3D-MoRSE—signal 05/weighted by atomic polarizabilities 3D-MoRSE descriptors
20 STN B2 Spanning tree number (log) Topological descriptors
21 NumPatc B3 The number of charged patches ZAPDESCRIPTORS
22 nOHp B2 Number of primary alcohols Functional group counts
23 O-059 B2 Al-O-Al Atom-centred fragments
24 Neoplastic-80 B2 Ghose-Viswanadhan-Wendoloski antineoplastic-like index at 80% Molecular properties
25 R1u B2 R autocorrelation of lag 1/unweighted GETAWAY descriptors
26 FDI B2 Folding degree index Geometrical descriptors
27 C-043 B2 X--CR..X Atom-centred fragments
28 N-074 B2 Ar-NAl2 Atom-centred fragments
29 H-048 B2 H attached to C2(sp3)/C1(sp2)/C0(sp) Atom-centred fragments
30 Mor03p B2 3D-MoRSE—signal 03/weighted by atomic polarizabilities 3D-MoRSE descriptors
31 FINGERPRINT B3 A collection of binary variables, which have the value of 1 if the
compound contains a particular fragment, say phenyl or carbonyl, and
0 otherwise.
FINGERPRINT
32 FNSA3 B3 Atomic charge weighted PNSA/Total molecular surface area CPSA
33 nR06 B2 Number of 6-membered rings Constitutional descriptors
34 FNSA2 B3 Total charge weighted PNSA/Total molecular surface area CPSA
35 Mor13m B2 3D-MoRSE—signal 13/weighted by atomic masses 3D-MoRSE descriptors
906 Li et al. • Vol. 30, No. 6 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
the log P descriptor (No. 31 in Table 2) was selected from
block3. This descriptor implied that it was the hydrophobic
effect of binding-pocket rather than the hydrophobic property of
ligand that dominated Kd. But in Ki model, the ALOGP descrip-
tor (No. 18 in Table 3) was selected in block2. It indicated that
the hydrophobic property of ligand, rather than binding pocket,
was more important to Ki value. The predictive models of Kd
and Ki were very similar: there are five same descriptors in the
two models (AROM, Ui, JGI9, R1p, and MAXDP) and also 14
types of descriptors (11 types in block2 and three types in
block3) in common in the two predictive models. There are still
some small differences between them from the hydrophobic
effect analysis. It can also explain why the former binding affin-
ity studies, e.g. the work by Lindstrom et al.,34 which used Kd
and Ki as the same predict target could not give a satisfactory
prediction results.
Third, in the geometrical type of descriptors, aromaticity of
ligand is the most important descriptor for the binding between
the ligand and the protein. Several former works were developed
to predict protein-ligand binding affinity using geometrical
descriptors.35,36 By analyzing the two models, the aromaticity
index of ligand, as a geometrical descriptor, is ranked in the first
place for both Kd and Ki models influencing the binding affinity
between ligand and protein. The aromaticity index can be calcu-
lated by 1 2 [dearomatization term due to bond length alterna-
tions (GEO) 1 energetic term (EN)].72 By analyzing all the
ligands in the two models, we found that most of them had an
aromatic or a heteroaromatic ring. These substructures often
appeared in the compounds with activity correlating with H-
bonds or Columbic interactions. This descriptor can also corre-
late with p-p stacking interaction between the ligand and
protein. All of the above properties are very important for the
protein-ligand binding and can be useful in the drug design and
discovery process. There are already many available drugs
which contained these substructures, such as in sulphonamides
(Sulfamide), antibiotics (Nitrofural, Metronidazole, Tinidazole,
Niridazole), antagonists of e.g. morphine (Amiphenazol, Dapta-
zol), fungicides (Benomyl) and others.73 These information can
Figure 2. Contour plot of the errors for LS-SVM when optimizing
the parameters r and c for Kd model. The magenta square indicates
the selected optimal settings.
Figure 3. Contour plot of the errors for LS-SVM when optimizing
the parameters r and c for Ki model. The magenta square indicates
the selected optimal settings.
Table 4. Statistical Results of the Models and Their Validations.
Content
Model
Internal
validation
External
validation
10-fold
CV
Rtrain RMSE Q2 Q2Yscrambling Rext RMSE Q2
ext Q2cv
Kd model 0.89 0.975 0.586 0.023 0.833 1.118 0.700 0.569
Ki model 0.922 0.953 0.500 0.014 0.742 1.471 0.537 0.496
Q2Yscrambling shows the highest value after 300 times of Y scrambling.
Figure 4. The predicted values of pKd versus the experimental val-
ues of Kd model.
907Protein-Ligand Binding Affinity Prediction
Journal of Computational Chemistry DOI 10.1002/jcc
suggest that aromatic property of a ligand could be a critical key
for the binding affinity between ligand and protein.
Finally, several atom-centred fragment descriptors belonging
to the constitutional descriptors may correlate to the H-bond for-
mation process. These descriptors (No. 7 in block2 of Kd model
and No. 23, 28 in block2 of Ki model) include the information
of molecular fragments relevant to the heteroatoms.
In summary, the selected descriptors proved that the electro-
static, hydrophobic, and H-bond donor or acceptor and aromatic-
ity relevant descriptors and some atom-centred fragments
descriptors were important structural features relevant to the
binding affinity. These descriptors contained the consistent infor-
mation with the results from other similar studies.
Conclusion
In this study, a novel method for the prediction of the protein-
ligand affinities (Kd and Ki) based on a comprehensive set of
1300 structurally diverse PLCs was developed. Three blocks of
molecular descriptors were used to describe the PLCs. The
whole data were split into the representative training and test
sets by full consideration of the model validation. The ReliefF
algorithm combined with the modeling method was used to
select the significant descriptors from the training set samples
and LS-SVMs were utilized to build the models to predict the
binding affinities of Kd and Ki. The model was accurate, robust,
and stable, as were indicated by carefully internal and external
validations. Our method outperformed other similar published
works in terms of the balance between predictive ability and
model complexity. Because of its satisfied performance, the pre-
dictive model can be used as a fast filter for the rapid virtual
screening of large chemical database.
Furthermore, several important descriptors related to the
binding affinity were analyzed. These descriptors can give some
insight into structural features related to the binding affinity
between the ligand and protein.
References
1. Betz, M.; Saxena, K.; Schwalbe, H. Curr Opin Chem Biol 2006, 10,
219.
2. Diercks, T.; Coles, M.; Kessler, H. Curr Opin Chem Biol 2001, 5,
285.
3. Villar, H. O.; Yan, J.; Hansen, M. R. Curr Opin Chem Biol 2004, 8,
387.
4. D’Amico, S.; Sohier, J. S.; Feller, G. J Mol Biol 2006, 358, 1296.
5. Chavelas, E. A.; Zubillaga, R. A.; Pulido, N. O.; Garcia-Hernandez,
E. Biophys Chem 2006, 120, 10.
6. Wiseman, T.; Williston, S.; Brandts, J. F.; Lin, L.-N. Anal Biochem
1989, 179, 131.
7. Lofas, S. Assay Drug Dev Techn 2004, 2, 407.
8. Kuntz, I. D.; Blaney, J. M.; Oatley, S. J.; Langridge, R.; Ferrin,
T. E. J Mol Biol 1982, 161, 269.
9. Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R. J Mol
Biol 1997, 267, 727.
10. Naım, M.; Bhat, S.; Rankin, K. N.; Dennis, S.; Chowdhury, S. F.;
Siddiqi, I.; Drabik, P.; Sulea, T.; Bayly, C. I.; Jakalian, A.; Purisima,
E. O. J Chem Inf Model 2007, 47, 122.
11. Aqvist, J.; Luzhkov, V. B.; Brandsdal, B. O. Acc Chem Res 2002,
35, 358.
12. Gohlke, H.; Hendlich, M.; Klebe, G. J Mol Biol 2000, 295, 337.
13. Zhang, C.; Liu, S.; Zhu, Q.; Zhou, Y. J Med Chem 2005, 48,
2325.
14. Imai, T.; Hiraoka, R.; Seto, T.; Kovalenko, A.; Hirata, F. J Phys
Chem B 2007, 111, 11585.
15. Muegge, I.; Martin, Y. C. J Med Chem 1999, 42, 791.
16. Muegge, I. J Med Chem 2006, 49, 5895.
17. Mitchell, J. B. O.; Laskowski, R. A.; Alex, A.; Thornton, J. M.
J Comput Chem 1999, 20, 1165.
18. Nobeli, I.; Mitchell, J. B. O.; Alex, A.; Thornton, J. M. J Comput
Chem 2001, 22, 673.
19. Huang, S.-Y.; Zou, X. J Comput Chem 2006, 27, 1876.
20. Yang, C. Y.; Wang, R.; Wang, S. J Med Chem 2006, 49, 5903.
21. Gehlhaar, D. K.; Verkhivker, G. M.; Rejto, P. A.; Sherman, C. J.;
Fogel, D. R.; Fogel, L. J.; Freer, S. T. Chem Biol 1995, 2, 317.
22. Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G. J Mol Biol 1996,
261, 470.
23. Head, R. D.; Smythe, M. L.; Oprea, T. I.; Waller, C. L.; Green, S.
M.; Marshall, G. R. J Am Chem Soc 1996, 118, 3959.
24. Bohm, H. J. J Comput Aided Mol Des 1998, 12, 309.
25. Wang, R.; Lui, L.; Lai, L.; Tang, Y. J Mol Model 1998, 4, 379.
26. Wang, R.; Lai, L.; Wang, S. J comput-Aided Mol Des 2002, 16, 11.
27. Eldridge, M. D.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee,
R. P. J Comput-Aided Mol Des 1997, 11, 425.
28. DeWitte, R. S.; Shakhnovich, E. I. J Am Chem Soc 1996, 118,
11733.
29. Ishchenko, A. V.; Shakhnovich, E. I. J Med Chem 2002, 45, 2770.
30. Yang, J.-M. J Comput Chem 2004, 25, 843.
31. Chen, H.-M.; Liu, B.-F.; Huang, H.-L.; Hwang, S.-F.; Ho, S.-Y.
J Comput Chem 2007, 28, 612.
32. Brown, S. P.; Muchmore, S. W. J Chem Inf Model 2006, 46, 999.
33. Brown, S. P.; Muchmore, S. W. J Chem Inf Model 2007, 47, 1493.
34. Lindstrom, A.; Pettersson, F.; Almqvist, F.; Berglund, A.; Kihlberg,
J.; Linusson, A. J Chem Inf Model 2006, 46, 1154.
35. Zhang, S.; Golbraikh, A.; Tropsha, A. J Med Chem 2006, 49, 2713.
36. Deng, W.; Breneman, C.; Embrechts, M. J. J Chem Inf Comput Sci
2004, 44, 699.
37. Wang, R.; Fang, X.; Lu, Y.; Wang, S. J Med Chem 2004, 47, 2977.
38. Wang, R.; Fang, X.; Lu, Y.; Yang, C. Y.; Wang, S. J Med Chem
2005, 48, 4111.
Figure 5. The predicted values of pKi versus the experimental val-
ues of Ki model.
908 Li et al. • Vol. 30, No. 6 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
39. Wang, R.; Lu, Y.; Fang, X.; Wang, S. J Chem Inf Comput Sci 2004,
44, 2114.
40. Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.;
Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucl Acids Res 2000,
28, 235.
41. Wang, R.; Fang, X.; Lu, Y.; Wang, S. J Med Chem 2004, 47,
2977.
42. Python Programming Language, ver. 2.5.1. Python software foundation:
Hampton, NH, 2007.
43. Li, Z. R.; Lin, H. H.; Han, L. Y.; Jiang, L.; Chen, X.; Chen, Y. Z.
Nucl Acids Res 2006, 34, W32.
44. Reczko, M.; Karras, D.; Bohr, H. Nucl Acids Res 1997, 25, 235.
45. Lin, Z.; Pan, X. M. J Protein Chem 2001, 20, 217.
46. David, S. H. Biopolymers 1988, 27, 451.
47. Sokal, R. R.; Thomson, B. A. Am J Phys 2006, 129, 121.
48. Chou, K. C. Biochem Biophys Res Commun 2000, 278, 477.
49. Cai, Y. D.; Chou, K. C. J Proteome Res 2005, 4, 967.
50. Han, L. Y.; Zheng, C. J.; Xie, B.; Jia, J.; Ma, X. H.; Zhu, F.;
Lin, H. H.; Chen, X.; Chen, Y. Z. Drug Discov Today 2007, 12,
304.
51. Bock, J. R.; Gough, D. A. Bioinformatics 2001, 17, 455.
52. HyperChem, ver. 7.0. Hypercube. Inc.: Gainesville, FL, 2002.
53. DRAGON for Windows (Software for molecular Descriptor Calcula-
tion), ver. 5.4. Talete srl, 2006. Available at http://www.talete.mi.it.
54. Liu, H.; Papa, E.; Gramatica, P. Chem Res Toxicol 2006, 19, 1540.
55. Sybyl molecular modeling software, ver. 6.9. Tripos Associates,
Inc.: St. Louis, 2002.
56. Kennard, R. W.; Stone, L. A. Technometrics 1969, 11, 137.
57. Marengo, E.; Todeschini, R. Chemom Intell Lab Syst 1992, 16, 37.
58. Maesschalck, R. D.; Estienne, F.; Verdu-Andres, J.; Candolfi, A.;
Centner, V.; Despagne, F.; Jouan-Rimbaud, D.; Walczak, B.; Mas-
sart, D. L.; Jong, S. D.; Noord, O. E. D.; Puel, C.; Vandeginste, B.
M. G. Internet J Chem 1999, 2, 19.
59. Kira, K.; Rendell, L. A. Proceedings of the Ninth International
Workshop on Machine Learning, Morgan Kaufmann Publishers Inc.:
San Francisco, CA, 1992, pp 249–256.
60. Kohavi, R.; John, G. H. Artifi Intell 1997, 97, 273.
61. Saeys, Y.; Inza, I.; Larranaga, P. Bioinformatics 2007, 23,
2507.
62. Suykens, J. A. K.; Gestel, T. V.; Brabanter, J. D.; Moor, B. D.; Van-
dewalle, J. Least Squares Support Vector Machines; World Scien-
tific: Singapore, 2002.
63. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support vector
Machine; Cambridge University Press: UK, 2000.
64. Liu, H.; Zhang, R.; Yao, X.; Liu, M.; Hu, Z.; Fan, B. J Comput Aid
Mol Des 2004, 18, 389.
65. Liu, H.; Yao, X.; Zhang, R.; Liu, M.; Hu, Z.; Fan, B. J Phys Chem
B 2005, 109, 20565.
66. Yao, X.; Liu, H.; Zhang, R.; Liu, M.; Hu, Z.; Panaye, A.; Doucet, J.
P.; Fan, B. Mol Pharmaceutics 2005, 2, 348.
67. Li, S.; Yao, X.; Liu, H.; Li, J.; Fan, B. Anal Chim Acta 2007, 584,
37.
68. Liu, H.; Gramatica, P. Bioorgan Med Chem 2007, 15, 5251.
69. Pelckmans, K.; A.K. Suykens, J.; Gestel, T. V.; Brabanter, J. D.;
Lukas, L.; Hamers, B.; Moor, B. D.; Vandewalle J. Internal Report
02-44, ESAT-SCD-SISTA K.U. Leuven: Leuven, Belgium, 2002.
70. Gramatica, P. QSAR Comb Sci 2007, 26, 694.
71. Tropsha, A.; Gramatica, P.; Gombar, V. K. QSAR Comb Sci 2003,
22, 69.
72. Krygowski, T. M.; Cyranski, M. Tetrahedron 1996, 52, 1713.
73. Mrozek, A.; Karolak-Wojciechowska, J.; Amiel, P.; Barbe, J. J Mol
Struct 2000, 524, 151.
909Protein-Ligand Binding Affinity Prediction
Journal of Computational Chemistry DOI 10.1002/jcc