a novel method for protein-ligand binding affinity prediction and the related descriptors...

A Novel Method for Protein-Ligand Binding Affinity

Prediction and the Related Descriptors Exploration

SHUYAN LI, LILI XI, CHENGQI WANG, JIAZHONG LI, BEILEI LEI, HUANXIANG LIU, XIAOJUN YAO

Department of Chemistry, Lanzhou University, Lanzhou 730000, China

Received 18 March 2008; Revised 18 June 2008; Accepted 22 June 2008DOI 10.1002/jcc.21078

Published online 10 September 2008 in Wiley InterScience (www.interscience.wiley.com).

Abstract: In this study, a novel method was developed to predict the binding affinity of protein-ligand based on a

comprehensive set of structurally diverse protein-ligand complexes (PLCs). The 1300 PLCs with binding affinity

(493 complexes with Kd and 807 complexes with Ki) from the refined dataset of PDBbind Database (release 2007)

were studied in the predictive model development. In this method, each complex was described using calculated

descriptors from three blocks: protein sequence, ligand structure, and binding pocket. Thereafter, the PLCs data

were rationally split into representative training and test sets by full consideration of the validation of the models.

The molecular descriptors relevant to the binding affinity were selected using the ReliefF method combined with

least squares support vector machines (LS-SVMs) modeling method based on the training data set. Two final opti-

mized LS-SVMs models were developed using the selected descriptors to predict the binding affinities of Kd and Ki.

The correlation coefficients (R) of training set and test set for Kd model were 0.890 and 0.833. The corresponding

correlation coefficients for the Ki model were 0.922 and 0.742, respectively. The prediction method proposed in this

work can give better generalization ability than other recently published methods and can be used as an alternative

fast filter in the virtual screening of large chemical database.

q 2008 Wiley Periodicals, Inc. J Comput Chem 30: 900–909, 2009

Key words: protein-ligand binding affinity; ReliefF method; least squares support vector machines (LS-SVMs);

model validation

Introduction

Recently, structure-based drug design (SBDD) methods such as

docking and de novo design have been widely used in the drug

discovery and drug design process, especially in lead discovery

and lead optimization. In SBDD methods, one of the most im-

portant issues is the screening of available ligands with the rele-

vant macromolecular target. In most cases, the stronger a ligand

molecule binding with the target protein, the more likely it will

affect the protein’s biological function, and as a consequence, it

will be likely to be a suitable drug candidate. Therefore, the

assessment of the binding affinity between ligand and receptor

plays an important role in drug discovery and design. As this af-

finity is mainly determined by the interaction between the ligand

and the relevant receptor, the study of the relationship between

the features of a ligand-protein complex and its binding affinity

becomes very important in modern drug discovery.1,2

Usually, binding affinity of a ligand to a given receptor can

be determined by experimental ways, such as NMR spectrome-

try,1–3 microcalorimetry,4–6 and surface plasmon resonance,7 but

they are often time-consuming and expensive. Therefore, many

in silico methods are developed to predict the binding affinity

based on the properties of the ligand and the receptor. The

advantages of these in silico methods lie in the fact that once

the reliable prediction models were developed, they are much

less dependent on the experimental data set; they could be used

even for the some virtual compounds that have not been synthe-

sized. At the same time, it can also filter some compounds that

have ‘‘no’’ or ‘‘low’’ activity. The most widely used methods in

this field are some kinds of docking and scoring methods that

can identify the binding mode of the ligands and estimate the

strength of the protein-ligand interactions. There are many

classes of these methods, such as force field-based (like DOCK,8

GOLD,9 SIE,10 and LIE11), knowledge-based approaches (e.g.,

DrugScore,12 DFIRE,13 3DDFT,14 PMF,15,16 BLEEP,17,18

ITScore,19 and M-Score20), and the empirical scoring functions

Contract/grant sponsor: Program for New Century Excellent Talents in

University; contract/grant number: NCET-07-0399

Contract/grant sponsors: Scientific Research Foundation for the Returned

Overseas Chinese Scholars, State Education Ministry, Zhide Foundation

of Lanzhou University

Correspondence to: X. Yao; e-mail: [email protected]

q 2008 Wiley Periodicals, Inc.

with some kinds of statistical methods (X-Score,21 FlexX

Score,22 VALIDATE,23 SCORE1 (LUDI),24 SCORE,25,26 Chem-

Score,27 SMoG,28,29 GEMDOCK,30 and SODOCK31). Besides,

some methods add the parallel approach in this field (high-

Throughput MM-PBSA32,33).

Recently, as an alternative to widely used docking and scor-

ing approach, some other in silico methods based on the struc-

tures of ligands and the relevant proteins are also proposed for

the fast prediction of the binding affinity with some success (e.g.

Hi-PLS34 and novel geometrical descriptors-based method35,36).

These methods often use the molecular descriptors calculated

from the structure of the ligand and protein as the inputs, and

then, use some modeling methods to develop predictive models

for binding affinity. Compared with docking and scoring meth-

ods, these methods showed some obvious advantages such as

easy implementation, fast prediction process, and good predic-

tive ability and could be used as a fast filter in the virtual

screening of large chemical database.

The recently published in silico methods together with their

correlation coefficient (R), standard deviation (SD) or root mean

square error (RMSE), and descriptors were listed in Table 1 to

give out an overview. The results of the model listed in the

Table were based on the largest dataset for the cases where sev-

eral different predictive models are available.

As can be seen from Table 1, these methods still need some

improvement of the prediction accuracy and generalization abil-

ity. First, most methods used only part of the large structure-

diverse dataset, and their testing results are not very satisfactory.

Many of the methods mentioned earlier used the refined set

[2003 release, containing 800 protein-ligand complexes (PLCs)]

of PDBbind,37,38 including the methods of PMF, ChemScore,

PLP, LUDI, GOLD, and X-Score, Hi-PLS, etc.16,34,39 Second,

some prediction models gave very good accuracy, but they were

based on only a relatively small data set. For instance, the work

by Zhang et al.35 used a data set containing 264 PLCs with

binding affinities (pKd) and yielding the best R2ext of the models

as 0.83. Although with its high prediction result, this prediction

method was based on a relatively small dataset and a little com-

plicated with more than 1000 models being built and filtered.

Therefore, the limited accuracy and the complexity of current

methods prevent the wide applications of these methods in the

real virtual screening of large chemical database. There is still a

large need for the improvement of the present scoring function,

and the development of new general binding affinity prediction

method with good predictive ability and that covers more

ligand-protein complex space.

Inspired by the successes of binding affinity prediction

method based on the molecular descriptors of ligand-protein

complexes and its easy implementation, we proposed a new

method for the fast prediction method by introducing new

molecular descriptors, modeling development method, and

model validation. To overcome the low-accuracy and high-com-

plexity problems of current methods, we present an accurate and

concise modeling method for the binding affinity prediction of

PLCs based on a general structural diverse data set (release

2007 of PDBbind database refined set, including 1300 PLCs). It

should also be pointed out that the developed models in our

study were carefully validated, which was often ignored in many

similar studies. The main process of our prediction method is

shown in Figure 1. As can be seen from the flow chart, our

method mainly includes the following steps: the collection of the

data set from the PDBbind database refined set, and then three

blocks of descriptors including protein sequence, ligand structure,

and binding pocket information were used to describe each com-

plex. Afterward, all the data were split into representative training

and test sets. Then the descriptors relevant to the binding affinity

were selected using the ReliefF method combined with least

squares support vector machines (LS-SVMs) modeling method

based on the training data set. The selected descriptors were used

to build predictive models using LS-SVMs method. The models

were validated internally by the leave-one-out (LOO) crossvalida-

tion, Y-scrambling on the training set, and externally validated by

the use of the rational selected test set.

Materials and Methods

Data Sets

All the PLCs information were collected from the 2007 release of

PDBbind database.37 The PDBbind database is a collection of the

experimentally measured binding affinities exclusively for the

PLCs available in the Protein Data Bank,40 with both information

about binding affinities and known 3D structures. The current

release contains 3214 PLCs, and a total of 1300 PLCs were

selected to form the ‘‘refined set’’ by consideration of the quality of

structures and binding affinity data, which is compiled particularly

for docking/scoring and binding affinity prediction studies. The

resolution of the PLCs crystal structure is lower than 2.5 A.38,41

To study the comprehensive information of PLCs, the whole

refined set was used in this work without any additional filter of

PLCs, like other works.34 Within the 1300 PLC data in the

refined set, there are 493 data with the binding affinity of disso-

ciation constant (Kd) value and 807 with inhibition constant (Ki)

value. We used the negative logarithm of Kd and Ki values in

this study (pKd and pKi). The pKd and pKi scale in this study

ranged from 0.5 to 13.96, spanning over 13 orders of magnitude,

with a mean value of 6.39 and a SD of 2.16. Two binding affin-

ity prediction models were developed based on pKd and pKi

values, respectively.

Descriptors Generation

The PLCs were described by three blocks of descriptors:

sequence information of protein, ligand structural information,

and binding pocket structural information.

Block1: Descriptors Based on Protein Sequences

The PDB files of proteins were converted into the amino acid

sequence files with FASTA format by the script of Python lan-

guage.42 Then, the structural and physicochemical features of

proteins were computed from amino acid sequence using the

PROFEAT43 method. Seven types of descriptors were generated,

including (a) amino acid composition, dipeptide composition,44

(b) normalized Moreau-Broto autocorrelation,40,45 (c) Moran

autocorrelation,46 (d) Geary autocorrelation,47 (e) sequence-

901Protein-Ligand Binding Affinity Prediction

Journal of Computational Chemistry DOI 10.1002/jcc

order-coupling number, quasi sequence-order descriptors,48 (f)

pseudo amino acid composition (k 5 30),49 and (g) the compo-

sition and transition and distribution of various structural and

physicochemical properties.50,51 At last, 1497 descriptors were

calculated in this block.

Block2: Descriptors of Ligand

All the ligands taken from PDBbind database were prechecked

for the missing atom types and were added with the hydrogen

atom using the HYPERCHEM52 program. A total of 1664

molecular descriptors were calculated using the DRAGON soft-

ware.53,54

Block3: Descriptors of Binding Pocket

The binding pocket structures of the PLCs from the PDBbind

database were first added with hydrogen and minimized using

the Tripos force field in Sybyl 6.9.55 Then seven types of

descriptors were calculated, including CPSA (charged partial

Table 1. Overview of Recently Published in silico Methods for the Prediction of Protein-Ligand

Binding Affinities.

Method Dataset size

Q2 of

training set

R of

test set Error

Descriptor type or

method description Ref.

VALIDATE 51 – 0.900 – Empirical scoring function 35

NGD 61 0.71 0.775 – Distance-dependent atom type descriptors 36

DFIRE 100 – 0.630 – 19 atom types and a distance-scale finite

ideal-gas reference (DFIRE) state

13

NGD 105 0.60 0.800 – Distance-dependent atom type descriptors 36

SMoG96 120 – 0.648 – Empirical scoring function 35

ENTess 264 0.66 0.911 – Chemical geometrical descriptors and

Pauling atomic electronegativities

35

BLEEP 351 – 0.728 – Knowledge-based scoring function 35

Hi-PLS 612 0.27 0.458 1.95a Principal components of 4 blocks structural

descriptors

34

GOLD::GoldScore 694 – 0.285 2.16b Empirical scoring function 37

Cerius2::LigScore 717 – 0.406 2.00b Empirical scoring function 37

SMoG2001 725 – 0.660 – Empirical scoring function 35

Sybyl::F-Score 732 – 0.141 2.19b Empirical scoring function 37

GOLD::ChemScore 741 – 0.423 2.00b Empirical scoring function 37

GOLD::ChemScore_opt 762 – 0.449 1.96b Empirical scoring function 37

GOLD::GoldScore_opt 772 – 0.365 2.06b Empirical scoring function 37

Sybyl::PMF-Score 785 – 0.147 2.16b Knowledge-based scoring function 37

Cerius2::LUDI1 790 – 0.334 2.08b Empirical scoring function 37

Cerius2::PMF 795 – 0.253 2.13b Knowledge-based scoring function 37

Sybyl::ChemScore 797 – 0.499 1.91b Empirical scoring function 37


X-Score::HPScore 800 – 0.514 1.89b Empirical scoring function 37

X-Score::HMScore 800 – 0.566 1.82b Empirical scoring function 37

X-Score::HSScore 800 – 0.506 1.90b Empirical scoring function 37

DrugScore::Pair 800 – 0.473 1.94b Knowledge-based scoring function 37

DrugScore::Surf 800 – 0.463 1.95b Knowledge-based scoring function 37

DrugScore::Pair/Surf 800 – 0.476 1.94b Knowledge-based scoring function 37

Sybyl::D-Score 800 – 0.322 2.09b Empirical scoring function 37

Sybyl::G-Score 800 – 0.443 1.98b Empirical scoring function 37

Cerius2::PLP1 800 – 0.458 1.96b Empirical scoring function 37

Cerius2::PLP2 800 – 0.455 1.96b Empirical scoring function 37


HINT 800 – 0.330 2.08b Empirical scoring function 37

X-Score::HMScore 1 DrugScore::Pair 800 – 0.573 – Consensus scoring function 37

X-Score::HMScore 1 Sybyl::ChemScore 800 – 0.586 – Consensus scoring function 37

X-Score::HMScore 1 Cerius2::PLP2 800 – 0.573 – Consensus scoring function 37

DrugScore::Pair 1 Sybyl::ChemScore 800 – 0.520 – Consensus scoring function 37

DrugScore::Pair 1 Cerius2::PLP2 800 – 0.476 – Consensus scoring function 37

Sybyl::ChemScore 1 Cerius2::PLP2 800 – 0.521 – Consensus scoring function 37

aRoot mean square error (RMSE) of predictive result of test set.bStandard deviation (SD) between experimental and predictive value of test set.

902 Li et al. • Vol. 30, No. 6 • Journal of Computational Chemistry


surface area), Volsurf, ZAPSOLVATION, ZAPDESCRIPTORS,

FINGERPRINT, MOLPROP_VOLUME, and MOL_WEIGHT. In

this block, 125 descriptors of the binding sites were generated.

Dataset Division

After the calculation of descriptors, all the original data were

split into representative training and test sets using the Kennard-

stone (KS) method56 by full consideration of the model valida-

tion. In this division, the composition of the training and test

sets is of crucial importance. The rational splitting must guaran-

tee that the training and test sets are scattered over the whole

area occupied by representative points in the descriptor space,

and that the training set is distributed over the area occupied by

representative points for the whole data set.54 KS method can

divide the experimental region uniformly and usually use the

Euclidean distance to select the samples (candidate objects) into

the calibration space (training set). If there is absence of strong

irregularities in the factor space, the procedure starts first select-

ing a set of points close to those selected by the D-optimal

method,57 i.e. on the borderline of the data set (plus the center

point, if this is chosen as the starting point). It then proceeds to

fill up the calibration space. It is a uniform mapping algorithm

and yields a flat distribution of the data which is preferable for a

regression model.58 After data division, there are 394 and 645

samples in the training sets of pKd and pKi data. There are 99

and 162 samples in the test sets of Kd and Ki data, respectively.

Descriptor Selection

Once all the three blocks of 3286 descriptors were generated, they

were first filtered to remove the redundant variables whose pair

correlation coefficients were higher than 0.9. Finally, 1694 descrip-

tors remained: 1099 in block1, 559 in block2, and 36 in block3.

Using the aforementioned division of training, the ReliefF

method59 combined with modeling method was used to extract

the crucial descriptors based on the training set from 1694

descriptors. Here, each descriptor was regarded as a feature.

ReliefF algorithm assigns a ‘‘relevance’’ weight to each feature,

which is meant to denote the relevance of the feature to the tar-

get concept. It evaluates the worth of an attribute by repeatedly

sampling an instance and considering the value of the given at-

tribute for the nearest instance of the same and different class

(can be either discrete or continuous data).60 The number of

descriptors selected to build the final model was based on the

best LOO crossvalidation (Q2) results. Training models (using

LS-SVMs method here in this study) with different number of

top-ranked descriptors using default parameters were built, and

the one that gave the favorable Q2 result was selected in the

further analysis.61

LS-SVMs Modeling

LS-SVMs62 are reformulations to the standard SVMs,63,64 which

result in a set of linear equations instead of a quadratic program-

ming problem of SVMs. This algorithm was introduced in this

study as modeling tool due to its good performance and easy

implementation for regression problems, as were proved by our

previous works.65–68 The theory of LS-SVMs and the differences

between LS-SVMs and SVMs were described in detail in our

previous works.65–68 Here, we briefly describe only the main

idea of LS-SVMs for function estimation.

In principle, LS-SVMs always fits a linear relation (y 5 wx1 b) between the regressor (x) and the dependent variable (y).The best relation can be obtained by minimizing the cost func-

tion (Q) containing a penalized regression error term:

QLS-SVM ¼ 1

2wTwþ c

XNk¼1

e2k (1)

subject to:

yk ¼ wTuðxkÞ þ bþ ek; k ¼ 1; . . . ;N ð2Þwhere u: Rn ? Rm is the feature map mapping the input space

to a usually high dimensional feature space, c is the relative

weight of the error term, ek is error variables taking noisy data

into account and avoiding poor generalization.

LS-SVMs consider this optimization problem to be a con-

strained optimization problem and use a Lagrange function to

solve it. By solving the Lagrange style of eq. (1), the weight

coefficients (w) can be written:

w ¼XNk¼1

akxk with ak ¼ 2cek (3)

By substituting (3) into the original regression line (y 5 wx1 b), the following result can be obtained:

y ¼XNk¼1

akxTk xþ b (4)

It can be seen that the Lagrange multipliers can be defined

as:

ak ¼ ðxTk xþ ð2cÞ�1Þ�1ðyk � bÞ (5)

Figure 1. The flow chart of the proposed method for predicting

binding affinity.



Finding these Lagrange multipliers is very simple as opposed

to the SVM approach in which a more difficult relation has to

be solved to obtain these values. In addition, it easily allows

nonlinear regression as an extension of the linear approach by

introducing the kernel function. This leads to the following non-

linear regression function:

f ðxÞ ¼XNk¼1

akKðx; xkÞ þ b (6)

where K(x,xk) is the kernel function. The value is equal to the

inner product of two vectors x and xk in the feature space F(x)and F(xk), that is K ¼ ðx; xkÞ ¼ UðxÞT � UðxkÞ. The choice of a

kernel and its specific parameters together with c has to be tuned

by the user. The radial basis function (RBF) kernel

K ¼ ðx; xkÞ ¼ expð� xk � xk k2=r2Þ is commonly used and then

LOO crossvalidation was used to tune the optimized values of

the two parameter c (the relative weight of the regression error)

and r (the kernel parameter of the RBF kernel). Here, the opti-

mal parameters are found by an intensive grid search method.

The result of this grid search is an error-surface spanned by the

model parameters. A robust model is obtained by selecting those

parameters that give the lowest error in a smooth area.

The RMSE was used as the error function, and it is com-

puted according to the following equation:

RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn

i¼1 ðyi � yiÞ2n

s(7)

where yi and yi are the experimental and calculated response

value of the i-th object, respectively, and n is the number of

samples.

All calculations implementing LS-SVMs were performed

using the Matlab/C toolbox.69

Internal and External Validation of Model

Once the models were built, only the prediction results of train-

ing set was not enough to prove the predictive ability of the

model. Therefore, two further methods were used to evaluate the

stability and robustness of the model.

The first one was internal validation based on the training set

only. The used methods mainly include LOO crossvalidation

correlation coefficient (Q2) of the model and the permutation

testing (Y scrambling) of the model.70 The LOO crossvalidation

procedure consists of removing one sample from the training set

and constructing the model only on the basis of the remaining

training data and then testing on the removed sample. In this

fashion, all of the training data samples were tested, and Q2 was

calculated. Another internal validation was performed by permu-

tation testing: new models were recalculated for randomly reor-

dered responses. The resulting models obtained on the data set

with randomized response should have significantly lower Q2

values than the proposed ones because the relationship between

the structure and response is broken. This is proof of the

proposed model’s validity, because it can be reasonably con-

cluded that the originally proposed model was not obtained by

chance correlation.54,71

The second method is the external validation based on the

test set.68,70 For any predictive model, the most important goal

is to use it for external prediction. Because the significant

descriptors were selected based on the training set only, the test

set here, completely not involved in the development of the

model was used to validate the built model externally. Com-

pared with crossvalidation, external validation can provide a

more rigorous evaluation of the model’s predictive capability for

untested PLCs. The external test set was evaluated by R, Q2ext

and RMSE. Here, Q2ext is defined as:

Q2ext ¼ 1�

Pnexti¼1 ðyi � yiÞ2Pnexti¼1 ðyi � yiÞ2

(8)

where the sum runs over the test set objects (next) and yi is the

average value of the training set responses.

Results and Discussion

The Construction and Internal Validation of Models

After training LS-SVM models with different number of top-

ranked descriptors from ReliefF methods, we selected the

descriptors that got the highest Q2 for the further analysis.

Thereafter, 35 significant features were selected to build the final

predictive model. The selected descriptors were listed in Tables

2 and 3.

In this investigation, two LS-SVMs models for binding affin-

ity prediction were built separately based on the PCLs with pKd

data (Kd model for short) and pKi data (Ki model for short). In

the LS-SVMs model development, only two parameters r and cneed to be optimized. The parameter r in the style of r2 and the

parameter c were tuned simultaneously in a grid 20 3 20 rang-

ing from 0.01 to 100 and 0.01 to 100, respectively. And then a

contour plot of the optimization error can be visualized easily

(Figs. 2 and 3). The optimal parameter settings can be selected

from a smooth subarea with a low prediction error. The selected

optimal values of c and r2 for the Kd model are 1.833 and

23.357, respectively, marked by the magenta square in the Fig-

ure 2. And, they are 1.833 and 11.288 for the Ki model, marked

by the magenta square in the Figure 3.

The internal validation and prediction results of Kd model

and Ki model are shown in Table 4, and the predictive results

versus the experiment values are plotted as Figures 4 and 5. The

Q2 of Kd model is 0.586 and 0.500 for Ki model which shows

the stability of the model for the PDBbind database refine data-

set. The corresponding Q2 value by Lindstrom et al.34 was only

0.220.

In addition, Y-randomization was applied to exclude the pos-

sibility of chance correlation, i.e., fortuitous correlation without

any predictive ability. The results of the random models per-

formed using a scrambled order of the experimental rate con-



stant repeated 300 times were significantly lower than the origi-

nal models. The highest values of Q2 of the 300 alters were

shown in Table 4. It also indicated the statistical reliability of

our predictive models.

The External Validation of Models

Ninety-nine Kd data and 162 Ki data, not used during the devel-

opment of model, were used as the external validation dataset.

The correlation coefficient Rext for Kd test set and Ki test set are

0.833 and 0.742, respectively. The RMSE and Q2ext values were

also shown in Table 4. The external validation results are better

than that in the references based on the same kind of data-

set.16,34,39

Besides, in order to further prove the generalization ability of

our model, an overall 10-fold crossvalidation of this model on

the whole datasets was also performed. Because the Q2 result is

not stable for an n-fold crossvalidation, we repeat this procedure

for 10 times and get an average Q2 for Kd and Ki datasets as

0.569 and 0.496. These results are consistent with the LOO

Table 2. Significant Descriptors Selected to Build the Kd Model.

No. Name Block Content Type

1 AROM B2 Aromaticity index Geometrical descriptors

2 FNSA1 B3 Partial negative surface area/total molecular surface area CPSA

3 FPSA2 B3 Total charge weighted PNSA(Partial negative surface area)/total

molecular surface area

CPSA

4 SizPatc2 B3 The size of the smallest patches of potential as a fraction of the total

surface area

ZAPDESCRIPTORS

5 C-011 B2 CR3X Atom-centred fragments

6 nSO2N B2 Number of sulfonamides (thio-/dithio-) Functional group counts

7 S-110 B2 R-SO2-R Atom-centred fragments

8 Ms B2 Mean electrotopological state Constitutional descriptors

9 MAXDN B2 Maximal electrotopological negative variation Topological descriptors

10 GATS1p B2 Geary autocorrelation—lag 1/weighted by atomic Sanderson

electronegativities

2D autocorrelations

11 GATS1m B2 Geary autocorrelation—lag 1/weighted by atomic masses 2D autocorrelations


surface area

ZAPDESCRIPTORS

13 JGI9 B2 Mean topological charge index of order9 Topological charge indices

14 Ui B2 Unsaturation index Molecular properties

15 GATS3p B2 Geary autocorrelation—lag 3/weighted by atomic polarizabilities 2D autocorrelations

16 FINGERPRINT B3 A collection of binary variables, which have the value of 1 if the

compound contains a particular fragment, say phenyl or carbonyl, and

0 otherwise.

FINGERPRINT

17 FESA3_1 B3 Fractional surface Areas: surface of potential greater than 11 kT ZAPDESCRIPTORS

18 GATS1e B2 Geary autocorrelation—lag 1/weighted by atomic Sanderson

electronegativities

2D autocorrelations

19 C-007 B2 CH2X2 Atom-centred fragments

20 R4e1 B2 R maximal autocorrelation of lag 4/weighted by atomic Sanderson

electronegativities

GETAWAY descriptors

21 X0A B2 Average connectivity index chi-0 Connectivity indices

22 MAXDP B2 Maximal electrotopological positive variation Topological descriptors

23 RARS B2 R matrix average row sum GETAWAY descriptors

24 PPSA1 B3 Partial positive surface area CPSA

25 R6e B2 R autocorrelation of lag 6/weighted by atomic Sanderson

electronegativities

GETAWAY descriptors

26 BLTD48 B2 Verhaar model of Daphnia base-line toxicity from MLOGP (mmol/l) Molecular properties

27 C-024 B2 R--CH--R Atom-centred fragments

28 GATS3m B2 Geary autocorrelation—lag 3/weighted by atomic masses 2D autocorrelations

29 ARR B2 Aromatic ratio Constitutional descriptors

30 R1p B2 R autocorrelation of lag 1/weighted by atomic polarizabilities GETAWAY descriptors

31 LogP B3 Mean of a linear equation derived by fitting VolSurf descriptor to

experimental data on water/octanol partition coefficient

Volsurf

32 nCrt B2 Number of ring tertiary C(sp3) Functional group counts

33 GVWAI-80 B2 Ghose-Viswanadhan-Wendoloski drug-like index at 80% Molecular properties

34 Mor19m B2 3D-MoRSE—signal 19 / weighted by atomic masses 3D-MoRSE descriptors

35 HATS2e B2 Leverage-weighted autocorrelation of lag 2 / weighted by atomic

Sanderson electronegativities

GETAWAY descriptors



Q2 of internal validation, which also suggests that our models

are robust and stable.

Exploration of the Selected Descriptors

As shown in Tables 2 and 3, 35 significant descriptors were

selected separately for Kd model and Ki model. The order in the

table was according to their importance by the ranking method

of ReliefF algorithm. By analyzing these 35 descriptors, some

interesting information about protein-ligand interaction can be

inferred. It should also be pointed out that it is difficult to give a

full explanation about all the descriptors involved in the model

and that the similar reported works gave few detailed analysis of

the descriptors involved in the model development.

First, electrostatic interaction is proved to be an important

interaction for binding affinity by our models. More than one-

third of the descriptors are electrostatic relevant features. There

are 15 such descriptors in Kd model (No. 2, 8, 9, 10, 13, 18, 20,

22, 25, 35 in block2 and No. 3, 4, 12, 17, 24 in block3) and 12

such descriptors in Ki model (No. 3, 7, 9, 11, 12, 15, 17 in

block2 and No. 6, 10, 21, 32, 34 in block3). The information

derived from our models indicated that the electrostatic interac-

tion was very important in describing the binding affinity

between the ligand and protein.

Second, hydrophobic effect also proved to be important for

the protein-ligand binding. It is interesting that different models

have the hydrophobic descriptor in different blocks. Both models

select the descriptors relevant to logP. However, in Kd model,

Table 3. Significant Descriptors Selected to Build the Ki Model.

No. Name Block Content Type

1 AROM B2 Aromaticity index Geometrical descriptors

2 piPC10 B2 Molecular multiple path count of order 10 Walk and path counts


4 SHP2 B2 Average shape profile index of order 2 Randic molecular profiles

5 Ui B2 Unsaturation index Molecular properties

6 SizPatc1 B3 The size of the largest patches of potential as a fraction of the total

surface area

ZAPDESCRIPTORS


8 AMR B2 Ghose-Crippen molar refractivity Molecular properties



surface area

ZAPDESCRIPTORS


12 HATS5e B2 Leverage-weighted autocorrelation of lag 5/weighted by atomic

Sanderson electronegativities

GETAWAY descriptors

13 R1p B2 R autocorrelation of lag 1/weighted by atomic polarizabilities GETAWAY descriptors

14 H-047 B2 H attached to C1(sp3)/C0(sp2) Atom-centred fragments

15 H2e B2 H autocorrelation of lag 2/weighted by atomic Sanderson

electronegativities

GETAWAY descriptors

16 S-107 B2 R2S / RS-SR Atom-centred fragments

17 MAXDP B2 Maximal electrotopological positive variation Topological descriptors

18 ALOGP B2 Ghose-Crippen octanol-water partition coeff. (logP) Molecular properties

19 Mor05p B2 3D-MoRSE—signal 05/weighted by atomic polarizabilities 3D-MoRSE descriptors

20 STN B2 Spanning tree number (log) Topological descriptors

21 NumPatc B3 The number of charged patches ZAPDESCRIPTORS

22 nOHp B2 Number of primary alcohols Functional group counts

23 O-059 B2 Al-O-Al Atom-centred fragments

24 Neoplastic-80 B2 Ghose-Viswanadhan-Wendoloski antineoplastic-like index at 80% Molecular properties

25 R1u B2 R autocorrelation of lag 1/unweighted GETAWAY descriptors

26 FDI B2 Folding degree index Geometrical descriptors

27 C-043 B2 X--CR..X Atom-centred fragments

28 N-074 B2 Ar-NAl2 Atom-centred fragments

29 H-048 B2 H attached to C2(sp3)/C1(sp2)/C0(sp) Atom-centred fragments

30 Mor03p B2 3D-MoRSE—signal 03/weighted by atomic polarizabilities 3D-MoRSE descriptors

31 FINGERPRINT B3 A collection of binary variables, which have the value of 1 if the

compound contains a particular fragment, say phenyl or carbonyl, and

0 otherwise.

FINGERPRINT

32 FNSA3 B3 Atomic charge weighted PNSA/Total molecular surface area CPSA

33 nR06 B2 Number of 6-membered rings Constitutional descriptors

34 FNSA2 B3 Total charge weighted PNSA/Total molecular surface area CPSA

35 Mor13m B2 3D-MoRSE—signal 13/weighted by atomic masses 3D-MoRSE descriptors



the log P descriptor (No. 31 in Table 2) was selected from

block3. This descriptor implied that it was the hydrophobic

effect of binding-pocket rather than the hydrophobic property of

ligand that dominated Kd. But in Ki model, the ALOGP descrip-

tor (No. 18 in Table 3) was selected in block2. It indicated that

the hydrophobic property of ligand, rather than binding pocket,

was more important to Ki value. The predictive models of Kd

and Ki were very similar: there are five same descriptors in the

two models (AROM, Ui, JGI9, R1p, and MAXDP) and also 14

types of descriptors (11 types in block2 and three types in

block3) in common in the two predictive models. There are still

some small differences between them from the hydrophobic

effect analysis. It can also explain why the former binding affin-

ity studies, e.g. the work by Lindstrom et al.,34 which used Kd

and Ki as the same predict target could not give a satisfactory

prediction results.

Third, in the geometrical type of descriptors, aromaticity of

ligand is the most important descriptor for the binding between

the ligand and the protein. Several former works were developed

to predict protein-ligand binding affinity using geometrical

descriptors.35,36 By analyzing the two models, the aromaticity

index of ligand, as a geometrical descriptor, is ranked in the first

place for both Kd and Ki models influencing the binding affinity

between ligand and protein. The aromaticity index can be calcu-

lated by 1 2 [dearomatization term due to bond length alterna-

tions (GEO) 1 energetic term (EN)].72 By analyzing all the

ligands in the two models, we found that most of them had an

aromatic or a heteroaromatic ring. These substructures often

appeared in the compounds with activity correlating with H-

bonds or Columbic interactions. This descriptor can also corre-

late with p-p stacking interaction between the ligand and

protein. All of the above properties are very important for the

protein-ligand binding and can be useful in the drug design and

discovery process. There are already many available drugs

which contained these substructures, such as in sulphonamides

(Sulfamide), antibiotics (Nitrofural, Metronidazole, Tinidazole,

Niridazole), antagonists of e.g. morphine (Amiphenazol, Dapta-

zol), fungicides (Benomyl) and others.73 These information can

Figure 2. Contour plot of the errors for LS-SVM when optimizing

the parameters r and c for Kd model. The magenta square indicates

the selected optimal settings.

Figure 3. Contour plot of the errors for LS-SVM when optimizing

the parameters r and c for Ki model. The magenta square indicates

the selected optimal settings.

Table 4. Statistical Results of the Models and Their Validations.

Content

Model

Internal

validation

External

validation

10-fold

CV

Rtrain RMSE Q2 Q2Yscrambling Rext RMSE Q2

ext Q2cv

Kd model 0.89 0.975 0.586 0.023 0.833 1.118 0.700 0.569

Ki model 0.922 0.953 0.500 0.014 0.742 1.471 0.537 0.496

Q2Yscrambling shows the highest value after 300 times of Y scrambling.

Figure 4. The predicted values of pKd versus the experimental val-

ues of Kd model.



suggest that aromatic property of a ligand could be a critical key

for the binding affinity between ligand and protein.

Finally, several atom-centred fragment descriptors belonging

to the constitutional descriptors may correlate to the H-bond for-

mation process. These descriptors (No. 7 in block2 of Kd model

and No. 23, 28 in block2 of Ki model) include the information

of molecular fragments relevant to the heteroatoms.

In summary, the selected descriptors proved that the electro-

static, hydrophobic, and H-bond donor or acceptor and aromatic-

ity relevant descriptors and some atom-centred fragments

descriptors were important structural features relevant to the

binding affinity. These descriptors contained the consistent infor-

mation with the results from other similar studies.

Conclusion

In this study, a novel method for the prediction of the protein-

ligand affinities (Kd and Ki) based on a comprehensive set of

1300 structurally diverse PLCs was developed. Three blocks of

molecular descriptors were used to describe the PLCs. The

whole data were split into the representative training and test

sets by full consideration of the model validation. The ReliefF

algorithm combined with the modeling method was used to

select the significant descriptors from the training set samples

and LS-SVMs were utilized to build the models to predict the

binding affinities of Kd and Ki. The model was accurate, robust,

and stable, as were indicated by carefully internal and external

validations. Our method outperformed other similar published

works in terms of the balance between predictive ability and

model complexity. Because of its satisfied performance, the pre-

dictive model can be used as a fast filter for the rapid virtual

screening of large chemical database.

Furthermore, several important descriptors related to the

binding affinity were analyzed. These descriptors can give some

insight into structural features related to the binding affinity

between the ligand and protein.

References

1. Betz, M.; Saxena, K.; Schwalbe, H. Curr Opin Chem Biol 2006, 10,

219.

2. Diercks, T.; Coles, M.; Kessler, H. Curr Opin Chem Biol 2001, 5,

285.

3. Villar, H. O.; Yan, J.; Hansen, M. R. Curr Opin Chem Biol 2004, 8,

387.

4. D’Amico, S.; Sohier, J. S.; Feller, G. J Mol Biol 2006, 358, 1296.

5. Chavelas, E. A.; Zubillaga, R. A.; Pulido, N. O.; Garcia-Hernandez,

E. Biophys Chem 2006, 120, 10.

6. Wiseman, T.; Williston, S.; Brandts, J. F.; Lin, L.-N. Anal Biochem

1989, 179, 131.

7. Lofas, S. Assay Drug Dev Techn 2004, 2, 407.

8. Kuntz, I. D.; Blaney, J. M.; Oatley, S. J.; Langridge, R.; Ferrin,

T. E. J Mol Biol 1982, 161, 269.

9. Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R. J Mol

Biol 1997, 267, 727.

10. Naım, M.; Bhat, S.; Rankin, K. N.; Dennis, S.; Chowdhury, S. F.;

Siddiqi, I.; Drabik, P.; Sulea, T.; Bayly, C. I.; Jakalian, A.; Purisima,

E. O. J Chem Inf Model 2007, 47, 122.

11. Aqvist, J.; Luzhkov, V. B.; Brandsdal, B. O. Acc Chem Res 2002,

35, 358.

12. Gohlke, H.; Hendlich, M.; Klebe, G. J Mol Biol 2000, 295, 337.

13. Zhang, C.; Liu, S.; Zhu, Q.; Zhou, Y. J Med Chem 2005, 48,

2325.

14. Imai, T.; Hiraoka, R.; Seto, T.; Kovalenko, A.; Hirata, F. J Phys

Chem B 2007, 111, 11585.

15. Muegge, I.; Martin, Y. C. J Med Chem 1999, 42, 791.

16. Muegge, I. J Med Chem 2006, 49, 5895.

17. Mitchell, J. B. O.; Laskowski, R. A.; Alex, A.; Thornton, J. M.

J Comput Chem 1999, 20, 1165.

18. Nobeli, I.; Mitchell, J. B. O.; Alex, A.; Thornton, J. M. J Comput

Chem 2001, 22, 673.

19. Huang, S.-Y.; Zou, X. J Comput Chem 2006, 27, 1876.

20. Yang, C. Y.; Wang, R.; Wang, S. J Med Chem 2006, 49, 5903.

21. Gehlhaar, D. K.; Verkhivker, G. M.; Rejto, P. A.; Sherman, C. J.;

Fogel, D. R.; Fogel, L. J.; Freer, S. T. Chem Biol 1995, 2, 317.

22. Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G. J Mol Biol 1996,

261, 470.

23. Head, R. D.; Smythe, M. L.; Oprea, T. I.; Waller, C. L.; Green, S.

M.; Marshall, G. R. J Am Chem Soc 1996, 118, 3959.

24. Bohm, H. J. J Comput Aided Mol Des 1998, 12, 309.

25. Wang, R.; Lui, L.; Lai, L.; Tang, Y. J Mol Model 1998, 4, 379.

26. Wang, R.; Lai, L.; Wang, S. J comput-Aided Mol Des 2002, 16, 11.

27. Eldridge, M. D.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee,

R. P. J Comput-Aided Mol Des 1997, 11, 425.

28. DeWitte, R. S.; Shakhnovich, E. I. J Am Chem Soc 1996, 118,

11733.

29. Ishchenko, A. V.; Shakhnovich, E. I. J Med Chem 2002, 45, 2770.

30. Yang, J.-M. J Comput Chem 2004, 25, 843.

31. Chen, H.-M.; Liu, B.-F.; Huang, H.-L.; Hwang, S.-F.; Ho, S.-Y.

J Comput Chem 2007, 28, 612.

32. Brown, S. P.; Muchmore, S. W. J Chem Inf Model 2006, 46, 999.

33. Brown, S. P.; Muchmore, S. W. J Chem Inf Model 2007, 47, 1493.

34. Lindstrom, A.; Pettersson, F.; Almqvist, F.; Berglund, A.; Kihlberg,

J.; Linusson, A. J Chem Inf Model 2006, 46, 1154.

35. Zhang, S.; Golbraikh, A.; Tropsha, A. J Med Chem 2006, 49, 2713.

36. Deng, W.; Breneman, C.; Embrechts, M. J. J Chem Inf Comput Sci

2004, 44, 699.

37. Wang, R.; Fang, X.; Lu, Y.; Wang, S. J Med Chem 2004, 47, 2977.

38. Wang, R.; Fang, X.; Lu, Y.; Yang, C. Y.; Wang, S. J Med Chem

2005, 48, 4111.

Figure 5. The predicted values of pKi versus the experimental val-

ues of Ki model.



39. Wang, R.; Lu, Y.; Fang, X.; Wang, S. J Chem Inf Comput Sci 2004,

44, 2114.

40. Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.;

Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucl Acids Res 2000,

28, 235.

41. Wang, R.; Fang, X.; Lu, Y.; Wang, S. J Med Chem 2004, 47,

2977.

42. Python Programming Language, ver. 2.5.1. Python software foundation:

Hampton, NH, 2007.

43. Li, Z. R.; Lin, H. H.; Han, L. Y.; Jiang, L.; Chen, X.; Chen, Y. Z.

Nucl Acids Res 2006, 34, W32.

44. Reczko, M.; Karras, D.; Bohr, H. Nucl Acids Res 1997, 25, 235.

45. Lin, Z.; Pan, X. M. J Protein Chem 2001, 20, 217.

46. David, S. H. Biopolymers 1988, 27, 451.

47. Sokal, R. R.; Thomson, B. A. Am J Phys 2006, 129, 121.

48. Chou, K. C. Biochem Biophys Res Commun 2000, 278, 477.

49. Cai, Y. D.; Chou, K. C. J Proteome Res 2005, 4, 967.

50. Han, L. Y.; Zheng, C. J.; Xie, B.; Jia, J.; Ma, X. H.; Zhu, F.;

Lin, H. H.; Chen, X.; Chen, Y. Z. Drug Discov Today 2007, 12,

304.

51. Bock, J. R.; Gough, D. A. Bioinformatics 2001, 17, 455.

52. HyperChem, ver. 7.0. Hypercube. Inc.: Gainesville, FL, 2002.

53. DRAGON for Windows (Software for molecular Descriptor Calcula-

tion), ver. 5.4. Talete srl, 2006. Available at http://www.talete.mi.it.

54. Liu, H.; Papa, E.; Gramatica, P. Chem Res Toxicol 2006, 19, 1540.

55. Sybyl molecular modeling software, ver. 6.9. Tripos Associates,

Inc.: St. Louis, 2002.

56. Kennard, R. W.; Stone, L. A. Technometrics 1969, 11, 137.

57. Marengo, E.; Todeschini, R. Chemom Intell Lab Syst 1992, 16, 37.

58. Maesschalck, R. D.; Estienne, F.; Verdu-Andres, J.; Candolfi, A.;

Centner, V.; Despagne, F.; Jouan-Rimbaud, D.; Walczak, B.; Mas-

sart, D. L.; Jong, S. D.; Noord, O. E. D.; Puel, C.; Vandeginste, B.

M. G. Internet J Chem 1999, 2, 19.

59. Kira, K.; Rendell, L. A. Proceedings of the Ninth International

Workshop on Machine Learning, Morgan Kaufmann Publishers Inc.:

San Francisco, CA, 1992, pp 249–256.

60. Kohavi, R.; John, G. H. Artifi Intell 1997, 97, 273.

61. Saeys, Y.; Inza, I.; Larranaga, P. Bioinformatics 2007, 23,

2507.

62. Suykens, J. A. K.; Gestel, T. V.; Brabanter, J. D.; Moor, B. D.; Van-

dewalle, J. Least Squares Support Vector Machines; World Scien-

tific: Singapore, 2002.

63. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support vector

Machine; Cambridge University Press: UK, 2000.

64. Liu, H.; Zhang, R.; Yao, X.; Liu, M.; Hu, Z.; Fan, B. J Comput Aid

Mol Des 2004, 18, 389.

65. Liu, H.; Yao, X.; Zhang, R.; Liu, M.; Hu, Z.; Fan, B. J Phys Chem

B 2005, 109, 20565.

66. Yao, X.; Liu, H.; Zhang, R.; Liu, M.; Hu, Z.; Panaye, A.; Doucet, J.

P.; Fan, B. Mol Pharmaceutics 2005, 2, 348.

67. Li, S.; Yao, X.; Liu, H.; Li, J.; Fan, B. Anal Chim Acta 2007, 584,

37.

68. Liu, H.; Gramatica, P. Bioorgan Med Chem 2007, 15, 5251.

69. Pelckmans, K.; A.K. Suykens, J.; Gestel, T. V.; Brabanter, J. D.;

Lukas, L.; Hamers, B.; Moor, B. D.; Vandewalle J. Internal Report

02-44, ESAT-SCD-SISTA K.U. Leuven: Leuven, Belgium, 2002.

70. Gramatica, P. QSAR Comb Sci 2007, 26, 694.

71. Tropsha, A.; Gramatica, P.; Gombar, V. K. QSAR Comb Sci 2003,

22, 69.

72. Krygowski, T. M.; Cyranski, M. Tetrahedron 1996, 52, 1713.

73. Mrozek, A.; Karolak-Wojciechowska, J.; Amiel, P.; Barbe, J. J Mol

Struct 2000, 524, 151.



a novel method for protein-ligand binding affinity prediction and the related descriptors...

Documents