novel statistical-thermodynamic methods to predict protein-ligand binding positions using...

7
Novel Statistical-Thermodynamic Methods to Predict Protein-Ligand Binding Positions Using Probability Distribution Functions A. M. Ruvinsky * and A. V. Kozintsev Force Field Laboratory, Algodign, LLC, Moscow, Russia Abstract: We present two novel methods to predict native protein-ligand binding positions. Both methods identify the native binding position as the most probable position corresponding to a maxi- mum of a probability distribution function (PDF) of possible binding positions in a protein active site. Possible binding positions are the origins of clusters composed, on the basis of root-mean square devia- tions (RMSD), from the multiple ligand positions determined by a docking algorithm. The difference between the methods lies in the ways the PDF is derived. To validate the suggested methods, we compare the averaged RMSD of the predicted ligand docked positions relative to the experimentally de- termined positions for a set of 135 PDB protein- ligand complexes. We demonstrate that the sug- gested methods improve docking accuracy by as much as 21–24% in comparison with a method that simply identifies the binding position as the energy top-scored ligand position. Proteins 2006;62:202–208. © 2005 Wiley-Liss, Inc. Key words: protein-ligand binding; docking; prob- ability distribution function; cluster INTRODUCTION The development of fast and reliable numerical methods for predicting ligand positions in protein active sites and protein-ligand binding free energy is a field of active research. 1–9 These methods have significant practical im- portance for the discovery of new drug lead compounds. 1,2 Recent docking tests and detailed comparative analysis of the performance of different docking tools 10 –17 demon- strate the dependence of docking accuracy on the quality of the intermolecular potentials describing protein-ligand interactions, the scoring methods for estimation of protein- ligand binding free energy, and the positional search and optimization methods. Docking algorithms generate a number of different ligand positions corresponding to different local minima of the protein-ligand energy landscape. Commonly, the top- scored ligand position is accepted as the predicted binding position, and docking accuracy is assessed as the RMSD of the top-scored ligand position relative to the experimen- tally determined position. Alternative approaches were introduced in Dennis et al., 18 Verkhivker et al., 19 and Ka ¨ llablad et al. 20 In these articles, the authors considered clusters of ligand positions that were similar to each other based on RMSD 18,20 and the intermolecular similarity coefficient. 19 To identify the most favorable binding posi- tions, the authors ranked the representative positions of the clusters. Dennis et al. 18 and Verkhivker et al. 19 suggested ranking of representative positions on the basis of the average free energy over all positions in the cluster. Ka ¨ llblad et al. 20 used two methods of ranking based on the free energy of representative positions and on cluster occupancy. Considering several different test sets of pro- tein-ligand complexes, Ka ¨ llblad et al. 20 noted that ranking representative positions by cluster occupancy had higher success rates compared to ranking by energy on one test set but lower on another test set. In other docking tests they found an improvement of docking accuracy when five distinct binding modes were considered. Also, trends be- tween the cluster occupancy and predicted binding ener- gies were observed in docking experiments by Rosenfeld et al. 21 The clustering method for protein-protein docking was first applied by Camacho and Gatchell. 22 It is interesting to note that resent theoretical analysis of the protein folding landscape suggest that taking into account cluster occupancy allows one to distinguish near- native conformations from misfolded ones. 23–25 In most cases near-native conformations in comparison with mis- folded ones have the greatest number of neighboring conformations within a RMSD-tolerance. Considering the problem of loop prediction, Xiang et al. 26 suggested to rank conformations using a standard energy term together with a RMSD-dependent term that favors conformations that have many neighbors in configurational space. In this article we suggest two new methods for predic- tion of ligand-binding positions. Both methods are based on using a probability distribution function of possible representative positions in the protein active site. To derive and test the PDF, we first generate multiple binding positions for each of 135 protein-ligand complexes and carry out a clustering of ligand positions on the basis of RMSD. We choose the representative position of each cluster as the ligand position with the lowest energy at the cluster. Than we identify the native position as the repre- sentative position that corresponds to the maximum of the *Correspondence to: A. M. Ruvinsky, Force Field Laboratory, Al- godign, LLC, B. Sadovaya, 8, 103379, Moscow, Russia. E-mail: aruvin [email protected] Received 2 March 2005; revised 27 May 2005; accepted 14 June 2005 Published online 14 November 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.20673 PROTEINS: Structure, Function, and Bioinformatics 62:202–208 (2006) © 2005 WILEY-LISS, INC.

Upload: a-m-ruvinsky

Post on 06-Jul-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Novel statistical-thermodynamic methods to predict protein-ligand binding positions using probability distribution functions

Novel Statistical-Thermodynamic Methods toPredict Protein-Ligand Binding PositionsUsing Probability Distribution FunctionsA. M. Ruvinsky* and A. V. KozintsevForce Field Laboratory, Algodign, LLC, Moscow, Russia

Abstract: We present two novel methods topredict native protein-ligand binding positions. Bothmethods identify the native binding position as themost probable position corresponding to a maxi-mum of a probability distribution function (PDF) ofpossible binding positions in a protein active site.Possible binding positions are the origins of clusterscomposed, on the basis of root-mean square devia-tions (RMSD), from the multiple ligand positionsdetermined by a docking algorithm. The differencebetween the methods lies in the ways the PDF isderived. To validate the suggested methods, wecompare the averaged RMSD of the predicted liganddocked positions relative to the experimentally de-termined positions for a set of 135 PDB protein-ligand complexes. We demonstrate that the sug-gested methods improve docking accuracy by asmuch as 21–24% in comparison with a method thatsimply identifies the binding position as the energytop-scored ligand position. Proteins 2006;62:202–208.© 2005 Wiley-Liss, Inc.

Key words: protein-ligand binding; docking; prob-ability distribution function; cluster

INTRODUCTION

The development of fast and reliable numerical methodsfor predicting ligand positions in protein active sites andprotein-ligand binding free energy is a field of activeresearch.1–9 These methods have significant practical im-portance for the discovery of new drug lead compounds.1,2

Recent docking tests and detailed comparative analysis ofthe performance of different docking tools10–17 demon-strate the dependence of docking accuracy on the quality ofthe intermolecular potentials describing protein-ligandinteractions, the scoring methods for estimation of protein-ligand binding free energy, and the positional search andoptimization methods.

Docking algorithms generate a number of differentligand positions corresponding to different local minima ofthe protein-ligand energy landscape. Commonly, the top-scored ligand position is accepted as the predicted bindingposition, and docking accuracy is assessed as the RMSD ofthe top-scored ligand position relative to the experimen-tally determined position. Alternative approaches wereintroduced in Dennis et al.,18 Verkhivker et al.,19 andKallablad et al.20 In these articles, the authors consideredclusters of ligand positions that were similar to each other

based on RMSD18,20 and the intermolecular similaritycoefficient.19 To identify the most favorable binding posi-tions, the authors ranked the representative positions ofthe clusters. Dennis et al.18 and Verkhivker et al.19

suggested ranking of representative positions on the basisof the average free energy over all positions in the cluster.Kallblad et al.20 used two methods of ranking based on thefree energy of representative positions and on clusteroccupancy. Considering several different test sets of pro-tein-ligand complexes, Kallblad et al.20 noted that rankingrepresentative positions by cluster occupancy had highersuccess rates compared to ranking by energy on one testset but lower on another test set. In other docking teststhey found an improvement of docking accuracy when fivedistinct binding modes were considered. Also, trends be-tween the cluster occupancy and predicted binding ener-gies were observed in docking experiments by Rosenfeld etal.21 The clustering method for protein-protein dockingwas first applied by Camacho and Gatchell.22

It is interesting to note that resent theoretical analysisof the protein folding landscape suggest that taking intoaccount cluster occupancy allows one to distinguish near-native conformations from misfolded ones.23–25 In mostcases near-native conformations in comparison with mis-folded ones have the greatest number of neighboringconformations within a RMSD-tolerance. Considering theproblem of loop prediction, Xiang et al.26 suggested to rankconformations using a standard energy term together witha RMSD-dependent term that favors conformations thathave many neighbors in configurational space.

In this article we suggest two new methods for predic-tion of ligand-binding positions. Both methods are basedon using a probability distribution function of possiblerepresentative positions in the protein active site. Toderive and test the PDF, we first generate multiplebinding positions for each of 135 protein-ligand complexesand carry out a clustering of ligand positions on the basisof RMSD. We choose the representative position of eachcluster as the ligand position with the lowest energy at thecluster. Than we identify the native position as the repre-sentative position that corresponds to the maximum of the

*Correspondence to: A. M. Ruvinsky, Force Field Laboratory, Al-godign, LLC, B. Sadovaya, 8, 103379, Moscow, Russia. E-mail: [email protected]

Received 2 March 2005; revised 27 May 2005; accepted 14 June 2005

Published online 14 November 2005 in Wiley InterScience(www.interscience.wiley.com). DOI: 10.1002/prot.20673

PROTEINS: Structure, Function, and Bioinformatics 62:202–208 (2006)

© 2005 WILEY-LISS, INC.

Page 2: Novel statistical-thermodynamic methods to predict protein-ligand binding positions using probability distribution functions

PDF. We suggest two methods to derive PDF. One methodleads to a PDF explicitly dependent on the energy andcluster occupancy and thus provides a physical explana-tion of the results obtained in Kallablad et al.20 andRosenfeld et al.21 We will show that our methodologyimproves the percentage of the top-ranked representativepositions within a RMSD of 2 Å of the experimentalposition (the success rate of docking) by 21–24% in compari-son with the simpler method that identifies the bindingposition as the ligand position having the top energy score.

The organization of this article is as follows. We developthe two methods of deriving the probability distributionfunctions in Materials and Methods. Also, we describe thetest set of protein-ligand PDB complexes and give detailsof docking in this section. In Results we compare dockingaccuracies of both methods in terms of the RMSD of thepredicted docked ligand positions relative to the experimen-tally determined positions. Then we compare the perfor-mance of these methods with methods based on rankingover energy or cluster occupancy only. We summarize theresults in Conclusions.

MATERIALS AND METHODSTheory

Docking programs commonly generate a number ofscored ligand positions corresponding to different localminima of the protein-ligand energy landscape. Often thedocked position having the highest score, i.e., the bestpredicted binding energy, is not a good approximation ofthe experimentally determined position. One reason isthat the several top-scored ligand positions may havepractically the same score but very different RMSDs, andthe scoring function is not sufficiently accurate to distin-guish between these positions. Moreover, the currentmethods of taking into account entropy contributions tobinding free energy partly explains the weakness of scor-ing functions. Thus it is a complicated problem to choosethe ligand position that has the minimal RMSD relative tothe experimentally determined position.

We suggest considering this problem from the statistical-thermodynamic point of view. Our goal is to find the mostprobable ligand position among all representative posi-tions in the protein active site found by the dockingprogram and to analyze its RMSD relative to the experi-mentally determined position. The representative positionof the cluster plays the role of center for relative motions inthe protein-ligand complex in solution. These motions aredescribed by the ligand positions in the cluster as bysnapshots. Following statistical physics, we will derive thePDF of representative positions and choose the positioncorresponding to the maximum of the PDF.

As a preliminary step, all docked ligand positions (Fig. 1)from a number of runs of a docking program are clusteredin such a way that each cluster contains ligand positionswith RMSD less than a definite value, the tolerance,relative to the ligand position having the best score(minimal energy) in the cluster. We used a procedure ofclustering built into AutoDock27 for two values of theRMSD-tolerance of 1 and 2 Å. The docked ligand position

with minimal energy in the cluster is designated as therepresentative position of the cluster. The result of theclustering procedure is a list of representative positionsthat have RMSDs between one another greater than theRMSD-tolerance. All other docked positions are assignedto the cluster of the nearest representative position. Thus,every ligand position is described by two indexes indicat-ing the number of its cluster i and the number of its ligandposition j inside the cluster i.

Method 1: The probability distribution function ofpossible representative positions depends on thecluster occupancy and the minimal energy found inthe cluster

In the Boltzmann gas approximation for bound protein-ligand complexes, the probability density to find a protein-ligand complex with a flexible ligand at a point (r) relativeto the rigid protein is equal to

��r� � N�exp� � Upl�r�/T�, (1)

where N is the total number of complexes,

��1 � ��

exp� �Upl�r�

T �dr, (2)

� is the ligand configurational space of r. Applying Monte-Carlo integration to the Exp. (2), we obtain

��1 � �i

��i

NiZi, (3)

where the subscript i numerates clusters, ��i is thevariation interval of r in the cluster numbered i,

Fig. 1. The scheme of clustering. The small circles show local minimaof the protein-ligand energy landscape found in docking. The hatchedcircles correspond to representative positions of the clusters. The largedashed circles show multiconformational clusters of docked ligand posi-tions.

PREDICTING PROTEIN-LIGAND BINDING POSITIONS 203

Page 3: Novel statistical-thermodynamic methods to predict protein-ligand binding positions using probability distribution functions

Zi � �j

exp� �Upl�ri

j�

T � (4)

The number of complexes with ligands bound to a proteinin the region �� is

�N���� � N����

exp� �Upl�r�

T �dr (5)

Applying Monte-Carlo approximation to the integral (5),we obtain

N����

�N ���

N���k

exp� �Upl�rk�

T �, (6)

where N�� is the number of ligand docked positions (rk)lying in the region ��. Choosing �� equal to the clustervolume ��i and introducing P(i) � N(��i)/N, we obtain theproportion of the complexes with ligands bound in cluster i

P�i� � ���i

Ni�

j

exp� �Upl�rj

i�

T � (7)

Now let us use the well-known fact that the exponentialaverage of an arbitrary function is essentially determinedby the low energy tail of the function (see, for example,Hendrix and Jarzynski28). Thus we can estimate

1Ni�

j

exp� �Upl�rj

i�

T � � exp� �min��i�Upl�rj

i��

T � (8)

Using the Exp. (8) and ��i Nip [p is volume per point inthe ligand configurational space (r)], we obtain the probabil-ity of finding the ligand in the cluster i

P�i� � �pNiexp� �min��i�Upl�rj

i��

T �, (9)

which explicitly depends on the cluster occupancy Ni andthe minimal energy min��i

(Upl(rji)) of the representative

position in the cluster i. The most probable representativeposition lies in the cluster with a maximal value of P(i). Toanalyze docking accuracy we shall calculate the RMSDbetween the representative position in this cluster and theexperimentally determined position.

If clusters have essentially close energies min Upl, thenthe cluster occupancy becomes the main factor identifyingthe native binding position. Indeed, comparing two clus-ters k and m with bottom energies min��k

(Upl(rjk))

min��m (Upl(rjm)) and cluster occupancies Nk and Nm, we

have

P�k�

P�m��

Nk

Nmexp� �

min��k�Upl�rjk�� � min��m�Upl�rj

m��

T ��

Nk

Nm(10)

Thus the method of ranking by cluster occupancy is aspecial case of the more general method based on Eq. (9)

and has a narrower field of application. This fact explainsthe docking results of Kallblad et al.20 for the Needles setof protein-ligand complexes. They found that “energyranking is a better predictor for the Needles data set” thanranking by cluster occupancy.

Equation (9) explains the dependence of docking accu-racy on the cluster occupancy and minimal energy in thecluster observed in docking experiments.20,21 In the gen-eral case it is necessary to account for both factors, namelythe depth of the well in the protein-ligand energy land-scape Upl and the volume of the well, which is directlyconnected with binding entropy of relative motions throughthe logarithm of binding volume (see References 29–33).In the Results section we shall compare results of themethod identifying the binding position as the dockedposition with the minimal energy (the representativeposition) in the most occupied cluster with the methodidentifying binding position as the position correspondingto the maximum of P(i).

For correct use of the Exp. (9) it is necessary to keep inmind that it is derived with the help of the Monte-Carloscheme. Thus only clusters with high occupancy should bescored with Exp. (9). To differentiate between dense andrare clusters, we introduce a low bound Nlb of denseclusters. Only clusters with Nlb � Nlb are scored with Exp.(9). If all clusters have occupancy lower than Nlb, we selectthe most occupied cluster as the cluster of the bindingposition, but if several clusters have the same occupancylower than Nlb, we compare them using Exp. (9).

Method 2: Application of the energy histogram toderive the probability distribution function ofpossible representative positions

Here we suggest the second method of deriving the PDFof ligand representative positions. Let us consider thedistribution function34:

P�i,E� �N�i;lεo,�l � 1�εo�

Ntot, (11)

where E is the energy of the representative position of theenergy interval (lε0, (l � 1)ε0) in the cluster i; ε0 is theenergy step of the histogram (Fig. 2); Ntot � ¥i,l N(i; lε0,(l � 1)ε0) is the total number of considered ligand positionsequal to 100 or 250 (see Docking Method section). Here wedesignate the docked ligand position with minimal energyin the interval (lε0, (l � 1)ε0) as the representative positionof the interval The problem of finding the most probablebinding position is formulated as the problem of finding amaximum of the PDF (11). To choose the binding positionwe first find the intervals (l, l � 1) maximizing Exp. (11) ineach cluster i and then choose the interval maximizingExp. (11) over all clusters. If several different intervals in acluster produce the same maximum of Exp. (11), we selectthe interval having the lower docking energy of therepresentative position. If several different clusters havethe same maximum of Exp. (11), we select the clusterhaving the lowest docking energy of the representativeposition as the cluster of the binding position.

204 A.M. RUVINSKY AND A.V. KOZINTSEV

Page 4: Novel statistical-thermodynamic methods to predict protein-ligand binding positions using probability distribution functions

It is easy to note that for ε0 greater than the energyrange in clusters, this method reduces to the method ofranking by cluster occupancy. So the method of ranking bycluster occupancy is also a special case of the more generalmethod based on Eq. (11).

Test Set

The suggested methods are tested on a set35 of 135protein-ligand complexes selected from the PDB36: 1a53,1aha, 1ahb, 1akb, 1akc, 1ama, 1amq, 1ane, 1anf, 1art,1az8, 1b0o, 1b30, 1b74, 1br5, 1br6, 1btn, 1c3j, 1c83, 1c84,1cc7, 1cen, 1d5r, 1dar, 1dht, 1di8, 1diw, 1dpf, 1dvj, 1e1v,1e8k, 1e8w, 1eap, 1efy, 1f9g, 1fao, 1fen, 1fgy, 1fh7, 1fh8,1fhd, 1flr, 1flz, 1fut, 1g7s, 1gc5, 1gii, 1gij, 1gor, 1gym,1h4h, 1h52, 1h70, 1hh8, 1hi3, 1hi4, 1hi5, 1hqp, 1i38, 1i7e,1icm, 1ifs, 1ifu, 1ikg, 1in8, 1ioz, 1ivr, 1j01, 1jd3, 1jeo, 1jgi,1jj0, 1jso, 1ju4, 1jvp, 1laf, 1lag, 1lbl, 1lf7, 1lif, 1lih, 1lmo,1lr4, 1lst, 1lzg, 1lzy, 1mai, 1mdq, 1mor, 1mrg, 1mrj, 1nli,1nst, 1o6g, 1oxp, 1pax, 1pot, 1ppa, 1ptv, 1qcf, 1qh7, 1qkq,1qno, 1rbp, 1rms, 1rnt, 1rob, 1rpf, 1tsy, 1tti, 1ttj, 1tyb,1wdn, 1wht, 1wrp, 2cmd, 2cyh, 2dri, 2enb, 2hmb, 2ifb,2man, 2ovw, 2pax, 2pk4, 2sak, 2sli, 3cyh, 3eng, 3jdw, 3kiv,3pax, 4cyh, 4rsk, 5cyh. All these entries have resolutionbetter than 2.5 Å and contain only one protein chain andone ligand molecule with no more than 30 heavy atoms.These entries do not contain metals and other moleculesexcept water molecules. Water molecules were removedfrom the PDB entries before docking.

Docking Method

Using AutoDock 3.027 with the lamarckian geneticalgorithm (LGA) we carried out 100 and 250 independentdocking processes for every complex assuming rigid pro-tein and flexible ligand molecules. In each process wecreated a population of 50 individuals, assigning a randomset of translational coordinates of ligand center of mass, arandom orientational ligand position and random torsionsto each of the 50 individuals. We used the mutation rateequal to 0.02, the crossover rate equal to 0.80, and a

maximum of number of energy evaluations of 105. Finaldocked positions were clustered by use of the RMSD-tolerance of 1.0 Å and separately of 2.0 Å. For each process,the docked position having the best score was selected. Weused the built-in scoring function27 including four terms: aLennard-Jones 12-6 dispersion/repulsion term, a direc-tional 12/10 hydrogen bonding term, and a screenedCoulomb electrostatic potential and the desolvation term.

RESULTS

The PDF of representative positions depends on thenumber of clusters and their occupancy. The number ofdistinct multiconformational clusters and their occupancydepend on the number of docking runs per complex and theRMSD-tolerance of clustering. For this reason, we did notuse the complex 8cho in our test set, as it does not haveclusters containing more than one comformation (multicon-formation) for either 100 or 250 runs and RMSD-toleranceof 1 Å. For 100 docking runs and RMSD-tolerance of 1 Å,AutoDock did not find clusters with even 2 conformationsin the complex 1i7e and only found one cluster with twoconformations in this complex for 250 runs. Therefore wecarried out two docking tests (Tables I and II) with,respectively, 100 and 250 docking runs per complex withRMSD-tolerance of 1 Å, and one docking test (Table III)with 100 docking runs per complex and RMSD-tolerance of2 Å.

To analyze the influence of the low bound of the clustersize on the performance of Method 1, we calculated thepercentage of the top-ranked solutions within a RMSD of 2Å of the experimental position (the success rate) as afunction of the low bound Nlb of the cluster size for threedocking tests (Fig. 3) on the test set of 134 protein-ligandcomplexes (excluding the complex 1i7e). We found that thesuccess rate at first increases with increasing Nlb and thenbecomes essentially constant starting from Nlb equal to 20,30, and 50 for respective tests with the RMSD-tolerance of1 Å and 100 docking runs per complex, with the RMSD-tolerance of 2 Å and 100 docking runs, and with theRMSD-tolerance of 1 Å and 250 docking runs. It isinteresting to note that in both tests with the RMSD-tolerance of 1 Å, the maximal success rate starts with Nlb

equal to 20% of the number of docking runs per complex.Figure 3 shows that the worst results are obtained for

the RMSD-tolerance of 2 Å. This decrease of the successrate for the case of RMSD-tolerance of 2 Å in comparisonwith the cases of the RMSD-tolerance of 1 Å is a result of abreach of the correspondence between clusters and theenergy wells for the greater value of the RMSD-tolerancedue to rearrangement of ligand positions. Further increaseof the RMSD-tolerance will lead to even greater associa-tion of ligand positions belonging to different energy wells(i.e., the clusters obtained for the smaller value of theRMSD-tolerance) into one cluster and therefore to furtherdecrease of docking accuracy. Thus we can conclude thatRMSD-tolerance of 1 Å is the optimal value to describeprotein-ligand energy wells in terms of clusters of ligandpositions in our test set.

Fig. 2. Example of the histogram of the distribution function N(i, E) inthe ith cluster. �0 is the histogram step over the energy scale.

PREDICTING PROTEIN-LIGAND BINDING POSITIONS 205

Page 5: Novel statistical-thermodynamic methods to predict protein-ligand binding positions using probability distribution functions

The results of ranking docked positions either by energy(min Upl(i)), by cluster occupancy (max(Ni/Ntot)), or on thebasis of Methods 1 and 2 are given in Tables I–III. Theseresults unambiguously demonstrate the advantage of bothsuggested Methods 1 and 2 and the method of ranking bycluster occupancy in comparison with the common methodof choosing the energy top-ranked position to identify thebinding position. Considering the percentage of the top-ranked solutions within a RMSD of 2 Å as a success rate,we found that the difference between the method usingmin Upl(i) and the suggested Methods 1 and 2 is 22–23%for the case of 100 docking runs per complex and RMSD-tolerance of 1 Å (Table I), 21–24% for the case of 250

docking runs (Table II) and RMSD-tolerance of 1 Å, and14–18% for the case of 100 docking runs and RMSD-tolerance of 2 Å (Table III). Hence, all three sets ofparameters yield better results than the method minUpl(i), although the 100 docking runs and RMSD-toleranceof 2 Å case is not as good as the other two, as mentionedabove.

The success rate of Method 2 shows stability to thechanges of the histogram step from 0.1 to 10.0 kcal/mol.Note that for large values of ε0 Method 2 transforms intothe method of ranking by cluster occupancy, i.e., identify-ing the binding position as the representative position ofthe most occupied cluster. This results from the fact that in

TABLE I. Percentage of the Top-ranked Representative Positions within a Defined RMSD from the ExperimentallyDetermined Position, with the Number of Docking Runs per Complex Equal to 100 and the RMSD-Tolerance of 1 A

ε0, kcal/mol

min Upl(i) max(Ni/Ntot)max P(i),Nlb � 20 max P(i, E)

— — — 0.1 0.2 0.3 0.5 1.0 1.5 2.0 10.0

RMSD, A�0.5 5 8 8 8 8 8 7 7 8 8 8�1.0 38 49 48 46 50 46 47 48 49 49 49�1.5 57 78 78 78 75 78 78 79 78 78 78�2.0 65 87 87 87 87 87 87 88 87 87 87�2.5 73 90 91 89 88 89 90 90 90 90 90�3.0 77 93 93 93 91 92 93 93 93 94 93

Averaged RMSD, A 2.4 1.4 1.3 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4Size of the test set 134 134 134 134

TABLE II. Percentage of the Top-ranked Representative Positions within a Defined RMSD from the ExperimentallyDetermined Position, with the Number of Docking Runs per Complex Equal to 250 and the RMSD-Tolerance of 1 A

ε0, kcal/mol

min Upl(i) max(Ni/Ntot)max P(i),Nlb � 50 max P(i, E)

— — — 0.1 0.2 0.3 0.5 1.0 1.5 2.0 10.0

RMSD, A�0.5 6 8 8 9 10 6 9 8 7 8 8�1.0 35 48 47 50 49 49 50 47 48 48 48�1.5 54 78 76 79 79 81 78 79 79 79 79�2.0 64 86 85 87 86 88 85 87 86 86 86�2.5 72 88 87 89 90 90 89 89 88 88 88�3.0 76 93 93 93 93 93 93 93 93 93 93

Averaged RMSD, A 2.5 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4Size of the test set 135 135 135 135

TABLE III. Percentage of the Top-ranked Representative Solutions within a Defined RMSD from the ExperimentallyDetermined Position, with the Number of Docking Runs per Complex Equal to 100 and the RMSD-Tolerance of 2 A

ε0, kcal/mol

min Upl(i) max(Ni/Ntot)max P(i),Nlb � 30 max P(i, E)

— — — 0.1 0.2 0.3 0.5 1.0 1.5 2.0 10.0

RMSD, A�0.5 4 4 4 8 6 9 7 4 4 4 4�1.0 36 41 41 41 41 41 42 39 39 39 41�1.5 53 67 64 70 68 68 63 63 65 65 67�2.0 64 80 77 80 81 82 82 81 80 80 80�2.5 73 89 87 87 90 89 87 88 89 90 89�3.0 76 93 91 90 93 92 92 93 94 94 93

Averaged RMSD, A 2.5 1.5 1.7 1.6 1.6 1.5 1.6 1.6 1.5 1.5 1.5Size of the test set 135 135 135 135

206 A.M. RUVINSKY AND A.V. KOZINTSEV

Page 6: Novel statistical-thermodynamic methods to predict protein-ligand binding positions using probability distribution functions

most complexes the energy width of clusters is lower than1–2 kcal/mol. For several values of ε0 (Table I and ε0 � 1.0kcal/mol; Table II and ε0 � 0.1, 0.3, 1.0 kcal/mol; and TableIII and ε0 � 0.2, 0.3, 0.5, 1.0 kcal/mol), Method 2 outper-forms by 1–2% the results of ranking over cluster occu-pancy or by Method 1. Possibly, a larger difference be-tween Methods 1 and 2 could be found in testing of largersets.

CONCLUSIONS

We presented two statistical-thermodynamic methodsbased on using the probability distribution function ofpossible ligand positions in the protein active site topredict experimentally observed native positions. We iden-tified the native binding position as the position correspond-ing to the maximum of the PDF. The PDF depends on theenergy spectrum and occupancy of the clusters obtainedfrom multiple docking positions. The efficiency of themethods has been validated in docking tests on 135protein-ligand complexes. We demonstrated that our meth-ods correctly predict native ligand positions 21–24% betterthan simply choosing the position having the top energyand thus have a significant value for ligand docking/scoring applications. The suggested methods are in prin-ciple applicable to any intermolecular potentials or scoringfunctions.

ACKNOWLEDGMENTS

We are grateful to B.M. Ruvinsky (Ivano-FrankivskPrecarpathian National University) for performing calcu-lations for us with AutoDock, to A.N. Romanov for usefuldiscussions, and to C. Queen for careful review of themanuscript.

REFERENCES

1. Stockwell BR. Exploring biology with small organic molecules.Nature 2004;432:846–854.

2. Shoichet BK. Virtual screening of chemical libraries. Nature2004;432:862–865.

3. Brooijmans N, Kuntz I. Molecular recognition and docking algo-rithms. Annu Rev Biophys Biomol Struct 2003;32:335–373.

4. Halperin I, Ma B, Wolfson H, Nussinov R. Principles of docking: anoverview of search algorithms and a guide to scoring functions.Prot Str Func Gen 2002;47:409–443.

5. Gohlke H, Klebe G. Approaches to the description and predictionof the binding affinity of small-molecule ligands to macromolecu-lar receptors. Angew Chem Int Ed 2002;41:2644–2676.

6. Taylor RD, Jewsbury PJ, Essex JW. A review of protein-smallmolecule docking methods. J Comp Aid Mol Design 2002;16:151–166.

7. Ajay, Murcko MA. Computational methods to predict binding freeenergy in ligand-receptor complexes. J Med Chem 1995;38:4953–4967.

8. Gilson MK, Given JA, Head MS. A new class of models forcomputing receptor-ligand binding affinities. Chem Biol 1997;4:87–92.

9. Mackerell AD. Empirical force fields for biological macromol-ecules: overview and issues. J Comp Chem 2004;25:1584–1604.

10. Bissantz C, Folkers G, Rognan D. Protein-based virtual screeningof chemical databases. 1. Evaluation of different docking/scoringcombinations. J Med Chem 2000;43:4759–4767.

11. Stahl M, Rarey M. Detailed analysis of scoring functions forvirtual screening. J Med Chem 2001;44:1035–1042.

12. Perez C, Ortiz AR. Evaluation of docking functions for protein-ligand docking. J Med Chem 2001;44:3768–3785.

13. Wang R, Lu Y, Wang S. Comparative evaluation of 11 scoringfunctions for molecular docking. J Med Chem 2003;46:2287–2303.

14. Kellenberger E, Rodrigo J, Muller P, Rognan D. Comparativeevaluation of eight docking tools for docking and virtual screeningaccuracy. Prot Str Funct Bio 2004;57:225–242.

15. Kontoyianni M, McClellan LM, Sokol GS. Evaluation of dockingperformance: comparative data on docking algorithms. J MedChem 2004;47:558–565.

16. Wang W, Donini O, Reyes CM, Kollman PA. Biomolecular simula-tions: recent developments in force fields, simulations of enzymecatalysis, protein-ligand, protein-protein, and protein-nucleic acidnoncovalent interactions. Annu Rev Biophys Biomol Struct 2001;30:211–243.

17. Kaplan IG. Theory of molecular interactions. Amsterdam: Elsevier;1986. 126 p.

18. Dennis S, Kortvelyesi T, Vajda S. Computational mapping identi-fies the binding sites of organic solvents on proteins. Proc NatlAcad Sci USA 2002;99:4290–4295.

19. Verkhivker GM, Bouzida D, Gehlhaar DK, Rejto PA, Schaffer L,Arthurs S, Colson AB, Freer ST, Larson V, Luty BA, Marrone T,Rose PW. Hierarchy of simulation models in predicting structureand energetics of the Src SH2 domain binding to tyrosyl phos-phopeptides. J Med Chem 2002;45:72–89.

20. Kallblad P, Mancera RL, Todorov NP. Assessment of multiplebinding modes in ligand-protein docking. J Med Chem 2004;47:3334–3337.

21. Rosenfeld RJ, Goodsell DS, Musah RA, Morris GM, Goodin DB,Olson AJ. Automated docking of ligands to an artificial active site:augmenting crystallographic analysis with computer modeling.J Comp-Aid Mol Des 2003;17:525–536.

22. Camacho CJ, Gatchell DW. Successful discrimination of proteininteractions. Prot Str Funct Gen 2003;52:92–97.

23. Shortle D, Simons KT, Baker D. Clustering of low-energy confor-mations near native structures of small proteins. Proc Natl AcadSci USA 1998;95:11158–11162.

24. Bonneau R, Tsai J, Ruczinski I, Chivian D, Rohl C, Strauss CEM,Baker D. Rosetta in CASP4: progress in ab initio protein structureprediction. Prot Str Funct Gen Suppl 2001;5:119–126.

25. Wang K, Fain B, Levitt M, Samudrala R. Improved proteinstructure selection using decoy-dependent discriminatory func-tions. BMC Str Biol 2004;4:1–18.

26. Xiang Z, Soto CS, Honig B. Evaluating conformational freeenergies: the colony energy and its application to the problem ofloop prediction. Proc Natl Acad Sci USA 2002;99:7432–7437.

27. Morris GM, Goodsell DS, Halliday RS, Huey R, Hart WE, BelewRK, Olson AJ. Automated docking using a Lamarckian geneticalgorithm and an empirical binding free energy function. J CompChem 1998;19:1639–1662.

Fig. 3. The success rate of the representative positions within aRMSD of 2 Å from the experimentally determined position as a function ofthe low bound of the cluster size Nlb. Squares correspond to theRMSD-tolerance of 1 Å and 100 docking runs per complex; circlescorrespond to the RMSD-tolerance of 1 Å, and 250 docking runs percomplex; triangles correspond to the RMSD-tolerance of 2 Å and 100docking runs per complex.

PREDICTING PROTEIN-LIGAND BINDING POSITIONS 207

Page 7: Novel statistical-thermodynamic methods to predict protein-ligand binding positions using probability distribution functions

28. Hendrix DA, Jarzynski C. A “fast growth” method of computingfree energy differences. J Chem Phys 2001;114:5974–5981.

29. Finkelstein AV, Janin J. The price of lost freedom: entropy ofbimolecular complex formation. Prot Eng 1989;3:1–3.

30. Amzel LM. Loss of translational entropy in binding, folding andcatalysis. Prot Str Funct Gen 1997;28:144–149.

31. Hermans J, Wang L. Inclusion of loss of translational androtational freedom theoretical estimates of free energies of bind-ing. Application to a complex of benzene and mutant T4 lysozyme.J Am Chem Soc 1997;119:2707–2714.

32. Ruvinsky AM. On the calculation of protein-ligand binding con-stants. Proceed Europ Confer Comp Biol 2003;204–205.

33. Ruvinsky AM, Kozintsev AV. A new and fast statistical-

thermodynamic method for computation of protein-ligand bindingentropy substantially improves docking accuracy. J Comp Chem2005;26:1089–1095.

34. Rumer YuB, Ryvkin MSh. Thermodynamics, statistical physicsand kinetics. Moscow: Mir Publishers; 1980. 128 p, Book ChapterII.

35. Ruvinsky AM, Kozintsev AV. The key role of atom types, referencestates and interaction cutoff radii in the knowledge-based method:new variational approach. Prot Struct Funct Bioinf 2005;58:845–854.

36. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, WeissigH, Shindyalov IN, Bourne PE. The Protein Data Bank. NucleicAcids Res 2000;28:235–242.

208 A.M. RUVINSKY AND A.V. KOZINTSEV