a united-residue force field for off-lattice protein-structure simulations. i. functional forms and...

25
< < A United-Residue Force Field for Off-Lattice Protein-Structure Simulations. I. Functional Forms and Parameters of Long-Range Side-Chain Interaction Potentials from Protein Crystal Data A. LIWO, 1, 2 S. OLDZIEJ, 1 M. R. PINCUS, 3 R. J. WAWAK, 2 S. RACKOVSKY, 4 H. A. SCHERAGA 2 1 Department of Chemistry, University of Gdansk, ul. Sobieskiego 18, 80-952 Gdansk, Poland ´ ´ 2 Baker Laboratory of Chemistry, Cornell University, Ithaca, New York 14853-1301 3 Department of Pathology, Brooklyn Veterans Administration Medical Center, Brooklyn, New York 11209 and State University of New York, Health Science Center, Brooklyn, New York 11203 4 Department of Biomathematical Sciences, Mount Sinai School of Medicine, One Gustave L. Levy Place, New York, New York 10029 Received 7 June 1996; accepted 11 September 1996 ABSTRACT: A two-stage procedure for the determination of a united-residue potential designed for protein simulations is outlined. In the first stage, the long-range and local-interaction energy terms of the total energy of a polypeptide chain are determined by analyzing protein] crystal data and averaging the Ž all-atom energy surfaces. In the second stage described in the accompanying . article , the relative weights of the energy terms are optimized so as to locate the native structures of selected test proteins as the lowest energy structures. The Correspondence to:H. A. Scheraga; e-mail: has5@cornell.edu Contract grant sponsors: Polish State Committee for Scientific This article includes Supplementary Material available Research, contract grant number: PB 190rT09r96r10; National from the authors upon request or via the Internet at Institute on Aging, contract grant number: AG 00322; National ftp.wiley.comrpublicrjournalsrjccrsuppmatr18r849 or Institute of General Medical Sciences, contract grant number: http:rrwww.journals.wiley.comrjcc GM 14312; National Science Foundation, contract grant num- ber: MCB95-13167; National Cancer Institute, contract grant number: CA 42500 Q 1997 by John Wiley & Sons, Inc. CCC 0192-8651 / 97 / 070849-25

Upload: h-a

Post on 06-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

  • — —< <

    A United-Residue Force Field forOff-Lattice Protein-StructureSimulations. I. Functional Forms andParameters of Long-Range Side-ChainInteraction Potentials from ProteinCrystal Data

    A. LIWO,1, 2 S. OŁDZIEJ,1 M. R. PINCUS,3 R. J. WAWAK,2

    S. RACKOVSKY,4 H. A. SCHERAGA21Department of Chemistry, University of Gdansk, ul. Sobieskiego 18, 80-952 Gdansk, Poland´ ´2Baker Laboratory of Chemistry, Cornell University, Ithaca, New York 14853-13013Department of Pathology, Brooklyn Veterans Administration Medical Center, Brooklyn, New York11209 and State University of New York, Health Science Center, Brooklyn, New York 112034Department of Biomathematical Sciences, Mount Sinai School of Medicine, One Gustave L. LevyPlace, New York, New York 10029

    Received 7 June 1996; accepted 11 September 1996

    ABSTRACT: A two-stage procedure for the determination of a united-residuepotential designed for protein simulations is outlined. In the first stage, thelong-range and local-interaction energy terms of the total energy of a polypeptidechain are determined by analyzing protein]crystal data and averaging the

    Žall-atom energy surfaces. In the second stage described in the accompanying.article , the relative weights of the energy terms are optimized so as to locate the

    native structures of selected test proteins as the lowest energy structures. The

    Correspondence to: H. A. Scheraga; e-mail: [email protected] Contract grant sponsors: Polish State Committee for ScientificThis article includes Supplementary Material available Research, contract grant number: PB 190rT09r96r10; National

    from the authors upon request or via the Internet at Institute on Aging, contract grant number: AG 00322; Nationalftp.wiley.comrpublicrjournalsrjccrsuppmatr18r849 or Institute of General Medical Sciences, contract grant number:http:rrwww.journals.wiley.comrjcc GM 14312; National Science Foundation, contract grant num-

    ber: MCB95-13167; National Cancer Institute, contract grantnumber: CA 42500

    Q 1997 by John Wiley & Sons, Inc. CCC 0192-8651 / 97 / 070849-25

  • LIWO ET AL.

    goal of the work in the present study is to parameterize physically reasonablefunctional forms of the potentials of mean force for side-chain interactions. Thepotentials are of both radial and anisotropic type. Radial potentials include the

    ŽLennard]Jones and the shifted Lennard]Jones potential with the shift parameter.independent of orientation . To treat the angular dependence of side-chain

    interactions, three functional forms of the potential that were designed previouslyŽto describe anisotropic systems are evaluated: Berne]Pechukas dilated

    . ŽLennard]Jones ; Gay]Berne shifted Lennard]Jones with orientation-dependent. Žshift parameters ; and Gay]Berne]Vorobjev the same as the preceding one, but

    .with one more set of variable parameters . These functional forms were used toparameterize, within a short-distance range, the potentials of mean force forside-chain pair interactions that are related by the Boltzmann principle to thepair correlation functions determined from protein-crystal data. Parameterdetermination was formulated as a generalized nonlinear least-squares problemwith the target function being the weighted sum of squares of the differences

    Žbetween calculated and ‘‘experimental’’ i.e., estimated from protein-crystal.data angular, radial-angular, and radial pair correlation functions, as well as

    contact free energies. A set of 195 high-resolution nonhomologous structuresfrom the Protein Data Bank was used to calculate the ‘‘experimental’’ values.The contact free energies were scaled by the slope of the correlation linebetween side-chain hydrophobicities, calculated from the contact free energies,and those determined by Fauchere and Pliska from the partition coefficients ofˇamino acids between water and n-octanol. The methylene group served todefine the reference contact free energy corresponding to that between theglycine methylene groups of backbone residues. Statistical analysis of thegoodness of fit revealed that the Gay]Berne]Vorobjev anisotropic potential fitsbest to the experimental radial and angular correlation functions and contactfree energies and therefore represents the free-energy surface of side-chain]side-chain interactions most accurately. Thus, its choice for simulations ofprotein structure is probably the most appropriate. However, the use of simplerfunctional forms is recommended, if the speed of computations is an issue.Q 1997 by John Wiley & Sons, Inc. J Comput Chem 18: 849]873, 1997

    Keywords: protein structure prediction; united-residue representation of apolypeptide chain; potential of mean force; radial and angular distributionfunctions

    Introduction

    he force fields that use a representation ofT amino-acid residues as one or two interactionsites, hereafter referred to as united-residue poten-tials, have long been of interest in theoretical simu-lations of protein structure.1 ] 39 The primaryreason for this is that they involve much lesscomputational effort than all-atom or united-atomrepresentations of the polypeptide chain. This isespecially important in protein-structure predic-tion, where extensive search of the conformationalspace of the polypeptide chain is required to locateits global minimum energy. After the global mini-mum energy has been found for the simplified

    chain, it can be converted to the all-atom chain,and limited exploration of the conformational spaceof the all-atom chain can then be carried out tolocate the global minimum in the all-atom repre-sentation. Such a protocol has recently been devel-oped and implemented with considerable successby Skolnick et al. in predicting the three-dimen-sional structures of model monomeric helical pro-

    15, 16, 18 Žteins, crambin which also contains a b-sheet. 18 19section , and the dimeric GCN4 leucine zipper.

    These investigators used an on-lattice representa-tion of the polypeptide chains to obtain united-residue structures which were then converted tofull-atom chains by using a set of statistical rulesdetermined from the Protein Data Bank. In ourrecent reports, we have described a similar proce-dure, based, however, on an off-lattice model of

    VOL. 18, NO. 7850

  • PROTEIN-STRUCTURE SIMULATIONS. I

    polypeptide chains38, 39 and a dipole-path methodŽ .based on an optimal hydrogen-bond network toconvert the a-carbon trace to an all-atom back-bone.38 This procedure succeeded in predicting thethree-dimensional structure of the avian pancreaticpolypeptide.

    As mentioned previously, there are two ways toexplore the conformational space of polypeptidechains with the use of a united-residue potential:the on-lattice and the off-lattice approach. In thefirst case, the polypeptide chain is superposed on adiscrete lattice, and the number of possible confor-mations is, therefore, finite. In the simplest ap-proach, the interaction potential is reduced to a setof residue]residue contact free energies.8 ] 10 Therationale for such an approach was based on theassumption that side-chain packing is the principaldriving force in protein folding; more recent stud-ies, however, have shown that this assumption isprobably not true.13 The recent approach devel-oped in Skolnick’s group incorporates many differ-ent interactions that can be responsible for proteinfolding: side-chain packing; local interactions; hy-drogen bonding; surface energy; and cooperativityin side-chain packing and hydrogen bonding.12 ] 19

    The contact and hydrogen-bonding energies de-pend on the distance and orientation of the inter-acting sites. The resulting force field was able tolocate the near-native structures of a number oftest proteins as the lowest energy ones.14 ] 16, 18, 19

    The parameters of the potentials for on-lattice sim-ulations were determined from a statistical analy-sis of the distributions of interacting sites obtainedfrom the crystal data of known proteins. Becausethe aforementioned force field expresses most ofthe energy components as analytical functions ofgeometry, it can also be used for off-lattice simula-tions.

    For the sake of completeness, we mention heresimple lattice models of proteins in which contactfree energies and other interaction parameters are

    Žassigned arbitrary values usually three types ofcontacts are chosen: hydrophobic]hydrophobic;hydrophobic]hydrophilic; and hydrophilic]hydro-

    .philic ; however, such models were used to studygeneral statistical]mechanical characteristics ofpolypeptide chains and the folding process, buthave not yet been used for predicting the three-di-mensional structures of real proteins.40 ] 42

    The united-residue potentials for off-lattice sim-ulations have an even longer history than the on-lattice ones.1] 7, 24 ] 37, 39 They have also been usedwith considerable success to predict the three-di-mensional structure of known proteins.28 ] 31, 35, 37, 38

    In contrast to the on-lattice potentials, they arefunctions of continuous variables. Therefore, theoff-lattice approach to protein folding enables localenergy minimization to be carried out for gener-ated structures. Thus, many powerful techniquesfor global-search minimization, such as Monte

    Ž . 43, 44Carlo with Minimization MCM , the diffusionŽ . 45equation method DEM , or the self-consistent

    Ž .46mean torsional field SCMTF method can beapplied. This was the rationale for choosing theoff-lattice potential in the present work.

    The present work was aimed at determining thelong-range potential for side-chain interactions. Weparameterized several functional forms for the in-teraction potential that also include angular de-pendence. This was motivated by the results of apreliminary analysis of the average dimensions ofthe side chains as calculated from the Protein DataBank, which show that ‘‘geometrical’’ anisotropyŽwhich can be defined as the ratio of the long axisof a side chain to the geometric average of the two

    .shorter axes is pronounced in almost all cases.Also, Kolinski et al. noticed that the pair distribu-´tions of side chains exhibit some anisotropy, andincluded this effect in their on-lattice statisticalpotential.14 The short-range part of the potential ispresented in the accompanying work.47

    Methods

    REPRESENTATION OF POLYPEPTIDECHAINS AND INTERACTION SCHEME

    The united-residue model of polypeptide chainsadopted in this work is a natural extension of themodel developed in our earlier studies.38, 39 Thechain is represented by a sequence of a-carbonsŽ a.C linked by virtual bonds with attached united

    Ž . Ž .side chains SC and united peptide groups plocated in the middle between the consecutivea-carbons. Only the united peptide groups andunited side chains serve as interaction sites, thea-carbons assisting in the definition of the geome-

    Ž . 38, 39try Fig. 1 . As in our previous model, all theŽ a a a .virtual bond lengths i.e., C —C and C —SC

    a a ˚are fixed; the C —C distance is taken as 3.8 A,which corresponds to trans peptide bonds. Weallow, however, for variation of the side-chain

    Žpositions with respect to the backbone a andSC.b , and for the variation of the virtual-bondSC

    angles, u , which were assumed fixed in our earlierapproach.38, 39

    JOURNAL OF COMPUTATIONAL CHEMISTRY 851

  • LIWO ET AL.

    The energy of the virtual-bond chain is ex-pressed by:

    U s U q U q w UÝ Ý ÝSC SC SC p el p pi j i j i ji-j i/j i-jy1

    Ž .q w U gÝt o r t or ii

    Ž . Ž .q w U u q U a , bÝl o c b i r o t SC SCi ii

    Ž .q w U 1cor r cor r

    where U , U , and U denote the energiesSC SC SC p p pi j i j i jof the interactions between side chains, betweenside chains and peptide groups, and between pep-

    Ž .tide groups, respectively, U g denotes the en-t o r iergy of variation of the virtual-bond dihedral an-

    Ž .gle g , U u denotes the ‘‘bending energy of thei b iŽ .virtual-bond angle u , U a , b is the locali r o t SC SCi i

    energy of side chain i, U includes cooperativecor rŽterms e.g., the four-body interactions considered

    by Skolnick et al.,15 as will be shown in part III of.the present work and the w values denote rela-

    tive weights of the respective energy terms.

    GENERAL PROCEDURE OFPARAMETERIZATION

    The following procedures are commonly usedto parameterize united-residue potentials:

    1. Direct averaging of the all-atom potentialsover those degrees of freedom that are lostwhen passing from the all-atom to the united-residue representation of the polypeptidechain.2y5 This method directly implementsthe assumption that united-residue potentialsare formally all-atom potentials averagedover some ‘‘less important’’ degrees of free-

    Ždom such as the dihedral angles, x , of the.side chains . In this approach, functional

    forms are assumed for all energy terms, withparameters determined by fitting energiescomputed at chosen values of the variablesdescribing the geometry of interacting sitesto those obtained by direct averaging of theall-atom potentials. However, even in theearliest attempts, it was recognized that someextra terms have to be added to account forhydrophobic interactions between the sidechains; in the early work of Levitt andWarshell1 and Levitt,2 those terms were as-signed according to side-chain hydrophobici-ties.

    FIGURE 1. United-residue representation of apolypeptide chain. The interaction sites are side-chain

    ( )centroids of different sizes SC and peptide-bond( )centers p indicated by dashed circles, whereas the

    ( )a-carbon atoms small empty circles are introducedonly to assist in defining the geometry. The virtual

    a a ˚C —C bonds have a fixed length of 3.8 A,corresponding to a trans peptide group; the virtual bond( ) ( )u and dihedral g angles are variable. Each side chainis attached to the corresponding a-carbon with a fixed‘‘bond length,’’ b , variable ‘‘bond angle,’’ a ,SC SCi iformed by SC and the bisector of the angle defined byiCa , Ca, and Ca , and with a variable ‘‘dihedral angle’’iy1 i iq1b of counterclockwise rotation about the bisector,SCistarting from the right side of the Ca , Ca, Ca frame.iy1 i iq1

    2. Determination of the united-residue po-tentials so as to reproduce the single-body,pair, and possibly triplet distribution or cor-relation functions, as well as contact free en-ergies determined from protein crystaldata.24 ] 26, 32 ] 35 In this case, the potential isexpressed either in a functional form,24 ] 26 or

    Ž .as a set of values at given points obtainedby taking the negative logarithms of appro-priately scaled correlation functions. The on-lattice potentials and the harmonic potentialsof the distance-constraint approach of Goeland Ycas21 and Wako and Scheraga22, 23 canˇbe considered to belong to this class, al-though the latter 21 ] 23 used the one- andtwo-body distribution functions directly.

    3. A combination of the two preceding ap-proaches in which some part of the potentialis determined by direct averaging of the all-

    VOL. 18, NO. 7852

  • PROTEIN-STRUCTURE SIMULATIONS. I

    atom potential, for example, the local andhydrogen-bonding interactions, and some es-timated from protein crystal data. Such adivision is motivated by the fact that, if di-rect averaging is computationally feasible, asin the case of the local and hydrogen-bond-ing interactions, the resulting potential willalways be more accurate than that calculatedfrom experimental distribution functions,whose accuracy is severely limited by thesparse number of protein crystal data. Con-versely, obtaining the hydrophobic potentialby direct averaging is in most cases not feasi-ble, owing to the large number of degrees offreedom over which averaging must be car-

    Žried out i.e., the dihedral angles, x , for each.side chain and possibly to the necessity of

    including explicit water molecules in the av-eraging. Such a combination was imple-mented in our earlier work.38 The local-inter-action and backbone hydrogen-bonding termswere determined by direct averaging of theall-atom ECEPPr248, 49 potential. This wasmotivated by the fact that local and hydro-gen-bonding interactions are well repre-sented in the ECEPP force field.50 The hy-drophobic potential was assumed to have amodified Lennard]Jones form, the parame-ters being assigned by the use of proteincrystal data, namely the side-chain van derWaals radii2 and the interresidue contact freeenergies.9

    4. Determination of the parameters of the po-tential so as to locate the native structures asglobal minima for a set of training proteinsand, simultaneously, introducing a large en-ergy gap between the near-native and non-native structures. For on-lattice simulations,such an approach based on spin-glass theorywas developed by Wolynes and coworkers.20

    A similar method was developed for off-lattice simulations by Crippen and cowork-ers.26 ] 31 In both cases, the resulting poten-tials appeared successful in predicting thenative structures of proteins that were notincluded in the training sets.

    In this work, we have implemented procedure 3to determine the parameters of individual energy

    Ž .terms the U values . The side-chain interactionand local terms are parameterized based on corre-lation functions collected from the Protein Data

    Ž .Bank PDB . For the peptide-group interaction po-

    tential, U , we use the energy function developed,p pand then parameterized through averaging of theall-atom ECEPPr2 potential,48, 49 in our earlierstudies.38, 39 The derivation of local-interaction

    Ž . Ž .terms U u and U a , b from protein-crystalb r o t SC SCdata will be described in an accompanying article.In the accompanying work,47 we also describe theprocedure for the determination of the relativeweights so as to locate the native structures of aset of training proteins as the lowest-energy ones.Therefore, our approach is a combination of all theprocedures to determine the aforementioned po-tential. Use of distribution functions or averagingof all-atom potentials to obtain individual energyterms allows us to collect data from the PDB orfrom all-atom potential functions with meaningfulstatistics. The use of flexible weights, which consti-tute a small subset of adjustable parameters, en-ables us to scale the individual terms so as toobtain a folding potential. The procedure forweight determination is described in the accompa-nying article.47

    MODELING SIDE-CHAIN INTERACTIONS

    The general form of the side-chain interactionŽ .U parameterized in this work is given by:SC SCi j

    12 6< < Ž .U s 4 e x y e x 2i j i j i j i j i j

    where e is the pair-specific van der Waals welli jdepth, which depends on side-chain orientation forthe potentials with angular dependence; as in ourearlier work, e ) 0 corresponds to hydrophobic]hydrophobic-type and e - 0 to hydrophobic-hy-drophilic and hydrophilic-hydrophilic-type inter-actions. The quantity, x , is the reciprocal of thei jreduced distance between side chains; forangular-dependent potentials, it also depends onthe orientation of the side chains.

    We first consider radial-only potentials. We as-sumed the following two functional forms for x :i j

    s 0i j Ž .x s 3i j ri j

    r 0i j Ž .x s 4i j 0 0r q r y si j i j i j

    Ž .Equation 3 corresponds to the Lennard]JonesŽ . 0LJ -type potential. The constant s , in this case,i jcan be identified with the equilibrium van derWaals distance between side chains i and j. Equa-

    Ž .tion 4 corresponds to the shifted Lennard]Jones

    JOURNAL OF COMPUTATIONAL CHEMISTRY 853

  • LIWO ET AL.

    potential of the form proposed by Kihara51Ž .hereafter referred to as LJK . In this case, thequantities s 0 y r 0 and r 0 can be identified withi j i j i jthe dimensions of the ‘‘hard core’’ and the ‘‘softcore’’ of the interacting bodies, respectively; weexpress the hard-core diameter as a combination oftwo terms, s 0 and r 0 , to maintain consistency ofi j i jthe notation with that corresponding to the angu-lar potentials.

    The constants r 0 and s 0, can be assumed to bei j i jpair-specific or calculated from the constants thatpertain to single residues:

    0 0 0 0 02 02 Ž .r s r q r ; s s s q s 5'i j i j i j i jTo include angular dependence, we considered

    three forms of anisotropic potentials derived onthe basis of the Gaussian-overlap model52 : modi-fied Berne]Pechukas52 ; Gay]Berne,53, 54 which isused in liquid-crystal simulations54, 55; and, finally,a potential developed by Vorobjev 56, 57 fornucleic-acid simulations. The latter can be consid-ered as a generalized form of the Gay]Berne po-tential. These three forms will hereafter be referredto as BP, GB, and GBV, respectively.

    All these potentials assume that the interactingsites are ellipsoids of revolution. We placed thecenters of the ellipsoids at the centers of mass ofthe side chains, the long axes being assumed to becollinear with the C a]SC axes. To describe therelative orientation of the ellipsoids, it is sufficientto define three angles that describe the relativeorientation of their long axes: u Ž1., u Ž2., and fi j i j i jŽ .Fig. 2 . Although such a model of angular depen-dence is apparently a very simplified one, it keepsthe number of orientational parameters at a rea-sonable minimum, which enables us to collect datawith meaningful statistics from the PDB.

    The expressions for all three potentials are ob-tained by introducing the angular dependence intoe and x of the general expression given by eq.i j i jŽ .2 :

    e ' e v Ž1. , v Ž2. , v Ž12. s e 0 e Ž1.e Ž2.e Ž3.Ž .i j i j i j i j i j i j i j i jy1r2Ž1. Ž1. Ž2. Ž12.2e s 1 y x x vi j i j i j i j

    2XŽ1. XŽ2.Ž1.2 Ž2.2x v q x vi j i j i j i jXŽ1. XŽ2. Ž1. Ž2. Ž12.y2 x x v v vi j i j i j i j i jŽ2.e s 1 yi j XŽ1. XŽ2. Ž12.21 y x x vi j i j i j

    FIGURE 2. Definition of the orientation of twoanisotropic side chains, SC and SC , represented byi jellipsoids of revolution. The relative position of the

    (centers of the side chain is given by the vector r ofi j)length r . The principal axes of the ellipsoids arei j

    assumed to be colinear with the Ca } SC lines; theirdirections are given by the unit vectors u(1) and u(2). Theˆ ˆvariables defining the relative orientations of the ellipsoids

    Ž1. ( (1)are the angles u the planar angle between u andˆi j i j) Ž2. ( (2) )r , u the planar angle between u and r , and fˆi j i j i j i j i j

    ( (2)the angle of counterclockwise rotation of the vector ûi jabout the vector r from the plane defined by u(1) andˆi j i j

    )r when looking from the center of SC toward thei j jcenter of SC .i

    Ž3. Ž1. Ž1. Ž2. Ž2.e s 1 y a v q a vi j i j i j i j i j2Ž1. Ž2. Ž12. Ž .y0.5 a q a v 6Ž .i j i j i j

    x ' x r , v Ž1. , v Ž2. , v Ž12.Ž .i j i j i j i j i jsi j¡ for the BP potentialri j

    0si j~ for the GB potentials 0r y s q si j i j i j0ri j

    for the GBV potential0¢r y s q ri j i j i jŽ .7

    VOL. 18, NO. 7854

  • PROTEIN-STRUCTURE SIMULATIONS. I

    with:

    y1r2Ž1. Ž1.2 Ž2. Ž2.2x v q x vi j i j i j i jŽ1. Ž2. Ž1. Ž2. Ž12.y2 x x v v vi j i j i j i j i j Ž .80s s s 1 yi j i j Ž1. Ž2. Ž12.21 y x x vi j i j i j

    v Ž1. s uŽ1. ? r s cos u Ž1.ˆ ˆi j i j i j i j

    v Ž2. s uŽ2. ? r s cos u Ž2.ˆ ˆi j i j i j i j

    v Ž12. s uŽ1. ? uŽ2.ˆ ˆi j i j i j

    s cos u Ž1. cos u Ž2. q sin u Ž1. sin u Ž2. cos fi j i j i j i j i j

    where uŽ1. and uŽ2. are unit vectors along theˆ ˆi j i jŽprincipal axes of the interacting sites in this work

    a .identified with the C ]SC axes , r is the vectori jlinking the centers of the sites, r is the corre-î jsponding unit vector, r is the distance betweeni j

    Ž . Ž1.the side-chain centers Fig. 2 , the constants xi jand x Ž2. are the anisotropies of the van der Waalsi jradius, and the constants x XŽ1. and x XŽ2. are thei j i janisotropies of the van der Waals well depth.

    The angular dependence of e Ž1. and x arisesi j i jfrom the extension of the Gaussian overlap poten-tial to the LJ-type function.53 Additional depen-dence of the van der Waals well depth on orienta-tion in the form of e Ž2. has been introduced byi jGB.53 For the original BP potential, e Ž2. s 1, buti jwe keep its orientational dependence to preservethe same form of the potential. The formulas aregeneralized in this work to the case of ellipsoids of

    Žrevolution with different axes the BP and GBpotentials were originally derived for the interac-

    .tion of identical ellipsoidal bodies . Finally, thefunction e Ž3. with the constants aŽ1. and aŽ2. hasi j i j i jbeen introduced in this work to account for thelower symmetry of the angular distribution func-tions observed in protein crystals than that impliedby the three potentials outlined previously. Squar-ing in the expressions for e Ž2. and e Ž3. is done toi j i jkeep them non-negative.

    As in the case of the radial potentials, the con-stants s 0 , r 0 , x Ž1., x Ž2., x XŽ1., x XŽ2., aŽ1., and aŽ2. cani j i j i j i j i j i j i j i jbe assumed to be pair-dependent or constrained tobe calculated from single-residue constants. In thiswork, we tried both procedures. For the constants

    0 0 Ž .r and s , the formulas are given by eq. 5 . Fori j i jthe case of different ellipsoids, the anisotropies ofthe van der Waals distances can be expressed by

    Ž .eq. 9 :

    s 5 2 y s H 2 s 5 2 y s H 2i i j jŽ1. Ž2. Ž .x s ; x s 9i j i j5 2 H 2 5 2 H 2s q s s q si j j i

    Further, for the Gaussian overlap model, it fol-lows52 that s Hs s 0 and s Hs s 0. The constantsi i j js H and s 5 can be identified with the lengths ofthe short and long axes of the ellipsoids, respec-tively. Our variable parameters were s 0 and the

    Ž 5 H.2ratio s rs ; the first one can be considered asa measure of the size, and the second of theanisotropy of a side chain.

    The same type of dependence can be assumedfor the constants x X, but then there would be toomany parameters to be determined. Therefore, weassumed that the constants x X and a depend onsingle-residue types, namely:

    x XŽ1. s x X , x XŽ2. s x Xi j i i j j Ž .10aŽ1. s a , aŽ2. s ai j i i j j

    It should be noted that, in the case of isotropicinteractions, the GB and BP potentials become theLJ potential, whereas the GBV potential becomesthe LJK potential.

    PARAMETERIZING SIDE-CHAIN INTERACTIONPOTENTIALS

    Similar to earlier work,24 ] 26 we determine theparameters of the potentials introduced in the pre-ceding section by fitting them to correlation func-tions and contact free energies calculated fromprotein-crystal data. In doing this, we make thefollowing two assumptions:

    1. The correlation functions obtained by using asufficiently large number of protein crystal

    Ždata each of which corresponds to a system.at a free-energy minimum are sufficiently

    good approximations to the correlation func-tions of a hypothetical ‘‘stochastic’’ mixtureof nonconnected side chains. This approxi-mation is justified by the observation that,although a crystal structure is at equilibriumas the whole structure, its individual partscan be forced to assume geometries far fromlocally equilibrated, locally lower energyconformations having, however, higher prob-ability of occurrence in the whole structure.58

    For example, the distributions of X—H bondlengths obtained from large data bases of

    JOURNAL OF COMPUTATIONAL CHEMISTRY 855

  • LIWO ET AL.

    crystal structures are qualitatively similar tothose calculated from potential-energy sur-faces of proton transfer.58

    2. Interactions between the side chains can bedescribed with sufficient accuracy by us-

    Žing the potential of mean force, W r ,i j i jŽ1. Ž2. .u , u , f , related directly to the corre-i j i j i j

    sponding side-chain pair correlation func-Ž Ž1. Ž2. .tions, g r , u , u , f :i j i j i j i j i j

    g r , u Ž1. , u Ž2. , fŽ .i j i j i j i j i jŽ1. Ž2. Ž .s exp yW r , u , u , f rRT 11Ž .i j i j i j i j i j

    where R and T are the gas constant and theabsolute temperature, respectively. Accord-

    Ž .ing to point 1, the reference state in eq. 11corresponds to a hypothetical polypeptide

    Žchain with noninteracting side chains theunfolded state according to the classification

    59 .of Godzik et al. .Because we want to exclude the effects of

    wlocal interactions since the local interactionsŽ . Ž .are included in the terms U u , U g , andb t or

    Ž . Ž .xU a , b of eq. 1 , we consider onlyr o t SC SCthe side chains that are separated by at leastten peptide groups. This also makes it legiti-mate to disregard the direction of the chainseparating the residues; therefore, we assumethat W s W , where i denotes a residuei , j i , j kk l l kof type i occupying the k th position in thechain.

    To avoid the influence of many-body andboundary effects on the distributions at large dis-tances, we confine our treatment to a short-rangedistance limit r F r m a x, and we express the poten-i jtial of mean force by one of the analytical forms

    Ž . Ž .given by eqs. 2 ] 8 . The upper distance limitsr m a x are defined so that only the regions of thei jfirst peak of the correlation functions are consid-ered, in which the potential of mean force is uni-modal in r :

    m a x 0; L 0; L ˚ ˚ Ž .r s min 0.5 r q r q 2 A, 8 A 12Ž .½ 5i j i jwhere r 0; L are the mean side-chain van der Waalsradii for each of the 20 naturally occurring aminoacids calculated by Levitt.2

    w Ž . Ž .xFor the LJ and LJK potentials eqs. 3 and 4 ,which depend only on the distance between the

    Ž Ž1. Ž2. . Ž .side chains, g r , u , u , f of eq. 11 is re-i j i j i j i j i jplaced by the radial-only correlation function,

    Ž . w Ž Ž1. Ž2. .g r which is equivalent to g r , u , u , fi j i j i j i j i j i j i j

    Ž1. Ž2. xaveraged over the angles u , u , and f . Thus,i j i j i jthe potentials of mean force and, in turn, thecorrelation functions depend parametrically on the

    Ž . Ž .constants appearing in eqs. 3 ] 8 , which can beoptimized so that the theoretical correlation func-

    Ž . Žtions given by eq. 11 fit best in the least-squares.sense to the correlation functions determined from

    protein crystal data.The limited number of data that we are able to

    collect from protein crystals prohibits the directuse of the full radial-and-angular correlation func-

    Ž Ž1. Ž2. .tion g r , u , u , f . Therefore, our targeti j i j i j i j i jfunction for parameter estimation includes correla-tion functions that are averaged over some of thevariables, and side-chain contact free energies. Theside-chain contact free energies are the logarithmsof the correlation functions averaged over the co-ordination sphere of the interacting side chains. Todetermine the parameters of the potentials, we

    Ž .minimized the weighted sum of the squares F Xof the differences between histograms of the radialŽ r . Ž ru .H , radial-angular H , and angulari j; k i j; k lmŽ uf .H correlation functions, as well as contacti j;k lm

    Ž .free energies F , calculated as functions of thei jparameters, and determined from protein-crystaldata, respectively:

    nri j20 202r r rˆŽ . Ž .F X s w w H X y HÝ Ý Ýi j i j ;k i j ;k½

    is1 js1 ks1nn n fu u 2

    uf uf ufˆŽ .qw H X y HÝ Ý Ý i j ;k lm i j ;k lmks1 ls1 ms1nr n ni j u u 2ru ru ruˆŽ .qw H X y HÝ Ý Ý i j ;k lm i j ;k lmks1 ls1 ms1

    2F ˆŽ .qw F X y Fi j i j 5r r Ž . uf uf Ž .s w F X q w F X

    ru ru Ž . F F Ž . Ž .q w F X q w F X 13

    where the indices i, j run over all 20 naturallyoccurring amino acids. Also:

    np np

    Ž .w s w n r w n 14Ý Ý Ýi j p i j p i j ; pps1 iFj ps1

    is the statistical weight of the pair of side chains ofŽtypes i and j w and np being the statisticalp

    weight of the pth protein and the total number ofprotein structures in the sample, respectively, andn and n being the total number of pairs ofi j i j; presidues of types i and j and the number of such

    VOL. 18, NO. 7856

  • PROTEIN-STRUCTURE SIMULATIONS. I

    .pairs in protein p, respectively ; nr is the num-i jber of distance values considered for a pair of sidechains of types i and j; n and n are the numbersu fof values of the angles u Ž1. or u Ž2.; and f, wuf, w r,w ru, and w F are weights of the histograms and ofthe free energy, respectively; X is a shorthand forthe parameters of the target potential, and a ‘‘hat’’

    Ž .circumflex over a quantity designates the valuedetermined from the crystal data. For the radial-only potentials, LJ and LJK, the angular and

    uf Ž . ru Ž .radial-angular components F X and F X doŽ .not appear in the expression for F X .

    The histograms of the correlation functions H r ,i j;kH ru , and Huf are defined by:i j;k lm i j;k lm

    r Ž .g ri j krH si j ;k n r ri j Ž .Ý g ri j kks1uf Ž1. Ž2. Ž1. Ž2.Ž .g u , u , f D cos u D cos ui j k l m k lufH si j ;k lm n n n uf Ž1. Ž2. Ž1. Ž2.u u f Ž .Ý Ý Ý g u , u , f D cos u D cos uks1 ls1 ms1 i j k l m k lru Ž1. Ž2. Ž1. Ž2.Ž .g r , u , u D cos u D cos ui j k l m l mru Ž .H s 15i j ;k lm nr n n ru Ž1. Ž2. Ž1. Ž2.i j u u Ž .Ý Ý Ý g r , u , u D cos u D cos uks1 ls1 ms1 i j k l m l m

    r uf Ž1. Ž2. ruŽ . Ž . Žwhere g r , g u , u , f , and g r ,i j k i j k l m i j kŽ1. Ž2..u , u are the average values of the radial, an-l m

    gular and radial-angular correlation functionsw xwithin bins defined as r y D rr2, r q D rr2 ,k k

    w Ž1. Ž1. x w Ž2. Ž2.u y Dur2, u q Dur2 = u y Dur2, u qk k l lx w x wDur2 = f y Dfr2, f q Df , and r y D rr2,m m k

    x w Ž1. Ž1. x w Ž2.r q D rr2 = u y Dur2, u q Dur2 = u yk l l mŽ2. xDur2, u q Dur2 , respectively, with D r, Du , andm

    Df being the dimensions of the bins. The defini-tions and method of calculation of the correlationfunctions from protein crystal data are given in theAppendix.

    The purpose of the introduction of the factorsDcos u Ž1. D cos u Ž2. into the angular and radial-k langular terms is to avoid overweighting the re-gions around u Ž1. or u Ž2. s 08 or 1808 in which thenumber of counts is very small, which results inpoor accuracy there of the angular-distributionfunctions.

    The free energies of contact interaction werecalculated from protein crystal data using the qua-sichemical approximation procedure developed byMiyazawa and Jernigan.9 We chose this approachbecause it takes into account the fact that competi-tive interactions between residues of different types

    ˚occur in real proteins. We chose a radius of 8 A forthe coordination sphere, which is greater than the

    ˚6.5-A radius used by Miyazawa and Jernigan. Thereason for this was that not many hydrophilic

    ˚contacts are present within the 6.5 A coordinationsphere which results in poorer statistics. Then, wescaled these free energies to be compatible withfree energies of transfer of amino-acid side chainsfrom n-octanol to water determined by Fauchere

    and Pliska60 ; this will be described in the subsec-ˇtion ‘‘Contact Free Energies’’ of the ‘‘Results’’ sec-tion. The corresponding ‘‘theoretical’’ values of thefree energies are calculated from:

    Ž .F X s yRTi j

    c Ž1. Ž2.Ž . Ž .= ln 1rV g D , q , q , w dVHi j i jcV i j

    Ž .16

    c Ž . Ž c c.where V s S 0, R y S 0, r q r is the allowedi j c i jcoordination sphere corresponding to pair ijw Ž .S a, r denoting the region of space bounded by a

    x csphere of radius r centered at point a , V is thei jvolume of V c , R is the maximum radius of thei j c

    ˚Žcoordination sphere assumed to be 8 A in this. c cwork , r and r are the contact radii of the sidei j

    Žchains calculated from their volumes see Table III.in ref. 9 .

    Because the relative weights of the angular,radial-angular, radial, and contact-free-energy

    Ž .terms in eq. 13 were not known a priori, estima-tion of the parameters by minimization of expres-

    Ž . 61sion 13 is a generalized least-square problem.In such a case, the weights are usually estimatedas the inverses of the squares of the residuals, andthe resulting sum of squares is minimized succes-sively with iteratively updated weights, until thecalculated and the assumed weights are consistent.Thus, the estimates w r, w ru, wuf, and w F of the˜ ˜ ˜ ˜

    Ž .weights of the terms in eq. 13 can be expressed

    JOURNAL OF COMPUTATIONAL CHEMISTRY 857

  • LIWO ET AL.

    by:

    r 2w s 1rs˜ r20 20

    Ž . Ž .s 1rSw w 1rnrÝ Ý i j i j½is1 jsi

    =

    y1nri j2r rˆŽ .H X y HÝ i j ;k i j ;k 5

    ks1

    20 20uf 2 2w s 1rs 1rSwn n w˜ Ž . Ý Ýuf u f i j½

    is1 jsi

    =

    y1nn n fu u 2uf ufˆŽ .H X y HÝ Ý Ý i j ;k lm i j ;k lm 5

    ks1 ls1 ms1

    ru 2w s 1rs˜ ru20 20

    2Ž . Ž .s 1rSwn w 1rnrÝ Ýu i j i j½is1 jsi

    =

    y1nr n ni j u u 2ru ruˆŽ .H X y HÝ Ý Ý i j ;k lm i j ;k lm 5ks1 ls1 ms1

    F 2w s 1rs˜ Fy120 20

    2ˆŽ . Ž . Ž .s 1rSw w F X y F 17Ý Ý i j i j i j½ 5is1 jsi

    2where s is a variance of the corresponding quan-tity, Sw s Ý20 Ý20 w , and the other symbolsis1 jsi i j

    Ž .are defined in eq. 13 .The standard deviations of the parameters were

    estimated according to the Gauss]Markov for-mula62 :

    Ž .F X* y12 Tw Ž .x w Ž . Ž . Ž .x Ž .s x s J X* W X* J X* 18i ii N y p

    � 2 Ž 2 . 20 20where N s 210n n q n n q 1 Ý Ý nru f u f is1 jsi i j4 Ž .q 210 is the total number of terms in eq. 13 , p is

    the total number of parameters, J s d r x isi j i jthe element of the first derivatives of the residualsd , d , . . . , d that occur in the sum of the squares;1 2 NW is the corresponding matrix of weights, and X*denotes the vector of the parameters at the mini-mum.

    SELECTION OF PROTEIN STRUCTURES

    The protein crystal structures were taken fromthe Brookhaven Protein Data Bank. First, the list ofavailable structures obtained from the PDB server

    Ž .pdb.pdb.bnl.gov as of June 25, 1994 was scanned

    ˚and, those with a resolution not exceeding 2 A andhaving a chain length of at least 100 amino-acidresidues, were selected. Then, the percentage ofsequence homology was calculated for all pairs ofsequences using the FASTA program63, 64 availableon anonymous ftp from uvaarpa.virginia.edu.Then, cluster analysis was carried out with theminimal-tree algorithm,65 taking the values ofŽ .100% y percentage homology as distances be-tween pair of structures. This grouped the selectedproteins into 154 families of homologous struc-tures. From each family, those structures weretaken that had the highest resolution or, if the

    Ž .resolution was the same, the longest chain s . Inseveral cases, however, both criteria were satisfiedby more than one structure. In such a case, wetook all the structures satisfying the criteria, di-minishing their statistical weights when calculat-ing the histograms of pair-correlation functionsand contact free energies. The final list contained195 structures, whose identities are summarized inTable I of the Supplementary Material.

    Results and Discussion

    DISTRIBUTION AND CORRELATIONFUNCTIONS

    In all calculations, we assumed that the centersof interactions are in the geometric centers of theside chains, calculated from the coordinates of thenonhydrogen atoms, including C a, as expressedby:

    NHi1Ž .R s r 19Ýi jiNHi js1

    where R represents the coordinates of the geo-imetric center of the ith side chain, r representsjithe coordinates of the ith nonhydrogen atom ofthe ith side chain, and NH is the number ofinonhydrogen atoms in side-chain i. The index idenotes an individual side chain in the data baseand not the side-chain type. For glycine the posi-tion of the side-chain atom coincides with theposition of C a.

    When calculating the pair distributions and con-tact free energies, we excluded disulfide-bondedcystine pairs; however, the nonbonded cysteinepairs were included. The weights of the structureswere calculated from the following formula:

    1Ž .w s 20p 2n n reschain hom

    VOL. 18, NO. 7858

  • PROTEIN-STRUCTURE SIMULATIONS. I

    where n is the number of equivalent chains inchaina protein, n is the number of homologous struc-homtures, if more than one was taken from a family,and res is the crystallographic resolution. Theweights, w , appear in equations in the Appendix.p

    The single-body densities of the amino-acid sideŽ .chains were calculated from eq. A-12 of the Ap-

    pendix, and were subsequently used in the evalua-tion of the factor T used for the calculation ofi j

    Žreference i.e., in the absence of any side-chain.interactions radial and radial-angular probability

    w Ž . Ž .distributions eqs. A-7 and A-9 of the Ap-xpendix . A sample collection of radial pair densi-

    ties together with the reference and total pair dis-tribution functions is shown for the Leu]Leu pair

    Žin Figure 3. The radial distribution curve C of Fig..3 qualitatively resembles that calculated theoreti-

    cally by Gan and Eu66 in their study of model vander Waals polymer chains. As shown, the distribu-tion calculated from single-body density and the

    Ž .Markovian factor curve B of Fig. 3 approximates

    quite well the Leu]Leu pair distribution function˚Ž .curve C of Fig. 3 for distances longer than 10 A.

    Ž .The correlation function curve A of Fig. 3 at˚distances longer than 10 A is almost constant.

    Greater deviations occur only at very long dis-tances; this can be explained by the fact that thesingle-body density is determined with poor accu-racy at longer distances.

    To calculate the reference angular distributionfunctions, we averaged the computed angular dis-tribution functions over all pairs of side chains,

    67 Ž Ž .using the method of Hao et al. see eq. A-13.and the following text in the Appendix for details .

    However, the angular pair correlation functions,Ž .calculated from eq. A-5 , still exhibited the behav-

    Žior of the ‘‘background’’ correlation function solid.curve of Fig. 4 for many pairs of residues. By

    least-squares fitting, we found that the ‘‘back-ground’’ angular pair correlation function can be

    Ž3. Ž .described by e of eq. 6 . Therefore, we includede Ž3. in the angular potentials.

    FIGURE 3. Sample pair-distribution and pair-correlation functions for the Leu ]Leu pair averaged over consecutive˚ r (2, 0 , r )( ) ( ) [0.5-A shells. A Radial pair correlation functions g ; B the reference pair distribution function n denominator ini j( )] ( ) (2, r ) [ ( )]eq. A-4 ; and C the total pair distribution function n numerator in eq. A-4 . All graphs were normalized to the

    maximum values of 1.0.

    JOURNAL OF COMPUTATIONAL CHEMISTRY 859

  • LIWO ET AL.

    uf Ž1. Ž2.Ž .FIGURE 4. The angular correlation functions g u , u , f averaged over all pairs of side-chain types and over theangle f for displaying purpose. The surfaces were normalized so that 1 is the maximum value for both. Solid surface:the function obtained from the PDB averaged over all the pairs of side chains. Dashed surface: the function obtained insimulation studies. The latter were carried out by generating a total of 1000 50-residue energy-minimized chains withrandom sequence confined to the ellipsoid characteristic of proteins of this size, according to the method of Hao etal.,66 with the united-residue representation of polypeptide chains and the energy function developed in our earlierwork38, 39; that is, the side-chain interaction potential did not have explicit angular dependence. The simulation studyshows that the background angular pair correlation function is not constant, even for a radial side-chain interactionpotential.

    CONTACT FREE ENERGIES

    The calculated contact free energies, togetherwith the numbers of contacts, are summarized inTable I.

    ŽEven though the set of 195 structures see Table.1 of the Supplementary Material has almost no

    structure in common with that used by MiyazawaŽ . 9and Jernigan MJ , the computed contact free en-

    ergies correlate well with those determined by MJ,the correlation equation being:

    M J ˚Ž .e s 1.580 0.028 e R s 8 AŽ .i j i j cŽ . Ž .q2.135 0.093 ; R s 0.9689 21

    where R denotes the correlation coefficient, andthe numbers in parentheses the standard deviationof the slope and intercept, respectively. When theradius of the coordination sphere was taken as 6.5˚ Ž .A as used by MJ , and the contact free energies

    re-evaluated, the correlation equation became:

    M J ˚Ž .e s 0.844 0.026 e R s 6.5 AŽ .i j i j cŽ . Ž .q0.45 0.11 ; R s 0.9159 22

    Ž .Thus, the slope from eq. 22 is closer to 1.0, asexpected.

    The correlation coefficient of our contact freeenergies with the contact free energies derived byTanaka and Scheraga8 with a smaller data base,with the definition of contact similar to that usedin our work except that their distances were mea-sured between the a-carbons, is 0.8670. By con-trast, the correlation of our contact free energieswith those determined by Kolinski et al.14 or Gre-´goret and Cohen10 is quite weak, the correlationcoefficients being 0.6019 and 0.2261, respectively.This is understandable because the reference statein the latter two potentials corresponds to sidechains arranged in a hydrophobic core and a hy-

    Ždrophilic exterior the U state according tophi l; phob

    VOL. 18, NO. 7860

  • PROTEIN-STRUCTURE SIMULATIONS. I

    TAB

    LEI.

    ˚(

    )C

    ont

    actF

    ree

    Ene

    rgie

    sR

    TU

    nits

    ;Dia

    go

    nala

    ndU

    pp

    erTr

    iang

    lean

    dTo

    talN

    umb

    ero

    fCo

    unts

    With

    inth

    e8-

    AC

    oo

    rdin

    atio

    nS

    ph

    ere

    ()

    Low

    erTr

    iang

    le,a

    ndth

    eLa

    stLi

    nefo

    rD

    iag

    ona

    lEle

    men

    tsfo

    rP

    airs

    ofA

    min

    o-A

    cid

    Res

    idue

    s.

    Cys

    Met

    Phe

    IleLe

    uV

    alTr

    pTy

    rA

    laG

    lyTh

    rS

    erG

    lnA

    snG

    luA

    spH

    isA

    rgLy

    sP

    ro

    Cys

    y4.

    54y

    4.72

    y4.

    94y

    4.90

    y4.

    72y

    4.53

    y4.

    60y

    4.08

    y3.

    93y

    3.68

    y3.

    73y

    3.56

    y3.

    31y

    3.14

    y2.

    97y

    3.12

    y3.

    87y

    3.02

    y2.

    71y

    3.50

    Met

    198

    y4.

    80y

    5.03

    y4.

    95y

    4.91

    y4.

    59y

    4.87

    y4.

    30y

    3.93

    y3.

    40y

    3.62

    y3.

    35y

    3.24

    y3.

    04y

    2.90

    y2.

    80y

    3.80

    y3.

    15y

    2.58

    y3.

    43P

    he47

    673

    5y

    5.22

    y5.

    25y

    5.18

    y4.

    89y

    5.07

    y4.

    43y

    4.04

    y3.

    52y

    3.70

    y3.

    46y

    3.25

    y3.

    17y

    2.89

    y2.

    95y

    3.95

    y3.

    26y

    2.70

    y3.

    54Ile

    541

    817

    1920

    y5.

    29y

    5.21

    y5.

    01y

    4.98

    y4.

    51y

    4.18

    y3.

    61y

    3.87

    y3.

    58y

    3.30

    y3.

    13y

    3.11

    y3.

    09y

    3.74

    y3.

    42y

    2.95

    y3.

    62Le

    u73

    913

    6531

    2338

    16y

    5.03

    y4.

    85y

    4.87

    y4.

    32y

    4.06

    y3.

    56y

    3.61

    y3.

    47y

    3.17

    y3.

    09y

    2.85

    y2.

    84y

    3.69

    y3.

    24y

    2.74

    y3.

    55V

    al59

    893

    723

    6531

    7547

    31y

    4.64

    y4.

    63y

    4.04

    y3.

    93y

    3.42

    y3.

    58y

    3.35

    y3.

    05y

    3.02

    y2.

    84y

    2.82

    y3.

    44y

    3.00

    y2.

    69y

    3.39

    Trp

    131

    240

    549

    601

    924

    741

    y4.

    74y

    4.20

    y3.

    81y

    3.47

    y3.

    44y

    3.30

    y3.

    25y

    3.23

    y3.

    04y

    3.12

    y3.

    94y

    3.42

    y2.

    84y

    3.60

    Tyr

    283

    464

    962

    1227

    1753

    1327

    366

    y3.

    69y

    3.41

    y3.

    12y

    3.16

    y2.

    97y

    2.90

    y2.

    82y

    2.71

    y2.

    84y

    3.42

    y3.

    01y

    2.58

    y3.

    23A

    la67

    911

    0520

    7128

    8947

    6242

    7963

    914

    45y

    3.28

    y2.

    91y

    2.97

    y2.

    76y

    2.67

    y2.

    59y

    2.45

    y2.

    52y

    2.95

    y2.

    56y

    2.28

    y2.

    81G

    ly78

    875

    013

    9118

    7831

    2629

    4456

    112

    7337

    50y

    2.62

    y2.

    78y

    2.61

    y2.

    39y

    2.41

    y2.

    12y

    2.34

    y2.

    77y

    2.43

    y2.

    08y

    2.60

    Thr

    446

    612

    1220

    1644

    2140

    2225

    348

    877

    2543

    2438

    y2.

    81y

    2.74

    y2.

    46y

    2.43

    y2.

    33y

    2.43

    y2.

    97y

    2.55

    y2.

    13y

    2.68

    Ser

    550

    579

    1068

    1543

    2335

    2131

    381

    927

    2379

    2344

    1737

    y2.

    47y

    2.34

    y2.

    36y

    2.22

    y2.

    31y

    2.76

    y2.

    46y

    2.01

    y2.

    57G

    ln25

    132

    947

    264

    410

    5689

    921

    947

    513

    1911

    4680

    184

    0y

    1.94

    y2.

    26y

    1.93

    y2.

    09y

    2.48

    y2.

    25y

    1.84

    y2.

    35A

    sn25

    029

    065

    976

    612

    1512

    3829

    563

    115

    9815

    6610

    9311

    7259

    8y

    2.24

    y2.

    14y

    2.21

    y2.

    60y

    2.19

    y1.

    95y

    2.28

    Glu

    281

    392

    673

    1071

    1356

    1354

    281

    679

    1957

    1447

    1241

    1407

    604

    919

    y1.

    60y

    1.77

    y2.

    49y

    2.61

    y2.

    20y

    2.09

    Asp

    308

    343

    718

    1004

    1238

    1241

    336

    813

    2096

    1910

    1410

    1389

    675

    1103

    867

    y1.

    85y

    2.70

    y2.

    67y

    2.22

    y2.

    16H

    is21

    828

    759

    755

    693

    970

    823

    846

    986

    889

    368

    961

    828

    944

    355

    572

    1y

    3.27

    y2.

    60y

    2.02

    y2.

    69A

    rg20

    331

    256

    183

    112

    5293

    226

    960

    313

    1213

    4395

    310

    4353

    863

    713

    0813

    9437

    5y

    2.03

    y1.

    47y

    2.29

    Lys

    262

    394

    705

    1122

    1481

    1595

    355

    782

    2077

    1758

    1324

    1385

    609

    995

    1721

    1778

    483

    542

    y1.

    14y

    1.95

    Pro

    305

    391

    771

    1000

    1628

    1371

    354

    743

    1604

    1507

    1100

    1143

    561

    763

    821

    803

    434

    600

    865

    y2.

    53

    260

    216

    944

    1237

    3213

    2185

    9736

    025

    3417

    9384

    297

    122

    345

    242

    150

    426

    127

    044

    339

    6

    JOURNAL OF COMPUTATIONAL CHEMISTRY 861

  • LIWO ET AL.

    59 .the terminology of Godzik et al. , rather than to aŽcompletely unarranged polypeptide chain the U

    59 .state , which is the reference state in theTanaka]Scheraga,8 MJ,9 and our approach. In theKolinski]Skolnick14 and Gregoret]Cohen10 poten-´tials, the nonspecific grouping of side chains intothe hydrophobic core and hydrophilic exterior isaccounted for by one-body centrosymmetric poten-tials, whereas in our approach it is encoded in theside-chain pair potentials.

    According to Miyazawa and Jernigan,9 thequantities 0.5q e , where q is the coordinationi i inumber of residue of type i and e s Ý20i is1N e rÝ20 N is the average contact free energyi j i j is1 i jof residue of type i, can be regarded as hydropho-bicities of the corresponding types of residues.Therefore, we correlated these quantities withside-chain hydrophobicities determined byFauchere and Pliska who measured the partitionˇcoefficients of amino acids between n-octanol andwater,60 and obtained the following correlationequation:

    Ž . Ž .yRT = 0.5q e s 1.60 0.15i iw Ž . x Ž .= RT ln 10 p y 10.50 0.23 ;i

    Ž .R s 0.9278 23

    where T s 298 K and p is the contribution of theiside chain of type i to the logarithm of the parti-tion coefficient between n-octanol and water, asdetermined by Fauchere and Pliska.60 The correla-ˇtion graph is shown in Figure 5.

    Similar correlation also holds with the diagonalcontact free energies, e :i i

    Ž .yRTe s 0.528 0.046i iw Ž . x Ž .= RT ln 10 p y 1.197 0.070 ;i

    Ž .R s 0.9372 24

    The correlation with other hydrophobicity scalesderived on the basis of thermodynamic data, forexample, those of Nozaki and Tanford,68 is worseŽ .with R s 0.8019 . The correlation is also worseŽ .R s 0.8518 when the contact free energies ob-

    ˚tained with R s 6.5 A are used. The latter fact iscunderstandable in view of the fewer number ofcontacts and therefore poorer statistics, especiallyfor hydrophilic pairs.

    Ž . Ž .The slopes of eqs. 23 and 24 were used toestimate the ‘‘true’’ free energies of contacts imple-

    Ž .mented in the sum of squares given by eq. 13 . Asin our earlier work,38, 39 we considered the residuecontact free energies to be composed of the parts

    due to side-chain]side-chain and backbone]back-bone interaction. To obtain the peptide-group]peptide-group interaction free energy for any

    Žamino acid, we subtract from e obtainedG l yG l y.from the PDB the contribution of the CH group2

    Ž .of Gly. We take the latter as y0.528RT ln 10 pC H2Ž . Ž .from eq. 24 , with p s 0.41 Ref. 60 , which canC H2

    be considered as an estimate of the glycine ‘‘side-chain]side-chain’’ contact free energy. Then, werescale our contact free energies by introduction of

    Ž .the factor 1.60 of eq. 23 for nonproline residues.Further, because Pro has no backbone NH donorgroup, we have to reduce the corresponding esti-mate of the peptide-group]peptide-group interac-tion free energy by a factor39 f or f . Thus,P r o P r oP r othe estimates of experimental contact free energies

    Ž .can be expressed by eq. 25 :

    RTF̂ si j 1.60

    ¡ Ž .e y e q 0.528 ln 10 p ,Ž .i j G l yG l y C H2if both i and j / Pro

    Ž .e y f e q 0.528 ln 10 p ,Ž .i j P r o G l yG l y C H2~=if only one of i or j s Pro

    Ž .e y f e q 0.528 ln 10 p ,Ž .i j P r oP r o G l yG l y C H2¢ if both i and j s ProŽ .25

    Finally, it should be noted that the computedcontact free energies are additive to a good ap-proximation, which is reflected in the followingcorrelation equation:

    Ž .Ž .e s 1.050 0.020 e q e r2i j i i j jŽ . Ž .q0.072 0.068 ; R s 0.9669 26

    in which the slope and intercept do not differsignificantly from 1 and 0, respectively. The quan-

    Ž .tity e q e r2 is called the ideal pair-interactioni i j jŽ .free energy, while the quantity e y e q e r2 isi j i i j j

    called the excess pair-interaction free energy.59Ž . Ž .Equation 26 , together with eq. 24 can serve to

    estimate the interresidue contact free energies ofnon-natural amino acids which do not occur in thedata base of the structures of known proteins, butfor which the water]n-octanol transfer free ener-gies can be measured directly or estimated fromQSAR equations. It must be noted, however, that,for quite a number of side-chain pairs, there are

    Ž .several outliers that depart from eq. 26 by morethan the standard deviation; this is illustrated in

    VOL. 18, NO. 7862

  • PROTEIN-STRUCTURE SIMULATIONS. I

    FIGURE 5. Correlation between side-chainhydrophobicities calculated from contact energies( )abscissae and those determined by Fauchere and

    59 ( )Pliska, as given by eq. 22 .˘

    Figure 6. As shown, the excess free energy is lessthan the standard deviation from the ideal freeenergy only when both side chains in a pair arehydrophobic, both are hydrophilic with zero netcharge, or both have charges of the same sign.

    Ž . Ž .Therefore, eqs. 24 and 26 should, in principle,be applicable only to such pairs of side chains. Forpairs composed of one hydrophobic and one hy-drophilic side chain the excess free energy is posi-tive, which means that making such a contact isless favorable than predicted by the ideal contactfree energy. This is understandable because, forexample, for hydrophilic side chains bearing hy-drogen-bonding groups, making a contact with anonpolar side chain, as opposed to making a con-tact with another hydrogen-bonding group, resultsin breaking of hydrogen bonds. The excess freeenergies of the interaction of charged side chainsof opposite signs are strongly negative, about y0.7RT , which can be explained in terms of the forma-tion of salt bridges. Finally, some small negativeexcess contact free energies occur for pairs com-posed of side chains with carboxyamide groupsand positively charged side chains.

    It should also be noted that the Arg]Arg con-tact free energy is considerably more negative thanthe contact free energies of the other pairs ofresidues with equal charges: Lys]Lys, Asp]Asp,

    FIGURE 6. A diagram showing the pairs of side chainswith large excess free energies computed from the datain Table I. The one-letter code of amino-acid residues isused. The numbers in each box are the integers of theratio of the excess contact free energy to the mean-

    2² :'square excess contact free energy, e = 0.23excess[RT the standard deviation from the ideal contact free

    ( )]energy in eq. 26 .

    and Glu]Glu. This is consistent with the results ofthe work of Magalhaes et al.,69 in which it wasdemonstrated that water bridges constitute an im-portant factor stabilizing the spatially close config-urations of the charged guanidino groups of thearginine side chains. Because the other chargedresidues do not possess so many groups capable offorming hydrogen bonds with water, the excep-tionally low value of the Arg]Arg contact freeenergy is understandable.

    DETERMINATION OF PARAMETERS ANDSTATISTICAL EVALUATION OF THE FIT OFPOTENTIALS TO EXPERIMENTAL DATA

    Ž .The sum of the squares defined by eq. 13 wasminimized for all five potentials given by eqs.Ž . Ž .2 ] 8 . The angular and radial-angular terms werenot considered in determining the parameters of

    w Ž .radial-only potentials, namely LJ and LJK eqs. 3Ž .x 70and 4 . We used the Marquardt algorithm,

    which is especially designed for minimizing thesums of squares. This method requires only thefirst derivatives of the components of the sums ofsquares, from which both the gradient and thepositive-definite approximate to the Hessian ma-trix are constructed.

    Numerical integration to calculate the averagecorrelation functions and free energies was carried

    ˚out with step sizes of dD s 0.25 A, dq s pr24,and dw s pr6, by taking the value of the functionin the center of the respective bin as the average

    JOURNAL OF COMPUTATIONAL CHEMISTRY 863

  • LIWO ET AL.

    value of the function in the bin. These step sizeswere chosen as a compromise between computa-tional efficiency and the error caused by too coarsea grid in the integration. Use of a finer grid re-sulted in differences in free energies and his-togram values less than 1%. To increase computa-tional efficiency, minimization was first carried

    ˚Žout with a coarse grid i.e., dD s D r s 0.5 A, dq s. Ždw s pr6 and then completed starting from the

    . Žcomputed parameters using a finer grid dD s˚ .0.125 A, dq s pr24, and dw s pr6 .

    The starting values of e8 were the free energiesŽ .of contacts calculated from eq. 25 . The values of

    s 8 in the LJK, GB, and GBV potentials and thevalues of r8 in the LJ potential were initially as-signed half the side-chain van der Waals distancescalculated by Levitt.2 The initial values of rT ini j

    ˚the GBV and LJK potentials were 1.3 A, this beingthe approximate van der Waals radius of the‘‘outer’’ atoms of the side chains. For the

    Ž 5 H.2 Xanisotropic parameters s rs and x , onechoice was based on the ratio of the long and thegeometric mean of the shorter principal axes of themoments of inertia of the side chains calculated byaveraging their geometry, and another start was

    from isotropic potentials. Both gave the same finalresults.

    Ž .Based on eqs. 17 , the final estimated ratios ofŽ . r uf ru Fthe weights of eq. 13 were w :w :w :w s

    1:20:20:20 for all models.Ž .Equation 13 contains pair-specific and single-

    residue-specific terms. We initially made trial runsby assuming that all the parameters are pair-specific; that is, we avoided the relations in eqs.Ž . Ž . Ž .5 , 9 , and 10 . However, for most of the side-chain pairs, the results were unreasonable, withthe standard deviations exceeding the parameter

    Ž . Ž .values. Therefore, we decided to use eqs. 5 , 9 ,Ž .and 10 to express all the constants except e ini j

    terms of single-body constants.The fit of the functional forms of the potentials

    considered in this work to experimental data iscompared in terms of the F-test71 in Table II forthe radial and anisotropic potentials, respectively.In the case of the radial potentials, the LJK modelŽ .shifted Lennard]Jones appears clearly superiorto the simple LJ model, the level of significance ofintroducing the ‘‘shift’’ parameters r8 being closeto 100%. A similar situation occurs for theanisotropic potential where the GBV functional

    TABLE II.Comparison of Fit of Various Radial and Anisotropic Potentials to Experimental Data.

    a uf ru r F b c dPotential F F F F F = 1000 p DF F

    LJK 0.39 } } 0.383 0.233 250 } }LJ 1.35 } } 1.340 0.433 230 0.960 468

    GBV 9.26 0.223 0.220 0.380 1.053 310 } }eLJK 10.23 0.230 0.261 0.384 0.952 250 0.975 871

    GB 9.87 0.222 0.247 0.465 1.031 290 0.615 549BP 9.75 0.222 0.241 0.462 1.625 290 0.493 }

    a uf ru r F ( ) ( ) r r F FF, F , F , F , F are defined in eq. 13 . For the LJ and LJK radial potentials F s w F q w F , for the remaining( ) uf uf ru ru r r F F r uf ru Fanisotropic potentials F s w F q w F q w F q w F , where w = 1 and w s w s w s 20 have been assigned

    ( )based on the ratios of the respective residual variances see text .b The number of adjustable parameters.c The difference between the fit corresponding to the potential giving an inferior fit to experimental data with that of the potential

    ( )giving the best fit i.e., LJK for the radial and GBV for the anisotropic potential .dThe F-test value to compare the goodness of fit of the inferior potential with that of the best one is:

    ( ) ( )N y p F X y F X*i iF =i ( )p* y p F X*i

    where N is the number of terms in the expression for F; N = 3451 for the radial and 165,487 for the anisotropic potentials,( U U U )T ( )TX* = x , x , . . . , x is the vector of the parameters of the best model, and X = x , x , . . . , x is the vector of the1 2 p* i i;1 i;2 i,p i

    ( )parameters of i th inferior model p - p* see ref. 70 . With the large values of N taken in this study, the best-fitting potentials areieffectively different from the inferior ones at the 100% significance level. Because the model with the BP potential is not nested inthe model with the GBV potential, the F-test value is not given in this case.eThe whole sum of squares was minimized, but with the radial LJK potential, which can be considered as the GBV potential devoidof the angular terms.

    VOL. 18, NO. 7864

  • PROTEIN-STRUCTURE SIMULATIONS. I

    form, which allows for free ‘‘shifting’’ of r in eq.Ž .7 , appears superior to both the BP and GB formsŽhowever, BP is a model not nested in GBV and,therefore, we did not carry out its statistical com-

    .parison in terms of the F-test . Again, the signifi-cance level of introducing the new class of param-eters r8, when passing from GB to GBV, is effec-tively 100%. Of the two potentials with fewerparameters, the BP form gives a better fit to theexperimental data. From Table II, it also followsthat the LJK potential gives a significantly poorerfit to the data that contain both radial and angularterms, that is, the angular terms are statisticallysignificant.

    The lesser adequacy of the GB form, comparedwith the BP and GBV ones, also follows from theplot of the residuals in the contact free energiesshown in Figure 7. For the GB potential, all residu-

    als show an apparent trend: the negative onesoccur for negative and positive ones for positivecontact free energies. This trend is partially elimi-nated for the BP potential and remains only forstrongly hydrophilic pairs in the case of the GBVpotential. The last observation may indicate thatnone of the models is fully adequate for hy-drophilic pairs. On the other hand, such pairsinteract weakly and therefore this should not causegreat concern.

    Sample theoretical and experimental histogramsŽof the radial and angular correlation functions the

    latter being averaged for visualizing purposes over.the dihedral angle f corresponding to the

    Leu]Leu pair are shown in Figure 8A and 8B.To test how the parameters of the side-chain

    interaction potentials depend on the choice of thedata base, we evaluated the parameters of the

    ( ) ( )FIGURE 7. Plots of weighted residuals of the contact energies corresponding to the BP crosses , GB squares , and( )GBV diamonds potentials.

    JOURNAL OF COMPUTATIONAL CHEMISTRY 865

  • LIWO ET AL.

    ( ) ( )FIGURE 8. Sample calculated dashed line and dashed surface and experimental solid line and solid surfaces( ) ( )histograms of radial A and angular B correlation function for the Leu ]Leu pair. For visualizing, the histograms of the

    angular correlation functions are averaged over the dihedral angle f.

    GBV potential, using the set of 42 protein struc-tures of Miyazawa and Jernigan9; the GBV poten-tial contains the greatest number of adjustableparameters and should, therefore, be the most sen-

    Žsitive to data-base selection. In the section ‘‘Con-tact Free Energies,’’ we have already shown thatthe contact free energies determined from our database of 195 protein structures are in very goodagreement with those determined by Miyazawa

    . Ž .and Jernigan. For the values of e of eq. 6 ,(which range from y12 to q1.6, the correlationcoefficient was 0.8569 and the mean-square differ-ence was 0.3 kcalrmol. For other parameters forwhich the range is not so extensive, correlationcoefficients of approximately 0.8 were obtained. Inview of the fact that the two data bases havealmost no structure in common and the MJ database is much smaller than ours, the parameters ofthe GBV potential determined from the two databases are reasonably consistent.

    DISCUSSION OF COMPUTED PARAMETERS

    The computed values of eT and the single-bodyi jŽ .parameters of eq. 6 and their standard deviations

    for the GBV side-chain interaction potential con-sidered in this work are given in Tables IIIa andIIIb. The parameters for the other four simplerpotential functions are included in Tables 2a, 2b to5a, 5b of the Supplementary Material, which alsocontains the parameters for all five potentials inmachine-readable form.

    Except for eT of the GBV model, and theL y sL y sconstants x X and a, the parameters are well deter-minable and significantly greater than their stan-

    dard deviations. For hydrophobic residues, thevalues of the well-depth anisotropy x X are small,although the anisotropy measures of the van derWaals radii s 5rs H are significantly different from1.0. Well-depth anisotropies appear significant forneutral and hydrophilic residues.

    It is interesting to compare the computed pa-rameters with contact free energies and estimatesof the van der Waals radii and anisotropies thatcan be determined from the geometrical character-istics of the side chains. Such comparison is pre-sented for the GBV model in Figure 9A]D.Asshown, the contact free energies of hydrophobic

    Ž .pairs for which e ) 0 correlate quite well withi jtheir van der Waals well depths determined by

    Ž . ; Lminimizing F of eq. 13 . The values of r8 , usedas the van der Waals distances in our earlier work,correlate with the values of sT , with the definite

    Žexception of aromatic residues and arginine Fig..9B . A similar situation occurs when the values

    corresponding to the LJK potential are taken intoŽaccount. The correlation is even better with the

    . 5exception of Lys when the values of s calcu-lated from s 8 and the ratio s 5rs H are consid-

    Ž .ered Fig. 9C . On the other hand, there is nocorrelation between the values of r8 from our

    Ž .earlier work and the constants r8 of Eq. 7 .There is no correlation between the parameters

    and the ratio of the long to the short axes of theside chains determined by diagonalizing the aver-age matrices of the moments of inertia determinedfrom the PDB. Thus, estimating these parametersbased on the average dimensions of a side chain isincorrect. On the other hand, it is interesting tonote that anisotropy parameters correlate with the

    VOL. 18, NO. 7866

  • PROTEIN-STRUCTURE SIMULATIONS. I

    TAB

    LEII

    Ia.

    T(

    )(

    )C

    alcu

    late

    dV

    alue

    so

    fe

    Kilo

    calo

    ries

    per

    Mo

    lefo

    rth

    eG

    BV

    Po

    tent

    ial

    Dia

    go

    nala

    ndLo

    wer

    Tria

    ngle

    and

    Thei

    rS

    tand

    ard

    Dev

    iatio

    nsij

    ()

    Up

    per

    Tria

    ngle

    ;Las

    tLin

    eC

    ont

    ains

    the

    Sta

    ndar

    dD

    evia

    tions

    oft

    he

    Dia

    go

    nalC

    ons

    tant

    s.

    Cys

    Met

    Phe

    IleLe

    uV

    alTr

    pTy

    rA

    laG

    lyTh

    rS

    erG

    lnA

    snG

    luA

    spH

    isA

    rgLy

    sP

    ro

    Cys

    1.05

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.03

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    Met

    1.26

    1.45

    0.01

    0.01

    0.01

    0.01

    0.02

    0.02

    0.02

    0.02

    0.02

    0.02

    0.03

    0.03

    0.03

    0.02

    0.02

    0.02

    0.03

    0.02

    Phe

    1.19

    1.34

    1.27

    0.01

    0.01

    0.01

    0.02

    0.01

    0.01

    0.01

    0.01

    0.01

    0.02

    0.02

    0.01

    0.01

    0.02

    0.01

    0.01

    0.01

    Ile1.

    301.

    471.

    411.

    580.

    010.

    010.

    020.

    010.

    010.

    010.

    010.

    010.

    020.

    020.

    020.

    020.

    020.

    020.

    020.

    01Le

    u1.

    251.

    511.

    401.

    591.

    550.

    010.

    020.

    010.

    010.

    010.

    010.

    010.

    020.

    020.

    020.

    020.

    020.

    020.

    020.

    01V

    al1.

    171.

    381.

    311.

    521.

    501.

    400.

    020.

    010.

    010.

    010.

    010.

    010.

    020.

    020.

    020.

    020.

    020.

    020.

    020.

    01Tr

    p0.

    991.

    171.

    151.

    211.

    181.

    100.

    970.

    020.

    020.

    020.

    010.

    010.

    020.

    020.

    020.

    020.

    020.

    020.

    020.

    02Ty

    r0.

    921.

    151.

    051.

    221.

    181.

    040.

    870.

    810.

    010.

    010.

    010.

    010.

    020.

    020.

    020.

    020.

    020.

    020.

    020.

    01A

    la0.

    981.

    200.

    991.

    241.

    261.

    190.

    770.

    811.

    020.

    010.

    010.

    010.

    030.

    020.

    020.

    020.

    020.

    020.

    020.

    01G

    ly0.

    981.

    030.

    841.

    061.

    131.

    010.

    710.

    720.

    820.

    560.

    010.

    010.

    020.

    020.

    010.

    020.

    020.

    020.

    000.

    01Th

    r0.

    800.

    910.

    760.

    980.

    900.

    870.

    560.

    580.

    640.

    550.

    430.

    010.

    020.

    020.

    020.

    010.

    020.

    020.

    000.

    01S

    er0.

    770.

    860.

    680.

    900.

    930.

    830.

    510.

    520.

    590.

    470.

    450.

    280.

    020.

    020.

    010.

    020.

    020.

    020.

    000.

    01G

    ln0.

    811.

    020.

    720.

    950.

    980.

    850.

    590.

    630.

    750.

    330.

    350.

    26y

    0.28

    0.06

    0.03

    0.02

    0.03

    0.04

    0.01

    0.02

    Asn

    0.73

    0.95

    0.70

    0.87

    1.00

    0.89

    0.60

    0.60

    0.77

    0.49

    0.38

    0.38

    0.53

    0.66

    0.01

    0.04

    0.03

    0.05

    0.00

    0.02

    Glu

    0.64

    0.81

    0.53

    0.88

    0.79

    0.72

    0.52

    0.51

    0.47

    y0.

    060.

    200.

    04y

    0.23

    y0.

    02y

    1.58

    0.10

    0.03

    0.04

    0.04

    0.02

    Asp

    0.68

    0.64

    0.52

    0.79

    0.68

    0.62

    0.53

    0.57

    0.51

    0.23

    0.29

    0.12

    y0.

    120.

    27y

    0.93

    y0.

    660.

    030.

    030.

    050.

    02H

    is0.

    911.

    050.

    920.

    940.

    980.

    830.

    820.

    760.

    650.

    560.

    570.

    490.

    380.

    590.

    420.

    620.

    800.

    030.

    000.

    02A

    rg0.

    580.

    870.

    690.

    960.

    950.

    750.

    660.

    670.

    530.

    380.

    430.

    400.

    360.

    331.

    011.

    000.

    49y

    0.02

    0.09

    0.02

    Lys

    0.59

    0.81

    0.55

    0.96

    0.97

    0.85

    0.50

    0.62

    0.68

    y0.

    010.

    00y

    0.01

    y0.

    020.

    001.

    301.

    09y

    0.01

    y0.

    48y

    11.9

    60.

    02P

    ro0.

    820.

    950.

    810.

    981.

    000.

    920.

    770.

    790.

    740.

    690.

    570.

    580.

    620.

    620.

    420.

    420.

    610.

    530.

    560.

    82

    0.03

    0.03

    0.02

    0.01

    0.01

    0.01

    0.03

    0.02

    0.02

    0.02

    0.01

    0.02

    0.07

    0.10

    0.25

    0.09

    0.04

    0.01

    18.

    0.02

    JOURNAL OF COMPUTATIONAL CHEMISTRY 867

  • LIWO ET AL.

    TABLE IIIb.( )Calculated Values of Single-Body Parameters of the GBV Potential. Standard Deviations in Parentheses.

    0 0Except for s and r All Quantities Are Dimensionless.

    XT T 5 H 2˚ ˚Ž . Ž . Ž .Residue s A r A s rs x a

    ( ) ( ) ( ) ( ) ( )Cys 2.3204 0.0382 5.7866 0.2123 2.6006 0.2042 y0.0025 0.0155 0.0299 0.0070( ) ( ) ( ) ( ) ( )Met 2.4984 0.0237 3.5449 0.1299 4.4303 0.2018 0.0968 0.0108 0.0878 0.0058( ) ( ) ( ) ( ) ( )Phe 2.2823 0.0245 6.3367 0.1459 3.9640 0.1815 0.0491 0.0077 0.0801 0.0039( ) ( ) ( ) ( ) ( )Ile 2.5919 0.0150 4.4859 0.0860 3.2406 0.0926 0.0897 0.0061 0.0664 0.0031( ) ( ) ( ) ( ) ( )Leu 2.8905 0.0098 3.3121 0.0548 2.3636 0.0406 0.0749 0.0047 0.1108 0.0027( ) ( ) ( ) ( ) ( )Val 2.7251 0.0125 3.7770 0.0629 2.0347 0.0514 0.0770 0.0062 0.0679 0.0032( ) ( ) ( ) ( ) ( )Trp 1.6947 0.0644 9.2904 0.3832 7.5089 1.0060 0.0731 0.0170 0.0549 0.0076( ) ( ) ( ) ( ) ( )Tyr 2.1346 0.0271 4.8607 0.1529 5.9976 0.3578 0.1177 0.0134 0.0438 0.0064( ) ( ) ( ) ( ) ( )Ala 2.4366 0.0100 2.1574 0.0423 1.8090 0.0396 0.0333 0.0077 0.1052 0.0041( ) ( ) ( ) ( ) ( )Gly 2.3359 0.0169 2.5197 0.0532 1.0429 0.0498 0.2238 0.0127 y0.1277 0.0073( ) ( ) ( ) ( ) ( )Thr 2.6047 0.0188 3.0723 0.0996 2.2451 0.0899 0.0236 0.0162 y0.0264 0.0075( ) ( ) ( ) ( ) ( )Ser 2.4471 0.0203 2.2432 0.0770 1.6795 0.0749 y0.0029 0.0184 y0.0348 0.0083( ) ( ) ( ) ( ) ( )Gln 2.6269 0.0229 1.1813 0.1189 2.6172 0.1455 0.2960 0.0305 0.0505 0.0165( ) ( ) ( ) ( ) ( )Asn 2.6954 0.0165 0.7634 0.0826 2.0433 0.0850 0.2732 0.0286 y0.0064 0.0152( ) ( ) ( ) ( ) ( )Glu 2.5933 0.0191 1.2819 0.0874 2.5707 0.1327 0.4904 0.0332 y0.0266 0.0175( ) ( ) ( ) ( ) ( )Asp 2.5098 0.0192 1.4061 0.0804 1.9262 0.0925 0.3090 0.0299 0.0250 0.0164( ) ( ) ( ) ( ) ( )His 2.3409 0.0323 3.3570 0.1817 3.6263 0.2703 0.1351 0.0245 0.0589 0.0124( ) ( ) ( ) ( ) ( )Arg 2.3694 0.0214 1.8119 0.1201 6.6061 0.3758 0.2624 0.0270 0.0062 0.0130( ) ( ) ( ) ( ) ( )Lys 2.7249 0.0161 0.2712 0.0913 8.0078 0.2948 0.5790 0.0364 0.0115 0.0161( ) ( ) ( ) ( ) ( )Pro 2.7230 0.0228 3.3320 0.1059 1.7905 0.0759 y0.1105 0.0160 y0.0190 0.0065

    dimensions of the side chains; larger side chainsare more likely to exhibit more pronounced

    Ž .anisotropy Fig. 9D .Finally, it should be noted that, in the case of

    the LJK and GBV potentials, for many of the sideŽchains, the constants, r8, exceed s 8 see Table 3b

    .and Table 3b of the Supplementary Material . Par-ticularly large r8 values occur for the aromaticresidues which exhibit broad radial distributions.This means that the potential will rarely tend toinfinity as side-chain separation approaches zero.This does not seem to be the result of inadequacyof the fitting procedure, because we included theregions in which the radial-correlation function iszero. We have also carried out additional trial runs

    Ž .by assuming lower exponents in eq. 2 than 6 and12, which results in broadening the potential wells.However, we still obtained r8 ) s 8 for most of theside chains. Thus, to use the LJK and GBV poten-tials in simulations, in these two cases we changed

    Ž .the general form of the potential given by eq. 2to:

    12 6<

  • PROTEIN-STRUCTURE SIMULATIONS. I

    ( ) ( )FIGURE 9. A Correlation between the hydrophobicity-scaled contact free energies, F abscissae , and thei jT ( )corresponding van der Waals well depths, e , corresponding to the GBV potential ordinates . The straight line is thei j

    least-squares line calculated for the ‘‘definitely hydrophobic’’ pairs with e G 0 kcal / mol; its equation is ei j( ) ( ) ( )= y0.865 0.040 F + 0.425 0.022 ; R = y0.8417. B Correlation between the mean radii of side-chain contacts derived

    2 ( ) Tby Levitt abscissae with the computed values of s of the GBV potential. After eliminating five apparent outliers: Phe,( ) T ( ) 0;L ( ) ( )Tyr, Trp, His aromatic side chains , and Arg, the equation is s = 0.1552 0.041 r + 1.72 0.23 ; R = 0.7254. C

    Correlation between Levitt’s mean side-chain contact distances and the values of s 5 calculated from the parameters of5 ( ) 0;L ( ) ( )the GBV potential. After eliminating lysine, the equation is s = 0.89 0.13 r y 0.94 0.75 ; R = 0.8619. D Correlation

    between Levitt’s mean side-chain contact distances and the values of s 5rs H corresponding to the GBV potential. After5 H ( ) 0;L ( )eliminating lysine, the equation is s rs = 0.463 0.072 r y 0.97 0.43 ; R = 0.8417.

    JOURNAL OF COMPUTATIONAL CHEMISTRY 869

  • LIWO ET AL.

    second feature enables us to consider the com-puted free energies of side-chain interactions asabsolute values that can be compared directly withexperimental data andror the results of calcula-

    Žtions with all-atom potentials including hydra-.tion .

    The choice as to which potential to use in simu-lations should be based on the balance betweenthe accuracy of the representation of free-energysurface and computational effort. Regarding thesetwo issues, the potentials can be ordered as fol-lows: GBV, BP and GB, LJK, LJ. The GBV potentialŽ .that includes angular dependence most accu-rately represents the free-energy surface, but in-volves the greatest computational effort, whereas

    Ž .the LJ potential radial-only is the simplest, butleast accurate representation of the free-energysurface, and should be used when the computationtime is a significant issue.

    Acknowledgments

    This work was supported by Grant PB190rT09r96r10 from the Polish State Committee

    Ž . Ž .for Scientific Research KBN to A.L. and S.O. , byGrant AG 00322 from the National Institute on

    Ž .Aging to S.R. , by Grant GM-14312 from the Na-tional Institute of General Medical Sciences, byGrant MCB95-13167 from the National Science

    Ž .Foundation to H.A.S. , and by Grant CA 42500Ž .from the National Cancer Institute to M.R.P. .

    Computations were carried out with one pro-cessor of the IBM-SP2 computer at the CornellNational Supercomputer Facility, a resource of theCenter for Theory and Simulation in Science andEngineering at Cornell University, which is fundedby the National Science Foundation, New YorkState, the IBM Corporation, and members of itsCorporate Research Institute, with additional fundsfrom the National Institutes of Health.

    Appendix: Definition and Calculationof Side-Chain Pair CorrelationFunctions from Protein-Crystal Data

    Assume that we have a data base of np proteinŽ2. Ž Ž1. Ž2. .structures. Let n r, u , u , f denote the num-i j; p

    ber density of pairs of side chains of types i and jat a distance r and orientation defined by theangles u Ž1., u Ž2., and f for protein p, all assumed

    Žto be at the same temperature for brevity of nota-tion, we omit the side-chain-pair subscripts ij inthe symbols of the variables throughout the Ap-

    .pendix . Because we cannot determine the actualdensity at a point from experimental data, instead

    Ž2. Ž1. Ž2.Ž .we will consider n r, u , u , f ; T definedi j; pas the average number density in the bins

    Ž Ž1. Ž2. . w xb s b r , u , u , f s r y D rr2, r q Dr2k lm n k l m n k kw Ž1. Ž1. x w Ž2. Ž2.= u y Dur 2, u q Dur2 = u y Dur 2, ul l m m

    x w xqDur2 = f y Dfr2, f q Dfr2 , k s 1,n n2, . . . , nr, l s 1, 2, . . . n , m s 1, 2, . . . n , n s 1,u u2, . . . n :f

    number of pairs of side chains of types i and j within bk lm nŽ2. Ž1. Ž2. Ž .A-1Ž .n r , u , u , f si j ; p k l m n volume of bk lm n

    Ž . Ž1. Ž . Ž2.where r s k y1r2 D r, u s l y 1r2 Du , u sk l m˚Ž . Ž .m y 1r2 Du , f s n y 1r2 Df, D r s 0.5 A, Dun

    T T Ž m a x .s 30 , Df s 30 , nr s int r rD r , n s pr6,i j i j un s 2pr6, where int is the integer part of a num-f

    m a x Ž .ber; the values of r are defined by eq. 12 .Ž2.The pair number density n can be decom-

    posed into the pair correlation function for residuesŽ1. Ž2.Ž . Žof types i and j, g r, u , u , f assumed toi j

    depend only on the types of the side chains and.not on the protein in which they reside and the2,0 Ž1.Žreference-state pair number density n r, u ,i j; p

    Ž2. .u , f , corresponding to a hypothetical chain withnoninteracting side chains. Thus, given the actual

    and reference pair number densities that can beevaluated from the data base of protein structures,the pair correlation function can be calculated asfollows:

    n p Ž2. Ž1. Ž2.Ž .Ý w n r , u , u , fps1 p i j ; pŽ1. Ž2.Ž .g r , u , u , f si j n p Ž2 , 0. Ž1. Ž2.Ž .Ý w n r , u , u , fps1 p i j ; pŽ .A-2

    where w is the statistical weight of the pth pro-ptein in the sample; the choice of weights is dis-

    w Ž .xcussed in the Results section eq. 20 .

    VOL. 18, NO. 7870

  • PROTEIN-STRUCTURE SIMULATIONS. I

    Clearly, g is the average pair correlation func-tion corresponding to bin b :k lm n

    Ž1. Ž2.Ž .g r , u , u , fi jŽ .s 1rDV

    q Ž1. Ž2. fr u u q Ž1. Ž1.q q Ž .= g D , q , q , w dVH H H H i jŽ1. Ž2.r u u fy y y y

    Ž .A-3

    where:

    r s r y D rr2, r s r q D rr2,y qu Ž1. s u Ž1. y Dur2, u Ž1.s u Ž1. q Dur2,y qu Ž2.s u Ž2. y Dur2, u Ž2.s u Ž2. q Dur2,y q

    f s f y Dfr2, f s f q Dfr2,y qdV s D2 sin q Ž1. sin q Ž2.dDdq Ž1.dq Ž2.dw

    and:

    Ž .Ž 3 3 .Ž Ž1. Ž1. .DV s 1r3 r y r cos u y cos uq y y qŽ Ž2. Ž2. .= cos u y cos u Dfy q

    Ž . 3 Ž1. Ž2.s 1r3 D r D cos u D cos u Df

    The limited number of available protein struc-tures still makes it impossible to determine thepair correlation functions with reasonable accu-racy. This is easily realized because, even the choice

    ˚of a coarse grid of D r s 0.5 A, Du s Df s 308with implementation of the symmetry of the hy-

    Ž w xpersurface in f only the interval 08, 1808 needs.to be considered yields 16 = 6 = 6 = 6 s 3456

    ˚w Ž .bins according to eq. 12 we take a maximum 8-Axcoordination sphere for a residue for which the

    average correlation functions would have to bedetermined. Within this coordination sphere, wehave at best about 5000 points per residue pair,which would mean an average of about 1.4 countsper bin. Therefore, in the fitting procedure we usethe correlation functions averaged over some ra-dial and angular variables, respectively:

    r Ž .g ri jq p p1 2pr Ž1. Ž2.Ž .s g D ; q , q , w dVH H H H i j38pD r r 0 0 0y

    n p Ž2 , r . Ž .Ý w n rps1 p i j ; p Ž .f A-4n p Ž0 , 2, r . Ž .Ý w n rps1 p i j ; p

    uf 1 2Ž .g u , u , fi j3

    s 3 Ž1. Ž2.r D cos u D cos u Dfm a xr qŽ1. Ž2.m a x uu u Ž1. Ž2.q q Ž .= g D ; q , q , w dVH H H H i j

    Ž1. Ž2.0 u u fy y y

    n p Ž2 , uf . Ž1. Ž2.Ž .Ý w n u , u , fps1 p i j ; p Ž .f A-5n p Ž0 ,2, uf . Ž1. Ž2.Ž .Ý w n u , u , fps1 pr u Ž1. Ž2.Ž .g r , u , ui j

    1s 3 Ž1. Ž2.2pD r D cos u D cos u

    =r Ž1. Ž2. 2pq u uq Ž1. Ž2.q Ž .g r ; q , q , w dVH H H H i j

    Ž1. Ž2.r u u 0y y y

    n p Ž2 , ru . Ž1. Ž2.Ž .Ý w n r , u , ups1 p i j ; p Ž .f A-6n p Ž0 ,2, ru . Ž1. Ž2.Ž .Ý n u , u , fps1

    r uf ruwhere g , g , and g denote the correlationŽ Ž1. Ž2.functions averaged over all angles u , u and

    .f , r, and the rotation angle, f, respectively; wenoticed that the dependence of the distributionfunction on f is the weakest and, therefore, choseto average over f to obtain a mixed radial and

    w Ž .xangular correlation function eq. A-6 . Likewise,Ž2, r . Ž2, uf . Ž1. Ž2. Ž2, ru . Ž1. Ž2.Ž . Ž . Ž .n r , n u , u , f , and n r, u , u

    denote the average number densities withinw x w Ž1. Ž1.x w Ž2. Ž2.x w xr , r , u , u = u , u = f , f , andy q y q y q y qw x w Ž1. Ž1.x w Ž2. Ž2.xr , r = u , u = u , u , respectively.y q y q y q

    The reference pair distribution functions muststill be defined. We assume that they can be de-composed into the radial and angular part, andthat the radial part can be expressed as a product

    Ž .of the Markovian factor, M r , arising from thei j; pfact that the side chains are on a polypeptidechain,9 a factor accounting for the finite dimen-

    Ž . 72sions and nonuniform residue density, T r , ofi j; pprotein molecules, and the ‘‘background’’ angu-

    Ž Ž1. Ž2. .lar distribution function, V u , u , f , as giveni jŽ . Ž .by eqs. A-7 ] A-9 .

    npŽ0, 2, r . Ž . Ž . Ž . Ž .n r s w M r T r A-7Ý p i j ; p i j ; p

    ps1

    Ž0, 2, uf . Ž1. Ž2. Ž1. Ž2.Ž . Ž . Ž .n u , u , f s N r F R V u , u ,fi j cŽ .A-8

    Ž0, 2, ru . Ž1. Ž2.Ž .n r , u , u2p Ž1. Ž2.Ž . Ž . Ž .s 1r2p M T r V u , u , w dwHi j ; p i j ; p

    0Ž .A-9

    JOURNAL OF COMPUTATIONAL CHEMISTRY 871

  • LIWO ET AL.

    Ž . n p Ž .where N r F R s Ý w n r F R is thei j c is1 p i j; p cweighte