computer modeling of protein folding:...

JMB—MS 402 Cust. Ref. No. PEW 62/94 [SGML]

J. Mol. Biol. (1995) 247, 995–1012

Computer Modeling of Protein Folding:Conformational and Energetic Analysis of Reducedand Detailed Protein Models

Alessandro Monge, Elizabeth J. P. Lathrop, John R. GunnPeter S. Shenkin and Richard A. Friesner*

Recently we developed methods to generate low-resolution protein tertiaryDepartment of Chemistry andstructures using a reduced model of the protein where secondary structureCenter for Biomolecularis specified and a simple potential based on a statistical analysis of theSimulation, ColumbiaProtein Data Bank is employed. Here we present the results of an extensiveUniversity, New York

NY 10027, U.S.A. analysis of a large number of detailed, all-atom structures generated fromthese reduced model structures. Following side-chain addition, minimiz-ation and simulated annealing simulations are carried out with a molecularmechanics potential including an approximate continuum solvent treatment.By combining reduced model simulations with molecular modelingcalculations we generate energetically competitive, plausible misfoldedstructures which provide a more significant test of the potential functionthan current misfolded models based on superimposing the native sequenceon the folded structures of completely different proteins. The variouscontributions to the total energy and their interdependence are analyzed indetail for many conformations of three proteins (myoglobin, the C-terminalfragment of the L7/L12 ribosomal protein, and the N-terminal domain ofphage 434 repressor). Our analysis indicates that the all-atom potentialperforms reasonably well in distinguishing the native structure. It alsoreveals inadequacies in the reduced model potential, which suggests howthis potential can be improved to yield greater accuracy. Preliminary resultswith an improved potential are presented.

*Corresponding author Keywords: protein folding; computer modeling; potential functions

Introduction

The thermodynamics of protein folding has beenthe subject of intense experimental and theoreticalstudy for many years (Privalov, 1989; Yang et al.,1992). From a theoretical point of view, the problemis extraordinarily difficult because the free energy offolding is a small fraction of the total free energy ofthe protein; consequently, one needs to calculate asmall energy difference with empirical potentialfunctions whose quality for this purpose is notknown. Because globally different protein confor-mations must be compared, the cancellation of errorthat is often relied upon in free energy perturbationcalculations is problematic. An adequate evaluationof the total energy of the protein is therefore

necessary. Furthermore, the size and complexity ofthe molecule and associated solvent make the usualprocedures of molecular mechanics computationallyexpensive; this difficulty is compounded by theexistence of a huge number of local minima (even inthe neighborhood of the global minimum) whichimpedes rapid conformational sampling of phasespace.

Over the past several years, a few papers haveappeared which have examined the total molecularmechanics energy of a small number of differentprotein conformations (Levitt & Sharon, 1988;Daggett & Levitt, 1991; Mark & van Gunsteren,1992; Novotny et al., 1984, 1988; Bryant & Lawrence,1993). While many of these results have beeninteresting, the amount of data obtained is not reallysufficient to fully address the problem describedabove. It is relatively straightforward to generatelarge numbers of plausible conformations that arevery close (e.g. 1 to 2 A r.m.s. deviation) to thenative structure, for example starting from theX-ray structure and using molecular dynamics or

Abbreviations used: r.m.s., root-mean-square; PDB,Protein Data Bank; RSA, rotamer simulated annealing;MBO, myoglobin; CTF, C-terminal fragment of theL7/L12 ribosomal protein; R69, N-terminal domain ofphage 434 repressor; CPU, central processor unit.

0022–2836/95/150995–18 $08.00/0 7 1995 Academic Press Limited

JMB—MS 402

Computer Modeling of Protein Folding996

simulated annealing algorithms (Daggett & Levitt,1991; Mark & van Gunsteren, 1992). It is alsostraightforward to ‘‘thread’’ the sequence of oneprotein through the structure of another, as was doneby Karplus and co-workers in a pioneer study of thistype (Novotny et al., 1984) and more recently, forinstance, by Bryant & Lawrence (1993). However, inorder to design a methodology for protein folding bycomputer, it is crucial to examine a large number oflow energy structures with conformations signifi-cantly different (at least in the 4 to 6 A r.m.s.deviation range, which is estimated to be in themolten globule regime) from the native. Suchstructures can only be produced via simulationscapable of rapidly traversing configuration space,which at the same time utilize a potential functionthat is a plausible approximation to the actualpotential.

Many studies have also been carried out usingreduced protein models and approximate potentials(Friedrichs et al., 1991; Hinds & Levitt, 1992; Sun,1993; Skolnick & Kolinski, 1990; Kolinski et al., 1993;Lau & Dill, 1989; Shakhnovich et al., 1991; Covell &Jernigan, 1990; Covell, 1992). Most of this work hasbeen concerned with sequence alignment andhomology modeling, i.e. identifying the nativestructure from the set of structures in the ProteinData Bank (PBD: Bernstein et al., 1977), given thesequence. Again, however, the restriction to PDBstructures is qualitatively inadequate if one wishes tounderstand how the native conformation is selectedas compared with alternatives that are actuallysuitable to the sequence. To begin with, a realistictreatment of excluded volume constraints is requiredand most of the ‘‘database potentials’’ in theliterature simply ignore such constraints, as PDBstructures have them built in automatically. Further-more, the quality of these reduced model potentialsis even more of an issue than the molecularmechanics potentials which at least have aperformance record that can be evaluated for smallmolecules.

In our earlier work on computer modeling ofprotein folding (Monge et al., 1994; Gunn et al., 1994),we have used a reduced model of the protein wherewe fix secondary structure and employ a simplepotential based on a statistical analysis of PDBstructures. The idea of fixing secondary structurewas proposed in a number of previous works(Ptitsyn & Rashin, 1975; Warshel & Levitt, 1976;Cohen et al., 1979), but while promising results werereported, the methods used have not been of generalapplicability. Our efforts have been toward develop-ing a genuinely automated algorithm to fold proteinsof arbitrary complexity using the secondarystructure as a starting point. Such an algorithm couldbe applied to proteins of unknown structure whencombined with NMR spectroscopy. In fact, the NMRmethod can provide a precise characterization of theprotein secondary structure at an early stage of astructure determination and quite independently ofthe complete structure calculation (Wuthrich et al.,1984, 1991; Wishart et al., 1992).

In this paper, we combine an extensive set ofreduced model simulations, using a potential with aprimitive but reasonably effective set of excludedvolume constraints, with molecular modelingcalculations using the AMBER* force field (Weineret al., 1984; McDonald & Still, 1992) for the proteinand the generalized Born (GB) continuum solventmodel of Still and co-workers (Still et al., 1990) torepresent the aqueous environment. Large numbersof energetically competitive reduced model struc-tures are generated, as described in previous papers(Monge et al., 1994; Gunn et al., 1994), for threeproteins: myoglobin, an eight a-helix protein (PDBcode 1MBO); the C-terminal fragment of the L7/L12ribosomal protein, a small mixed a/b protein (PDBcode 1CTF); and the amino-terminal domain ofphage 434 repressor, a small helical protein (PDBcode 1R69). A subset of structures are then selectedfor further study: side-chains are added via therotamer simulated annealing (RSA) program ofShenkin and co-workers (Farid et al., 1992), andminimization and simulated annealing runs arecarried out with the AMBER*/GB potential using theMacroModel/BatchMin molecular modeling pro-gram (Mohamadi et al., 1990). Recently, Vieth et al.(1994) have studied the GCN4 leucine zipper (adimer of two helices each containing 33 residues)using a hierarchical approach similar to ours, wherea lattice model is used first and then all-atomstructures are generated. They report very goodagreement with the crystal structure and their resultsappear to be promising. However, a validation oftheir method must come in the context of larger andmore complex proteins.

Our results on three proteins allow us, for the firsttime, to systematically investigate the major issuesdescribed above. Can a molecular mechanicspotential with a continuum solvent pick out thenative structure when compared with a set ofgenuinely competitive alternatives, as opposed tostructures of a completely different protein? Whatsort of energy gap is there (if any) between the nativeand other structures, and which terms in thepotential contribute to it most substantially? Howgood is the correlation between reduced model andmolecular mechanics potential for the total energyand for each of the component parts of the energy?

While the conclusions that emerge from thisinvestigation are in accordance with several previousspeculations, the systematic trends, which canreadily be observed in all three test proteins, arequite striking. The results suggest a strategy forconstructing a significantly improved reduced modelpotential by identifying critical flaws in the potentialsthat have been produced to date. Work in thisdirection is currently in progress and an initialpreliminary result is presented.

This paper is organized into four sections. Insection two, we review our reduced model andassociated computational algorithms, and presentnew results for CTF and R69 (results for MBO can befound in Gunn et al., 1994). CTF is the first b-strandcontaining protein that we have studied; the results

JMB—MS 402

Computer Modeling of Protein Folding 997

are quite satisfactory with the addition of ahydrogen-bonding potential to generate strandpairing (we do not otherwise bias how the strandspair). R69 is the first protein that we have studiedfor which our current model potential is grosslyinadequate. As is often the case, one can learn asmuch or more from failure as from success; here, thedifficulties in the potential, which are to some extentreflected even in the AMBER*/GB model, provide akey to understanding significant problems with theunderlying physics of the approximate model.Section three describes the AMBER*/GB molecularmechanics calculations, including technical details ofthe simulations, statistical summaries of the resultsand discussion of the implications of these results forprotein folding. Finally, section four containsconclusions and directions for future work.

The Reduced Model

Overview

In our studies of protein folding we use ahierarchical approach in which we represent thepolypeptide chain at different levels of detail. Thecrudest level consists of cylinders connected byspheres. The cylinders contain either a-helices orb-strands and the spheres enclose loop regions. Thenext level of detail incorporates explicitly thebackbone atoms and represents side-chains asspheres centered at the b-carbon atomic positions.The use of ideal geometries for helices and strandsand of precalculated loop lists for loop segmentsresults in a one-to-one correspondence between thetwo levels; in particular, each sphere at the coarselevel corresponds to a possible loop at the moredetailed level. Fixing the protein secondary structureprovides a simplification of the problem and canbe viewed as a computational technique with noimplications for protein folding kinetics. In practice,secondary structure might be specified from NMRexperimental data, as suggested by Wuthrich et al.(1991). In this section we describe the two levels ofrepresentation which define our reduced model. Thismodel was introduced in our earlier work onmyoglobin (Gunn et al., 1994). Here we emphasizerecent algorithmic developments and extensions ofthe model to include b-strands, and report newresults for CTF and R69.

Model representation and algorithms

The geometric representation of the molecule isbased on the assignment of each residue to oneof 18 possible ‘‘states’’ which specify the backbonedihedral angles f and c with all other internalcoordinates assuming standard values. These statesare chosen to span the allowed regions of theRamachandran map, but are otherwise not weightedaccording to local energy and are not residue-specific. This is to allow maximum flexibility for theloops by eliminating only impossible conformations.

Each segment of repeated dihedral angle state(a-helices or b-strands) can be represented by acylinder described by the axis and the radial vectorsof the terminal residues. Each loop can berepresented by a vector connecting the end-points ofadjacent cylinders, with its geometry specified by theinternal coordinates (angles and dihedral angles)formed with the axes and radii of the cylinders. Thefirst level of representation (cylinders and spheres)thus consists of the segmented chain formed by thecylinder axes and radii with the connecting loops.The second level consists of the sequence of dihedralangle states which uniquely specifies the positions ofall Ca and Cb atoms.

Trial moves are carried out by replacing the loopsegments with new values selected from apre-calculated list. The remainder of the structurecan be pivoted into the new conformation simply bymaking use of the stored values of the internalcoordinates corresponding to each loop. This allowsfor very fast construction and evaluation of trialstructures using the cylinder-sphere representation.The more detailed representation of the molecule canthen be constructed at periodic intervals by using thesequence of dihedral angle states which were usedto construct each loop and are stored along with thecorresponding geometry in the loop list. The entirechain can thus be rebuilt from the sequence ofdihedral angles when required.

The minimization procedure consists of an innerloop of trial moves in which the loops are randomlyreplaced by loops from the loop list. These structuresare checked for self-avoidance and rejected if any ofthe secondary structure elements, modeled as hardcylinders and spheres, are closer than a minimumallowed distance. This effective radius is a parameterwhich describes an impenetrable core, but whichdoes not enclose all atoms. Rejection at this level rulesout grossly self-overlapping structures. After anumber of iterations the resulting structure is usedas the trial move for evaluation with the completeresidue–residue potential function. In this way, themore expensive potential, which is the quantity to beminimized by simulated annealing, is only evaluatedfor structures which have been selected by what iseffectively a short minimization of a simpler model.The structures at this level are checked forself-avoidance with a cutoff distance for each pair ofCa or Cb atoms. The minimum distances for eachpossible pair of residues is determined by taking theshortest distances observed in a survey of the PDB.For each overlap in the structure a constant penaltyis added to the total energy used to accept or rejectthe structure. If the structure is rejected, thepreviously accepted structure is used for the nextcycle of the inner loop.

The success of the algorithm depends significantlyon the choice of the overlap penalty. If it is too high,at the start, most trial moves would be rejected, sincethe cylinder-sphere potential does not completelyprevent atomic overlaps from occurring. This isbecause hard spheres and cylinders, which eliminatepossible overlaps by enclosing all Ca and Cb atoms,

JMB—MS 402


would grossly over-estimate the excluded volumeand prevent the formation of compact structures.However, since there is a hard core in the simplemodel which does prevent impossible foldingtopologies, most overlaps can be alleviated withrelatively small changes in the structure. This isachieved by gradually increasing the value of thepenalty during the simulation so that thosestructures with fewer overlaps are progressivelyselected out. This method generates final structureswith all overlaps removed without a significantincrease in the energy.

The minimization is carried out simultaneously fora large number of structures and includes periodicimplementation of a genetic algorithm. In this step,a loop is chosen as a splice point and a number ofhybrids are created by taking parts of differentstructures and connecting them together at the splicepoint with a new loop selected from the loop list.Each hybrid undergoes a minimization cycle usingthe simple model to select a reasonable loop toconnect the two parts. For each parent, defined as thestructure contributing the larger segment, the lowestenergy hybrid is selected and used as a new trialmove for the complete potential, following theprocedure described above for the mutation steps.Further refinement is carried out by selecting thelowest energy structures in the ensemble, replicatingthem, and continuing the simulation.

The potential function used in these simulations isbased on a statistical analysis of the PDB (Casari &Sippl, 1992). Only pairs of residues far apart in thesequence are considered, so that the potential doesnot depend on the local geometry, but ratherrepresents the overall packing of hydrophobic andhydrophilic residues and the formation of ahydrophobic core. In addition, the potential islong-ranged so that it can be used to evaluatenon-compact structures. The potential has the form

E = sN

i − j e 20

(hi + hj + 2h0)=ri − rj = (1)

where the coefficients hi correspond to the relativehydrophobicities of the residues and h0 is a nethydrophobicity of the molecule, which provides adriving force for compactness.

For use with the cylinder-sphere representation,the potential for a pair of secondary structuresegments can be expanded around the center–centerdistance. The first two terms of this expansion can beinterpreted as a net hydrophobicity interaction anda hydrophobic dipole interaction. This approxi-mation is sufficiently accurate to provide a usefulestimate of the total energy in the inner loop of theminimization.

To improve the performance of the potential indifferentiating similar compact structures, theall-residue potential also contains a contact termwhich consists of a residue dependent constantenergy for each pair of Cb atoms within a cutoffdistance (Maiorov & Crippen, 1992). This contactpotential is added to the hydrophobic potential with

a coefficient that is treated as an adjustableparameter. This allows the contact term to besmoothly ‘‘turned on’’ during the simulation. Sincethis term provides an additional driving forcetowards compactness, the net hydrophobicity h0 isalso reduced during the simulation. This ‘‘parameterannealing’’ combined with the increasing overlappenalty discussed above, allows the potential tobecome less smooth and more ‘‘rugged’’ duringminimization, along with the usual lowering of theeffective temperature. It should be remarked that thesame functional form of the potential is used fordifferent proteins. The net hydrophobicity h0 and theweighting for the contact potential depend solely onthe sequence, and the parameter annealing can beregarded as a computational artifice used to achievean efficacious minimization.

The above potential proved to be rather poor atgenerating structures with the correct pairing ofb-strands. The strands tended to clump together inbundles much like helices, rather than maintain aparallel arrangement. In order to describe theinter-strand hydrogen bonding, an additional termwas added to the potential designed to mimic anattraction between backbone O and H atoms with anappropriate geometry. It would be very computation-ally expensive to consider interactions among allpairs of atoms in a pair of residues, so only a verysimple strand–strand potential was considered.Since the potential in this case involves short-rangeinteractions between long extended segments, it isimpossible to systematically approximate an all-atom function with an effective center–centerinteraction for use with the cylinders as in the caseof the hydrophobic potential. Instead, an ad hocfunction was constructed which provides a crudeapproximation of a hydrogen-bonding potential formany relative orientations of two strands, with theantiparallel configuration being favored. This has theform

E A (log Rij − (Rx ij · Lx i )2(Rx ij · Lx j )2 − 2.4)

× (1 − 2(Lx i · Lx j )5)/Rij (2)

where Rij is the center–center vector and the Li are theaxial vectors of the cylinders. This potential wasfound to provide an improved correlation betweenenergy and r.m.s. deviation for the strand pairing,and therefore was subsequently used for both thecylinder-sphere and all-residue representations ofthe molecule. It should be emphasized that the formof this function is essentially arbitrary and isintended only to generate the simplest features ofstrand pairing. This is clearly only a first steptowards developing a realistic hydrogen-bondingpotential. In order to compensate for the increasedattraction of the strands towards one another, thecontact potential was set to zero for the strandresidues. Note that the additional strand–strandpotential does not in any way specify the way inwhich strands must combine to form a given b-sheet,but simply requires nearby strands to assume anantiparallel configuration.

JMB—MS 402


Figure 3. Distribution of reduced model structuresplotted with r.m.s. deviation versus total energy for CTF.Structures considered in the all-atom analysis areidentified by their ID number.

Figure 5. Distribution of reduced model structuresplotted with r.m.s. deviation versus total energy for R69.Structures considered in the all-atom analysis areidentified by their ID number.

of the native structure, with the exception of thedetails of the strand pairing, which requiresadditional refinement with a more detailed model toadequately represent. This structure is shownsuperimposed on the native in Figure 4. Both formyoglobin and for CTF, the results of the simulationsindicate that for our reduced model representationwith secondary structure fixed the number ofpotential minima is drastically reduced and thenative-like topology is one of a small number ofdistinct low-energy conformations. Since the de-scription of the protein chain is very coarse, it is notsurprising that we find misfolded structuresenergetically competitive with the native one.

The final example discussed here is R69 (60residues, five helices), for which the results areshown in Figure 5. Although there are a fewstructures generated with relatively low r.m.s.

deviation from the native, they are no lower inenergy than the average for the distribution.More importantly, the native structure itselfexhibits a very high energy relative to misfoldedones. This implies that while further annealing ofthe structures shown might be expected to lowerthe energy of the ensemble, there is no reasonto expect lower-energy structures to be anymore native-like, and in fact the lowest r.m.s.deviation of the ensemble may well increase. This isthe first case we have studied where the currentpotential function appears to make significant errorsin distinguishing the native fold from reasonablecompact alternatives encountered in the simulation.It thus provides an important test for the furtherunderstanding of the potential and the evaluation ofalternatives.

Figure 4. Superimposition of Ca worms for the native (yellow) and calculated (blue) structures of CTF. The r.m.s.deviation between the two structures is 5.0 A.

JMB—MS 402


The Detailed Model

Overview

Structures generated with the reduced model ofthe previous section can be further analyzed byintroducing another level in the hierarchicalframework. The detailed model is a standard unitedatom representation of the protein molecule whichcan then be simulated employing traditionalmolecular mechanics force fields. This model servesa twofold purpose: a detailed all-atom representationis clearly required if accurate structures at the 1 to 2 Aresolution level are to be obtained; furthermore,detailed analysis of competitive reduced modelstructures should help understand the strengths andlimitations of the simplified potentials.

Another important issue that can be addressed inthe context of detailed model simulations is thequality of molecular mechanics potentials and ofcontinuum solvent treatments. Our reduced model iscapable of generating diverse protein conformationsthat are competitive alternatives to the native.Analysis of the energetics of these structures andcomparison with the native will allow criticalevaluation of the force field.

We first present the representation of the detailedall-atom model, the procedure used to map aminimized reduced structure onto an all-atomone, and the potential employed in the simulations.We then describe in detail the calculationsfor the detailed model, which were carried outusing the MacroModel/BatchMin modeling package(Mohamadi et al., 1990). Finally, we report the resultsfor our three test proteins.

Model representation and algorithms

All-atom structures are generated from thereduced model main-chain fold by adding explicitside-chains. This is a well known and difficultproblem (Janin et al., 1978; Lee & Subbiah, 1991;Desmet et al., 1992), whose complexity is due to theastronomical number of possible structural permu-tations. For our purposes it is not actually necessaryto achieve an accurate prediction of side-chainconformations, but rather some reasonable initialguess that can then be manipulated via minimizationand/or simulated annealing. This is even truer inview of the fact that one would like to let themain-chain relax along with the side-chains.

To generate detailed atomic models we used theRSA program developed by Peter Shenkin andco-workers (Farid et al., 1992). Reduced modelstructures are initially dressed with planar side-chains and random x1 torsion angles. Side-chainrotamer space is then explored with the RSA codewhich uses a Monte Carlo algorithm to sample froma rotamer library. A simulated annealing scheme isused to minimize ‘‘bumps’’ between side-chains.The RSA code produces all-atom structures whichmight still have ‘‘bad’’ contacts. This could be due to

inadequacies in the optimization procedure and/orto the restriction to rotamers in describing side-chainconformations.

In the reduced model, each residue f and cdihedral angles are not restricted to any particularregion of the Ramachandran map, i.e. each residuecan sample the same discrete set of main-chaindihedral angles, regardless of amino-acid type. Thischoice is motivated by the fact that in this way thereduced model is more flexible, and consequentlycapable of traversing configuration space moreefficaciously. However, the all-atom structure model-ing of proline residues requires the f angle to benearly fixed. In the present studies we have treatedprolines as alanines (there are four proline residuesin myoglobin, one in CTF and two in R69), judgingthat even at this stage of refinement added flexibilityfor these residues could be beneficial.

Terminal loops are not modeled in the reducedstructures. In order to avoid spurious interactionsthat might derive from charged and/or polar ends,we capped the all-atom structures with an acetylgroup at the N terminus and N-methyl amide at theC terminus. These two groups are modeled from thetwo Ca atoms at either end of the reducedrepresentation.

Calculations for the detailed model are carriedout using the AMBER* force field, the originalAMBER potential of Kollman and co-workers(Weiner et al., 1984) with additional parameters fororganic functionality (McDonald & Still, 1992). Aunited-atom scheme was used, whereby hydrogenatoms are explicitly considered only for polaratoms. Solvent was treated with the GB/SAcontinuum solvation model (Still et al., 1990). Thismodel is based on a continuum dielectric for solventpolarization and solvent-accessible surface areatreatment of the cavity and van der Waals solvationcomponents.

Structures generated with the RSA scheme aresubjected to minimization using the conjugategradient method in MacroModel/BatchMin. Mini-mization is carried out including solvation andemploying analytical approximation of surface areas.To speed up the calculation we also employ cutoffsfor non-bonded interactions: 7 A for van der Waalsinteractions and 12 A for electrostatic interactions.With these cutoffs, 027%, 065% and 070% of thenon-bonded pair interactions are included in thecalculation for myoglobin, CTF and R69, respectively.A convergence criterion is set by requiring that thegradient be less or equal to 0.05 kJ mol−1 A−1. Inpractice we have run minimization with a prefixedmaximum number of iteration of 10,000, achievingconvergence in most cases (typically after 07000iterations for myoglobin, 03500 iterations for CTFand 04000 iterations for R69). The energy for thefinal structures is evaluated with essentially infinitecutoffs (i.e. cutoffs for which no significant change inthe energy is observed by increasing them) and byusing accurate numerical areas for solvation. Thecomputational cost of conjugate gradient minimiz-ation is substantial; we ran our calculations on IBM

JMB—MS 402


RS/6000 350 and 370 workstations with an averageCPU time of 13.5 hours for myoglobin, two hours forCTF and 2.5 hours for R69. Nonetheless, thesecalculations are orders of magnitude less expensivethan calculations involving explicit solvent, wherehundreds or thousands of discrete solvent moleculesare used to model solvent effects.

A subset of the minimized structures was furtheroptimized using simulated annealing. Confor-mational sampling is performed by means ofstochastic dynamics using the SHAKE protocol toconstrain bonds to hydrogen atoms and a 1.5 fs timestep. An initial equilibration is carried out at 300 Kfor 10 ps. Typically, the temperature is then loweredto 50 K in 40 ps with a linear cooling schedule. Thisis followed by a 3000 iteration conjugated gradientminimization. We experimented with differentcooling rates and higher initial temperatures for CTF,and found that the annealing protocol describedabove produced structures with the lowest energy.The results of this analysis are described below inSimulated annealing, of section three, where we alsodiscuss convergence for our simulated annealingprocedure. In the following, when we refer tosimulated annealing calculations, we always meanthe combination of equilibration, annealing andminimization described above. The CPU time forsimulated annealing runs is determined by the timerequired for each step of stochastic dynamics (3, 0.9,and 1 seconds for MBO, CTF, and R69, respectively)and the time of each conjugate gradient minimizationiteration (7, 2, and 2.3 seconds for MBO, CTF, andR69, respectively).

Results

Overview

The procedures described above were used toproduce a large number of energetically plausiblebut substantially different conformations of the threeproteins considered in this study. While thecomplexity of the potential energy functions and theprocedure itself make the analysis of the resultsnontrivial, it is still possible to ask and provide atleast preliminary answers to a number of importantquestions. These are as follows: (1) What is theperformance of the AMBER*/GB potential inranking the native structure as compared to thealternatives that we have generated? Can anything beinferred about the strengths or weaknesses of themolecular mechanics force field and solvation modelused? (2) What are the uncertainties at each step ofthe ‘‘dressing’’ process (addition of side-chains,conjugate gradient minimization, simulated anneal-ing), i.e. what sort of energetic and geometricalvariations are obtained in the final structure if onestarts from a given reduced model structure andrepeats the dressing procedure many times? (3) Arethere systematic differences in any of the com-ponents of the energy for the native structure ascompared to alternative structures? Can anything be

inferred about the driving forces from protein foldingfrom such systematic behavior? (4) How well doesthe reduced model potential correlate with theAMBER*/GB potential? Are particular terms ineither potential good predictors of native-likestructures? (5) Can a better reduced model potential(i.e. one capable of higher resolution and reliability)be designed after the above analysis is completed?

In addition to presenting the raw data in variousschematic forms, we shall attempt to address each ofthese questions. The computations presented hereare rather expensive, so the procedures described inModel representation and algorithms, of sectionthree, were not applied to all the structures generatedwith the reduced model. Instead, we have tried tomix a broad survey of many structures with anin-depth analysis of the computational proceduresfor a small subset of these structures.

In analyzing in detail the components of theAMBER*/GB potential energy, we focus on the vander Waals, electrostatic Coulombic, electrostaticsolvation and surface area terms. The remainingterms (stretches, bends and torsions) are critical inmaintaining the connectivity of the protein and itslocal geometry, but exhibit very small differencesfrom structure to structure and hence are unimpor-tant, at least at the level of resolution examined here,in the ranking of conformations.

Detailed model data sets

Detailed model calculations were performed onsets of structures selected from the reduced modeldistributions (Figures 1, 3 and 5) and on thecorresponding X-ray structures. We typically se-lected structures with low reduced energy or withlow r.m.s. deviation from the native, but we have alsoanalyzed structures in the middle of the distributionor with high energy and high r.m.s. deviation. Formyoglobin we have also studied two ‘‘extended’’structures that do not appear in the reduced modeldistribution; these were obtained by running thereduced model code for just a few steps so that eachof the loop regions, initially in helical conformation,is assigned a loop from the loop list.

The results of conjugate gradient minimizationand simulated annealing calculations for MBO, CTFand R69 are presented in Table 1. Each structure isidentified by a code corresponding to its sequentialposition in the reduced model distribution (whichcontains 1024 not necessarily distinct structures). Ther.m.s. deviations reported in the Table are based onthe positions of the a-carbon atoms only and arerelative to the minimized native structure. Forconsistency with the generated structures, the nativestructure was stripped of the loops at each end andcapped as described in Model representations andalgorithms of section three. The prosthetic hemegroup of myoglobin was neglected in the calcu-lations.

Figure 6 plots the reduced model energy versus thetotal AMBER*/GB energy for the minimized MBOstructures. A triangular distribution is observed,

JMB—MS 402


indicating that structures with high reduced energyalso have high AMBER*/GB energy while structureswith low reduced energy do not necessarily have agood AMBER*/GB energy. The r.m.s. deviation fromthe minimized native structure as function of thetotal AMBER*/GB energy is plotted for MBO inFigure 7. A gap is present between native andgenerated structures (a feature that is absent in thedistribution for the reduced potential). We observethat the lowest-energy structures in the high-r.m.s.and low-r.m.s. clusters have comparable energies.

Analysis of the different energetic componentslisted in Table 1 reveals that a correlation existsbetween surface area and van der Waals energiesand between electrostatic Coulombic and solvationenergies. This is shown in Figures 8 and 9,

respectively, for the MBO structures of Table 1A. Thesum of the internal electrostatic energy and of theelectrostatic solvation energy has a much lowervariance overstructure (hundreds of kJ/mol) thanthe individual components, which vary by thou-sands of kJ/mol. The r.m.s. deviation versus the totalelectrostatic energy is plotted in Figure 14 for all ofthe structures of Table 1A.

Side-chain addition

Our most detailed study of variability in theside-chain addition procedure has been carried outfor the native backbone conformation of CTF. TheRSA procedure of Shenkin and co-workers was run52 times, followed by conjugate gradient minimiz-

Table 1Minimization and simulated annealing results for MBO, CTF and R69ID IRG r.m.s. RG SA vdW ESC ESS TES E

A. MBO0 15.01 0.00 14.45 157.04 −4148.53 −26991.29 −5311.33 −32302.62 −30499.650-SA-1 1.31 14.56 165.10 −4036.50 −27203.29 −5252.50 −32455.79 −30708.120-SA-2 1.59 14.43 163.16 −4079.81 −27823.29 −4747.80 −32571.09 −30814.600-SA-3 1.35 14.55 164.00 −4072.35 −27343.80 −5092.46 −32436.26 −30680.290-SA-4 1.56 14.44 164.53 −4122.31 −27141.86 −5294.94 −32436.80 −30748.98727 15.53 5.76 14.91 196.36 −3628.96 −24130.57 −7782.53 −31913.10 −29896.87727-SA 6.15 15.23 208.08 −3611.75 −26351.79 −6229.04 −32580.83 −30612.3618-1 15.42 6.28 15.91 219.26 −3419.05 −24766.84 −7533.24 −32300.08 −30134.29†18-2 5.96 15.33 202.31 −3574.18 −24756.96 −7465.82 −32222.78 −30127.44†18-3 6.33 15.77 215.02 −3470.08 −24763.16 −7577.39 −32340.55 −30193.65†18-4 6.32 15.48 222.59 −3372.63 −24624.43 −7728.20 −32352.63 −30083.35†457 15.24 6.73 15.12 191.81 −3759.34 −24651.75 −7478.53 −32130.28 −30193.08457-SA 6.81 15.30 206.03 −3723.43 −26675.46 −6011.44 −32686.90 −30777.63994-1 16.30 6.44 15.22 203.87 −3662.32 −25353.85 −6875.97 −32229.82 −30148.49994-2 7.08 16.06 231.28 −3449.72 −24627.83 −7562.06 −32189.89 −29890.84937 15.03 7.87 15.04 202.97 −3664.22 −24310.34 −7722.40 −32032.74 −30044.16428 15.40 9.69 15.22 226.89 −3310.56 −24137.10 −7963.48 −32100.58 −29701.28157 14.41 10.70 15.39 225.76 −3444.20 −23931.79 −8078.73 −32010.52 −29844.47157-SA-1 10.84 15.18 220.95 −3547.16 −25631.65 −6794.64 −32426.29 −30481.01157-SA-2 11.07 15.25 216.58 −3579.31 −25017.14 −7316.02 −32333.16 −30401.64157-SA-3 10.62 15.11 215.97 −3586.50 −25555.43 −6884.88 −32440.31 −30557.54157-SA-4 10.56 15.27 211.97 −3535.00 −24977.85 −7401.94 −32379.79 −30424.32763 15.30 11.84 15.55 223.89 −3593.50 −24134.60 −7991.10 −32125.70 −30074.97763-SA 11.56 15.24 222.70 −3619.75 −25643.07 −6983.26 −32626.33 −30724.668 14.60 11.91 14.53 195.31 −3773.14 −24000.89 −8018.50 −32019.39 −30122.318-SA 11.85 14.43 189.61 −3895.85 −25315.96 −7024.53 −32340.49 −30728.3851-1 15.01 12.72 14.85 214.39 −3541.28 −25233.29 −7118.84 −32352.13 −30149.90†51-1-SA 12.86 15.67 227.96 −3425.33 −25709.07 −7042.80 −32751.87 −30683.07†51-2 13.21 15.42 221.08 −3431.97 −24880.36 −7543.34 −32423.70 −30144.62†51-3 13.00 15.46 229.84 −3412.47 −24526.49 −7807.35 −32333.84 −29999.45†51-4 13.06 15.45 227.49 −3431.01 −25317.62 −7101.29 −32418.91 −30117.12†906 19.26 13.13 18.71 261.89 −3331.94 −23850.94 −8439.36 −32290.30 −29981.62906-SA 13.45 18.88 255.64 −3472.30 −25207.24 −7425.98 −32633.22 −30488.94475 − 1 15.59 15.26 15.89 233.79 −3400.30 −23902.64 −8204.01 −32106.65 −29915.95475-1-SA 16.00 16.14 231.88 −3432.98 −25540.46 −7043.79 −32584.25 −30531.50475-2 15.54 15.99 237.56 −3391.23 −23654.48 −8407.03 −32061.51 −29899.92E1 25.68 18.84 24.84 314.60 −2973.09 −23270.89 −9026.42 −32297.31 −29525.65E2 25.28 21.55 25.01 342.28 −2769.52 −23234.84 −9271.09 −32505.93 −29538.96B. CTF0 0.00 10.32 76.36 −1479.43 −10886.57 −3814.72 −14701.30 −13915.190-SA 0.97 10.21 79.91 −1479.32 −10970.68 −3834.75 −14805.43 −14063.86126 4.92 10.64 100.76 −1257.68 −9715.81 −4629.78 −14345.59 −13447.48126-SA 5.49 11.08 106.60 −1170.38 −10915.57 −3854.21 −14769.78 −13829.39380 5.16 10.96 106.10 −1206.89 −9405.37 −5036.85 −14442.22 −13515.46380-SA 7.01 12.08 117.95 −1200.25 −10598.35 −4157.32 −14755.67 −13850.70682 5.05 12.02 120.39 −1171.62 −9758.12 −4782.08 −14540.20 −13589.47682-SA 6.07 13.02 124.67 −1166.63 −10431.89 −4332.62 −14764.51 −13788.61

continued overleaf

JMB—MS 402


Table 1 (continued)ID IRG r.m.s. RG SA vdW ESC ESS TES E

249-1 5.59 10.59 105.72 −1212.69 −10287.85 −4209.97 −14497.82 −13482.72249-1-SA 6.11 10.90 113.96 −1163.56 −11111.85 −3707.35 −14819.20 −13812.83249-2 5.45 10.64 107.78 −1216.60 −10166.59 −4386.91 −14553.50 −13537.28249-3 5.78 10.63 107.46 −1228.61 −10199.16 −4295.75 −14494.91 −13502.11249-3-SA 5.59 10.64 108.52 −1224.01 −11507.99 −3356.41 −14864.40 −13924.17249-4 5.25 10.33 99.10 −1276.10 −10438.81 −4051.54 −14490.35 −13527.02249-4-SA 5.56 11.18 110.36 −1182.20 −11556.20 −3286.72 −14842.92 −13901.12161 6.23 10.91 101.11 −1223.92 −9277.00 −5081.67 −14358.67 −13454.18333 6.99 10.32 95.47 −1336.15 −9319.39 −5146.39 −14465.78 −13575.34333-SA 7.54 10.40 103.30 −1314.60 −9909.37 −4741.51 −14650.88 −13823.45441 7.41 10.45 109.98 −1293.65 −10602.85 −3883.41 −14486.26 −13483.45229 8.89 11.04 100.33 −1289.17 −9439.27 −5080.16 −14519.43 −13599.623 8.94 10.73 102.85 −1219.94 −8995.48 −5471.35 −14466.83 −13531.54579-1 10.25 10.73 106.21 −1270.95 −9723.91 −4700.35 −14424.26 −13503.96579-1-SA 10.14 11.25 118.49 −1168.48 −10781.98 −3990.68 −14772.66 −13777.54579-2 9.98 10.45 104.69 −1282.81 −9549.23 −4845.96 −14395.19 −13442.87579-3 9.90 10.41 101.88 −1351.93 −9492.10 −4907.95 −14400.05 −13548.69579-3-SA 10.82 11.02 115.49 −1204.07 −9983.14 −4658.71 −14641.85 −13720.39579-4 10.07 10.67 104.70 −1304.93 −9452.09 −4937.30 −14389.39 −13456.40579-4-SA 10.76 12.23 119.93 −1153.69 −10324.96 −4390.13 −14715.09 −13753.50362 10.93 10.39 85.87 −1343.24 −10007.54 −4423.81 −14431.35 −13617.32362-SA 10.75 10.78 91.71 −1309.93 −10580.11 −4125.03 −14705.14 −13931.85C. R690 9.94 0.00 9.52 60.53 −1634.82 −9316.18 −3068.90 −12385.08 −11623.000-SA-1 0.79 9.57 60.35 −1628.18 −9378.89 −3208.32 −12587.21 −11878.390-SA-2 0.58 9.54 59.56 −1645.94 −9297.44 −3261.71 −12559.15 −11844.19100 10.44 5.44 10.33 84.53 −1477.46 −8729.18 −3734.99 −12464.17 −11646.73100-SA 6.26 10.51 89.21 −1454.45 −8978.61 −3714.40 −12693.01 −11880.29204 11.17 5.86 11.15 103.19 −1253.16 −8856.17 −3755.67 −12611.84 −11424.61204-SA 6.11 11.05 106.82 −1250.08 −9428.87 −3413.40 −12842.27 −11818.11869 10.06 5.92 10.68 88.40 −1329.86 −8653.16 −3772.63 −12425.79 −11459.28162 10.64 6.37 10.13 81.67 −1425.69 −8496.09 −3913.89 −12409.98 −11474.8677 10.55 6.77 10.81 82.57 −1361.13 −8492.05 −3857.71 −12349.76 −11454.3839 9.56 8.11 9.66 79.37 −1471.64 −8556.61 −3871.43 −12428.04 −11574.52630 11.71 8.52 11.30 97.89 −1280.68 −8397.59 −4041.75 −12439.34 −11398.52322 9.94 9.64 10.30 87.28 −1252.64 −7899.16 −4518.09 −12417.25 −11338.11852 10.13 10.03 10.77 88.46 −1344.90 −8270.09 −4252.96 −12523.05 −11454.44852-SA 11.48 12.32 110.89 −1254.54 −8247.92 −4420.51 −12668.43 −11675.60451 10.34 10.26 10.93 85.62 −1373.45 −8271.04 −4129.12 −12400.16 −11501.12361 10.76 10.87 10.52 88.51 −1321.66 −8232.84 −4216.52 −12449.36 −11478.18

† Uncapped structure.ID is an identifier for the structure, IRG is the radius of gyration of the starting structure in A, r.m.s. is the Ca r.m.s. deviation from

the native in A, RG is the final radius of gyration in A, SA is the surface area energy, vdW is the van der Waals energy, ESC is the electrostaticCoulombic energy, ESS is the electrostatic solvation energy, TES is the total electrostatic energy and E is the total energy. All energiesare in kJ/mol. Each structure’s ID consists of its sequential number as obtained from the reduced model distributions (compareFigures 1, 3, and 5), 0 being used for the native, a second number (1, 2, . . .) if different side-chain additions were considered, and thesuffix SA if the structure was the result of a simulated annealing run (different simulated annealing runs with the same initial structureare indexed SA-1, SA-2, . . .). E1 and E2 refer to the two MBO extended structures (see text for details).

ation. The average Ca r.m.s. deviation of all runs fromthe native structure is 0.65 A and the average all-atomr.m.s. is 1.65 A; this is competitive with results fromother procedures reported in the literature forside-chain addition (Lee & Subbiah, 1991).

With regard to packing (as measured by the vander Waals energy), our side-chain addition methoddoes rather well; indeed, there is little variation inthis component over the 52 runs (the averagedifference for the van der Waals energy between thegenerated side-chains and the native is 29.82 kJ/mol,with a standard deviation of 2.32 kJ/mol). Forelectrostatics (Coulombic plus solvation), the averagedifference of all runs from the native is 164.85 kJ/mol, with a standard deviation of 5.37 kJ/mol. Thisis not surprising because electrostatics and solvationare not included in the approximate side-chainpotential function used in the RSA procedure. Such

potentials can be added to the side-chain dressingprocedure and work along these lines is currently inprogress.

Although the relative difference of both van derWaals and total electrostatic energies of all runs withrespect to those of the native is comparable (of theorder of 1 to 2%), it is the actual value of the energythat matters in the ranking of structures. The overallenergetic gap observed over the 52 runs between thenative and the generated structures is almost entirelydue to the total electrostatic component. Furtherevidence of this fact is that some of the generatedstructures show van der Waals energies lower thanthe native, while this is never observed for theelectrostatic energy.

In addition to this extensive study employing thenative backbone conformation of CTF, we have alsocarried out a number of experiments where the same

JMB—MS 402


Figure 6. Plot of reduced model energy versusAMBER*/GB energy for minimized MBO structures.

Figure 8. Plot of surface area energy versus van der Waalsenergy for all MBO structures in Table 1A. In this Figure,as well as in the following ones, squares label structuresobtained via minimization and triangles label structuresobtained via minimization followed by simulated anneal-ing.backbone conformation generated with the reduced

model was dressed to give different initial all-atomstructures. Examples of this analysis are reported inTable 1 for myoglobin structures 18, 994, 51, and 475,and CTF structures 249 and 579.

The above results suggest that electrostaticstabilization of the generated structures may beunderestimated by the side-chain addition/conju-gated gradient minimization procedure as comparedto the native structure. Again, this is no surprise,considering the overlap-only nature of the RSAmethod. The ability of simulated annealing to closethis gap is addressed in the next subsection.

Simulated annealing

In view of the extreme jaggedness of the potentialenergy surface in the all-atom model, conjugategradient minimization is not the most satisfactoryenergy optimization technique. Minimized struc-tures might be ‘‘surrounded’’ by structures withlower energy which even relatively small barrierscan make inaccessible when a conjugate gradientmethod is used. To investigate this scenario we

subjected some of the minimized structures tosimulated annealing.

The effects of the simulated annealing procedure(described in Model representation and algorithmsof section three) are presented in Figure 10, whichplot the Ca r.m.s. deviation of the starting structurefrom the native versus the r.m.s. deviation of the finalstructure from the starting one and versus the changein total energy, respectively, for MBO. Althoughnon-native structures tended to move by 02 A r.m.s.in the course of simulated annealing, this procedurealtered the r.m.s deviation to the native structure byonly about 20.5 A.

To ensure that our simulated annealing scheme isappropriate for study of medium to large molecules,we used CTF as an example to investigate the choiceof parameters (cooling rate, initial temperature,random seed selection) on the final energy. All runsstart from the same minimized solvated nativestructure and in all cases simulated annealing is

Figure 9. Plot of electrostatic Coulombic energy versuselectrostatic solvation energy for all MBO structures inTable 1A.

Figure 7. Plot of r.m.s. deviation versus AMBER*/GBenergy for minimized MBO structures.

JMB—MS 402


(a)

Figure 10. Conformational and energetic effects ofsimulated annealing for MBO structures: plots of r.m.s.deviation of starting structure from native MBO versusr.m.s. deviation of final annealed structure from startingone (a) and versus change in total energy upon simulatedannealing (b).

600 K yield final energies about 500 kJ/mol higherthan energies obtained via conjugate gradientminimization, indicating that too high initialtemperatures lead to other less stable local energyminima. On the other hand, runs starting from lowertemperatures (200, 300 and 400 K) give comparablefinal energies. How random seed selection affects thefinal energy was investigated using both the nativeand one of the dressed structures (249). For bothstructures we performed ten runs, each one with adifferent seed initialization. For the native weobtained a mean energy of −14054.76 kJ/mol with astandard deviation of 11.80 kJ/mol, and for the 249structure a mean energy of −13854.25 kJ/mol with astandard deviation of 15.30 kJ/mol. Based on theseresults, we expect energies obtained via simulatedannealing to be accurate only up to approximately215 kJ/mol.

Total energies

How well does the AMBER*/GB potential do atdiscriminating the native structure from misfoldedstructures generated by our reduced model simu-lations? The r.m.s. deviation from the native versusthe total AMBER*/GB energy for the structurespresented in Table 1 is plotted in Figure 11, wheredifferent symbols are used for minimized structuresand for minimized structures followed by simulatedannealing.

First, it is noteworthy that the minimized nativestructure does not have a lower energy thanalternative structures subjected to simulatedannealing. The conformational adjustments gener-ated by simulated annealing are responsible for asignificant lowering of the energy. After simulatedannealing, the native structure is lowest formyoglobin (by 040 kJ/mol) and for CTF (by0140 kJ/mol). These energy gaps are not unreason-able estimates for the stabilization of the nativestructure as compared to plausible compactalternatives. For R69, some misfolded structuresare very close in energy to the native even aftersimulated annealing has been carried out. Theconclusion is therefore that the AMBER*/GBpotential has a reasonably good performance(certainly much better than the present reducedmodel potential), but that improvements arenecessary to render it completely reliable for anarbitrary protein.

With regard to the remaining structures, native-like structures (e.g. 6 A r.m.s. structures formyoglobin) do not have energies that are substan-tially better on average than non-native basins, inparticular the basin centered around 012.5 A r.m.s.deviation. We defer a discussion of the implicationsof this observation until the various components ofthe energies have been analyzed in detail. On theother hand, structures with very poor score from thereduced model potential typically have substantiallyhigher energies with the AMBER*/GB potential; thecorrelation between the two potentials is plotted in

(b)

preceded by a 10 ps equilibration at 300 K and isfollowed by 3000 conjugate gradient minimizationsteps.

We tried five different linear cooling rates from 300to 50 K. The final energies are reported in Table 2,which shows that the largest energy differenceobserved is 085 kJ/mol. Although most of oursimulations were performed with an initial tempera-ture of 300 K, we also studied the effect of varyingthe initial temperature. Experiments starting from

Table 2Final energies of native CTF for five simulated annealingruns with different cooling rates

Length Rate Final energyRun (ps) (K/step) (kJ/mol)

1 40 0.00938 −14063.862 50 0.00750 −14100.413 60 0.00625 −14148.224 80 0.00469 −14132.225 200 0.00188 −14101.55

In all runs the initial and final temperatures were 300 K and50 K, respectively.

JMB—MS 402


(a)

(b)

(c)

Figure 11. Plots of r.m.s. deviation versus AMBER*/GBenergy for MBO (a), CTF (b), R69 (c).

(a)

(b)

(c)

Figure 12. Plots of r.m.s. deviation versus electrostaticCoulombic energy for MBO (a), CTF (b), R69 (c).

Figure 6 for MBO minimized structures. A similardistribution is obtained for CTF, while for R69(where the native structure has high reduced energyand low AMBER*/GB energy) a rather scattereddistribution is observed with an overall negativecorrelation between reduced model and AMBER*/GB energies. Thus, for MBO and CTF the reducedmodel potential is capable of producing a qualitat-ively correct picture. It is in the quantitative detailsthat the atomic level potential is superior, as will bediscussed below. It will be clear then that thesedetails must be included at some level in the reducedmodel potential in order to improve the results forR69.

Breakdown of the total energy into componentparts

Electrostatic and solvation energies. Figures 12 to 14plot the r.m.s. deviation from the native versus theinternal protein electrostatic energy, the electrostaticsolvation energy, and the sum of these two quantitiesfor all of the structures in Table 1. As before, differentsymbols have been used to label structures generatedvia simulated annealing.

The most striking result of Figure 14 is that thetotal variation in the total electrostatic energy foreach protein is substantially smaller (by a factor fiveto ten) than the individual variations in the internal

JMB—MS 402


(a)

(b)

(c)

Figure 13. Plots of r.m.s. deviation versus electrostaticsolvation energy for MBO (a), CTF (b), R69 (c).

(a)

(b)

(c)

Figure 14. Plots of r.m.s. deviation versus totalelectrostatic energy for MBO (a), CTF (b), R69 (c).

electrostatic and solvation contributions. Consider,for example, myoglobin. The extended structureshave, unsurprisingly, the largest solvation term (by01000 kJ/mol) as many more of their charges areexposed to the solvent. In compensation, however,the internal electrostatic energy is correspondinglysmaller than any of the compact structures.Conversely, the native structure has the smallestsolvation term and the largest internal electrostaticterm, presumably a reflection of the small exposedsurface area (see Table 1).

Most of the structures generated with the reducedmodel potential have total electrostatic energiescomparable with or even better than the native

structure. It is not difficult to generate conformationswith electrostatic energies of this magnitude, as thevalues for the unfolded conformations indicates (oneof these is in fact substantially lower than the native);what is non-trivial is to do this and at the same timeproduce a compact structure with a reasonablehydrophobic component, as discussed below. Thereduced model potential essentially serves to excludecompact conformations with inadequate electrostaticterms. It is remarkable that such a simple function(which also could apparently be seriously criticizedon physical grounds) is extremely effective ingenerating compact structures with plausible elec-trostatic interactions.

JMB—MS 402


(a)

(b)

(c)

Figure 15. Plots of r.m.s. deviation versus van der Waalsenergy for MBO (a), CTF (b), R69 (c).

configurations) the total enthalpy would be a sum offive terms: water–water van der Waals interaction,water–water electrostatic interaction, water–proteinvan der Waals interaction, water–protein electrostaticinteraction, and protein–protein van der Waalsinteraction (the first two are of course averaged oversolvent conformations and include an entropiccontribution to the free energy). We shall designatethese enthalpies as VWW, EWW, VWP, EWP, and VPP,respectively.

VPP is of course directly evaluated in the molecularmechanics computation. In the continuum solventapproximation, the surface area free energy term FS

must then represent the sum of the remaining fourterms plus the entropic contribution SW, extractedfrom experimental solvation energies on smallmolecules by assuming a proportionality to surfacearea, i.e.

FS = VWW + EWW + VWP + EWP + SW = H + N (3)

where

H = EWW + EWP + SW (4)

is the hydrophobic energy, and

N = VWW + VWP (5)

is the averaged non-polar enthalpy of the solvent. Asthe protein conformation becomes more compact,one would expect N to become less negative, due tothe lowering of the magnitude of VWP as less of theprotein’s surface area is exposed to solvent, and H tobecome more negative, due to smaller disruption ofthe hydrogen-bonding network by a more compactsolute.

The small magnitude of FS (as compared, forexample, to VPP) indicates that the changes in H andN nearly cancel as the protein conformation ischanged. Thinking of the solution as a set of closepacked spheres (ignoring for the moment thedistinction between solute and solvent), one wouldexpect the total van der Waals energy,VPP + VWW + VWP, to be more or less invariant tosolute conformation. Hence, the real physical effecton the free energy of folding is enhancement of thehydrophobic energy of the solvent. In the continuumsolvent model, however, this is reflected in a largernegative value of VPP. To put this more directly inmathematical terms, from equation (3), assuming, asargued above, N + VPP 0 C, a constant, and FS 0 0,we obtain:

H 0 VPP − C. (6)

We now return to the results in Figures 12 to 14. Itis apparent from Figures 8 and 15 that the nativestructure has a significantly lower value of VPP (andhence of H) than any alternative that we have beenable to generate for all three proteins considered, onthe order of hundreds of kJ/mol. This stronglysuggests that the hydrophobic effect is the drivingforce for protein folding. Many, but not all,conformational basins of the protein have areasonable electrostatic energy; of these, only onecan be compressed to the small surface area (and

van der Waals and surface area terms. Figure 15plots the r.m.s. deviation versus the internal proteinvan der Waals term for all of the structures in Table 1.As shown in Figure 8, the surface area and van derWaals terms are strongly correlated; for the purposeof this discussion, we will assume that a linearrelation holds and that one term is reliably predictedby the other.

The sum of these two terms represents thecontinuum solvent approximation to the free energyof an uncharged protein conformation in water. In anexact evaluation of the enthalpic component of thisfree energy (for example, using an accurate atomiclevel model and averaging over explicit solvent

JMB—MS 402


correspondingly large VPP, and hence H) of the nativefold.

There are two possible reasons why the reducedmodel simulations fail to generate structures with aVPP comparable to the native structure. Oneexplanation is that the excluded volume constraintsemployed in the current model are too crude, leadingto the generation of structures with significantoverlap of the van der Waals hard core regions. Whenthese structures are inserted into the AMBER*/GBmodel, the overlaps are eliminated via conjugategradient minimization, but at the cost of producinga less compact structure with a larger surface areaand less favorable value of VPP.

A second alternative is that the reduced modelpotential contains essentially no information aboutCoulombic interactions between residue pairs. Thedominant terms in the Casari-Sippl potential arethose that drive the charged amino acids to theexterior of the molecule; thus, even residues withcharges of opposite sign repel each other. In an actualprotein, however, we can expect that nearestneighbor and next-nearest neighbor residues willadjust their geometries to experience favorableCoulombic interactions. The argument is then thatthis adjustment process in non-native structuresoccurs at the expense of close packing; in order toavoid repulsive Coulombic interactions or toexperience attractive interactions (e.g. hydrogenbonding), the structure will expand slightly so thatthe appropriate side-chain reorientations can beperformed. For the native structure, this process doesnot require sacrifice of compactness, but a substantialsacrifice does occur for non-native configurations.This leads to a lowering of VPP. In support of thisproposal, we note that the native structure in almostall cases has the lowest internal Coulombic energy(compare Figure 12), suggesting that the arrange-ment of charges in this structure is significantlybetter than in our alternatives generated with thereduced potential. The alternative structures canmake up much of the difference in electrostaticenergy from the solvation term; however, thisproduces an increased surface area and hence a lessfavorable hydrophobic energy.

The data shown in Table 1 provide some evidencethat these mechanisms are operative. Severalcases can be seen in which the radius of gyrationincreases significantly after conjugate gradientminimization or after simulated annealing (e.g.structure 157 for myoglobin). However, the trend isnot uniform and more careful studies need to becarried out. We intend to make detailed investi-gations along these lines in subsequent papers. It islikely that both types of errors are significant and atpresent we cannot determine the contribution of eachclass of error to the inaccuracy in the reduced modelpotential.

If the hypotheses presented above are correct,improvement of the potential function requires afiner level of description of the residue–residueinteraction potential, in particular a more preciseevaluation of excluded volume effects and a

representation of Coulombic residue–residue inter-actions. We have taken initial steps in this direction,beginning with the excluded volume interactions.The development of a residue based van der Waalspotential, which provides a significantly betterrepresentation of the actual hard core constraintscontained in the AMBER* potential, is describedbelow.

Residue based van der Waals potential

As described above, our current potential utilizesa hard core constraint derived from a data basesurvey. In implementing this constraint, the Ca − Ca

distance Daa and the Cb − Cb distance Dbb areemployed. However, in our model the cross distancesDab and Dba are also readily available. The fourdistances Daa, Dab, Dba, and Dbb can provide a muchmore accurate description of the van der Waalspacking of the side-chains than that provided byDaa and Dbb alone. While the computation of all fourdistances does increase the computational cost ofevaluating the potential, it is still orders ofmagnitude less expensive than an all atom molecularmechanics calculation would be; furthermore, fordistant pairs one can revert to a single parameterpotential, as the critical determinant of accuracy isthe close packing of the side-chains.

To this end, we have begun the process ofconstructing an effective residue based van der Waalspotential using the four distances described above.Residue–residue van der Waals interactions areevaluated from a database of protein structures. Afour-index table is then constructed by binning eachof the four distances with 0.5 A spacing (up to 7.5 A)and collecting the residue–residue energy statistics.We retain only attractive interactions (an unrelaxedbad contact between two atoms of two differentresidues produces a large repulsive contributionwhich might skew an otherwise favorable inter-action), and assign a penalty whenever an emptyentry appears in the table.

We show here preliminary results obtained usingthe residue based van der Waals potential as ascreening function for reduced model structures.Figure 16 displays results for myoglobin, CTF, andR69 using the derived van der Waals potential witha penalty of 10 for entries not in the lookup table. Theimprovement in the screening capability is quitedramatic. For all three proteins, a substantial energygap is opened up between the native structure andthe simulated structures. The result is particularlydramatic for R69 where the native structure was inthe middle of the distribution for the originalpotential. Firm conclusions cannot yet be drawnbecause one has to carry out minimization with thepotential to check whether alternative structureswith low energies are generated. However, theseresults are very encouraging and suggest that, whenthe excluded volume problem is solved reliably,reduced model simulations will be able to generatestructures with substantially better resolution than iscurrently possible (perhaps 3 to 4 A as opposed

JMB—MS 402


(a)

(b)

(c)

Figure 16. Screening of reduced model structures usingthe residue based van der Waals energy for MBO (a),CTF (b), R69 (c).

given that solvation is treated by the GB continuummodel in a plausible fashion. The R69 results indicatethat improvements in the potential will be necessaryto obtain high resolution and/or reliable results fora large set of proteins. Further tests of the potentialcan be carried out by minimizing loop geometriesand comparing with the experimentally determinedstructures; we plan to do this in the future.

Secondly, the reduced model potential is capable ofgenerating some remarkably good results (e.g. thosefor myoglobin), but will require substantial modifi-cation to perform at this level for a general protein orto achieve resolution beyond 06 A. Our resultssuggest fruitful directions in which modificationscan be pursued, and this work is currently ongoing.We are optimistic about the possibility of ultimatelyobtaining 3 to 4 A resolution with the reduced modelon a routine basis. Another alternative is to increasethe flexibility of the reduced model by adding moredetail, for example allowing each side-chain to existin several different rotamer states. This model willrequire more computational expense per structurebut should also permit higher accuracy to beobtained.

Our results are very encouraging with regard todetermining the structure of an unknown protein viaa combination of the reduced and detailed modelcalculations, given secondary structure informationplus a small number of distance constraints as canoften be obtained from NMR experiments. A fully denovo prediction of structure from sequence willrequire, in addition to the technology discussed inthis paper, potential functions to evaluate secondarystructure energetics and algorithms to allowa-helices and b-strands to form, grow, diminish, anddisappear. Preliminary computational experimentsalong these lines are currently in progress.

Acknowledgements

This work was supported in part by the NIH Divisionof Research Resources, grant P41-RR06892. We thankBarry Honig for many useful discussions, and MichaelLevitt for suggesting R69 as a test case. We also thank theIllinois National Center for Supercomputing Applicationsfor a grant of CM-5 time.

ReferencesBernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer,

E. F., Brice, M. D., Rodgers, J. R., Kennard, O.,Shimanouchi, T. & Tasumi, M. (1977). The Protein DataBank: a computer-based archival file for macromol-ecular structures. J. Mol. Biol. 112, 535–542.

Bryant, S. H. & Lawrence, C. E. (1993). An empirical energyfunction for threading protein sequence through thefolding motif. Proteins: Struct. Funct. Genet. 16, 92–112.

Casari, G. & Sippl, M. J. (1992). Structure-derivedhydrophobic potential. Hydrophobic potential de-rived from X-ray structures of globular proteins is ableto identify native folds. J. Mol. Biol. 224, 725–732.

Cohen, F. E., Richmond, T. J. & Richards, F. M. (1979).Protein folding: evaluation of some simple rules for the

to 6 A) and that general applicability to the greatmajority of proteins will be possible.

Conclusion

We have investigated in detail the energetics of alarge number of plausible conformations of threeproteins: myoglobin, CTF and R69. A number ofgeneral conclusions can be drawn from these studies.First, the AMBER*/GB potential has a reasonablygood capability for distinguishing the nativestructure. This is a gratifying result because there isno obvious reason why the potential should fail,

JMB—MS 402


assembly of helices into tertiary structures withmyoglobin as an example. J. Mol. Biol. 132, 275–288.

Covell, D. G. (1992). Folding protein a-carbon chains intocompact forms by Monte Carlo methods. Proteins:Struct. Funct. Genet. 14, 409–420.

Covell, D. G. & Jernigan, R. L. (1990). Conformations offolded proteins in restricted spaces. Biochemistry, 29,3287–3294.

Daggett, V. & Levitt, M. (1991). A molecular dynamicssimulation of the C-terminal fragment of the L7/L12ribosomal protein in solution. Chem. Phys. 158,501–512.

Desmet, J., De Maeyer, M., Hazes, B. & Lasters, I. (1992).The dead-end elimination theorem and its use inprotein side chain positioning. Nature (London), 356,539–542.

Farid, H., Shenkin, P. S., Greene, J. & Fetrow, J. (1992).Prediction of side chain conformations in protein coreand loops from rotamer libraries. Biophys. J. 61, A350.

Friedrichs, M. S., Goldstein, R. A. & Wolynes, P. G. (1991).Generalized protein tertiary structure recognitionusing associative memory Hamiltonians. J. Mol. Biol.222, 1013–1034.

Gunn, J. R., Monge, A., Friesner, R. A. & Marshall, C. H.(1994). Hierarchical algorithm for computer modelingof protein tertiary structure: folding of myoglobin to6.2 A resolution. J. Phys. Chem. 98, 702–711.

Hinds, D. A. & Levitt, M. (1992). A lattice model for proteinstructure prediction at low resolution. Proc. Nat. Acad.Sci., U.S.A. 89, 2536–2540.

Janin, J., Wodak, S., Levitt, M. & Maigret, B. (1978).Conformation of amino acid side-chains in proteins.J. Mol. Biol. 125, 357–386.

Kolinski, A., Godzik, A. & Skolnick, J. (1993). A generalmethod for the prediction of the three dimensionalstructure and folding pathway of globular proteins:application to designed helical proteins. J. Chem. Phys.98, 7420–7433.

Lau, K. F. & Dill, K. A. (1989). Statistical mechanics modelof the conformational and sequence spaces of proteins.Macromolecules, 22, 3986–3997.

Lee, C. & Subbiah, S. (1991). Prediction of protein sidechain conformations by packing optimization. J. Mol.Biol. 217, 373–388.

Levitt, M. & Sharon, R. (1988). Accurate simulation ofprotein dynamics in solution. Proc. Nat. Acad. Sci.,U.S.A. 85, 7557–7561.

Maiorov, V. N. & Crippen, G. M. (1992). Contact potentialthat recognizes the correct folding of globularproteins. J. Mol. Biol. 227, 876–888.

Mark, A. E. & van Gunsteren, W. F. (1992). Simulation ofthe thermal denaturation of hen egg white lysozyme:trapping the molten globule state. Biochemistry, 31,7745–7748.

McDonald, D. Q. & Still, W. C. (1992). AMBER* torsionalparameters for the peptide backbone. TetrahedronLetters, 33, 7743–7746.

Mohamadi, F., Richards, N. G. J., Guida, W. C., Liskamp,R., Lipton, M., Caufield, C., Chang, G., Hendrickson,

T. & Still, W. C. (1990). MacroModel: an integratedsoftware system for modeling organic and bioorganicmolecules using molecular mechanics. J. Comput.Chem. 11, 440–467.

Monge, A., Friesner, R. A. & Honig, B. (1994). An algorithmto generate low-resolution protein tertiary structuresfrom knowledge of secondary structure. Proc. Nat.Acad. Sci., U.S.A. 91, 5027–5029.

Novotny, J., Bruccoleri, R. E. & Karplus, M. (1984). Ananalysis of incorrectly folded protein models.Implications for structure prediction. J. Mol. Biol. 177,787–818.

Novotny, J., Rashin, A. A. & Bruccoleri, R. E. (1988). Criteriathat discriminate between native proteins andincorrectly folded models. Proteins: Struct. Funct.Genet. 4, 19–30.

Privalov, P. L. (1989). Thermodynamic problems of proteinstructure. Annu. Rev. Biophys. Biophys. Chem. 18, 47–69.

Ptitsyn, O. B. & Rashin, A. A. (1975). A model of myoglobinself-organization. Biophys. Chem. 3, 1–20.

Shakhnovich, E. I., Farztdinov, G., Gutin, A. M. & Karplus,M. (1991). Protein folding bottlenecks: a lattice MonteCarlo simulation. Phys. Rev. Letters, 67, 1665–1668.

Skolnick, J. & Kolinski, A. (1990). Simulations of the foldingof a globular protein. Science, 250, 1121–1125.

Still, W. C., Tempczyk, A., Hawley, R. C. & Hendrickson, T.(1990). Semianalytical treatment of solvation formolecular mechanics and dynamics. J. Amer. Chem.Soc. 112, 6127–6129.

Sun, S. (1993). Reduced representation model of proteinstructure prediction: statistical potential and geneticalgorithms. Protein Sci. 2, 762–785.

Vieth, M., Kolinski, A., Brooks, C. L., III & Skolnick, J.(1994). Prediction of the folding pathways andstructure of GCN4 leucine zipper. J. Mol. Biol. 237,361–367.

Warshel, A. & Levitt, M. (1976). Folding and stability ofhelical proteins: Carp myogen. J. Mol. Biol. 106,421–437.

Weiner, S. J., Kollman, P. A., Case, D. A., Singh, U. C., Ghio,C., Alagona, G., Profeta, S., Jr & Weiner, P. (1984). Anew force field for molecular mechanical simulation ofnucleic acids and proteins. J. Amer. Chem. Soc. 106,765–784.

Wishart, D. S., Sykes, B. D. & Richards, F. M. (1992). Thechemical shift index: a fast and simple method for theassignment of protein secondary structure throughNMR spectroscopy. Biochemistry, 31, 1647–1651.

Wuthrich, K., Billeter, M. & Braun, W. (1984). Polypeptidesecondary structure determination by NMR obser-vation of short proton–proton distances. J. Mol. Biol.180, 715–740.

Wuthrich, K., Spitzfaden, C., Memmert, K., Widmer, H. &Wider, G. (1991). Protein secondary structuredetermination by NMR. Application with recombi-nant human cyclophilin. FEBS Letters, 285, 237–247.

Yang, A.-S., Sharp, K. A. & Honig, B. (1992). Analysis of theheat capacity dependence of protein folding. J. Mol.Biol. 227, 889–900.

Edited by F. E. Cohen

(Received 23 May 1994; accepted 4 January 1995)

computer modeling of protein folding:...

Documents