pande protein

Atomistic Protein FoldingSimulations on theSubmillisecond Time ScaleUsing Worldwide DistributedComputing

Vijay S. Pande14

Ian Baker1

Jarrod Chapman4,*Sidney P. Elmer1

Siraj Khaliq5

Stefan M. Larson2

Young Min Rhee1

Michael R. Shirts1

Christopher D. Snow2

Eric J. Sorin1

Bojan Zagrovic2

1 Department of Chemistry,Stanford University,

Stanford, CA 94305-5080

2 Biophysics Program,Stanford University,

Stanford, CA 94305-5080

3 Department of StructuralBiology,

Stanford University,Stanford, CA 94305-5080

4 Stanford SynchrotronRadiation Laboratory,


5 Department of ComputerScience,


Received 27 March 2002;accepted 22 May 2002

Abstract: Atomistic simulations of protein folding have the potential to be a great complement toexperimental studies, but have been severely limited by the time scales accessible with currentcomputer hardware and algorithms. By employing a worldwide distributed computing network oftens of thousands of PCs and algorithms designed to efciently utilize this new many-processor,highly heterogeneous, loosely coupled distributed computing paradigm, we have been able tosimulate hundreds of microseconds of atomistic molecular dynamics. This has allowed us to directly

Correspondence to: Vijay S. Pande; email: [email protected]

*Present address: Physics Department, University of California,Berkeley, CA

Contact grant sponsor: ACS PRF, NSF MRSEC CPIMA, NIHBISTI, ARO, and Stanford University

Contact grant numbers: 36028-AC4 (ACS PRF), DMR-9808677(NSF MRSEC CPIMA), IP20 6M64782-01 (NIH BISTI), 41778-LS-RIP (ARO)Biopolymers, Vol. 68, 91109 (2003) 2002 Wiley Periodicals, Inc.

91

simulate the folding mechanism and to accurately predict the folding rate of several fast-foldingproteins and polymers, including a nonbiological helix, polypeptide -helices, a -hairpin, and athree-helix bundle protein from the villin headpiece. Our results demonstrate that one can reach thetime scales needed to simulate fast folding using distributed computing, and that potential sets usedto describe interatomic interactions are sufciently accurate to reach the folded state with exper-imentally validated rates, at least for small proteins. 2002 Wiley Periodicals, Inc. Biopolymers68: 91109, 2003

Keywords: atomistic protein folding; microsecond time scale; computer hardware; computeralgorithms; molecular dynamics; distributed computing; villin; beta hairpin

INTRODUCTION

Understanding the sequencestructure relationship ofproteins will play a pivotal role in the postgenomicera, and will have great impact on genetics, biochem-istry, and pharmaceutical chemistry.13 A detailedpicture of the folding process itself will be importantin understanding diseases, such as Alzheimers andvariant CreutzfeldtJacob disease, believed to be re-lated to protein misfolding.4 Finally, an understandingof protein folding mechanisms will be important inprotein design and nanotechnology, in which self-assembling nanomachines may be designed usingsynthetic polymers with protein-like folding proper-ties.5

Unfortunately, current computational techniques totackle protein folding simulations are fundamentallylimited by the long time scales (from a simulationpoint of view) needed to study the dynamics of inter-est. For example, while the fastest proteins fold on theorder of tens of microseconds, current single com-puter processors can only simulate on the order of ananosecond of real-time folding in full atomic detailper CPU daya 10,000-fold-computational gap.Great strides in traditional parallel molecular dynam-ics (MD), utilizing many processors to speed a singledynamics simulation, have been made and have par-tially overcome this divide. A tour-de-force parallel-ization of simulation code for supercomputers by Duanand Kollman has previously led to the simulationof 1 s of dynamics for the villin headpiece three-helix bundle,6 demonstrating that parallelizationschemes using hundreds of processors can be used tomake signicant progress at closing this computa-tional gap. However, such methods have fundamentaldrawbacks: in particular, these methods require com-plex, expensive supercomputers due to the need forfast communication between processors. Moreover,due to the stochastic nature of folding, in order tostudy the folding of a 10-s folder, one must simulatehundreds of microseconds, requiring computingpower equal to thousands or tens of thousands oftodays processors.

Developing such large-scale parallelization meth-ods is very difcult, and current parallelizationschemes cannot scale to the level of even thousands ofprocessors (i.e., cannot use so many processors ef-ciently). To understand why scalability to thousandsof processors is so difcult, consider an analogy to agraduate student thesis. A typical thesis takes 1500graduate student days. If one employed 1500 graduatestudents to accomplish this goal, would it be possibleto complete a thesis in a single day? Clearly nottheoverhead of communication between students, as wellas the inability to devise an algorithm to divide thelabor evenly, would make 100% efciency impossiblein this case. At this level of scaling, it is likely that thework would actually take longer. These issues, inparticular balancing communication time against timespent actually doing work, are mirrored in the divisionof labor between computer processors. Clearly, theonly way to efciently utilize such a large number ofprocessors is to divide work in such a way that re-quires minimal communication.

Even with an algorithm with perfect scalability(e.g., with a 10,000-fold increase in speed using10,000 processors), we are still left with the problemof obtaining a 10,000-processor supercomputer. Forcomparison, the largest unclassied supercomputer inthe world (the SP at NERSC) has 2500 processors,and of course this resource must be shared betweenmany (hundreds) different research groups. Recently,another approach has been developed to bridge thisenormous computational gap: worldwide distributedcomputing.7 There are hundreds of millions of idlePCs potentially available for use at any given time,the majority of which are vastly underused. Thesecomputers could be used to form the most powerfulsupercomputer on the planet by several orders ofmagnitude.7 However, to tap into this resource ef-ciently and productively, we must employ nontradi-tional parallelization techniques.8,9 Indeed, we wish toaccomplish a seemingly impossible goal: to push thescalability of MD simulations to previously unattain-able levels (i.e., the efcient use of tens of thousands

92 Pande et al.

of processors) using an extremely heterogeneous net-work of processors that are loosely coupled by rela-tively low-tech networking (primarily modems).

In this review, we rst present the details of ourmethod to simulate protein folding using distributedcomputing, and then summarize our folding simula-tion results for several small, fast folding proteins andpolymers. Specically, we demonstrate the applica-bility of our method by simulating the folding ofprotein helices10,11 in atomic detail. Next, as an addi-tional quantitative test of our methodology, we exam-ine the folding rate and folding time distribution of anonbiological helix,12 for which results of traditionalMD simulations13 and experiments14 are known. Wenally apply these methods to larger and more com-plex proteins, including a -hairpin15,16 and a three-helix bundle.6 We conclude with an assessment of thevalidity of our methods including a quantitative com-parison of these results with experimental measure-ments of folding rates and equilibrium constants, anda discussion of what we have learned about foldingmechanisms.

METHODS

Why Is the Dynamics of ComplexSystems so Slow and How Can This BeCircumvented?

The dynamics of complex systems typically involves the cross-ing of free energy barriers.17 It has been demonstrated18,19that free energy barrier crossing dynamics, such as proteinfolding, does not make steady, gradual progress from onestate to another (such as dynamics from the unfolded tofolded states during a folding transition), but rather spendsmost of the trajectory time dwelling in a free energy mini-mum, waiting for thermal uctuations to push the systemover a free energy barrier. Indeed, this process is dominatedby the waiting time, and the time to cross the free energybarrier is in fact much shorter than the overall folding time,typically by several orders of magnitude.20 This opens thedoor to the possibility that one may simulate complexprocesses, such as folding, using trajectories much shorterthan the folding time20 (i.e., using nanosecond simulationsto reproduce kinetics with microsecond rate constants).

Methods to exploit this observation have been previ-ously developed (the most notable being path sampling, inwhich one simulates the paths over the barrier, rather thanthe time spent waiting in the original free energy minimum),with promising results.2022 However, several technicalcomplications have limited the use of these methods insimulating protein folding and in making quantitative com-parisons to experiment. First, some of these methods requirethat the path in question not dwell in metastable states andthus may get stuck in local meta-stable free energy minima

along the pathway (which have been found in many foldingand unfolding simulations19). Second, these methods re-quire knowledge of the native state as an end goal and in asense apply a eld to the trajectories to reach this nativestate. Finally, in the end, the heart of the protein foldingproblem lies in sampling, and even with the great benets ofpath sampling methodology, a tremendous degree of sam-pling (in this case, in path space) must still be performed.

Thus, the main goal of the method we have developed isto simulate folding dynamics starting purely from the pro-tein sequence and an atomistic force eld, without using anyknowledge of the native state in the folding simulation. It isimportant that our method can successfully tackle the issueof lingering in metastable free energy minima. To achievethis, we use the following algorithm (ensemble dynam-ics). Consider running M independent simulations startedfrom a given initial condition (each run starts from the samecoordinates, but with different velocities). We next wait forthe rst simulation to cross the free energy barrier (seeFigure 1). Since the average time for the rst of M simula-tions to cross a single barrier is M times less than theaverage time for all the simulations (assuming an exponen-tial distribution of barrier crossing times8,9; see below fordetails), we can use M processors to effectively achieve anM times speedup of a dynamical simulation, thus avoidingthe waiting in free energy minima. In a sense, we distributethe waiting to each processor in parallel, rather than inseries, as in traditional parallel MD. Given the ability toidentify individual barrier crossings, one can then speed theentire (multiple barrier) problem by turning it into a seriesof single barrier problems, restarting the processors from thenew free energy minima after each barrier crossing (seebelow and Refs. 8 and 9). Also, it can be shown that oneneed not use identical computers for these calculations, animportant fact in employing heterogeneous public clus-ters.8,9

To more quantitatively see how one can use thesesimulations to examine events that occur on considerablylonger time scales, consider a protein with single expo-nential kinetics, where the fraction that fold in time t isgiven by

ft 1 exp(k t)

where k is the folding rate. On average, a folding event willoccur on the 1/k time scale. However, we expect to see somefolding events even at short times compared to the foldingrate, i.e., when kt is small. In this case, we have f(t) kt.How many folding events would we expect to see? Considerstudying a protein that folds with k 1/10,000 ns, given M 10,000 simulations each of length t 30 ns we wouldexpect to see M f(t) M k t 30 folding events.

The above discussion shows how one can speed dynam-ics by a factor of M for a single barrier system, but whatabout multiple barrier problems? To handle multiple barri-ers, we suggest a scheme similar to Voters parallel replicamethod.9 This method is summarized in Figure 1. We startM simulations from a single initial condition, and then wait

Protein Folding Simulations on the Submillisecond Time Scale 93

for the rst simulation to cross a free energy barrier. Oncethis simulation has crossed over to the next free energyminimum, we restart all other simulations from that newlocation in conguration space, restart the dynamics, andwait for another replica to cross another free energy barrier.Since we employ stochastic dynamics, even though allsimulations are restarted from the same conguration, theyquickly decorrelate from each other and explore differentregions of the phase space.

Quantitative Rate Prediction

Below, we show how this scheme can be used to predictfolding rates. This result stems from the fact that the distri-bution of barrier crossing times for the rst-to-cross isdirectly linked to the distribution of usual barrier crossingtimes. To demonstrate this, we again assume single expo-nential kinetics (deviations from single exponential kineticshave also been examined elsewhere8). For a single proces-sor, we expect that a particular simulation would havecrossed the barrier by time t with a probability

P1t k exp(k t)

For the M simulation case, the probability that the rstsimulation has crossed in time t is

PM(t) k exp(k t] M 1 k 0

t

exp(k tdt]M1

(i.e., the probability that one simulation has crossed, times adegeneracy factor of M, times the probability that the re-maining M 1 simulations have not folded). Evaluation ofthe integral above yields

PMt M k exp[M k t]

which is exactly the same distribution as the single proces-sor case, except with an effective rate which is M timesfaster. Since this method simply speeds the effective rate ofcrossing each barrier, one can use the number of processorsM and the rate for rst crossing to predict the experimentalrate.

The error inherent in the above procedure for calculatingthe rate and the time constant can be estimated in thefollowing way. As shown above, for a given Ntotal and agiven t, the rate k is simply proportional to the number ofmolecules that have folded by time t, Nfolded(t). Since eachfolding process behaves probabilistically (according to anexponential distribution) and given xed t and Ntotal, thenumber of processes that will fold by time t, Nfolded(t), willbe a random variable. In other words, different realizationsof the large experiment containing Ntotal individual pro-cesses will, by their very nature, yield different values ofNfolded(t) for a xed time t. From this it follows that our rateestimate will also be associated with a certain inherentuncertainity. From elementary probability theory, we knowthat the number of folding events by time t, Nfolded(t), givena constant rate, will be distributed according to the Poissondistribution. This in turn means that the rate estimate, whichis proportional to Nfolded(t), will also be distributed accord-ing to the Poisson distribution. The standard deviation of aPoisson distribution with rate is equal to 1/2, meaningthat our rate estimate standard deviation will simply be

k Nfolded/Ntotalt Nfolded1/2/(Ntotalt)

FIGURE 1 Simulating dynamical events using worldwidedistributed computing. Traditional methods utilize multipleprocessors to speed a single dynamics calculation. We sug-gest that multiple processors can be used to generate sets ofcalculations, and that the desired thermodynamic or kineticobservables can be calculated from such an ensemble. Thismethod does not require supercomputers and can run wellon massively parallel clusters.79 Consider a multiple bar-rier-crossing problem (a model of many complex phenom-ena). Since the bulk of simulation time is spent waiting forthermal uctuations to bring the system over the barriers,one can speed the calculation by starting many simulationsin the rst free energy minimum (a), and waiting until justone of them has crossed. At this point, we couple thesimulations by placing them all in the same place in theconguration space as the simulation that has crossed thatbarrier (b). This process is then repeated as many times asneeded to cross additional free energy barriers (c). One canshow (see text) that this algorithm, with M processors, isequivalent to a single processor system running M timesfaster.8,9 Thus, with hundreds to thousands of processorsand assuming that one can identify transitions, we would beable to bridge the computational barriers currently limitingprotein folding and reach well into the microsecondtime-scale.

94 Pande et al.

Standard propagation of error results in time constant standard deviation of

1/k Ntotalt/Nfolded Ntotalt/Nfolded3/2

For example, for the -hairpin folding data (see below), wehave Ntotal 2700, Nfolded 8, and t 14 ns, which resultsin k 2.1 105 0.74 104 s1, and 4.7 1.7 s.

We stress that as long as one can identify transitions,thus allowing an M times speed-up for all barrier crossings,the dynamics we simulate will faithfully follow the dynam-ics one would obtain from traditional MD, but simply Mtimes faster. If there are off-pathway traps, our method willgo to them; indeed, we will reach them M times faster.However, we will escape these traps M times faster as well.This method is not intended as a structure prediction algo-rithm, but rather a means to speed dynamics and study themechanism of folding, which may include on-pathway in-termediates or diversions to traps.

How Can One Identify Free EnergyBarrier Crossings (Transitions)?

Of course, the utility of this method rests on our ability toidentify transitions, i.e., to calculate whether a simulationhas crossed a free energy barrier. Voters parallel replicamethod was intended to accelerate the dynamics of solid-state systems that have energy barriers, in which one canidentify new states by performing energy minimization tosee whether one has crossed an energy barrier.9 However, inprotein folding (as well as many other complex systems),the relevant barriers are free energy barriers, and thus anenergy minimization technique is not applicable. In orderfor this method to be applied to a broad range of barriercrossing problems, one needs to use a more general way toidentify free energy barrier crossings.

We suggest that, in analogy to rst-order phase transi-tions, one could look for a large variance in energy, whichcan loosely be related to a momentary surge in the heatcapacity (a common sign of a rst-order phase transition).Such energy variance peaks have been seen to coincide withfree energy barrier crossings in simple models23 and all-atom (S. Perkins and V. Pande, unpublished results) modelsof protein folding. This technique has the signicant advan-tage that it does not require any knowledge of the structureof the protein at the barrier.24 Moreover, to the extent thatenergy variance peaks correctly identify transitions, thesepeaks would aid in the interpretation of the simulationresults, since they would demarcate transitions to new freeenergy minima.

Of course, in the case of single-exponential kinetics, asis experimentally observed for almost all small proteins,there exists only one rate-determining free energy barrier,and thus recognizing the barrier is not essential to thetechnique. In fact, although we do not discuss it in thispaper, in some cases ignoring the transitions can result inreaching the folded state faster than by recognizing all thebarriers.8 Finally, simulating completely independent trajec-

tories is another appealing possibility for systems withsingle exponential kinetics; we discuss this possibility in theDiscussion section below.

Simulation Details

For all of the molecules presented here, each simulation isstarted from a completely extended state. This is done toavoid any possibility of biasing the initial state toward thenative state of the molecule. Clearly the extended state doesnot represent the structure of the unfolded state. Indeed, wend that rapidlyi.e., within 13 ns of MD simulationthis extended state relaxes to the unfolded state of theprotein. While this practice utilizes more computationaltime than, for example, starting from some predicted un-folded state, it has the virtue of not making any assumptionsof the unfolded ensemble and removes any possibility ofbiasing the system to the native state.

For each run, we have used M clone processors, eachsimulating folding in atomic detail with molecular or sto-chastic dynamics simulations (Figure 1). Once one of theseclones makes a transition (identied by a spike in the energyvariance: see below), we declare that the simulation hasgone through a transition, copy the resulting congurationto all of the other processors, and recommence simulationsfrom the new conguration. After restarting all simulationsfrom the coordinates of the barrier-crossing simulation, onemust ensure decorrelation of the next ensemble of trajecto-ries in order to achieve an increase in computational speed.This process is performed many times, over several runs.

In our simulations, the spatial coordinates of the barrier-crossing simulation were copied and unique random numberseeds (for Langevin dynamics random forces) were used toimmediately differentiate the simulations. In a purely deter-ministic simulation, one would need to differentiate eachsimulation by restarting them with differing velocities,which may lead to potentially nonphysical discontinuities inthe path; however, if the velocity decorrelation time is muchshorter than the conformational decorrelation time (certainlytrue for dynamics in any water-like solvent), then the effectsare likely to be minimal. In either case, the path obtainedwould correspond to a fast traversal of the potential land-scape, but the total simulation time among all processorswould be equivalent to the additional time waiting in min-ima that a representative serial simulation would take.

The Folding@Home distributed computing system(http://folding.stanford.edu) was used for the two most de-manding calculations (the -hairpin and villin simulations)presented here. The Folding@Home client software (whichperforms the scientic calculations) is based upon theTinker molecular dynamics code,25 with numerous modi-cations performed by Michael Shirts, other members of thePande group, Jed Pitera, and Bill Swope. We simulatedfolding and unfolding at 300 K and at pH 7 (unless notedotherwise), using the OPLS26 parameter set and the GB/SA27 implicit solvent model. Stochastic dynamics wereused to simulate the viscous drag of water ( 91/ps), anda 2 fs integration time step was used with the RATTLE


algorithm28 to maintain bond lengths. Long-range interac-tions were truncated using 16 A cutoffs and 12 A tapers.

We identied transitions by a heat capacity spike asso-ciated with crossing a free energy barrier. It has beenpreviously shown that this is a means to identify transitionsin all-atom8 and simplied protein model29 simulations. Tomonitor the heat capacity during the simulation, we calcu-late the energy variance, and use the thermodynamic rela-tionship Cv (E2 E2)/T, where E is the energy of thesystem; note that since we are using an implicit solvent, ourenergy is often called an internal free energy (the totalfree energy except for protein conformational entropy).Each PC runs a 100 ps MD simulation (1 generation),calculates the energy variance within this time period, andthen returns this data to the Folding@Home server. If theenergy variance exceeds a preset threshold value, the serveridenties this trajectory as having gone through a transition,and then resets all other processors to the newly reportedcoordinates. Since the heat capacity is extensive, we used axed value of this threshold per atom (0.8 kcal2/mol2/atom)for all molecules (which for example leads to a threshold of300 kcal2/mol2 for villin).

Since transitions occur relatively infrequently (see be-low), one need not run these simulations on massivelyparallel supercomputers (with high speed communication);instead, these simulations are well suited for large, decen-tralized distributed computing clusters, such as theFolding@Home project. Not only is this a demonstrationthat such distributed computing clusters can be used tostudy long time-scale kinetics with molecular dynam-icswe stress that this is likely the only way such calcula-tions could have been practically performed, considering thegreat computational demands of these calculations.

RESULTS

Protein -Helices

To test the methodology presented above, we havesimulated the folding of two different -helical pep-tides. One sequence we examined, the Fs peptideAcA5(A3RA)3ANH2, has been shown experimen-tally to have biexponential kinetics, with characteris-tic times of 10 ns and 160 60 ns.10 Helices arebelieved to form via nucleation,30,31 which is inu-enced by the disorder in a system (either as a nucle-ation accelerator or blocker), analogous to a liquidwith impurities. In our system, the arginine residuescould be considered to be an analog of these impuri-ties that blocks propagation, and it is interesting toconsider the role of this disorder in the sequenceabove, i.e., whether the arginine residues affect thenucleation processes. To address this, we have alsofolded a pure poly-A chain, AcA20NH2.

We have been able to fold both of these proteinsequences and nd rates comparable to experiment at

a temperature of 10C. The initial congurations werecompletely elongated chains (135, 135).Qualitatively, both the poly-A helix and the Fs pep-tide folded by rst undergoing nucleation followed bypropagation toward the termini. We found propaga-tion in both directions (N to C and C to N), althoughwe do not have sufcient statistics to determine a biasin propagation direction.30 Quantitatively, the Fs pep-tide folded (i.e., reached 15 helical residues, the valueexpected from experiment10) in 82 60 ns in oursimulations. Note that while 7 runs were used for thisaverage, 2 of the 7 Fs peptide runs did not fold after160 ns. Since these runs were included in the averageas folding in 160 ns, the average is somewhat lowerthan it should be. However, since the experimentalrate is 1/160 ns, one would expect that on average 4.4/7 [63% 1 exp(kt)] of the runs would fold after160 ns, whereas we observed 5/7, which is well withinthe experimental bounds.11

Moreover, our simulations capture some ner de-tail about the nature of folding. We see fast earlyevents, as found experimentally. The alanine-rich N-terminal part of the Fs peptide folded very quickly, in15 10 ns, consistent with the observation of N-terminal uorophore quenching in 10 ns by Eaton andco-workers11 and the faster rate observed by Dyer andco-workers.10 The poly-A helix folded considerablyfaster (18 8 ns, out of 8 runs) and typically hadmore helical content (17.8 residues vs 15.1 for the Fspeptide).

Apparently, the arginine residues are responsiblefor the differences in these folding rates by acting asblockers of helix nucleation and propagation. Lookingat the formation of secondary structure vs. time (seeFigure 2, right), we see that helical propagation haltsat the arginine residues (R) and often the completionof helix formation requires additional nucleationevents. While our eight poly-A helix runs did showstalling of propagation (Figure 2), these events werenot localized to any particular point in the chain. Whydoes arginine limit propagation? We suggest that thelong Arg side chain signicantly limits its mobility,and moving into a helical / orientation thus occursmuch more slowly.

Nonbiological Helices

How generally applicable and accurate is the coupledsimulation method? To address this question, we haveapplied this method to study the folding of a nonbio-logical helix, a 12-mer of polyphenylacetylene(PPA).12 This polymer can be considered to be anonbiological analog of polyalanine, since it is a ho-mopolymer with a simple side chain that folds into a

96 Pande et al.

helix.12 We have previously shown13 that this poly-mer folds to a helix on the tens of nanosecond time-scale, in accordance with previous experimental ob-servations.14 We nd that our ensemble dynamicsmethod works well for PPA. The mean folding timeand folding time distribution are consistent with bruteforce, traditional simulations of PPA.13 This is dem-onstrated by the agreement in mean folding timesbetween the two methods and the similarity of thefolding time distribution (see Figure 3 and Table 1).

C-Terminal -Hairpin of Protein G

-Helices and -hairpins together represent the mostubiquitous secondary structural elements in proteins.In a previous section, we discussed our simulationresults for helices and now we concentrate on hair-pins. We have recently reported a full-atom, implicit-solvent simulation of folding of the hairpin at a bio-logically relevant temperature,32 and here we brieysummarize those results. We have obtained a very

large ensemble of conformations, which includesmostly partially folded structures, as well as eightcomplete, fully independent folding trajectories.These data sets allow us to determine the key trendscharacterizing the folding process and determine sev-eral average properties that have been measured orcould, in principle, be measured experimentally.

Based on our results, we can estimate the foldingrate of the hairpin in the following way: we havesimulated 27 independent runs, each consisting of M 100 clone simulations that, on average, completedapproximately 14 ns of simulated time, bringing thetotal to approximately 38 s of real time simulation.Out of 2700 simulations, we have detected eight com-plete folding events, which (if we assume single ex-ponential folding kinetics) results in an estimatedfolding time of approximately 4.7 s. This predictionis in excellent agreement with the experimentallymeasured time of 6 s.16,32

Our results offer the following picture of the fold-ing mechanism (Figure 4). Folding from a fully ex-

FIGURE 2 Folding simulation of -helices. Shown above are trajectory data for simulations ofthe poly-A helix (left) and Fs peptide (right). Top: number of helical units vs time (dotted line) andenergy variance vs time. We see that peaks are associated with nucleation events. Bottom: Secondarystructure formation vs time: red, yellow, and blue denote helices, -sheets, and turns respectively.In both cases, we see nucleation events (corresponding to energy variance peaks). However, in thecase of the Fs peptide, nucleation events did not occur at the arginine residues (R) and propagationtypically was blocked at these residues (also seen in the other seven runs we performed, data notshown). We estimated the time by multiplying the directly simulated time t by the number ofprocessors M (M 24 and M 128 for the left and right trajectories, respectively).


tended conformation begins with a rapid collapse to amore compact structure. During this time, varioustemporary hydrogen bonds form, condensing the pep-tide and decreasing the costly loop entropy that ham-pers the formation of the hydrophobic core. Thesetemporary hydrogen bonds form and break; their pat-tern, which varies from run to run, has no resemblanceto the nal hydrogen-bonding pattern of the hairpin.Next, an interaction between the hydrophobic coreresidues is established. This is clearly the central

event in the folding process and most probably itsrate-limiting step. Note that at this point the core isstill not fully formed: the initial hydrophobic interac-tion most often involves just two hydrophobic resi-dues on the opposite sides of the future hairpin. Fullformation of the core typically appears simulta-neously with the establishment of nal hydrogenbonds.

This pathway was also suggested by several othersimulation methods. Pande and Rokhsar reported the

FIGURE 3 Quantitative validation of our method. We plot the folding time distribution for a12-mer PPA helix calculated from the ensemble dynamics (gray) and traditional molecular dynam-ics (black) methods. We nd excellent agreement with both simulation and with experiment (whichnds a characteristic time of10 ns). We calculated the folding time as Mt, where t is the simulationtime of the individual trajectory and M 20 processors were used. The quantitative agreement hereshows that one can indeed achieve a linear speed-up using 20 processors for the 12-mer PPA foldingproblem. Inset: a folded PPA 12-mer.

Table I Summary of Predicted vs Experimentally Measured Folding Timesa

Protein/MoleculePredicted time

(ns)Experimental time

(ns) Experiment reference

Polyphenylacetylene (PPA) 5.3b 10 14Fs peptide [AcA5(A3RA)3ANH2] 127c 160 60 10, 11C-terminal -hairpin of protein G

(AcGEWTYDDATKTFTVT-ENH2) 4700 1700 6000 15, 16Villin headpiece 20,000d 11,111 37, 38

a We see a very strong correlation between our prediction and experiment. For a direct correlation, we nd R2 0.993, p value 0.008,and for a correlation of the log of these times, to match Figure 8, we nd R2 0.993, p value 0.000026.

b PPA folds with nonexponential behavior.These numbers report the fast time in a double exponential t.c If average the folding times of the runs that folded, we get 82 60 ns. However, if we include the data for the runs that did not fold,

we see that 5/7 folded in 160 ns; therefore using 5/7 1 exp(kt) leads to a time of 1/k 127 ns.d This number represents an estimate based on one folding event, and therefore has a large error and thus is likely reliable solely as an order

of magnitude prediction.

98 Pande et al.

results of high temperature unfolding and refolding ofthe -hairpin, in which a discrete unfolding pathwaywas recognized to include a hydrophobically stabi-lized intermediate (H state) with only the Val54side chain being released from the core and littlehydrogen bonding occurring.19 Karplus and co-work-ers used multicanonical Monte Carlo simulations tolook at folding of the hairpin with similar results.33Garcia and Sanbonmatsu35 and Berne and co-work-ers34 later veried the existence of this intermediatethrough a temperature-exchange Monte Carlo/molec-ular dynamics hybrid model of unfolding in which thethermodynamics of the unfolding events are well de-scribed.35 They note that these intermediates appearwith, on average, 2 fewer hydrogen bonds than thefolded hairpin. This H intermediate was then ob-served in mechanically driven unfolding simulationsperformed by Bryant et al.,36 who describe it as in-cluding a nearly assembled core and very little back-bone hydrogen bonding.

The picture of the folding process that emerges is,in essence, a blend of the hydrogen-bond-centric andthe hydrophobic-core-centric views of hairpin fold-ing: nonspecic hydrogen bonds are important in theinitial stages of folding, but the key event that stabi-lizes the U-shaped precursor of the hairpin and guidesthe downstream folding process is the formation of ahydrophobic interaction between core residues. Finalhydrogen bonds appear later, around the same timethe full formation of the hydrophobic core occurs, andthese continue to uctuate even after folding is com-plete.

Villin Headpiece

We have also simulated the folding of a thermostable,fast folding,37 36-residue -helical subdomain (pdb-code 1VII) from the villin headpiece38,6 (the C-termi-nal domain of the much larger villin actin bindingprotein). Figure 5 details the nature of this foldingtrajectory. We start from a completely elongatedstructure and then see rapid relaxation into a random-walk unfolded state (U). Next (Figure 5a), the C-terminal helix forms very quickly (at the tenth gener-ation, G10 250 ns M t 250 10 generations 0.1 ns/generation; see Methods for details). Thistime is consistent with helical folding times foundexperimentally39 and in simulation.8 The protein thencollapses, driven by the attraction of its hydrophobicgroups. While many residues have native-like second-ary structure (see Figure 5b), there is a large degree ofnon-native side-chain interaction, such as the contactof TRP24 and PHE36 with hydrophobic core resi-dues, although they are solvent exposed in the native

structure. This intermediate collapsed thermodynamicstate (I) consists of an ensemble of many confor-mations with partial native secondary structure, butconfounded by a lack of native side-chain packing.

The protein remains in this state for a very longtime (the equivalent of 3.2 s) until a thermaluctuation occurs which breaks key non-native inter-actions that were preventing the formation of thehydrophobic core. Once these non-native contacts arebroken, the protein rapidly folds to its native state(N). Looking at this transition in more detail (Fig-ure 6), we see that in order to break non-native con-tacts (such as, but not limited to, the interaction be-tween PHE11 and PHE36 in G200), the protein ex-pands, breaking many contacts (G210), and thencollapses into its native fold (G225), as identied bya root mean square deviation (RMSD) similar to thatfound by exploring the native state in our unfoldingsimulation (i.e., 34 A ; see below). This event occursafter the equivalent of 5.5 s, which is within the timeestimated experimentally (on the order of 10 s).37Since PHE36 forms non-native (misfolded) contactsin this intermediate state (as well as the intermediatefound in the Kollman simulation6), we predict thatremoving this bulky hydrophobic side chain wouldlikely increase the folding rate.

We have performed four other coupled simulationsthat have also each reached the 5 s time scale (datanot shown). All of these trajectories have reached I(RMSD between 5 and 7 A , radius of gyration Rgbetween 7 and 10 A ), but none have reached the Nstate. Statistically, this is not surprising and can beused to estimate the folding rate (see Methods,above): if the mean folding time for villin were 20 sand it follows exponential kinetics, then one wouldexpect that 20% of runs would fold in 5 s, inagreement with our results.

We have also used Folding@Home to study thenative state of villin, i.e., by starting simulations fromthe NMR structure.38 One use of such simulations isthe determination of the variability of conformationsin the native state ensemble. Moreover, since ourmethod allows us to simulate events that would occuron the microsecond time scale, we should also be ableto simulate villin unfolding under experimental con-ditions (e.g., 300 K). We see (Figure 7) that confor-mations within the native state typically have a 34 ARMSD from the NMR structure. Thus, we identifyour N state with the native state of this protein sinceour folding simulation reaches an ensemble of con-formations with4 A RMSD (our conformation fromthe folding run, which was most similar to the NMRstructure, had a 3.3 A RMSD). Moreover, our nativestate simulation was run long enough to explore un-


folding to the intermediate state: the protein did notcompletely unfold during the simulation time scale(1 s). However, a transition to the partially foldedintermediate (I) was detected.

Finally, we compare our results to what one mightexpect from protein folding theory. One of our pri-mary results is that folding appears to proceed throughtransitions between free energy minima: starting in an

unfolded state (U) to an intermediate (I) and then tothe native state (N) (e.g., see reviews Refs. 13 and40, and references therein). As previously dis-cussed,13 the collapse to the I state appears to bedriven by hydrophobic interactions. However, consid-ering that villin is one of the fastest folding proteinscurrently known, it is interesting to consider thatmany of these interactions were non-native.

100 Pande et al.

While there is little structure in U, the intermediateI is collapsed, with some native tertiary structure,much non-native structure, and little side-chain pack-ing. Thus, I is very much like the molten globuleintermediates found in other proteins.41 Also, the in-termediate state found by Duan and Kollman6 ts ourI state, since it is collapsed, partly native, but missingcertain native contacts, and satises our I state de-nition in terms of the RMSD and radius of gyration.We see a somewhat cooperative U 3 I transition(e.g., reected in free energy barriers in Figure 7a)and a very cooperative I3 N transition, as predictedpreviously.42,43 Indeed, the cooperativity of the I3 Ntransition appears to result from side chain packing inour model, since many non-native contacts must col-lectively break to allow the formation of native side-chain packing. It seems highly unlikely that the nativestructure could be reached so rapidly through piece-wise movements without this collective event.

It is clear that any potential set employed to modelatomic interactions will have its limitations. The rel-evant question to ask is, How good do they need to beand what would result from errors in these potentials?Since we do see folding to the native state, it appearsthat the potentials we used were sufcient in this case.However, we cannot rule out the possibility that in ourmodel, the I state is comparably (or more) stable thanthe N state. This could be the result of slight errors inthe potentials.44 Much like adding denaturant in aphysical example, adding errors to a potential reducesthe energetic favorability of the N state and thusmakes the I state relatively more favorable due to itsentropic advantage.44

DISCUSSION

Scalability of the Algorithm

As more computers become accessible to distrib-uted computing methods, it is important to under-stand the limits of the scalability of the method, i.e.,the limits to the number of processors one can useto achieve a speed increase. While this method canyield signicant speed improvements for simulatingcomplex systems (and scalability considerably be-yond traditional parallel MD), there are some im-portant limitations to its scalability we must con-sider. For example, simulating a process where themean time to fold tfold 100 ns using M 106processors will not necessarily mean that one willachieve folding events using only tfold/M 100 fstrajectories. The scalability will be inherently lim-ited by the barrier crossing time tcross (i.e., the timespent actually crossing the barrier, not including themuch longer time spent waiting in the free energyminima, which dominates the folding time tfold).Since the speed increase from our method is due tothe elimination of the waiting time, we expect thatM tfold/tcross additional processors will not giveany additional speed increase.8,9 Thus, the boundsof scalability for this method are also related to aninteresting physical question: How much time isrequired to actually cross the free energy barrier?This time can be quantied by using our method tolook for the limits of scalability within our tech-nique.

For the proteins we have examined, it is likely thatthis time is on the hundreds of picoseconds to nano-second time scale. It is interesting to consider how

FIGURE 4 A detailed analysis of a folding trajectory of the -hairpin from the C-terminalsegment of protein G. (a) Cartoon representation of the folding trajectory; the backbone of thepeptide is represented as a gray trace; the core hydrophobic residues (Trp43, Tyr45, Phe52, Val54)are shown in dot representation; (b) RMSD from the 1GB1 structure of the hairpin (residues 4354),radius of gyration, and the number of backbonebackbone hydrogen bonds; (c) distance betweenkey hydrogen bonding partners (green: Trp43Val54; red: Tyr45Phe52), and the minimumdistance between Trp43 and Phe52 (black). Note that the minimum distance between Trp43 andPhe52 reaches its nal value before the key hydrogen bonds are established; (d) solvation energy(Esolvation), chargecharge energy (Echarge), and total potential energy vs time. The initial hydro-phobic collapse of the unfolded peptide correlates with a sharp decrease in Etotal, while theattainment of the nal structure correlates with Etotal reaching its nal value. A signicant deviationaround G160 of Echarge and Esolvation from their nal value is correlated with the temporary breakingof the key Tyr45Phe52 hydrogen bond; (e) a concise summary of the key events along the foldingtrajectory (color code: yellowhigh; violetlow). HB-ij denotes the distance between the hydro-gen bonding partners i and j; min-kl denotes the minimum distance between residues k and l. Notethat the establishment of the hydrophobic Trp43Phe52 interaction is the earliest event of signi-cance along the trajectory. Time is reported in the number of generations: roughly, 1 generationcorresponds to 100 processors 0.1 ns/generation/processor 10 ns/generation.


this minimum time varies with the folding time. Sincethere need not be any correlation between these times,it is possible that slower folding proteins (e.g., thosewhich fold on the millisecond and longer time scale)could be folded using our method with current micro-processors by simply employing more of them. In-deed, computational resources on the million-proces-

sor scale have been proposed, such as IBMs BlueGene, as have other distributed computing projects.With such computational resources, it is possible thatwe could push our simulations from the hundreds ofmicrosecond timescale to fractions of a second, al-lowing us to reach timescales relevant for slow fold-ers.

102 Pande et al.

Limitations of Our Methods to PredictFolding Rates and Mechanisms

Below, we summarize the approximations involved inour rate determination method above, and our justi-cations and reasoning of these approximations. First,we assume that the barrier crossing probability den-sity is exponential. This does not mean that the totalkinetics is single exponential, but that the time tocross an individual barrier is exponential. Second, weassume that transitions are correctly identiedi.e.,that there are no false positives or false negatives.This second assumption is of greater concern. Whilewe cannot know for certain that we are correctlyidentifying transitions, our ability to predict rates sug-gests that incorrect transition prediction is not anissue. (Perhaps this may be due to the fact that tran-sitions were rarely detected in the larger proteinsstudied and were typically found at the beginning ofthe folding trajectoriessee the section Is TransitionDetection Really Necessary? below). However, wecan mathematically address the consequences of in-

correct transition detection and nonexponential kinet-ics, as we have done in a previous work.8 Finally, inour error analysis above, we can calculate the statis-tical uncertainty of our rates. Even with only tens ofsuccessful folding trajectories, the statistical uncer-tainty is negligible.

Accuracy of Implicit Solvent Models forProtein Folding

For all of the folding simulations presented here, wehave used the GB/SA method.27 While GB/SA makesconnections to physical arguments about the nature ofinteractions via internal vs external dielectrics, it is anempirical theory. Nevertheless, GB/SA performs verywell at predicting the solvation free energy of smallmolecules,27 and it is perhaps not surprising that itappears to be sufciently accurate in the prediction ofthe folding rate of small proteins and peptides. More-over, Caisch and co-workers have also had success-ful results using even simpler implicit solvation mod-

FIGURE 6 Examination of the I to N transition of the villin headpiece in detail. We see that inorder to correctly fold, the protein must rst unfold and open its conformation in order for it to formthe missing native state interactions. Visualization is the same as in Figure 2a. See text for moredetails. The nal state agrees reasonably well with the average rened NMR conformation from theProtein Data Base.38

FIGURE 5 Anatomy of a folding trajectory of the villin headpiece. (a) Signicant representativeconformations along the trajectory. The protein is visualized as a backbone trace with the aromaticresidues (PHE7, PHE11, PHE18, TRP24, PHE36) space lled and colored gray, red, cyan, yellow,and blue respectively. (b) Secondary structure from DSSP51 (black helix, gray turn, white nostructure). (c) Native contact density for each residue (blue low, red middle, yellow high).(d) Radius of gyration and RMSD from the native state (the native state is dened from the averageof a 10 ns traditional MD simulation at 300 K starting from the NMR structure38) are plotted; weuse only -carbons in this calculation and omit the rst and last 2 residues in the RMS calculation(as they are unstructured in the rened NMR conformation). (e) Solvation free energy (Fsolvation),charge/charge energy (Echarge), and total internal free energy (Ftotal) vs time. While Ftotal graduallydecreases over the whole simulation, we see that Fsolvation has an initial decrease, but then graduallyincreases over the simulation, whereas Echarge consistently decreases. In fact, Echarge and Fsolvationare highly correlated (R2 0.92) during this trajectory. (f) Fraction of all and native contacts vstime. In all frames, time is on the horizontal axis. It is most natural to report time in terms of 100ps generations (see Figure 1); roughly, one can approximate time as8 250 processors 0.1ns/generation/processor 25 ns/generation. We label conformations by their generation (e.g.,G225 in the upper right).


els (distance-dependent dielectric with a surface areaterm following Still et al.27) in folding simula-tions.31,45 However, it is unclear whether similar ac-curacy (in either rate prediction or even in reachingthe native state) would be achieved using implicitsolvation models for folding simulations of largerproteins.

Second, we stress that when employing any im-plicit solvent model for the faithful reproduction ofkinetics, one must take into account the viscosity ofthe solvent (in addition to the dielectric and hydro-phobic aspects). We have done so using Allens sto-chastic integrator, as implemented in Tinker.25 Thisscheme is an extension to Langevin dynamics, andincludes both viscous drag and random forces in theforce equation, to match the viscosity of the solventand the random thermal uctuations that the solventwould apply to the solute. However, unlike pure Lan-gevin dynamics, this method scales the drag and therandom force by the solvent-exposed area in order toonly apply these solvation effects to atoms that areactually solvent exposed. Often, implicit solvation isrun without any viscosity model (or viscosity consid-erably lower than water,45 i.e., the viscosity parameter 90/ps), which leads to differences in samplingand cannot lead to accurate rate predictions. This is, in

some cases, considered to be an advantage of implicitsolvation: one would expect that the speed of dynam-ics is inversely proportional to viscosity. However,since the magnitude of the random forces is alsoproportional to the viscosity, decreasing the viscositydiminishes the strength of these random forces. Coun-terintuitively, this may actually decrease the samplingas it is these very random forces that enable thesystem to cross free energy barriers.18 Since our goalis the faithful reproduction of folding kinetics, wehave chosen a viscosity damping parameter in order tomatch that of water.46

Third, it is interesting to consider the possibledifferences between implicit and explicit solvationmodels. While implicit solvation models can capturemany important properties of the solvent, such as thedielectric effect, hydrophobicity, and viscosity, thereare effects that are missing. In particular, any physicaleffect that arises from the discrete nature of watermolecules, such as proteins hydrogen bonding to wa-ter, solvent-separated minima, or the drying effect,will be lost. Again, it is important to keep in mind thatall models are approximations, and the relevant ques-tion is not whether a model is correct (since allmodels are incorrect at some level), but whether agiven model is correct enough to capture the rele-

FIGURE 7 Rough characterization of the underlying free energy landscape for the villin head-piece. We plot the log of the probability of nding conformations with a given Rg and RMSD in (a)folding and (b) unfolding simulations. We nd three distinct probability maxima (which correspondto free energy minima): an unfolded state, molten-globule-like intermediate, and the native state.This landscape generated from kinetic data qualitatively agrees with previous, more extensivethermodynamic calculations.2

104 Pande et al.

vant physics to faithfully reproduce and predict thephysical effect of interest. For the small proteins wehave examined, it appears that the model we haveemployed is indeed correct enough for predictingrates (see Figure 8 and Table I). This implies thateither the discrete nature of water is not relevant forfolding, that folding rates are fairly robust to suchinaccuracies of the model, or that there is a convenientcancellation of errors. In order to discern betweenthese two possibilities, one must resimulate these pro-teins with explicit solvation models and compare therate and mechanistic predictions; if these predictionsagree, then perhaps the potential gain in accuracy ofexplicit solvent models would indeed not be relevantfor folding kinetics. Furthermore, we stress that ex-plicit solvent models make approximations as well,47and there is no reason why an arbitrary explicit sol-vation model would necessarily be better than a well-designed implicit model.

Finally, it is important to consider that the questionof the validity of implicit solvation models goes be-yond a simple debate of the validity of particularcomputational methodology, but also impacts the wayin which one thinks of protein structure in general. Ifexplicit solvation were critical to protein folding, thenit is likely that one should not think of protein struc-

tures without the requisite cloud of water molecules itinteracts with, as it is the very discrete and potentiallystructural aspects of the water that play a large role infolding. However, if implicit solvent models are suf-ciently accurate, this suggests that a structural pic-ture of a protein alone (implicitly considering theeffects of water, such as hydrophobicity, etc., but notwith a discrete, structural form in mind) is indeedsufcient.

Alternative Methods to SimulateDynamical Events on Long Time ScalesUsing Low Viscosity Simulation

Water is a relatively viscous solvent. Indeed, in quan-titative terms, the damping force of water is on theorder of 100/ps. It is intriguing to considerwhether one can simulate the effective result of longtime-scale events by simulating the effect of muchlower viscosity solvents, say 1/ps, while keepingall of the other properties of a water-implicit solvationmodel unchanged. This is appealing since this is triv-ial to perform with implicit solvation models and thisability to explicitly set the viscosity of the solvent inthe model may represent one of the great strengths ofusing implicit solvent models.

FIGURE 8 Comparison of theoretical rate predictions from @Folding@Home and the accordingexperimental folding rate determinations. We compare the folding rates for the proteins andpolymers described in this review: PPA, polyalanine-based helices, the C-terminal -hairpin fromprotein G, and the villin headpiece. If our folding rate prediction were perfect, all points would lieon the diagonal line. The agreement strongly suggests that our method can accurately predict theabsolute folding rate for small proteins, peptides, and foldamers.


If one were to decrease the viscosity by 100 times,could one simulate 10 ns and expect to get 1000 ns 1 s of sampling? This question has been ad-dressed in many models and systems. For example,Klimov and Thirumalai48 have shown that (for acoarse-grained protein model) the rate of folding in-creases with decreasing viscosity to a point ( 1/ps)at which the rate decreases with decreasing viscosity.This nonmonotonic dependence can be understood interms of the dual role of the solvent viscosity: viscos-ity retards motion through the solvent, but also createsthe random forces that are needed to drive the systemover energy and free energy barriers. Thus, the rateshould be optimal at intermediate viscosity. If thispeak in the rate vs viscosity curve does peak at 1/ps, then one should expect an increase in sam-pling at viscosities in between 1/ps and 100/ps, andfor thermodynamic properties, this increased sam-pling should be benecial. However, it is still unclearwhether kinetic properties would be unchanged bysignicant changes in the viscosity.

Alternative Methods to SimulateDynamical Events on Long Time ScalesUsing Large-Scale DistributedComputing

In this review, we have discussed protein foldingsimulations using ensemble dynamics, our parallelreplica-like method intended to handle free energybarrier crossing problems. The greatest weakness ofthis method rests in the need for transitions and for theaccurate identication of these transitions. Our sug-gested means to identify transitions, looking for en-ergy variance spikes during dynamics, has the benetof being a purely thermodynamic method and thusdoes not use any information of the protein nativestate or any folding-related hypothesized reaction co-ordinates. However, if transitions are incorrectly iden-tied, the validity of the resulting data is put underquestion. Considering that great computational re-sources are needed to generate the folding simulationspresented here, this limitation could be very expen-sive computationallythe failure to accurately iden-tify transitions may mean that the resulting data set isinvalid.

It is interesting to consider a simpler method,which does not have the liabilities described above.Namely, instead of loosely coupling simulations (i.e.,restarting simulations after transitions have been de-tected), one could simply run a large number M ofcompletely independent simulations. For single expo-nential kinetics, we would still gain an M times speed-up (as described above). However, even if the reaction

under study did not have single exponential kinetics,independent trajectories might still have value. In-deed, in a sense, a set of thousands of simulationseach on the tens of nanosecond time scale is a data setthat stands on its own. For example, one could inter-pret the results for single-exponential kinetics, byexamining the fraction f(t) that fold in time t andtting a rate with the slope. However, one would notbe limited to this exponential kinetics analysis, andthis data could be reanalyzed a postiori to test newhypotheses or kinetic models. Considering the greatcomputational cost of producing these data sets, thismore pure method for simulating kinetics has agreat appeal. Indeed, we have reexamined the foldingof villin with uncoupled trajectories49 in this manner(B. Zagrovic, et al. J Mol Biol, 2002, in press) and itwill be interesting to determine how the uncoupledsimulations differ (e.g., in rates and mechanism) fromthose presented here.

Is Transition Detection ReallyNecessary?

The discussion above regarding the possibility of us-ing independent trajectories and still gaining a speedincrease linear with the number of processors raisesthe question, Must one bother with transition detec-tion as used here? Another way to examine this ques-tion is to ask with what frequency were transitionsdetected in the examples described here. For the helixfolding simulations, 2 or 3 transitions were detectedbefore the simulation reached the folded state. Therst transition accompanied the rst formation of he-lical structure and the other transitions occurred dur-ing propagation. For the larger molecules, transitionswere even more infrequent. For example, the -hair-pin and villin simulations typically had a single tran-sition that occurred early in the folding process, ac-companying the collapse of the protein chain.

Thus, we nd that for the larger molecules, transi-tion detection was likely not necessary, since thetransitions occurred earlier and thus the simulationswere essentially running independently (as suggestedin the subsection above). We suggest that the transi-tions were not needed since these larger moleculesfold with single exponential kinetics, and thus have asingle rate-limiting step. The helices are potentiallydifferent: the rates of nucleation and propagation ofhelices in our model are not highly separated (e.g., seethe trajectories in Figure 2) and thus transition detec-tion may be needed in the helix case, but not for the-hairpin or villin molecules.

106 Pande et al.

CONCLUSIONS

Comparison to Experiment

With a wide range of molecules under study, from thenonbiological PPA helices to the 36-residue villinheadpiece, we have simulated a set of molecules witha range of folding times spanning over four orders ofmagnitude, from nanoseconds to tens of microsec-onds. Since the primary means of comparison to ex-periment is the comparison of rates determined bysimulation and experiment, we concentrate on ourprediction of rates. Figure 8 shows a striking agree-ment between predicted and experimental rates (seeTable I for details). Of course, with just four mole-cules simulated, it is unclear whether this agreementis simply fortuitous. In order to more fully addressthis question, we plan to simulate the folding kineticsof additional molecules, including larger and moreslowly folding proteins. Indeed, more recent work ona small -fold (C. Snow et al., Nature, 2002, inpress) and villin (B. Zagrovic et al., J Mol Biol, 2002,in press) also result in strong agreement with experi-mental rates.

With this quantitative agreement with experiment,it is also interesting to ask how do our results reectupon the quality of modern force elds? On the sur-face, one might conclude that our agreement withexperiment is evidence that force elds are suf-ciently accurate. We stress that the only question thatcan truly be addressed by our work is whether forceelds are sufciently accurate to reproduce experi-mental rates and structures. Ignoring for the momentthe possibility that the agreement may be fortuitous,the agreement between our simulations and experi-ments suggest that force elds are sufciently accu-rate to predict the folding rates of small proteins.Indeed, this accuracy can be quantied in terms of thestrong correlation (R2 0.996) and low p value(0.000026) of the logarithms of the predicted to ex-perimental rates. However, this statement should def-initely not be overgeneralizedit is unclear whetherthe analogous rate prediction for large protein foldingwould be similarly accurate or whether these resultsare fortuitous (such that the simulation of additionalproteins would weaken the correlation). We are cur-rently addressing this question by examining the fold-ing of different and larger proteins.

What Have We Learned About theProtein Folding Mechanism?

The question of how proteins fold has been askedfor decades, and remains a difcult problem due to the

complexities and difculties of computational andexperimental methods. However, the methods pre-sented here have allowed us to understand, for the rsttime, the folding mechanism for some small fast fold-ing proteins, in atomistic detail with experimentallyvalidated rates (Figure 8 and Table I). We have beenable to discern the mechanism of a few particularproteins, but it is unclear whether we can expect theseto generalize to larger and more complex proteins. Anunderstanding of the mechanism of larger proteinswill likely require further direct simulation. However,considering the diversity of mechanistic results foundeven in these small proteins, it seems reasonable toconsider that there may not be a single, universalfolding mechanism. Indeed, evolution may be mech-anistically agnostic and may have selected proteinsfor function, without concern for folding mechanism.This could lead to a variety of protein folding mech-anisms (even for sequences which fold to the samestructure), and thus there may not be a single answerto the question of how proteins fold.

Future Perspectives

The ensemble dynamics technique coupled with dis-tributed computing has allowed us to break funda-mental computational barriers in the dynamics ofcomplex systems, such as protein and polymer fold-ing. However, one need not build a distributed com-puting infrastructure to gain the benets of our meth-ods. Indeed, with a cluster accessible to almost anygroup (e.g., a hundred PCs), one can simulate 100 nsin a day (assuming 1 ns/processor). This is a signi-cant advance over state of the art of traditional parallelmolecular dynamics.6 Of course, the combination ofour method on top of traditional parallel MD (i.e.,using traditional MD to speed up individual simula-tions to the maximum scalability of parallel MD andthen using our method to statistically sample runs)may lead to the greatest advance, especially on mas-sively parallel architectures with millions of proces-sors, such as IBMs proposed million processor BlueGene supercomputer.

Moreover, this technique should have broad appli-cability to any dynamical system that progresses bycrossing free energy barriers, especially in the mostintractable problems with high free energy barriers. Itcould also serve to augment existing computationalmethods, such as path sampling20 (which requiressimulating a fast trajectory over the relevant freeenergy barriers) or the determination of transitionstates using pfold analysis50 (which is currently hin-dered by simulations dwelling in transiently stableintermediate states).


Finally, with the ability to reach time scales forprotein folding in all-atom simulations (i.e., hundredsof microseconds), it is natural to ask whether thepotential sets are adequate for folding. Indeed, due tothe great number of calculations involved, distributedcomputing networks will most likely play an impor-tant part in providing sufcient computational powerto extensively test and validate new potential sets.Considering the omnipresent role of force elds instructural biology, ranging from simulations, to infor-matics, to x-ray and NMR renement, the ability toquantitatively test force elds will likely play a criti-cally important role in structural biology and virtuallyall related elds.

It is our honor to have this work be part of the memorialissue for Peter Kollman. Peter was a great inspiration as ascientist and a senior colleaguegenerous with time andwith praise and with numerous useful and encouragingsuggestions. Indeed, in many ways our work on foldingdynamics was inspired by his work with Yong Duan on thevillin headpiece, as it opened our eyes to see just how closethe eld was to simulating time scales relevant for foldingand thus to nally directly simulate protein folding.

We would also like to thank Kevin Plaxco for a criticalreading of the manuscript, Robert Baldwin and Susan Mar-qusee for their comments about -helices, Martin Gruebeleand Jeff Moore for their comments about PPA foldingkinetics, Dan Raleigh and collaborators for their unpub-lished results on the experimental folding time for the36-residue villin fragment, Jay Ponder for allowing our useof the Tinker MD code in the Folding@Home client, andBill Swope and Jed Pitera for their advice with modifyingand extending Tinker. We would also like to thank ScottGrifn and the other members of the Intel distributed com-puting team for their help with the Folding@Home infra-structure and support of our work.

Much of this method was developed, tested, originallyimplemented, and run on the T3E at the National EnergyResearch Scientic Computing (NERSC) center at Law-rence Berkeley National Labs. We also thank the AdvancedBiomedical Computing Center, National Cancer Institute(NCI), Frederick, Maryland, for its assistance and use of itsSP3 and SV1. The helix and PPA production calculationswe performed at NERSC and NCI. The other productioncalculations (beta hairpin and villin) were run onFolding@home. We would especially like to thank the tensof thousands of Folding@Home contributors, withoutwhom this work would not be possible (a complete list ofcontributors can be found at http://Folding.Stanford.edu).

JC and EJS acknowledge support in the form of Com-putational Science Graduate Fellowship (DOE). SL is aJames Clark fellow and acknowledges the support of aStanford Graduate Fellowship. MS acknowledges the sup-port of a Fannie and John Hertz fellowship and a StanfordGraduate Fellowship. YMR acknowledges the support of aStanford Graduate Fellowship. CS and BZ each acknowl-

edge support from a HHMI predoctoral fellowship. Thiswork was supported by grants from the ACS PRF (36028-AC4), NSF MRSEC CPIMA (DMR-9808677), NIH BISTI(IP20 GM64782-01), ARO (41778-LS-RIP), and StanfordUniversity (Internet 2), as well as by gifts from the Intel andGoogle corporations.

REFERENCES

1. Dill, K. A.; Chan, H. S. Nat Struct Biol 1997, 4, 1019.2. Brooks, C. L.; Gruebele, M.; Onuchic, J. N.; Wolynes,

P. G. Proc Natl Acad Sci USA 1998, 95, 1103711038.3. Dobson, C. M.; Sali, A.; Karplus, M. Angew Chem Int

Edit Engl 1998, 37, 868893.4. Prusiner, S. Proc Natl Acad Sci USA 1998, 95, 13363

13383.5. Nelson, J. C.; Saven, J. G.; Moore, J. S.; Wolynes, P. G.

Science 1997, 277, 17931796.6. Duan, Y.; Kollman, P. A. Science 1998, 282, 740744.7. Shirts, M. S.; Pande, V. S. Science 2000, 290, 1903

1904.8. Shirts, M. S.; Pande, V. S. Phys Rev Lett 2000.9. Voter, A. F. Phys Rev B 1998, 57, 1398513988.

10. Williams, S.; et al. Biochemistry 1996, 35, 691697.11. Thompson, P.; Eaton, W.; Hofrichter, J. Biochemistry

1997, 36, 92009210.12. Nelson, J. C.; Saven, J. G.; Moore, J. S.; Wolynes, P. G.

Science 1997, 277, 17931796.13. Elmer, S.; Pande, V. S.; J Phys Chem B 2001, 105,

482485.14. Yang, W. Y.; Prince, R. B.; Sabelko, J.; Moore, J. S.;

Gruebele, M. J Am Chem Soc 2000, 122, 32483249(2000).

15. Blanco, F. J.; Serrano, L. Eur J Biochem 1995, 230,634649.

16. Munoz, V.; Thompson, P. A.; Hofrichter, J.; Eaton,W. A. Nature 1997, 390, 196199.

17. Bryngelson, J. D.; Onuchic, J. N.; Socci, N. D.;Wolynes, P. G. Proteins Struct Funct Genet 1995, 21,167195.

18. Chandler, D. J Chem Phys 1978, 68, 29592970.19. Pande, V. S.; Rokhsar, D. S. Proc Natl Acad Sci 1999,

96, 90629067.20. Dellago, C.; Bolhuis, P. G.; Csajka, F. S.; Chandler, D.

J Chem Phys 1998, 108, 19641977.21. Doniach, S.; Eastman, P. A. Curr Opin Struct Biol

1999, 9, 157163.22. Elber, R. Curr Opin Struct Biol 1996, 6, 232235.23. Pande, V. S.; Rokhsar, D. S. Proc Natl Acad Sci 1999,

96, 12731278.24. Pande, V. S.; Grosberg, A. Y.; Tanaka, T.; Rokhsar,

D. S. Curr Opin Struct Biol 1998, 8, 6879.25. Pappu, R. V.; Hart, R. K.; Ponder, J. W. J Phys Chem

B 1998, 102, 97259742.26. Jorgensen, W. L.; Tirado-Rives, J. J Am Chem Soc

1988, 110, 16661671.

108 Pande et al.

27. Qiu, D.; Shenkin, P. S.; Hollinger, F. P.; Still, W. C. JPhys Chem A 1997, 101, 30053014.

28. Andersen, H. C. J Comput Phys 1983, 52, 2434.29. Pande, V. S.; Rokhsar, D. S. Proc Natl Acad Sci 1999,

96, 12731278.30. Young, W. S.; Brooks, C. J Mol Biol 1996, 96, 560

572.31. Ferrara, P.; Apostolakis, J.; Caisch, A. J Phys Chem B

2000, 104, 50005010.32. Zagrovic, B.; Sorin, E. J.; Pande, V. J Mol Biol 2001,

313, 151169.33. Dinner, A. R.; Lazaridis, T.; Karplus, M. Proc Natl

Acad Sci USA 1999, 96, 90689073.34. Zhou, R.; Berne, B.; Germain, R. Proc Natl Acad Sci

USA 2001, 98, 1493114936.35. Garcia, A. E.; Sanbonmatsu, K. Y. Proteins 2001, 42,

345354.36. Bryant, Z.; Pande, V. S.; Rokhsar, D. S. Biophys J

2000, 78, 584589.37. Raleigh, D.; et al. Private communication.38. McKnight, C. J.; Matsudaira, P. T.; Kim, P. S. Nat

Struct Biol 1997, 4, 180184.

39. Thompson, P. A.; Eaton, W. A.; Hofrichter, J. Bio-chemistry 1997, 36, 92009210.

40. Pande, V. S.; Grosberg, A. Y.; Tanaka, T.; Rokhsar,D. S. Curr Opin Struct Biol 1998, 8, 6879.

41. Ptitsyn, O. B. Adv Protein Chem 1995, 47, 83229.42. Finkelstein, A. V.; Shakhnovich, E. I. Biopolymers

1989, 28, 16671680.43. Pande, V. S.; Rokhsar, D. S. Proc Natl Acad Sci USA

1998, 95, 14901494.44. Pande, V. S.; Grosberg, A. Y.; Tanaka, T. J Chem Phys

1995, 103, 94829491.45. Ferrara, P.; Caisch, A. Proc Natl Acad Sci 2000, 97,

1078010785.46. Cramer, C. J.; Truhlar, D. G. Chem Rev 1999, 99,

21612200.47. Jorgensen, W. L. J Am Chem Soc 1981, 103, 335.48. Kilmov, D.; Thirumalai, D. Phys Rev Lett 1997, 79,

317320.49. Zagrovic, B.; Pande, V. S. 2002, in preparation50. Du, R.; Pande, V. S.; Grosberg, A. Y.; Tanaka, T.;

Shakhnovich, E. I. J Chem Phys 1998, 108, 334350.51. Kabsch, W.; Sander, C. Biopolymers 1983, 22, 25772637.


pande protein

Documents

fast folding

folding process

folding rate

folding mechanism

understandingof protein

protein misfolding

folding properties

university of california