prediction of protein configurational entropy...

9
Prediction of Protein Congurational Entropy (Popcoen) Martin Goethe,* ,Jan Gleixner, Ignacio Fita, and J. Miguel Rubi Department of Condensed Matter Physics, University of Barcelona, Carrer Martí i Franque ́ s 1, 08028 Barcelona, Spain Faculty of Biosciences, Heidelberg University, Im Neuenheimer Feld 234, 69120 Heidelberg, Germany Molecular Biology Institute of Barcelona (IBMB-CSIC, Maria de Maeztu Unit of Excellence), Carrer Baldiri Reixac 4-8, 08028 Barcelona, Spain * S Supporting Information ABSTRACT: A knowledge-based method for congurational entropy prediction of proteins is presented; this methodology is extremely fast, compared to previous approaches, because it does not involve any type of congurational sampling. Instead, the congurational entropy of a query fold is estimated by evaluating an articial neural network, which was trained on molecular-dynamics simulations of 1000 proteins. The predicted entropy can be incorporated into a large class of protein software based on cost-function minimization/evaluation, in which congurational entropy is currently neglected for performance reasons. Software of this type is used for all major protein tasks such as structure predictions, proteins design, NMR and X-ray renement, docking, and mutation eect predictions. Integrating the predicted entropy can yield a signicant accuracy increase as we show exemplarily for native-state identication with the prominent protein software FoldX. The method has been termed Popcoen for Prediction of Protein Congurational Entropy. An implementation is freely available at http://fmc.ub.edu/popcoen/. INTRODUCTION Predicting the protein structure from the amino acid (AA) sequence is a challenging task, especially for larger proteins without sequence homologues in the PDB database. 1 For these cases, ab initio methods are employed to derive the protein structure from rst principles. A major class of such software is based on some cost (or scoring) function G ̂ that relates a given structure (or conformation) of the protein to a number that is intended to take lower values for structures which are closerto the native state of the protein. 2 In a nutshell, these programs construct various trial conformations and select one of them as the predicted native fold, where the entire process is guided by G ̂ . These cost-function methods are based on a well-known statistical physics principle stating that the equilibrium state of a system is the one with the lowest free energy G. 3 This guarantees that G ̂ exhibits its minimum for the native state if G ̂ represents a suciently accurate approximation to the free energy G of the protein in solvent. Following Lazaridis and Karplus, 4 one can decompose G into three major contributions, namely, the average intramolecular energy, the average solvation free energy, and the congurational entropy (multipled by (T)). The rst two contributions are usually considered within G ̂ . They are modeled as functions of the average atom coordinates, whereupon uctuation-induced eects on the averages are ignored. 5 In contrast, the third contribution, congurational entropy, is often neglected in G ̂ , mainly because this quantity is strongly dependent on information beyond the average protein structure. In detail, congurational entropy is given by ρ ρ =− S k x x x d ( ) log ( ) conf B (1) where the integral is performed over the entire congurational space. Here, xdenotes the set of proteinatom coordinates (in a body-xed coordinate system), ρ(x) represents their joint probability distribution (in the presence of the solvent), and k B is the Boltzmann constant. 4 From eq 1, it follows that S conf is crucially dependent on the spatial uctuations of the protein atoms and their correlations. This information is not present in G ̂ , since this quantity is only a function of the average atom coordinates. Therefore, incorporating S conf into G ̂ is cumber- some. Consequently, software tools based on a cost-function G ̂ account for S conf only rudimentarily 6,7 or neglect it com- pletely. 813 This praxis might be justied for some especially compact proteins for which congurational entropy plays a minor role, while it is highly problematic in general, since S conf has a major impact on the native-state selection of most proteins. 1417 Other software does include more elaborate estimates of S conf in its cost-function G ̂ , where S conf is obtained via sampling of the congurational space in the vicinity of the given structure. 18,19 This, however, dramatically changes the runtime of such software and, consequently, its applicability, since the sampling process involves many evaluations of the energy function, instead of just one. Received: October 26, 2017 Published: January 19, 2018 Article pubs.acs.org/JCTC Cite This: J. Chem. Theory Comput. XXXX, XXX, XXX-XXX © XXXX American Chemical Society A DOI: 10.1021/acs.jctc.7b01079 J. Chem. Theory Comput. XXXX, XXX, XXXXXX

Upload: others

Post on 04-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

Prediction of Protein Configurational Entropy (Popcoen)Martin Goethe,*,† Jan Gleixner,‡ Ignacio Fita,¶ and J. Miguel Rubi†Department of Condensed Matter Physics, University of Barcelona, Carrer Martí i Franques 1, 08028 Barcelona, Spain‡Faculty of Biosciences, Heidelberg University, Im Neuenheimer Feld 234, 69120 Heidelberg, Germany¶Molecular Biology Institute of Barcelona (IBMB-CSIC, Maria de Maeztu Unit of Excellence), Carrer Baldiri Reixac 4-8, 08028Barcelona, Spain

*S Supporting Information

ABSTRACT: A knowledge-based method for configurational entropyprediction of proteins is presented; this methodology is extremely fast,compared to previous approaches, because it does not involve any type ofconfigurational sampling. Instead, the configurational entropy of a queryfold is estimated by evaluating an artificial neural network, which wastrained on molecular-dynamics simulations of ∼1000 proteins. Thepredicted entropy can be incorporated into a large class of proteinsoftware based on cost-function minimization/evaluation, in whichconfigurational entropy is currently neglected for performance reasons.Software of this type is used for all major protein tasks such as structurepredictions, proteins design, NMR and X-ray refinement, docking, andmutation effect predictions. Integrating the predicted entropy can yield a significant accuracy increase as we show exemplarily fornative-state identification with the prominent protein software FoldX. The method has been termed Popcoen for Prediction ofProtein Configurational Entropy. An implementation is freely available at http://fmc.ub.edu/popcoen/.

■ INTRODUCTION

Predicting the protein structure from the amino acid (AA)sequence is a challenging task, especially for larger proteinswithout sequence homologues in the PDB database.1 For thesecases, ab initio methods are employed to derive the proteinstructure from first principles. A major class of such software isbased on some cost (or scoring) function G that relates a givenstructure (or conformation) of the protein to a number that isintended to take lower values for structures which are “closer” tothe native state of the protein.2 In a nutshell, these programsconstruct various trial conformations and select one of them asthe predicted native fold, where the entire process is guided by G.These cost-function methods are based on a well-known

statistical physics principle stating that the equilibrium state of asystem is the one with the lowest free energy G.3 This guaranteesthat G exhibits its minimum for the native state if G represents asufficiently accurate approximation to the free energy G of theprotein in solvent. Following Lazaridis and Karplus,4 one candecompose G into three major contributions, namely, theaverage intramolecular energy, the average solvation free energy,and the configurational entropy (multipled by (−T)). The firsttwo contributions are usually considered within G. They aremodeled as functions of the average atom coordinates,whereupon fluctuation-induced effects on the averages areignored.5

In contrast, the third contribution, configurational entropy, isoften neglected in G, mainly because this quantity is stronglydependent on information beyond the average protein structure.In detail, configurational entropy is given by

∫ ρ ρ= − S k x x xd ( ) log ( )conf B (1)

where the integral is performed over the entire configurationalspace. Here, x denotes the set of protein−atom coordinates (in abody-fixed coordinate system), ρ(x ) represents their jointprobability distribution (in the presence of the solvent), and kBis the Boltzmann constant.4 From eq 1, it follows that Sconf iscrucially dependent on the spatial fluctuations of the proteinatoms and their correlations. This information is not present inG, since this quantity is only a function of the average atomcoordinates. Therefore, incorporating Sconf into G is cumber-some.Consequently, software tools based on a cost-function G

account for Sconf only rudimentarily6,7 or neglect it com-pletely.8−13 This praxis might be justified for some especiallycompact proteins for which configurational entropy plays aminor role, while it is highly problematic in general, since Sconf hasa major impact on the native-state selection of mostproteins.14−17 Other software does include more elaborateestimates of Sconf in its cost-function G, where Sconf is obtained viasampling of the configurational space in the vicinity of the givenstructure.18,19 This, however, dramatically changes the runtimeof such software and, consequently, its applicability, since thesampling process involves many evaluations of the energyfunction, instead of just one.

Received: October 26, 2017Published: January 19, 2018

Article

pubs.acs.org/JCTCCite This: J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

© XXXX American Chemical Society A DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Page 2: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

Here, we suggest an alternative approach for incorporatingconfigurational entropy into G, where the missing informationabout fluctuations and correlations is estimated with probabilisticrules, instead of by sampling. Let us illustrate this idea using anelementary example. It is well-known that an AA that is buried inthe bulk of the protein usually exhibits weaker fluctuations thanan AA on the surface of the protein,20 mainly due to strong stericconstraints in the bulk. For this reason, surface AAs usuallycontribute stronger to Sconf than buried AAs and it might bereasonable to include an entropic term in G, which favorsproteins having a large surface. This term could easily beexpressed in terms of the available average atom coordinates, andits incorporation would generally improve the accuracy of G.In this work, we performed a more elaborate approach along

these lines, which allows one to exploit more-complex fluctuationand correlation patterns for improving G. From molecular-dynamics (MD) simulations of ∼1000 proteins, we measured anestimator of configurational entropy, as well as various features ofthe protein structure (such as solvent accessibility, local density,and hydrogen bonding). With these data, we trained an artificialneural network (NN) on predicting Sconf from the features, asillustrated in Figure 1 (top panel). The NN can now be used toconstruct a knowledge-based estimator for Sconf from a querystructure via feature measurement and NN evaluation (see thebottom panel of Figure 1). This is extremely fast and allows oneto incorporate Sconf into cost-functions G of existing proteinsoftware without compromising its runtime.

■ RESULTSWe trained a NN with simulation data to predict configurationalentropy of proteins. First, we outline the underlying entropyapproximation, the measurements, and the training process.Then, we report a test for the usefulness of the NN for native-state identification of proteins.

Entropy Approximation. Configurational entropy Sconf isthe 3Natoms − 6 dimensional integral of eq 1 for a proteincontaining Natoms atoms. Because of its high dimension, bruteforce evaluation from simulation data is intractable andapproximations must be employed to obtain a feasibleexpression.21 Using well-known approaches22−24 (with slightadaptations), we obtained the decomposition

∑≈ +=

S S constanti

N

iconf1

res

(2)

in terms of so-called “partial” entropies {Si} (i = 1, ...,Nres), whichcan be related to the entropy contributions of the Nres residues(see the Methods section). The Si values are functions of specificmarginal entropies of torsion angles and mutual informationsthat do not involve integrals of dimension larger than two,enabling their reliable measurement from simulations. It turnsout that the decomposition given by eq 2 is extremely valuable forentropy predictions, because local quantities are considerablyeasier to predict than global ones. Note that a similardecomposition property does not hold for eq 1.25

Measurements from Simulations.We analyzed moleculardynamics (MD) simulations of 961 proteins that were available

Figure 1.Workflow illustration. (Top panel) FromMD simulations of 961 proteins, we measured 108 structural features per amino acid of the proteinstructures, as well as the partial entropies {Si}, which are the configurational entropies of the residues. With these data, we trained a neural network(termed Popcoen) on predicting the Si values from the features. (Bottom panel) Popcoen allows one to estimate configurational entropy Sconf just from aquery structure via feature measurement and Popcoen evaluation. The approach is extremely fast, because it does not involve any type of configurationalsampling.

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

B

Page 3: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

from a public database of MD trajectories (MoDEL-database26).The simulation details and the data-selection process areoutlined in the Methods section. From the simulation of eachprotein, we measured the partial entropies {Si} of all AAs, as wellas the centroid structure, which is the snapshot of the simulationthat is most similar to all other snapshots. From the centroidstructures, we then derived 108 features per residue (see theMethods section). The features contain global properties of theprotein such as Nres, the total solvent accessible surface area(SASA), the total number of hydrogen bonds, and properties ofthe gyration tensor, as well as local properties of the residue andits neighbors such as the residue type, the burial level, the localdensity, the relative SASA, the average torsion angles, and thenumber of hydrogen bonds. Table 1 lists the coefficient ofdetermination R2 between Si and a subset of features showingthat the features are predictive for Si (complete list in theSupporting Information (SI)). A particularly large value of R2 =0.210(7) is found for the relative solvent accessibility of theresidues. This confirms the illustrative example discussed in theIntroduction: a properly defined surface termmight be used as anestimator for Sconf of low accuracy. However, by combining theinformation of all features, we were able to construct asubstantially better estimator (see discussion below).Training of the Neural Network. Predicting the partial

entropy Si from the 108 features constitutes a standard regressionproblem of supervised learning, which can be tackled withvarious machine-learning approaches. We chose to implement anartificial neural network (NN) for their excellent scalability andtheir highly successful application to diverse machine-learningproblems in recent years.27 By extensive search over networkarchitectures, learning methods, and other hyper-parameters, weidentified a six-layer NNwith shortcut connections that providedexcellent validation performance when trained with a stochasticgradient descent variant and regularized by dropout and earlystopping (see the Methods section). On a separate test set, thetrained network predicted Si with a coefficient of determinationof R2 = 0.547(6) (corresponding to an RMSE of 0.706(7) kB anda Pearson correlation coefficient of 0.740(5)), which issubstantially better than the prediction performance obtainedfor single features (see Table 1) or with a linear model (R2 =0.421(7)).This results in an excellent performance for the prediction of

the accumulated entropy

∑==

S Si

N

iPC1

res

(3)

(referred to as Popcoen entropy hereafter), as confirmed by an R2

value of 0.66(5) (Pearson = 0.81(4)) for the prediction of SPC/Nres on the test set. Scatter plots of measured versus predictedvalues are shown in Figure S2 in the SI.The Popcoen prediction for SPC/Nres is more accurate than for

the individual Si values (bootstrapped p-value = 0.016). Thisstems from the fact that the Si values inside a protein are typicallypositively correlated, which broadens the SPC distribution, i.e.,

‐ S Nstd dev( / )PC res = η1 · std-dev(Si) with η1 = 3.5(3) > 1. Sincethe feature set also contains global features of the proteins,Popcoen is able to partially capture these correlations, such that

S NRMSE( / )PC res = 2.4(3)kB = η2 ·RMSE(Si), with η2 = 2.8(3)< η1.Entropy Prediction Tool Popcoen. The entropy pre-

diction program Popcoen has been made available via the

Internet at http://fmc.ub.edu/popcoen/. It allows one toestimate the configurational entropy of a protein structure viatwo calculation steps (see the bottom panel of Figure 1). First,the query structure is loaded and the 108 features per residue arecalculated. Second, the NN is evaluated for the features, whichyields predictions for the partial entropies {Si} of the residues.The program outputs the Si values and the Popcoen entropy SPC(of eq 3), which represents the Popcoen estimate (in units of kB)for Sconf + C, where C is constant for a given AA sequence. Inaddition, a reliability number λ ∈ [0, 1) is output, which reflectsthe fact that the Popcoen entropy is reliable only when the givenquery structure is not too different from the training data (see theSI).

Table 1. Coefficient of Determination (R2) between thePartial Entropies {Si} and Selected Features (orCombinations of Features)a

feature (-combination)

coefficient ofdetermination,b

R2 description

totalSASA/[∏j=13 evalj]

1/2 0.029(9) totalSASA = total SASA; evalj = jtheigenvalue of gyration tensor inascending order

relSASA 0.210(7) relSASA = relative SASA of residuei

ResType(−1) 0.018(3) ResType(s) = residue-type label(binary coded)

ResType(0) 0.17(1) for residue i + s

ResType(1) 0.014(2)

|dist|/Rg 0.12(1) dist = vector from Cα atom tocenter of mass

[evec1 · dist]2/eval1 0.026(6) Rg = radius of gyration

[evec2 · dist]2/eval2 0.026(6) evecj = eigenvector associated with

evalj[evec3 · dist]

2/eval3 0.043(5)

c(6) 0.122(7) c(rc) = number Cα atoms

c(10) 0.19(1) in sphere of radius rc aroundresidue i

c(14) 0.20(1)

c(18) 0.162(7)

c(22) 0.11(1)

totalHbs/Nres 0.016(3) totalHbs = total number hydrogenbonds

Hbs(−1) 0.056(5) Nres = number residues

Hbs(0) 0.068(4) Hbs(s) = number of hydrogenbonds of residue i + sHbs(1) 0.032(7)

ψ(−1) for ALA 0.25(2)c ψ(s) = torsion angle ψ of residue i+ s of the centroid (or query)structure if residue i is of givenresidue type;

ψ(0) for ALA 0.33(1)c analogously for other torsionangles

ψ(1) for ALA 0.20(2)c

ϕ(0) for GLY 0.17(2)c

χ1(0) for LEU 0.090(8)c

χ2(0) for ILE 0.13(1)c

ω(0) for SER 0.04(1)c

aThe full list for all features is given in the Supporting Information(SI). The R2 values are significantly larger than zero, which proves thatthe features are predictive for Si. The most predictive feature is therelative SASA of the residues (relSASA), which alone allows one toexplain 21% of the variance. The features were used for training aneural network, resulting in an improved accuracy of R2 = 0.547(6) bycombining the information of all features. bR2 is defined as R2 = 1 −MSE/Var(Si), where MSE represents the mean squared error whenthe neural network was trained on predicting Si just from a singlefeature, and Var represents the variance. cMSE and Var measured fordata reduced to a given residue type.

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

C

Page 4: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

Popcoen entropies can be obtained in two ways. First, a webserver accessible via the Internet at http://fmc.ub.edu/popcoen/allows easy but slow queries of Popcoen with no installationrequired. This access is suited for single requests, while largescans cannot be performed, because of the limited computationalresources of the server. Second, Popcoen can fully bedownloaded and run as a separate server process on localarchitecture where input−output communication is realized viasocket communication. In this way, incorporating Popcoen intoexisting protein software reduces to elementary socketcommunication, which requires ∼10 lines of additional sourcecode. Besides, file-based communication is also supported.Strictly speaking, querying Popcoen for nonequilibrium states

(decoy structures) is based on the mild assumption that Sconf ofnative states is not fundamentally different than Sconf of otherstates, since Popcoen was exclusively derived from equilibriumsimulations. More precisely, we assume that there is no hiddenmechanism in nature, which causes the fluctuation andcorrelation patterns of native states to be somehow special. Asimilar assumption is also made for software that estimates Sconfvia sampling, namely, that the force field used can also be appliedto nonequilibrium states, although force fields have only beenparametrized from information accessible on experimental timescales.Performance Test for Native-State Identification. In the

following, we show that the accuracy of existing protein softwarecan be improved by incorporating Popcoen entropy.We focus onthe task of native-state identification where protein software aimsto identify the native state of a protein between many decoystates of equal AA sequence but distinct structure. This is done byranking the G values of all structures and predicting the structurewith the lowest G value to be the native one. The accuracy of thistask can be interpreted as a measure for the accuracy of the free-energy expression G used.The test was performed for the prominent protein software

FoldX.6 FoldX is based on the free-energy expression GFX, whichcontains, among others, a contribution for configurationalentropy, denoted by SFX here (see below). We replaced SFX bySPC, keeping the remaining contributions of GFX unchanged, andcompared the performances of the two cost functions for native-state identification. This procedure offered an indirect way for

comparing the accuracies of SPC and SFX. It was used since it doesnot require knowledge of the exact configurational entropies. Fora direct evaluation of SPC and SFX, we would need exact Sconfvalues, which cannot be estimated reliably for proteins of realisticsize (see the Methods section).FoldX was chosen because it arguably includes the most

elaborate Sconf estimator between those programs that do notrequire configurational sampling (see the Introduction). More-over, since SFX has clearly been declared to model Sconf insideFoldX, we could perform a meaningful test by simply replacingSFX with SPC. Performing a similar test for a free-energyexpression that does not contain a well-defined contributionfor Sconf requires a cleaning process of all energy terms fromremaining entropy contributions. This cleaning procedure couldnot be performed unambigously by us, since it requires additionalinformation about the cost function that has not been publishedto its fullest extent.Our test was based on a dataset compiled from the structure-

prediction competition CASP.1 It contains the experimentalstructures of 459 AA sequences, together with decoy states forthose sequences created by the CASP participants. Like in CASP,the experimental structures were interpreted as native states ofthe proteins, neglecting experimental artifacts. The decoys wereselected to be mutually distinct and also distinct from thecorresponding native states (see the Methods section). Thenumber of decoys per sequence ranged from 6 to 315, with anaverage of 104.With these data, we tested how reliably FoldX6 identified the

native states between the decoys. We computed the free energyof FoldX (GFX) by applying the FoldX protocols RepairPDBand Stability to all structures.28 We then ranked allstructures with equal sequence in terms of GFX, i.e., we orderedthe structures from lowest to largest GFX values and assigned itsrank in the sorted list to each structure. We further counted howoften the native state was identified by the cost function, i.e., howoften rank 1 was assigned to the native. FoldX identified thenative in 78% of the cases (358 of 459), which is much better thanthe 2% success probability obtained by picking a fold randomly.We then calculated the Popcoen entropy SPC for all structures.First evidence that Popcoen can improve GFX was obtained bynoticing that SPC/Nres is specifically large for those natives that

Figure 2. (A) Normalized histograms of SPC/Nres for all 459 CASP natives and for the reduced set of 101 CASP natives that were not identified by FoldX.The associated medians are indicated by the arrows. FoldX fails to identify the native state preferentially for samples with large Popcorn entropy density(p-value of <10−10, Mann−Whitney−Wilcoxon test). (B) Histograms of the differencesΔEFX/Nres and (−T)ΔSPC/Nres between the decoy of lowest EFXand the native, showing that the Popcoen entropy varies less between samples than the remaining energy contributions of FoldX. Inset shows a scatterplot of the standardized values of EFX/Nres vs (−T)SPC/Nres for the CASP natives that reveals an anticorrelation with a correlation coefficient of −0.3 ±0.1.

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

D

Page 5: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

could not be identified by FoldX (see Figure 2A, p-value of<10−10, Mann−Whitney−Wilcoxon test).Then, we modified the FoldX free-energy expression as

mentioned above. GFX is a sum of various enthalpic and entropiccontributions, modeling all significant contributions of G.Configurational entropy (multiplied by (−T)) is modeled bytwo terms, which are referred to as “WmcTΔSmc +WscTΔSsc” bythe authors,29 and, together, are referred to as (−T)SFX here. SFXaccounts for configurational entropy stemming from backboneand side-chain motion. Its precise formula has not beenpublished, because FoldX is proprietary software, but (−T)SFXvalues are output by FoldX for given input structures. Wereplaced (−T)SFX by (−T)SPC = (−0.6 kcal/mol)SPC/kB (with T≈ 300 K), which did not involve any type of fitting since FoldXenergies are given in units of kcal/mol. The ranking of thestructures was repeated with the modified free-energy functionGFX+PC = EFX − TSPC, where EFX denotes all remainingcontributions of GFX, i.e., EFX = GFX + TSFX. The substitutiondid not affect the ranks of 368 natives. With both free-energyexpressions, 351 natives were identified (i.e., the natives had rank1) while 17 natives had unchanged ranks larger than 1. This is aconsequence of the fact that the values of EFX/Nres varied greaterbetween structures than those of TSPC/Nres (see the histogramsin Figure 2B). Precisely, the standard deviation of the differencesΔEFX/Nres and TΔSPC/Nres between the decoy of lowest EFX andthe native is 0.32(2) kcal/mol and 0.052(3) kcal/mol,respectively. Interestingly, ΔEFX/Nres and (−T)ΔSPC/Nres areanticorrelated, because of an observed anticorrelation betweenEFX/Nres and (−T)SPC/Nres (see the inset of Figure 2B), whichmight be interpreted as evidence for energy−entropy compen-sation inside proteins. However, further investigation is neededto decide whether the anticorrelation indeed represents intrinsicproperties of proteins, or whether it is rather an artifact of theimprecisions of Popcoen and/or FoldX.The ranks of the remaining 91 natives did change: they

improved in 55 cases but worsened in 36 cases (i.e., in 55 cases,the rank of the native was smaller for GFX+PC than for GFX, andlarger in 36 cases). This shows that incorporating SPC into FoldXresults in a mild increase in precision (p-value = 0.045,generalized likelihood ratio test). The total number of identifiednatives increased by 7, because 14 natives were identified withGFX+PC but not with GFX while 7 samples were only identifiedwith GFX (see Table 2, presented later in this work).The main reason for why the ranking of some natives

worsened because of Popcoen can be traced back to the fact thatPopcoen’s NNwas evaluated for structures that are very differentfrom all training data. This is shown in the scatter plot of Figure 3,where two geometrical observables of the proteins are shown forthe centroid structures of the training data, as well as for the 92CASP samples which were ranked differently for both G values.The observables are the inverse ellipsoidal density (ρellip

−1 =π Ndet( ) / )43 res and the ellipsoidal aspect ratio (Aellip =

eval /eval )3 1 defined via the gyration tensor and its largestand smallest eigenvalues (eval3 and eval1, respectively). The plotshows that natives whose ranking improved due to Popcoenaccumulate in regions of high training data density, while nativeswhose rank worsened due to Popcoen are located preferentiallyin regions of low training data density. This behavior is expected,since, roughly speaking, a neural network is a reliably“intrapolation machine” but usually fails to “extrapolate” intoregions of the feature space with low training data density.30

A detailed analysis confirmed this observation. We dropped allCASP structures (natives and decoys) with a reliability number λ< 0.5, which reduced the data to those structures with a localtraining data density larger than 10/[std-dev(ρellip

−1) · std-dev(Aellip)] (see the SI; std-dev is the standard deviation of thetraining data). For the remaining data (containing 415 nativeswith an average of 77 decoys per native, in the range of 6−243),we repeated the ranking using GFX and GFX+PC. Now, the ranks of49 natives improved due to Popcoen while they worsened foronly 18 natives (with 348 native ranks unchanged), whichrepresents a highly significant increase in performance (p-value =0.00013, generalized likelihood ratio test). As a related result,native-state identification also improved significantly (p-value =0.0015): from the 415 natives, FoldX identified 336 (81%), whileFoldX+Popcoen found 348 (84%), because FoldX+Popcoencould identify 14 additional natives while native-state identi-fication was lost only in two cases (see Table 2). To further assessthe statistical significance of this result, we randomly permutedthe Popcoen entropies between all structures of equal AAsequence. This yielded, with large probability (>0.9993), lessthan 336 identified natives using GFX+PC. Hence, the permutedPopcoen entropies act as noise, while the unpermuted onesimprove the detection rate significantly.

■ DISCUSSIONWe presented the knowledge-based tool Popcoen for theprediction of protein configurational entropy. In contrast toprevious approaches, Popcoen does not require any type ofconfigurational sampling. Hence, Popcoen offers a way to includeSconf also in software based on cost-function minimization/

Figure 3. Scatter plot of inverse ellipsoidal density (ρellip−1) vs ellipsoidal

aspect ratio (Aellip) for the centroid structures of the training andvalidation set (864 proteins), as well as for the 91 CASP natives whichwere ranked differently by GFX and GFX+PC. Natives whose rankworsened due to Popcoen are located preferentially in regions with lowtraining data density. This shows that a neural network gives reliableestimates for cases “close” to the training data. Popcoen accounts for thisfact via the reliability number λ.

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

E

Page 6: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

evaluation where no configurational sampling is carried out tosave computational time.Entropy is estimated by evaluating a neural network that was

trained on simulation data of 961 representative proteins. Thisallows reliable Sconf predictions for most protein structures. Forcases that are very distinct from the training proteins, Popcoenentropies may not be reliable, as indicated by a small value of thereliability number (λ ≲ 0.5). We plan to reparametrize Popcoenon a larger set of proteins when the simulation database containssubstantially more trajectories. This will further amplifyPopcoen’s applicability.In an explicit test, we have shown that incorporating Popcoen

into the FoldX cost function improves the accuracy of FoldX fornative-state identification. We expect that a similar precision gaincan be obtained for all protein software based on cost-functionminimization/evaluation which currently neglect Sconf or accountfor Sconf only rudimentarily. Software of this type has beendeveloped for all major tasks related to proteins such as structurepredictions, proteins design, NMR and X-ray refinement,docking, and mutation effect predictions. Therefore, Popcoenmay be helpful for many aspects of protein science.We have designed Popcoen to be very easily integratable into

existing software, since Popcoen can be run as a separate serverprocess. This enables researchers to test whether Popcoenimproves their protein software with very limited effort, as onlysocket communication must be established. However, to preventdouble counting, it might additionally be necessary to modify theoriginal cost function to eliminate partial effects of Sconf alreadycaptured within G.Because of the large entropic changes upon docking,31

considering Sconf is particularly important for protein-dockingpredictions. In collaboration with the developers of pyDock,10 wecurrently investigate the impact of Popcoen on this task. Wefurther work on a similar knowledge-based approach also for theaverage solvation free energy. This may result in a more accurateestimator than current approaches (such as Generalized Born orGBSA32), since it will account for the heterogeneous surfacecharacteristics of proteins.

■ METHODS

Entropy Expression. Configurational entropy Sconf is theintegral of eq 1, where x denotes the 3Natoms − 6 atomcoordinates in a body fixed coordinate system and ρ(x ) is theirjoint probability density in the presence of the solvent. This

definition stems from a formal decomposition of the full phasespace integral over all degrees of freedom (dofs) of the protein−solvent system into protein dofs and solvent dofs introduced byLazaridis and Karplus4 (for details, see the SI). Equation 1 isbetter expressed in bond-angle-torsion coordinates to facilitatethe isolation of dofs which are strongly confined by the covalentstructure of the protein.22 These coordinates are basically“frozen” at room temperature,20 so that their contribution to Sconfcan be neglected and eq 1 reduces to an integral over the flexibletorsion angles usually denoted {ψi}, {ϕi}, and {χi

1}...{χi4} (see the

SI). The remaining integral is still of very high dimension(namely, ∼3.8Nres-dimensional) rendering its brute-forceevaluation from simulation data unfeasible (curse of dimension-ality). Therefore, an approximation must be employed. Here, weused an approach similar to the second-order MaximumInformation Spanning Tree (MIST) approximation of entro-py,23,24 which allowed us to decompose Sconf into a sum of partialentropies {Si} which could reliably be measured from theavailable simulation data, since only one- and two-dimensionalintegrals are involved. We obtained the expression

∑≈ +=

S S C T( , AA sequence)i

N

iconf1

res

(4)

where C is a constant for different conformations of a given AAsequence and constant temperature, pressure, and pH. Thepartial entropies (except S1 and SNres

), given as

∑ θ θ κ θ≔ −θ∈Θ

S H I( ) ( ; ( ))ii (5)

are defined in terms of the marginal entropies

∫θ θ ρ θ ρ θ≔ −π

π

−H k( ) d ( ) log ( )B (6)

(denoted here by H to avoid confusion with Si) and the mutualinformations

θ τ θ τ θ τ≔ + −I H H H( ; ) ( ) ( ) ( , ) (7)

Here, ρ(θ) is the marginal distribution of a single torsion angle θ,and

∫ ∫θ τ θ τ ρ θ τ ρ θ τ≔ −π

π

π

π

− −H k( , ) d d ( , ) log ( , )B (8)

Table 2. Summary of the Performance Test for Native-State Identificationa

Comparison of the Rank of the Native State

dataset equal for GFX and GFX+PC smaller for GFX+PC smaller for GFX p-value

all CASP data (459 samples) 368 (80.2%) 55 (12.0%) 36 (7.8%) 0.045λ filtered (415 samples) 348 (83.9%) 49 (11.8%) 18 (4.3%) 0.00013

Native-State Identification

dataset GFX+PC: +, GFX+PC: −, GFX+PC: +, GFX+PC: −, p-value

GFX : + GFX: − GFX: − GFX: +

all CASP data (459 samples) 351 (76.5%) 87 (19.0%) 14 (3.1%) 7 (1.5%) not significantλ filtered (415 samples) 334 (80.5%) 65 (15.7%) 14 (3.4%) 2 (0.5%) 0.0015

aThe FoldX cost function GFX is compared to GFX+PC, which contains Popcoen entropy for Sconf. In the top portion of the table, the rank of the nativestate is compared for the entire CASP dataset and for a reduced dataset containing only structures with reliability number λ > 0.5. The numbers (andpercentages) of natives are given which are ranked equally by both cost functions, or better for one or the other. This shows that GFX+PC performssignificantly better than GFX (p-values obtained with generalized likelihood ratio test). In the bottom portion of the table, the number (andpercentage) of natives that have been identified (+) by both cost-functions, have not been identified (−), or have been identified by only one costfunction. For the reduced dataset, GFX+PC performs significantly better than GFX.

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

F

Page 7: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

is the entropy of the two-dimensional joint distribution ρ(θ, τ) oftwo torsion angles θ and τ. The sum of eq 5 runs over the set Θi

containing all unfrozen torsion angles of residue i, i.e.,Θi is the set{ϕi, ψi} extended by those side-chain angles χi

1, χi2, χi

3, and χi4 that

exist in residue i. The function κ encodes adjacent torsion anglesalong the covalent structure of the protein via the definitionsκ(ϕi)≔ ψi−1, κ(ψi)≔ ϕi, κ(χi

1)≔ ψi, κ(χi2)≔ χi

1, κ(χi3)≔ χi

2, andκ(χi

4) ≔ χi3.

The detailed calculation leading to eq 5, together with theslightly modified formulas for S1 and SNres

, can be found in the SI.

The dependence of C on the AA sequence is also discussed in theSI. This dependency must be considered when entropy fordifferent sequences is compared, e.g., in mutation studies.MDData.We trained the neural network with data measured

from MD simulations. The trajectories were obtained from theMoDEL database which contains simulation trajectories for alarge and representative set of proteins covering all major foldclasses.26 All simulations used for this study had been performedwith AMBER 8.0 (parm99 force field) under ambient conditions(T = 300 K, p = 1 atm) and constant volume. Each simulation wasperformed for a single protein in explicit solvent (TIP3P water).After careful equilibration to the simulation conditions,production runs lasted for 10 ns using a time step of 2 fs,where snapshots were taken every picosecond. All trajectoriesused had passed all eight quality criteria of the MoDELdatabase,26 which guarantees that unfolding events did notoccur. In total, we analyzed simulation data for 961 proteins,ranging from 21 AAs to 539 AAs, with an average size of 151.4AAs.The simulation data were available from the MoDEL database

in compressed form where a nonreversible compressionalgorithm had been employed.33 Enormous compression rateswere achieved by reducing the fluctuation spectra of theCartesian coordinates to the most contributing modes. At firstglance, this might be problematic for the present work, becauseconfigurational entropy is crucially dependent on fluctuations.However, as we show in the SI, the compression causes only anegligible error when the partial entropies are rescaledappropriately (see eq (S19) in the SI).Measurements. From the MD trajectories, we measured all

marginal entropies and mutual informations needed to constructthe partial entropies {Si}. The corresponding integrals werecalculated using histograms, which is efficient and sufficientlyaccurate.34 Four hundred (400) bins of equal size were used forthe one-dimensional integral of H(θ), while 20 bins per anglewere used for the one- and two-dimensional integrals needed forcomputing I, to ensure that I ≥ 0 holds exactly for the estimator.Finally, the rescaling that accounts for the data compression (eq(S19)) was applied.We further derived the centroid structure defined as the

snapshot k with largest ∑l exp(−dkl/d ), where dkl is the root-mean-square deviation of the snapshots k and l, and d is the meanof all dkl encountered for the trajectory. The alignment wasperformed with mdtraj.35 The calculations were accelerated withthe cost of negligible imprecision. First, a centroid was calculatedfor all 100 consecutive subtrajectory of 100 ps length. Then, thecentroid of these centroids was derived.From the centroid structures, as well as from the CASP data

(see below), we calculated the following 108 features for allresidues (labeled by i, 1 ≤ i ≤ Nres):Feature 1: Nres = number of residues.

Features 2−4: eval1, eval2, eval3 = eigenvalues of the gyrationtensor, sorted in ascending order.Features 5−7: evec1 · dist, evec2 · dist, evec3 · dist, where evecj

is the jth eigenvector of the gyration tensor associated witheigenvalue evalj and directed such that its first component is non-negative; dist is the vector from the center-of-mass to the Cα

atom of residue i. The center-of-mass is calculated as the(unweighted) average position of all protein atoms, excepthydrogens. Note that features 2−7 contain information about theposition of residue i inside an ellipsoid representing the proteinshape.Features 8−18: c(2), c(4), c(6), ..., c(22), where c(rc) is the

number of Cα atoms with an Euclidean distance of less than rc [inÅ] or a chain distance of <3 to the Cα atom of residue i. Thesefeatures represent a measure for the local density profile. Thec(rc) values were not normalized by the associated volumes 4πrc

3/3 since such normalization would not affect the NN performancewhile integer values are more human-readable.Feature 19: relSASA is the solvent-accessible surface area

(SASA) of residue i, relative to the standard accessibility of theresidue type of residue i. SASA calculations are performed withthe mdtraj35 implementation of the Shrake-Rupley algorithm(probe radius = 0.14 Å). The standard accessibility values ofNACCESS36 are used for normalization. Normalization wasintroduced since it allows for better human interpretability. It isirrelevant for the network performance (because of featureResType(0); see below).Feature 20: totalSASA is the total SASA of the protein in Å2.Features 21−41: torsions(−1), torsions(0), torsions(1),

where torsions(s) = {ψ (s), ϕ (s), χ 1(s), χ 2(s), χ 3(s), χ 4(s), ω(s)}, which are the torsion-angle values of residue i + s of thegiven protein structure, shifted by −π/2, −π, −5π/3, −π/4, −π/3, −π, and −π/2, respectively. Angles are defined on (−π, π].Nonexistent angles are assigned the value nan. Because of theconstant shifts, there is typically small weight at the ends of theinterval.Feature 42: totalHbs is the total number of hydrogen bonds of

the protein, where hydrogen bonds are identified using themdtraj implementation of the Wernet−Nilsson algorithm.37

Features 43−45: Hbs(−1), Hbs(0), Hbs(1), where Hbs(s) isthe number of hydrogen bonds of residue i + s (acting as eitherdonor or acceptor).Features 46−108: ResType(−1), ResType(0), ResType(1),

where ResType(s) represents 21 binary variables encoding theresidue type of residue i + s.All features are predictive for Si, as confirmed by the associated

R2 values (see Section S7 in the SI). We further showed thatreasonable groups of features contain additional information thatis not inherent in the remaining features (see Section S6 in theSI). Adding a feature for the residue’s secondary structure did notimprove the performance. Hence, the features compose areasonable minimal feature set, although we cannot discard thatsome subset might provide similar prediction accuracy.

Training of the Neural Network. The MD data wererandomly divided into three sets: a training set containing 780proteins (118 062 residues), a validation set of 84 proteins(13 113 residues), and a test set of 97 proteins (14 311 residues).Optimization of theNN architecture, learningmethod, and otherhyper-parameters was performed by training various NNs on thetraining set and evaluating their performance in terms of RMSEon the validation set. The manual and random search includedarchitectural variants (simple convolutional architectures, skip-connections,38 batch-normalization layers39), different activation

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

G

Page 8: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

functions (rectified-linear unit (ReLU), Parametric ReLU,exponential-linear unit (ELU)40), as well as learning methods(stochastic gradient-descent variants Adam41 and RMSprop42).The random search sampled independently between 0 and 14hidden layers, between 8 and 1024 neurons per layer, between 0and 50% dropout, and between 10−2 and 10−5 for the initiallearning rate. In all cases, features and targets were standardizedfor training, parameters were initialized in a data-dependent waywith Gaussian noise and within-layer weight normalization,43

and learning rate was reduced by a factor of 10 at 80 epochs andat 160 epochs. All NNs and their training were implemented inPython using the Theano library.44

Based on this extensive search, superior performance wasfound for a six-layer NN with short-cut connections (re-feedingthe inputs to deeper layers), and, in total, 94 709 weight and biasparameters trained with Adam for 200 epochs in mini-batches ofsize 200, with an initial step size of 5× 10−4 and otherwise defaultparameters. The precise feed-forward architecture is depicted inFigure 4. It consists of five hidden layers, each of 100 nodes,

which are fully connected to both the previous layer and theinput layer with 30% and 10% dropout, respectively. ELUs serveas nonlinearity for all layers, except the output layer, for which alinear activation function was used. The learning curve of the finalNN is shown in Figure S3 in the SI.CASP Data. CASP is a competition for blind predictions of

protein structure. In a nutshell, the organizers publish AAsequences for which the participants submit models. Then, theexperimental structures (called targets) are published andcompared with the models.From the CASP historical data repository, we downloaded all

target files of CASP5−CASP11, together with the associatedmodels submitted in the category tertiary structure predictions.Entirely prior to the performance test reported in the Resultssection, the data were cleaned and filtered in various ways, asdescribed in the SI. Summarizing, we selected those samples thatwere not canceled by the organizers and whose target pdb-filesdid not contain errors. Furthermore, we dropped erroneousmodels, unfolded models, models for improper AA sequences,and models containing insufficient side-chain information. Wethen performed hierarchical clustering on the models (of giventarget) and dropped all models per cluster, except a singlerandom one. The remaining models were sequence-aligned withthe associated targets and reduced to their common atoms.

Finally, very accurate models were dropped (to ensure that theremaining models represent decoy states of the protein) andtargets with less than 5 remaining models were dropped. In thisway, we obtained 459 targets (= natives), with an average of 104models (= decoys) per target, ranging from 6 to 315.

Error Estimation. The reported errors represent thestandard error of the mean, computed with the bootstrapmethod. Errors are given in concise notation, i.e., numbers inparentheses represent the error on the last significant figure(s).

■ ASSOCIATED CONTENT*S Supporting InformationThe Supporting Information is available free of charge on theACS Publications website at DOI: 10.1021/acs.jctc.7b01079.

The Supporting Information contains the followinginformation: (1) the full derivation of the used entropyexpression; (2) additional figures referred to in the paper;(3) an analysis of the effects caused by the compressionalgorithm of the MoDEL database; (4) details about thereliability number λ; (5) a report of how the CASP datasetwas compiled; (6) an analysis confirming the importanceof the features; and (7) a list of the R2 values of all features(PDF)

■ AUTHOR INFORMATIONCorresponding Author*E-mail: [email protected] Goethe: 0000-0002-5826-2180NotesThe authors declare no competing financial interest.

■ ACKNOWLEDGMENTSWe are grateful to the authors of ref 26 (especially AdamHospital, Josep Lluıs Gelpı, and Modesto Orozco) for setting uptheMoDEL database and for providing their simulation data. Wethank Mareike Poehling for drawing Figure 1.

■ REFERENCES(1)Moult, J. A decade of CASP: Progress, bottlenecks and prognosis inprotein structure prediction. Curr. Opin. Struct. Biol. 2005, 15, 285−289.(2) Bonneau, R.; Baker, D. Ab Initio Protein Structure Prediction:Progress and Prospects. Annu. Rev. Biophys. Biomol. Struct. 2001, 30,173−189.(3) Reichl, L. E. A Modern Course in Statistical Physics, 3rd Edition;Wiley−VCH: Weinheim, Germany, 2009.(4) Lazaridis, T.; Karplus, M. Effective Energy Function for Proteins inSolution. Proteins: Struct., Funct., Genet. 1999, 35, 133−152.(5) Goethe, M.; Fita, I.; Rubi, J. M. Thermal motion in proteins: Largeeffects on the time-averaged interaction energies. AIP Adv. 2016, 6,035020.(6) Schymkowitz, J.; Borg, J.; Stricher, F.; Nys, R.; Rousseau, F.;Serrano, L. The FoldX web server: An online force field. Nucleic AcidsRes. 2005, 33, W382−W388.(7) Pokala, N.; Handel, T. M. Energy Functions for Protein Design:Adjustment with Protein−Protein Complex Affinities, Models for theUnfolded State, and Negative Design of Solubility and Specificity. J. Mol.Biol. 2005, 347, 203−227.(8) Rohl, C. A.; Strauss, C. E.; Misura, K. M.; Baker, D. ProteinStructure Prediction using Rosetta.Methods Enzymol. 2004, 383, 66−93.(9) Schwieters, C. D.; Kuszewski, J. J.; Tjandra, N.; Clore, G. M. TheXplor-NIH NMR molecular structure determination package. J. Magn.Reson. 2003, 160, 65−73.

Figure 4. Schematic depiction of the computational graph underlyingthe NN architecture. The values of the 108 features are passed through 6feed-forward layers (5 hidden, 1 output) to obtain an estimate y for Si.Each hidden layer Li applies an affine transformation (i.e., multiplicationwith weight matrixWi and addition of bias bi) and an exponential-lineartransformation. The output layer only applies an affine transformation.Shortcut connections from the input layer to the output of the hiddenlayers were provided. 10% and 30% dropout was applied to the output ofthe input layer and hidden layers, respectively. All parameters wereoptimized to minimize the squared loss between y and the targets t =partial entropies Si calculated from the MD simulations.

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

H

Page 9: Prediction of Protein Configurational Entropy (Popcoen)fmc.ub.edu/popcoen/acs.jctc.7b01079.pdfPredicting the protein structure from the amino acid (AA) sequence is a challenging task,

(10) Cheng, T. M.; Blundell, T. L.; Fernandez-Recio, J. pyDock:Electrostatics and desolvation for effective scoring of rigid-bodyprotein−protein docking. Proteins: Struct., Funct., Genet. 2007, 68,503−515.(11) Suarez, M.; Tortosa, P.; Jaramillo, A. PROTDES: CHARMMtoolbox for computational protein design. Syst. Synth. Biol. 2008, 2,105−113.(12) Pierce, B.; Weng, Z. ZRANK: Reranking protein dockingpredictions with an optimized energy function. Proteins: Struct., Funct.,Genet. 2007, 67, 1078−1086.(13) Thevenet, P.; Shen, Y.; Maupetit, J.; Guyon, F.; Derreumaux, P.;Tuffery, P. PEP-FOLD: An updated de novo structure prediction serverfor both linear and disulfide bonded cyclic peptides. Nucleic Acids Res.2012, 40, W288−W293.(14) Schafer, H.; Smith, L. J.; Mark, A. E.; van Gunsteren, W. F.Entropy Calculations on the Molten Globule State of a Protein: Side-Chain Entropies of α-Lactalbumin. Proteins: Struct., Funct., Genet. 2002,46, 215−224.(15) Berezovsky, I. N.; Chen, W. W.; Choi, P. J.; Shakhnovich, E. I.Entropic Stabilization of Proteins and Its Proteomic Consequences.PLoS Comput. Biol. 2005, 1, e47.(16) Zhang, J.; Liu, J. S. On Side-Chain Conformational Entropy ofProteins. PLoS Comput. Biol. 2006, 2, e168.(17) Goethe, M.; Fita, I.; Rubi, J. M. Vibrational Entropy of a Protein:Large Differences between Distinct Conformations. J. Chem. TheoryComput. 2015, 11, 351−359.(18) Benedix, A.; Becker, C. M.; de Groot, B. L.; Caflisch, A.;Bockmann, R. A. Predicting free energy changes using structuralensembles. Nat. Methods 2009, 6, 3−4.(19) Xu, D.; Zhang, Y. Ab initio protein structure assembly usingcontinuous structure fragments and optimized knowledge-based forcefield. Proteins: Struct., Funct., Genet. 2012, 80, 1715−1735.(20) McCammon, J. A.; Harvey, S. C. Dynamics of Proteins and NucleicAcids; Cambridge University Press: Cambridge, U.K., 1987.(21) Suarez, D.; Díaz, N. Direct methods for computing single-molecule entropies from molecular simulations. WIREs Comput. Mol.Sci. 2015, 5, 1.(22) Hnizdo, V.; Gilson, M. K. Thermodynamic and DifferentialEntropy under a Change of Variables. Entropy 2010, 12, 578−590.(23) King, B. M.; Tidor, B. MIST: Maximum Information SpanningTrees for dimension reduction of biological data sets. Bioinformatics2009, 25, 1165−1172.(24) King, B. M.; Silver, N. W.; Tidor, B. Efficient Calculation ofMolecular Configurational Entropies Using an Information TheoreticApproximation. J. Phys. Chem. B 2012, 116, 2891−2904.(25) Mark, A. E.; van Gunsteren, W. F. Decomposition of the FreeEnergy of a System in Terms of Specific Interactions: Implications forTheoretical and Experimental Studies. J. Mol. Biol. 1994, 240, 167−176.(26) Meyer, T.; D’Abramo, M.; Hospital, A.; Rueda, M.; Ferrer-Costa,C.; Perez, A.; Carrillo, O.; Camps, J.; Fenollosa, C.; Repchevsky, D.; et al.MoDEL (Molecular Dynamics Extended Library): A Database ofAtomistic Molecular Dynamics Trajectories. Structure 2010, 18, 1399−1409.(27) Schmidhuber, J. Deep learning in neural networks: An overview.Neural Networks 2015, 61, 85−117.(28) Delgado, J. Private communication.(29) Guerois, R.; Nielsen, J. E.; Serrano, L. Predicting Changes in theStability of Proteins and Protein Complexes: A Study of More Than1000 Mutations. J. Mol. Biol. 2002, 320, 369−387.(30) Bishop, C. M. Neural Networks for Pattern Recognition; OxfordUniversity Press: Cambridge, U.K., 1995.(31) Chang, C.-E.; Chen, W.; Gilson, M. K. Ligand configurationalentropy and protein binding. Proc. Natl. Acad. Sci. U. S. A. 2007, 104,1534−1539.(32) Cramer, C. J.; Truhlar, D. G. Implicit Solvation Models:Equilibria, Structure, Spectra, and Dynamics. Chem. Rev. 1999, 99,2161−2200.(33) Meyer, T.; Ferrer-Costa, C.; Perez, A.; Rueda, M.; Bidon-Chanal,A.; Luque, F. J.; Laughton, C. A.; Orozco, M. Essential Dynamics: A

Tool for Efficient Trajectory Compression and Management. J. Chem.Theory Comput. 2006, 2, 251−258.(34) Lange, O. F.; Grubmuller, H. Full correlation analysis ofconformational protein dynamics. Proteins: Struct., Funct., Genet. 2008,70, 1294−1312.(35) McGibbon, R. T.; Beauchamp, K. A.; Harrigan, M. P.; Klein, C.;Swails, J. M.; Hernandez, C. X.; Schwantes, C. R.; Wang, L.-P.; Lane, T.J.; Pande, V. S. MDTraj: A Modern Open Library for the Analysis ofMolecular Dynamics Trajectories. Biophys. J. 2015, 109, 1528−1532.(36) Hubbard, S. J.; Thornton, J. M. NACCESS, Computer Program;Department of Biochemistry and Molecular Biology, University CollegeLondon: London, 1993.(37) Wernet, P.; Nordlund, D.; Bergmann, U.; Cavalleri, M.; Odelius,M.; Ogasawara, H.; Naslund, L.; Hirsch, T. K.; Ojamae, L.; Glatzel, P.;et al. The structure of the first coordination shell in liquid water. Science2004, 304, 995−999.(38) Huang, G.; Liu, Z.; van derMaaten, L.;Weinberger, K. Q. DenselyConnected Convolutional Networks. Presented at the 2017 IEEEConference on Computer Vision and Pattern Recognition (CVPR),Honululu, HI, 2017; arXiv:1608.06993, p 2261 (DOI: 10.1109/CVPR.2017.243).(39) Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift. In Proceedings ofthe 32nd International Conference onMachine Learning (ICML-15), 2015;pp 448−456.(40) Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. Fast and AccurateDeep Network Learning by Exponential Linear Units (ELUs).Presented at the 2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR). arXiv:1511.07289, 2015.(41) Kingma, D. P.; Ba, J. Adam: A Method for StochasticOptimization. arXiv:1412.6980, 2014.(42) Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide thegradient by a running average of its recent magnitude. In COURSERA:Neural Networks for Machine Learning, 2012.(43) Krahenbuhl, P.; Doersch, C.; Donahue, J.; Darrell, T. Data-dependent Initializations of Convolutional Neural Networks.arXiv:1511.06856, 2015.(44) Al-Rfou, R.; Alain, G.; Almahairi, A.; Angermuller, C. et al. D. B.Theano: A Python framework for fast computation of mathematicalexpressions. arXiv:1605.02688, 2016.(45) Li, D.-W.; Meng, D.; Bruschweiler, R. Short−Range Coherence ofInternal Protein Dynamics Revealed by High-Precision in Silico Study. J.Am. Chem. Soc. 2009, 131, 14610−14611.(46) Suarez, E.; Díaz, N.; Suarez, D. Entropy Calculations of SingleMolecules by Combining the Rigid-Rotor and Harmonic-OscillatorApproximations with Conformational Entropy Estimations fromMolecular Dynamics Simulations. J. Chem. Theory Comput. 2011, 7,2638−2653.(47) Sokal, R.; Michener, C. A statistical method for evaluatingsystematic relationships. Univ. Kans. Sci. Bull. 1958, 38, 1409−1438.(48) Zemla, A. LGA: A method for finding 3D similarities in proteinstructures. Nucleic Acids Res. 2003, 31, 3370−3374.

Journal of Chemical Theory and Computation Article

DOI: 10.1021/acs.jctc.7b01079J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

I