ad hoc arxiv:2003.09193v2 [physics.comp-ph] 19 jun 2020

15
Optimal estimates of self-diffusion coefficients from molecular dynamics simulations Jakob T´ omas Bullerjahn, 1 oren von B¨ ulow, 1 and Gerhard Hummer 1,2, a) 1) Department of Theoretical Biophysics, Max Planck Institute of Biophysics, 60438 Frankfurt am Main, Germany 2) Institute of Biophysics, Goethe University Frankfurt, 60438 Frankfurt am Main, Germany (Dated: June 22, 2020) Translational diffusion coefficients are routinely estimated from molecular dynamics simulations. Linear fits to mean squared displacement (MSD) curves have become the de facto standard, from simple liquids to complex biomacromolecules. Nonlinearities in MSD curves at short times are handled with a wide variety of ad hoc practices, such as partial and piece-wise fitting of the data. Here, we present a rigorous framework to obtain reliable estimates of the self-diffusion coefficient and its statistical uncertainty. We also assess in a quantitative manner if the observed dynamics is indeed diffusive. By accounting for correlations between MSD values at different times, we reduce the statistical uncertainty of the estimator and thereby increase its efficiency. With a Kolmogorov-Smirnov test, we check for possible anomalous diffusion. We provide an easy-to-use Python data analysis script for the estimation of self-diffusion coefficients. As an illustration, we apply the formalism to molecular dynamics simulation data of pure TIP4P-D water and a single ubiquitin protein. In a companion paper [J. Chem. Phys. XXX, YYYYY (2020)], we demonstrate its ability to recognize deviations from regular diffusion caused by systematic errors in a common trajectory “unwrapping” scheme that is implemented in popular simulation and visualization software. I. INTRODUCTION Brownian motion is one of the pillars of biological physics, being observed on both microscopic and meso- scopic scales. Einstein’s seminal work 1 established the mean squared displacement (MSD) as the central observ- able to characterize the jittering motion of microscopic objects. On sufficiently long time scales, the MSD of a freely diffusing tracer particle or macromolecule grows linearly in time with a slope directly proportional to its self-diffusion coefficient D. Accordingly, D is commonly estimated via linear fits to measured MSDs, 2 which at first glance may seem like a rigorous approach, because the resulting estimate is unbiased. However, the precision of the estimate suffers if too many MSD values are used for the fit. 3,4 This counter-intuitive behavior results from the fact that the most common estimator for the MSD of a finite time series {X 0 ,X 1 ,...,X N-1 ,X N }, namely MSD i = N-i X n=0 (X n+i - X n ) 2 N - i +1 , (1) introduces correlations between the MSD i at different time lags t i = iΔt with i =1, 2,...,M N . Further- more, the underlying dynamics may not be purely diffu- sive, as is often the case with molecular dynamics (MD) simulations. Non-diffusive dynamics, arising, e.g., due to ballistic motion, 5 short-lived memory 6 or caging effects, 7 affect the MSD values at short times. As an example of a complex diffusion process where the transition to a) Electronic mail: [email protected] regular diffusion is slow, Vargas and Snurr 8 analyzed the translational diffusion of alkanes in an anisotropic metal- organic framework. A common strategy to characterize non-diffusive short-time dynamics is to invoke elaborate non-Markov processes, 9 but these have to be specifically tailored to the data at hand. We therefore face the prob- lem that diffusion coefficient estimates using short-time data are compromised by possible non-diffusive dynam- ics, and estimates using long-time data suffer from large statistical uncertainties. To address the latter, various proposals for improved diffusion coefficient estimation have been published in recent years, 4,1012 correcting for experimental artifacts and making more efficient use of the available data. Here, we introduce a rigorous framework that combines these sophisticated estimators with a sub-sampling procedure, where the length of the recording-time interval is varied to suppress nonlinearities in the MSD curves on short time scales. The resulting sub-sampled dynamics is then modeled via a diffusive process X combined with an instantaneous stepwise spread, satisfying hMSD i i≡h(X n+i - X n ) 2 i = a 2 + 2 , (2) whose details are presented at the beginning of Sec. II. The associated self-diffusion coefficient is given by D = σ 2 t , where Δt denotes the length of the time interval between two consecutive observations X i and X i+1 i. By in- troducing the static noise parameter a 2 into our generic model of diffusion, we aim to account for “molecular events” such as correlated collisions and cage diffusion without having to use complex system-specific dynamical arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

Upload: others

Post on 28-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

Optimal estimates of self-diffusion coefficients from molecular dynamicssimulations

Jakob Tomas Bullerjahn,1 Soren von Bulow,1 and Gerhard Hummer1, 2, a)1)Department of Theoretical Biophysics, Max Planck Institute of Biophysics, 60438 Frankfurt am Main,Germany2)Institute of Biophysics, Goethe University Frankfurt, 60438 Frankfurt am Main,Germany

(Dated: June 22, 2020)

Translational diffusion coefficients are routinely estimated from molecular dynamics simulations. Linear fits tomean squared displacement (MSD) curves have become the de facto standard, from simple liquids to complexbiomacromolecules. Nonlinearities in MSD curves at short times are handled with a wide variety of ad hocpractices, such as partial and piece-wise fitting of the data. Here, we present a rigorous framework to obtainreliable estimates of the self-diffusion coefficient and its statistical uncertainty. We also assess in a quantitativemanner if the observed dynamics is indeed diffusive. By accounting for correlations between MSD values atdifferent times, we reduce the statistical uncertainty of the estimator and thereby increase its efficiency. Witha Kolmogorov-Smirnov test, we check for possible anomalous diffusion. We provide an easy-to-use Pythondata analysis script for the estimation of self-diffusion coefficients. As an illustration, we apply the formalismto molecular dynamics simulation data of pure TIP4P-D water and a single ubiquitin protein. In a companionpaper [J. Chem. Phys. XXX, YYYYY (2020)], we demonstrate its ability to recognize deviations from regulardiffusion caused by systematic errors in a common trajectory “unwrapping” scheme that is implemented inpopular simulation and visualization software.

I. INTRODUCTION

Brownian motion is one of the pillars of biologicalphysics, being observed on both microscopic and meso-scopic scales. Einstein’s seminal work1 established themean squared displacement (MSD) as the central observ-able to characterize the jittering motion of microscopicobjects. On sufficiently long time scales, the MSD of afreely diffusing tracer particle or macromolecule growslinearly in time with a slope directly proportional to itsself-diffusion coefficient D. Accordingly, D is commonlyestimated via linear fits to measured MSDs,2 which atfirst glance may seem like a rigorous approach, becausethe resulting estimate is unbiased. However, the precisionof the estimate suffers if too many MSD values are usedfor the fit.3,4 This counter-intuitive behavior results fromthe fact that the most common estimator for the MSD ofa finite time series {X0, X1, . . . , XN−1, XN}, namely

MSDi =

N−i∑n=0

(Xn+i −Xn)2

N − i+ 1, (1)

introduces correlations between the MSDi at differenttime lags ti = i∆t with i = 1, 2, . . . ,M ≤ N . Further-more, the underlying dynamics may not be purely diffu-sive, as is often the case with molecular dynamics (MD)simulations. Non-diffusive dynamics, arising, e.g., due toballistic motion,5 short-lived memory6 or caging effects,7

affect the MSD values at short times. As an exampleof a complex diffusion process where the transition to

a)Electronic mail: [email protected]

regular diffusion is slow, Vargas and Snurr8 analyzed thetranslational diffusion of alkanes in an anisotropic metal-organic framework. A common strategy to characterizenon-diffusive short-time dynamics is to invoke elaboratenon-Markov processes,9 but these have to be specificallytailored to the data at hand. We therefore face the prob-lem that diffusion coefficient estimates using short-timedata are compromised by possible non-diffusive dynam-ics, and estimates using long-time data suffer from largestatistical uncertainties.

To address the latter, various proposals for improveddiffusion coefficient estimation have been published inrecent years,4,10–12 correcting for experimental artifactsand making more efficient use of the available data. Here,we introduce a rigorous framework that combines thesesophisticated estimators with a sub-sampling procedure,where the length of the recording-time interval is variedto suppress nonlinearities in the MSD curves on shorttime scales. The resulting sub-sampled dynamics is thenmodeled via a diffusive process X combined with aninstantaneous stepwise spread, satisfying

〈MSDi〉 ≡ 〈(Xn+i −Xn)2〉 = a2 + iσ2 , (2)

whose details are presented at the beginning of Sec. II.The associated self-diffusion coefficient is given by

D =σ2

2∆t,

where ∆t denotes the length of the time interval betweentwo consecutive observations Xi and Xi+1 ∀i. By in-troducing the static noise parameter a2 into our genericmodel of diffusion, we aim to account for “molecularevents” such as correlated collisions and cage diffusionwithout having to use complex system-specific dynamical

arX

iv:2

003.

0919

3v2

[ph

ysic

s.co

mp-

ph]

19

Jun

2020

Page 2: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

2

models. A price we pay is that the process X becomesvalid only for time intervals ∆t that exceed the time scaleof the molecular events. We note that ballistic motion,e.g., in an underdamped Langevin equation would leadto a2 < 0. However, in practice, such down-shifts arenormally not observed, because they get compensated byother effects that broaden the MSD.

The process X is compatible with the above-mentioneddiffusion coefficient estimators, a few of which we brieflyreview and then compare performance-wise in Sec. II.To determine the range of interval lengths where X ap-propriately describes the data at hand, we propose aquality factor based on χ2-statistics in Sec. III A to de-cide whether an estimate overfits or underfits the modelto the data. As a probe of possible anomalous diffusion,the resulting diffusion coefficient estimate obtained atshort times can then be compared to observed long-timedynamics, as described in Sec. III B. We illustrate ourframework by applying it to various MD trajectories inSec. IV: First on an ensemble of pure TIP4P-D water13

(Secs. IV A–IV C) and then on a single ubiquitin proteinsolvated in TIP4P-D water (Sec. IV D). Although theanalysis of the latter system reveals some ambiguities,our framework identifies the dynamics as diffusive with aself-diffusion coefficient that also properly captures ubiq-uitin’s long-time behavior (Sec. IV E). In the companionpaper, Ref. 14, we show that unreasonable quality-factorvalues arise due to previously overlooked shortcomingsin a trajectory “unwrapping” scheme used widely in theanalysis of MD simulations at constant pressure. Finally,Sec. V provides a summary of our results and the Ap-pendix gives the interested reader further details on someof the more elaborate derivations.

II. DIFFUSION COEFFICIENT ESTIMATION

At short times, inertial dynamics and correlated localmotions lead to deviations from simple diffusion typicallyresulting in a fast initial spread of the particle position.On longer time scales, the resulting MSD then resem-bles Eq. (2). Let us therefore consider two discrete-timeWiener processes, Z and X, where the latter processevolves on top of each realization Zi of the former. Thedynamics of the two processes is captured by the followingiterative equations,

Zi+1 = Zi + σRi , 〈Ri〉 = 0 , 〈RiRj〉 = δi,j , (3a)

Xi = Zi +a√2Si , 〈Si〉 = 0 , 〈SiSj〉 = δi,j , (3b)

where R and S denote uncorrelated normal distributedrandom variables with zero mean and unit variance, andδi,j is the Kronecker delta that evaluates to one if i = jand zero otherwise. By construction, we have 〈ZiSj〉 = 0∀i, j and 〈ZiRj〉 = 0 ∀i ≤ j. Because Z and X arealso both normal distributed, their distributions are fully

characterized by the mean and (co)variance, namely

〈Zi〉 = 0 , cov(Zi, Zj) = σ2 min(i, j) ,

〈Xi〉 = 〈Zi〉 = 0 , cov(Xi, Xj) = σ2 min(i, j) +a2

2δi,j ,

which result in Eq. (2) for the MSD of X.In the following, we shall discuss a few established

diffusion coefficient estimators within the context of theprocess X, and then compare them performance-wise.

A. Ordinary least squares estimators

A linear fit to the MSD corresponds to minimizing thesum of squared residuals between the data [Eq. (1)] andthe model [Eq. (2)], thus giving rise to a set of ordinaryleast squares (OLS) estimators

(a2OLS, σ2OLS) = arg min

a2,σ2≥0

M∑i=1

(MSDi−a2 − iσ2)2 (4)

for some M ≤ N . In non-pathological cases, the OLSproblem [Eq. (4)] is analytically tractable and results inthe following expressions for the estimators,

a2OLS =βγ − αδMβ − α2

, σ2OLS =

Mδ − αγMβ − α2

, (5)

α =M(M + 1)

2, β = α

2M + 1

3,

γ =

M∑i=1

MSDi , δ =

M∑i=1

i MSDi .

Although these estimators are unbiased, their precisiondrastically decreases for increased numbers M of MSDvalues used in the fit. This counter-intuitive behavior canbe read off their corresponding variances (see Appendix A)and, to circumvent this shortcoming, it was suggested inRef. 4 to vary the value of M to single out the estimatorswith the smallest variances. The associated standarddeviation is typically much larger than ad hoc uncertaintyestimates, such as those constructed from fits to hand-selected linear regimes in the data.15

B. Covariance-based estimators

In Ref. 12, an alternative to the OLS estimators wasproposed, which circumvents the computation of MSDi

via Eq. (1) altogether. It makes use of the fact that

〈(Xn+1 −Xn)(Xn −Xn−1)〉 = −a2

2

must hold ∀n, as well as Eq. (2) evaluated at i = 1. Thisresults in the unbiased covariance-based estimators (CVE)

Page 3: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

3

a2CVE = −2

N−1∑n=1

(Xn+1 −Xn)(Xn −Xn−1)

N − 1, (6a)

σ2CVE =

N−1∑n=0

(Xn+1 −Xn)2

N− a2CVE , (6b)

whose variances are given by

var(a2CVE) =7a4 + 8a2σ2 + 4σ4

(N − 1)− 2a4

(N − 1)2, (7a)

var(σ2CVE) = 4

a2σ2 + σ4

N − 1+ 2

a4 + σ4

N+

5a4 + 4a2σ2

N(N − 1)

− a4

(N − 1)2− a4

N2(N − 1)2. (7b)

While Eqs. (6) and (7) are computationally inexpensiveto evaluate, they are only guaranteed to be practicallyoptimal for signal-to-noise ratios σ2/a2 larger than one.12

C. Generalized least squares estimators

To improve upon the OLS scheme, one can take intoaccount correlations between the residuals, which arecollected in the covariance matrix of the MSDi-valueswith elements (see Appendix B)

Σi,j(a2, σ2) = Σi,j(0, σ

2) +a4(1 + δi,j) + 4a2σ2 min(i, j)

N −min(i, j) + 1

+a4 max(0, N − i− j + 1)

(N − i+ 1)(N − j + 1)(8a)

for i, j = 1, 2, . . . ,M and

Σi,j(0, σ2) =

σ4

3

[2 min(i, j)

[1 + 3ij −min(i, j)2

]N −min(i, j) + 1

+min(i, j)2 −min(i, j)4

(N − i+ 1)(N − j + 1)+ Θ(i+ j −N − 2)

× (N + 1− i− j)4 − (N + 1− i− j)2(N − i+ 1)(N − j + 1)

], (8b)

where Θ(z) denotes the Heaviside unit step function.Equation (4) is then replaced by

(a2GLS, σ2GLS) = arg min

a2,σ2≥0χ2(a2, σ2) , (9a)

χ2(a2, σ2) =

M∑i,j=1

(MSDi−a2 − iσ2)Σ−1i,j (a2GLS, σ2GLS)

×(MSDj −a2 − jσ2) , (9b)

in a procedure commonly referred to as generalized leastsquares (GLS). Note that the covariance matrix is evalu-ated at (a2, σ2) = (a2GLS, σ

2GLS), and thus remains fixed

while a2 and σ2 are being varied. The a priori un-known a2GLS and σ2

GLS are then found by requiring self-consistency.

In general, there exist no closed-form expressions forthe estimators of the GLS problem [Eq. (9)]. They satisfythe following transcendental coupled equations,

a2GLS =µν − λξκµ− λ2 , σ2

GLS =κξ − λνκµ− λ2 , (10)

κ =

M∑i,j=1

Σ−1i,j (a2GLS, σ2GLS) ,

λ =

M∑i,j=1

iΣ−1i,j (a2GLS, σ2GLS) ,

µ =

M∑i,j=1

i j Σ−1i,j (a2GLS, σ2GLS) ,

ν =

M∑i,j=1

MSDi Σ−1i,j (a2GLS, σ2GLS) ,

ξ =

M∑i,j=1

i MSDj Σ−1i,j (a2GLS, σ2GLS) ,

which have to be solved numerically, e.g., via iterativeroot-finding methods. In Appendix C, we provide onesuch algorithm, based on fixed-point iteration, which wehave implemented in a Python data analysis script.16

Analogous to a2OLS and σ2OLS, the estimators in Eqs. (10)

are asymptotically unbiased. Lower bounds for the vari-ances of the GLS estimators can be inferred from theinverse of the Fisher information matrix I associated withthe likelihood

L(a2, σ2) ∝ exp(−χ2(a2, σ2)/2

),

which is constructed from the χ2-statistic [Eq. (9b)]. Inour case, the components of I read

I1,1 =1

2

∂2χ2(a2, σ2)

(∂a2)2= κ ,

I1,2 ≡ I2,1 =1

2

∂2χ2(a2, σ2)

∂a2∂σ2= λ ,

I2,2 =1

2

∂2χ2(a2, σ2)

(∂σ2)2= µ ,

and result in the inverse matrix

I−1 =1

κµ− λ2(µ −λ−λ κ

).

The variances of the GLS estimators are thus estimatedfrom below by

var(a2GLS) ≥ µ

κµ− λ2 , var(σ2GLS) ≥ κ

κµ− λ2 . (11)

Whenever the estimators are Gaussian, equality holds inEqs. (11).

Page 4: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

4

D. Properties of the GLS estimators

It can be shown (see Appendix D) that Eqs. (10) areof the form

a2GLS =ν −MSD1 λ+O

([σ2/a2]−2

)κ− λ+O

([σ2/a2]−2

) , (12a)

σ2GLS =

MSD1 κ− ν +O([σ2/a2]−2

)κ− λ+O

([σ2/a2]−2

) , (12b)

which implies, on the one hand, that

a2GLS + σ2GLS

σ2�a2∼ MSD1 (12c)

must hold for sufficiently large signal-to-noise ratios σ2/a2,and, on the other hand, that in the special case of a ≡ 0,the estimator becomes analytically tractable ∀M,N andsimply reads

σ2GLS|a=0 ≡ MSD1 . (13)

In this case, Eqs. (11) reduce to

var(σ2GLS|a=0) ≥ µ−1|a=0 =

2

Nσ4 . (14)

Another special case for which Eq. (12c) becomes exactis given by M = 2. Equations (10) are then analyticallytractable and give rise to the closed-form estimators

a2M=2 = 2 MSD1−MSD2 , (15a)

σ2M=2 = −MSD1 + MSD2 , (15b)

with the following variances,

var(a2M=2) = 4Σ1,1 − 4Σ1,2 + Σ2,2 , (16a)

var(σ2M=2) = Σ1,1 − 2Σ1,2 + Σ2,2 . (16b)

These results coincide with the OLS estimators and theirrespective variances (given in Appendix A) for M = 2.

E. Application to three-dimensional time series

Up until now, we have solely focused on one-dimensionaltime series, while two- and three-dimensional particle tra-jectories are recorded in experiments and MD simulations.For a three-dimensional time series, we can decomposethe associated MSD in the following way,

MSD3Di = MSDx,i + MSDy,i + MSDz,i , (17)

where the MSDd,i-values are obtained by evaluatingEq. (1) using the respective one-dimensional time se-ries along each spatial dimension d ∈ {x, y, z}. If wefurthermore model the stochastic dynamics along everyCartesian coordinate via our minimal diffusion process X,then the expectation value of Eq. (17) is simply given by

〈MSD3Di 〉 = a2x + a2y + a2z + i(σ2

x + σ2y + σ2

z) ,

≡ a23D + iσ23D ,

with D = σ23D/6∆t. The elements of the associated

covariance matrix read

Σ3Di,j = Σi,j(a

2x, σ

2x) + Σi,j(a

2y, σ

2y) + Σi,j(a

2z, σ

2z) .

Here, we have assumed no correlation between dimensions,i.e., 〈MSDd1,i MSDd2,j〉 = 〈MSDd1,i〉〈MSDd2,j〉. This lin-ear behavior propagates all the way up to the estimatorsθ2EST ∈ {a2EST, σ

2EST} for EST ∈ {OLS,GLS,CVE} and

their respective variances, resulting in

θ23D,EST = θ2x,EST + θ2y,EST + θ2z,EST ,

var(θ23D,EST) = var(θ2x,EST) + var(θ2y,EST) + var(θ2z,EST) .

The latter relation follows from the fact that〈θ2d1,ESTθ

2d2,EST〉 = 〈θ2d1,EST〉〈θ2d2,EST〉 = θ2d1θ

2d2

holds forall the estimators considered in this paper. Finally, aχ2-statistic can be calculated for three-dimensional timeseries by replacing MSDi, a

2, σ2 and Σi,j(a2GLS, σ

2GLS) in

Eq. (9b) with their three-dimensional analogues, giving

χ2 = 3

M∑i,j=1

(MSD3Di −a23D − iσ2

3D)Σ−1i,j (a23D,GLS, σ23D,GLS)

×(MSD3Dj −a23D − jσ2

3D) . (18)

F. Comparing the estimators

Because all the above-mentioned estimators are unbi-ased, we must compare their respective variances to rankthem. Under the assumption that all four estimator pairspredict roughly the same values for a2 and σ2, which isrealized in practice for N � 1, we can compare theircorresponding variances as functions of the signal-to-noiseratio σ2/a2. In general, the GLS estimators outperformtheir counterparts at low and intermediate signal-to-noiseratios, as visualized in Fig. 1a. Only for σ2/a2 � 1 dothe CVE estimators eventually catch up, as seen by theirintersection point with the closed-form GLS estimators.For M = 2, the estimator variances cross at[

σ2

a2

]CVE-M=2

=

√N − 1

2N − 4. (19)

For M > 2, the crossing point must be found numericallyand behaves roughly like (see Fig. 1b)[

σ2

a2

]CVE-M>2

≈√N .

Therefore, for M > 2, the GLS estimators tend to besuperior already for comparably short trajectory lengths.

III. STATISTICAL TESTING OF MODELASSUMPTIONS AND QUALITY OF FIT

Another advantage of the GLS estimators is their natu-ral compatibility with the quality factor that is introduced

Page 5: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

5

M = 3M = 10

M = N

M = 3

M = 10

M = N

10−2 10−1 100 101

10−1

100

101

102

103

signal-to-noise ratio σ2/a2

Na2

var(σ2/a

2)/σ2

CVE

G/OLS, M = 2GLS, M > 2OLS, M > 2

(a)

101 102 103 104 105

100

101

102

103

number of points N

[σ2/a

2] C

VE-G

LS

√N

[σ2/a2]CVE-M=2

[σ2/a2]CVE-M>2

(b)

Figure 1. Quantitative comparison of the estimators. (a) Es-timator variances as functions of the signal-to-noise ratio forN = 100. For σ2/a2 > 1, the CVE variance [Eq. (7b), solidblue line] is almost indistinguishable from the general GLSsolution [Eq. (11), dash-dotted green lines] and the closed-formspecial solution for M = 2 [Eq. (16b), dashed red line], whilethe OLS variance [Eq. (A1b), dash-double-dotted orange lines]gradually worsens with increasing M . In turn, at low signal-to-noise ratios and M � N , the OLS estimators have comparableuncertainties to their GLS counterparts. (b) Signal-to-noisethreshold below which the GLS estimator outperforms theCVE estimator as a function of the number of points N in thetime series. Unlike [σ2/a2]CVE-M=2, given by Eq. (19) (reddashed line), the CVE-GLS intersection for M > 2 is onlynumerically tractable (black open symbols). Yet, it turns outthat its N -dependence is, for sufficiently large M , remarkablywell captured by a square root function (solid blue line).

below and serves as a measure for the quality of diffusioncoefficient estimates. This section also provides a test toprobe for possible anomalous diffusion in the long-timelimit.

A. Determining the optimal length of the recording-timeinterval

If non-diffusive short-time dynamics are present in atime series, their influence can be suppressed via sub-sampling, where intermediate observations from the origi-nal time series {X0, X1, . . . , XN−1, XN} are removed to

generate a set of shorter time series

{X0, X1, X2, . . . } ,{X0, X2, X4, . . . } ,{X0, X3, X6, . . . } ,

...

{X0, Xn, X2n, . . . } ,

with respective recording-time interval lengths∆t1,∆t2,∆t3, . . . ,∆tn, where ∆tn = n∆t1. Alter-natively, one can choose different starting points tosub-sample the time series; e.g., it is just as valid to use{X1, X3, X5, . . . } instead of {X0, X2, X4, . . . }. While alonger interval will generally result in a more linear MSDcurve, at least for n � N , the shortened length of thenew time series adversely affects the uncertainty of thediffusion coefficient estimate. The estimator uncertaintycan be improved somewhat by including in the analysisall n − 1 interspersed time series constructed from thediscarded points between Xi and Xi+n. However, theseadditional time series are not fully independent, so theirinclusion has only a modest effect and is therefore notconsidered in what follows.

We balance the competition between systematic andstatistical errors at short and long ∆tn, respectively, byexploiting the fact that the GLS estimators originatefrom the χ2-statistic in Eq. (9b). For a sample of GLSestimates {a2GLS

(k), σ2

GLS(k)}k=1,2,...,Ns

, the corresponding χ2-values should follow a χ2-distribution with M − 2 degreesof freedom whenever the residuals

δMSD(k)i =

M∑j=1

(MSD

(k)j −a2GLS

(k) − jσ2GLS(k))

×Σ−1/2i,j

(a2GLS, σ

2GLS

)(20)

are normal distributed with Σ−1/2i,j denoting the elements

of the inverse square root of the covariance matrix [Eq. (8)]evaluated at (a2, σ2) = (a2GLS, σ

2GLS). If there are nonlin-

earities present at short times that skew the distributionof the residuals, we expect atypical χ2-values and therecording-time interval has to be lengthened. However,instead of focusing on the broadly distributed χ2-values,we instead consider the associated quality factor

Q(∆tn,M) = 1− γ(M/2− 1, χ2/2

)Γ(M/2− 1)

, (21)

which coincides with the reciprocal cumulative distribu-tion function (CDF) of the χ2-statistic and therefore onlytakes values between zero and one. Here,

γ(a, z) =

∫ z

0

dxxa−1e−x , Γ(a) = limz→∞

γ(a, z) ,

denote the lower incomplete and ordinary Γ-functions,respectively. The quality factor can be seen as the proba-bility to observe a χ2-value greater than χ2(a2GLS, σ

2GLS).

Page 6: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

6

Ideally, the Q(k)-values, computed from the estimatesa2GLS

(k)and σ2

GLS(k)

, are distributed uniformly on [0, 1] with asample average Q ≈ 1/2. Here, O denotes the arithmeticmean of a finite sample of observations {O(k)}k=1,2,...,Ns

,namely

O =1

Ns

Ns∑k=1

O(k) . (22)

If the average quality factor is significantly lower than1/2, certain features of the data are not well capturedby our diffusive model, thus requiring us to lengthen therecording-time interval. Conversely, Q > 1/2 hints atoverfitting. The optimal interval length ∆topt marks theinstance, where the quality factor first reaches Q ≈ 1/2and the residuals in Eq. (20) follow a normal distribution.If the time series behind the Q(k)-values are sufficientlylong, then the quality factor should remain approximatelyconstant well beyond ∆topt, before slowly drifting offtowards unity.

As an alternative to the above procedure, one coulduse only parts of the data, e.g., the time series along xand y, to evaluate the GLS estimators. The quantitiesχ2 and Q would then serve as tools to measure how wellthe estimators fit the remaining data. Such use of theabove concepts would only require minor alterations toEqs. (9b) and (21), in particular changing the number ofdegrees of freedom from M − 2 to M .

B. Comparing short-time predictions to observeddynamics at long times

A common concern is that regular diffusion cannotfully account for the dynamics in complex molecular sys-tems, which must then be described using more elaboratemodels that incorporate memory or trapping effects. Tomake sure our short-time diffusion coefficient estimateD(∆tn = ∆topt) can correctly describe the actual dy-namics of our system at long times, we take advantageof the fact that typical MD trajectories are long com-pared to the time range M∆topt used to fit the diffusioncoefficient. We therefore compare the statistics of the rel-ative endpoints ∆X(k) = X

(k)N −X(k)

0 of each consideredtime series, characterized by the empirical (cumulative)distribution function (eCDF), to the CDF

F (∆X) =1

2+

1

2erf

(∆X −∆X√

2a2(∆topt) + 4D(∆topt)N∆t1

)(23)

of a diffusive process using the Kolmogorov-Smirnov (KS)statistic

S = max1≤k≤Ns

(k

Ns− F (∆X(k)), F (∆X(k))− k − 1

Ns

).

(24)Here, erf(z) denotes the error function. The discrepancytest above measures the absolute size of the largest dif-ference between the two distribution functions, and the

test statistic S can be used to compute a correspond-ing p-value, i.e., the probability that, if the endpointshad actually been drawn from F ′(∆X), the resulting KSstatistic would have been greater or equal to S. Alterna-tively, one can vary D in Eq. (23) to find the diffusioncoefficient that minimizes Eq. (24) and thus best describesthe observed long-time dynamics. This is possible becausea2(∆topt)� 2D(∆topt)N∆t1 for N � 1, so a2 can eitherbe neglected or kept constant at a2 = a2(∆topt) while Dis varied.

In the following, we refer to the use of Eqs. (23) and (24)as a KS test.

IV. RESULTS AND DISCUSSION

A. MD simulation of TIP4P-D water

We tested the applicability of the GLS estimator for thediffusion coefficient by performing a 1 µs MD simulationof water at ambient conditions. We placed 4139 TIP4P-D water molecules13 in a cubic simulation box with anapproximate edge length of 5 nm and periodic boundaryconditions. We recorded the coordinates of each atom inthe system at intervals of ∆t1 = 1 ps to capture possiblenon-diffusive short-time dynamics. The simulation wasrun at 300 K17 and 1 bar18 using Gromacs/2018.619 withtemperature- and pressure-coupling constants τT = 1 psand τp = 5 ps, respectively. After an initial 100 ps equili-bration period in the NV T -ensemble, and a consecutive5 ns run in the NpT -ensemble, the production run en-sued at the same temperature and pressure. We madeuse of the leap-frog integrator with a 2 fs time step. Allatomic bonds to hydrogens were treated with the LINCSconstraint algorithm.20

For data evaluation, we made use of an improvedscheme14 to “unwrap” the three-dimensional center-of-mass trajectories of every water molecule out of the simula-tion box, which we then split into three independent timeseries – one for every Cartesian coordinate. These one-dimensional time series were used to estimate the MSDalong the spatial dimensions d ∈ {x, y, z} via Eq. (1), af-ter which the estimates were summed up to give the totalMSD for every molecule k = 1, 2, . . . , 4139 according toEq. (17). All trajectories displayed a clear non-diffusiveinitial regime, which became less pronounced as the lengthof the recording-time interval was increased, followed bya more-or-less linear growth at longer times.

B. Establishing ground truth for TIP4P-D water

To establish a reference value Dref for the diffusioncoefficient to compare our results to, we analyzed a sub-sampled version of our time series with ∆tn = 10 ps,using the OLS estimators for M = 20. The length of therecording-time interval was chosen such that it was longenough to suppress non-diffusive short-time effects, while

Page 7: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

7

2.05

2.1

2.15d

iff.

coeff

.D

[nm

2n

s−1]

(a)DGLS, M = 20 Dref

δDempirical

GLSδDref

δDpredicted

GLS

systematic errordominated

statistical errordominated

∆topt 30 50 70 90

0.2

0.4

0.6

0.8

time-step size ∆tn [ps]

qu

alit

yfa

ctor

Q(∆

t n,M

)

Q(∆tn,M = 20) δQ(b)

(c)DGLS, n = 10 Dref

δDempirical

GLSδDref

δDpredicted

GLS

4 8 12 16 20 24

MSD-values M

Q(∆t10 = 10 ps,M) δQ(d)

Figure 2. Translational diffusion coefficient estimation and quality factor analysis of TIP4P-D water simulation data. (a) Usingthe GLS estimators [Eqs. (10)], in combination with Eqs. (25), we computed the average diffusion coefficient at differentrecording-time interval lengths ∆tn with M = 20 fixed (solid green line). Our results are together with the ground-truth valueDref = 2.08 nm2 ns−1 (dashed gray line) and its uncertainty δDref = ±0.02 nm2 ns−1 (gray shaded area) for comparison, whichwere determined at a fixed interval length of ∆t10 = 10 ps and therefore do not vary with ∆tn. The uncertainty of the GLSestimates (green shaded area) was determined empirically via Eq. (25c) and is nicely reproduced by our analytic prediction[Eq. (26), dotted black lines] for virtually all interval lengths. (b) A lower bound ∆topt to the range of desirable interval lengthswas determined from the instance, where the sample average of the quality factor [Eqs. (21) and (22), solid blue line] convergesto Q ≈ 1/2, namely at ∆topt = 10 ps. The shaded area represents one sample standard deviation of uncertainty. (c–d) Same as(a) and (b), respectively, except that here the length of the recording-time interval was kept fixed at ∆topt = 10 ps, while M wasvaried. The estimated diffusion coefficients are independent of the choice of M .

minimizing its impact on the length N + 1 of the timeseries and, in turn, the variance. For every molecule k,we obtained an estimate

D(k)d,EST =

σ2d,EST(k)

2∆tn(25a)

for the translational diffusion coefficient along each spatialdimension d, which we then used to calculate the sampleaverages Dd,OLS via Eq. (22) with Ns = 4139. Unsurpris-ingly, we found perfect agreement of the Dd,OLS-values[Dx,OLS = Dy,OLS = Dz,OLS = (2.08± 0.04) nm2 ns−1],thus confirming that the overall diffusive motion of thewater molecules was isotropic with a scalar translationaldiffusion coefficient

DEST =Dx,EST +Dy,EST +Dz,EST

3, (25b)

whose uncertainty was determined from the sample stan-dard deviation as follows,

δDempirical

EST =

1

3

√√√√ Ns∑k=1

(D

(k)x,EST +D

(k)y,EST +D

(k)z,EST − 3DEST

)2Ns − 1

.

(25c)

Evaluating these expressions with the OLS estimates{a2d,GLS

(k), σ2d,GLS(k) }d∈{x,y,z}k=1,2,...,4139 finally resulted in the refer-

ence value Dref = (2.08± 0.02) nm2 ns−1, which is wellwithin the range of previously reported values for TIP4P-D water.13 It is slightly lower than the experimentallydetermined value of 2.3 nm2 ns−1, measured at 298 K.21

The main reasons for this discrepancy are, on the onehand, that the TIP4P-D model slightly overestimates

Page 8: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

8

the viscosity of water22 and, on the other hand, thatfinite-size effects in simulations using periodic boundaryconditions further reduce the diffusion coefficient. Al-though closed-form corrections have been developed bothfor three-dimensional in-bulk diffusion23 and for quasitwo-dimensional diffusion within membranes,24 we willnot make use of them here, because finite-size effects aresystematic and therefore not estimator-specific. The in-bulk correction is, however, accounted for in our Pythondata analysis script.16

C. Diffusion coefficient estimation and quality factoranalysis for TIP4P-D water

Analogous to the previous section, we considered thecenter-of-mass trajectories of every molecule k along eachspatial dimension d, for which we computed the asso-ciated MSDs via Eq. (1). These were substituted intoEqs. (10) to obtain the GLS estimates a2d,GLS

(k)and σ2

d,GLS(k)

that, in turn, were used to calculate the sample averageand uncertainty of the diffusion coefficient via Eqs. (25).This procedure was repeated for different sub-samplinginterval lengths ∆tn in the range of 1 ps to 100 ps, where∆t1 = 1 ps and M = 20 were kept fixed. Our resultsare summarized in Fig. 2a, where our GLS estimate forthe diffusion coefficient, DGLS, is plotted next to thereference value Dref. Figure 2a also compares the empiri-cally estimated uncertainty [Eq. (25c)] to the predicteduncertainty

δDprediction

GLS =

√√√√ ∑d=x,y,z

var(σ2d,GLS)

(6∆tn)2, (26)

which was evaluated with the help of Eqs. (11). Theanalytical and numerical estimates of the uncertainties,Eqs (25c) and (26), coincide near perfectly for most in-terval lengths.

A lower bound ∆topt to the range of viable intervallengths, where the trade-off between estimate quality anduncertainty is balanced, was determined via the qualityfactor [Eq. (21)], which we evaluated for all moleculespresent in the system using different recording-time inter-val lengths ∆tn. We thereby calculated a correspondingχ2-value for each molecule k according to Eq. (18). Fig-ure 2b visualizes the sample average and uncertainty ofour Q(k)-values, where the latter was estimated via thesample standard deviation. Because the sample averageconverged to Q ≈ 1/2 around n = 10 and did not varysignificantly at higher n, the lower bound was chosen tobe ∆topt = 10 ps. At this interval length, the averageGLS estimate for the diffusion coefficient may not be fullyconverged, but Dref and DGLS are clearly within eachother’s uncertainty intervals. We verified that the chosenbound was independent of our choice of M by repeat-ing our analysis with n = 10 fixed and varying M (seeFigs. 2c–d). Indeed, at the chosen interval length, both

CVE G/OLS GLS OLS

2.06

2.08

2.1

estimators

diff

.co

eff.D

[nm

2n

s−1]

(a)

−2 −1 0 1 2

0.1

0.2

0.3

0.4

residuals (D(k)EST − DEST)/ var(D

(k)EST)1/2

rela

tive

freq

uen

cy

N (0, 1)CVE

G/OLSGLSOLS

(b)

Figure 3. Comparison of estimator performances when appliedto TIP4P-D water simulation data, recorded at ∆tn = 10 ps,with M = 20. (a) Diffusion coefficient predictions by the fourestimators discussed in this paper. For each water molecule,we evaluated the OLS , CVE , GLS and G/OLS estimators[Eqs. (5), (6), (10) and (15), respectively], and substitutedthe results into Eqs. (25) to calculate the sample mean DEST

(open symbols) and standard deviation δDempiricalEST (error bars)

for all EST ∈ {OLS,GLS,CVE}, respectively. (b) Distribu-tion of scaled residuals (D

(k)EST −DEST)/ var(D

(k)EST)1/2. The

appropriately scaled residuals of all four estimators follow astandard normal distribution (black dotted line), meaningthat our uncertainty estimates [Eq. (27)] are appropriate andthat the assumptions underlying the use of the quality factor[Eq. (21)] are met.

the diffusion coefficient estimate and the quality factordid not vary significantly with M .

As discussed in the companion paper, Ref. 14, we origi-nally used the built-in tool trjconv in Gromacs to unwrapthe simulation trajectories, which resulted in a highly os-cillating diffusion coefficient estimate for short recording-time intervals, and a slower convergence of the qualityfactor to Q ≈ 1/2. These irregularities intensified as theedge length of the simulation box was reduced, whichmade us wary of the unwrapping scheme implemented intrjconv, and prompted us to develop an improved un-wrapping scheme appropriate for simulations at constant

Page 9: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

9

pressure.14

Finally, in Fig. 3a, we give a comparison of all the differ-ent estimators discussed in this paper, evaluated for ourTIP4P-D water trajectories with ∆tn = ∆topt = 10 psand, when appropriate, M = 20. At this intervallength, we estimated an average noise parameter ofa2GLS = (8.4± 0.5)× 10−3 nm2 for TIP4P-D water, whichgives a fairly high signal-to-noise ratio (σ2/a2 ≈ 15),where all estimators, aside from the OLS estimators, haveessentially the same uncertainty (see Fig. 1). The ra-tio is still below the [σ2/a2]CVE-M>2-threshold, meaningthat the GLS estimators outperform, if only marginally,the other three. Figure 3b demonstrates that the scaledresiduals (D

(k)EST −DEST)/ var(D

(k)EST)1/2 follow a N (0, 1)-

distribution, i.e., a Gaussian with zero mean and unitvariance, for all four considered estimators. Here, D

(k)EST =

D(k)x,EST +D

(k)y,EST +D

(k)z,EST, DEST is defined via Eq. (25b)

and

var(D

(k)EST

)=∑

d=x,y,z

var(σ2d,EST(k) )

(6∆tn)2(27)

is evaluated using either Eqs. (7), (11), (16) or (A1), de-pending on the estimator EST ∈ {OLS,GLS,CVE}. Thegood agreement with the standard normal distributionat the chosen interval length implies, on the one hand,that our predictions for the estimator-uncertainties areappropriate and, on the other hand, that the σ2

EST(k)

=σ2x,EST(k)

+σ2y,EST(k)

+σ2z,EST(k)

are all Gaussian distributed. Forthe GLS estimators, we can conclude from the latter thatthe residuals δMSDi are also normal distributed becauseEq. (20) is linear, which, in turn, justifies the use of thequality factor.

D. MD simulation and data analysis for ubiquitin

We tested our framework in a more biological context bysimulating a 2 µs trajectory of a single ubiquitin molecule(PDB identification code: 1ubq25) using the Amber99SB*-ILDN-Q force field26–29 in a cubic simulation box withperiodic boundary conditions and an approximate edgelength of 7.5 nm. The protein was solvated by 13347TIP4P-D water molecules at a concentration of ∼150 mMNaCl.30 The equilibration and production runs were oth-erwise performed in the same manner as described inSec. IV A.

In contrast to pure TIP4P-D water, we had only asingle trajectory for the ubiquitin protein, which we ana-lyzed in two ways: First, we applied the GLS estimators[Eqs. (10)] to MSD values computed for the full trajectoryat different interval lengths. Then, we split the trajec-tory into 150 segments of equal length and treated eachsegment as an individual trajectory in the data analysis,thus allowing for a similar data analysis as performed onthe water simulation data. In both cases, M = 20 MSDvalues were considered for the fits. The results of bothapproaches are presented in Fig. 4a, where we compare

0.05

0.06

0.07

0.08

diff

.co

eff.D

[nm

2n

s−1]

(a)DGLS, segm. DGLS, traj.

δDempirical

GLS δDpredictedGLS

δDpredicted

GLS

∆topt 30 50 70 90

0.2

0.4

0.6

0.8

time-step size ∆tn [ps]q

ual

ity

fact

orQ

(∆t n

,M)

Q(∆tn,M = 20) δQ(b)

Figure 4. Translational diffusion coefficient estimation andquality factor analysis for a single ubiquitin in aqueous so-lution with M = 20 fixed. (a) Splitting the trajectory upinto 150 segments of equal length allowed us to calculate thesample average [Eq. (25b)] via the GLS estimators [Eqs. (10),solid green line] and the associated sample standard deviation[Eq. (25c), green shaded area]. For comparison, the trajec-tory was also analyzed as a whole (dashed gray line), wherethe corresponding uncertainty was determined via Eq. (26)(gray shaded area). Applying Eq. (26) to the sample averageDGLS resulted in the dotted black lines. (b) Analogous to theTIP4P-D water data (see Fig. 2b), the lower bound ∆topt wasread off the quality factor plot, giving ∆topt = 6 ps. However,unlike for water, the average quality factor (solid blue line) isconsistently above 1/2 for ∆tn > ∆topt, which is indicativeof the trajectory segments being too short (see Fig. 5). Theshaded area represents the uncertainty of our estimate, whichwas determined from the sample standard deviation.

the single-trajectory diffusion coefficient estimate DGLS

to a sample average DGLS over the above-mentioned seg-ments. The two estimates coincide almost perfectly andpossible deviations are contained within the uncertaintyinterval of DGLS.

Yet, unlike for pure TIP4P-D water, the diffusion coef-ficient estimate for ubiquitin did not converge to a con-stant value at sufficiently large ∆tn, but instead creptto ever lower values. This might indicate some underly-ing non-diffusive dynamics on long time scales, but ananalysis of the quality factor, depicted in Fig. 4b, showedthat it already reached Q = 1/2 at ∆topt = 6 ps and

Page 10: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

10

0.5

0.55

0.6

qu

alit

yfa

ctor

Q(∆

t n,M

=20

)N = 1000000 N = 100000N = 10000 N = 5000N = 2500

(a)

10 30 50 70 90

0.5

0.55

0.6

time-step size ∆tn [ps]

(b)

D(k)GLS/DGLS

DGLSp( D

(k)

GLS

)

Figure 5. For short time series, the average quality factorQ deviates from 1/2 as a result of non-Gaussian residuals.(a) Q as function of the recording-time interval length ∆tnfor the TIP4P-D water time series of Fig. 2 truncated todifferent lengths N . (b) Same as in (a) for ideal diffusivedynamics, generated via Eqs. (3) for a2 = 1/2 and σ2 = 1.Inset: Distributions of the corresponding diffusion coefficientestimates for ∆tn = 100 ps on a semi-logarithmic scale.

then remained essentially flat, thus implying a good fitto our diffusion model. At the optimal interval length∆topt, the noise parameter for ubiquitin was estimatedas a2GLS = (1.9± 0.6)× 10−4 nm2, corresponding to asignal-to-noise ratio of σ2/a2 ≈ 11. A closer inspectionrevealed that Q consistently took values slightly greaterthan 1/2, indicating that we overestimated δD

prediction

GLS

in this regime. This can also be seen in Fig. 4a, wherethe black dotted lines [Eq. (26)] clearly envelop the greenshaded area [Eq. (25c)]. The elevated Q-values beyond∆topt are due to the trajectory segments being too shortto produce Gaussian-distributed estimates for D

(k)GLS. We

also observed this effect for synthetic data generated viaEqs. (3), as well as our TIP4P-D water data when wedrastically shortened the underlying time series. Theeffect is demonstrated in Fig. 5, where we plot the aver-age quality factor computed from time series of differentlengths. For short time series, the quality factor increaseswith the interval length both in TIP4P-D water data andin synthetic data for an ideal diffusion process. As shownin the inset of Fig. 5b, this is associated with deviationsfrom Gaussian statistics for small sample sizes. Further-

more, in the case of the TIP4P-D water data, the lowerbound ∆topt also seems affected by the shortening of thetime series. This casts doubt on our initial decision to set∆topt = 6 ps for ubiquitin, because the optimal value isprobably a few picoseconds higher.

E. Verifying short-time predictions on long time scales

Another indication that ∆topt might be too low forubiquitin is given by the fact that the uncertainty boundsat ∆topt = 6 ps from the single-trajectory analysis donot account for all diffusion coefficient estimates obtainedat longer recording-time intervals. To test whether thisdiscrepancy is due to insufficient statistics or because ourmodel is not entirely adequate to describe the translationaldynamics of ubiquitin, we turned to the KS statistic[Eq. (24)], which we evaluated for our water data and thesegmented ubiquitin trajectory, respectively. Figure 6apresents the KS test results for TIP4P-D water, which,unsurprisingly, confirm that the system dynamics remainsdiffusive on long time scales and is well characterizedby the optimal diffusion coefficient Dopt ≡ D(∆topt) =(2.086± 0.010) nm2 ns−1. For ubiquitin, we observed asignificant discrepancy between the global minimum ofS√Ns and Dopt, as depicted in Fig. 6b. Yet, due to the

large p-value associated with Dopt, we could not reject thenull hypothesis that the long-time dynamics of ubiquitinis diffusive with D ≈ Dopt = (0.060± 0.002) nm2 ns−1.According to the KS test, the value of Dopt estimatedfrom the 150 segments, each of length 13.3 ns, is consistentwith trajectory data at longer recording-time intervalsof 28.5 ns and 66.7 ns, with p-values exceeding 0.5 (seeFig. 6b).

V. CONCLUSIONS

We have proposed a robust framework to extract reli-able self-diffusion coefficients and their uncertainties frommolecular dynamics simulation trajectories. We fit a dif-fusion process to the observed MSD curves using GLSestimators [Eqs. (10)], which account for strong correla-tions in the errors of MSD values at different time lags,and generally outperform other estimators at high and lowsignal-to-noise ratios. We allow for possible non-diffusivedynamics at short times by including an instantaneousGaussian spread in the diffusion process. By sub-samplingthe time series, we identify remaining deviations fromregular diffusion at intermediate times. We extend therecording-time interval until the sub-sampled trajectoriesbecome statistically consistent with our diffusion process.The quality factor Q [Eq. (21)] used to quantify the fitconsistency arises naturally in the context of GLS fit-ting procedures and provides a measure on whether thedata are overfitted or underfitted. In this way, we obtainan estimate of the translational diffusion coefficient thatoptimally trades off possible systematic errors from non-

Page 11: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

11

2.04 2.06 2.08 2.1 2.12

0.5

0.6

0.7

0.8

0.9

diffusion coefficient D [nm2 ns−1]

KS

stat

isti

cS√N

s

Dopt

δDopt

KS statistic

(a)

0

1

∆X/√

Dopt∆topt

eCDF

CDF

0.045 0.05 0.055 0.06

(b)

30

70

150

0.964

0.864

0.711

0.544

0.393

p-v

alu

e

(b)

0

1

∆X/√

Dopt∆topt

eCDF150

CDF

Figure 6. Kolmogorov-Smirnov test to verify whether Dopt = DGLS obtained for ∆tn = ∆topt is consistent with the observedlong-time dynamics. (a) The optimal diffusion coefficient estimate Dopt = (2.086± 0.010) nm2 ns−1 for TIP4P-D water (solidgreen line), which was read off the quality factor (see Fig. 2b), coincides near perfectly with the minimum of the scaled KSstatistic [Eq. (24), dashed black line]. Here, a2(∆topt) = a2

GLS|∆tn=∆topt was kept constant for all D, and Ns = 3 × 4139.The shaded area represents one sample standard deviation of uncertainty. Inset: Empirical (solid blue line) and cumulativedistribution function [Eq. (23), dashed red line] of the relative endpoints {∆X(k)}d∈{x,y,z}k=1,2,...,4139, where the latter was evaluatedusing Dopt. (b) Same as in (a) for Nseg = 30, 70 and 150 ubiquitin trajectory segments using Dopt = (6.0± 0.2)× 10−2 nm2 ns−1

and Ns = 3×Nseg. Although our estimate Dopt is too high compared to the value that minimizes the KS statistic for Nseg = 30and 150, we cannot reject it on the basis of the KS test because the corresponding p-values are fairly high (∼0.65 and ∼0.72,respectively). This is reflected in the excellent agreement between eCDF and CDF (see inset).

diffusive dynamics at short times and statistical errorsfrom increasing uncertainties in the MSD values at longertimes.

Our framework can be readily applied to either a singletrajectory or a set of trajectories with the help of ourPython data analysis script.16 In the former case, the fulltrajectory must be split into multiple segments of equallength to perform the quality factor analysis. As a proofof principle and to demonstrate its use in practice, weapplied our framework to molecular dynamics simulationdata. We estimated the translational diffusion coeffi-cient of TIP4P-D water from an ensemble of 4139 watermolecules and compared our results to literature values.We then went on to a system with much sparser statistics,namely a single ubiquitin molecule solvated in TIP4P-Dwater, where we performed a single-trajectory analysis onthe full ubiquitin trajectory and compared the results tostatistics obtained by analyzing multiple segments of saidtrajectory. We found that both approaches give almostidentical diffusion coefficient estimates, albeit with verydifferent uncertainties. An inspection of the quality factorfor the segments revealed that our predictions for theuncertainty were adequate, but systematically a bit toohigh due to the segments being too short. With the helpof a Kolmogorov-Smirnov test, we confirmed that ouroptimal diffusion coefficient estimates correctly predictthe long-time dynamics observed in our trajectories. Inthis way, we effectively ruled out possible anomalous dif-fusion of the protein ubiquitin on intermediate and longtime scales. Finally, our framework allowed us to iden-tify systematic errors caused by well-established software

packages for unwrapping particle trajectories from simu-lations at constant pressure, as detailed in the companionpaper, Ref. 14.

All in all, our framework supplies the practitioner withan optimal diffusion coefficient estimate and an associateduncertainty with high precision. It furthermore providesevidence on how to optimize the quality of the fit by sub-sampling, thereby trading off systematic and statisticaluncertainties, and whether the predicted overall uncer-tainty is of appropriate size or not.

DATA AVAILABILITY

The data that support the findings of this study areavailable from the corresponding author upon reasonablerequest.

ACKNOWLEDGMENTS

We thank Attila Szabo for insightful comments on themanuscript and discussions. This research was supportedby the Max Planck Society (J.T.B., S.v.B. and G.H.)and the Human Frontier Science Program RGP0026/2017(S.v.B. and G.H.).

Page 12: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

12

Appendix A: OLS-estimator variances

The variances of the OLS estimators [Eqs. (5)] arecomputed in a straight-forward fashion, giving

var(a2OLS) = 〈(a2OLS)2〉 − 〈a2OLS〉2

=β2〈γ2〉 − 2αβ〈γδ〉+ α2〈δ2〉

(Mβ − α2)2− a4 ,

var(σ2OLS) =

M2〈δ2〉 − 2Mα〈γδ〉+ α2〈γ2〉(Mβ − α2)2

− σ4 .

where we can make use of the relation 〈MSDi MSDj〉 =Σi,j+(a2+iσ2)(a2+jσ2) to rewrite the ensemble averagesappearing in the above expressions as follows,

〈γ2〉 =

M∑i,j=1

Σi,j + (a2M + σ2α)2 ,

〈γδ〉 =

M∑i,j=1

iΣi,j + a4Mα+ a2σ2(Mβ + α2) + σ4αβ ,

〈δ2〉 =

M∑i,j=1

i j Σi,j + (a2α+ σ2β)2 .

Here, Σi,j denotes the covariance of the MSDi, whosefunctional form is explicitly derived in the next sectionand given by Eqs. (8). After some algebraic manipulations,we arrive at the final result

var(a2OLS) =

M∑i,j=1

(iα− β)(jα− β)

(Mβ − α2)2Σi,j(a

2OLS, σ

2OLS) ,

(A1a)

var(σ2OLS) =

M∑i,j=1

(iM − α)(jM − α)

(Mβ − α2)2Σi,j(a

2OLS, σ

2OLS) .

(A1b)

For σ2/a2 > 1, the precision of the OLS estimators dete-riorates with increasing M , as seen in the limit M � Nwith a2 ≡ 0, where

var(σ2OLS) =

2σ4

3N

M∑i,j=1

(iM − α)(jM − α)

(Mβ − α2)2min(i, j)

×[1 + 3ij −min(i, j)2

]+O(N−2)

=2σ4

3N

[78M

35+ 3 +O(M−1)

]+O(N−2) .

Including more data points therefore worsens the OLSestimate of the diffusion coefficient.

By contrast, at low signal-to-noise ratios σ2/a2 otherterms dominate the covariance, thus resulting in an un-certainty comparable to the GLS estimators for M � N(see Fig. 1).

Appendix B: MSD covariance matrix

While the mean of MSDi is trivially given by

〈MSDi〉 =

N−i∑n=0

〈(Xn+i −Xn)2〉N − i+ 1

= a2 + iσ2 ,

the calculation of its second moment,

〈MSDi MSDj〉 =

N−i∑n=0

N−j∑m=0

〈δX2n,iδX

2m,j〉

(N − i+ 1)(N − j + 1)

with the shorthand notation δYn,i = Yn+i − Yn, is moreinvolved. It essentially reduces to a sum of moments ofthe form 〈AiBjCkDl〉, where A,B,C,D ∈ {Z,X,R, S},which evaluate to

〈AiBjCkDl〉 = cov(Ai, Bj) cov(Ck, Dl)

+ cov(Ai, Dl) cov(Bj , Ck)

+ cov(Ai, Ck) cov(Bj , Dl) (B1)

for normally distributed processes.31 Because of

〈ZiZjZkSl〉 =∑

combinations

cov(Za, Zb) cov(Zc, Sd) = 0 ,

〈ZiSjSkSl〉 =∑

combinations

cov(Za, Sb) cov(Sc, Sd) = 0 ,

it becomes apparent that the only non-zero terms of〈(Xn+i −Xn)2(Xm+j −Xm)2〉, when expanded with re-spect to Z and S, are of even order, i.e.,

〈δX2n,iδX

2m,j〉 = 〈δZ2

n,iδZ2m,j〉+

a2

2〈δZ2

n,iδS2m,j〉

+2a2〈δZn,iδSn,iδZm,jδSm,j〉

+a2

2〈δS2

n,iδZ2m,j〉+

a4

4〈δS2

n,iδS2m,j〉 .

We can already identify

Σi,j(0, σ2) =

N−i∑n=0

N−j∑m=0

〈δZ2n,iδZ

2m,j〉

(N − i+ 1)(N − j + 1)− ijσ4 ,

(B2)which we will come back to later, and use linear combina-tions of

〈ZiZjSkSl〉 = σ2 min(i, j)δk,l ,

〈SiSjSkSl〉 = 〈RiRjRkRl〉 = δi,jδk,l + δi,lδj,k + δi,kδj,l

Page 13: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

13

to calculate the remaining terms, giving

〈δZ2n,iδS

2m,j〉 = 2iσ2 ,

〈δZn,iδSn,iδZm,jδSm,j〉 = σ2(δm,n − δm,n+i−δm+j,n + δm+j,n+i)

×[min(m,n)−min(m,n+ i)

−min(m+ j, n)

+ min(m+ j, n+ i)] ,

〈δS2n,iδZ

2m,j〉 = 2jσ2 ,

〈δS2n,iδS

2m,j〉 = 2[2 + (1 + 2δi,j)δm,n + δm,n+i

+δm+j,n + δm+j,n+i] .

Substituting all of these into our expression for the co-variance matrix results in Eq. (8a).

For the special case of a ≡ 0, our covariance matrixreduces to Eq. (B2), where the ensemble average can berewritten as follows,

〈δZ2n,iδZ

2m,j〉 = σ4

⟨[n+i−1∑a=n

Ra

]2[m+j−1∑b=m

Rb

]2⟩

(B1)= σ4

(ij + 2

[n+i−1∑a=n

m+j−1∑b=m

δa,b

]2).

The σ4ij-term conveniently cancels with 〈MSDi〉〈MSDj〉and the summation over multiple indices can be reducedto

N−i∑n=0

N−j∑m=0

[n+i−1∑a=n

m+j−1∑b=m

δa,b

]2=

1

6

(2 min(i, j)[N + 1−max(i, j)]

×[1 + 3ij −min(i, j)2

]+ min(i, j)2 −min(i, j)4 + Θ(i+ j −N − 2)

×[(N + 1− i− j)4 − (N + 1− i− j)2

])using geometric reasoning. This finally gives rise toEq. (8b) of the main text.

Appendix C: Iterative algorithm for the GLS estimators

Starting from a candidate solution (a2k, σ2k), we can

evaluate the covariance matrix [Eq. (8a)] and use it to

construct the following auxiliary values,

κk =

M∑i,j=1

Σ−1i,j (a2k, σ2k) ,

λk =

M∑i,j=1

iΣ−1i,j (a2k, σ2k) ,

µk =

M∑i,j=1

i j Σ−1i,j (a2k, σ2k) ,

νk =

M∑i,j=1

MSDi Σ−1i,j (a2k, σ2k) ,

ξk =

M∑i,j=1

i MSDj Σ−1i,j (a2k, σ2k) .

Our candidate solution can then be iteratively updatedaccording to

a2k+1 =µkνk − λkξkκkµk − λ2k

, σ2k+1 =

κkξk − λkνkκkµk − λ2k

,

where the procedure is stopped either when the followinginequality is satisfied for some tolerance threshold ε,

(a2k+1 − a2k

)2+(σ2k+1 − σ2

k

)2< ε ,

or a certain number of iteration steps are completed. Asan initial solution, we choose

(a20, σ20) = (a2M=2, σ

2M=2)

≡ (2 MSD1−MSD2,−MSD1 + MSD2) ,

which is also returned if the algorithm fails to converge.Although this is rarely the case for 1 < M � N , theabove described procedure is susceptible to considerablenumerical errors in the limit M → N � 1, where thecovariance matrix becomes ill-conditioned.

Appendix D: Asymptotics of the GLS estimators

According to Eq. (8a), we have the identity

Σ1,i(a2, σ2) =

2i

Nσ4 +

a4(2 + δ1,i) + 4a2σ2

N

− a4

N(N − i+ 1),

Page 14: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

14

from which the relation

b1 =2

Nσ4

M∑i,j=1

i bj Σ−1i,j (a2, σ2)

+2a4 + 4a2σ2

N

M∑i,j=1

bj Σ−1i,j (a2, σ2)

+a4

N

M∑i,j=1

δ1,i bj Σ−1i,j (a2, σ2)

−M∑

i,j=1

bja4

N(N − i+ 1)Σ−1i,j (a2, σ2)

follows for arbitrary bi. For large signal-to-noise ratios,we thus have

M∑i,j=1

i bj Σ−1i,j (a2, σ2) =N

2σ4b1 −

2a2

σ2

M∑i,j=1

bj Σ−1i,j (a2, σ2)

+O([σ2/a2]−2

), (D1)

which can be used to derive Eqs. (12), because

κµ− λ2 =N

2σ4[κ− λ] +O

([σ2/a2]−2

),

µν − λξ =N

2σ4[ν −MSD1 λ] +O

([σ2/a2]−2

),

κξ − λν =N

2σ4[MSD1 κ− ν] +O

([σ2/a2]−2

),

must hold. Finally, Eq. (14) follows from Eq. (D1) fora ≡ 0 and bj = j.

REFERENCES

1A. Einstein, Uber die von der molekularkinetischen Theorie derWarme geforderte Bewegung von in ruhenden Flussigkeiten sus-pendierten Teilchen, Ann. Phys. 322, 549-560 (1905).

2H. Qian, M. P. Sheetz, and E. L. Elson, Single particle track-ing. Analysis of diffusion and flow in two-dimensional systems,Biophys. J. 60, 910-921 (1991).

3S. Wieser and G. J. Schutz, Tracking single molecules in the livecell plasma membrane – Do’s and don’t’s, Methods 46, 131-140(2008).

4X. Michalet, Mean square displacement analysis of single-particletrajectories with localization error: Brownian motion in anisotropic medium, Phys. Rev. E 82, 041914 (2010).

5R. Huang, I. Chavez, K. M. Taute, B. Lukic, S. Jeney, M. G.Raizen, and E.-L. Florin, Direct observation of the full transitionfrom ballistic to diffusive Brownian motion in a liquid, Nat. Phys.7, 576-580 (2011).

6Z. Li, X. Bian, X. Li, and G. E. Karniadakis, Incorporation ofmemory effects in coarse-grained modeling via the Mori-Zwanzigformalism, J. Chem. Phys. 143, 243128 (2015).

7W. van Megen, and H. J. Schope, The cage effect in systems ofhard spheres, J. Chem. Phys. 146, 104503 (2017).

8E. Vargas L. and R. Q. Snurr, Heterogeneous diffusion of alkanesin the hierarchical metal-organic framework NU-1000, Langmuir31, 10056-10065 (2015).

9M. Lysy, N. S. Pillai, D. B. Hill, M. G. Forest, J. W. R. Mellnik,P. A. Vasquez, and S. A. McKinley, Model comparison and as-sessment for single particle tracking in biological fluids, J. Am.Stat. Assoc. 111, 1413 (2016).

10A. J. Berglund, Statistics of camera-based single-particle tracking,Phys. Rev. E 82, 011917 (2010).

11X. Michalet and A. J. Berglund, Optimal diffusion coefficientestimation in single-particle tracking, Phys. Rev. E 85, 061916(2012).

12C. L. Vestergaard, P. C. Blainey, and H. Flyvbjerg, Optimal esti-mation of diffusion coefficients from single-particle trajectories,Phys. Rev. E 89, 022726 (2014).

13S. Piana, A. G. Donchev, P. Robustelli, and D. E. Shaw, Waterdispersion interactions strongly influence simulated structuralproperties of disordered protein states, J. Phys. Chem. B 119,5113-5123 (2015).

14S. von Bulow, J. T. Bullerjahn, and G. Hummer, Incorrect unwrap-ping causes systematic errors in diffusion coefficients from long-time MD simulations at constant pressure, J. Phys. Chem. XXX,YYY-ZZZ (2020).

15M. J. Abraham, D. van der Spoel, E. Lindahl, B. Hess, and theGROMACS development team, GROMACS User Manual version2019, www.gromacs.org.

16Available for download at https://github.com/bio-phys/

DiffusionGLS.17G. Bussi, D. Donadio, and M. Parrinello, Canonical sampling

through velocity rescaling, J. Phys. Chem. 126, 014101 (2007).18M. Parrinello and A. Rahman, Polymorphic transitions in single

crystals: A new molecular dynamics method, J. Appl. Phys. 52,7182-7190 (1981).

19M. J. Abraham, T. Murtola, R. Schulz, S. Pall, J. C. Smith, B.Hess, and E. Lindahl, GROMACS: High performance molecu-lar simulations through multi-level parallelism from laptops tosupercomputers, SoftwareX 1-2, 19-25 (2015).

20B. Hess, H. Bekker, H. J. C. Berendsen, and J. G. E. M. Fraaije,LINCS: A linear constraint solver for molecular simulations, J.Comput. Chem. 18, 1463-1472 (1997).

21K. Krynicki, C. D. Green, and D. W. Sawyer, Pressure and tem-perature dependence of self-diffusion in water, Faraday Discuss.Chem. Soc. 66, 199-208 (1978).

22S. von Bulow, M. Siggel, M. Linke, and G. Hummer, Dynamiccluster formation determines viscosity and diffusion in denseprotein solutions, Proc. Natl. Acad. Sci. USA 116, 9843-9852(2019).

23I.-C. Yeh and G. Hummer, System-size dependence of diffusioncoefficients and viscosities from molecular dynamics simulationswith periodic boundary conditions, J. Phys. Chem. B 108, 15873-15879 (2004).

24M. Vogele, J. Kofinger, and G. Hummer, Hydrodynamics ofdiffusion in lipid membrane simulations, Phys. Rev. Lett. 120,268104 (2018).

25S. Vijay-Kumar, C. E. Bugg, and W. J. Cook, Structure ofubiquitin refined at 1.8A resolution, J. Mol. Biol. 194, 531-544(1987).

26K. Lindorff-Larsen, S. Piana, K. Palmo, P. Maragakis, J. L.Klepeis, R. O. Dror, and D. E. Shaw, Improved sidechain torsionpotentials for the Amber ff99SB protein force field, Proteins 78,1950-1958 (2010).

27R. B. Best and G. Hummer, Optimized molecular dynamics forcefields applied to the helix-coil transition of polypeptides, J. Phys.Chem. B 113, 9004-9015 (2009).

28V. Hornak, R. Abel, A. Okur, B. Strockbine, A. Roitberg, andC. Simmerling, Comparison of multiple Amber force fields anddevelopment of improved protein backbone parameters, Proteins65, 712-725 (2006).

29R. B. Best, D. de Sancho, and J. Mittal, Residue-specific α-helix propensities from molecular simulation, Biophys. J. 102,1462-1467 (2012).

30I. S. Joung and T. E. Cheatham III, Determination of alkaliand halide monovalent ion parameters for use in explicitly sol-vated biomolecular simulations, J. Phys. Chem. B 112, 9020-9041(2008).

31C. W. Gardiner, Handbook of Stochastic Methods for Physics,Chemistry and the Natural Sciences (Springer-Verlag, Berlin,

Page 15: ad hoc arXiv:2003.09193v2 [physics.comp-ph] 19 Jun 2020

15

1985).