on the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets...

31
On the relationship between cumulative correlation coefficients and the quality of crystallographic data sets Journal: Protein Science Manuscript ID PRO-17-0259.R1 Wiley - Manuscript type: Full-Length Papers Date Submitted by the Author: 26-Sep-2017 Complete List of Authors: Wang, Jimin; Yale University, Department of Molecular Biophysics and Biochemistry Brudvig, Gary; Yale University, Department of Chemistry Batista, Victor; Yale University, Department of Chemistry Moore, Peter; Yale University, Department of Chemistry Keywords: CC1/2, X-ray free-electron laser, femtosecond series crystallography, Photosystem II, PSII, Model Bias, Cumulative Correlation Coefficients John Wiley & Sons Protein Science

Upload: others

Post on 24-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

On the relationship between cumulative correlation

coefficients and the quality of crystallographic data sets

Journal: Protein Science

Manuscript ID PRO-17-0259.R1

Wiley - Manuscript type: Full-Length Papers

Date Submitted by the Author: 26-Sep-2017

Complete List of Authors: Wang, Jimin; Yale University, Department of Molecular Biophysics and Biochemistry Brudvig, Gary; Yale University, Department of Chemistry Batista, Victor; Yale University, Department of Chemistry Moore, Peter; Yale University, Department of Chemistry

Keywords: CC1/2, X-ray free-electron laser, femtosecond series crystallography,

Photosystem II, PSII, Model Bias, Cumulative Correlation Coefficients

John Wiley & Sons

Protein Science

Page 2: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

On the relationship between cumulative correlation coefficients and the

quality of crystallographic data sets

Jimin Wang1,*, Gary W. Brudvig

1,2, Victor S. Batista

2, Peter B. Moore

1,2

Running Title: CC1/2 and data quality

1Department of Molecular Biophysics and Biochemistry, Yale University, New Haven,

CT 06520-8114. 2Department of Chemistry, Yale University, New Haven, CT 06520-

8107

*To whom correspondence should be addressed: [email protected]

Page 1 of 25

John Wiley & Sons

Protein Science

Page 3: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

2

Summary

In 2012, Karplus and Diederichs demonstrated that the Pearson correlation coefficient

CC1/2 is a far better indicator of the quality and resolution of crystallographic data sets

than more traditional measures like merging R-factor or signal-to-noise ratio. More

specifically, they proposed that CC1/2 be computed for data sets in thin shells of

increasing resolution so that the resolution dependence of that quantity can be examined.

Recently, however, the CC1/2 values of entire data sets, i.e. cumulative correlation

coefficients, have been used as a measure of data quality. Here we show that the

difference in cumulative CC1/2 value between a data set that has been accurately

measured and a data set that has not is likely to be small. Furthermore, structures

obtained by molecular replacement from poorly measured data sets are likely to suffer

from extreme model bias.

Keywords: CC1/2, X-ray Free-Electron Laser, Femtosecond Series Crystallography,

Photosystem II, PSII, Model Bias, Cumulative Correlation Coefficients.

Page 2 of 25

John Wiley & Sons

Protein Science

Page 4: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

3

Introduction

In his classic note on regression and the theory of evolution in 1896,1 Karl Pearson

introduced the statistic now known as the Pearson correlation coefficient (CC), and it has

been widely used in the social and behavioral sciences ever since.2 The Pearson CC of the

data obtained when the same set of observations are made twice (xj,yj) is given by the

following expression:.

�� = ∑ ����⟨�⟩����⟨�⟩��∑ ����⟨�⟩ � ∑ ����⟨�⟩ �

(Eq. 1)

The reflective CC, or CC0, is obtained when the same expression is evaluated with the

mean values replaced by zero (Eq. 2).

��� = ∑ �������∑ ��� � ∑ ��� �

(Eq. 2)

Both Pearson and reflective CCs can provide useful insights into the information content

of experimental data sets and the reproducibility of measurements.

Although some crystallographic applications of reflective CC0 were reported in

the 1960s,3,4 until recently, they were seldom used. Those early studies demonstrated that

the CC0’s of the amplitudes of crystallographic data sets are not good measures of overall

data quality because the range of values possible for that statistic is so small. If two non-

centrosymmetric data sets are being compared, the value of CC0 will be 100% if the two

are identical, but fall only to 78.5% (= π/4) if the two are completely unrelated (see

Online Supporting Materials, OSM). These limits hold for the CC0 values obtained within

thin shells of constant resolution, and for cumulative (or overall) CC0 values. Another

reason correlation coefficients were slow to gain a foothold is that data quality is seldom

Page 3 of 25

John Wiley & Sons

Protein Science

Page 5: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

4

an issue in small molecule crystallography. The data are usually accurately measured, and

the models derived from them often account well for them [e.g. Ref.5].

In structural biology the situation can be quite different. The data are often harder

to measure well, and the models inferred from them commonly fail to explain the data as

accurately as they were measured.6 As Karplus and Diederichs pointed out a few years

ago, the Pearson CCs can play as useful a role in this arena,7 as they do in electron

microscopy (EM), where CC0’s (i.e., Fourier Shell Correlation) have been used for years

to estimate the resolution of the EM maps.8,9

The CC1/2 is particularly useful for detecting the presence of weak signals in the

high-resolution shells of crystallographic data sets. The measurements used to estimate

the intensities of each reflection are partitioned into two non-overlapping sets of equal

size that are then used to obtain two independent sets of intensity estimates. CC1/2 is the

Pearson correlation coefficient obtained by comparing these two sets of intensities. As

mentioned earlier, the calculations are usually done after the two sets of intensities have

been divided into thin shells of increasing resolution so that the dependence of CC1/2 on

resolution can be determined. At low resolution, the CC1/2 of crystallographic data sets is

usually close to 100%, and it tends to fall off quite rapidly as the resolution limit of the

data is reached.7,10 Strong reflections in a data set, which are almost always better

measured than weak reflections, tend to dominate correlation coefficients, and that is why

CC1/2 is a better tool for detecting signals in noisy data than data quality indexes that treat

each Bragg reflection equally, such as the fractional intensity R(diff), which is the

average fractional intensity difference between the two half-data sets. More important,

CC1/2 can identify very weak signals in noisy data because it is a reliable and does not

Page 4 of 25

John Wiley & Sons

Protein Science

Page 6: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

5

involve a scaling issue (and even when scaling is involved, it is scaling independent). In

contrast, traditional R-factors such as Rsym, Rmerge, and Rmeas (see Ref.11 for definitions)

may fail to do so because their values are sensitive to the linear and Wilson B scale

factors applied to the individual images that contributed to the two sets of intensities that

are being compared, and they can be hard to determine accurately when the data are

noisy.

Results and discussion

The R(diff) and Pearson CC1/2 values of crystallographic data sets are related to one

another. If the distribution of intensities in an X-ray diffraction data set obeys Wilson

statistics,12 which it will if properly measured, and the noise in those data is Gaussian

[OSM & Ref.7,11], it can be shown that:

���/� = ��������� , (Eq. 3)

where a is a constant that depends only on the symmetry-related multiplicity, and that,

typically, has a value between 1.0 and 2.0. This relationship is independent of both the

average resolution of the reflections in each resolution shell and the thickness of those

shells, which means that it is as applicable to the CC1/2 values of entire data sets (i.e.,

cumulative CC1/2’s) as it is to the CC1/2 values of individual resolution shells. R(diff) in

this equation is a lower-bound estimate because it assumes that the intensities being

compared can be, and have already been, properly scaled.

The data released recently for four of the X-ray free-electron laser (XFEL)

structures of photosystem II (PSII) [5WS5, 5WS6, 5WS0, and 5GTI]13 are complete

enough so that one can see whether equation 3 applies to real data. As Figure 1 shows,

Page 5 of 25

John Wiley & Sons

Protein Science

Page 7: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

6

both the CC1/2 values for thin shells of resolution from these data sets and their

cumulative CC1/2 values conform to the upper bound curve predicted by equation 3 (i.e.,

the a = 1.0 curve). Analyses done on other data sets (data not shown), as well as studies

published by Karplus and Diederichs11 further support the conclusion that equation 3 is

generally valid.

When cumulative CC1/2 values are compared to the dependence of CC1/2 on

resolution from the same data sets, it becomes obvious that cumulative CC1/2 values are

insensitive measures of data quality. The cumulative CC1/2 values of the four PSII data

sets mentioned above range between 99.4% and 99.7%, even though CC1/2 falls to ~ 50%

in the highest resolution shell in each of them. The cumulative CC1/2 value of another,

unrelated data set (4IR3),10 which was obtained using conventional synchrotron methods,

is only one percent smaller, 98.5%, even though the CC1/2 value of that data set in its

highest resolution shell is much worse, 13.5%.

Problems can arise when cumulative CC1/2 values fall below ~ 95%, as is the case

for some of the other XFEL data sets reported for PSII.14-17 For example, the 3.0-Å

resolution data set for 5KAF has a cumulative CC1/2 value of 53.2%, and the cumulative

CC1/2 value for the 2.8-Å resolution data set that corresponds to 5KAI is 54.2%.17 These

cumulative CC1/2 values imply that the lower-bound value of R(diff) for these two data

sets, as a whole, is likely to be 92% to 93% (!) (Fig. 1). In two of the other XFEL data

sets (4IXR and 4IXQ) that have been reported for PSII the cumulative CC1/2 values are

66.5% and 79.1%, respectively,15 which suggests R(diff) values that are only slightly

better, ~ 71% and 51%. Thus by normal crystallographic standards, the quality of all of

these data sets is very poor.

Page 6 of 25

John Wiley & Sons

Protein Science

Page 8: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

7

The four data sets just discussed are similar in quality to many of the other XFEL

data sets that have been deposited in the PDB. Diederichs and colleagues have expressed

concern about the quality of some of them, and about the wisdom of using cumulative

CC1/2 values as measure of data set quality.8,19 It has been recommended that individual

values at both low and high resolution bins should be reported, instead of just an overall

value.11 Nevertheless, cumulative CC1/2 values are sometimes the only data quality

statistics reported [e.g. Ref.17,20]. It should be noted that the overall CC value is also

being used as a measure of the correspondence between cryo-EM maps and the atomic

models derived from them [e.g. Ref.21]. This practice too is suspect.

How much structural information can there really be in the data sets that have

quality statistics like those of the four data sets mentioned above? Rossmann has asked

the same question about other XFEL data sets.22,23 The reply sometimes given is that the

quality of data cannot be as bad as it seems because the electron density (ED) maps

obtained from them by molecular replacement “look good”, and the model R-factors

appear acceptable. We have explored this issue computationally in two different ways.

First, starting with the 5KAI model for PSII,17 a 2.80-Å resolution ED map was

calculated using model phases, and a set of amplitudes obtained by Fourier

transformation of a molecular model that was generated from 5KAI by changing the

positions of all protein atoms at random so that the distribution of their displacements is a

Gaussian function with σ|∆r| = 1.038 Å (see OSM). This shifted-coordinate model is

unphysical, and its transform differs from that of its parent structure, on average, by

42.5% in amplitude and by 67.6% in intensity, which is a lot. Nevertheless, the

cumulative intensity Pearson CC between these two data sets is 75%, consistent with the

Page 7 of 25

John Wiley & Sons

Protein Science

Page 9: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

8

arguments made above about the insensitivity of that statistic to data quality.

Furthermore, the ED map computed using these coordinate-shifted amplitudes and model

phases is strikingly similar to the one obtained using the measured amplitudes and model

phases (Fig. 2A, and 2B, top panels). In fact, the fit of the original protein model to the

ED map calculated using both the amplitudes and phases obtained from the coordinate-

shifted model is reasonably good, and not obviously worse than its fit to the ED map

computed using coordinate-shifted amplitudes but the original model phases (data not

shown).

In the second test, the 5KAI experimental data were divided into 100 thin shells

of resolution, each containing the same number of reflections. A second data set was

obtained from the first by assigning measured amplitudes to Bragg indices at random

within each shell (see OSM). This procedure is called random permutation, and it has

used in the past to investigate the statistical properties of small molecule crystallographic

data.24 The average Pearson CC between the parent data and the randomized data in each

shell was zero for all intents and purposes: ~ -0.001 ± 0.021. Furthermore, on average,

amplitudes had changed by 55.7% and intensities by 101.9%, which is close to what

would be seen if diffraction data sets were compared that had been obtained from two

unrelated, non-centrosymmetric crystals that happened to have the same symmetry and

unit cell dimensions (58.6% and 100%).25,26 Nevertheless, the cumulative Pearson CC

between the permuted data set and its parent was 22%, rather than the 0% one might have

anticipated (see below). Figures 2A and 2C (top panels) show the same small region of

the ED map (ρobs) one computed using the original amplitudes and model phases (Fig.

2A, Fig. S1A), and the other computed using these randomized amplitudes and the

Page 8 of 25

John Wiley & Sons

Protein Science

Page 10: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

9

original model phases (ρchimeric) (Fig. 2C, Fig. S1B). Once again, the similarity of the two

is striking. This observation should not come as a surprise because similar results were

reported for computations carried out using small molecule data over half a century

ago.26-29

What these computations illustrate is the impact that model bias can have on the

outcomes of a structure determination that depend on molecular replacement, a problem

that has long been recognized, but may still be underappreciated. It should also serve as a

warning that the claim that an ED map “looks good” is no guarantee that the data on

which it is based are meaningful.

The similarity between ρobs and ρchimeric maps is easy to understand.

������� = � �����, #����; (Eq. 4) �%&'()*'%��� = � ���+,, #���� = � �����, #���� + � [���+, − �����, #���0 = ������� + � [�Δ��, #���0; (Eq. 5)

where FT indicates Fourier transformation, experimentally observed structure factors are

Fobs and αobs, Falt are the randomly altered amplitudes, and ∆F are differences between

the experimental and altered amplitudes. The second term in the last line of equation 5 is

a difference Fourier map. It is well known, of course, that if the amplitudes identified

here as Falt derive from crystals that are isomorphous to those used to produce the Fobs

data, and the structures of the molecules in the two crystals are closely related, but not

identical, the FT[∆F,αobs] map obtained will reveal the specific differences between the

two structures.30-32 However, for the computations just described, the amplitude

differences are completely uncorrelated with the structure that gave rise to the model

phases. Thus, this component of the chimeric maps should consist of random features that

Page 9 of 25

John Wiley & Sons

Protein Science

Page 11: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

10

extend throughout the unit cell, including its solvent channels, and bear no relationship to

ρobs(r) (Fig. 2). That this is in fact the case is evident in the lower panels in Figure 2 that

are centered on the solvent region in the crystal of interest, which show a much larger

portion of the unit cell than the top panels. The solvent region is almost featureless in the

map computed using model phases and the observed amplitudes (Fig. 2A, lower panels).

It is significantly noisier in the map computed using model phases, but amplitudes

derived from a randomly distorted model (Fig. 2B, lower panels), and it is as feature-

filled as the parts of the unit cell that contain protein in the map obtained with random

permutated intensities and model phases (Fig. 2C, lower panels). There are other

indications that all is not well with the two randomized data sets. First, the R factors that

measure the correspondence between the observed amplitudes and the computed data are

poor: 42.5% and 55.7%, and density histograms within the protein-containing regions of

the two maps are obviously abnormal.

It is straightforward to convert intensity R(diff) estimates into estimates of

intensity and amplitude R(sigma) values (OSM),4 and when this is done one finds that

the intensity R(sigma) values for the 5KAF and 5KAI data sets17 should be 65.1% to

65.8%, and their amplitude R(sigma) values should be 37.6% to 38.0%. In light of what

has just been said, it is quite surprising that the amplitude free R-factors reported for the

models derived from these two data sets are significantly lower than these lower bound

estimates of amplitude R(sigma) values of the data from which they derive: 30.30%

(5KAF) and 29.97% (5KAI).17 How can a model derived from a set of data explain those

data more accurately than the data explain themselves? These observations suggest that

Page 10 of 25

John Wiley & Sons

Protein Science

Page 12: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

11

both models are over-fitted, and one has to ask why the cross validation procedures used,

which are supposed to prevent over-fitting,33 failed to do so in both cases.

Examples of poor quality data sets can also be found in the electron

crystallographic literature. For example, the overall data merging R-factor34 reported for

a data set obtained from proteinase K crystals at a resolution between 21.91 and 1.30 Å

was 62.9%. Yet, the authors were able to produce what appears to be an outstanding

electrostatic potential (ESP) map of that molecule by molecular replacement, even though

the model they used for molecular replacement took no account of the impact that partial

charges have on electron scattering factors, which are known to be large.35-38 In addition,

a recent analysis done by Spence and colleagues has demonstrated that a number of

electron crystallographic structures are computational artifacts because the influence of

model bias had on the ESP density functions on which they are based.39

In the past, structural biologists were more concerned about the quality of their

diffraction data than the quantity, and often adopted a very conservative approach40 to

this problem by, for example, rejecting all the data they had measured beyond the

resolution at which the average signal-to-noise ratio dropped to 2.0. The introduction of

the CC1/2 method for evaluating resolution limits has encouraged structural biologists to

use high resolution data they would otherwise have ignored, and by so doing, they have

been able to arrive at better models.7,10,11,41

Nevertheless, the fact that it is a good idea to use of high-resolution data that has a

CC1/2 value of, say, 50%, does not mean that it is a good idea to use data sets that have a

cumulative CC1/2 value of 50%. The cumulative Pearson CC1/2 statistic is not very

sensitive to the overall quality, as we have demonstrated above, and while we have not

Page 11 of 25

John Wiley & Sons

Protein Science

Page 13: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

12

found a rigorous way to estimate a lower bound value for this parameter, that bound is

clearly greater than zero, and it is easy to understand why. If the average intensities of the

reflections in two data sets have the same dependence on resolution, then, even if the two

data sets are completely unrelated, their cumulative Pearson CC will be greater than zero

because (i) strong reflections predominate at low resolution and weak reflections

predominate at high resolution, and (ii) the two uncorrelated data sets are likely to have

similar mean intensity values within each resolution shell. It appears from the results

described above that the cumulative CC1/2 of a data set that has been well measured by

traditional reproducibility criteria, such as R(diff), R(split), R(sigma), R(merging),

R(meas), or R(pim) (see Ref. 11, for definitions), will certainly exceed 90%.

Over the last few decades, molecular replacement has become the preferred

method for solving crystal structures in structural biology. It is usually the case that the

molecular model used is not a fully accurate representation of the molecules in the

crystals of interest, and the hope is that the ED map that ultimately emerges will not be

identical to that anticipated for the starting model, and that new insights will emerge

when those differences are interpreted. This expectation is unlikely to be satisfied if the

quality of the data set under consideration is low because of the overwhelming impact

that model-phase bias will have on outcomes, as discussed above. Indeed, the recent

literature provides an example of a failure of just this kind that resulted when an XFEL

data set having a resolution of 1.35 Å, and a cumulative CC1/2 of 81.3% was used to

compute a difference map [Figure 6A in Ref. 20].

In conclusion, the CC1/2 method Karplus and Diederichs devised for assessing

data quality was an important advance in macromolecular crystallography.11 By contrast,

Page 12 of 25

John Wiley & Sons

Protein Science

Page 14: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

13

although related, the cumulative CC1/2, is a statistic best avoided because it is ill behaved,

and less informative than more traditional statistics like R(merge).11 When it is used, it

must be supplemented with other relevant statistics.

Acknowledgement

Authors thank Drs. P.A. Karplus, M. Rossmann, J. Spence, and B.W. Matthews for

critical comments on this manuscript. This work was supported in part by National

Institutes of Health Grant P01 GM022778 (JW), and the U.S. Department of Energy,

Office of Science, Office of Basic Energy Sciences, Division of Chemical Sciences,

Geosciences, and Biosciences, via Grants DESC0001423 (VSB), and DE-

FG0205ER15646 (GWB).

Conflict of interest statement

The authors declare no conflict of interest in publishing results of this study.

Page 13 of 25

John Wiley & Sons

Protein Science

Page 15: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

14

Figure Legends

Figure 1. The relationship between intensity CC1/2 and R(diff) values. The two theoretical limits for the relationship between these two quantities are shown using cyan and magenta solid lines. CC1/2 and R(diff) values have been computed for four XFEL experimental data sets for PSII (5WS5, black spheres; 5WS6, red; 5WS0, green; 5GTI, blue) as a function of resolution. Those data follow the magenta curve, as do their cumulative CC1/2 values (see the horizontal and vertical dotted lines having the same colors). Figure 2. Electron density (ED) maps. (A) A portion of the ED map computed for 5KAI using observed amplitudes and model phases with the 5KAI model superimposed. (B) The ED map for the same region as (A) computed by combining amplitudes obtained from a model that was derived from 5KAI by altering the locations of all protein atoms at random by the standard deviation of 1.038 Å with (unaltered) 5KAI phases. The position of the atoms in the partially randomized model used to compute the amplitudes used are shown as colored spheres. (C) The ED obtained for the same region shown in (A) using random permutated amplitudes and phases from the original model. The 5KAI model is again superimposed. The first row shows map features contoured at +1.5σ for an α-helix next to the oxygen-evolving complex (OEC) of photosystem II. The lower two rows contoured at +0.5σ for and +1.0σ show map features in solvent channels.

Page 14 of 25

John Wiley & Sons

Protein Science

Page 16: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

15

REFERENCES

1. Pearson K (1896) Mathematical contributions to the theory of evolution. III. Regression, Heredity and Panmixia. Philos Trans of Royal Society London 187:253-318.

2. Evans JD. 1996. Straightforward statistics for the behavior sciences, Brooks/Cole Publishing Co., Pacific Grove.

3. Srinivasan R, Chandrasekaran R (1966) Correlation functions connected with structure factors and their application to observed and calculated structure factors. Indian J Pure Ap Phy 4:178-182.

4. Srinivasan R, Parthasarathy S. 1976. Some statistical applications in X-ray crystallography, Pergamon Press, Oxford; New York.

5. Koritsanszky T, Flaig R, Zobel D, Krane H, Morgenroth W, Luger P (1998) Accurate experimental electronic properties of dl-proline monohydrate obtained within 1 Day. Science 279:356-358.

6. Holton JM, Classen S, Frankel KA, Tainer JA (2014) The R-factor gap in macromolecular crystallography: an untapped potential for insights on accurate structures. FEBS J 281:4046-4060.

7. Karplus PA, Diederichs K (2012) Linking crystallographic model and data quality. Science 336:1030-1033.

8. Rosenthal PB, Henderson R (2003) Optimal determination of particle orientation, absolute hand, and contrast loss in single-particle electron cryomicroscopy. J Mol Biol 333:721-745.

9. Vanheel M (1987) Similarity Measures between Images. Ultramicroscopy 21:95-99. 10. Wang J, Wing RA (2014) Diamonds in the rough: a strong case for the inclusion of weak-intensity X-ray diffraction data. Acta Cryst D70:1491-1497.

11. Karplus PA, Diederichs K (2015) Assessing and maximizing data quality in macromolecular crystallography. Current Opinion in Structural Biology 34:60-68.

12. Wilson AJC (1949) The probability distribution of X-ray intensities. Acta Cryst 2:318-321.

13. Suga M, Akita F, Sugahara M, Kubo M, Nakajima Y, Nakane T, Yamashita K, Umena Y, Nakabayashi M, Yamane T, Nakano T, Suzuki M, Masuda T, Inoue S, Kimura T, Nomura T, Yonekura S, Yu LJ, Sakamoto T, Motomura T, Chen JH, Kato Y, Noguchi T, Tono K, Joti Y, Kameshima T, Hatsui T, Nango E, Tanaka R, Naitow H, Matsuura Y, Yamashita A, Yamamoto M, Nureki O, Yabashi M, Ishikawa T, Iwata S, Shen JR (2017) Light-induced structural changes and the site of O=O bond formation in PSII caught by XFEL. Nature 543:131-135.

14. Kern J, Alonso-Mori R, Hellmich J, Tran R, Hattne J, Laksmono H, Glockner C, Echols N, Sierra RG, Sellberg J, Lassalle-Kaiser B, Gildea RJ, Glatzel P, Grosse-Kunstleve RW, Latimer MJ, McQueen TA, DiFiore D, Fry AR, Messerschmidt M, Miahnahri A, Schafer DW, Seibert MM, Sokaras D, Weng TC, Zwart PH, White WE, Adams PD, Bogan MJ, Boutet S, Williams GJ, Messinger J, Sauter NK, Zouni A, Bergmann U, Yano J, Yachandra VK (2012) Room temperature femtosecond X-ray diffraction of photosystem II microcrystals. Proc Natl Acad Sci USA 109:9721-9726.

Page 15 of 25

John Wiley & Sons

Protein Science

Page 17: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

16

15. Kern J, Alonso-Mori R, Tran R, Hattne J, Gildea RJ, Echols N, Glockner C, Hellmich J, Laksmono H, Sierra RG, Lassalle-Kaiser B, Koroidov S, Lampe A, Han G, Gul S, Difiore D, Milathianaki D, Fry AR, Miahnahri A, Schafer DW, Messerschmidt M, Seibert MM, Koglin JE, Sokaras D, Weng TC, Sellberg J, Latimer MJ, Grosse-Kunstleve RW, Zwart PH, White WE, Glatzel P, Adams PD, Bogan MJ, Williams GJ, Boutet S, Messinger J, Zouni A, Sauter NK, Yachandra VK, Bergmann U, Yano J (2013) Simultaneous femtosecond X-ray spectroscopy and diffraction of photosystem II at room temperature. Science 340:491-495.

16. Kern J, Tran R, Alonso-Mori R, Koroidov S, Echols N, Hattne J, Ibrahim M, Gul S, Laksmono H, Sierra RG, Gildea RJ, Han G, Hellmich J, Lassalle-Kaiser B, Chatterjee R, Brewster AS, Stan CA, Glockner C, Lampe A, DiFiore D, Milathianaki D, Fry AR, Seibert MM, Koglin JE, Gallo E, Uhlig J, Sokaras D, Weng TC, Zwart PH, Skinner DE, Bogan MJ, Messerschmidt M, Glatzel P, Williams GJ, Boutet S, Adams PD, Zouni A, Messinger J, Sauter NK, Bergmann U, Yano J, Yachandra VK (2014) Taking snapshots of photosynthetic water oxidation using femtosecond X-ray diffraction and spectroscopy. Nature Comm 5:4371.

17. Young ID, Ibrahim M, Chatterjee R, Gul S, Fuller FD, Koroidov S, Brewster AS, Tran R, Alonso-Mori R, Kroll T, Michels-Clark T, Laksmono H, Sierra RG, Stan CA, Hussein R, Zhang M, Douthit L, Kubin M, de Lichtenberg C, Vo Pham L, Nilsson H, Cheah MH, Shevela D, Saracini C, Bean MA, Seuffert I, Sokaras D, Weng TC, Pastor E, Weninger C, Fransson T, Lassalle L, Brauer P, Aller P, Docker PT, Andi B, Orville AM, Glownia JM, Nelson S, Sikorski M, Zhu D, Hunter MS, Lane TJ, Aquila A, Koglin JE, Robinson J, Liang M, Boutet S, Lyubimov AY, Uervirojnangkoorn M, Moriarty NW, Liebschner D, Afonine PV, Waterman DG, Evans G, Wernet P, Dobbek H, Weis WI, Brunger AT, Zwart PH, Adams PD, Zouni A, Messinger J, Bergmann U, Sauter NK, Kern J, Yachandra VK, Yano J (2016) Structure of photosystem II and substrate binding at room temperature. Nature. 450:453-457.

18. Assmann G, Brehm W, Diederichs K (2016) Identification of rogue datasets in serial crystallography. J Appl Crystallogr 49:1021-1028.

19. Diederichs K (2017) Dissecting random and systematic differences between noisy composite data sets. Acta Cryst D73:286-293.

20. Uervirojnangkoorn M, Zeldin OB, Lyubimov AY, Hattne J, Brewster AS, Sauter NK, Brunger AT, Weis WI (2015) Enabling X-ray free electron laser crystallography for challenging biological systems from a limited number of crystals. Elife 4:e05421(05421-05429).

21. Hryc CF, Chen DH, Afonine PV, Jakana J, Wang Z, Haase-Pettingell C, Jiang W, Adams PD, King JA, Schmid MF, Chiu W (2017) Accurate model annotation of a near-atomic resolution cryo-EM map. Proc Natl Acad Sci U S A 114:3103-3108.

22. Gati C, Bourenkov G, Klinge M, Rehders D, Stellato F, Oberthur D, Yefanov O, Sommer BP, Mogk S, Duszenko M, Betzel C, Schneider TR, Chapman HN, Redecke L (2014) Serial crystallography on in vivo grown microcrystals using synchrotron radiation. IUCrJ 1:87-94.

23. Rossmann MG (2014) Serial crystallography using synchrotron radiation. IUCrJ 1:84-86.

24. Srinivasan R, Srikrishnan T (1966) New statistical tests for distinguishing between centrosymmetric and non-centrosymmetric structures. Acta Cryst 21:648-652.

Page 16 of 25

John Wiley & Sons

Protein Science

Page 18: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

17

25. Wilson AJC (1950) Largest likely values for the reliability index. Acta Cryst 3:397-398.

26. - (1974) Largest likely values of residuals. Acta Cryst A30:836-838. 27. Ramachandran G, Srinivasan R (1961) An apparent paradox in crystal structure analysis. Nature 190:159-161.

28. Srinivasan R (1961) The significance of the phase synthesis. Proc Indian Acad Sci 53A:252-264.

29. Lattman E, DeRosier D (2008) Why phase errors affect the electron function more than amplitude errors. Acta Cryst A64:341-344.

30. Stryer L, Kendrew JC, Watson HC (1964) The Mode of Attachment of the Azide Ion to Sperm Whale Metmyoglobin. J Mol Biol 8:96-104.

31. Kraut J (1965) Structural Studies with X-Rays. Annual Rev Biochem 34:247-268. 32. Wang J, Mauro M, Edwards SL, Oatley SJ, Fishel LA, Ashford VA, Xuong NH, Kraut J (1990) X-ray structures of recombinant yeast cytochrome c peroxidase and three heme-cleft mutants prepared by site-directed mutagenesis. Biochemistry 29:7160-7173.

33. Brunger AT (1992) Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature 355:472-475.

34. Hattne J, Shi D, de la Cruz MJ, Reyes FE, Gonen T (2016) Modeling truncated pixel values of faint reflections in MicroED images. J Appl Crystallogr 49:1029-1034.

35. Wang J, Moore PB (2017) On the interpretation of electron microscopic maps of biological macromolecules. Protein Science 26:122-129.

36. Wang J (2017) On the appearance of carboxylates in electrostatic potential maps. Protein Science 26:396-402.

37. - (2017) Experimental charge density from electron microscopic maps. Protein Science 26:1619-1629.

38. Wang J, Videla PE, Batista VS (2017) Effects of aligned alpha-helix peptide dipoles on experimental electrostatic potentials. Protein Science 26:1692-1697.

39. Subramanian G, Basu S, Liu H, Zuo JM, Spence JC (2015) Solving protein nanocrystals by cryo-EM diffraction: multiple scattering artifacts. Ultramicroscopy 148:87-93.

40. Wang J (2015) Estimation of the quality of refined protein crystal structures. Protein Science 24:661-669.

41. Diederichs K, Karplus PA (2013) Better models by discarding data? Acta Cryst D69:1215-1222.

Page 17 of 25

John Wiley & Sons

Protein Science

Page 19: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

Figure 1. The relationship between intensity CC1/2 and R(split) values. The two theoretical limits for the relationship between these two quantities are shown using cyan and magenta solid lines. CC1/2 and R(split) values have been computed for four XFEL experimental data sets for PSII (5WS5, black spheres; 5WS6, red;

5WS0, green; 5GTI, blue) as a function of resolution. Those data follow the magenta curve, as do their cumulative CC1/2 values (see the horizontal and vertical dotted lines having the same colors).

124x90mm (300 x 300 DPI)

Page 18 of 25

John Wiley & Sons

Protein Science

Page 20: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

Figure 2. Electron density (ED) maps. (A) A portion of the ED map computed for 5KAI using observed amplitudes and model phases with the 5KAI model superimposed. (B) The ED map for the same region as (A) computed by combining amplitudes obtained from a model that was derived from 5KAI by altering the

locations of all protein atoms at random by the standard deviation of 1.038 Å with (unaltered) 5KAI phases. The position of the atoms in the partially randomized model used to compute the amplitudes used are shown

as colored spheres. (C) The ED obtained for the same region shown in (A) using ramdom permutated amplitudes and phases from the original model. The 5KAI model is again superimposed. The first row shows map features contoured at +1.5σ for an α-helix next to the oxygen-evolving center (OEC) of photosystem

II. The lower two rows contoured at +0.5σ for and +1.0σ show map features in solvent channels.

199x199mm (300 x 300 DPI)

Page 19 of 25

John Wiley & Sons

Protein Science

Page 21: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

On the relationship between cumulative correlation coefficients and the

quality of crystallographic data sets

Jimin Wang1,*, Gary W. Brudvig

1,2, Victor S. Batista

2, Peter B. Moore

1,2

Reflective correlation coefficient on amplitudes of two unrelated, non-

centrosymmetric structures (Ref. 3).

Let z1 and z2 represent the normalized intensities of two unrelated, non-centrosymmetric diffraction data sets. Both their mean values and standard deviations will be unity and the probability distribution of their intensities will follow Wilson statistics, i.e. P(z)dz = exp(-z)dz. Reflective correlation coefficient of the two data sets, on amplitudes, is as follows.

������, ��� = ∑�����∑���∑���

= ����� = �����������

� ����������

= �2� ���������� �� = �

The relationship between CC1/2 and R(diff) in diffraction data sets.

The relationship between CC1/2 and Rσ quantities has been worked out by by Karplus and Dietrichs (Ref. 7, 11). Let I1 and I2 represent the measured intensities of a particular Bragg reflections in the two halves into which some data set has been divided, and y1 = I1 – <I1> = τ + ε1; y2 = I2 – <I2> = τ + ε2, where τ is the true value of these reflections less the average of the true value of all reflections, and ε is the noise in measurement. Thus

the mean value of y1 and y2 is zero, and their variances are (σε)2+(στ)

2. This being the case, correlation between y1 and y2, i.e. CC1/2, will be:

���/� = �� �!�, !�� = ��"�!�, !����#$� + #&���#$� + #&��

= '�!�!��#$� + #&� = '(�) + *���) + *��+

#$� + #&�

= '()� + *�) + *�) + *�*�+#$� + #&� = #$�

#$� + #&� = 11 + (#&�/#$�+

Page 20 of 25

John Wiley & Sons

Protein Science

Page 22: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

2

In the above equation, Corr is correlation coefficient, Cov is covariance, and E is the expectation value. It is easy to show that #$� = ⟨.�⟩ − ⟨.⟩�. Thus if q2 = <I2>/[<I2>–<I>2], then:

���/� = 11 + 1�#&�/⟨.�⟩

For accentric reflections that obey Wilson statistics, q2 = 2, at maximum, and (σε)2 =

2<σI>2. When this is so the correlation coefficient for accentric reflections will be:

���/�233456783 = 11 + 4/�⟨.⟩/⟨#:⟩�� = 1

1 + 4;<�

where Rσ is the intensity R(sigma) (= <σI>/<I>). For centric reflections, q2 = 3/2. At high resolutions where random noise begins to dominate the data, q2 will tend to 1.0 because <I> will be approaching zero. Thus, when the CC1/2 value of real data is plotted as a function of R(sigma) it should fall between the curves defined by the two following equations.

���/� = 11 + 4;<�

and���/� = 11 + 2;<�

When the intensities obtained from two halves of a data set are merged to generate one full data set, the merging R-factor of the data, R(merging), will usually be the root mean squared average fractional intensity difference of two sets of intensities to their mean value. R(merging), therefore, is not the same as R(diff), which is the root mean square average fractional intensity difference between two sets of intensities. Thus:

;A8BB = �⟨|.� − .�|�⟩�⟨.�̅⟩ ; . ̅ = �.� + .��

2 ;

;F47G85G = �⟨|.� − .|̅�⟩ + ⟨|.� − .|̅�⟩�⟨.�̅⟩ = ;A8BB

√2

Thus [Rσ]2 = [Rmerging]2 = [Rsplit]

2/2. It follows that the equations that relate CC1/2 to R(diff) are:

���/� = 11 + 2;A8BB� IJ����/� = 1

1 + ;A8BB�

An average amplitude and intensity difference between paired structures can be established through the joint probability if the distribution of intensity for both structures

Page 21 of 25

John Wiley & Sons

Protein Science

Page 23: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

3

follows Wilson law. If the difference is due to random noise added to the two structures, an average intensity difference is about 1.73 times the average amplitude difference (Ref. 4). Computational Procedures

Structure factor calculation was carried out using the Refmac5 program in the CCP4 package by setting the number of cycles of rigid-body refinement of zero so that geometry of ligands need not be checked (Ref. 42,43). Structure factor comparisons were made and electron density maps were calculated using the CCP4 package (Ref. 43). Resulting maps were analyzed using the graphics programs Coot and PyMOL (Ref. 44,45). All map calculations include a weighting function that was determined between the experimental data and the calculated amplitudes from the original atomic model, which serves to damp down the high-frequency noise. Gaussian shifts of coordinates were made using the program CNS (Ref. 46). Structure factor shuffling (random permutation) was made using the intrinsic function of Linux utility called “shuf”. An example of the script for random coordinate shifts in each of x,y,z direction by standard deviation of 0.600 Å (i.e., total shift of 1.038 Å in position) is as follows. ###Gaussian_0p6A_shift.inp### structure @generate_5KAI_ProteinAtoms.mtf end coordinate @generate_5KAI_ProteinAtoms.pdb do (x=x+gauss(0.6)) (all) do (y=y+gauss(0.6)) (all) do (z=z+gauss(0.6)) (all) write coordinate format=PDBO output=5KAI_ProteinAtoms_0p6A.pdb end stop ##### An example of script how to shuffle Miller indexes of Bragg reflections (or random permutation of Miller indexes) is as follows. ###Shuffle.com##### #/bin/csh # Input file 5kai-sf_amplitude_resol.list contains h, k, l, Fobs, σFobs, and s where s is # the reciprocal resolution and the reflection file has been sorted in the ascceding # order of resolution. The number of resolution shell Nres=100. # This script has one dependent awk_shuffled script. cp /dev/null 5kai-sf_amplitude_shuffled.list set dmax = “2.8” @ n = 1 beginning_loop: cat << eof >! awk.it BEGIN{dmax3=1.0/$dmax**3; Nres=100; delta=dmax3/Nres; res1=($n-1)*delta; res2=($n)*delta}

Page 22 of 25

John Wiley & Sons

Protein Science

Page 24: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

4

{h=\$1; k=\$2; l=\$3; fobs=\$5; sfobs=\$6; res3=\$8**3; if(res3>res1&&res3<=res2) {printf(“%5d%5d%5d%10.2f%10.2f\n”,h,k,l,fobs,sfobs)}} eof awk –f awk.it 5kai-sf_amplitude_resol.list >! 5kai-sf_amplitude_$n.list set Nref = `wc –l 5kai-sf_amplitude_$n.list | awk ‘{print $1}’` shuf –i 1-$Nref >> 5kai-sf_amplitude_$n.list awk –f awk_shuffled 5kai-sf_amplitude_$n.list >> 5kai-sf_amplitude_shuffled.list @ n += 1 if( $n < 101) goto beginning_loop ###### ### awk_shuffled #### #!/bin/csh # Data file contains two parts: the first Nref lines contain the original h,k,l,Fobs,σFobs # and next Nref lines contain the shuffled order of hkl indexes. # Output contains h,k,l,Fobs,σFobs, Fobs(shuffled), σFobs(shuffled). {h[NR]=$1; k[NR]=$2; l[NR]=$3; fobs[NR]=$4; sfobs[NR]=$5} END{Nref=NR/2; for(i=1; i<=Nref; i+=1) {h0=h[i]; k0=k[i]; l0=l[i]; fobs0=fobs[i]; sfobso=sfobs[i]; jshuffled=h[i+Nref]; f2=fobs[jshuffled]; sf2=sfobs[jshuffled]; printf(“%5d%5d%5d%10.2f%10.2f%10.2f%10.2f\n”,h0,k0,l0,fobs0,sfobs0,f2,sf2)}} ####end of script###### Five Additional References (42-46):

42. G. N. Murshudov, A. A. Vagin, E. J. Dodson, Refinement of macromolecular structures by the maximum-likelihood method. Acta cryst D53, 240-255 (1997).

43. M. D. Winn et al., Overview of the CCP4 suite and current developments. Acta cryst D67, 235-242 (2011).

44. P. Emsley, K. Cowtan, Coot: model-building tools for molecular graphics. Acta cryst D60, 2126-2132 (2004).

45. W. L. DeLano. (2002). The PyMOL Molecular Graphics System. Version 1.8. Schrödinger, LLC.

46. A. T. Brünger et al., Crystallography & NMR system: A new software suite for macromolecular structure determination. Acta cryst D54, 905-921 (1998).

Page 23 of 25

John Wiley & Sons

Protein Science

Page 25: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

5

One Supporting Figure:

Figure S1. Continued from Figure 2A and 2C as S1A and S1B at contour levels of +1.0σ, +2.0σ, and +3.0σ.

Page 24 of 25

John Wiley & Sons

Protein Science

Page 26: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

Cover Illustration - different level of noise in solvent regions in different amplitude-modified electron density maps contoured at +0.5 sigma (upper panel and +1.0 sigma (lower panel)

199x96mm (300 x 300 DPI)

Page 25 of 25

John Wiley & Sons

Protein Science

Page 27: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

On the relationship between cumulative correlation coefficients and the quality of crystallographic data sets  Jimin Wang1,*, Gary W. Brudvig1,2, Victor S. Batista2, Peter B. Moore1,2 Reflective correlation coefficient on amplitudes of two unrelated, non-centrosymmetric structures (Ref. 3). Let z1 and z2 represent the normalized intensities of two unrelated, non-centrosymmetric diffraction data sets. Both their mean values and standard deviations will be unity and the probability distribution of their intensities will follow Wilson statistics, i.e. P(z)dz = exp(-z)dz. Reflective correlation coefficient of the two data sets, on amplitudes, is as follows.

𝐶𝐶! 𝐹!,𝐹! =𝐹!𝐹!𝐹!! 𝐹!!

= 𝐹!𝐹! = 𝑧!𝑒!!!𝑑𝑧!

!

!

𝑧!𝑒!!!𝑑𝑧!!

!

= 2 𝑥!𝑒!!!𝑑𝑥!

!

!= !

!

The relationship between CC1/2 and R(diff) in diffraction data sets.

The relationship between CC1/2 and Rσ quantities has been worked out by by Karplus and Dietrichs (Ref. 7, 11). Let I1 and I2 represent the measured intensities of a particular Bragg reflections in the two halves into which some data set has been divided, and y1 = I1 – <I1> = τ + ε1; y2 = I2 – <I2> = τ + ε2, where τ is the true value of these reflections less the average of the true value of all reflections, and ε is the noise in measurement. Thus

the mean value of y1 and y2 is zero, and their variances are (σε)2+(στ)

2. This being the case, correlation between y1 and y2, i.e. CC1/2, will be:

𝐶𝐶!/! = 𝐶𝑜𝑟𝑟 𝑦!,𝑦! =𝐶𝑜𝑣 𝑦!,𝑦!

𝜎!! + 𝜎!! 𝜎!! + 𝜎!!

=𝐸 𝑦!𝑦!𝜎!! + 𝜎!!

=𝐸 𝜏 + 𝜀! 𝜏 + 𝜀!

𝜎!! + 𝜎!!

=𝐸 𝜏! + 𝜀!𝜏 + 𝜀!𝜏 + 𝜀!𝜀!

𝜎!! + 𝜎!!=

𝜎!!

𝜎!! + 𝜎!!=

11+ 𝜎!!/𝜎!!

Page 28: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

  2  

In the above equation, Corr is correlation coefficient, Cov is covariance, and E is the expectation value. It is easy to show that 𝜎!! = 𝐼! − 𝐼 !. Thus if q2 = <I2>/[<I2>–<I>2], then:

𝐶𝐶!/! =1

1+ 𝑞!𝜎!!/ 𝐼!

For accentric reflections that obey Wilson statistics, q2 = 2, at maximum, and (σε)2 =

2<σI>2. When this is so the correlation coefficient for accentric reflections will be:

𝐶𝐶!/!!""#$%&'" =1

1+ 4/ 𝐼 / 𝜎! ! =1

1+ 4𝑅!!

where Rσ is the intensity R(sigma) (= <σI>/<I>). For centric reflections, q2 = 3/2. At high resolutions where random noise begins to dominate the data, q2 will tend to 1.0 because <I> will be approaching zero. Thus, when the CC1/2 value of real data is plotted as a function of R(sigma) it should fall between the curves defined by the two following equations.

𝐶𝐶!/! =1

1+ 4𝑅!!    and  𝐶𝐶!/! =

11+ 2𝑅!!

When the intensities obtained from two halves of a data set are merged to generate one full data set, the merging R-factor of the data, R(merging), will usually be the root mean squared average fractional intensity difference of two sets of intensities to their mean value. R(merging), therefore, is not the same as R(diff), which is the root mean square average fractional intensity difference between two sets of intensities. Thus:

𝑅!"!! =𝐼! − 𝐼! !

𝐼!;  𝐼 =

𝐼! + 𝐼!2 ;

𝑅!"#$%&$ =𝐼! − 𝐼 ! + 𝐼! − 𝐼 !

𝐼!=𝑅!"##2

Thus [Rσ]2 = [Rmerging]2 = [Rsplit]2/2. It follows that the equations that relate CC1/2 to R(diff) are:

𝐶𝐶!/! =1

1+ 2𝑅!"##!    𝑎𝑛𝑑  𝐶𝐶!/! =1

1+ 𝑅!"##!

An average amplitude and intensity difference between paired structures can be established through the joint probability if the distribution of intensity for both structures

Page 29: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

  3  

follows Wilson law. If the difference is due to random noise added to the two structures, an average intensity difference is about 1.73 times the average amplitude difference (Ref. 4). Computational Procedures Structure factor calculation was carried out using the Refmac5 program in the CCP4 package by setting the number of cycles of rigid-body refinement of zero so that geometry of ligands need not be checked (Ref. 42,43). Structure factor comparisons were made and electron density maps were calculated using the CCP4 package (Ref. 43). Resulting maps were analyzed using the graphics programs Coot and PyMOL (Ref. 44,45). All map calculations include a weighting function that was determined between the experimental data and the calculated amplitudes from the original atomic model, which serves to damp down the high-frequency noise. Gaussian shifts of coordinates were made using the program CNS (Ref. 46). Structure factor shuffling (random permutation) was made using the intrinsic function of Linux utility called “shuf”. An example of the script for random coordinate shifts in each of x,y,z direction by standard deviation of 0.600 Å (i.e., total shift of 1.038 Å in position) is as follows. ###Gaussian_0p6A_shift.inp### structure @generate_5KAI_ProteinAtoms.mtf end coordinate @generate_5KAI_ProteinAtoms.pdb do (x=x+gauss(0.6)) (all) do (y=y+gauss(0.6)) (all) do (z=z+gauss(0.6)) (all) write coordinate format=PDBO output=5KAI_ProteinAtoms_0p6A.pdb end stop ##### An example of script how to shuffle Miller indexes of Bragg reflections (or random permutation of Miller indexes) is as follows. ###Shuffle.com##### #/bin/csh # Input file 5kai-sf_amplitude_resol.list contains h, k, l, Fobs, σFobs, and s where s is # the reciprocal resolution and the reflection file has been sorted in the ascceding # order of resolution. The number of resolution shell Nres=100. # This script has one dependent awk_shuffled script. cp /dev/null 5kai-sf_amplitude_shuffled.list set dmax = “2.8” @ n = 1 beginning_loop: cat << eof >! awk.it BEGIN{dmax3=1.0/$dmax**3; Nres=100; delta=dmax3/Nres; res1=($n-1)*delta; res2=($n)*delta}

Page 30: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

  4  

{h=\$1; k=\$2; l=\$3; fobs=\$5; sfobs=\$6; res3=\$8**3; if(res3>res1&&res3<=res2) {printf(“%5d%5d%5d%10.2f%10.2f\n”,h,k,l,fobs,sfobs)}} eof awk –f awk.it 5kai-sf_amplitude_resol.list >! 5kai-sf_amplitude_$n.list set Nref = `wc –l 5kai-sf_amplitude_$n.list | awk ‘{print $1}’` shuf –i 1-$Nref >> 5kai-sf_amplitude_$n.list awk –f awk_shuffled 5kai-sf_amplitude_$n.list >> 5kai-sf_amplitude_shuffled.list @ n += 1 if( $n < 101) goto beginning_loop ###### ### awk_shuffled #### #!/bin/csh # Data file contains two parts: the first Nref lines contain the original h,k,l,Fobs,σFobs # and next Nref lines contain the shuffled order of hkl indexes. # Output contains h,k,l,Fobs,σFobs, Fobs(shuffled), σFobs(shuffled). {h[NR]=$1; k[NR]=$2; l[NR]=$3; fobs[NR]=$4; sfobs[NR]=$5} END{Nref=NR/2; for(i=1; i<=Nref; i+=1) {h0=h[i]; k0=k[i]; l0=l[i]; fobs0=fobs[i]; sfobso=sfobs[i]; jshuffled=h[i+Nref]; f2=fobs[jshuffled]; sf2=sfobs[jshuffled]; printf(“%5d%5d%5d%10.2f%10.2f%10.2f%10.2f\n”,h0,k0,l0,fobs0,sfobs0,f2,sf2)}} ####end of script###### Five Additional References (42-46):

42. G. N. Murshudov, A. A. Vagin, E. J. Dodson, Refinement of macromolecular

structures by the maximum-likelihood method. Acta cryst D53, 240-255 (1997). 43. M. D. Winn et al., Overview of the CCP4 suite and current developments. Acta

cryst D67, 235-242 (2011). 44. P. Emsley, K. Cowtan, Coot: model-building tools for molecular graphics. Acta

cryst D60, 2126-2132 (2004). 45. W. L. DeLano. (2002). The PyMOL Molecular Graphics System. Version 1.8.

Schrödinger, LLC. 46. A. T. Brünger et al., Crystallography & NMR system: A new software suite for

macromolecular structure determination. Acta cryst D54, 905-921 (1998).

Page 31: On the relationship between cumulative correlation ... · 1/2 values of crystallographic data sets are related to one another. If the distribution of intensities in an X-ray diffraction

  5  

One Supporting Figure:

Figure S1. Continued from Figure 2A and 2C as S1A and S1B at contour levels of +1.0σ, +2.0σ, and +3.0σ.