diber: protein, dna, or both?diber.iimcb.gov.pl/site_media/data/chojnowski_bochtler.pdf · 2011....

22
DIBER: protein, DNA, or both? Grzegorz Chojnowski MPG-PAN Junior Research Group International Institute of Molecular and Cell Biology, 4 Ks. Trojdena Street, 02-109 Warsaw, Poland Institute of Experimental Physics, University of Warsaw, 93 ˙ Zwirki i Wigury Street, 02-089 Warsaw, Poland [email protected], [email protected] Matthias Bochtler MPG-PAN Junior Research Group, International Institute of Molecular and Cell Biology, 4 Ks. Trojdena Street, 02-109 Warsaw, Poland Schools of Chemistry and Biosciences, Cardiff University, Park Place, CF10 3AT Cardiff, United Kingdom [email protected], [email protected] Abstract The program DIBER (an acronym for DNA and FIBER) requires only native diffraction data to predict whether a crystal contains protein, B-form DNA, or both. In standalone mode, the classi- fication is based on the cube root of the recip- rocal unit cell volume and the largest local av- erage of diffraction intensities at 3.4 ˚ A resolu- tion. In combined mode, the PHASER rotation function score (for the 3.4 ˚ A shell and a canon- ical B-DNA search model) is also taken into ac- count. In standalone (combined) mode, DIBER classifies 87.4 ± 0.2% (90.2 ± 0.3%) of the pro- tein, 69.1 ± 0.3% (78.8 ± 0.3%) of the protein- DNA and 92.7 ± 0.2% (90.0 ± 0.2%) of the DNA crystals correctly. Reliable predictions with a correct classification rate above 80% are possi- ble for 36.8 ± 1.0% (60.2 ± 0.4%) of the protein, 43.6 ± 0.5% (59.8 ± 0.3%) of the protein-DNA and 83.3 ± 0.3% (82.6 ± 0.4%) of the DNA struc- tures. Surprisingly, selective use of the diffraction data in the 3.4 ˚ A shell improves the overall suc- cess rate of the combined mode classification. An open-source CCP4/CCP4i compatible version of DIBER is available from the authors’ website at http://diber.iimcb.gov.pl/ and is subject to the GNU Public License. 1 Introduction Structural studies of protein-nucleic acid com- plexes require the co-crystallization of both com- ponents. If a tight protein-DNA complex is not available for crystallization, uncertainty about the crystal content often remains until the structure is finally solved. Part of the difficulty is due to the surprising observation that DNA can be re- quired for crystallization without getting incorpo- rated into the crystal, perhaps by perturbing the pH of the buffer or by other indirect effects [15]. In principle, the crystal content could be clari- fied by spectroscopic methods, but the equipment for such measurements is often unavailable. Al- ternatively, crystals can be washed, dissolved and analyzed by gel electrophoresis with appropriate staining, but this method is destructive and does not always provide a clear-cut answer. On the one hand, components of the crystal can go unno- ticed if crystals are small and detection efficiency is limited. On the other hand, components can be falsely diagnosed if they stick to the crystal surface 1

Upload: others

Post on 28-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • DIBER: protein, DNA, or both?

    Grzegorz ChojnowskiMPG-PAN Junior Research Group

    International Institute of Molecular and Cell Biology,4 Ks. Trojdena Street, 02-109 Warsaw, Poland

    Institute of Experimental Physics, University of Warsaw,93 Żwirki i Wigury Street, 02-089 Warsaw, Poland

    [email protected], [email protected]

    Matthias BochtlerMPG-PAN Junior Research Group,

    International Institute of Molecular and Cell Biology,4 Ks. Trojdena Street, 02-109 Warsaw, Poland

    Schools of Chemistry and Biosciences,Cardiff University,

    Park Place, CF10 3AT Cardiff, United Kingdom

    [email protected], [email protected]

    Abstract

    The program DIBER (an acronym for DNA andFIBER) requires only native diffraction data topredict whether a crystal contains protein, B-formDNA, or both. In standalone mode, the classi-fication is based on the cube root of the recip-rocal unit cell volume and the largest local av-erage of diffraction intensities at 3.4 Å resolu-tion. In combined mode, the PHASER rotationfunction score (for the 3.4 Å shell and a canon-ical B-DNA search model) is also taken into ac-count. In standalone (combined) mode, DIBERclassifies 87.4 ± 0.2% (90.2 ± 0.3%) of the pro-tein, 69.1 ± 0.3% (78.8 ± 0.3%) of the protein-DNA and 92.7 ± 0.2% (90.0 ± 0.2%) of the DNAcrystals correctly. Reliable predictions with acorrect classification rate above 80% are possi-ble for 36.8 ± 1.0% (60.2 ± 0.4%) of the protein,43.6 ± 0.5% (59.8 ± 0.3%) of the protein-DNAand 83.3 ± 0.3% (82.6 ± 0.4%) of the DNA struc-tures. Surprisingly, selective use of the diffractiondata in the 3.4 Å shell improves the overall suc-cess rate of the combined mode classification. Anopen-source CCP4/CCP4i compatible version ofDIBER is available from the authors’ website at

    http://diber.iimcb.gov.pl/ and is subject to theGNU Public License.

    1 Introduction

    Structural studies of protein-nucleic acid com-plexes require the co-crystallization of both com-ponents. If a tight protein-DNA complex is notavailable for crystallization, uncertainty about thecrystal content often remains until the structureis finally solved. Part of the difficulty is due tothe surprising observation that DNA can be re-quired for crystallization without getting incorpo-rated into the crystal, perhaps by perturbing thepH of the buffer or by other indirect effects [15].In principle, the crystal content could be clari-fied by spectroscopic methods, but the equipmentfor such measurements is often unavailable. Al-ternatively, crystals can be washed, dissolved andanalyzed by gel electrophoresis with appropriatestaining, but this method is destructive and doesnot always provide a clear-cut answer. On theone hand, components of the crystal can go unno-ticed if crystals are small and detection efficiencyis limited. On the other hand, components can befalsely diagnosed if they stick to the crystal surface

    1

  • p p

    P P

    1/p 1/p

    (a) (b)

    (c) (d)

    1/P1/P

    Figure 1: Real and reciprocal space representations of continuous and discontinuous helices and double-helices. All calculations were done with pitch P = 34 Å and (average) helix radius r = 7.0 Å. Axialdistance between pearls in (c) and (d) was p = 3.4 Å. Layers have finite width because only two turns ofthe helix were used for the numerical calculations. The arrows highlight the characteristic 3.4 Å peak.

    without being incorporated in the lattice. Clearly,a method that could distinguish between proteincrystals, DNA crystals and crystals of both com-ponents based on the diffraction data alone (in theabsence of any phase information) would be highlydesirable.

    Crystals that contain only DNA typically havemuch smaller unit cells than crystals that con-tain protein (either alone or in combination withDNA) and are therefore easily identifiable. It ismuch harder to distinguish protein crystals fromcrystals that contain protein and DNA, becausetheir cell dimensions are typically comparable. Wereasoned that the presence of double stranded B-DNA (dsDNA) should be deducible from the char-acteristic features of its Fourier transform, even

    though the latter is sampled by the reciprocal lat-tice in three-dimensional diffraction experiments.The key features of the Fourier transform of ds-DNA are well known [4, 10, 7]. The modulus is ap-proximately cylinder symmetric. Slices that con-tain the reciprocal space helix axis reveal a crossat low resolution and a strong maximum at 3.4 Åresolution. This maximum is known as the merid-ional peak in fiber diffraction for its location in atypical setup. It is due to in-phase scattering of allDNA nucleotide pairs (which are related by 3.4 Åshifts along the helix axis and irrelevant rotations)(Fig. 1).

    The transverse (perpendicular to the helix axis)and longitudinal (along the helix axis) profile ofthe characteristic 3.4 Å peak can be analyzed ei-

    2

  • 0.05 0.10 0.15 0.20

    reciprocal space distance [Å-1

    ]

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Inte

    nsity

    [a.u

    .]

    dsDNA bases only

    0.05 0.10 0.15 0.20

    reciprocal space distance [Å-1

    ]

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Inte

    nsity

    [a.u

    .]

    dsDNA backbone only

    0 0.05 0.10 0.15 0.20

    reciprocal space distance [Å-1

    ]

    0.0

    2.0

    4.0

    6.0

    Inte

    nsity

    [a.u

    .]Simulation

    Analyticalapproximation

    dsDNA complete molecule

    (a) (b) (c)

    Figure 2: Transverse intensity profile of the 3.4 Å peak of dsDNA. The scattering of dsDNA (a) bases,(b) backbone and (c) complete molecule was estimated analytically (broken lines, formulas 3, 4 and 5 inAppendix A) and calculated numerically with cylindrical averaging (continuous gray lines) for a 10 basepair helix. The vertical line in panel (c) indicates the location of the maximum.

    ther numerically and/or analytically (Figs. 2 and3). The detailed calculations are presented in Ap-pendix A. The transverse profile does not de-pend on helix length and has a complicated shape(Fig. 2). It can be attributed to coherent su-perposition of the structure factors of DNA bases(Fig. 2a) and backbone (Fig. 2b). Interferenceis destructive on-axis due to the location of phos-phates half-way between bases in the axial direc-tion. However, radial dependencies of base andbackbone scattering differ. Therefore, the contri-butions reinforce each other at a radial distanceR = 0.04 Å−1 off axis (Fig. 2c). Fortuitously, the0.08 Å−1 separation between the maxima is ap-proximately the inverse of the 12 Å radius of theDNA helix, and therefore perfectly in agreementwith the reciprocity of real and reciprocal spacedimensions. The longitudinal profile of the 3.4 Åresolution peak can be calculated like the widthof the first maximum in a multiple-slit diffractionexperiment. The half-width at half-maximum ofapproximately 12 (3.4 Å)

    −1 ≈ 0.15 Å−1 divided bythe number of base pairs in the dsDNA helix canbe confirmed numerically and by more detailedanalytical calculations (Fig. 3).

    In this work, we present the CCP4-compatible,GPL-licensed program DIBER (an acronym forDNA and FIBER), which takes 3D-diffractiondata as input and predicts whether a given crystalcontains protein only, protein-DNA, or DNA only.

    DIBER is intended to search for double strandedB-DNA and not the rarer A- or Z-DNA forms ofdouble stranded DNA, double stranded RNA, norfor any single stranded nucleic acid. The programquantifies the intensity average in regions of recip-rocal space that could represent the characteristic3.4 Å dsDNA peak. Moreover it takes into ac-count the reciprocal unit cell size represented bythe cube root of its volume. Assuming equal apriori probabilities for protein only, protein-DNAand DNA only crystals, the program forecasts thecrystal content, and assesses the confidence of theprediction. A graph with the reflection averagesin a thin resolution shell around 3.4 Å is alsoproduced. Regions of exceptionally strong sig-nals may correlate with the position of the dsDNAcharacteristic peak, and may indicate the doublehelix orientation (up to the usual ambiguity ofhand). It must be stressed, however, that thisinformation should be taken with a grain of salt,because DIBER was not written for this purposeand because the feature has not yet been bench-marked.

    2 Methods

    2.1 Training and test data

    Crystal structures solved at 3.0 Å resolution orbetter were downloaded from Protein Data Bank

    3

  • 0.01 0.02 0.03 0.04 0.05 0.06

    reciprocal space distance [Å-1

    ]

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Inte

    nsity

    [a.u

    .]

    HWHMnumerical

    HWHManalytical

    Figure 3: Longitudinal intensity profile of the3.4 Å peak of dsDNA. The analytical (broken line,formula 8 in Appendix A) and numerical (con-tinuous gray line) results apply to a complete 10base pair dsDNA helix. Vertical lines and arrowsindicate the estimates for the half-width at half-maximum (HWHM) according to formulas 8 and11 in Appendix A.

    (PDB, release date 23 March 2009) together withcorresponding experimental diffraction data. Du-plicates or near duplicates (90% sequence identitycut-off) were removed from the set. Structurescontaining RNA or nucleic acids of less than 2standard Watson-Crick base pairs (as recognizedby 3DNA [11]) were removed as well. The final setcontained 10580 protein only, 791 protein-DNAand 258 DNA only crystal structures. Protein-DNA structures were further subdivided into 762B-DNA structures (containing at least 2 neigh-bouring base pairs of double stranded B-DNA)and 29 others. Similarly, DNA only structureswere partitioned into 151 B-DNA structures and107 others. All DNA containing structures werechecked for continuous helices. For every dou-ble stranded DNA molecule, the centroids of thefour terminal nucleotides of both ends were calcu-lated, stored in pairs of spatially close ends, andexpanded by crystallographic symmetry. DNAwas classified as continuous if at least one cen-troid on each end was within 5 Å of anothercentroid. This procedure should treat DNA du-plexes with sticky ends correctly. However, it doesnot take into account unusual arrangements (like

    DNA on histones). Therefore, the set was alsomanually curated. Finally, we identified struc-tures with translational non-crystallographic sym-metry in all three sets (protein only, protein-DNAand DNA only). These were defined by the pres-ence of strong off-origin peaks in the native Patter-son maps (above 40% of the origin peak height).All reported calculations are based on experimen-tal diffraction data. Structural information wasonly used to select and classify datasets accordingto their macromolecule content.

    The support vector machine algorithm as-sumes that the classifier will operate on datadrawn from the same distribution as the train-ing data. In DIBER, equal a priori probabili-ties for obtaining DNA, protein-DNA and DNAcrystals are assumed, which is not reflected bythe actual numbers of available datasets for thethree classes. Therefore we had to rebalancethe data artificially. For initial tests, we ran-domly pruned protein datasets and replicatedDNA datasets so that their numbers matched thenumber of protein-DNA structures. For the fi-nal optimization, we took the opposite approach,and replicated protein-DNA and DNA structuresuntil their numbers were equal to the numberof protein structures in the set. The classifi-cation performance was estimated using a re-peated stratified sub-sampling validation proce-dure. Classifiers were trained with equal num-bers of structures from each class (roughly 50%of instances). The remaining structures (in gen-eral, unequally distributed among classes) wereused for testing. The average error rate of100 training and testing cycles was used as anestimate of the true error rate. All graphswere prepared with the GRACE (http://plasma-gate.weizmann.ac.il/Grace/) or Matplotlib [9]software.

    2.2 Program implementation

    DIBER is written in C/C++ and comprises lessthan 3000 lines of newly written source code. Theprogram relies extensively on CCP4 [5] and CLIP-PER [6] libraries to handle keyword parsing, crys-tal symmetry issues and diffraction data formats.In addition, the support vector machine LIBSVMlibraries [3] are used for training and decisionmaking. DIBER does not include any routines

    4

  • of the molecular replacement program PHASER[12], but provides an interface to run this pro-gram to obtain a score (the likelihood-enhancedfast rotation function rescored with full likelihoodtarget).

    2.3 Anisotropy correction

    Overall anisotropy was corrected in all DIBERmodes. For calculation of local averages, theCLIPPER routines were used, because the reso-lution dependence of scaling is smooth [6]. Scal-ing factors were applied to all diffraction data.However, they were calculated without the data inthe 3.37 Å to 3.43 Å resolution shell to avoid anydegradation of the helix signal. In the PHASERassisted mode of DIBER, the rotation score wascalculated after applying the PHASER anisotropycorrection [12].

    2.4 Normalization of structure fac-tors and intensities

    Normalization of structure factors poses similarproblems like anisotropy correction. For the cal-culation of local averages, we used CLIPPER rou-tines, which model the resolution dependence ofthe average intensity without dividing diffractiondata into resolution shells (in order to avoid prob-lems at low-resolution [2]). As for anisotropy cor-rection, the normalization factors were calculatedwithout the 3.4 Å resolution shell, but appliedthroughout.

    2.5 The averaging region (stan-dalone mode)

    The size and shape of the characteristic 3.4 Å peakof dsDNA is shown in Figs. 2 and 3. In transversedirection, the profile can be roughly approximatedby a step function. In the longitudinal direction,a Gaussian or quadratic function would be a bet-ter approximation. Nevertheless, considerationsof computational efficiency suggested to use a stepfunction also in this direction. The dimensions ofthe averaging cylinder (with axis pointing towardsthe origin of reciprocal space) were tuned to max-imize the DIBER performance. The percentageof correctly classified datasets (at all costs) wastaken as the criterion of success. A cylinder height

    of 0.04 Å−1 and radius of 0.09 Å−1 were found tobe optimal.

    2.6 The calculation of local aver-ages (standalone mode)

    The crystallographically independent part of the3.4 Å resolution shell was sampled to determinethe maximum local average. To cover this re-gion evenly, we tested pre-computed icosahedralsphere coverings [8] with between 522 to 78032sampling points. The set of 5072 points (corre-sponding approximately to 3◦ sampling) providedsmooth graphics at acceptable computational costand was used throughout. Diffraction data wereexpanded to space group P1 to avoid compu-tationally expensive on the fly use of crystallo-graphic symmetry.

    2.7 Maximum of the likelihood-enhanced fast rotation functionscore (PHASER only mode)

    PHASER 2.1.1 [12] was used for the molecular re-placement calculations with a poly-adenine/poly-thymine dsDNA model generated with 3DNA [11].The likelihood function was defined with defaultsolvent-related parameters Bsol = 300 Å

    2, fsol =0.95. Their exact values have only a minor influ-ence on the classification performance. Parame-ters defining model quality (expected root-mean-square coordinate error RMS and a fraction ofthe scattering power fp) were optimized with re-spect to the classification performance and set toRMS = 0.5 Å2 and fp = 0.5. As input forthe classifier, the largest log-likelihood gain (LLG)score of the likelihood-enhanced rotation functionof order 1 (LERF1) [14] after rescoring its top100 solutions with the Sim maximum likelihoodrotation function (MLRF) [13] was used. Onlythe diffraction data in the resolution range be-tween 3.18 Å and 3.65 Å were used for calcula-tions. Henceforth, we will for simplicity refer tothis score as the rotation score.

    5

  • 2.8 Support vector machine forclassification

    All DIBER predictions were carried out withSupport Vector Machine (SVM) classifiers imple-mented in the LIBSVM library [3], with inputdata scaled linearly from 0 to 1. Input dataand kernel parameters were dependent on DIBERmode. In standalone mode, the cube root of thereciprocal unit cell volume and largest local in-tensity average were used. The kernel parameterswere γ = 0.02, C = 500.0. In PHASER onlymode, the rotation score replaced the largest localintensity average. Moreover, γ = 0.01 instead ofγ = 0.02 was used. In combined mode, the cuberoot of the reciprocal unit cell volume, largest lo-cal intensity average and rotation score were inputto the classifier, and kernel parameters were set toγ = 2.0, C = 500.0.

    3 Results

    3.1 DIBER overview

    DIBER requires a (binary) CCP4 mtz file withdiffraction data to at least 3.0 Å resolution. Theprogram extracts three parameters, (a) the cuberoot of the reciprocal unit cell volume, (b) thelargest local average of reflection intensities (c) arotation score of a PHASER molecular replace-ment run. The DIBER classification of a crystal ofunknown content can be carried out in standalonemode (parameters a and b), PHASER only mode(parameters a and c) or combined mode (all threeparameters). In all three modes, DIBER predictscrystal content with the help of a support vectormachine (SVM) classifier and estimates a proba-bility for correct classification that depends on theactual parameters of the unknown structure. Asa side product of the classification, DIBER alsooutputs a plot of local intensity averages in thethin 3.4 Å resolution shell on a stereographic net(and PHASER solutions if available). This infor-mation can be useful to derive the orientation ofthe DNA structure in the crystal.

    Figure 4: DIBER flow-chart. In standalone modethe predictions are based on the strongest averageintensity at 3.4 Å resolution and the cube root ofthe reciprocal unit cell volume. As an option theprediction reliability may be improved by takinginto account the PHASER rotation function score(dashed box).

    3.2 Standalone search for the 3.4 Åpeak of B-DNA

    In standalone mode, DIBER performs a search ina thin reciprocal space shell around 3.4 Å resolu-tion for a small region with many strong, neigh-bouring reflections. However, the notion of astrong reflection in a newly collected dataset isrelative. Average intensities are resolution depen-dent, and even within a resolution shell, they canvary due to overall temperature factors anisotropy.DIBER corrects for these effects with a globalanisotropy correction, followed by a normaliza-tion of diffraction intensities (Fig. 4). In order todetect the regions with strong reflections DIBER

    6

  • 0.5 1.0 1.5 2.0 2.5The strongest average E2

    -100-80-60-40-20

    020406080

    100

    Top

    LLG

    (a)

    protein

    0 2.0 4.0 6.0 8.0 10.0The strongest average E2

    -100

    -50

    0

    50

    100

    150

    200

    250

    300

    Top

    LLG

    (b)

    protein-DNA

    0 2.0 4.0 6.0 8.0 10.0The strongest average E2

    -100

    -50

    0

    50

    100

    150

    200

    250

    300

    Top

    LLG

    (c)

    DNA

    Figure 5: Correlation of the PHASER rotation function score (Top LLG) with the strongest local averageof normalized intensity (E2) for (a) protein only (green circles), (b) protein-DNA (red circles) and (c)DNA only (blue crosses) structures. The corresponding correlation coefficients are -0.10 for protein only,0.82 for protein-DNA and 0.63 for DNA only structures.

    calculates local averages within suitably orientedcylindrical disks (cylinder axis directed towardsthe origin of reciprocal space) placed at 3.4 Å res-olution. The local averages are calculated withappropriate sampling of a crystallographically in-dependent set of orientations, and the largest av-erage is retained as the score for the classifier.

    3.3 PHASER search for the 3.4 Åpeak of B-DNA

    The molecular replacement procedure imple-mented in PHASER determines the orientation ofa search model according to maximum likelihoodprinciples. For a correct orientation of a searchmodel, the probability to observe the experimen-tally determined data is larger than for the samesearch model in random orientation. The correctorientation is found by maximizing the increase of(the logarithm of) this probability. Missing infor-mation about the model position and hence rel-ative phases of the structure factor contributionsof symmetry related molecules is treated with arandom walk approximation. We reasoned thatwe could use the rotation search of PHASER tolook for the characteristic 3.4 Å peak of dsDNA.As a dsDNA model, we used an idealized 11 basepair long dsDNA. As the score, we took the loglikelihood gain (LLG) of the likelihood-enhancedfast rotation function after rescoring of the top100 solutions with the full likelihood target (in

    0 0.01 0.02 0.031/

    3�V [�A− 1 ]2.04.0

    6.0

    8.0

    10.0

    The

    stro

    nges

    t ave

    rage

    E2

    (a)

    Standalone mode

    0 0.01 0.02 0.031/

    3�V [�A− 1 ]-200.0-100.00.0100.0

    200.0

    300.0

    400.0

    Top

    LLG

    (b)

    PHASER only mode

    Figure 6: Scatter plot of the classifier input pa-rameters for (a) standalone and (b) PHASER onlymode of DIBER. Color codes are the same as inFig. 5.

    this paper referred to as the rotation score).

    3.4 Correlation of the largest localintensity average and the rota-tion score

    The largest local intensity average and the rota-tion score are both measures of the presence or ab-sence of the 3.4 Å peak of B-DNA. In the absenceof DNA, the two measures are essentially uncor-related (Fig. 5a, correlation coefficient -0.10). Incontrast, when DNA is present (either with pro-tein or alone), the two measures detect the 3.4 Åpeak and are clearly correlated, although not verystrongly. The correlation coefficients are 0.82 for

    7

  • 0 0.01 0.02 0.031/

    3�V [�A− 1 ]2.04.0

    6.0

    8.0

    10.0

    The

    stro

    nges

    t ave

    rage

    E2

    (a)

    protein-DNA complex

    DNA

    protein

    Standalone mode

    0 0.01 0.02 0.031/

    3�V [�A− 1 ]-200.0-100.00.0100.0

    200.0

    300.0

    400.0

    Top

    LLG

    (b)

    protein-DNA complex

    DNA

    protein

    PHASER only mode

    Figure 7: SVM classification boundaries for(a) standalone and (b) PHASER only mode ofDIBER. The gray lines correspond to a classifi-cation at all costs. The dashed and continuouslines mark the regions of the scatter plot withcorrect classification probability greater than 80%and 90%, respectively.

    protein-DNA crystals and 0.63 for DNA only crys-tals (Figs. 5b and 5c). These findings promptedus to train the classifier with the two parame-ters either separately or in combination, alwaystogether with the cube root of the reciprocal unitcell volume as an additional input.

    3.5 Scatter of the DIBER classifierinput for datasets of known con-tent

    The two-dimensional scatter-plots presented inFig. 6 illustrate the spread of classifier input pa-rameters for known structures (with equal repre-sentation of protein only, protein-DNA and DNAonly structures). Qualitatively, DNA crystalshave smaller real space and larger reciprocal spaceunit cells than crystals that contain protein (withor without DNA). Moreover, a large local inten-sity average and rotation score correlate with thepresence of DNA (alone or with protein), as an-ticipated. However, there was no clear separationin the scatter plots between structures with con-tinuous DNA and structures with non-continuousDNA (Fig. S1). Apparently, the bendability ofDNA tends to break the phase lock of structurefactor contributions from distant nucleotide pairs.

    3.6 Classifier training with crystalsof known content

    Optimal separation lines between the three scat-ter plot regions for DNA, protein-DNA and pro-tein were determined with a support vector ma-chine. The classifier was separately trained instandalone mode (using cube root of the recip-rocal unit cell size and largest local intensity av-erage), PHASER only mode (using cube root ofthe reciprocal unit cell size and rotation score)and combined mode (all three parameters). Thetraining procedure defined not only the optimaldivision lines between the classes, but also prob-abilities of the correct classification of a structureof unknown content (Fig. 7).

    3.7 Benchmarking DIBER withstructures of unknown content

    Presented with a diffraction dataset of an un-known crystal structure, DIBER parses the deci-sion tree of Fig. 8 to determine its output. DIBERwas benchmarked with the structures from thePDB that were not used in the training phase.Again, equal numbers of structures with only pro-tein, protein-DNA, or only DNA were used fortesting. Classification at all costs led to a cor-rect answer for 80-90% of the protein, 70%-80% ofthe protein-DNA and over 90% of the DNA struc-tures (Fig. 9a and Table 1). About half of allprotein and protein-DNA crystals and over 80%of the DNA crystals were located in regions of thescatter plot with greater than 80% correct clas-sification probability (Fig. 9b). Slightly fewerstructures could even be classified with greaterthan 90% probability (Fig. 9c). Except forDNA crystals, the PHASER dependent classifi-cation was slightly better than the ”quick” stan-dalone classification. The combined mode wasbest, with insignificant extra computational cost(over the PHASER requirements). Therefore,only the standalone (highest efficiency) and com-bined (most accurate results) modes of DIBER areavailable to the user.

    3.8 DIBER for curated datasets

    The performance figures for DIBER in Table 1and Fig. 9 were obtained with the non-curated

    8

  • Figure 8: Decision tree of DIBER for predictionA (protein only). Analogous trees are parsed forevents B (protein-DNA) and C (DNA only).

    training and testing sets described in Methods.Could the performance be improved by exclud-ing unusual data? In a first re-run of DIBERtraining and testing, we excluded protein-DNAand DNA only structures that did not haveat least two neighbouring base pairs of doublestranded B-DNA. DIBER performance improvedonly slightly (Figs. S2 and S3). We also tookinto account that non-crystallographic transla-tional symmetry might affect the DIBER localaverages and PHASER scores, because it can sys-tematically enhance and reduce the intensities insubsets of reflections. Again, the DIBER scoresimproved, but again the improvement was veryslight (Figs. S2 and S3). We also tested the224 protein structures, 11 protein-DNA structures

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (a)

    At all costs

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (b)

    Above 80% classification probability

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (c)

    Above 90% classification probability

    Figure 9: Benchmarking DIBER performance forthe classifications (a) at all costs, (b) with greaterthan 80% classification probability and (c) withgreater than 90% classification probability. Struc-tures were divided into protein only, protein-DNAand DNA only. Bars indicate DIBER predic-tions of the crystal content (green for protein only,red for protein and DNA, blue for DNA only,and white for no prediction). Textures corre-spond to the standalone (left, hatched), PHASERonly (middle, dotted) and combined (right, plain)modes of DIBER.

    9

  • and 14 DNA structures with significant pseudoo-rigin peaks separately. Using standard decisioncriteria, DIBER classified 149 (185) protein, 10(10) protein-DNA, and 11(11) DNA structurescorrectly. We attribute this result to the fact thatnon-translational symmetry tends to affect adja-cent reflections in opposite ways, so that local av-erages are not much perturbed. As the overallDIBER performance did not improve significantlyby excluding any unusual structures, checks forshort DNA (which would have to be stated by theuser) or pseudoorigin peaks (which can be donewithout user intervention) were not implementedin DIBER.

    4 Discussion

    4.1 Alternatives to DIBER

    Many groups, including our own, have rou-tinely assessed crystal content using spectroscopicand/or biochemical techniques. In addition, thereare other sources of information. For small unitcells, the Matthews coefficient will sometimes besufficient to rule out the bulkier component or acomplex. As the calculation can easily be donewith available software, the solvent content is notexplicitly considered in DIBER. However, it is ofcourse implicit in the unit cell size parameter forthe classifier. The presence of DNA (with or with-out protein) tends to show up as a bump in theWilson plot at 3.4 Å resolution. It can also man-ifest itself as a persistent peak that shows up ina fixed location for many higher order symmetryaxes (e.g. 8-fold, 9-fold or 10-fold). If the self ro-tation peaks are due to DNA, much of the signalshould be lost if only the low resolution data (be-low 3.6 Å) are taken into account. A set of two-fold axes on a great circle perpendicular to themain axis strengthens the case for DNA. In manyprotein-DNA complexes, one of the two-fold axesis also a local symmetry axis of a protein dimerand therefore clearly visible.

    4.2 Equal a priori probabilities

    The chances to get a protein-DNA co-crystal arecase-dependent. Tight interaction favours com-plexes, loose interaction promotes the crystalliza-

    tion of single components. Unfortunately, goodestimates for the chances to get protein-DNA co-crystals are not available. Therefore, DIBERmakes the ad hoc assumption that the three pos-sible crystallization outcomes (protein, DNA orboth) are equally probable. DIBER output mustbe read with this assumption in mind.

    4.3 Minimal information from theuser

    DIBER was designed to require minimal inputfrom the user. Therefore, no attempt was made toincorporate information about packing or solventcontent in the DIBER predictions. At present,we do not even request the user to state the ex-pected length of the DNA duplex that was used inthe crystallization experiments, even though theshape of the 3.4 Å peak is affected by this length.The decision was made because kinks and disor-dered ends of the DNA duplex are hard to predict,but can drastically affect the effective length.

    4.4 Minimal input to the classifier

    We also used a minimal number of parameters forthe classifier. The reciprocal unit cell size was rep-resented by a single parameter (the cube root ofits volume). The overall anisotropy of the diffrac-tion data, which was obtained as a by-productof the anisotropy correction, was not used at all.These simplifications were justified because theadditional parameters mostly help to distinguishDNA only crystals from all others which can al-ready be done based on the unit cell size alone.

    4.5 Alternative measure of unit cellsize

    The inverse of the smallest unit cell dimension wastested as an alternative to the cube root of the re-ciprocal lattice volume as input for the classifier.Results were slightly inferior and therefore this op-tion was not used (Figs. S4 and S5). Using thethree unit cell constants separately improved per-formance only very marginally (data not shown)and was therefore not implemented.

    10

  • protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (a)

    At all costs

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (b)

    Above 80% classification probability

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (c)

    Above 90% classification probability

    Figure 10: Quality of DIBER predictions inPHASER only mode versus the resolution range ofinput diffraction data. PHASER rotation scoreswere calculated with default parameters (RMS =1.5 Å2, fp = 0.5) for the diffraction data withoutthe 3.2-3.7 Å resolution shell (left, hatched), forall diffraction data (middle, dotted) and for thediffraction data in the 3.2-3.7 Å resolution shell(right, plain). The classification was done (a) atall costs or with a classification probability above(b) 80%, or (c) 90%. Results are colour-coded likein Fig. 9

    4.6 Alternative molecular replace-ment scores

    First, we considered to substitute MOLREP [16]for the PHASER molecular replacement scores.However, MOLREP was not written to deal withdiffraction data in a thin shell only, and the scoreswere not useful for classification (data not shown).Next, we tested the PHASER Z-score (the num-ber of standard deviations above the mean) asan alternative to the log likelihood gain. DIBERperformance deteriorated for reasons that are cur-rently not clear (Fig. S6). We also considered touse all diffraction data rather than the 3.4 Å shellfor PHASER molecular replacement experiments.The classification performance of DIBER droppedagain and was also poor for the control experi-ment with all data outside the 3.4 Å resolutionshell (Fig. 10). We also attempted to replace therotation score of PHASER with the correspond-ing translation score using either only the 3.4 Ådata shell or all diffraction data. In both cases,DIBER performance was no better than with therotation score, but calculations took much longer(data not shown).

    4.7 Data outside the 3.4 Å resolu-tion shell

    It is surprising that DIBER performs best withthe diffraction data in a thin shell around 3.4 Åresolution. Clearly, the information content in therest of the data cannot be negative, so the datamust be used in the wrong way. We know al-ready that the characteristic 1.5 Å peaks of pro-tein α-helices can be detected in favourable cases(data not shown). For high resolution data, thiscould be used to further confirm the distinctionbetween protein structures (with or without DNA)and structures of DNA alone. We also know thatpeaks in the self-rotation function (which can becalculated without the data in the 3.4 Å shell)can be diagnostic for the presence of DNA. Theinformation could help to distinguish structuresthat contain DNA (with or without protein) fromstructures of protein alone, which is not alwayspossible with the current version of DIBER. In or-der to make good use of the low resolution data,it might already suffice to calculate separate rota-tion functions for different resolution ranges. The

    11

  • scores could be combined into a single parameteror input to the classifier as a vector. A careful in-vestigation of these possibilities remains for futurework.

    A B-DNA characteristicdiffraction signals

    A.1 The Fourier transforms of helixmodels

    Scattering of B-DNA [4, 17, 7] can be under-stood by considering a series of models of in-creasing complexity. In the first step, DNA isapproximated as an infinitely long helical stringof radius r and pitch P along z (coordinatesx = r cos(2πz/P ) = r cos(φ), y = r sin(2πz/P ) =r sin(φ) and z = z). The Fourier transform of thisstructure T (R,ψ, n/P ) is 0 except in layers at dis-tances n/P from the origin. In the nth layer, theFourier transform can be expressed in reciprocal,cylindrical coordinates R and ψ in terms of thenth-order Bessel function [4]

    T (R,ψ, n/P ) = Jn(2πRr) exp

    (

    in(ψ +1

    2π)

    )

    (1)

    The cylinder symmetry of the modulus ofthis function reflects the equivalence of rotationsaround and translations along the helix axis. Thedistance of the first maximum from the originincreases with the order of the Bessel function.Therefore the well-known cross is is seen in sec-tions that include the (reciprocal space) helix axis(Figs. 1a).

    In the second step, an infinitely thin and longdouble helix is considered. The two strands are re-lated to each other by a rotation around a twofoldaxis perpendicular to the helix axis. In the model,the infinitely long and thin helix strands have noorientation. Within this approximation, their re-lationship can be described by a translation alongthe helix axis. The axial shift 0.4 × P models thenon-equivalence of the major and minor grooves ofDNA and translates into a phase shift 0.4×n×2πfor the nth layer of the diffraction pattern. Thephase shift is 0 for the 0th layer. However, it isclose to odd multiples of 2π for the 1st and 4thlayer, which therefore have very little intensity(Fig. 1b).

    In the third step, a discrete single helix made ofpoint scatterers of axial distance p is considered.This structure can be described as the product oftwo functions that describe a helical string and aset of planes of spacing p. The Fourier transformof a set of planes of distance p is a set of planesin reciprocal space 1/p apart. By the convolutiontheorem, the Fourier transform of the discrete he-lix is the Fourier transform of the helical stringwith its origin placed at each of the points (0, 0, 0),(0, 0,±1/p), (0, 0,±2/p), etc [4]. In practice, thestrong decrease of intensity with resolution atten-uates the crosses with origins other than (0, 0,0). Often only the halos of the (0, 0, 0) peak at(0, 0,±1/p) are recognizable (Fig. 1c).

    In the fourth step, the same transition as in thethird step is made for double helices. Assumingthat the point scatterers in the two strands are atthe same height, the above argument can be ap-plied again to predict a diffraction pattern like fora non-discrete double helix, but with increasinglyweak halos at (0, 0,±1/p), (0, 0,±2/p). The DNApeak at 3.4 Å resolution corresponds to the halo ofthe origin peak at (0, 0,±1/p) and is due to con-structive interference of all scattering points (Fig.1d).

    A.2 Transverse width of the 3.4 Åintensity peak

    In order to determine the transverse width of the3.4 Å peak, it is necessary to calculate the Fouriertransform of the B-DNA double helix in the re-ciprocal space layer (x, y, (3.4 Å)−1). The phasesin this layer are unaffected by 3.4 Å translationsalong the helix axis. Therefore, the scattering con-tribution of all base pairs can be calculated by pro-jecting them onto the real space xy plane, wherethey form a filled circle of radius rbps = 5.0 Å.Therefore, the Fourier transform can be expressedin reciprocal space cylindrical coordinates R andψ using the Bessel function identity [1]

    xnJn−1(x)dx = xnJn(x) (2)

    12

  • Crystal content

    protein protein-DNA DNAC

    lass

    ifica

    tion

    resu

    lts

    DNA2.1 ± 0.1% 4.3 ± 0.2% 92.7 ± 0.2%

    (1.2 ± 0.1%) (4.0 ± 0.1%) (90.0 ± 0.2%)

    protein-DNA10.5 ± 0.3% 69.1 ± 0.3% 5.5 ± 0.2%(8.8 ± 0.2%) (78.8 ± 0.3%) (7.2 ± 0.2%)

    protein87.4 ± 0.2% 26.6 ± 0.2% 1.8 ± 0.1%

    (90.2 ± 0.3%) (17.2 ± 0.2%) (2.6 ± 0.1%)

    Table 1: Benchmarking DIBER performance for the classifications at all costs in standalone (combined)mode.

    as

    Tbps(R,ψ) =

    1

    πr2bps

    ∫ rbps

    0

    ∫ 2π

    0

    exp(2πiRr cos(φ− ψ))rdrdφ =

    =2

    r2bps

    ∫ rbps

    0

    rJ0(2πRr)dr =

    =1

    πrbpsRJ1(2πrbpsR)

    (3)

    The contribution from the phosphodiester back-bone of the double helix (with deoxyribose sugars)can be approximated by two non-discrete helicesof radii rbb = 9.0 Å (calculated as a weighted av-erage of the backbone atom positions). As the3.4 Å peak is a halo of the origin peak, it sufficesto calculate the transverse width of the latter. Forthis purpose, the two helices can be replaced bytheir projections onto the xy plane. The projec-tions coincide and form a circle of radius rbb. Asalready implied by equation 1 for the special casen = 0, the Fourier transform of a circle is a Besselfunction of order 0. Up to a multiplicative fac-tor (discussed below) the Fourier transform in the(x, y, (3.4 Å)−1) plane can therefore be written

    Tbb(R,ψ) = J0(2πrbbR) (4)

    The scattering of the complete dsDNA can becalculated by adding the base pair and back-bone atom contributions with proper weights andphases. A simple, but tedious calculation suggests

    a weighing of 3:1 for the contributions of bases andbackbone. Interference on the axis is in antiphase,due to the position of the strongly scattering phos-phorus atoms halfway between base pairs (alongthe helix axis).

    |TdsDNA(R,ψ)| = |Tbps(R,ψ) − Tbb(R,ψ)| =

    = |3 1πrbpsR

    J1(2πrbpsR) − J0(2πrbbR)| (5)

    Both 1πrbpsR

    J1(2πrbpsR) and J0(2πrbbR) tend to-

    wards 1 as R→ 0, and both decrease with increas-ing R. As the 1

    πrbpsRJ1(2πrbpsR) term decreases

    faster, the net sum describes a function with amaximum at R ≈ 0.04 Å−1.

    The predictions of analytical formulas 3, 4 and5 were tested against diffraction patterns of idealB-DNA generated with the program 3DNA [11].The agreement is excellent for base pairs (Fig. 2a),backbones (Fig. 2b) and for complete B-DNA(Fig. 2c). The cylinder radius 0.09 Å−1, whichmaximizes the performance of the DIBER classi-fier in stand-alone mode, is only slightly smallerthan the distance from the axis to the first mini-mum (Fig. 2c).

    A.3 Longitudinal width of the 3.4 Åintensity peak

    In order to determine the longitudinal width of the3.4 Å peak, it suffices to know the Fourier trans-form on the helix axis in reciprocal space. For-tunately, this can be calculated by the projection

    13

  • theorem as the one dimensional Fourier transformof the electron density projection on the real spacehelix axis. Therefore, the structure factor contri-butions of nucleotide pairs m = 0, 1, . . . , N − 1differ only by phase factors e2πiζmp. These de-pend on the wave number ζ in the direction of thehelix axis and on the axial-spacing p = 3.4 Å. Forζ = 1/p the phase factors are all equal to 1. Forζ = (1+δ)/p with (dimensionless) small, but non-zero δ, complex numbers with non-trivial phaserelationships are added. The combined structurefactor F (ζ) (due to residues 0, 1, . . . , N − 1) canbe expressed as the product of the structure factorfor a single nucleotide pair Fs(ζ) with a geometricseries.

    F (ζ) = Fs(ζ) ·N−1∑

    m=0

    e2πiζmp

    = Fs(ζ) ·1 − e2πiζpN1 − e2πiζp (6)

    I(ζ) =

    = F (ζ)F ⋆(ζ) = |F 2s (ζ)|sin2(πζpN)

    sin2(πζp)

    =

    Fs

    (

    1 + δ

    p

    )∣

    2sin2(πNδ)

    sin2(πδ)(7)

    The second term in the product is maximal forδ = 0 and shrinks to 0 for δ = 1/N ≪ 1. Inthis interval, the first term is roughly constantand can be replaced by its value for δ = 0. Withthis approximation and the well-known expansionsin(x) = x · (1 − x2/6 + . . . ) for |x| ≪ 1, the in-tensity can be written as:

    I(ζ) = F (ζ)F ⋆(ζ)

    ≈∣

    Fs

    (

    1

    p

    )∣

    2sin2(πNδ)

    sin2(πδ)(8)

    ≈ N2∣

    Fs

    (

    1

    p

    )∣

    2(

    1 − (πNδ)2

    6

    1 − (πδ)26

    )2

    (9)

    This is down to 50% of the value for δ = 0 for:

    δl =

    6 − 3√

    2

    π√

    N2 − 1/√

    2≈√

    6 − 3√

    2

    πNfor N ≫ 1 (10)

    It translates into a half-width at half-maximum(in wave numbers):

    HWHM l =δlp

    ≈√

    6 − 3√

    2

    πpN≈ 0.12 Å−1 1

    N(11)

    The more exact calculation agrees reasonably wellwith the rough estimate of the Introduction:

    HWHM longitudinal = 0.15 Å−1 1

    N(12)

    The cylinder height 0.04 Å−1, which maximizesthe performance of the DIBER classifier in standalone mode, must be compared with the full-width at half-maximum, which is approximately0.24 Å−1/N , and with the distance between firstminima, which is approximately twice larger. Ap-parently, the optimal height of the averaging cylin-der is in between the distance between first min-ima and the full-width at half-maximum for mostdsDNA helices.

    Acknowledgements

    This work was supported by a Polish Ministry ofScience and Higher Education grant to MB (NN204 240834). We are grateful to Dr HonorataCzapinska for proofreading the manuscript

    References

    [1] G.B. Arfken and H.J. Weber. MathematicalMethods for Physicists, Sixth Edition. Aca-demic Press, 2005.

    [2] M. Bochtler and G. Chojnowski. The high-est reflection intensity in a resolution shell.Acta Crystallographica Section A, 63(2):146–155, Mar 2007.

    [3] Chih-Chung Chang and Chih-Jen Lin. LIB-SVM: a library for support vector ma-chines, 2001. Software available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvm.

    [4] W. Cochran, F. H. Crick, and V. Vand. Thestructure of synthetic polypeptides. I. Thetransform of atoms on a helix. Acta Crys-tallographica, 5(5):581–586, Sep 1952.

    14

  • [5] Collaborative Computational Project, Num-ber 4. The CCP4 suite: programs for proteincrystallography. Acta Crystallographica Sec-tion D, 50(5):760–763, Sep 1994.

    [6] K. Cowtan. The clipper c++ libraries forx-ray crystallography. IUCr Commission onCrystallographic Computing Newsletter, 2:4–9, 2003.

    [7] R.E. Franklin and RG Gosling. Molecu-lar Configuration in Sodium Thymonucleate.Nature, 171:740–741, 1953.

    [8] R. H. Hardin, N. J. A. Sloane, and W. D.Smith. Tables of spherical codes with icosa-hedral symmetry. Published electronically athttp://www.research.att.com/˜njas/icosahedral.codes/, 2000.

    [9] J. D. Hunter. Matplotlib: A 2d graphics en-vironment. Computing in Science and Engi-neering, 9(3):90–95, 2007.

    [10] A. Klug, F. H. C. Crick, and H. W. Wyck-off. Diffraction by helical structures. ActaCrystallographica, 11(3):199–213, Mar 1958.

    [11] X. J. Lu and W. K. Olson. 3dna: a soft-ware package for the analysis, rebuildingand visualization of three-dimensional nu-cleic acid structures. Nucleic Acids Research,31(17):5108–5121, September 2003.

    [12] A. J. McCoy, R. W. Grosse-Kunstleve, P. D.Adams, M. D. Winn, L. C. Storoni, andR. J. Read. Phaser crystallographic soft-ware. Journal of Applied Crystallography,40(4):658–674, Aug 2007.

    [13] Randy J. Read. Pushing the boundaries ofmolecular replacement with maximum like-lihood. Acta Crystallographica Section D,57(10):1373–1382, Oct 2001.

    [14] L. C. Storoni, A. J. McCoy, and R. J. Read.Likelihood-enhanced fast rotation functions.Acta Crystallographica Section D, 60(3):432–438, Mar 2004.

    [15] Giedre Tamulaitiene, Arturas Jakubauskas,Claus Urbanke, Robert Huber, SauliusGrazulis, and Virginijus Siksnys. The crys-tal structure of the rare-cutting restriction

    enzyme SdaI reveals unexpected domain ar-chitecture. Structure, 14(9):1389–1400, Sep2006.

    [16] A. Vagin and A. Teplyakov. MOLREP: anAutomated Program for Molecular Replace-ment. Journal of Applied Crystallography,30(6):1022–1025, Dec 1997.

    [17] MHF Wilkins, WE Seeds, AR Stokes, andHR Wilson. Helical structure of crys-talline deoxypentose nucleic acid. Nature,172(4382):759–762, 1953.

    15

  • A Supplementary material

    16

  • 0 0.01 0.02 0.031/

    3�V [�A�1 ]2.0

    4.0

    6.0

    8.0

    10.0

    The

    stro

    nges

    t ave

    rage

    E2

    (a)

    Standalone mode

    0 0.01 0.02 0.031/

    3�V [�A�1 ]0.0

    100.0

    200.0

    300.0

    400.0To

    p LL

    G

    (b)

    PHASER only mode

    Figure S1: Comparison of characteristic signals of continuous (red circles) and discontinuous (blackcrosses) DNA molecules in the crystal. The (a) largest local intensity average and (b) top rotation scoreare plotted against the cube root of the reciprocal unit cell volume.

    17

  • protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (a)

    At all costs

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (b)

    Above 80% classification probability

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (c)

    Above 90% classification probability

    Figure S2: Standalone mode classifier performance for different training and test sets. Plain bars illus-trate the results for the control set with all structures. Dotted bars describe the results after removingstructures with DNA, but with less than two base pairs in B-DNA conformation. Hatched bars show theresults for the set after removing structures with pseudoorigin peaks (threshold 40% of the origin peak).Bars with open circles summarize the results after removing structures with pseudoorigins or very shortB-DNA fragments. Colour coding is like in Fig. 9 (green for protein only, red for protein and DNA, bluefor DNA only and white for no prediction). Panels (a), (b) and (c) are for classification at all costs andwith greater than 80% and 90% correct classification probability, respectively.18

  • protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (a)

    At all costs

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (b)

    Above 80% classification probability

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (c)

    Above 90% classification probability

    Figure S3: Combined mode classifier performance for different training and test sets. The same symbolsand colours as in Fig. S2 are used.

    19

  • protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (a)

    At all costs

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (b)

    Above 80% classification probability

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (c)

    Above 90% classification probability

    Figure S4: Standalone mode classifier performance with different measures of unit cell size. The cuberoot of the reciprocal unit cell volume (plain bars) is compared with the inverse of the smallest unit celldimension (dotted bars). Colours and panels are as in Fig. S2.

    20

  • protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (a)

    At all costs

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (b)

    Above 80% classification probability

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (c)

    Above 90% classification probability

    Figure S5: Combined mode classifier performance with different measures of unit cell size. The samesymbols and colours as in Fig. S4 are used.

    21

  • protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (a)

    At all costs

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (b)

    Above 80% classification probability

    protein prot-DNA DNA0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Sco

    res

    (c)

    Above 90% classification probability

    Figure S6: The rotation score (plain bars) versus Z-score (hatched bars) as an input parameter for theclassifier in Phaser only mode. Both scores were calculated with the same Phaser settings and combinedwith the standard cube root of the reciprocal unit cell volume for classification. Colour coding is like inFig. S2.

    22