computational crystallography initiativephysical biosciences division first aid & pathology data...

37
Computational Crystallography Initiative Physical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Upload: evelyn-floyd

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

First Aid & PathologyData quality assessment in PHENIX

Peter Zwart

Page 2: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Introduction

• Structure solution can be enhanced by the knowledge of the quality and idiosyncrasies of the merged data– Anomalous signal?– Twinning– Pseudo centering

• Data characterization should extend beyond standard quantities as Rmerge and nominal resolution

• A full characterization of a data set might provide expert systems, such as wizards, useful information on how to most optimally solve a structure

Page 3: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Introduction

• Xtriage is a program that aims to characterize a merged X-ray dataset– Probabilistic unit cell content analyses– Likelihood based Wilson scaling

• Analyses of mean intensity • Ice ring detection

– Outlier analyses – Twinning / pseudo centering – Anomalous signal

Page 4: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Likelihood based Wilson Scaling

• Both Wilson B and nominal resolution determine the ‘looks’ of the map

Zwart & Lamzin (2003). Acta Cryst. D50, 2104-2113.

Bwil : 50 Å2; dmin: 2ÅBwil : 9 Å2; dmin: 2Å

Page 5: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Likelihood based Wilson Scaling

• Data can be anisotropic• Traditional ‘straight-line fitting’ not reliable at low

resolution• Solution: Likelihood based Wilson scaling

– Results in estimate of anisotropic overall B value.Zwart, Grosse-Kunstleve & Adams, CCP4 newletter, 2005.

Page 6: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Likelihood based Wilson Scaling

• Likelihood based scaling not extremely sensitive to resolution cut-off, whereas classic straight line fitting is.

-40

-30

-20

-10

0

10

20

30

40

1.2 1.5 2.5 3 3.5 4

resolution

Wils

on

B

Likelihood

Straight line

0

0 0.2 0.4 0.6

1/resolution2

<I>

Observed <I>

Extpected <I>

Page 7: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

0

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

Z

N(Z

)

Observed

Observed, Anisotropy corrected

Theory

Likelihood based Wilson Scaling

• Anisotropy is easily detected and can be ‘corrected’ for.– Useful for molecular

replacement and possibly for substructure solution

• Anisotropy correction cleans up your N(Z) plots

Page 8: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Likelihood based Wilson Scaling

• For the ML Wilson scaling an ‘expected Wilson plot’ is needed

• Obtained from over 2000 high quality experimental datasets

• ‘Expected intensity’ and its standard deviation can be obtained -0.7

-0.5

-0.3

-0.1

0.1

0.3

0 0.2 0.4 0.6

1/resolution2

<|E

|2>

- 1

Page 9: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

0

0 0.1 0.2 0.3 0.4 0.5 0.6

1/resolution2

<I>

Observed

Protein standard

DNA/RNA standard

Likelihood based Wilson Scaling

• Resolution dependent problems can be easily/automatically spotted– Ice rings

• Empirical Wilson plots available for protein and DNA/RNA.

Data is from DNA structure

Page 10: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Outlier analyses

• Assume amplitudes are distributed according to Wilson distribution

• For a dataset of a given size, the cumulative distribution function of the largest |E| values in the dataset can be used to detect outliers

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8

|E|P

(la

rge

st

|E| <

|E|)

1 observation

1000 observations

10,000 observations

100,000 observations

NobsEEEEP2

maxexp2)(

Page 11: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Pseudo Translational Symmetry

• Can cause problems in refinement and MR– Incorrect likelihood function due to effects of extra

translational symmetry on intensity

• Can be helpful during MR– Effective ASU is smaller is T-NCS info is used.

• The presence of pseudo centering can be detected from an analyses of the Patterson map.– A Fobs Patterson with truncated resolution should

reveal a significant off-origin peak.

Page 12: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Pseudo Translational Symmetry

• A database analyses reveal that the height of the largest off-origin peaks in truncated X-ray data set are distributed according to:

56.3;10*8.6

1exp)(

2

max

maxmax

ba

Qa

QQF

b

0

0.5

1

0 10 20 30

Relative peak height QmaxF

(Qm

ax)

Page 13: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Pseudo Translational Symmetry

• 1-F(Qmax): The probability that the largest off origin peak in your Patterson map is not due to translational NCS; This is a so-called p value

• If a significance level of 0.01 is set, all off origin Patterson vectors larger than 20% of the height of the origin are suspected T-NCS vectors.

PDBID Height (%)

P-value

(%)

1sct 77 9*10-6

1ihr 45 1*10-3

1c8u 20 1

1ee2 10 5

Page 14: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Twinning

• Merohedral twinning can occur when the lattice has a higher symmetry than the intensities.

• When twinning does occur, the recorded intensities are the sum of two independent intensities.– Normal Wilson statistics break down

• Detect twinning using intensity statistics

Page 15: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Twinning

• Cumulative intensity distribution can be used to identify twinning

(acentric data)

Pseudo centering

Normal

Perfect twin

0

0.25

0.5

0.75

0 0.25 0.5 0.75 1

Z

N(Z

)

Page 16: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Twinning

Pseudo centering + twinning = N(Z) looks normal

• Anisotropy in diffraction data produces similar trend to Pseudo centering– Anisotropy can however be removed

• How to detect twinning in presence of T-NCS?– Partition miller indices on basis of detected T-NCS vectors

• Intensities of subgroups follow normal Wilson statistics (approximately)

– Use L-test for twin detection• Not very sensitive to T-NCS if partitioning of miller indices is done

properly. • No need to know twin laws: not sensitive to pseudo symmetry or

certain data processing problems.

Page 17: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Twinning

-

+2

-

+2

+; /N<L>

Page 18: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Twinning• A data base analyses

on high quality, untwinned datasets reveals that the values of the first and second moment of L follow a narrow distribution

• This distribution can be used to determine a multivariate Z-score– Large values

indicate twinning

Page 19: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Twinning

• Determination of twin laws– From first principles

• No twin law will be overlooked• PDB analyses: 36% of structures has at least 1 possible

twin law – 50.9% merohedral; 48.2% pseudo merohedral;0.9% both

• 27% of cases with twin laws has intensity statistics that warrant further investigation on whether or not the data is twinned

– 10% of whole PDB(!)

• Determination of twin fraction– Fully automated Britton and H analyses as well as

ML estimate of twin fraction of basis of L statistic.

Page 20: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Conflicting information

• PDBID: 1???– Unit cell: 99.5 60.9 70.96 90 134.5 90– Space group : C 2– Twin laws and estimated twin fractions:

• H,-K,-H-L : 0.44• H+2L,-K,-L : 0.01• -H-2L, K, H+L : 0.01

– <I2>/<I>2 = 2.10 (theory for untwinned data : 2.0); • Data does not appear to be twinned

– <L> = 0.49 (theory for untwinned data : 0.5); Multivariate Z-score of L test: 0.963

• Data does not appear to be twinned

Page 21: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Conflicting information

• What is going on? – Estimated twin fraction is large, but data does

not seem to be twinned: • Twin law H,-K,-H-L is parallel to an existing NCS

axis

or• Twin law H,-K,-H-L is a symmetry axis, and the

space group is too low– It should be : C2 + H,-K,-H-L = F222

» http://www.phenix-online.org/cctbx

• Need images to make decision

Page 22: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Conflicting information

• A DNA example:– Space group: P65;

• 1 twin law

– Resolution: 1.87A

– Native Patterson analyses indicates several significant off-origin peaks

• Intensity statistics indicate pseudo translation symmetry:

– <I^2>/<I>^2 :4.243

– N(Z) plot not very informative

0

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

Normalized intensity

N(Z

)

N(Z) observed

N(Z) theory

Page 23: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Conflicting information

• However– L test: <L>=0.46;

• Data might be twinned. – Partitioned data might not follow Wilson statistics however.

– Britton and H analyses estimate of twin fraction is about 40%• Wrong spacegroup?

– Monomer would not fit in ASU

– Twinning, pseudo symmetry, or both?• Not clear from experimental data only, use deposited coordinates

– Rwork=28%; Rfree=34%– Twin fractions via Britton plot

» From Fcalc: 11% (due to pseudo symmetry only)» From Fobs: 41% (pseudo symmetry + twinning)See Lebedev, Vagin, Murshudov (2006) Acta D62, 83-95.

• Data likely to be twinned. – Difficult to spot due to TPS and RPS effects on intensity statistics

Page 24: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Anomalous data

• Structure solution via experimental methods (especially SAD) is on the rise.

• Presence of anomalous signal indicated by a quantity called Measurability:– Fraction of Bijvoet differences for which

• I/I>3 and (I+/I(+) and I(-)/I(-) > 3)

– Easy to interpret• At 3 Angstrom 6% of Bijvoet pairs are significantly larger than

zero

Page 25: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Anomalous data

• Measurability and <I/I> are closely related

• Measurability more directly translates to the number of ‘useful’ Bijvoet differences in substructure solution/phasing

Page 26: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Anomalous data

0

0.1

0.2

0.3

0.4

0.5

0 0.05 0.1 0.15 0.2 0.25

Me

asu

rab

ility

1/resolution2

6 (partially occupied) Iodines in thaumatin at =1.5Å.

Raw SAD phases, straight after PHASER

A

B

A

B

Page 27: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Anomalous data

0

0.1

0.2

0.3

0.4

0.5

0 0.05 0.1 0.15 0.2 0.25

Me

asu

rab

ility

1/resolution2

6 (partially occupied) Iodines in thaumatin at =1.5Å.

Density modified phases

A

B

A

B

Page 28: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Anomalous data

• SAD phasing with PHASER– Very sensitive residual maps

• Residual map indicates where a certain type of anomalous scatterers need to be placed to improve fit between observed and expected F(+) and F(-)

• Lysozyme soaked with solution containing (NH4)2(OsCl6) – Wilson B: 13.7; dmin=1.7– Data collected at Os L-III edge (f”>10) – Measurability at 3.0 is 67%

• Anomalous signal is strong

– Partial structure is large• Zheavy

2/(Zheavy2+Zprotein

2)=35%

PHASER residual map indicating location of main chain atoms

Page 29: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Anomalous data

• SAD phasing with PHASER– Very sensitive residual maps

• Residual map indicates where a certain type of anomalous scatterers need to be placed to improve fit between observed and expected F(+) and F(-)

• Lysozyme soaked with solution containing (NH4)2(OsCl6) – Wilson B: 13.7; dmin=1.7– Data collected at Os L-III edge (f”>10) – Measurability at 3.0 is 67%

• Anomalous signal is strong

– Partial structure is large• Zheavy

2/(Zheavy2+Zprotein

2)=35%

Raw PHASER SAD phases

Page 30: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Anomalous data

• Another extreme– 2 Fe4S4 clusters in 60

residues• Wilson B: 6.5Å2; dmin=1.2Å

• Measurability at 3.0Å: 6%– Data not terribly strong

• ZFe2/(ZFe

2+ZS2+Zprotein

2)=17%

• Fe f ”=1.25 e; S f ”=0.35 e

– PHASER residual map from Fe SAD phases clearly show S positions

SAD on Fe, residual maps indicate S positions (green balls)

Page 31: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Anomalous data

• Inclusion of Sulfurs improves phasing

– (ZFe2+ZS

2)/(ZFe2+ZS

2+Zprotein2)=32%

– <FOM>=0.67 (was 0.53)– Residual maps show almost

all non-hydrogen atoms– Inclusion of non hydrogen

atoms results in <FOM>=0.98.

SAD on Fe, S. Residual maps (purple) and FOM weighted Fobs map (blue).

Page 32: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Discussion & Conclusions

• Software tools are available to point out specific problems– mmtbx.xtriage <input_reflection_file> [params]

• Log file are not just numbers, but also contains an extensive interpretation of the statistics

• Knowing the idiosyncrasies of your X-ray data might avoid falling in certain pitfalls.– Undetected twinning for instance

Page 33: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

First Aid

Analyses at the beamline

If problem are detected while at the beam line, possible problems could be solved by recollecting data or adapting the data collection strategy.

The Surgeon and the Peasant – 1524. Lucas van Leyden

Page 34: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Pathology/Autopsy

Analyses at home

The anatomical lesson of dr. Nicolaes Tulp - 1632. Rembrandt van Rijn.

Page 35: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

Ackowledgements

Paul Adams

Ralf Grosse-Kunstleve

Pavel Afonine

Nigel Moriarty

Nick Sauter

Michael Hohn

Cambridge

Randy Read

Airlie McCoy

Laurent Storoni

Los Alamos

Tom Terwilliger

Li Wei Hung

Thirumugan Rhadakanan

Texas A&M Univeristy

Jim Sacchettini

Tom Ioerger

Eric McKee

Funding:

– LBNL (DE-AC03-76SF00098)

– NIH/NIGMS (P01GM063210)

– PHENIX Industrial Consortium

Page 36: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division

W W W

• Phenixwww.phenix-online.org

• Xtriage tutorialswww.phenix-online.org/tutorials

• CCTBXcctbx.sf.net

Page 37: Computational Crystallography InitiativePhysical Biosciences Division First Aid & Pathology Data quality assessment in PHENIX Peter Zwart

Computational Crystallography Initiative Physical Biosciences Division