addressing protein crystallization bottlenecks by screening multiple homologs

Joint Center for Molecular Modeling

Addressing Protein Crystallization Bottlenecks by Screening Multiple Homologs

Lukasz Jaroszewski, Lukasz Slabinski, John Wooley, Ian. A. Wilson,Ashley M. Deacon, Scott. A. Lesley, and Adam Godzik

The Protein Structure Initiative "Bottlenecks" Workshop, April 14-16 Bethesda 2008


TargetDB database provides the first large and diverse learning sets for studying protein “production” and crystallization

Protein “production” learning set (from cloning to purified protein)

Positive subset (successes):12,850 targets listed as purified in TargetDB by PSI centers

Negative subset (failures):13,587 targets: all stopped targets that were listed as cloned, but not purifiedall targets that were cloned, but not purified, and did not show any further progress after 18 months

Protein crystallization learning set

Positive subset (successes):3,140 protein structures solved by X-ray crystallography by PSI centers

Negative subset (failures):5,819 targets: all stopped targets listed as purified, but not crystallized, and not assigned to NMR + all targets that were purified and did not show any progress for more than 18 months.


The ability to crystallize is correlated only for very close homologs…

Homologs of crystallized proteins

0

100

200

300

400

500

600

99-90 89-60 59-50 49-40 39-30 29-20Sequence identity to the closest crystallized protein

Num

ber

of

crys

tals

and

crys

talli

zati

on f

ailu

res

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Succ

ess

rate


…while difficulties with crystallization are correlated for more distantly related proteins

Homologs of not crystallized proteins

0

100

200

300

400

500

600

700

99-90 89-60 59-50 49-40 39-30 29-20

Sequence identity to the closest protein which did not crystallize

Num

ber

of

crys

tals

and

crys

talli

zati

on f

ailu

res

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Succ

ess

rate


Probability distributions of protein “production”(stages from cloning to purified protein)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

30-5

0

50-7

0

70-1

00

100-

200

200-

300

300-

400

400-

500

500-

600

600-

700

700-

1200

sequence length

# su

cces

ses

/ fa

ilu

res

0

0.1

0.2

0.3

0.4

0.5

0.6

Ple

ng

ht

0

500

1000

1500

2000

2500

3000

3500

4000

(-1.

5)-(

-1.0

)

(-1.

0)-(

-0.8

)

(-0.

8)-(

-0.6

)

(-0.

6)-(

-0.4

)

(-0.

4)-(

-0.2

)

(-0.

2)-(

-0.1

)

(-0.

1)-0

.0

0.0-

0.1

0.1-

0.2

0.2-

0.3

0.3-

0.4

0.4-

0.5

gravy hydrophobicity index

# su

cces

ses

/ fa

ilu

res

0

0.1

0.2

0.3

0.4

0.5

0.6

PG

RA

VY

B

length < 345 aa

0

100

200

300

400

500

600

700

3-4 4-5 5-6 6-7 7-8 8-9 9-13

isoelectric point

# su

cces

ses

/ fa

ilu

res

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pp

I

length > 345 aa

0

200

400

600

800

1000

1200

1400

3-4 4-5 5-6 6-7 7-8 8-9 9-13

isoelectric point

# su

cces

ses

/ fa

ilu

res

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Pp

I


Probability distributions of protein crystallization

0

100

200

300

400

500

600

700

800

50-7

0

70-1

00

100-

150

150-

200

200-

300

300-

400

400-

500

500-

600

600-

700

sequence length

# cr

ysta

ls /

fai

lure

s

0

0.1

0.2

0.3

0.4

0.5

Ple

ng

th

length < 345 aa

0

100

200

300

400

500

600

700

3-5 5-6 6-7 7-8 8-9 9-11 11-13

isoelectric point

# cr

ysta

ls /

fai

lure

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Pp

I

length > 345 aa

0

50

100

150

200

3-5 5-6 6-7 7-8 8-9 9-11

isoelectric point

# cr

ysta

ls /

fai

lure

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pp

I

0

100

200

300

400

500

600

700

800

(-1.

5)-(

-1.0

)

(-1.

0)-(

-0.8

)

(-0.

8)-(

-0.6

)

(-0.

6)-(

-0.4

)

(-0.

4)-(

-0.2

)

(-0.

2)-(

-0.1

)

(-0.

1)-0

.0

0.0-

0.1

0.1-

0.2

0.2-

0.3

0.3-

0.4

gravy hydrophobicity index

# cr

ysta

ls /

fai

lure

s

0

0.1

0.2

0.3

0.4

0.5

0.6

PG

RA

VY


Probability distributions of protein crystallization

0

100

200

300

400

500

600

700

800

900

1000

0-10 10-20 20-30 30-40 40-50 50-60 60-

the longest disordered region (aa)

# cr

ysta

ls /

fai

lure

s

0

0.1

0.2

0.3

0.4

0.5

Pld

iso

0

100

200

300

400

500

600

700

800

900

1000

5-10 10-20 20-30 30-40 40-50 50-60 60-90

instability index

# cr

ysta

ls /

fai

lure

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

PII

0

100

200

300

400

500

600

700

800

900

0-10 10-20 20-30 30-40 40-50 50-70

% predicted coil structure

# cr

ysta

ls /

fai

lure

s

0

0.1

0.2

0.3

0.4

0.5

Pco

ils

0

500

1000

1500

2000

2500

0-20 20-

coiled-coil structure prediction

# cr

ysta

ls /

fai

lure

s

0

0.1

0.2

0.3

0.4

0.5

Pcc

0

200

400

600

800

1000

1200

0-5 5-20 20-100

% insertions

# cr

ysta

ls /

fai

lure

s

0.2

0.3

0.4

0.5

Pin

s


It is possible to combine individual probabilities into one estimate of crystallization probability (“crystallization score”)

n

i

wi

ipkP1

k – normalizing constant

pi – individual probability distributions, such as: P length, PpI, PGRAVY, etc.

n – number of individual probability distributions

wi – weights of a individual probability distributions

(we used all weights equal 1/n since the size of learning sets did not allow optimization of individual weights).

We used a method called a logarithmic opinion pool known in financial risk analysis. The probability of protein crystallization is estimated by the product of individual probabilities.


Jack-knife tests confirm that the crystallization score has predictive power.

optimal suboptimal average difficult very difficult

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

0 20 40 60 80 100

percentile of targets according to crystalization score

succ

ess

rate

Sc

SCs

ScL

Scb

Sc – learning set rank-ordered by crystallization score derived from the same set

SCs – crystallization score derived from the data from four large PSI centers (JCSG, MCSG, NESG, and NYSGXRC) and used to rank-order targets from all other centers (BSGC, BCGI, CESG, ISFI, OPPF, S2F, SECSG, SGPP, SPINE-EU, YSG, TB, and RSGI). Scl – opposite to SCs

Scb – crystallization score used to rank-order targets deposited in TargetDB after crystallization score was derived

Since sets of targets used in tests have different average success rates (from 33% to 41%), the normalized plot is shown in the inset.


Crystallization score can be used to split targets from TargetDB into classes with different success rates

Protein crystallization


Each completely sequenced genomes brings more suitable targets for about 7 protein families

A

0

1000

2000

3000

4000

5000

6000

1 487

Number of genomes

Num

ber

of

cove

red P

fam

fam

ilies

B

0

500

1000

1500

2000

2500

3000

3500

1 487

Number of genomes

Num

ber

of

cove

red P

fam

fam

ilies

All Pfam familiesPfam families without

structures


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Protein families sorted by precentage of very difficult targets

Dis

trib

uti

on o

f cr

ysta

lliza

tion c

lass

es

in P

fam

fam

ilies

0

1

2

3

4

5

6

7

8

9

Norm

aliz

ed a

vera

ge

num

ber

of

solv

ed s

truct

ure

s per

Pfa

m f

amily

Broad distribution of crystallization classes in protein families allows promising targets to be found in many “difficult” families.

The number of structures solved from a family is correlated with the number of “crystallizable” targets from that family.

Very difficult5

Difficult4

Average3

Suboptimal2

Optimal1

Very difficult5

Difficult4

Average3

Suboptimal2

Optimal1

Crystallization classes


Assessing the bias introduced in representative structures from protein families by crystallizability

Sequence length distributions

0

100

200

300

400

500

600

100 200 300 400 500 600 700 800 900

Length bins

No

rmal

ize

d c

ou

nt

Gravy index distributions

0

100

200

300

400

500

600

700

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Gravy index bins

No

rmal

ize

d c

ou

nt

pI distributions

0

100

200

300

400

500

600

700

800

900

3 4 5 6 7 8 9 10 11 12 13

pI bins

No

rmal

ize

d c

ou

nt

Instability index distributions

0

100

200

300

400

500

600

700

800

Instability index bins

No

rmal

ize

d c

ou

nt

The longest disordered fragment distributions

0

100

200

300

400

500

600

700

Disordered fragment length bins

No

rmal

ize

d c

ou

nt

Percentage of coil structure distributions

0

100

200

300

400

500

600

700

800

900

Coil percentage bins

No

rmal

ize

d c

ou

nt

Distributions of protein features calculated for:

•Sequences of microbial members of Pfam families without any solved structures (red)

•Sequences of microbial members of Pfam families with at least one solved structure (green)

•Sequences of solved members of Pfam families (blue)

•Actual sequence constructs of solved members of Pfam families (black).


The distribution of crystallizability classes in microbial genomes is more even than in protein families.

0%

20%

40%

60%

80%

100%

Microbial genomes

Dis

trib

uti

on o

f cr

ysta

lliza

tion c

lass

es

otherpathogensandsymbionts

ther

mo

ph

iles

Very difficult5

Difficult4

Average3

Suboptimal2

Optimal1

Very difficult5

Difficult4

Average3

Suboptimal2

Optimal1

Crystallization classes

Selected group of organisms

Average percent of “optimal”

crystallization class

Average percent of “very difficult”

crystallization class

Archaea 13 39

Bacteria 11 45

Thermophilic and hyperthermophilic

13 40

Pathogens and symbionts

11 46

Other 11 44


JCSG crystallization score distribution also confirms high complementarity of X-ray and NMR

0%

10%

20%

30%

40%

50%

60%

1 2 3 4 5Crystallization classes

Per

cent

age

of t

arge

ts

TargetDB - solved by Xray

TargetDB - solved by NMR


Requirements for optimal NMR targets are different than requirements for optimal X-ray targets

Triple Tracts of the Same Residue

0

1

2

3

4

xc

Pro-Pro-Pro Tracts

0

0.1

0.2

0.3

xc

Pro-Xxx-Pro Tracts

0

0.5

1

1.5

2

2.5

xc

Pro-Pro Tracts

0

0.5

1

1.5

2

xc

Num

ber

of T

ract

s pe

r P

rote

in

Num

ber

of T

ract

s pe

r P

rote

in

Num

ber

of T

ract

s pe

r P

rote

in

Num

ber

of T

ract

s pe

r P

rote

in

Specific sequence features which increase the cost of solving protein structure by NMR

X-ray NMR

pI preference for acidic proteins

no strong preference

structural

disorder

highly detrimental

no strong preference

sequence length

up to several thousand residues

< ~220 aa

All solved structures vs. NMR-solved proteins (from PDB)


NMR

NMR with non-trivial construct

Non-trivial construct design

X-ray, in complex with other protein

X-ray, full sequence

As expected solving structures from “very difficult” families more often requires nontrivial construct design or use of NMR

Structures from optimal group of familiesStructures from optimal group of families

Structures from very difficult group of familiesStructures from very difficult group of families

X-ray, full sequence

X-ray, in complex with other protein

Non-trivial construct design

NMR with non-trivial construct

NMR


Target selection and optimization server XtalPred


Summary Homologues of crystallized proteins

0

100

200

300

400

500

600

99-90 89-60 59-50 49-40 39-30 29-20Sequence identity

Num

ber

of

crys

tals

and

crys

talli

zati

on f

ailu

res

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Succ

ess

rate

A

0

1000

2000

3000

4000

5000

6000

1 487

Number of genomes

Num

ber

of

cove

red P

fam

fam

ilies

B

0

500

1000

1500

2000

2500

3000

3500

1 487

Number of genomes

Num

ber

of

cove

red P

fam

fam

ilies

Crystallization screening of multiple homologous sequences is justified by the observation that probability of crystallization is correlated only for very close homologs.

Based on the statistics derived from TargetDB database, it is now possible to estimate crystallization probability and separate proteins into “crystallizability classes” with different crystallization success rates.

Most protein families contain proteins from different crystallizability classes. Continuous growth of available sequence data helps in crystallization efforts by providing promising targets from “difficult” protein families.


26-months after 1-st PSI target draft:

• 1369 Pfam families initially identified: 352 are now solved(203 by 4 large PSI centers)

• JCSG prioritized 742 families in 3 categories, 268 solved (36%)

1-250 optimal 125 solved (50%)

251-500 suboptimal 75 solved (30%)

500-742 difficult 68 solved (27%)

742-1369 very difficult 84 solved (13%)

PSI success rate per protein family is several times higher than success rate per protein


0

10

20

30

40

50

60

2000 2001 2002 2003 2004 2005 2006 2007year

PSI

cont

ribu

tion

(% o

f th

e W

orld

)

in large protein families

in protein residues

Impact of the PSI on structural coverage of protein families annotated in Pfam database

Since start of PSI1

solved structures:2894

solved Pfam families:611 (25% of the World)

solved large Pfam families:216 (20% of the World)

Since start of PSI2

solved structures:1521

solved Pfam families:312 (43% of the World)

solved large Pfam families:76 (35% of the World)

(rapid growth of PSI contribution in 2007 is partly effect of slow release of non-PSI structures)


UCSD & BurnhamBioinformatics CoreJohn WooleyAdam GodzikLukasz JaroszewskiSlawomir Grzechnik Sri Krishna SubramanianAndrew MorseTamara AstakhovaLian DuanPiotr KozbialDana WeekesNatasha SefcovicPrasad BurraJosie AlaoenCindy Cook

GNF & TSRICrystallomics CoreScott LesleyMark KnuthHeath KlockDennis CarltonThomas ClaytonKevin D. MurphyChristina TroutMarc DellerDaniel McMullan Polat Abdubek Claire AcostaLinda M. ColumbusJulie FeuerhelmJoanna C. HaleThamara JanaratneHope JohnsonEdward NigoghossianLinda OkachSebastian SudekAprilfawn WhiteYlva EliasGlen SpraggonBernhard GeierstangerSanjay AgarwallaCharlene ChoBi-Ying YehAnna GrzechnikJessica CansecoMimmi Brown

Scientific Advisory BoardSir Tom BlundellUniv. CambridgeHomme Hellinga

Duke University Medical CenterJames Naismith

The Scottish Structural Proteomics facility Univ. St. AndrewsJames Paulson

Consortium for Functional Glycomics,The Scripps Research Institute

Robert StroudCenter for Structure of Membrane Proteins,

Membrane Protein Expression Center, UCSF Soichi Wakatsuki

Photon Factory, KEK, JapanJames Wells

UC San FranciscoTodd Yeates

UCLA-DOE, Inst. for Genomics and Proteomics

TSRINMR CoreKurt Wüthrich Reto Horst Maggie JohnsonAmaranth ChatterjeeMichael GeraltWojtek AugustyniakPedro SerranoBill PedriniWilliam Placzek

TSRI Administrative CoreIan WilsonMarc ElsligerGye Won HanDavid MarcianoHenry TienLisa van Veen

Stanford /SSRLStructure Determination CoreKeith HodgsonAshley DeaconMitchell Miller Herbert AxelrodHsiu-Ju (Jessica) ChiuKevin JinChristopher RifeQingping XuSilvya OommachenHenry van den BedemScott TalafuseRonald ReyesAbhinav KumarChristine TrameDebanu Das

The JCSG is supported by the NIH Protein Structure Initiative (PSI) Grant U54 GM074898 from NIGMS (www.nigms.nih.gov).

Ex officio founding members Raymond Stevens , TSRI Susan Taylor, UCSD Peter Kuhn, SSRL/TSRI Duncan McRee, TSRI/Syrrx


JCSG Annual Meeting 2007

addressing protein crystallization bottlenecks by screening multiple homologs

Documents