addressing protein crystallization bottlenecks by screening multiple homologs
DESCRIPTION
Addressing Protein Crystallization Bottlenecks by Screening Multiple Homologs. Lukasz Jaroszewski, Lukasz Slabinski, John Wooley, Ian. A. Wilson, Ashley M. Deacon, Scott. A. Lesley, and Adam Godzik. The Protein Structure Initiative "Bottlenecks" Workshop, April 14-16 Bethesda 2008. - PowerPoint PPT PresentationTRANSCRIPT
Joint Center for Molecular Modeling
Addressing Protein Crystallization Bottlenecks by Screening Multiple Homologs
Lukasz Jaroszewski, Lukasz Slabinski, John Wooley, Ian. A. Wilson,Ashley M. Deacon, Scott. A. Lesley, and Adam Godzik
The Protein Structure Initiative "Bottlenecks" Workshop, April 14-16 Bethesda 2008
Joint Center for Molecular Modeling
TargetDB database provides the first large and diverse learning sets for studying protein “production” and crystallization
Protein “production” learning set (from cloning to purified protein)
Positive subset (successes):12,850 targets listed as purified in TargetDB by PSI centers
Negative subset (failures):13,587 targets: all stopped targets that were listed as cloned, but not purifiedall targets that were cloned, but not purified, and did not show any further progress after 18 months
Protein crystallization learning set
Positive subset (successes):3,140 protein structures solved by X-ray crystallography by PSI centers
Negative subset (failures):5,819 targets: all stopped targets listed as purified, but not crystallized, and not assigned to NMR + all targets that were purified and did not show any progress for more than 18 months.
Joint Center for Molecular Modeling
The ability to crystallize is correlated only for very close homologs…
Homologs of crystallized proteins
0
100
200
300
400
500
600
99-90 89-60 59-50 49-40 39-30 29-20Sequence identity to the closest crystallized protein
Num
ber
of
crys
tals
and
crys
talli
zati
on f
ailu
res
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Succ
ess
rate
Joint Center for Molecular Modeling
…while difficulties with crystallization are correlated for more distantly related proteins
Homologs of not crystallized proteins
0
100
200
300
400
500
600
700
99-90 89-60 59-50 49-40 39-30 29-20
Sequence identity to the closest protein which did not crystallize
Num
ber
of
crys
tals
and
crys
talli
zati
on f
ailu
res
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Succ
ess
rate
Joint Center for Molecular Modeling
Probability distributions of protein “production”(stages from cloning to purified protein)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
30-5
0
50-7
0
70-1
00
100-
200
200-
300
300-
400
400-
500
500-
600
600-
700
700-
1200
sequence length
# su
cces
ses
/ fa
ilu
res
0
0.1
0.2
0.3
0.4
0.5
0.6
Ple
ng
ht
0
500
1000
1500
2000
2500
3000
3500
4000
(-1.
5)-(
-1.0
)
(-1.
0)-(
-0.8
)
(-0.
8)-(
-0.6
)
(-0.
6)-(
-0.4
)
(-0.
4)-(
-0.2
)
(-0.
2)-(
-0.1
)
(-0.
1)-0
.0
0.0-
0.1
0.1-
0.2
0.2-
0.3
0.3-
0.4
0.4-
0.5
gravy hydrophobicity index
# su
cces
ses
/ fa
ilu
res
0
0.1
0.2
0.3
0.4
0.5
0.6
PG
RA
VY
B
length < 345 aa
0
100
200
300
400
500
600
700
3-4 4-5 5-6 6-7 7-8 8-9 9-13
isoelectric point
# su
cces
ses
/ fa
ilu
res
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Pp
I
length > 345 aa
0
200
400
600
800
1000
1200
1400
3-4 4-5 5-6 6-7 7-8 8-9 9-13
isoelectric point
# su
cces
ses
/ fa
ilu
res
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Pp
I
Joint Center for Molecular Modeling
Probability distributions of protein crystallization
0
100
200
300
400
500
600
700
800
50-7
0
70-1
00
100-
150
150-
200
200-
300
300-
400
400-
500
500-
600
600-
700
sequence length
# cr
ysta
ls /
fai
lure
s
0
0.1
0.2
0.3
0.4
0.5
Ple
ng
th
length < 345 aa
0
100
200
300
400
500
600
700
3-5 5-6 6-7 7-8 8-9 9-11 11-13
isoelectric point
# cr
ysta
ls /
fai
lure
s
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Pp
I
length > 345 aa
0
50
100
150
200
3-5 5-6 6-7 7-8 8-9 9-11
isoelectric point
# cr
ysta
ls /
fai
lure
s
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Pp
I
0
100
200
300
400
500
600
700
800
(-1.
5)-(
-1.0
)
(-1.
0)-(
-0.8
)
(-0.
8)-(
-0.6
)
(-0.
6)-(
-0.4
)
(-0.
4)-(
-0.2
)
(-0.
2)-(
-0.1
)
(-0.
1)-0
.0
0.0-
0.1
0.1-
0.2
0.2-
0.3
0.3-
0.4
gravy hydrophobicity index
# cr
ysta
ls /
fai
lure
s
0
0.1
0.2
0.3
0.4
0.5
0.6
PG
RA
VY
Joint Center for Molecular Modeling
Probability distributions of protein crystallization
0
100
200
300
400
500
600
700
800
900
1000
0-10 10-20 20-30 30-40 40-50 50-60 60-
the longest disordered region (aa)
# cr
ysta
ls /
fai
lure
s
0
0.1
0.2
0.3
0.4
0.5
Pld
iso
0
100
200
300
400
500
600
700
800
900
1000
5-10 10-20 20-30 30-40 40-50 50-60 60-90
instability index
# cr
ysta
ls /
fai
lure
s
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
PII
0
100
200
300
400
500
600
700
800
900
0-10 10-20 20-30 30-40 40-50 50-70
% predicted coil structure
# cr
ysta
ls /
fai
lure
s
0
0.1
0.2
0.3
0.4
0.5
Pco
ils
0
500
1000
1500
2000
2500
0-20 20-
coiled-coil structure prediction
# cr
ysta
ls /
fai
lure
s
0
0.1
0.2
0.3
0.4
0.5
Pcc
0
200
400
600
800
1000
1200
0-5 5-20 20-100
% insertions
# cr
ysta
ls /
fai
lure
s
0.2
0.3
0.4
0.5
Pin
s
Joint Center for Molecular Modeling
It is possible to combine individual probabilities into one estimate of crystallization probability (“crystallization score”)
n
i
wi
ipkP1
k – normalizing constant
pi – individual probability distributions, such as: P length, PpI, PGRAVY, etc.
n – number of individual probability distributions
wi – weights of a individual probability distributions
(we used all weights equal 1/n since the size of learning sets did not allow optimization of individual weights).
We used a method called a logarithmic opinion pool known in financial risk analysis. The probability of protein crystallization is estimated by the product of individual probabilities.
Joint Center for Molecular Modeling
Jack-knife tests confirm that the crystallization score has predictive power.
optimal suboptimal average difficult very difficult
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
0 20 40 60 80 100
percentile of targets according to crystalization score
succ
ess
rate
Sc
SCs
ScL
Scb
Sc – learning set rank-ordered by crystallization score derived from the same set
SCs – crystallization score derived from the data from four large PSI centers (JCSG, MCSG, NESG, and NYSGXRC) and used to rank-order targets from all other centers (BSGC, BCGI, CESG, ISFI, OPPF, S2F, SECSG, SGPP, SPINE-EU, YSG, TB, and RSGI). Scl – opposite to SCs
Scb – crystallization score used to rank-order targets deposited in TargetDB after crystallization score was derived
Since sets of targets used in tests have different average success rates (from 33% to 41%), the normalized plot is shown in the inset.
Joint Center for Molecular Modeling
Crystallization score can be used to split targets from TargetDB into classes with different success rates
Protein crystallization
Joint Center for Molecular Modeling
Each completely sequenced genomes brings more suitable targets for about 7 protein families
A
0
1000
2000
3000
4000
5000
6000
1 487
Number of genomes
Num
ber
of
cove
red P
fam
fam
ilies
B
0
500
1000
1500
2000
2500
3000
3500
1 487
Number of genomes
Num
ber
of
cove
red P
fam
fam
ilies
All Pfam familiesPfam families without
structures
Joint Center for Molecular Modeling
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Protein families sorted by precentage of very difficult targets
Dis
trib
uti
on o
f cr
ysta
lliza
tion c
lass
es
in P
fam
fam
ilies
0
1
2
3
4
5
6
7
8
9
Norm
aliz
ed a
vera
ge
num
ber
of
solv
ed s
truct
ure
s per
Pfa
m f
amily
Broad distribution of crystallization classes in protein families allows promising targets to be found in many “difficult” families.
The number of structures solved from a family is correlated with the number of “crystallizable” targets from that family.
Very difficult5
Difficult4
Average3
Suboptimal2
Optimal1
Very difficult5
Difficult4
Average3
Suboptimal2
Optimal1
Crystallization classes
Joint Center for Molecular Modeling
Assessing the bias introduced in representative structures from protein families by crystallizability
Sequence length distributions
0
100
200
300
400
500
600
100 200 300 400 500 600 700 800 900
Length bins
No
rmal
ize
d c
ou
nt
Gravy index distributions
0
100
200
300
400
500
600
700
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Gravy index bins
No
rmal
ize
d c
ou
nt
pI distributions
0
100
200
300
400
500
600
700
800
900
3 4 5 6 7 8 9 10 11 12 13
pI bins
No
rmal
ize
d c
ou
nt
Instability index distributions
0
100
200
300
400
500
600
700
800
Instability index bins
No
rmal
ize
d c
ou
nt
The longest disordered fragment distributions
0
100
200
300
400
500
600
700
Disordered fragment length bins
No
rmal
ize
d c
ou
nt
Percentage of coil structure distributions
0
100
200
300
400
500
600
700
800
900
Coil percentage bins
No
rmal
ize
d c
ou
nt
Distributions of protein features calculated for:
•Sequences of microbial members of Pfam families without any solved structures (red)
•Sequences of microbial members of Pfam families with at least one solved structure (green)
•Sequences of solved members of Pfam families (blue)
•Actual sequence constructs of solved members of Pfam families (black).
Joint Center for Molecular Modeling
The distribution of crystallizability classes in microbial genomes is more even than in protein families.
0%
20%
40%
60%
80%
100%
Microbial genomes
Dis
trib
uti
on o
f cr
ysta
lliza
tion c
lass
es
otherpathogensandsymbionts
ther
mo
ph
iles
Very difficult5
Difficult4
Average3
Suboptimal2
Optimal1
Very difficult5
Difficult4
Average3
Suboptimal2
Optimal1
Crystallization classes
Selected group of organisms
Average percent of “optimal”
crystallization class
Average percent of “very difficult”
crystallization class
Archaea 13 39
Bacteria 11 45
Thermophilic and hyperthermophilic
13 40
Pathogens and symbionts
11 46
Other 11 44
Joint Center for Molecular Modeling
JCSG crystallization score distribution also confirms high complementarity of X-ray and NMR
0%
10%
20%
30%
40%
50%
60%
1 2 3 4 5Crystallization classes
Per
cent
age
of t
arge
ts
TargetDB - solved by Xray
TargetDB - solved by NMR
Joint Center for Molecular Modeling
Requirements for optimal NMR targets are different than requirements for optimal X-ray targets
Triple Tracts of the Same Residue
0
1
2
3
4
xc
Pro-Pro-Pro Tracts
0
0.1
0.2
0.3
xc
Pro-Xxx-Pro Tracts
0
0.5
1
1.5
2
2.5
xc
Pro-Pro Tracts
0
0.5
1
1.5
2
xc
Num
ber
of T
ract
s pe
r P
rote
in
Num
ber
of T
ract
s pe
r P
rote
in
Num
ber
of T
ract
s pe
r P
rote
in
Num
ber
of T
ract
s pe
r P
rote
in
Specific sequence features which increase the cost of solving protein structure by NMR
X-ray NMR
pI preference for acidic proteins
no strong preference
structural
disorder
highly detrimental
no strong preference
sequence length
up to several thousand residues
< ~220 aa
All solved structures vs. NMR-solved proteins (from PDB)
Joint Center for Molecular Modeling
NMR
NMR with non-trivial construct
Non-trivial construct design
X-ray, in complex with other protein
X-ray, full sequence
As expected solving structures from “very difficult” families more often requires nontrivial construct design or use of NMR
Structures from optimal group of familiesStructures from optimal group of families
Structures from very difficult group of familiesStructures from very difficult group of families
X-ray, full sequence
X-ray, in complex with other protein
Non-trivial construct design
NMR with non-trivial construct
NMR
Joint Center for Molecular Modeling
Target selection and optimization server XtalPred
Joint Center for Molecular Modeling
Summary Homologues of crystallized proteins
0
100
200
300
400
500
600
99-90 89-60 59-50 49-40 39-30 29-20Sequence identity
Num
ber
of
crys
tals
and
crys
talli
zati
on f
ailu
res
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Succ
ess
rate
A
0
1000
2000
3000
4000
5000
6000
1 487
Number of genomes
Num
ber
of
cove
red P
fam
fam
ilies
B
0
500
1000
1500
2000
2500
3000
3500
1 487
Number of genomes
Num
ber
of
cove
red P
fam
fam
ilies
Crystallization screening of multiple homologous sequences is justified by the observation that probability of crystallization is correlated only for very close homologs.
Based on the statistics derived from TargetDB database, it is now possible to estimate crystallization probability and separate proteins into “crystallizability classes” with different crystallization success rates.
Most protein families contain proteins from different crystallizability classes. Continuous growth of available sequence data helps in crystallization efforts by providing promising targets from “difficult” protein families.
Joint Center for Molecular Modeling
26-months after 1-st PSI target draft:
• 1369 Pfam families initially identified: 352 are now solved(203 by 4 large PSI centers)
• JCSG prioritized 742 families in 3 categories, 268 solved (36%)
1-250 optimal 125 solved (50%)
251-500 suboptimal 75 solved (30%)
500-742 difficult 68 solved (27%)
742-1369 very difficult 84 solved (13%)
PSI success rate per protein family is several times higher than success rate per protein
Joint Center for Molecular Modeling
0
10
20
30
40
50
60
2000 2001 2002 2003 2004 2005 2006 2007year
PSI
cont
ribu
tion
(% o
f th
e W
orld
)
in large protein families
in protein residues
Impact of the PSI on structural coverage of protein families annotated in Pfam database
Since start of PSI1
solved structures:2894
solved Pfam families:611 (25% of the World)
solved large Pfam families:216 (20% of the World)
Since start of PSI2
solved structures:1521
solved Pfam families:312 (43% of the World)
solved large Pfam families:76 (35% of the World)
(rapid growth of PSI contribution in 2007 is partly effect of slow release of non-PSI structures)
Joint Center for Molecular Modeling
UCSD & BurnhamBioinformatics CoreJohn WooleyAdam GodzikLukasz JaroszewskiSlawomir Grzechnik Sri Krishna SubramanianAndrew MorseTamara AstakhovaLian DuanPiotr KozbialDana WeekesNatasha SefcovicPrasad BurraJosie AlaoenCindy Cook
GNF & TSRICrystallomics CoreScott LesleyMark KnuthHeath KlockDennis CarltonThomas ClaytonKevin D. MurphyChristina TroutMarc DellerDaniel McMullan Polat Abdubek Claire AcostaLinda M. ColumbusJulie FeuerhelmJoanna C. HaleThamara JanaratneHope JohnsonEdward NigoghossianLinda OkachSebastian SudekAprilfawn WhiteYlva EliasGlen SpraggonBernhard GeierstangerSanjay AgarwallaCharlene ChoBi-Ying YehAnna GrzechnikJessica CansecoMimmi Brown
Scientific Advisory BoardSir Tom BlundellUniv. CambridgeHomme Hellinga
Duke University Medical CenterJames Naismith
The Scottish Structural Proteomics facility Univ. St. AndrewsJames Paulson
Consortium for Functional Glycomics,The Scripps Research Institute
Robert StroudCenter for Structure of Membrane Proteins,
Membrane Protein Expression Center, UCSF Soichi Wakatsuki
Photon Factory, KEK, JapanJames Wells
UC San FranciscoTodd Yeates
UCLA-DOE, Inst. for Genomics and Proteomics
TSRINMR CoreKurt Wüthrich Reto Horst Maggie JohnsonAmaranth ChatterjeeMichael GeraltWojtek AugustyniakPedro SerranoBill PedriniWilliam Placzek
TSRI Administrative CoreIan WilsonMarc ElsligerGye Won HanDavid MarcianoHenry TienLisa van Veen
Stanford /SSRLStructure Determination CoreKeith HodgsonAshley DeaconMitchell Miller Herbert AxelrodHsiu-Ju (Jessica) ChiuKevin JinChristopher RifeQingping XuSilvya OommachenHenry van den BedemScott TalafuseRonald ReyesAbhinav KumarChristine TrameDebanu Das
The JCSG is supported by the NIH Protein Structure Initiative (PSI) Grant U54 GM074898 from NIGMS (www.nigms.nih.gov).
Ex officio founding members Raymond Stevens , TSRI Susan Taylor, UCSD Peter Kuhn, SSRL/TSRI Duncan McRee, TSRI/Syrrx
Joint Center for Molecular Modeling
JCSG Annual Meeting 2007