structure databases: the protein data bank swanand gore & gerard kleywegt pdbe – ebi may 7 th...

58
Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Upload: lambert-powell

Post on 18-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Structure Databases:The Protein Data Bank

Swanand Gore & Gerard KleywegtPDBe – EBI

May 7th 2010, 9-10 am

Macromolecular Crystallography Course

Page 2: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Outline

• Structural Biology and Bioinformatics

• Databases in Structural Bioinformatics

• Protein Data Bank

• PDBe

Page 3: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Promise of Structural Biology

• Basic research– Insights in biophysics of folding– Insights into Evolution– Insights into enzymatic catalysis

• Applications– Design of drug / antibody / epitope / pesticide / enzymes– Design of new materials– Understanding disease

• Structural bioinformatics– Big computational and informatics toolbox– Full of techniques to translate insights to application– Databases are a vital aspect

Page 4: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Sequence-Structure-Function

Sequence

Function

PredictionModelling

DeterminationArchival / Retrieval

Classification

StructureSearching

MiningComparisonAlignment

DesignEngineering

Page 5: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

A rich toolbox

StructuralBio-info-computing

Structure Refinement

DatabasesAnnotation

Classification

Comparison

Analysis Mining

Prediction

Page 6: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases are central to structural bioinformatics pipeline

Primary StructuralDatabases

DetermineAnnotate

AlignCompare

MineClassify

ModelPredict

Secondary StructuralDatabases

Page 7: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases help in Structure Determination

• Dihedral preferences– Ramachandran contours– Sidechain rotamer libraries– RNA backbone and puckers

• Likely ring conformations– Small-molecules (CCDC)

• Molecular replacement– Choice of probe using homology– fragment-based MR

• Validation– Electron density server and PrEDS

• Dunbrack, R.L., Jr. Rotamer libraries in the 21st century. Curr. Opin. Struct. Biol. 12:431-440, 2002.• Jane S. Richardson et al (2008) "RNA Backbone: Consensus All-angle Conformers and Modular String Nomenclature (an RNA Ontology Consortium contribution)" RNA 14 :465-481• The Cambridge Structural Database: a quarter of a million crystal structures and rising, F. H. Allen, /Acta Cryst./, B*58*, 380-388, 2002 • S.C. Lovell et al. (2003) "Structure Validation by Cα Geometry: φ,ψ and Cβ Deviation." Proteins: Structure, Function and Genetics 50, 437-450.• Claude et al. CaspR: a web server for automated molecular replacement using homology modelling. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W606-9.• McCoy, A.J., Grosse-Kunstleve, R.W., Adams, P.D., Winn, M.D., Storoni, L.C. and Read, R.J. (2007). Phaser crystallographic software. J. Appl. Cryst. 40: 658-674.• Gubbi et al. (2007) Solving Protein Structures Using Molecular Replacement Via Protein Fragments, Lecture Notes In Artificial Intelligence;.Vol. 4578. 627.• GJ Kleywegt et al. (2004) "The Uppsala Electron-Density Server", Acta Crystallographica, D60, 2240-2249

Page 8: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases are vital to archiving structures!

• Structures represent invaluable scientific insights

• But it is costly to solve a structure– Time, effort, money

• Organize and safe-keep painstakingly determined data– Formal mechanisms of

arranging, searching, backing up• Wide-ranged access to

invaluable repository without compromising data integrity

• Very low cost of maintenance in comparison with the cost of content!

Page 9: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases are vital to archiving structures

• “Database is a structured collection of data held in computer storage, often incorporating software to make it accessible in various ways”

• Databases– Provide accessibility with safety and persistence– Provide context for your data against other data– Facilitate comparisons and data-mining

• Primary structural databases– Experimental data and model coordinates

• NDB, wwPDB, BMRB, CSD, EMDB

• Secondary structural databases– Classification, function annotation

• SCOP, EC2PDB, PALI, and many many more!

Page 10: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases / Archival / Retrieval• Formats of databases

– Flat files (csv, tsv, columnar), supporting scripts– Relational (MySQL, Oracle): professional, indexed

• Access– Modes: read, write, edit, delete (PDB provides entry deposition mechanisms)– Means: Download (wwPDB ftp), Command-line or GUI (SQL queries, Oracle desktop

client), Web-based interfaces (PDBeDatabase service)– Access frequency

• Schema design– Tables, primary keys,

foreign keys, views….– Normal forms: avoid

data repetition, inconsistencies

Page 11: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases for Classification• Structural hierarchy

– CATH• Class, Architecture, Topology, Homology

– SCOP• Class, Fold, Superfamily, Family

• Enzyme hierarchy– EC-PDB

• Oxidoreductase, ligase, lyase, isomerase, hydrolase, transferase.

• Functional ontology– GOA

• Gene Ontology: Cellular component, Biological process, Molecular Function

• Linked to structures via SIFTS• Christos A. Ouzounis et al. (2005) Classification schemes for protein structure and function Nature Reviews Genetics 4, 508-519.• Andreeva et al. (2007) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36:D419• Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29• Barrell D. et al. (2009) The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Research 2009 37: D396-D403.

Page 12: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases for Comparison• Structural and structure-sequence alignments

• Phylogeny– Evolutionary trace

• Evolutionarily important residues• Mapping onto structure• Mizuguchi K, Deane CM, Blundell TL, Overington JP. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7:2469-2471.

• SISYPHUS - structural alignments for proteins with non-trivial relationships Andreeva et al, Nucleic Acid Research Database Issue 2007, 35, D253-D259• Gowri, V. S. Et al. (2003). Integration of related sequences with protein three-dimensional structural families in an updated Version of PALI database. Nucleic Acids Res. 2003 31: 486-488.• Bhaduri A, Pugalenthi G, Sowdhamini R. PASS2: an automated database of protein alignments organised as structural superfamilies. BMC Bioinformatics. 2004, 5:35• DBAli tools: mining the protein structure space. Marc A. Marti-Renom et al. Nucleic Acids Research, doi:10.1093/nar/gkm236 • Whelan, S., P.I.W. de Bakker, & N. Goldman. (2003). Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics 19:1556-1563• The Pfam protein families database:,R.D. Finn,et al, Nucleic Acids Research (2010) Database Issue 38:D211-222• Morgan, D.H., D.M. Kristensen, D. Mittleman, and O. Lichtarge. ET Viewer: An Application for Predicting and Visualizing Functional Sites in Protein Structures. Bioinformatics. 2006 Aug 15;22(16):2049-50

Page 13: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases for Annotation

• SNPs

• Servant F. rt al (2002) ProDom: Automated clustering of homologous domains. Briefings in Bioinformatics. vol 3, no 3:246-251• Marchler-Bauer A,et al CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 2009 Jan;37(Database issue):D205-10• Hulo N., Bairoch A., Bulliard V., Cerutti L., Cuche B., De Castro E., Lachaize C., Langendijk-Genevaux P.S., Sigrist C.J.A. The 20 years of PROSITE. Nucleic Acids Res. 2007• SitesBase: a database for structure-based protein–ligand binding site comparisons , Nicola D. Gold and Richard M. Jackson, Nucleic Acids Research, 2006, Vol. 34, Database issue D231-D234• sc-PDB: an Annotated Database of Druggable Binding Sites from the Protein Data Bank, Esther Kellenberger et al, J. Chem. Inf. Model., 2006, 46 (2), pp 717–727• Binding MOAD, a high-quality protein–ligand database. Mark L. Benson et al, Nucleic Acids Research 2008 36(Database issue):D674-D678• SNPeffect v2.0: a new step in investigating the molecular phenotypic effects of human non-synonymous SNPs . Joke Reumers at al, Bioinformatics 2006 22(17):2183-2185

• Domains

• Active / allosteric sites

Page 14: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases for Annotation

• CREDO: A Protein-Ligand Interaction Database for Drug Discovery.Adrian Schreyer, Tom Blundell. Chemical Biology & Drug Design, Vol. 73, No. 2. (February 2009), pp. 157-167• BIPA: a database for protein–nucleic acid interaction in 3D structures. Semin Lee and Tom L Blundell, Bioinformatics 2009 25(12):1559-1560• PIBASE: a comprehensive database of structurally defined protein interfaces. Davis FP and Sali A, Bioinformatics. 2005 May 1;21(9):1901-7.• JAIL: a structure-based interface library for macromolecules. Stefan Günther et al. Nucleic Acids Res. 2009 January; 37(Database issue): D338–D341• Elke Michalsky et al., SuperLigands – a database of ligand structures derived from the Protein Data Bank, BMC Bioinformatics 2005, 6:122• Voronoia: analyzing packing in protein structures. Rother K et al. Nucleic Acids Res. 2009 Jan;37(Database issue):D393-5.• CASTp: Computed Atlas of Surface Topography of proteins. Binkowski et al. Nucleic Acids Res. 2003 Jul 1;31(13):3352-5.• The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Craig T. Porter, Gail J. Bartlett, and Janet M. Thornton (2004) Nucl. Acids. Res. 32: D129-D133.

• Binding partners– Small molecule: TIMBAL, CREDO– Protein, DNA – PiBase JAIL, BIPA

• Residues critical to enzyme mechanism

• Surface properties, cavities: Voronoia,

Page 15: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases of Analysis / Mining

• Secondary structure: SSEP• Active sites

• Oliva et al (1997) An automated classification of the structure of protein loops. J Mol Biol 266 (4): 814-830.• SSEP: secondary structural elements of proteins , V. Shanthi, P. Selvarani, Ch. Kiran Kumar, C. S. Mohire and K. SekarNucleic Acids Research, 2003, Vol. 31, No. 13 3404-3405• PepX: a structural database of non-redundant protein-peptide complexes. Vanhee F et al., Nucleic Acids Res. 2010 Jan;38(Database issue):D545-51. • Baeten L, et al. (2008) Reconstruction of Protein Backbones from the BriX Collection of Canonical Protein Fragments. PLoS Comput Biol 4(5): e1000083. doi:10.1371/journal.pcbi.1000083• Bystroff C & Baker D. (1998). Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 281, 565-77.• LigBase: a database of families of aligned ligand binding sites in known protein sequences and structures. Stuart AC et al., Bioinformatics. 2002 Jan;18(1):200-1.• PTGL—a web-based database application for protein topologies. Patrick May et al. Bioinformatics 2004 20(17):3277-3279; doi:10.1093/bioinformatics/bth367 • Fitzkee, N. C., Fleming, P. J, Rose G. D. (2005) The Protein Coil Library: a structural database of nonhelix, nonstrand fragments derived from the PDB. Proteins. 58 (4): 852-4.

• Protein-peptide interactions

• Loop databases– Protein Coil Library– Protein Loop Classification– Loops in Proteins

– Protein Topology Graph Library

• Frequent structural motifs

Page 16: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Databases in Prediction• Oligomeric state

– PISA at PDBe

• 3D coordinates– ab-initio folding– homology models

• Possible binding partners and binding modes

– small-molecule (PRECISE)– protein-protein (ADAN)

• Dynamics, conformational changes– MolMovDB

• Cellular location • LOC3D: annotate sub-cellular localization for protein structures. Nair R, Rost B., Nucleic Acids Res. 2003 Jul 1;31(13):3337-40.• MolMovDB: analysis and visualization of conformational change and structural flexibility. Echols N et al., Nucleic Acids Res. 2003 Jan 1;31(1):478-82.• ADAN: a database for prediction of protein-protein interaction of modular domains mediated by linear motifs. Encinar JA et al., Bioinformatics. 2009 Sep 15;25(18):2418-24. Epub 2009 Jul 14.• PRECISE: a Database of Predicted and Consensus Interaction Sites in Enzymes . Shu-Hsien Sheu et al., Nucleic Acids Research, 2005, Vol. 33, Database issue D206-D211• MODBASE, a database of annotated comparative protein structure models and associated resources. Ursula Pieper et al., Nucleic Acids Research 37, D347-D354, 2009.• Krissinel E, Henrick K. Inference of macromolecular assemblies from crystalline state. J. Mol. Biol. (2007) 372:774–797.• S. M. Larson . Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable problems in computational biology. Mod Meth Comp Biol, R. Grant, ed, Horizon Press (2003)

Page 17: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Specialized databases with structures

• MCSIS (GPCRs, Prions etc)

• Carbohydrates– KEGG Glycans

• Antibodies (Abysis)

• Lysozymes

• Abysis: http://www.bioinf.org.uk/abysis/• Horn F., Vriend G., Cohen FE. Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res. 29:346-349 (2001)• LySDB - Lysozyme Structural DataBase. Mohan KS et al., Acta Crystallogr D Biol Crystallogr. 2004 Mar;60(Pt 3):597-600.

Page 18: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

The Protein Data Bank

• Unique primary database– Single archive of experimentally determined macromolecular

(biopolymer) structures– ~ 65000 entries– Distributed online– Updated weekly– Numerous databases derived and enriched with PDB data– Many frontends- RCSB, PDBe, PDBsum, OCA, MMDB, Jena, SIB

• “The PDB” is a flat-file archive– PDB formatted coordinate files– any experimental data when submitted

Page 19: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

The Protein Data Bank

• International Effort– Curated by RCSB, PDBe, PDBj, BMRB– ftp archive currently operated by RCSB

Page 20: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

FTP traffic at PDB sites

RCSB PDB200 milliondata downloads

PDBe37 milliondata downloads

PDBj14 milliondata downloads

Page 21: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

The Protein Data Bank• When is a biopolymer PDB-worthy?

– Polypeptides• Gene products• Non-ribosomal• Synthetic peptides > 23 residues

– Unless clearly biologically significant

– Polynucleotides• > 3 residues

– Sugars• > 3 sugar residues

– Fibers• Only repeating unit deposited

Page 22: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Annual Growth of PDBPrimary databases differ by magnitudes in size.

UniprotKB107 protein sequences

GenBank1011 base pairs

108 gene sequences

< 105 structures

http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ebi.ac.uk/uniprot/TrEMBLstats/

Page 23: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Annual Growth of PDB

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1972

1973

1974

1975

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

X-ray

NMR

EM

Dominated by x-ray!

EM rising…

Page 24: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Redundancy in PDB(as in Nov’08)

• Entries > 54,000• Chains > 120,000

– Copies of a chain in same entry• Homo-oligomers

– Same chains in different entries• Determined by multiple labs• Determined under different conditions• Complexed with different partners• Mutants

• Chains < 8700 at seq.id < 30%– Orthologs, paralogs are very similar

• Using non-redundant chains from PDB– PISCES server– WHATIF, CATH, SCOP, DALI sets

• G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.

Page 25: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

File formats at PDB

• The .pdb format– Header

• Remarks– experimental setup– Refinement details– oligomeric state– deviations from expected geometry

• Biochemical entities– Biopolymers, het groups

– Coordinates• 3D model of the entity• Multiple coordinates for same entity can exists

– MODELs, altloc identifiers

• Structure factors– .cif file

Page 26: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

File formats at PDB

XML

mmCIF

Page 27: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

The PDB format: header

123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+

HEADER RETINOIC-ACID TRANSPORT 28-SEP-94 1CBS 1CBS 2COMPND CELLULAR RETINOIC-ACID-BINDING PROTEIN TYPE II COMPLEXED 1CBS 3COMPND 2 WITH ALL-TRANS-RETINOIC ACID (THE PRESUMED PHYSIOLOGICAL 1CBS 4COMPND 3 LIGAND) 1CBS 5SOURCE HUMAN (HOMO SAPIENS) 1CBS 6SOURCE 2 EXPRESSION SYSTEM: (ESCHERICHIA COLI) BL21 (DE3) 1CBS 7SOURCE 3 PLASMID: PET-3A 1CBS 8SOURCE 4 GENE: HUMAN CRABP-II 1CBS 9AUTHOR G.J.KLEYWEGT,T.BERGFORS,T.A.JONES 1CBS 10REVDAT 1 26-JAN-95 1CBS 0 1CBS 11

Column 1-6Record type

Column 7-72 - human-readable, mostlytextual information

Page 28: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

The PDB format: coordinates

HETATM 1 C ACE A 0 4.279 14.829 14.190 1.00 19.08 C HETATM 2 O ACE A 0 3.706 14.098 15.038 1.00 20.62 O HETATM 3 CH3 ACE A 0 3.827 16.236 14.001 1.00 20.22 C ATOM 4 N MET A 1 5.514 14.621 13.695 1.00 17.77 N ATOM 5 CA MET A 1 6.269 13.401 13.959 1.00 16.51 C ATOM 6 C MET A 1 6.702 13.319 15.400 1.00 16.41 C ATOM 7 O MET A 1 7.036 12.248 15.870 1.00 15.38 O ATOM 8 CB MET A 1 7.529 13.301 13.085 1.00 16.52 C ATOM 9 CG MET A 1 7.292 12.805 11.676 1.00 16.48 C

Atom nr

Residue type

Atom name

Chain name

Residue nr

“B-factor”

Occupancy

X, Y, Z coordinates

Page 29: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Protein Data Bank in Europe• PDBe

– European node of wwPDB– Started 1996 as MSD at EBI– Deposition site since 1999– Started EMDB in 2002

• PDBe operations– Handle deposition and annotation of PDB and EMDB entries– Build advanced structure databases– Build services for search, browsing, analysis– Liaise with broader structural biology community– Coordinate with other databases e.g. Uniprot

• Funding

• PDBe: Protein Data Bank in Europe. S. Velankar et al., • Nucleic Acids Research, doi:10.1093/nar/gkp916

Page 30: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe Deposition and Annotation

• Checks– Is format correct?– Are biopolymer sequences in biochemical

entities consistent with 3D models?– Are hetero groups named correctly?– Where all does model deviate from expected

geometry?

• Record various types of information– Experiment: Method, conditions, data

resolution, spacegroup, completeness etc.– Sample: source, expression system, engineered

etc.– Refinement: program, target

AutoDepDeposition

Tool

Page 31: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

AutoDep provides valuable information to depositors

• Validation of structure factors– EDS criteria

• http://www.ebi.ac.uk/pdbe-xdep/autodep/index.jsp

Page 32: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

AutoDep provides valuable information to depositors

Heterogen summary and Validation against ideal representations of ligands

Page 33: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

AutoDep provides valuable information to depositors

Oligomeric state - PQS Sequence-structure alignmentUniprot, Pfam, Interpro

Page 34: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

AutoDep provides valuable information to depositors

• Revisions, withdrawal, release– Release sequence-only

immediately– Release coordinates immediately– Hold for 1 year– Release after publication

• Communication with depositors– Help depositors understand and

conform to PDB standards– Discussing errors

Page 35: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPISA, SSM/ PDBeFold, PDBeMotif, PDBeChem, SIFTS, PDBeStatistics, PDBeSearch, PDBeView

PDBe Services

Page 36: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeView – the Atlas pages

• http://www.ebi.ac.uk/pdbe-srv/view/

Page 37: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeFold (SSM): has my fold been seen before? Or is it novel!

PDB

???

• E. Krissinel and K. Henrick, Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Cryst. (2004). D60, 2256±2268.

Page 38: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe Services

• Why compare structures?– Reveal conformational changes

• Ligands, mutations, crystal packing, pH..

– Judge structural variability• NMR ensembles, structure families

– Discover common structural motifs– Identify fold– Infer function

• Sequence-alignments do not work well for distant evolutionary relationships

• Structures diverge much slowly than sequences• Structure improves quality of alignment• Better inference of function, e.g. when active

sites match well

PDBeFold (SSM)

• The relation between the divergence of sequence and structure in proteins. Chothia C, Lesk AM. EMBO J. 1986 Apr;5(4):823-6.

Page 39: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeFold (SSM) algorithm

H1

S1

S2

S3

S4

H2

H1

H2 H3

H4

S1

H5

H6

S2

S3

S4 S5

S6

S7

Match SSE graphs to get initial alignment

Iterative expansion of Ca-alignment

Page 40: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeFold (SSM)

SSM can carry out genuine multiple structure alignment to reveal a motif common to a family of structures

Page 41: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBePISA

• What is the likely biological assembly of a given structure?• Can I learn about it from crystal-packing of chains?

PDB file (ASU)

Biological Unit

Crystal Symmetry ASU

PISAGenerate possible assembliesRank according to free energy

Page 42: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBePISA

PDB entry 1P30A monomer?

Biological unit 1P30Homotrimer!

Page 43: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBePISA

PDB entry 2TBVA trimer?

Biological Unit 2TBV180-mer!

Page 44: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBePISA

Page 45: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBePISA

PDB entry 1E942 Biological Units in 1E94:

A dodecamer and a hexamer!

Page 46: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeMotif

• A very powerful engine to search PDB• Structure-sequence general searches

• Chemical substructure• Predefined frequent motifs• Arbitrary secondary structure patterns• Φψ patterns• Protein sequences

• Prosite motif, Uniprot, CSA accessions• Raw sequence • Regular expression

• Interactions between ligands, protein• Seq-distance between protein motifs

• PDB header searches• Specialized searches

• Envionment around an interaction• Motif binding• Occurrence of a motif inside another

• MSDmotif: exploring protein sites and motifs. Adel Golovin and Kim Henrick. BMC Bioinformatics 2008, 9:312

Page 47: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeMotif: which motif does my substructure bind often?

Sta

uros

pori

ne K

inas

e in

hibi

tor

Page 48: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeMotif: which ligands and chemical fragments does my sequence motif bind?

Tyrosine protein kinase-specific active-site signature:

[LIVMFYC]-{A}-[HY]-x-D-[LIVMFY]-[RSTAC]-{D}-{PF}-N-[LIVMFYC](3)

Motif binding statistics

Chemical fragments

Page 49: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeMotif: how does a sequence motif look like in 3D?

Tyrosine protein kinase-specific active-site signature:

[LIVMFYC]-{A}-[HY]-x-D-[LIVMFY]-[RSTAC]-{D}-{PF}-N-[LIVMFYC](3)

Sequence hits 3D alignment

Page 50: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeMotif: which sequences often host a Ramachandran path?

3D fragmentφ/ψ sequence

-156/-155,-103/17,-134/161

Search Sequence pattern

Page 51: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeAnalysis: selections and statistics

• Structure Statistics• frequency plots on 1 or 2 properties of entries

• Residue Statistics• Choose residues and make frequency plots of a property• Choose residues in entry meeting certain filters, and plot their property

• Atom Statistics• Choose atom-sets in entries and plots distance, angle, dihedrals

between them• Structure Selection

• Create a subset of entries using various filters• Database Browser

• Web-based SQL query page to internal database• Geometric Validation coupled with 3D viewer

• http://www.ebi.ac.uk/pdbe-as/pdbevalidate/

Page 52: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeAnalysis: selections and statistics

Resolution vs Rfactor

CA1-CA2-CA3-CA4Torsion distribution

Low res

High res

Page 53: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe ServicesPDBeAnalysis: geometric validation

Table and plot of geometric checksPhi-psi, chi, omega, B-value,bonds, angles, chiralities

AstexViewer coordinated with plots

Page 54: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe Community Work• X-ray

– CCP4 software: MMDB, PISA, SSM, harvesting

– Validation Task Force

• NMR– CCP-NMR software– Validation task force

• EM– Validation and standards– Ongoing software

development

• SIFTS - coordinating with other biodatabases

• CAPRI - Provide infrastructure for submission and maintenance of entries

• PiMS – Information management system for protein crystallography experiments

Page 55: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe Community Work• EuroCarbDB

– Databases and bioinformatic tools in glycobiology and glycomics

• BIObar– A toolbar for browsing biological data and

databases, a Mozilla plugin for your browser

• Outreach and training– Roadshows: invite us!– Tutorials

Page 56: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

PDBe Services: Future Emphasis• To go from being a historic structural archive to a valuable

resource for structural biomedicine

• PDBeXplore– Provide relevant interesting avenues to access structural information– Ligands, Assemblies, Enzymes, GO, CATH, Sequences, Publications,

Pathways

• PDBe Validation Resource– Provide a comprehensive battery of validation tools during

deposition and to the end-user– Migrate and enhance EDS server– Partner with CCDC to bring cutting edge ligand validation

Page 57: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Summary• Structural Bioinformatics and Biocomputing are essential to fulfilling

the promise of structural biology

• Databases are indispensible to all aspects of structural bioinformatics

• PDB is the primary repository of structures and numerous databases are developed based on PDB.

• PDBe provides high-quality services to depositors and end-users, and is an active member of structure-determination community.

• PDBe is open to all suggestions to make our services better and more relevant to your work.

Page 58: Structure Databases: The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7 th 2010, 9-10 am Macromolecular Crystallography Course

Acknowledgements

• Alejandro and organizers at IPMont

• PDBe group– Sameer Velankar, Jawahar Swaminathan

• Designers, developers, maintainers of various structural databases at PDBe and elsewhere