protein structure database for structural genomics group jessica lau december 13, 2004 m.s. thesis...
TRANSCRIPT
Protein Structure Database for Structural Genomics
Group
Jessica LauDecember 13, 2004
M.S. Thesis Defense
• Bioinformatics is• Analysis of biological data: gene expression, DNA
sequence, protein sequence. • Data mining and management of biological information
through database systems.
• At the Northeast Structural Genomics Consortium, database management systems play a large role in its daily operation
• Data collection and mining of experimental results• Track target progress – status milestones• Exchange information with rest of the world
• My thesis presents work in database management systems at the NESG.
• Part 1: ZebaView• Part 2: Worm Structure Gallery• Part 3: Prototype of NESG Structure Gallery
• Zebaview is the official target list of the Northeast Structural Genomics Consortium
• Display summary table of NESG targets.– Status milestones– Protein properties: DNA and
protein sequences, molecular weight, isoelectric point
• New targets are curated and then uploaded to SPiNE.
• 11,284 targets from 88 organisms.
Family View
NESG Families
• Unfolded• Membrane• Core 50• Nf-kB
In PDB / Cloned Prokaryotic vs. Eukaryotic
0
5
10
15
20
25
30
35H
. sa
pie
ns
(H)
D.
me
lan
og
ast
er
(F)
S.
cere
visi
ae
(Y
)
C.
ele
ga
ns
(W)
Organism
Pe
rce
nta
ge
In P
DB
/Clo
ne
d Prokaryotic
Eukaryotic
Target Summary Statistics
Success of soluble targets: Prokaryotic vs. Eukaryotic
0
10
20
30
40
50
60
70
80
90
D. m
elan
ogas
ter
(F)
S. c
erev
isia
e (Y
)
H. s
apie
ns (
H)
C. e
lega
ns (
W)
Organism
Per
cen
tag
e o
f S
olu
ble
/Clo
ned
Prokaryotic
Eukaryotic
Selected Cloned Expressed Soluble Purified X-ray or NMR data collection In PDB
• 4,418 targets cloned• 141 structures• 3.4% successful targets
GO, Cellular Localization, and SignalP
• Search for targets that have • any of the three GO ontologies defined• no GO ontologies defined at all
116 NESG structures do not have Molecular Function defined
LOCTarget
• Secretory proteins require formation of disulfide bonds• Oxidative Folding needed for proper native folding
• 2,132 “Extracellular” NESG targets
Bovine ribonuclease A has four disulfide bonds to stabalize its 3-D structure.Mahesh Narayan, et al. (2000) Acc. Chem. Res., 33 (11), 805 -812.
SignalP
• mRNA are translated with signal peptide for cellular localization• Peptide is cleaved upon destination
• SignalP predicts cleavage of signal peptide• Removal of signal peptide gives proper native fold
Lodish et al. Molecular Cell Biology 4th edition, Figure 7.1 (2000)
Part 2 – Worm Structure Gallery
Caenorhabditis elegans– Widely studied model organism
• 2-3 weeks life span, small size (1.5-mm-long), ease of laboratory cultivation, transparent body
• Small genome, yet has complex organ systems similar to higher organisms: digestive, excretory, neuromuscular, reproductive systems
Donald Riddle et al, C. elegans II (1997)
Altun Z F and Hall DH. , Atlas of C. elegans Anatomy, Wormatlas (2002-2004)
System Components
• 22,653 C. elegans proteins• 42 experimentally determined
• 4 are from NESG• 24 homology models
• 14 are from NESG• 960 C. elegans proteins potentially modeled
• Uniprot: Pfam domain, Gene name, ORF name• PDB Coordinates• Structure Validation Report• Sequence similarities to proteins in PDB
Protein Structure Validation Software
• Suite of quality validation software– PROCHECK
• Quality of experimental data• Distribution of φ, ψ angles in Ramachandran plot
– MolProbity Clashscore• Number of H atom clashes per 1,000 atoms
• With respect to a set of scores from 129 high resolution X-ray crystal structures
• < 500 residues, of resolution <= 1.80 Å, R-factor <= 0.25 and R-free <= 0.28;
Bahattacharya, A et al. to be published
• Algorithm based on alignment between query and template sequences.– Regions of conserved
residues forms a set of constraints for modeling
• Sequence identity of 40% or more
• Good quality template
Homology Modeling Automatically (HOMA)
Bad alignment Bad model
Poor quality template Poor quality model
Quality scores of 3-D structures
Quality Z-scores - Homology Models vs. Experimentally Determined Structures
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
-10 -8 -6 -4 -2 0 2
Procheck (all) z-score
Mo
lPro
bit
y C
las
hs
co
re z
-sc
ore
Homology Models
Experimentally Determined Structures
Search
• Search for C. elegans proteins in local database.
• Keyword: “Ubiquitin” in any field
Results:72 C. elegans proteins2 Experimentally determined structures1 Homology model11 Potential models
Results:152 C. elegans proteins2 Experimentally determined structures1 Homology model19 Potential models
System Architecture• Java, Tomcat, MySQL, Perl.
Three-tier architecture
• Client: Web browser
• Application: JSP, Logic components, Data access components
• Data: MySQL
Part 3 – NESG Structure Gallery
• Structure files submitted by individual groups
• Structure information is entered into SPiNE manually
• Manually run PSVS and MolScript
• Structure files submitted by automated pipeline
• ADIT integrated with SPiNE for uniform format
• PSVS and images automatically generated
• Structure information from PSVS directly into SPiNE
• Archives structure files.
• Downloads– Structure Validation
Report– Structure related files
• Atomic coordinates• NMR constraints• NMR peak lists • Chemical shifts• Structure factor
• Annotation– Functional annotation
provided by other NESG members
– Uniprot– PDB coordinates file
• Reusing Java components from Worm Structure Gallery
– Enhance ZebaView performance to handle increased load and functionalities
– Integrate annotation from other protein and structure databases.
– Make modules available for other java-based applications within structural genomics.
– Develop a gallery for other organisms: yeast, fruit fly, human
– Continue specifications for the new NESG Structure Gallery
Advisor: Dr. Gaetano Montelione
Thanks to everyone at theProtein NMR lab and NESG!
Aneerban BhattacharyaJohn Everett
All the scientists who solved the structures!