structural genomics and the protein folding problem george n. phillips, jr. university of...
TRANSCRIPT
Structural Genomics and the Protein Folding
Problem
George N. Phillips, Jr.University of Wisconsin-Madison
February 15, 2006
High-throughputDNA Sequencing
GeneModel
FunctionalAssignments
Basic Understanding/Applications
(e.g. therapeutics)
Structure Determination& Experimental Analysis
Modeling& Inference
From DNA to biological function
Developing a gene modelGlimmer (Gene Locator and Interpolated Markov ModelER)GlimmerHMM for eukaryotic genomes (more advanced)
Genome sequencingGenome assemblyRegulatory elementsIdentification of ORF’s
All but the simplest genomes are works in progress. It is estimated that 80% of gene models have errors at present!Comparative genomics should help the process, as will sequencing
of expressed sequence tags and other genomics projects
Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. W.H. Majoros, M. Pertea, and S.L. Salzberg. Bioinformatics 21:9 (2005), 1782-88.
PfamMany others…
HYSIELNASLLERGV…HLNIEDNPSCNAMGV…PLNIELNASLNEPGV…WERIELNASLNER--…HQRIEL--SLMMRG-…
HLNIEDNPSCNAMGV…PLNIELNASLNEPGV…WERIELNASLNER--…HQRIEL--SLMMRG-…
HYSIELNASLLERGV…HLNIEDNPSCNAMGV…WERIELNASLNER--…HQRIEL--SLMMRG-…
HLNIEDNPSCNAMGV…PLNIELNASLNEPGV…WERIELNASLNER--…HQRIEL--SLMMRG-… HYSIELNASLLERGV…HLNIEDNPSCNAMGV…PLNIELNASLNEPGV…WERIELNASLNER--…HQRIELK-SLMMRG-… HYSIELNASLLERGV…
HLNIEDNPSCNAMGV…PLNIELNASLNEPGV…WERIELNASLNER--…HQRIEL--SLMMRG-…
The “sequence-space” of proteins
Universe of all protein sequences
PSI-BLASTHMM
PFAM “domains”
Alex Bateman, Lachlan Coin, Richard Durbin, Robert D. Finn, Volker Hollich, Sam Griffiths-Jones, Ajay Khanna, Mhairi Marshall, Simon Moxon, Erik L. L. Sonnhammer, David J. Studholme, Corin Yeats and Sean R. Eddym Nucleic Acids Research(2004) Database Issue 32:D138-D141
High-throughputDNA Sequencing
GeneModel
FunctionalAssignments
Basic Understanding/Applications
(e.g. therapeutics)
Structure Determination& Experimental Analysis
Modeling& Inference
Flow of information from DNA to functional understanding
X-ray Laboratory
Crystallography reveals locationsof electron ‘clouds’ of the atoms:And the polypeptide chain can
be traced through space
ScopCath
The “fold-space” of proteins
Universe of all protein structures
Murzin et al. http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html
Glimpes of the “fold space” of proteins
Hou, Sims, Zhang, and Kim, PNAS 100:2386 (2003)
High-throughputDNA Sequencing
GeneModel
FunctionalAssignments
Basic Understanding/Applications
(e.g. therapeutics)
Structure Determination& Experimental Analysis
Modeling& Inference
Flow of information from DNA to functional understanding
Connections between sequence and structure
Universe of sequences Universe of structures
Connections between sequence and structure
Universe of sequences Universe of structures
?
At what level of homology can one trust a structural inference?
Redfern, Orengo et al., J. Chromatography B 815:97 (2005)
What is structural genomics?
• Experimental determination of key structures (target selection is a key part of the idea)
• Modeling of family members• Inferring function (note “infer”)• Making direct use of the new structures
Protein Sequences and Folds
• ~100,000 families of proteins that cannot be reliably modeled at present (modeling families: <30% identity over large fraction to a known structure)
• ~50% of all domain families can be assigned to a structure under CATH
Protein Structure Initiative (PSI)Mission Statement
“To make the three-dimensional atomic level structures of most proteins easily available from knowledge of their corresponding DNA sequences.”
Genseration of new structures
Chandonia and Brenner, Science 311:347 2006.
Center for Eukaryotic Structural Genomics
Exclusively eukaryotic targets• 60% fold-space targets (emphasis on eukaryote-only
families• 20% disease relevant• 20% outreach – targets from the community
Overall goals are to reduce the costs of determining structures of proteins from eukaryotes by refining all steps in the pipeline
Supported by National Institutes of HealthJohn Markley- PI, George Phillips/Brian Fox Co-PI’s
University of Wisconsin’s Center for Eukaryotic Structural Genomics
(~75 total, 3/4 unique)
How does one clone, express, purify, and solve structures
not previously studied?
An industry-style pipeline
Protein from E. coli cells Protein from cell-free
PCR cloning -> DNA
Protein from E. coli cells
Construct design
Protein from cell-free
Screening:YieldMS
Functional assays
1-5 mg scale
Fluidigm chip crystallization screening (+)
NMR 15N-1H HSQC or 1H screening (+)
Flexi®Vector plasmids
10-100 mg scale: 13C,15N for NMR, Se-Met for X-ray
2-10 mg scale: 13C,15N for NMR, Se-Met for X-ray
Protein from E. coli cells Protein from cell-free
PCR cloning -> DNA
Protein from E. coli cells
Construct design
Protein from cell-free
Screening:YieldMS
Functional assays
1-5 mg scale
Fluidigm chip crystallization screening (+)
NMR 15N-1H HSQC or 1H screening (+)
Flexi®Vector plasmids
10-100 mg scale: 13C,15N for NMR, Se-Met for X-ray
2-10 mg scale: 13C,15N for NMR, Se-Met for X-ray
Pipeline details: cell-based and cell-free protein production for X-ray and NMR
Note: project involves sequencing, which aids gene modeling!
Sesame—integrated LIMS in use at CESG
Open access to the public—structures, protocols, reagents, progress… http://www.uwstructuralgenomics.org
Zolnai et al., J. Struct. Func. Genomics 4:11 (2003)
At1g18200
Mis-annotated prior to our work, but structure led to discovery of function.
>>Alignment of GalP_UDP_transf vs 1Z84:A|PDBID|CHAIN|SEQUENCE/15-196
*->kkfsplDhvhrrynpLtlvwilVsphrakRPikqsqsLidlkkeLwq ++ ++ + +r p t +w+ sp+rakRP 1Z84:A|PDB 15 GDSVENQSPELRKDPVTNRWVIFSPARAKRP---------------- 45
gavetpkvptdplhdp.dcysakLcpg........atratgevNPdyest + ++k p+ p p++c+ c g++++ ++ r++ ++ P + 1Z84:A|PDB 46 -TDFKSKSPQNPNPKPsSCP---FCIGreqecapeLFRVP-DHDPNWKLR 90
yvLkspkkftndFyalseDnpyikvsvSNeaIaknplfqlksvrGhelci + +n ++als+ +++ +++++ G +++ 1Z84:A|PDB 91 VI-------ENLYPALSRN---LETQ------------STQPETG--TSR 116
VI...CF......SKPehDptlpalakeeirevvdaWqlcteelGyegre +I + F++ +S P h+ l + i+ ++ a + + 1Z84:A|PDB 117 TIvgfGFhdvvieS-PVHSIQLSDIDPVGIGDILIAYKKRINQIA----- 160
nhpayqnvqIFEmNkGaemGcsnpHPYaYFnEHGQvwatsfiP<-* h + + q+F N Ga G s H H Q a++ +P 1Z84:A|PDB 161 QHDSINYIQVFK-NQGASAGASMSHS------HSQMMALPVVP 196
Pfam B: 13 and 136 matches to #’s 7198 and 11634
http://www.sanger.ac.uk/Software/Pfam/
Blind prediction of structure:CASP and At5g18200
High-throughputDNA Sequencing
GeneModel
FunctionalAssignments
Basic Understanding/Applications
(e.g. therapeutics)
Structure Determination& Experimental Analysis
Modeling& Inference
Flow of information from DNA to functional understanding
Function space of proteinsKEGG = Kyoto Encyclopedia of Genes and GenomesThe Gene Ontology project (GO)
Metabolism Cellular Processes
SignalProcessing
Enzymes
Don’t forget protein-protein interactions exist also!
At2g17340
Related to a human protein associated with Hallervorden-Spatz syndrome, a neurological disorder?
81 protein samples sent to Toronto:8 solved CESG structures, 73 randomly chosen
Generalized assays for: phosphatase, esterase, phospodiesterase, protease, amino acid dehydrogenase, alcohol dehydrogenase, organic acid dehydrogenase, amino acid oxidase, alcohol oxidase, organic acid oxidase, beta-lactamase, beta-galactosidase, arylsulfatase, lipase.
Results:- Solid hits: 3 phosphatases, 5 esterases- Weaker hits: 9 more esterases, 6 phosphodiesterases - No hits: all others
A. Yakuknin et al. Current Opinion in Chemical Biology, 8:42 (2004)
Parallel Enzyme Activity Testing (Collaboration with University of Toronto)
Activity Assay Substrate JR5670
Phosphodiesterase bis-pNPP 0.016
Dehydrogenase Amino Acids 0.032
Dehydrogenase Acids 0.016
Dehydrogenase Alcohols 0.022
Dehydrogenase Aldehyde -0.045
Dehydrogenase Sugars 0.003
Thioesterase palmitoyl-CoA 0.108
Oxidase NAD(P)H Ox -0.115
Protease Protease Mix 0.118
Phosphatase pNPP > 1
Target: At2g17340/JR5670
• Absorbance >0.25 is a tentative signal, >0.5 is a strong signal.
Initial Assay: Wide-spectrum
High-throughputDNA Sequencing
GeneModel
FunctionalAssignments
Basic Understanding/Applications
(e.g. therapeutics)
Structure Determination& Experimental Analysis
Modeling& Inference
Flow of information from DNA to functional understanding
At2g17340
Enzyme of unknown specificity.
A functional annotation lesson
Functional Annotation by Inference
From raw DNA sequences, one looks for genomic features such as promoters, alternative splicing of mRNAs, retrotransposons, pseudogenes, tandem duplications, synteny, and homology.
It Is homology, both from sequence and from structure, that allow functional inferences to be made.
Prosite, Dali, VAST, FFAS03
Some tool integrate knowledge from many sources into one place, acting a meta-servers of clues.
Connections between structure and function
Universe of structuresUniverse of functions
Connections between structure and function
Universe of structuresUniverse of functions
Convergent evolution
Connections between structure and function
Universe of structuresUniverse of functions
Divergent evolution
At1g18200
Misleading annotation prior to our work, but structure led to
discovery of function.
High-throughputDNA Sequencing
GeneModel
FunctionalAssignments
Basic Understanding/Applications
(e.g. therapeutics)
Structure Determination& Experimental Analysis
Modeling& Inference
Flow of information from DNA to functional understanding
Summary
Structural genomics efforts are gaining momentum and helping to assign new functions to orfs and to fill in the space of all possible
protein folds.
Administration Madison (Primm, Troestler, Markley, Phillips, Fox)Cloning/sequencing pipeline Madison (Wrobel, Fox)Expression pipeline Madison (Frederick, Fox, Riters)E. coli cell growth pipeline Madison (Sreenath, Burns, Seder, Fox)Cell-Free System Madison (Vinarov, Markley, Newman)Protein purification pipeline Madison (Vojtik, Phillips, Fox, Ellefson, Jeon)Mass spectrometry Madison (Aceti, Sabat, Sussman)
Madison NMRFAM (Song, Tyler, Cornilescu, Markley) NMR spectroscopy Milwaukee MCW (Peterson, Volkman, Lytle)Crystallization / crystallography Madison (Bingman, Phillips, Bitto, Han, Bae, Meske)
Argonne (Advanced Photon Source)Bioinformatics Madison (Bingman, Sun, Phillips, Wesenberg)
Indianapolis (Dunker)Milwaukee MCW (Twigger, de la Cruz)
Computational support Madison (Bingman, Ramirez, Phillips)Sesame Madison (Zolnai, Markley, Lee)
The Center for Eukaryotic Structural Genomics(supported by NIH GM64598 and GM074901)