bioinformatics and proteomics march 28, 2003 nih proteomics workshop bethesda, md anastasia...
Post on 19-Dec-2015
213 views
TRANSCRIPT
Bioinformatics and Proteomics
March 28, 2003
NIH Proteomics Workshop
Bethesda, MD
Anastasia Nikolskaya, Ph.D. Research Assistant Professor Protein Information Resource Department of Biochemistry and Molecular Biology Georgetown University Medical Center
Overview
• Role of Bioinformatics/Computational Biology in Proteomics Research
• Genomics• Functional Annotation of Proteins• Classification of Proteins• Bioinformatics Databases and Analytical
Tools (Dr. Yeh and Dr. Hu)
Sequence function
Functional Genomics/Proteomics Proteomics studies biological systems based on global
knowledge of genomes, transcriptomes, proteomes, metabolomes. Functional genomics studies biological functions of proteins, complexes, pathways based on the analysis of genome sequences. Includes functional assignments for protein sequences.– Genome: All the Genetic Material in the Chromosomes– Transcriptome: Entire Set of Gene Transcripts – Proteome: Entire Set of Proteins– Metabolome: Entire Set of MetabolitesGenome Transcriptome Proteome Metabolome
Proteomics
• Data: Gene Expression Profiling
- Genome-Wide Analyses of Gene Expression
• Data: Structural Genomics
- Determine 3D Structures of All Protein Families
• Data: Genome Projects (Sequencing) - Functional genomics
- Knowing complete genome sequences of a number of organisms is the basis of the proteomics research
Genomic DNA Sequence
5' UTRPromoter Exon1 Intron Exon2 Intron Exon3 3' UTR
A
G
GT
A
G
Gene Recognition
Exon2Exon1 Exon3
C
A
C
A
C
A
A
T
T
A
T
A
Protein Sequence
A
T
G
AA
T
A
A
A
Structure Determination
Protein Structure
Function Analysis
Gene NetworkMetabolic Pathway
Protein FamilyMolecular Evolution
Family Classification
GT
Gene Gene
DNASequence
Gene
Protein
Sequence
Function
Bioinformatics and Genomics/Proteomics
Sequence,Other Data
Pathways and Regulatory Circuits Hypothetical Cell
UnknownGenes
Putative Functional Groups
Most new proteins come from genome sequencing projects
• Mycoplasma genitalium - 484 proteins
• Escherichia coli - 4,288 proteins
• S. cerevisiae (yeast) - 5,932 proteins
• C. elegans (worm) ~ 19,000 proteins
• Homo sapiens ~ 40,000 proteins
... and have unknown functions
Advantages of knowing the complete genome sequence
• All encoded proteins can be predicted and identified
• The missing functions can be identified and analyzed
• Peculiarities and novelties in each organism can be studied
• Predictions can be made and verified
The changing face of protein science
20th century
• Few well-studied proteins
• Mostly globular with enzymatic activity
• Biased proteinset
21st century
• Many “hypotheti-cal” proteins
• Various, often with no enzymatic activity
• Natural protein set
Properties of the natural protein set
• Unexpected diversity of even common enzymes (analogous, paralogous, xenologous, etc. enzymes )
• Conservation of the reaction chemistry, but not the substrate specificity
• Functional diversity in closely relatedproteins
• Abundance of new structures
• Experimentally characterized Best annotated protein database: SwissProt• “Knowns” = Characterized by similarity (closely related to
experimentally characterized)
– Make sure the assignment is plausible• Function can be predicted
– Extract maximum possible information– Avoid errors and overpredictions– Fill the gaps in metabolic pathways
• “Unknowns” (conserved or unique)– Rank by importance
Objectives of functional analysisfor different groups of proteins
Escherichia coli Methanococcus jannaschii
Yeast Human
E. coli M. jannaschii S. cerevisiae H. sapiens Characterized experimentally 2046 97 3307 10189 Characterized by similarity 1083 1025 1055 10901 Unknown, conserved 285 211 1007 2723 Unknown, no similarity 874 411 966 7965 from Koonin and Galperin, 2003, with modifications
Problems in functional assignments for “knowns”
• Previous low quality annotations - misinterpreted experimental results
(e.g. suppressors, cofactors)
- biologically senseless annotations
Deinococcus: head morphogenesis protein Arabidopsis: separation anxiety protein-like
Helicobacter: brute force protein
Methanococcus: centromere-binding protein
Plasmodium: frameshift
- propagated mistakes of sequence comparison
Histidine kinase
Periplasmic sensor domain Uncharacterized domain
Periplasmic sensor domain His kinase domain
Problems in functional assignments for “knowns”
• Multi-domain organization of proteins
• Non-orthologous gene displacement
• Enzyme evolution (divergence in sequence and function)
• Low sequence complexity (coiled-coil, non-globular regions)
Problems in functional assignments for “knowns”
Enzyme recruitment:Minor mutational changes convert aglycerol kinase into gluconate kinase
GNTK_BACSU ---MTSYMLGIDIGTTSTKAVLFSENGDVVQKESIGYPLYTPDISTAEQNPEEIFQAVIHTTARITKQHPEKR--ISFISFSSAMHS-VIAIDENDKPLTPCITWADNRSEGWAHKIKE- 113GNTK_BACLI ---MTSYMLGIDIGTTSTKAVLFSEKGDVIQKESIGYALYTPDISTAEQNPDEIFQAVIQSTAKIMQQHPDKQ--PSFISFSSAMHS-VIAMDENDKPLTSCITWADNRSEGWAHKIKE- 113GLPK_BACSU ---METYILSLDQGTTSSRAILFNKEGKIVHSAQKEFTQYFPHPGWVEHNANEIWGSVLAVIASVISESGISASQIAGIGITNQRETTVVWDKDTGSPVYNAIVWQSRQTSGICEELRE- 116GLPK_MYCGE MDLKKQYIIALDEGTSSCRSIVFDHNLNQIAIAQNEFNTFFPNSGWVEQDPLEIWSAQLATMQSAKNKAQIKSHEVIAVGITNQRETIVLWNKENGLPVYNAIVWQDQRTAALCQKFNED 120XYLB_BACSU ----MKYVIGIDLGTSAVKTILVNQNGKVCAETSKRYPLIQEKAGYSEQNPEDWVQQTIEALAELVSISNVQAKDIDGISYSGQMHGLVLLDQDR-QVLRNAILWNDTRTTPQCIR--MT 113XYLB_ECOLI ------MYIGIDLGTSGVKVILLNEQGEVVAAQTEKLTVSRPHPLWSEQDPEQWWQATDRAMKALGDQHSLQDVKALGIAGQMHGATLLDAQQ-R--VLRPAILWNDGRCAQECT---LL 108
GNTK_BACSU ELNGHEVYKRTGTPIHPMAPLSKIAWITNERKEIASKAKK-----YIGIKEYIFKQLFN-EYVIDYSLASATGMMNLKGLDWDEEALRIAGITPDHLSKLVPTTEIFQHCSPEIAIQMGI 227GNTK_BACLI EMNGHNVYKRTGTPIHPMAPLSKITWIVNEHPEIAVKAKK-----YIGIKEYIFKKLFD-QYVVDYSLASAMGMMNLKTLAWDEEALAIAGITPDHLSKLVPTTAIFHHCNPELAAMMGI 227GLPK_BACSU KGYNDKFREKTGLLIDPYFSGTKVKWILDNVEGAREKAEKGELLFGTIDTWLIWKMSGGKAHVTDYSNASRTLMFNIYDLKWDDQLLDILGVPKSMLPEVKPSSHVYAETVDY--HFFGK 236GLPK_MYCGE KLIQTKVKQKTGLPINPYFSATKIAWILKNVPLAKKLMEQKKLLFGTIDSWLIWKLTNGKMHVTDVSNASRTLLFDIVKMEWSKELCDLFEVPVSILPKVLSSNAYFGDIETNHWSSNAK 240XYLB_BACSU EKFGDHLLDITKNRVLEGFTLPKMLWVKEHEPELFKKTAV------FLLPKDYVRFRMTGVIHTEYSDAAGTLLLHITRKEWSNDICNQIGISADICPPLVESHDCVGSLLPHVAAKTGL 227XYLB_ECOLI EARVPQSRVITGNLMMPGFTAPKLLWVQRHEPEIFRQIDK------VLLPKDYLRLRMTGEFASDMSDAAGTMWLDVAKRDWSDVMLQACDLSRDQMPALYEGSEITGALLPEVAKAWGM 222
GNTK_BACSU DPETPFVIGASDGVLSNLGVNAIKKGEIAVTIGTSGAIRTIIDKPQTDEKGRIFCYALTDK----HWVIGGPVNNGGIVLRWIRDEFASSEIETATRLGIDPYDVLTKIAQRVRPGSDGL 343GNTK_BACLI DPQTPFVIGASDGVLSNLGVNAIKKGEIAVTIGTSGAIRPIIDKPQTDEKGRIFCYALTEN----HWVIGGPVNNGGIVLRWIRDEFASSEIETAKRLGIDPYDVLTKIAERVRPGADGL 343GLPK_BACSU GKNIPIAGAAGDQQSALFGQACFEEGMGKNTYGTGCFMLMNTGEKAIKSEHGLLTTIAWGIDGKVNYALEGSIFVAGSAIQWLRDGLRMFQDSSLSES-----------YAEKVDSTDGV 341GLPK_MYCGE G-IVPIRAVLGDQQAALFGQLCTEPGMVKNTYGTGCFVLMNIGDKPTLSKHNLLTTVAWQENHPPVYALEGSVFVAGAAIKWLRDALKIIYSEKESDFY----------AELAKENEQNL 350XYLB_BACSU LEKTKVYAGGADNACGAIGAGILSSGKTLCSIGTSGVILSYEEEKERDFKGKVHFFNHGKKDSFYTMGVTLAAGYSLDWF-------------KRTFAPNESFEQLLQGVEAIPIGANGL 334XYLB_ECOLI A-TVPVVAGGGDNAAGAVGVGMVDANQAMLSLGTSGVYFAVSEGFLSKPESAVHSFCHALPQRWHLMSVMLSAASCLDW--------------AAKLTGLSNVPALIAAAQQADESAEPV 327
GNTK_BACSU LFHPYLAGERAPLWNPDVRGSFFGLTMSHKKEHMIRAALEGVIYNLYTVFLALTECMDGPVTRIQATGGFARSEVWRQMMSDIFESEVVVPESYESSCLGACILGLYATGKIDSFDAVSD 463GNTK_BACLI LFHPYLAGERAPLWNPDVPGSFFGLTMSHKKEHMIRAALEGVIYNLYTVFLALTECMDGPVARIQATGGFARSDVWRQMMADIFESEVVVPESYESSCLGACILGLYATGKIDSFDVVSD 463GLPK_BACSU YVVPAFVGLGTPYWDSDVRGSVFGLTRGTTKEHFIRATLESLAYQTKDVLDAMEADSNISLKTLRVDGGAVKNNFLMQFQGDLLNVPVERPEINETTALGAAYLAGIAVGFWKDRSEIAN 461GLPK_MYCGE VFVPAFSGLGAPWWDASARGIILGIEASTKREHIVKASLESIAFQTNDLLNAMASDLGYKITSIKADGGIVKSNYLMQFQADIADVIVSIPKNKETTAVGVCFLAGLACGFWKDIHQLEK 470XYLB_BACSU LYTPYLVGERTPHADSSIRGSLIGMDGAHNRKHFLRAIMEGITFSLHESIELFR-EAGKSVHTVVSIGGGAKNDTWLQMQADIFNTRVIKLENEQGPAMGAAMLAAFGSGWFESLEECAE 453XYLB_ECOLI WFLPYLSGERTPHNNPQAKGVFFGLTHQHGPNELARAVLEGVGYALADGMDVVHACGIKP-QSVTLIGGGARSEYWRQMLADISGQQLDYRTGGDVGPALGAARLAQIAANPEKSLIELL 446
Differences between gluconate and glycerol/xylulose kinasesDifferences between gluconate and glycerol/xylulose kinasesDifferences between gluconate and glycerol/xylulose kinases
Leads to non-orthologous gene displacement
• Experimentally characterized• “Knowns” = Characterized by similarity (closely
related to experimentally characterized)
– Make sure the assignment is plausible• Function can be predicted
– Extract maximum possible information– Avoid errors and overpredictions– Fill the gaps in metabolic pathways
• “Unknowns” (conserved or unique)– Rank by importance
Objectives of functional analysisfor different groups of proteins
Functional Prediction:Dealing with “hypothetical” proteins
• Computational analysis– Sequence analysis of the new ORFs
• Mutational analysis • Functional analysis
– Expression profiling– Tracking of cellular localization
• Structural analysis– Determination of the 3D structure
Structural Genomics
Protein Structure Initiative: Determine 3D Structures of All Proteins– Family Classification: Organize Protein Sequences into Families, collect
families without known structures– Target Selection: Select Family Representatives as Targets – Structure Determination: X-Ray Crystallography or NMR Spectroscopy – Homology Modeling: Build Models for Other Proteins by Homology– Functional prediction based on structure
Structural Genomics: Structure-Based Functional Assignments
Methanococcus jannaschii MJ0577 (Hypothetical Protein)
Contains bound ATP => ATPase or ATP-Mediated Molecular Switch
Confirmed by biochemical experiments
• Detailed manual analysis of sequence similarities
• Cluster analysis of protein families (family databases)
• Use of sophisticated database searches (PSI-BLAST, HMM)
Improving functional assignments for “unknowns”
(Functional Prediction)
• Those amino acids that are conserved in divergent proteins (archaeal and bacterial, hyperthermophilic and mesophilic) are likely to be important for catalytic activity.
• Comparative analysis allows us to find subtlesequence similarities in proteins that would not have been noticed otherwise
• Prediction of the 3D fold and general function is much easier than prediction of exact biological (or biochemical) function.
Using comparative genomics for protein analysis
• For some reason, the reaction chemistry often remains conserved even when sequence diverges almost beyond recognition
• Sequence database searches that use exotic or highly divergent query sequences often reveal more subtle relationships than those using queries from humans or standard model organisms (E. coli, yeast, worm, fly).
• Sequence analysis complements structural comparisons and can greatly benefit from them
Using comparative genomics for protein analysis
Poorly characterized protein families
• Enzyme activity can be predicted, the substrate remains unknown (ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases)
• Helix-turn-helix motif proteins (predicted transcriptional regulators)
• Membrane transporters
• Phylogenetic distribution
– Wide - most likely essential– Narrow - probably clade-specific– Patchy - most intriguing, niche-
specific
• Domain association – Rosetta Stone for multidomain proteins
• Gene neighborhood (operon organization)
Improving functional assignments for “unknowns”
Problems in functional assignments/predictions
• Identification of protein-coding regions
• Delineation of potential function(s) for distant paralogs
• Identification of domains in the absense of close homologs
• Analysis of proteins with low sequence complexity
“Unknown unknowns”
• Phylogenetic distribution
– Wide - most likely essential– Narrow - probably clade-specific– Patchy - most intriguing,
niche-specific
To deal with the ocean of new sequences, need “natural” protein classification
• Protein families are real and reflect evolutionary relationships
• Protein classification systems can be used to– Improve sensitivity of protein identification– Provide new protein sequence annotation, simplifying the
search for non-obvious relationships– Detect and correct genome annotation errors systematically – Drive other annotations (actve site etc)– Provide basis for evolution, genomics and proteomics research
Discovery of New Knowledge by Using Information Embedded within
Families of Homologous Sequences and Their Structures
The ideal system would be:
• Comprehensive, with each sequence classified either as a member of a family or as an “orphan” sequence, a family of one
• Hierarchical, with families united into superfamilies on the basis of distant homology
• Allow for simultaneous use of the whole protein and domain information (domains mapped onto proteins)
• Allow for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families
• Expertly curated (family name, function, evidence attribution (experimental vs predicted), background etc). This is the only way to avoid annotation errors and prevent error propagation
Levels of Protein ClassificationClass / Composition of structural
elementsNo relationships
Fold TIM-Barrel Topology of folded backbone
Possible monophyly above and below
Superfamily Aldolase Recognizable sequence similarity (motifs); basic biochemistry
Monophyletic origin
Family Class I Aldolase High sequence similarity (alignments); biochemical properties
Evolution by ancient duplications
COG 2-keto-3-deoxy-6-phosphogluconate aldolase
Orthology for a given set of species; biochemical activity; biological function
Origin traceable to a single gene in LCA
LSE PA3131 and PA3181
Paralogy within a lineage Evolution by recent duplication and loss
Protein Evolution
• Tree of Life & Evolution of Protein Families (Dayhoff, 1978)
• Can build a tree representing evolution of a protein family, based on sequences
• Othologus Gene Family: Organismal and Sequence Trees Match Well
Protein Evolution
• Homolog– Common Ancestors– Common 3D Structure– Common Active Sites
or Binding Domains• Ortholog
– Derived from Speciation
• Paralog– Derived from
Duplication
Orthologs and ParalogsM
yxin
idae
Tele
osto
mi
Verteb
rata
Tetrap
oda
Mam
malia
Amph
ibia
Hb
(H
agfi
sh)
Myo
(C
od)
Hb
A (
Cod
)H
bB
(C
od)
Myo
(Fr
og)
Hb
A (
Frog
)H
bB
(Fr
og)
Myo
(R
at)
Hb
A (
Rat
)H
bB
(R
at)
Myo
(H
agfi
sh)
Craniata
Orthologs and ParalogsH
b (
Hag
fish
)
Myo
(C
od)
Hb
A (
Cod
)H
bB
(C
od)
Myo
(Fr
og)
Hb
A (
Frog
)H
bB
(Fr
og)
Myo
(R
at)
Hb
A (
Rat
)H
bB
(R
at)
Myo
(H
agfi
sh)
LCA of Craniata
COGmyoglobins
COGhemoglobins
Orthologs and ParalogsMyo (Hagfish)
Myo (Cod)
HbA (Cod)
HbB (Cod)
Myo (Frog)
HbA (Frog)
HbB (Frog)
Myo (Rat)
HbA (Rat)
HbB (Rat)
Orthologs(COG Myo)
SubCOG
Hb (Hagfish)
Orthologs(COG Hb)
Out-paralogs(globin family)
SubCOG
In-paralogs (LSE in Vertebrata)
Orthologs and ParalogsH
b (
Hag
fish
)
Myo
(C
od)
Hb
A (
Cod
)H
bB
(C
od)
Myo
(Fr
og)
Hb
A (
Frog
)H
bB
(Fr
og)
Myo
(R
at)
Hb
A (
Rat
)H
bB
(R
at)
Myo
(H
agfi
sh)
COGmyoglobins
COGhemoglobins
COGhemoglobins A
LCA of Vertebrata
Orthologs and ParalogsMyo (Cod)
HbA (Cod)
HbB (Cod)
Myo (Frog)
HbA (Frog)
HbB (Frog)
Myo (Rat)
HbA (Rat)
HbB (Rat)
Orthologs(COG Myo)
Orthologs(COG HbA)
Out-paralogs(globin family)
Orthologs(COG HbB)
Levels of Protein ClassificationClass / Composition of structural
elementsNo relationships
Fold TIM-Barrel Topology of folded backbone
Possible monophyly above and below
Superfamily Aldolase Recognizable sequence similarity (motifs); basic biochemistry
Monophyletic origin
Family Class I Aldolase High sequence similarity (alignments); biochemical properties
Evolution by ancient duplications
COG 2-keto-3-deoxy-6-phosphogluconate aldolase
Orthology for a given set of species; biochemical activity; biological function
Origin traceable to a single gene in LCA
LSE PA3131 and PA3181
Paralogy within a lineage Evolution by recent duplication and loss
Protein Family-Domain-Motif • Domain: Evolutionary/Functional/Structural Unit Domain = structurally compact, independently folding unit that
forms a stable three-dimentional structure and shows a certain level of evolutionary conservation. Usually, corresponds to an evolutionary unit.
A protein can consist of a single domain or multiple domains. Proteins have modular structure.
• Motif: Conserved Functional/Structural Site
Protein Evolution:Sequence Change vs. Domain Shuffling
If enough similarity remains, one can trace the path to the
common origin
What about these?
Recent Domain Shuffling
CM (AroQ type)
CM (AroQ type)
SF001501
SF001499
SF005547
SF001500
SF006786
PDH
CM (AroQ type) PDT ACT
PDH
PDH
ACT
PDT ACT SF001424
Protein classification: proteins and domains
• Option 1: classify domains- take individual domain sequences, consider
them as independently evolving units and build a classification system
- allows to go all the way to the deepest possible level, the last point of traceable homology and common origin (fold)
- domain databases (Pfam, SMART, CDD) allow to map domains onto a query sequence
Protein classification: proteins and domains
• Option 2: classify full-length proteins- In cases of multidomain proteins, does not
allow to go deep along the evolutionary tree- All proteins in a family will often have a
common biological function, which is very convenient for annotation
- Domains will be mapped onto protein families
Practical Classification of Proteins:Setting Realistic Goals
We strive to reconstruct the natural classification of proteins to the fullest possible extent
BUT
Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity)
THUS
The further we extend the classification, the finer is the domain structure we need to consider
SO
We need to compromise between the depth of analysis and protein integrity
Clasification: current status
• PIR Superfamilies: Proteins in PIRPSD: 283,289
Proteins classified: 187,871
2/3 of the PIR proteins
• COGs:
~ 70% of each microbial genome
~ 50% of each Eukaryotic genome in 3-clade COG
~ 20% ? of each Eukaryotic genome in LSEs
PIR Superfamily Concept
– Whole (Full-Length) Proteins– Homeomorphic (Common Domain Architecture)– Monophyletic (Common Evolutionary Origin)– Hierarchical structure (Family and Superfamily)– Non-Overlapping placement within each level
PIR Superfamily vs. Other Concepts
– Evolution: Superfamily hierarchy reflects orthology and paralogy
– Structure: PIR superfamily generally corresponds to SCOP family
– Domain: Domains are mapped onto the Superfamily– Motif: Motifs (functional/structural sites) are mapped onto
the Superfamily– Function: a Superfamily may contain divergent functions
PIR Superfamilies• Created by automated clustering by % identity with
coverage-by-length requirements. Creation of new Superfamilies is an ongoing process.
• Automated classification rules are refined by expert curation: - Evolution rates are very different in different “branches” of
the protein universe, so need very different score cutoffs- Verify/add members• Annotation (at level of orthology): Superfamily Name,
Description, Bibliography• In some cases, more than one orthologous group will be
included into a single Superfamily; these Superfamilies will often be very large and diverse
• Depth of hierarchy will be different for single-domain and multidomain proteins
This is work in progress and will become available through PIR (iProClass) and InterPro
CM-Related Superfamilies• Chorismate Mutase (CM), AroQ class
– SF001501 – CM (Prokaryotic type) [PF01817]– SF001499 – tyrA bifunctional enzyme (Prok) [PF01817-PF02153]– SF001500 – pheA bifunctional enzyme (Prok) [PF01817-PF00800]– SF017318 – CM (Eukaryotic type) [Regulatory Dom-PF01817]
• Chorismate Mutase, AroH class – SF005965 – CM [PF01817]
AroH
AroQ Euk
AroQ Prok
-InterPro is an integrated resource for protein families, domains and sites.
- InterPro combines a number of databases that use different methodologies. By uniting the member databases, InterPro capitalizes on their individual strengths, producing a powerful integrated diagnostic tool.
Member databases: PROSITE, PRINTS, Pfam, SMART, ProDom, and TIGRFAMsPIR to be added soonSWISSPROT and TrEMBL matches used as examples
InterPro
InterPro Entry Type defines the entry as a Family, Domain, Repeat, or Post-translational modification site (other sites to be added: binding site, active site). Family = protein family. PIR SFs will generally belong to this type. “Contains” field lists domains within this protein “Found in”: for domain entries, lists families which contain this domain
InterPro Entry
CM PDT ACT
InterPro Entry Type = Family
SF001500Bifunctional chorismate mutase / prephenate dehydratase (P-protein)
PIR Superfamilies are being integrated into InterPro
-complete genomes- reciprocal best hits
- no score cutoffs
Comparative genomics - a branch of computational biology that
uses complete genome sequences
COGs Clusters of Orthologous Groups
Construction of COGs:
Synechocystis
slr0952
YeastYLR377c
E. coli fbp
Bidirectionalbest hit
Bidirectionalbest hit
Bidirectionalbest hit
Triangle -the simplest
COG
In COGs, the dilemma between the depth of analysis and protein integrity is approached by keeping proteins intact whenever possible, and dividing into modules (single- or multidomain) when necessary
Case Study 1: Prediction verified: GGDEF domain
• Proteins containing this domain: Caulobacter crescentus PleD controls swarmer cell - stalk cell transition (Hecht and Newton, 1995). In Rhizobium leguminosarum, Acetobacter xylinum, required for cellulose biosynthesis (regulation)
• Predicted to be involved in signal transduction because it is found in fusions with other signaling domains (receiver, etc)
• In Acetobacter xylinum, cyclic di-GMP is a specific nucleotide regulator of cellulose synthase (signalling molecule). Multidomain protein with GGDEF domain was shown to have diguanylate cyclase activity (Tal et al., 1998)
• Detailed sequence analysis tentatively predicts GGDEF to be a diguanylate cyclase domain (Pei and Grishin, 2001)
• Complementation experiments prove diguanylate cyclase activity of GGDEF (Ausmees et al., 2001)
Case study 2:Defining a novel domain family
CheY-likereceiver
Output
Variable - DNA-binding- Enzymatic
What if domain is not described yet?
CheYreceiver
PSY-BLAST with C-terminal portion alone
Prokaryotic Response Regulatiors (RRs)
Two Groups of Unusual RRs [Receiver-X] SF006198, COG3279
1. AlgR-related• Pseudomonas aeruginosa (AlgR): alginate biosynthesis• Klebsiella pneumoniae (MrkE): formation of adhesive fimbriae• Clostridium perfringens (VirR): virulence factors 2. Regulators of autoinduced peptide-controlled regulons• Staphylococcus aureus (AgrA): virulence factors • Lactobacillus plantarum (PlnC, PlnD): bacteriocin production• Streptococcus pneumoniae (ComE): competence
Properties of the CheY- LytTR transcriptional regulators • Regulate secreted and extracellular factors• Often regulate their own expression • Bind to imperfect direct repeat sites in -80 to - 40 area (or in UAS) • Can be phosphorylated by His kinases, but form operons with HisK-type
sensor ATPases• Contain a conserved LytTR-type DNA-binding domain
LytTR - a new DNA-binding domain not similar to HTH, winged helix, or ribbon-helix-helix DNA-binding domains
Domain organization of LytTR proteinsother than CheY-LytTR
Stand-alone LytTR Streptococcus pneumoniae BlpSPseudomonas phage D3 Orf50
40aa - LytTR Lactococcus lactis L121252Listeria monocytogenes Lmo0984Staphylococcus aureus SA2153Streptococcus pneumoniae SP0161
ABC - LytTR Bacillus halodurans BH3894
MHYT - LytTR Oligotropha carboxydovora CoxC, CoxH
3TM - LytTR Xanthomonas campestris RpfDCaulobacter crescentus CC1610Mesorhizobium loti mll0891
3TM - LytTR Caulobacter crescentus CC0295 4TM - LytTR Caulobacter crescentus CC0330,
CC3036 8TM - LytTR Caulobacter crescentus CC0551
PAS - LytTR Burkholderia cepaciaGeobacter sulfurreducens
Predicted LytTR-regulated genes
Expected
Bacillus subtilis natAB (Na+-ATPase)
Oligotropha carboxidovorans comC, comH (CO growth)
Staphylococcus aureus lrgAB (autolysis)
Streptococcus pneumoniae hld (hemolysin delta)
Unexpected
Bacillus subtilis alr, dinB, rapI, veg,
ybaJ, ybbI, yceA, ydbS, ydjL, yebB, yfiV, ykuA
Staphylococcus aureus capO, coa, hsdR, SA0096, SA0257, SA0285, SA0302,SA0357, SA0358, SA0513
Impact of genomics
• Single protein level– Discovery of new enzymes and superfamilies– Prediction of active sites and 3D structures
• Pathway level– Identification of “missing” enzymes– Prediction of alternative enzyme forms– Identification of potential drug targets
• Cellular metabolism level– Multisubunit protein systems– Membrane energy transducers – Cellular signaling systems
Examples for analysis:
1. Retrieve one of the following protein sequences:
PIR: C69086 D64376 GenBank GI:15679635. Using analysis tools available on the web, check if the functional annotation is correct, and provide correct annotation without looking at internal PIR or COG annotations (Run BLAST with CDsearch and SMART to start with). When you are done, look at the PIR curated SF annotation (still at internal site only):
http://pir.georgetown.edu/test-cgi/sf/pirclassif.pl?id=SF006549http://pir.georgetown.edu/cgi-bin/ipcSF1?id=SF006549 (compare with original
automatic SF annotation at the public site), and at COG annotations. What caused the wrong annotations? In BLAST outputs for these sequences, do you see other wrongly annotated proteins?
• Next, analyze the C-terminal domain of these proteins by PSI-BLAST (and alignment analysis) and suggest any speculations as to its function (homework).
Examples for analysis:
2.• Retrieve the following sequence: GI:7019521• Take a look at the associated publication (reference).• Analyze the sequence to see if any additional information
can be obtained (run PSI-BLAST, and (as a homework) construct multiple alignment).
• Take a look at taxonomy report: what does it tell you? • Find experimental paper associated with one of the
sequences found by PSI-BLAST. What annotation is appropriate for this sequence and for the entire family?
Examples for analysis:
3.Predict the function of the following proteins:• GenBank: GI: 27716853• E. coli YjeE protein
Verify and/or correct the following functional annotations. Can you explain why the erroneous annotations were made?
• PIR: H87387
• GenBank: GI:15606003 GI:15807219
• PIR: F70338
Examples for analysis:
4. Homework: an exercise in transitive relationships:Start with>gi|20093648|ref|NP_613495.1| Uncharacterized membrane protein, conserved in Archaea [Methanopyrus kandleri AV19](this is a short membrane protein); run PSI-BLAST, make sure you have filtering, complexity and CD-search off. There are no good hits but a bunch of sub-threshold ones. Collect "suspect" relations, use them as queries and expand the net. You will be able to come up with two proteins:>gi|21227474|ref|NP_633396.1| hypothetical protein [Methanosarcina mazei Goe1] and>gi|14324537|dbj|BAB59464.1| hypothetical protein [Thermoplasma volcanium]When used as a PSI-BLAST query, the first will tie the Methanopyrus protein into a group, while the second will tie this group to the Sec61 subunit of preprotein translocase.Then, of course, you can obtain the same result with CD-search in a single step .