img terms and pathways
DESCRIPTION
IMG terms and pathways. Krishna Palaniappan Amy Chen Frank Korzeniewski Yuri Grechkin Ernest Szeto Victor Markowitz. Natalia Ivanova Iain Anderson Thanos Lykidis Nikos Kyrpides. MGM Workshop February 1, 2012. New: SEED subsystems Transport DB, Phenotypes. Why so many? - PowerPoint PPT PresentationTRANSCRIPT
Advancing Science with DNA Sequence
IMG terms and pathwaysKrishna Krishna
PalaniappanPalaniappan
Amy ChenAmy Chen
Frank Frank KorzeniewskiKorzeniewski
Yuri GrechkinYuri Grechkin
Ernest SzetoErnest Szeto
Victor MarkowitzVictor Markowitz
Natalia IvanovaNatalia Ivanova
Iain AndersonIain Anderson
Thanos LykidisThanos Lykidis
Nikos KyrpidesNikos Kyrpides
MGM WorkshopMGM Workshop
February 1, 2012February 1, 2012
Advancing Science with DNA Sequence
Why so many?What’s the difference?Which one should I use?
New: SEED subsystemsTransport DB, Phenotypes
Advancing Science with DNA Sequence
Where it all comes from
• Experimental data: gene A in a genome X
catalyzes a reaction interacts with another protein(s) gene knock-out causes certain phenotype …
This information is recorded in a structured way: ontologies (e.g. Gene Ontology) pathway collections (metabolic and
protein-protein interaction) other (reasoning rules, like TIGR
Genome Properties)
Advancing Science with DNA Sequence
Modeling the data properly – why nobody does that
• Genes are connected to phenotypes via a multi-step process, with many parameters
• We have very vague ideas about the steps/parameters for the majority of genes/phenotypes
• If we design a relational database for gene/phenotype connections, most tables will be empty
gene phenotype
transcript
proteinenzyme
reaction
pathway
compoundsevidence
Advancing Science with DNA Sequence
What it looks like in real life – KEGG vs MetaCyc
KEGG
http://www.genome.jp/kegg/
MetaCyc
http://metacyc.org/
Advancing Science with DNA Sequence
Ammonia oxidation pathway in KEGG
Advancing Science with DNA Sequence
The same pathway/reaction in MetaCyc
Advancing Science with DNA Sequence
Even MetaCyc record is still incomplete
• Which subunit has which cofactor?
• Type of Cu2+ cluster, type of Fe2+ cluster?
• One of the subunits is a cytochrome c, yet the enzyme is cytosolic?
• Does it require any help with maturation of metal clusters?
• Pseudomonas sp. PB16 was shown to have only 1 enzyme from the pathway, hydroxylamine reductase. Does it have the entire pathway?
Advancing Science with DNA Sequence
Even bigger mess: bioinformatics inference
• Experimental data: gene A in a genome X
catalyzes a reaction interacts with another protein(s) gene knock-out causes certain phenotype …
What about gene B in genome Y, which is similar to gene A?
Advancing Science with DNA Sequence
“True or false?” game
• If gene B was manually annotated, the annotation must be correct
• If gene B was manually annotated, and it has a bi-directional best BLAST hit to gene A with e-value of 1.0e-5, the annotation must be correct
• If gene B was manually annotated, and it has >50% identity to gene A, it is found in the same conserved chromosomal neighborhood as gene A, the annotation must be correct
• …
Advancing Science with DNA Sequence
Poorly done inference - MetaCyc
• Software called PathoLogic• Parses annotated files, tries to find matches
between EC numbers/full product names/partial product names and reactions in MetaCyc database
• Automatically infers pathway presence based on matches to MetaCyc reactions
• Tries to find candidate genes for “missing” enzymes by doing BLAST of the genes assigned to this reaction in other organisms
• Generates a lot of false positives - inferred the presence of ammonia oxidation pathway in Staphylococcus based on the presence of 1 gene annotated as ammonia monooxygenase in GenBank file
Advancing Science with DNA Sequence
Better inference: KEGG
• Annotation is inferred based on orthology, defined as bi-directional best BLAST hits, manually refined based on “Ortholog tables” and chromosomal clusters
• Poorly documented, but seems to generate a lot less false positives than PathoLogic
Advancing Science with DNA Sequence
Even the best structured inference is far from perfect
• Problem: both BLAST or Smith-Waterman don’t know which amino acids are more important for protein function than others
• Using consensus sequence (either as PSSM or HMM) with family-specific bit score cutoffs would be much better
Advancing Science with DNA Sequence
Pathway collections: KEGG, MetaCyc and others
Which particular set of interactions is a pathway? (i. e. how do we define pathway boundaries within the network?)
Advancing Science with DNA Sequence
Ideal solution: pathway NR
• All pathway collections share a common skeleton of reactions, which consist of reactants (compounds)
• All reactions share the common base of proteins annotated as catalysts
• Can we merge the information from different collections, using the best features of all of them?
Advancing Science with DNA Sequence
IMG terms: 3 types
IMG terms of 3 types:1. gene product2. multi-subunit protein complex3. modified protein
A BR1
Enzyme (EC x.x.x.x)
Enzyme (EC x.x.x.x)monomeric, needs
cofactor C
Enzyme (EC x.x.x.x)heterotrimeric, needs cofactor D
Enzyme (EC x.x.x.x)monomeric precursor
R2, spontaneousC
Enzyme (EC x.x.x.x)heterotrimeric, subunit B
Enzyme (EC x.x.x.x)heterotrimeric, subunit C
R4, chaperone
Enzyme (EC x.x.x.x)heterotrimeric, subunit A precursor
D R3, spontaneous
Enzyme (EC x.x.x.x)heterotrimeric, subunit A
IMG term of the type “Modified protein”
Not an IMG term!
IMG term of the type “Protein complex”
IMG term of the type “Gene product”
IMG term of the type “Gene product”
Advancing Science with DNA SequenceProtein-protein interaction pathways: same model
Advancing Science with DNA Sequence
You’ve been warned!