img terms and pathways

18
Advancing Science with DNA Sequence IMG terms and pathways Krishna Krishna Palaniappan Palaniappan Amy Chen Amy Chen Frank Frank Korzeniewski Korzeniewski Yuri Grechkin Yuri Grechkin Ernest Szeto Ernest Szeto Victor Markowitz Victor Markowitz Natalia Ivanova Natalia Ivanova Iain Anderson Iain Anderson Thanos Lykidis Thanos Lykidis Nikos Kyrpides Nikos Kyrpides MGM Workshop MGM Workshop February 1, 2012 February 1, 2012

Upload: nathan-ayers

Post on 02-Jan-2016

16 views

Category:

Documents


1 download

DESCRIPTION

IMG terms and pathways. Krishna Palaniappan Amy Chen Frank Korzeniewski Yuri Grechkin Ernest Szeto Victor Markowitz. Natalia Ivanova Iain Anderson Thanos Lykidis Nikos Kyrpides. MGM Workshop February 1, 2012. New: SEED subsystems Transport DB, Phenotypes. Why so many? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IMG terms and pathways

Advancing Science with DNA Sequence

IMG terms and pathwaysKrishna Krishna

PalaniappanPalaniappan

Amy ChenAmy Chen

Frank Frank KorzeniewskiKorzeniewski

Yuri GrechkinYuri Grechkin

Ernest SzetoErnest Szeto

Victor MarkowitzVictor Markowitz

Natalia IvanovaNatalia Ivanova

Iain AndersonIain Anderson

Thanos LykidisThanos Lykidis

Nikos KyrpidesNikos Kyrpides

MGM WorkshopMGM Workshop

February 1, 2012February 1, 2012

Page 2: IMG terms and pathways

Advancing Science with DNA Sequence

Why so many?What’s the difference?Which one should I use?

New: SEED subsystemsTransport DB, Phenotypes

Page 3: IMG terms and pathways

Advancing Science with DNA Sequence

Where it all comes from

• Experimental data: gene A in a genome X

catalyzes a reaction interacts with another protein(s) gene knock-out causes certain phenotype …

This information is recorded in a structured way: ontologies (e.g. Gene Ontology) pathway collections (metabolic and

protein-protein interaction) other (reasoning rules, like TIGR

Genome Properties)

Page 4: IMG terms and pathways

Advancing Science with DNA Sequence

Modeling the data properly – why nobody does that

• Genes are connected to phenotypes via a multi-step process, with many parameters

• We have very vague ideas about the steps/parameters for the majority of genes/phenotypes

• If we design a relational database for gene/phenotype connections, most tables will be empty

gene phenotype

transcript

proteinenzyme

reaction

pathway

compoundsevidence

Page 5: IMG terms and pathways

Advancing Science with DNA Sequence

What it looks like in real life – KEGG vs MetaCyc

KEGG

http://www.genome.jp/kegg/

MetaCyc

http://metacyc.org/

Page 6: IMG terms and pathways

Advancing Science with DNA Sequence

Ammonia oxidation pathway in KEGG

Page 7: IMG terms and pathways

Advancing Science with DNA Sequence

The same pathway/reaction in MetaCyc

Page 8: IMG terms and pathways

Advancing Science with DNA Sequence

Even MetaCyc record is still incomplete

• Which subunit has which cofactor?

• Type of Cu2+ cluster, type of Fe2+ cluster?

• One of the subunits is a cytochrome c, yet the enzyme is cytosolic?

• Does it require any help with maturation of metal clusters?

• Pseudomonas sp. PB16 was shown to have only 1 enzyme from the pathway, hydroxylamine reductase. Does it have the entire pathway?

Page 9: IMG terms and pathways

Advancing Science with DNA Sequence

Even bigger mess: bioinformatics inference

• Experimental data: gene A in a genome X

catalyzes a reaction interacts with another protein(s) gene knock-out causes certain phenotype …

What about gene B in genome Y, which is similar to gene A?

Page 10: IMG terms and pathways

Advancing Science with DNA Sequence

“True or false?” game

• If gene B was manually annotated, the annotation must be correct

• If gene B was manually annotated, and it has a bi-directional best BLAST hit to gene A with e-value of 1.0e-5, the annotation must be correct

• If gene B was manually annotated, and it has >50% identity to gene A, it is found in the same conserved chromosomal neighborhood as gene A, the annotation must be correct

• …

Page 11: IMG terms and pathways

Advancing Science with DNA Sequence

Poorly done inference - MetaCyc

• Software called PathoLogic• Parses annotated files, tries to find matches

between EC numbers/full product names/partial product names and reactions in MetaCyc database

• Automatically infers pathway presence based on matches to MetaCyc reactions

• Tries to find candidate genes for “missing” enzymes by doing BLAST of the genes assigned to this reaction in other organisms

• Generates a lot of false positives - inferred the presence of ammonia oxidation pathway in Staphylococcus based on the presence of 1 gene annotated as ammonia monooxygenase in GenBank file

Page 12: IMG terms and pathways

Advancing Science with DNA Sequence

Better inference: KEGG

• Annotation is inferred based on orthology, defined as bi-directional best BLAST hits, manually refined based on “Ortholog tables” and chromosomal clusters

• Poorly documented, but seems to generate a lot less false positives than PathoLogic

Page 13: IMG terms and pathways

Advancing Science with DNA Sequence

Even the best structured inference is far from perfect

• Problem: both BLAST or Smith-Waterman don’t know which amino acids are more important for protein function than others

• Using consensus sequence (either as PSSM or HMM) with family-specific bit score cutoffs would be much better

Page 14: IMG terms and pathways

Advancing Science with DNA Sequence

Pathway collections: KEGG, MetaCyc and others

Which particular set of interactions is a pathway? (i. e. how do we define pathway boundaries within the network?)

Page 15: IMG terms and pathways

Advancing Science with DNA Sequence

Ideal solution: pathway NR

• All pathway collections share a common skeleton of reactions, which consist of reactants (compounds)

• All reactions share the common base of proteins annotated as catalysts

• Can we merge the information from different collections, using the best features of all of them?

Page 16: IMG terms and pathways

Advancing Science with DNA Sequence

IMG terms: 3 types

IMG terms of 3 types:1. gene product2. multi-subunit protein complex3. modified protein

A BR1

Enzyme (EC x.x.x.x)

Enzyme (EC x.x.x.x)monomeric, needs

cofactor C

Enzyme (EC x.x.x.x)heterotrimeric, needs cofactor D

Enzyme (EC x.x.x.x)monomeric precursor

R2, spontaneousC

Enzyme (EC x.x.x.x)heterotrimeric, subunit B

Enzyme (EC x.x.x.x)heterotrimeric, subunit C

R4, chaperone

Enzyme (EC x.x.x.x)heterotrimeric, subunit A precursor

D R3, spontaneous

Enzyme (EC x.x.x.x)heterotrimeric, subunit A

IMG term of the type “Modified protein”

Not an IMG term!

IMG term of the type “Protein complex”

IMG term of the type “Gene product”

IMG term of the type “Gene product”

Page 17: IMG terms and pathways

Advancing Science with DNA SequenceProtein-protein interaction pathways: same model

Page 18: IMG terms and pathways

Advancing Science with DNA Sequence

You’ve been warned!