subsystem approach to genome annotation national microbial pathogen data resource claudia reich...

20
Subsystem Approach to Subsystem Approach to Genome Annotation Genome Annotation National Microbial Pathogen Data Resource www.nmpdr.org Claudia Reich NCSA, University of Illinois, Urbana

Upload: agnes-marsh

Post on 22-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

Subsystem Approach to Subsystem Approach to Genome AnnotationGenome Annotation

National Microbial Pathogen Data Resource www.nmpdr.org

Claudia ReichNCSA, University of Illinois, Urbana

Page 2: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Complete Microbial Genomes• 464 complete microbial genomes in NCBI as of 3-1-

07• 691 microbial genomes in progress as of 3-1-07

Page 3: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Making Sense of Genome Data

• Locate Genes: identify ORFs automatically GeneMark NCBI’s ORF Finder Glimmer Critica

• Assign Function: by sequence similarity to experimentally characterized proteins BLAST family of sequence comparison tools

Page 4: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Problems with Assignments by Similarity

• When ORF is a member of a protein family

• Paralogous genes• ORFs encoding similar proteins acting

on different substrates• Assignments can be transitive, and

many times removed from experimental data

Page 5: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Other Factors Can Aid in Function Assignments

• Molecular phylogeny• Paralogous and orthologous families• Conserved gene neighborhood• Metabolic context• Bidirectional best hit matches

across multiple genomes

Page 6: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Incorporating Information Other Than Similarity

• KEGG: manually curated pathway and metabolic maps

• GO: vocabularies that describe ORFs as associated with biological processes cellular components molecular function

• MetaCyc: experimentally elucidated metabolic pathways

Page 7: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

What is Needed:

• A system that: integrates all the above concepts organizes genomic data in structured

idioms allows high-throughput annotation of newly

sequenced genomes resolves discrepancies in different

annotation tools informs experimental research

Page 8: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Enter the SEED*

• Database and annotation environment• Underlies, and accessible through,

NMPDR (www.nmpdr.org)• Expert annotation via subsystems

building• Provides the most accurate genome

annotations available

*Argonne National Lab, University of Chicago, UIUC, FIG

Page 9: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

What is a Subsystem?• Any organizing biological principle:

metabolic pathway• amino acid biosynthesis, nitrogen fixation,

glycolysis

complex structure• ribosome, flagellum

set of defining features• virulome, pathogenicity islands

functional concept• bacterial sigma factors, DNA binding proteins

Page 10: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Subsystems are:

• Sets of functional roles, which are functions, or abstractions of functions (such as an EC number), that together implement a specific biological process or concept

• Created manually by expert curators• Experts annotate single subsystems over

the complete collection of genomes, thus contributing and sharing their expertise with the scientific community

Page 11: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

How Subsystems are Built

• Create a subsystem for the biological concept, and define the functional roles

• In one (or a few) key organisms that include the subsystem, find the genes and assign meaningful functional names

• Project the annotations to orthologous genes

• Expand to more genomes, creating a Populated Subsystem

Page 12: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Populated Subsystems

• Are Spreadsheets where: Columns: functional roles Rows: specific genomes Cells: genes in the organism that

implement the functional role

Page 13: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

How to Access Subsystems

• From Search menu• From Organism pages• From search results when found protein

is included in a subsystem• From Annotation Overview pages

Page 14: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Subsystem Pages in NMPDR

• Table of Functional Roles• Subsystem diagram (if appropriate)• Populated subsystem spreadsheet• Customizable spreadsheet viewing

options• Functional variants and subsets of roles• Curator’s notes

Page 15: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Benefits of Subsystems

• More accurate annotations• Annotation of protein families• Analysis of sets of functionally related

proteins• Less error-prone to automatic

projections to novel genomes

Page 16: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Subsystems Reveal Interesting • Pathway variants:

Are they clustered by phylogeny?• Delta subunit of RNA polymerase only Bacillales

Are they clustered by functional niche? Horizontal gene transfer?

• Fused genes: and ’ subunit of RNA polymerase fused in

Helicobacter

• Fissioned genes:’ subunit of RNA polymerase is fissioned in

Cyanobacteria

Page 17: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Subsystems Reveal Interesting

• Duplicate assignments More than one gene for one functional role?

• Alpha subunit of RNA polymerase in Magnetococcus and Francisella

Same sequenced region in more than one contig in partially assembled genomes?

Frameshifts or other sequencing errors? Annotation errors?

Page 18: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Subsystems Reveal Interesting • Missing genes:

Is the function essential? Is the function conserved? Does the missing gene cluster with

homologs in other organisms? Is the function performed by a newly

recruited gene? Has a gene been acquired by horizontal

gene transfer and now performs that function?

Page 19: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Synthesis of Selenocysteinyl-tRNA• Two known pathway variants

One step in Bacteria• SelA is annotated

Two steps in Archaea and Eucarya• PSTK was missing until very recently

Page 20: Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource  Claudia Reich NCSA, University of Illinois, Urbana

www.nmpdr.org

Explore Selenocysteine Usage• Start by searching for gene name, selA, in an organism

known to use Sec, E. coli K12• Start from subsystem tree; expand category of "Protein

metabolism," expand subcategory of "Selenoproteins"• Open "Selenocysteine metabolism" subsystem from

protein page or SS tree Genomes arranged phylogenetically Roles defined on mouse-over What genes are missing in which organisms? Are there Sec metabolism genes present in any organisms that

do not have proteins that need Sec? Are there organisms known to need Sec for certain proteins,

but that do not have a complete Sec biosynthesis pathway? Why is there a hypothetical protein included in this subsystem?