goa: looking after go annotations
DESCRIPTION
GOA: Looking after GO annotations. Emily Dimmer Gene Ontology Annotation (GOA) Database European Bioinformatics Institute Cambridge UK. E. Coli hub. http://www.geneontology.org. Reactome. Gene Ontology Annotation (GOA) Database. Member of the GO Consortium since 2001 - PowerPoint PPT PresentationTRANSCRIPT
EBI is an Outstation of the European Molecular Biology Laboratory.
GOA: Looking after GO annotations
Emily Dimmer
Gene Ontology Annotation (GOA) Database
European Bioinformatics Institute
Cambridge
UK
2 EMBRACE Workshop 7-9th November 2007
http://www.geneontology.org
Reactome
E. Coli hub
3 EMBRACE Workshop 7-9th November 2007
Gene Ontology Annotation (GOA) Database
• Member of the GO Consortium since 2001
• Largest open-source contributor of annotations to GO
• Provides annotation for more than 139,000 species
• GOA’s priority is to annotate the human proteome
• GOA is responsible for human, chicken and bovine annotations in the GO Consortium
4 EMBRACE Workshop 7-9th November 2007
GOA Group
EMBL-EBI
Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
GOA office
5 EMBRACE Workshop 7-9th November 2007
Evelyn Camon(senior GOA
curator)
Daniel Barrell(GOA file releases
& database)
Rachael Huntley(GOA curator)
David Binns(QuickGO,
protein2go tools)
Along with the help of UniProt curators at the EBI, UniProt controlled vocabularies, HAMAP group, InterPro group, IntAct curators, the IPI group, Ensembl, other EBI groups
…and of course the GO editors and the other GO Consortium annotation groups
Emily Dimmer(GOA coordinator)
GOA Group
6 EMBRACE Workshop 7-9th November 2007
How does GOA annotate to the GO ?
Electronic Annotation
Manual Annotation
• Both these methods have their advantages
• They can be easily distinguished by the evidence code used.
7 EMBRACE Workshop 7-9th November 2007
• Annotations provided to over 140,000 taxa
• Total of 415,576 PubMed references included as evidence.
• Manual annotations integrated from external model organism and multi-species databases:AgBase, DictyBase, Ensembl, FlyBase, GDB, GeneDB(S.pombe),Gramene, HGNC, MGI, Reactome, RGD, Roslin, SGD, TAIR, TIGR, WormBase, ZFIN, the IntAct protein-protein interaction database, LIFEdb and the Proteome Inc dataset
Status of GOA Annotation
Evidence Source Annotations Proteins UniProt coverage
Electronic annotations 22,774,674 3,362,148 63.7 %
Manual Annotations 450,489 86,778 1.6 %
October 2007 Stats
8 EMBRACE Workshop 7-9th November 2007
Core information needed for a GO annotation
1. Gene or gene product identifiere.g. Q9ARH1
2. GO term IDe.g. GO:0004674 (protein serine/threonine kinase)
3. Reference IDe.g. PubMed ID: 12374299 GO_REF:0000001
4. Evidence codee.g. IDA
..and also in some cases:
- Qualifiers available to modify interpretation of annotation:
NOT
contributes_to
colocalizes_with
- ‘With’ column information, to provide further information on the method (evidence code)
9 EMBRACE Workshop 7-9th November 2007
Electronic Annotation
• A number of different techniques used by different GO Consortium annotation groups.
• All resulting annotations must be high-quality and provide an explanation of the method (GO_REF)
1. Mapping of external concepts to GO terms
2. Automatic transfer of annotations to orthologs
10 EMBRACE Workshop 7-9th November 2007
Electronic annotation: GO mappings
Fatty acid biosynthesis (SwissProt keyword)
EC:6.4.1.2 (EC number)
IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)
MF_00527: Putative 3-methyladenine DNA glycosylase(HAMAP)
Camon et al. BMC Bioinformatics. 2005; 6 Suppl 1:S17
GO:fatty acid biosynthesis(GO:0006633)
GO:DNA repair (GO:0006281)
GO:acetyl-CoA carboxylaseactivity
(GO:0003989)
GO:acetyl-CoA carboxylase activity
(GO:0003989)
11 EMBRACE Workshop 7-9th November 2007
12 EMBRACE Workshop 7-9th November 2007http://www.geneontology.org/GO.indices.shtml
13 EMBRACE Workshop 7-9th November 2007
Automatic transfer of annotations to orthologs
Anopheles
Mouse DrosophilaRat Zebrafish Xenopus
Ensembl COMPARA
Homologies between different species calculated
GO terms projected from MANUAL annotation only (IDA, IEP, IGI, IMP, IPI)
One-to-one and apparent one-to-one orthologies only used.
http://www.ensembl.org/info/data/compara
Macaque Chimpanzee
Guinea Pig Rat Mouse
Dog Chicken
Human
Rat
Human
Mouse
Human
Human
Tetraodon
Fugu
Zebrafish
Aedes aegypti
14 EMBRACE Workshop 7-9th November 2007
• High–quality, specific annotations made using:
• Peer-reviewed papers
• A range of evidence codes to categorize the types of evidence found in a paper
• Very time consuming and requires trained biologists
Manual Annotation
15 EMBRACE Workshop 7-9th November 2007
Finding Annotations
In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response…
wound response
serine/threonine kinase activity,
integral membrane protein…for B. napus PERK1 protein (Q9ARH1) PubMed ID: 12374299
FUNCTION protein serine/threonine kinase activity GO:0004674
COMPONENT integral to plasma membrane GO:0005887
PROCESS response to wounding GO:0009611
16 EMBRACE Workshop 7-9th November 2007
Evidence Codes
IEA Inferred from Electronic Annotation
IDA Inferred from Direct Assay
IMP Inferred from Mutant Phenotype
IPI Inferred from Protein Interaction
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
ISS* Inferred from Sequence or Structural Similarity
IGC Inferred from Genomic Context
RCA Reviewed Computational Analysis
TAS Traceable Author Statement
NAS Non-traceable Author Statement
IC Inferred from Curator Judgement
ND No Data available
IDA:
• Enzyme assays
• In vitro reconstitution
• Immunofluorescence
• Cell fractionation
TAS:
• In the literature source the original experiments referred to are referenced.
17 EMBRACE Workshop 7-9th November 2007
Core information needed for a GO annotation
1. Gene or gene product identifiere.g. Q9ARH1
2. GO term IDe.g. GO:0004674 (protein serine/threonine kinase)
3. Reference IDe.g. PubMed ID: 12374299 GO_REF:0000001
4. Evidence codee.g. IDA
..and also in some cases:
- Qualifiers available to modify interpretation of annotation
NOT
contributes_to
colocalizes_with
- ‘With’ column information, to provide further information on the method (evidence code)
18 EMBRACE Workshop 7-9th November 2007
The ‘Qualifier’ Column
The Qualifier column is used to modify the interpretation of an annotation.
Allowable values are: NOT colocalizes_with
contributes_to
19 EMBRACE Workshop 7-9th November 2007
The ‘NOT’ qualifier
• 'NOT' is used to make an explicit note that the gene product is not associated with the GO term.
… particularly important when associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method).
Also used to document conflicting claims in the literature.
NOT can be used with ALL three GO Ontologies.
e.g. This protein does not have ‘kinase activity’ because it has beenfound that this protein has a disrupted/missing an ‘ATP binding’ domain.
20 EMBRACE Workshop 7-9th November 2007
The ‘colocalizes_with’ qualifier
Only used with GO Component Ontology
• Gene products that are transiently or peripherally associated with an organelle or complex may be annotated to the relevant cellular component term, using the 'colocalizes_with' qualifier.
21 EMBRACE Workshop 7-9th November 2007
The ‘contributes_to’ qualifier
i.e. annotating 'to the potential of the complex‘
• distinguishes an individual subunit from complex functions
All gene products annotated using 'contributes_to' must also be annotated to a cellular component term representing the complex that possesses the activity.
Only used with GO Function Ontology
Where an individual gene product that is part of a complex can be annotated to terms that describe the action (function or process) of the whole complex.
22 EMBRACE Workshop 7-9th November 2007
23 EMBRACE Workshop 7-9th November 2007
Where does GOA data go?
24 EMBRACE Workshop 7-9th November 2007
etc.
QuickGO browser:
http://www.ebi.ac.uk/quickgo
Human Insulin Receptor (P06213)…
25 EMBRACE Workshop 7-9th November 2007
GO data in Ensembl
26 EMBRACE Workshop 7-9th November 2007
GOA data in Entrez Gene
27 EMBRACE Workshop 7-9th November 2007http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
28 EMBRACE Workshop 7-9th November 2007
Gene Association Files Tab delimited files: http://www.geneontology.org/GO.current.annotations.shtml
DB DB_Object_ID
DB_Object_Symbol Qualifier* GO_id DB:Ref Evidence With*
UniProt Q9H2K8 TAOK3_HUMAN GO:0004674 PMID:10559204 IDA
UniProt O00110 O00110_HUMAN GO:0003676 GO_REF:0000002 IEA InterPro:IPR007087
UniProt P09884 DPOLA_HUMAN NOT GO:0000731
PMID:1730053 IMP
UniProt P09936 UCHL1_HUMAN GO:0005515 PMID:12082530 IPI UniProt:P46527
Aspect DB_Object_Name* DB_Object_Synonym* DB_Object Type
Taxon Date Assigned By
F Serine/threonine-protein.. IPI00410485 protein taxon:9606 20070720 HGNC
F protein taxon:9606 20070720 UniProt
P DNA polymerase alpha.. IPI00220317 protein taxon:9606 20060825 UniProt
F UCHL1: Ubiquitin carboxyl.. IPI00018352 protein taxon:9606 20070720 IntAct
* = optional field
29 EMBRACE Workshop 7-9th November 2007
http://www.geneontology.org/GO.current.annotations.shtml
30 EMBRACE Workshop 7-9th November 2007
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
http://www.ebi.ac.uk/GOA/downloads.html
31 EMBRACE Workshop 7-9th November 2007
Output from the GOA database
Non-Redundant
based on IPI
(International Protein Index)
Cow
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
Redundant
625 proteome sets
32 EMBRACE Workshop 7-9th November 2007
Output from the GOA database
Non-Redundant
based on IPI
(International Protein Index)
Cow
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
Redundant
625 proteome sets
33 EMBRACE Workshop 7-9th November 2007
… annotations are also displayed in:
• All GO Consortium Model Organism Databases integrate and exchange GO annotation data to ensure a comprehensive set of annotations for their organism/area of interest.
• Array Products and data analysis
Affymetrix
Spotfire
Almac
34 EMBRACE Workshop 7-9th November 2007
(http://www.geneontology.org/GO.tools.shtml)
… and Numerous Third Party Tools
35 EMBRACE Workshop 7-9th November 2007
What’s new on the GO annotation front?
36 EMBRACE Workshop 7-9th November 2007
Reference Genomes
Arabidopsis thaliana Caenorhabditis elegans Danio rerio (zebrafish) Dictyostelium discoideum Drosophila melanogaster Escherichia coli Homo sapiens Saccharomyces cerevisiae Mus musculusSchizosaccharomyces pombe Gallus gallus Rattus norvegicus
• Comprehensive annotation of a set of conserved pathway and disease-related proteins in human and orthologs in 11 other selected genomes
• Empowers comparative methods used in first pass annotation of other proteomes.
E. Coli hub
37 EMBRACE Workshop 7-9th November 2007
GOA annotation focuses
Cardiovascular GO annotation Grant with the British Heart Foundation to support a collaboration with HGNC curators to provide full Gene Ontology annotation to genes associated with cardiovascular processes
wiki: http://wiki.geneontology.org/index.php/Cardiovascular
Immune GO annotationInterest in actively GO annotating immune relevant genes. GOA, UCL and MGI are collaborating to improve annotation for
immunologically-important genes, WT grant pending.
wiki: http://wiki.geneontology.org/index.php/Immunology
38 EMBRACE Workshop 7-9th November 2007
Electronic Annotation developments
New mappings:
• Swiss-Prot Subcellar Location to GO (just released)
• Swiss-Prot UniPathway
Expansion of existing methods
• Ensembl Compara species expansion
39 EMBRACE Workshop 7-9th November 2007
Acknowledgements
The Gene Ontology Consortium and 1.5 members of GOA currently supported by an P41 grant from the National Human Genome Research Institute (NHGRI) [grant HG002273], GOA is also supported by core EMBL funding and BBSRC Tools and
Resources grant.
Rolf Apweiler. Head of the EBI protein sequence database group
Emily Dimmer
Evelyn Camon
Rachael Huntley
Daniel Barrell
David Binns
Contact the GOA team: [email protected]
GOA web page: http://www.ebi.ac.uk/goa