emily dimmer [email protected] goa group european bioinformatics institute wellcome trust genome...
TRANSCRIPT
Emily [email protected]
GOA group
European Bioinformatics Institute
Wellcome Trust Genome Campus
Cambridge
UK
Gene Ontology (GO)
• Introduction to GO
• Description of the GO ontologies
• How groups annotate to GO
• Practical:
• Investigating the GO and OBO web sites
• Browsing the GO using the AmiGO Browser.
• Open Biomedical Ontologies
• How GO is being used
• Available Tools
• GO slims
• Practical:
• Creating your own GO slim
GO Tutorial Outline:
• Introduction to GO
• Description of the GO ontologies
• How groups annotate to GO
• Practical:
• Investigating the GO and OBO web sites
• Browsing the GO using the AmiGO Browser.
• Open Biomedical Ontologies
• How GO is being used
• Available Tools
• GO slims
• Practical:
• Creating your own GO slim
GO Tutorial Outline:
• Introduction to GO
• Description of the GO ontologies
• How groups annotate to GO
• Practical:
• Investigating the GO and OBO web sites
• Browsing the GO using the AmiGO Browser.
• Open Biomedical Ontologies
• How GO is being used
• Available Tools
• GO slims
• Practical:
• Creating your own GO slim
GO Tutorial Outline:
• Introduction to GO
• Description of the GO ontologies
• How groups annotate to GO
• Practical:
• Investigating the GO and OBO web sites
• Browsing the GO using the AmiGO Browser.
• Open Biomedical Ontologies
• How GO is being used
• Available Tools
• GO slims
• Practical:
• Creating your own GO slim
GO Tutorial Outline:
Why is GO needed ?THE PROBLEM:
• Huge body of knowledge with an extremely large vocabulary to describe it
• Vocabulary used is poorly defined – i.e. one word can have different meanings
– or different names for the same concept
• Biological systems are complex and our knowledge of such systems is incomplete
RESULT:
Large databases which are difficult to manage and impossible to mine computationally
• A (part of the) solution:
GO:
“a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing”
What is GO?
• Access gene product functional information
• Provide a link between biological knowledge and …
•gene expression profiles
• proteomics data
• Find how much of a proteome is involved in a process/ function/ component in the cell
• using a GO-Slim
(a slimmed down version of GO to summarize biological attributes of a proteome)
• Map GO terms and incorporate manual GOA annotation into own databases
• to enhance your dataset
• or to validate automated ways of deriving information about gene function (text-mining).
What can scientists do with GO?
TactitionTactile sense
Taction
?
perception of touch ; GO:0050975
TactitionTactile senseTaction
•Molecular Function: elemental activity or taske.g. DNA binding, catalysis of a reaction
•Biological Process: broad objective or goale.g. mitosis, signal transduction, metabolism
•Cellular Component: location or complexe.g. nucleus, ribosome
GOThree (Orthogonal) Ontologies
•Molecular Function: elemental activity or taske.g. DNA binding, catalysis of a reaction
•Biological Process: broad objective or goale.g. mitosis, signal transduction, metabolism
•Cellular Component: location or complexe.g. nucleus, ribosome
GOThree (Orthogonal) Ontologies
•Molecular Function: elemental activity or taske.g. DNA binding, catalysis of a reaction
•Biological Process: broad objective or goale.g. mitosis, signal transduction, metabolism
•Cellular Component: location or complexe.g. nucleus, ribosome
GOThree (Orthogonal) Ontologies
•Molecular Function: elemental activity or taske.g. DNA binding, catalysis of a reaction
•Biological Process: broad objective or goale.g. mitosis, signal transduction, metabolism
•Cellular Component: location or complexe.g. nucleus, ribosome
GOThree (Orthogonal) Ontologies
How does GO work?
• Provides a standard, species-neutral way of representing biology
• GO covers ‘normal’ functions and processes– No pathological processes– No experimental conditions
Molecular Function 7,493 terms Biological Process 9,640 terms Cellular Component 1,634 terms
Total 18,767 terms
Definitions: 16,696 (93.9 %)
Content of GO
What is GO?
• NOT a system of nomenclature or a list of gene products
• GO doesn’t attempt to cover all aspects of biology or evolutionary relationships
Open Biomedical Ontologieshttp://obo.sourceforge.net
• NOT a dictated standard
• NOT a way to unify databases
http://www.geneontology.org
Reactome
Anatomy of a GO term
• GO terms are composed of:
• Term name• Unique GO ID• Definition (93 % of GO terms are
defined)• Synonyms (optional)• Database references (optional)• Relationships to other GO terms
Ontologies
• “Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing” (Gruber 1993)
I. The GO Ontologies
Can be used to:
• Formalise the representation of biological knowledge• Describe a common and defined vocabulary for
database annotation• Standardise database submissions• Provide unified access to information through
ontology-based querying of databases, both human and computational
• Improve management and integration of data within databases.
• Facilitate data mining
Ontology applications
• Ontologies can be represented as graphs, where the vertices (nodes and leaves) are connected by edges.
• The nodes are concepts in the ontology.
• The edges are the relationships between the concepts
node
nodenode
edge
Ontology Structure
Ontology Structure
• The Gene Ontology is structured as a hierarchical directed acyclic graph (DAG).
• Terms are linked by two relationships– is-a– part-of
• Terms can have more than one parent
Simple hierarchies Directed Acyclic (Trees) Graphs
Directed Acyclic Graph
cell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
is-apart-of
True Path Rule
• The path from a child term all the way up to its top-level parent(s) must always be true
cell cytoplasm
chromosome nuclear chromosome
nucleus nuclear chromosome
is-a
part-of
• Terms become obsolete when they are removed or redefined
• GO IDs are never deleted
• For each term, a comment is added to explains why the term is now obsolete
Ensuring Stability in a Dynamic Ontology
Obsolete Cellular Component
Obsolete Molecular FunctionObsolete Biological Process
Biological ProcessMolecular FunctionCellular Component
Access to the Gene Ontology• Downloads
• formats available:
OBO GO
XML OWL
MySQL
(http://www.geneontology.org/GO.downloads)
• Web-based tools
• AmiGO (http://www.godatabase.org)
• QuickGO
(http://www.ebi.ac.uk/ego)
II. Annotating to GO
Use of GO terms to represent the activities and localizations of gene products.
Basic information needed:
1. Database object (e.g. a protein or gene identifier)e.g. Q9ARH1
2. Reference IDe.g. PubMed ID: 12374299
3. GO term IDe.g. GO:0004674
4. Evidence codee.g. TAS
GenNav: http://etbsun2.nlm.nih.gov:8000/perl/gennav.pl
J. Clark et al. Plant Physiology 2005 (in press)
Two types of GO Annotation:
Electronic Annotation
Manual Annotation
All annotations must:
• be attributed to a source.
• indicate what evidence was found to support the GO term-gene/protein association.
Electronic Annotation
• Provides large-coverage
• High-quality
• BUT annotations tend to use high-level GO terms and provide little detail.
1. Assignment of GO terms to gene products using existing information within database entries
• Manual mapping of GO terms to concepts external to GO (‘translation tables’).
• Proteins then electronically annotated with the relevant GO term(s).
2. Automatic sequence analyses to transfer annotations between highly similar gene products
Electronic Annotation
Fatty acid biosynthesis ( Swiss-Prot Keyword)
EC:6.4.1.2 (EC number)
IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)
MF_00527: Putative 3-methyladenine DNA glycosylase(HAMAP)
GO:Fatty acid biosynthesis
(GO:0006633)
GO:acetyl-CoA carboxylase activity
(GO:0003989)
GO:acetyl-CoA carboxylaseactivity
(GO:0003989)
GO:DNA repair
(GO:0006281)
Electronic Annotation
http://www.geneontology.org/GO.indices.shtml
Mappings of external concepts to GO
Evaluation of precision of annotation electronic techniques (InterPro2GO,
SPKW2GO, EC2GO)
• Compared manually-curated test set of GO annotated proteins with the electronic annotations
• InterPro2GO = most coverage
• EC2GO = 67 % of predictions exactly match the manual GO annotation.
• 91-100 % of time the 3 mappings predicted GO terms within the same lineage
Camon et al. BMC Bioinformatics 2005 in press
Manual Annotation
• High–quality, specific gene/gene product associations made, using:
• Peer-reviewed papers
• Evidence codes to grade evidence
BUT – is very time consuming and requires trained biologists
Finding GO terms
In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response…
Process: response to wounding GO:0009611
serine/threonine kinase activity,
Function: protein serine/threonine kinase activity GO:0004674
integral membrane protein
Component: integral to plasma membrane GO:0005887
…for B. napus PERK1 protein (Q9ARH1)
PubMed ID: 12374299
wound response
GO Evidence Codes
*With column required
Manuallyannotated
Code Definition
*IEA Inferred from Electronic Annotation
IDA Inferred from Direct Assay
IEP Inferred from Expression Pattern
*IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
*IPI Inferred from Physical Interaction
*ISS Inferred from Sequence Similarity
TAS Traceable Author Statement
NAS Non-traceable Author Statement
*IC Inferred from Curator
RCA Inferred from Reviewed Computational Analysis
ND No Data
IDA:
•Enzyme assays
•In vitro reconstitution (transcription)
•Immunofluorescence
•Cell fractionation
TAS:
•In the literature source the original experiments referred to are traceable (referenced).
GO Evidence Codes
*With column required
Manuallyannotated
• additional needed identifier for annotations using certain
evidence codes
Code Definition
*IEA Inferred from Electronic Annotation
IDA Inferred from Direct Assay
IEP Inferred from Expression Pattern
*IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
*IPI Inferred from Physical Interaction
*ISS Inferred from Sequence Similarity
TAS Traceable Author Statement
NAS Non-traceable Author Statement
*IC Inferred from Curator
RCA Inferred from Reviewed Computational Analysis
ND No Data
IGI:
• a gene identifier for the "other" gene involved in the interaction
IPI:
• a gene or protein identifier for the "other" protein involved in the interaction
IC:
• GO term from another annotation used as the basis of a curator inference
• Annotation of a gene product to one ontology is independent from its annotation to other ontologies.
• Terms reflecting a normal activity or location are only annotated to.
• Usage of ‘unknown’ GO terms
(e.g. Molecular function unknown GO:0005554)
…some extra things:
A set of ‘Qualifier’ terms is also available to curators modify the interpretation of an annotation.
Allowable values:
1. NOT• a gene product is not associated with the GO term • to document conflicting claims in the literature.
2. Contributes to• distinguishes between individual subunits functions and whole
complex functions• (used with GO Function Ontology)
3. Colocalizes with• Transiently or peripherally associated with an organelle or
complex • where the resolution of an assay is not accurate.
(used with GO Component Ontology)
…some extra things: Qualifier Information
• The Qualifier column can be used to modify the interpretation of an annotation.
Allowable values:
1. NOT• a gene product is not associated with the GO term • to document conflicting claims in the literature.
2. Contributes to• distinguishes between individual subunits functions and whole
complex functions• (used with GO Function Ontology)
3. Colocalizes with• Transiently or peripherally associated with an organelle or
complex • where the resolution of an assay is not accurate.
(used with GO Component Ontology)
…some extra things:
• The Qualifier column can be used to modify the interpretation of an annotation.
Allowable values:
1. NOT• a gene product is not associated with the GO term • to document conflicting claims in the literature.
2. Contributes to• distinguishes between individual subunits functions and whole
complex functions• (used with GO Function Ontology)
3. Colocalizes with• Transiently or peripherally associated with an organelle or
complex • where the resolution of an assay is not accurate.
(used with GO Component Ontology)
…some extra things:
• The Qualifier column can be used to modify the interpretation of an annotation.
Allowable values:
1. NOT• a gene product is not associated with the GO term • to document conflicting claims in the literature.
2. Contributes to• distinguishes between individual subunit functions and whole
complex functions• (used with GO Function Ontology)
3. Colocalizes with• Transiently or peripherally associated with an organelle or
complex • where the resolution of an assay is not accurate.
(used with GO Component Ontology)
…some extra things:
Accessing annotations to the Gene Ontology
1. Downloads
• Annotations – gene association files
• Ontologies and annotations – MySQL and XML
2. Web-based access
• AmiGO (http://www.godatabase.org)
• QuickGO
(http://www.ebi.ac.uk/ego)
…among others…
Gene Association File
Calcyclin IPI00027463 protein taxon:9606 20040426 UniProt Calcyclin IPI00027463 protein taxon:9606 20030721 UniProt Calcyclin IPI00027463 protein taxon:9606 20030721 UniProt
UniProt P06703 S106_HUMAN GO:0008083 GOA:spkw IEA FUniProt P06703 S106_HUMAN NOT GO:0007409 PMID:12152788 NAS PUniProt P06703 S106_HUMAN GO:0005515 PMID:12577318 IPI UniProt:P50995 F
• via web (GO consortium page)http://www.geneontology.org/GO.current.annotations.shtml
•
DB DB_Object_ID DB_Object_Symbol Qualifier GOid DB:Reference Evidence With Aspect
DB_Object_Name DB_Object_Synonym DB_Object_Type taxon Date Assigned by
http://www.geneontology.org/GO.current.annotations.shtml
Summary
• GO is still being developed and updated - it requires a serious and ongoing effort.
– the biological community is involved
• New model organism databases are joining the GO Consortium annotation effort
Practical session
1. Visit the GO website
2. Visit the OBO website
3. Browse the ontologies using the official GO Consortium Browser – AmiGO
GO web site: www.geneontology.orgPart 1.
OBO web site: http://obo.sourceforge.net
AmiGO: http://www.godatabase.org
GO terms with no children
Filter queries by organism, data source or evidence
Search for GO terms or by Gene symbol/name
Querying the GO
Querying the GO
Querying the GO
GOst tool
GOst tool
QuickGO browser: http://www.ebi.ac.uk/ego
QuickGO browser: http://www.ebi.ac.uk/ego
QuickGO browser: http://www.ebi.ac.uk/ego
OBO and Gene Ontology Uses and
Tools
Anatomy
Physiology
Phenotype
Pathway
Disease
Molecular
MetabolicDevelopmental
Stage
Ontologies
Beyond GO – Open Biomedical Ontologies
• Orthogonal to existing ontologies to facilitate combinatorial approaches
- Share unique identifier space- Include definitions
• Anatomies• Cell Types• Sequence Attributes• Temporal Attributes• Phenotypes• Diseases• More….
http://obo.sourceforge.net
Sequence Ontology
http://song.sourceforge.net
• Ontology of ‘small molecular entities’
http://www.ebi.ac.uk/chebi
http://www.fruitfly.org/cgi-bin/ex/go.cgi
Access to GO and its annotations
How to access the Gene ontology and its annotations
1. Downloads
• Ontologies – (various – GO, OBO, XML, OWL MySQL)
• Annotations – gene association files
• Ontologies and Annotations – MySQL and XML
2. Web-based access
• AmiGO (http://www.godatabase.org)
• QuickGO
(http://www.ebi.ac.uk/ego)
among others…
http://www.ncbi.nlm.nih.gov/entrez
www.uniprot.org/
http://www.ebi.ac.uk/intact
SRS view…
http://srs.ebi.ac.uk
www.ensembl.org/ www.ensembl.org/
www.ensembl.org/
www.ensembl.org/
• Access gene product functional information
• Provide a link between biological knowledge and …
•gene expression profiles
• proteomics data
• Find how much of a proteome is involved in a process/ function/ component in the cell
• using a GO-Slim
(a slimmed down version of GO to summarize biological attributes of a proteome)
• Map GO terms and incorporate manual GOA annotation into own databases
• to enhance your dataset
• or to validate automated ways of deriving information about gene function (text-mining).
What can scientists do with GO?
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
attacked
time
control
Puparial adhesionMolting cyclehemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabloismImmune response
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
…analysis of high-throughput data according to GOMicroArray data analysis
Proteomics data analysis
Kislinger T et al, Mol Cell Proteomics, 2003
GO classification
…analysis of high-throughput data according to GO
http://www.geneontology.org/GO.tools
Analysis of Data: Clustering
Color indicates up/down regulation
GoMiner Tool, John Weinstein et al, Genome Biol. 4 (R28) 2003
Compare annotations associated with the test set to the entire set of GO annotations….
DNA Repair seems to be a common theme.
Example of VLAD Output
…overview proteome with GO Slim
http://www.ebi.ac.uk/integr8
http://go.princeton.edu/cgi-bin/GOTermMapper
map2slim.pl
• distributed as part of the go-perl package
• maps a set of annotations up to their parent GO slim terms
Off-the-shelf GO slims
Summary
The Gene Ontology project precipitated a generalized implementation for ontologies for molecular biology
Bio-ontologies such as GO have facilitated development of systems for hypothesis generation in biological systems
Further integration – creation of cross-products between different ontologies
Practical II – Creation of GO slims using the DAG-Edit tool.
http://sourceforge.net/projects/geneontology/
…loading the GO
…loading the GO
…loading the GO
…loading the GO
…loading the GO
ftp://ftp.geneontology.org/pub/go/ontology/gene_ontology.obo
…loading the GO
…loading the GO
…loading the GO
…browsing the GO
…viewing GO terms
…searching for GO terms
…searching for GO terms
…searching for GO terms
…creating a new GO slim
…creating a new GO slim
…creating a new GO slim
…creating a new GO slim
…creating a new GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…adding terms to the GO slim
…adding terms to the GO slim
…adding terms to the GO slim
…adding terms to the GO slim
…filtering GO for terms in the GO slim
…filtering GO for terms in the GO slim
…filtering GO for terms in the GO slim
…removing filters/renderers
…saving the newly created GO slim