the gene ontology and its insertion into umls jane lomax
TRANSCRIPT
The Gene Ontology and its insertion into UMLS
Jane Lomax
The Gene Ontology
Set of three structured vocabularies
Provide functional annotation of gene products
Dynamic
Cross-references to external databases
The vocabularies
Molecular function — elemental activity or task
Biological process — broad objective or goal
Cellular component — location or complex
The vocabularies
Molecular function — elemental activity or task• nuclease, DNA binding, microtubule motor
Biological process — broad objective or goal
Cellular component — location or complex
The vocabularies
Molecular function — elemental activity or task• nuclease, DNA binding, microtubule motor
Biological process — broad objective or goal• mitosis, signal transduction, metabolism
Cellular component — location or complex
The vocabularies
Molecular function — elemental activity or task• nuclease, DNA binding, microtubule motor
Biological process — broad objective or goal• mitosis, signal transduction, metabolism
Cellular component — location or complex• nucleus, ribosome
GO structure
Directed acyclic graph (DAG) Allows multiple parentage
True-path rule
Every path from a node back to the root must be biologically accurate
Relationship types
is_a• subclass: a is a type of b
part_of• physical part of (component)• sub-process of (process)
What makes up a GO term?
• term name• go_id• definition and
definition dbxref
• GO synonym• general dbxref• comment
GO cross-links
Cross-references within GO• EC• RESID• MetaCyc
Mappings• SWISS-PROT keywords
Links in other databases• InterPro• UMLS/MeSH – in progress
Why insert GO into UMLS?
A rich, widely used source for expanding UMLS• Can be used to improve areas of MeSH
Potential for ‘non-fuzzy’ text mining using GO terms• MeSH terms manually assigned to papers
Unified Medical LanguageSystem (UMLS)
Research project maintained by the National Library of Medicine (NLM)
Aims to • allow computers to ‘understand’ biomedical meaning• improve retrieval and integration of computer
readable info
Has three ‘Knowledge sources’:• UMLS Metathesaurus • SPECIALIST lexicon • semantic network
Knowledge sources
UMLS Metathesaurus• links multiple source vocabularies into unified
concepts, includes MeSH (Medical Subject Headings)
• GO to become source vocabulary
SPECIALIST lexicon• provides biomedical/English lexical info
semantic network • for categorizing concepts
Inserting GO into UMLS
inversion• converting GO to correct format for UMLS
insertion• inserting GO using matching algorithms
editing• all concepts containing GO term reviewed
by hand
Statistics
Approximately 23% of GO terms ‘match’ something in another source vocabulary
23.03%GO terms in concepts with other sources
76.97%GO terms in concepts where they are the only source
Statistics
biological process molecular functioncellular component
% of GO in sources with other concepts, by GO vocabulary
4.6% 27.8% 45.2%
Statistics
% of GO in sources with other concepts, by source
CSP2002 (Computer Retrieval of Information on Scientific Projects Thesaurus)
7.34 %
MSH2003_2002_08_14 (Medical Subject Headings)
19.74 %
SNMI98 (Systemized Nomenclature of Human and
Veterinary Medicine)
11.05 %
GO
CRISP
MeSH
SNOMED
concept name
concept id
GO atoms
MeSH atoms
EC number
contexts
relationships to other concepts
definition
Challenges with insertion
GO synonyms• As GO evolved - now not all synonymous
GO enzymes• GO separates enzyme function from enzyme
‘complexes’ - most vocabularies don’t
Semantic types• What semantic types now apply to concepts with GO
atoms?
Future of insertion
Hoped that GO can be released with UMLS early next year• dependent on ironing out problems
Maintenance of insertion• GO changing continually - large differences
between UMLS releases
www.geneontology.org•FlyBase & Berkeley Drosophila Genome Project•Saccharomyces Genome Database• PomBase (Sanger Institute)• Rat Genome Database• Genome Knowledge Base (CSHL)• The Institute for Genomic Research• Compugen, Inc•The Arabidopsis Information Resource•WormBase•DictyBase•Mouse Genome Informatics•Swiss-Prot/TrEMBL/InterPro•Pathogen Sequencing Unit(Sanger Institute)
•National Library of Medicine
•Alexa McCray•Stuart Nelson•Bill Hole
•Oak Ridge Institute for Science and Education•National Library of Medicine•U. S. Department of Energy
The Gene Ontology Consortium is supported by an R01 grant from the National Human Genome Research Institute (NHGRI) [grant HG02273]. SGD is supported by a P41, National Resources, grant from the NHGRI [grant HG01315]; MGD by a P41 from the NHGRI [grant HG00330]; GXD by the National Institute of Child Health and Human Development [grant HD33745]; FlyBase by a P41 from the NHGRI [grant HG00739] and by the Medical Research Council, London. TAIR is supported by the National Science Foundation [grant DBI-9978564]. WormBase is supported by a P41, National Resources, grant from the NHGRI [grant HG02223]; RGD is supported by an R01 grant from the NHLBI [grant HL64541]; DictyBase is supported by an R01 grant from the NIGMS [grant GM064426].