anatomy ontology evaluation @ arrayexpress
DESCRIPTION
Anatomy ontology evaluation @ ArrayExpress. Helen Parkinson, PhD. Content. ArrayExpress use cases Fuzzy matching of ontology terms Data driven ontology building Wish list. Public/Private. ATLAS. Re-annotate. Summarize. Gene queries. Experiment queries. Submit. Hybs. - PowerPoint PPT PresentationTRANSCRIPT
www.ebi.ac.uk/arrayexpressEBI is an Outstation of the European Molecular Biology Laboratory.
Anatomy ontology evaluation @ ArrayExpress
Helen Parkinson, PhD
www.ebi.ac.uk/arrayexpress
Content
• ArrayExpress use cases• Fuzzy matching of ontology terms• Data driven ontology building• Wish list
www.ebi.ac.uk/arrayexpress
ArrayExpress: Overview
Submit Hybs
Experiment queries
Public/Private
ATLASSummarize
Public Only
Re-annotate
Gene queries
Genes
Cross expt/speciesqueries
www.ebi.ac.uk/arrayexpress
Fuzzy matching of ontology terms – why?
• Clean up ArrayExpress OE and synonym tables• OE based integration• Constrain OEs on data entry/validation• Improved searches in repository/DW web interface• Data integration across species, experiments and
experimental designs• Automated mapping of free text to ontology terms for data
imporrt
www.ebi.ac.uk/arrayexpress
Phonetic Matching
• Precompute phonetic encodings of all terms in the ontology
• Match each target term by comparing these encodings• Soundex: Robert Russell and Margaret Odell (1918), famously
described by Donald Knuth• Double Metaphone: Lawrence Philips (2000)• Metaphone: Lawrence Philips
• Most matches are single• Highest success rate
www.ebi.ac.uk/arrayexpress
Algorithm comparisons
Sou
ndex
Met
apho
ne
Dou
ble
Met
apho
ne
Leve
nsht
ein0%
10%20%30%40%50%60%70%80%90%
100%
SAEL vs. AE Organ-ismPart
nonemultiple_badmultiple_okaysingle_badsingle_okayvalid
www.ebi.ac.uk/arrayexpress
Percent matches using automated mapping
www.ebi.ac.uk/arrayexpress
Failures to match
• Species (or Kingdom)-specific terms (e.g. plant anatomy)• Conflated terms (e.g. diseased cell types)• Compound terms (e.g. "cerebral cortex and
hypothalamus")• Genuinely missing terms
• Esoteric terms less of a priority
• Most trivial misspellings, however, were matched• Dirty input data
www.ebi.ac.uk/arrayexpress
Implications
• Need more terms in some commonly-used ontologies• Synonyms are important
• generating less noise • better coverage
• Choice of ontology can limit expressivity - this will be frustrating to biologists
www.ebi.ac.uk/arrayexpress
Why?
• Clean up ArrayExpress OE and synonym tables• Add accessions/DB links to these tables• Constrain OEs on data entry/validation• Improved searches in repository/DW web interface• Generate suggestions for new OE terms• Evaluate domain coverage by a given ontology
www.ebi.ac.uk/arrayexpressArrayExpress Ontology Development and Future Directions
24.04.2311
Developing the Ontology
• Define Scope: ArrayExpress already has some useful structure given the current database plus rich source of use cases and competency questions.
• Build: Ontology Capture: Identify key concepts and relationships within our domain and give explicit definitions to these features:• Middle-out approach – specify core of basic terms then specialise and
generalise as required
• Mappings – text mining approach to do initial semi-automated mappings to external resources for rapid coverage
• Manual mapping for data warehouse data, and selected data sets
www.ebi.ac.uk/arrayexpressArrayExpress Ontology Development and Future Directions
24.04.23
Capture to Code: Definitions and Hierarchy
www.ebi.ac.uk/arrayexpressArrayExpress Ontology Development and Future Directions
24.04.23
Semantic Roadmap• Position of the ArrayExpress Experimental Factor
Ontology in the ‘bigger picture’
AE Ontology
Disease Ontology Common Anatomy Reference Ontology
Cell Type OntologyChemical Entities of Biological Interest
(ChEBI) NCI
Various Species Anatomy
Ontologies
• Key is orthogonal coverage, reuse of existing resources and shared frameworks
www.ebi.ac.uk/arrayexpress
Wish list
• NOT to build our own anatomy ontology• CARO extension• CARO evaluation • Mapping CARO to relevant multi-species ontologies• Application of CARO to ArrayExpress data• Use of CARO in ArrayExpress tools
www.ebi.ac.uk/arrayexpress
Acknowledgments
• Anna Farne• Ele Holloway• James Malone• Margus Lukk ArrayExpress Production Team• Helen Parkinson• Tim Rayner• Faisal Rezwan• Eleanor Williams• Mengyao Zhao• Holly Zheng