bio-ontologies for annotation and service discovery

70
Bio-ontologies for Annotation and Service Discovery Chris Wroe ( + material from Carole Goble, Alan Rector, Jeremy Rogers, Ian Horrocks) University of Manchester, UK

Upload: marged

Post on 09-Jan-2016

37 views

Category:

Documents


5 download

DESCRIPTION

Bio-ontologies for Annotation and Service Discovery. Chris Wroe ( + material from Carole Goble, Alan Rector, Jeremy Rogers, Ian Horrocks) University of Manchester, UK. Overview. Example driven tour of the why , what and how of ontologies in life sciences - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bio-ontologies for Annotation and Service Discovery

Bio-ontologies for Annotation and Service Discovery

Chris Wroe( + material from Carole Goble, Alan Rector, Jeremy Rogers, Ian Horrocks)

University of Manchester, UK

Page 2: Bio-ontologies for Annotation and Service Discovery

Overview Example driven tour of the why, what

and how of ontologies in life sciences Cover the key features of an ontology

Vocabulary, definitions, hierarchies, grammar & reasoning

Cover the key targets of ontology use Biological knowledge, service

descriptions, (database schema)

Page 3: Bio-ontologies for Annotation and Service Discovery

Ontology – the discipline Semantics – the meaning of meaning. Philosophical discipline, branch of

philosophy that deals with the nature and the organisation of reality.

Science of Being (Aristotle, Metaphysics, IV,1)

What is being? What are the features common to all

beings?

Page 4: Bio-ontologies for Annotation and Service Discovery

In science…ontology the thing A resource to aid the precise

communication and integration of information

Binds a community to communicate information in some domain of interest in a consistent manner.

Page 5: Bio-ontologies for Annotation and Service Discovery

Gene Ontology – a community effort Model organism databases need to

be integrated Not possible if they all use a

different vocabulary Gene Ontology Consortium got

together to form “a dynamic controlled vocabulary that

can be applied to all eukaryotes”

Page 6: Bio-ontologies for Annotation and Service Discovery

Gene Ontology – keeping it simple

Provide three separate vocabularies to describe: The function a gene product is capable of. The process a gene product takes part in. The location at which the gene product has

been found.

Page 7: Bio-ontologies for Annotation and Service Discovery

GO annotations

Gene detail page in MGD for the vitamin D receptor gene, Vdr

Annotation

Page 8: Bio-ontologies for Annotation and Service Discovery

GO annotations

Gene detail page in MGD for the vitamin D receptor gene, Vdr

Annotation

Feature 1:

Ontologies provide a shared controlled vocabulary of concepts.

Page 9: Bio-ontologies for Annotation and Service Discovery

Gene ontology - definitions A diverse community, so explicit

definitions important. 60% of GO concepts have a

textural definition e.g. apoptotic nuclear changes

GO:0030262 Changes affecting the nucleus and its contents during apoptosis; includes condensation and fragmentation of nuclear DNA and of the nucleus itself.

Page 10: Bio-ontologies for Annotation and Service Discovery

Gene ontology - definitions A diverse community so explicit

definitions important. 60% of GO concepts have a

textural definition e.g. apoptotic nuclear changes

GO:0030262 Changes affecting the nucleus and its contents during apoptosis; includes condensation and fragmentation of nuclear DNA and of the nucleus itself.

Feature 2:

Ontologies provide an agreed definition for each concept to ensure each concept is used in the same way.

Page 11: Bio-ontologies for Annotation and Service Discovery

biological process death

cell deathtissue death

necrosis histolysis

Gene ontology – organisation

An alphabetical list of 11000 terms is not enough

Hierarchies allow similar terms to be grouped together.

Page 12: Bio-ontologies for Annotation and Service Discovery
Page 13: Bio-ontologies for Annotation and Service Discovery

Gene ontology – hierarchy use

GO hierarchy is used for Navigation of concepts by users Indexing of information in databases Aggregating information

Page 14: Bio-ontologies for Annotation and Service Discovery

Taxonomy remark 1 The world is not a tree, it’s a latticeanimal

rodent

cow

catmouse

dog

domesticverminwild

pet working

Page 15: Bio-ontologies for Annotation and Service Discovery

Taxonomy remark 2 What does the taxonomy mean?

Concept A is a parent of concept B iff every instance of B is also an instance of A

Superset/subset ICONCLASS

Metalwork of a Door

Closing the DoorMonumental Door

Door

Door-Knocker

Door-keeperThreshold

Action associated with a door

Something attached to a door

Kind ofa door

Page 16: Bio-ontologies for Annotation and Service Discovery

Classification trickiness"On those remote pages it is written that animals are divided into:a. those that belong to the Emperor b. embalmed ones c. those that are trained d. suckling pigse. mermaids f. fabulous ones g. stray dogs h. those that are included in this classificationi. those that tremble as if they were mad j. innumerable ones k. those drawn with a very fine camel's hair brush l. others m. those that have just broken a flower vase n. those that resemble flies from a distance"

The Celestial Emporium of Benevolent Knowledge, Borges

Page 17: Bio-ontologies for Annotation and Service Discovery

Classification is task and culture specific

Dyirbal classification of objects in the universe, Bayi: men, kangaroos, possums, bats, most snakes, most

fishes, some birds, most insects, the moon, storms, rainbows, boomerangs, some spears, etc.

Balan: women, anything connected with water or fire, bandicoots, dogs, platypus, echidna, some snakes, some fishes, most birds, fireflies, scorpions, crickets, the stars, shields, some spears, some trees, etc.

Balam: all edible fruit and the plants that bear them, tubers, ferns, honey, cigarettes, wine, cake.

Bala: parts of the body, meat, bees, wind, yamsticks, some spears, most trees, grass, mud, stones, noises, language, etc.

Page 18: Bio-ontologies for Annotation and Service Discovery

Gene ontology – directed acyclic graphs

Each concept is explicitly grouped either by is-a or part of relationships

Functions are often grouped by type Cellular components are often grouped by part

Each concept can have multiple parents A concepts positions is represented by a directed

acyclic graph Hierarchies are handcrafted so as to suit the ‘culture’ of

biologists

Page 19: Bio-ontologies for Annotation and Service Discovery
Page 20: Bio-ontologies for Annotation and Service Discovery

Feature 3:

Ontologies organise concepts in multiple ways for multiple uses. Principle of grouping should be explicit.

Page 21: Bio-ontologies for Annotation and Service Discovery

Taking it further GO concepts are often phrases

insulin control element activator complex, insulin processing, insulin receptor, insulin receptor complex, insulin receptor ligand, insulin receptor signalling pathway, insulin secretion, insulin acticated sodium/amino acid transporter,

Components of phrase hidden to computer applications

Page 22: Bio-ontologies for Annotation and Service Discovery

Explicit conceptualisation Semantic similarity searching Automated maintenance of hierarchies. What we need is..

A formal grammar with which to compose phrases

Software which can interpret phrases and produce sound and complete hierarchies

Page 23: Bio-ontologies for Annotation and Service Discovery

The exploding bicycle ICD-9 (E826) 8 READ-2 (T30..) 81 READ-3 87 ICD-10 (V10-19) 587 V31.22 Occupant of three-wheeled motor vehicle injured in

collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income

W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity

X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities

Page 24: Bio-ontologies for Annotation and Service Discovery

Defusing the exploding bicycle:500 codes in pieces

10 things to hit… Pedestrian / cycle / motorbike / car / HGV / train /

unpowered vehicle / a tree / other 5 roles for the injured…

Driving / passenger / cyclist / getting in / other 5 activities when injured…

resting / at work / sporting / at leisure / other 2 contexts…

In traffic / not in traffic V12.24 Pedal cyclist injured in collision with two- or

three-wheeled motor vehicle, unspecified pedal cyclist, nontraffic accident, while resting, sleeping, eating or engaging in other vital activities

Page 25: Bio-ontologies for Annotation and Service Discovery

Coordination: Conceptual Lego

hand

extremity

body

acute

chronic

abnormal

normalischaemic

deletion

bacterial

polymorphism

cell

protein

gene

infection

inflammation

Lung

expression

Page 26: Bio-ontologies for Annotation and Service Discovery

Conceptual Lego“SNPolymorphism of CFTRGene causing Defect in MembraneTransport of ChlorideIon causing Increase in Viscosity of Mucus in CysticFibrosis…”

“Hand which isanatomicallynormal”

Page 27: Bio-ontologies for Annotation and Service Discovery

DAML+OIL Specifically designed to compose phrases in a

compositional manner Becoming a standard ontology interchange

language Adopted by W3C and will soon become

Ontology Web Language (OWL)

Page 28: Bio-ontologies for Annotation and Service Discovery
Page 29: Bio-ontologies for Annotation and Service Discovery
Page 30: Bio-ontologies for Annotation and Service Discovery

Reasoning support Consistency — check if knowledge is

meaningful Subsumption — structure knowledge,

compute taxonomy Equivalence — check if two classes

denote same set of instances Instantiation — check if individual i

instance of class C Retrieval — retrieve set of individuals

that instantiate C Problems all reducible to consistency

(satisfiability)

Page 31: Bio-ontologies for Annotation and Service Discovery

Gene Ontology Next Generation Early aim

Proof of concept showing DAML+OIL & description logic can practically help in at least one aspect of GO maintenance.

In cooperation with Mike Ashburner and the GO editorial team

Further aims Prototype an evolutionary environment in

which the benefits can be replicated on a larger scale

Page 32: Bio-ontologies for Annotation and Service Discovery

Preliminary task Providing an exhaustive is-a taxonomy

GO is-a poly-hierarchy

It becomes increasingly laborious to make sure that all concepts are linked to all possible is-a parents

Page 33: Bio-ontologies for Annotation and Service Discovery

Metabolism terms: e.g. heparin biosynthesis

[i] (GO:0006024)

Axis 1:

Chemicals

Axis 2:

Process

[chemical] biosynthesis (GO:0009058)

[i] carbohydrate biosynthesis (GO:0016051)

[i] aminoglycan biosynthesis (GO:0006023)

[i] heparin biosynthesis (GO:0030210)

[i] glycosaminoglycan biosynthesis (GO:0006024)

[i] heparin metabolism (GO:0030202)

[i] heparin biosynthesis (GO:0030210)

Page 34: Bio-ontologies for Annotation and Service Discovery

Is this important? Complete taxonomy not necessary for

browsing by biologist (and may actually get in the way)

BUT… improves fidelity of DB record retrieval. Asking for records annotated with ‘glycosaminoglycan

biosynthesis’ or more specific will lead to an additional result

O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment)

Page 35: Bio-ontologies for Annotation and Service Discovery

How can we support the task? Step 0. Translate to DAML+OIL syntax

Provided by OilEd

Provide DAML+OIL based definitions of GO concepts – initially in the metabolism area

Page 36: Bio-ontologies for Annotation and Service Discovery

DAML+OIL definitions for metabolism concepts

heparin biosynthesis class heparin biosynthesis defined

subClassOf biosynthesis restriction onProperty acts_on hasClass heparin

(acts_on is unique) Paraphrase: biosynthesis which acts solely on heparin

glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined

subClassOf biosynthesis restriction onProperty acts_on hasClass

glycosaminoglycan

Page 37: Bio-ontologies for Annotation and Service Discovery

DAML+OIL definitions for metabolism concepts

heparin biosynthesis class heparin biosynthesis defined

subClassOf biosynthesis restriction onProperty acts_on hasClass heparin

(acts_on is unique) Paraphrase: biosynthesis which acts solely on heparin

glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined

subClassOf biosynthesis restriction onProperty acts_on hasClass

glycosaminoglycan

Feature 4:

Ontologies provide a formal computer interpretable concept definition.

Page 38: Bio-ontologies for Annotation and Service Discovery

A chemical ontology Initially used MESH to create a DAML+OIL ontology

from a subset of the chemical taxonomy (using UMLS tools/ API)

Provides the following information

carbohydrates[i] polysaccharides

[i] glycosaminogylcans[i] heparin

Page 39: Bio-ontologies for Annotation and Service Discovery

Reason over the combination

Combine GO definitions with chemical ontology using OilEd API

Send to FaCT DL reasoner…

Page 40: Bio-ontologies for Annotation and Service Discovery

Paraphrased reasoning process

heparin biosynthesis class heparin biosynthesis defined

subClassOf biosynthesis restriction onProperty acts_on hasClass heparin

glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined

subClassOf biosynthesis restriction onProperty acts_on hasClass

glycosaminoglycan

Is-a

Page 41: Bio-ontologies for Annotation and Service Discovery

Inferring a new is-a link heparin biosynthesis

class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin

glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined

subClassOf biosynthesis restriction onProperty acts_on hasClass

glycosaminoglycan

Is-a

Is-a

Page 42: Bio-ontologies for Annotation and Service Discovery

Inferring a new is-a link heparin biosynthesis

class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin

glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined

subClassOf biosynthesis restriction onProperty acts_on hasClass

glycosaminoglycan

Is-a

Is-aFeature 5:

Ontologies can become a dynamic service with reasoning support.

Page 43: Bio-ontologies for Annotation and Service Discovery

Output OilEd API reports additional inferred is-a

relationships.E.g.

heparin biosynthesis has new is-a parent glycosaminoglycan biosynthesis

Sanitised version sent to GO editorial team for comment.

They (Jane Lomax) makes changes to GO if appropriate and sends back queries

Page 44: Bio-ontologies for Annotation and Service Discovery

Results Carbohydrate metabolism

22 additional is-a links 17 of which now in GO

Amino acid metabolism Further 17 additional is-a links now in GO

Currently preparing results for metabolism as a whole

Page 45: Bio-ontologies for Annotation and Service Discovery

Where next with GONG? Moving from proof of concept requires

dedicated software tools to support the process.

Authoring/ Curation of DAML+OIL definitions Tracking GO as it evolves Tracking suggested changes and response to

changes.

Page 46: Bio-ontologies for Annotation and Service Discovery

myGrid & high level ontologies

myGrid: Personalised extensible environments for data-intensive in silico experiments in biology

Higher level services: workflow, databases, knowledge management, provenance…

Bioinformatics services are published as Web services (and soon Grid Services)

http://www.ebi.ac.uk/collab/mygrid/service0/axis/index.html

Page 47: Bio-ontologies for Annotation and Service Discovery

Ontologies for Service Discovery

Find appropriate type of services sequence alignment

Find appropriate instances of that service BLAST (an algorithm for sequence alignment), as

delivered by NCBI Assist in forming an appropriate assembly of

discovered services. Find, select and execute instances of services

while the workflow is being enacted.Knowledge in the head of expert bioinformatician

Page 48: Bio-ontologies for Annotation and Service Discovery

Fetch

WF

Similarsequences

Structure

modellingFetch

View

RASMOL

Protein name

An in silico experiment as a workflow

Page 49: Bio-ontologies for Annotation and Service Discovery

Four-tiered service descriptions

1. Class of service: • a protein sequence alignment, a protein sequence

database. 2. Specific example of an abstract service:

• BLAST, SWISS-PROT.

3. Instance service description of a specific service: • BLAST, SWISS-PROT as offered by the EBI.

4. Invoked instance service description: • BLAST as offered by the EBI on a particular date, with

particular parameters when a service was actually enacted.

Domain “semantic”

Business “operational”

Page 50: Bio-ontologies for Annotation and Service Discovery

Service description phrases

Build up a phrase describing classes of service functionality.

Building blocks for phrase come from a suite of ontologies

Template for the description based on DAML-S specialised for bioinformatics.

Use reasoning to maintain a classification of services

Page 51: Bio-ontologies for Annotation and Service Discovery

Bioinformatics ontology

Web serviceontology

Task ontology

Publishing ontology

Informatics ontology

Molecularbiology ontology

Organisationontology

Upper levelontology

Specialises. All concepts are subclassed from those in the more general ontology.

Contributes concepts to form definitions.

Suite

Page 52: Bio-ontologies for Annotation and Service Discovery

Bioinformatics ontology

Web serviceontology

Task ontology

Publishing ontology

Informatics ontology

Molecularbiology ontology

Organisationontology

Upper levelontology

Specialises. All concepts are subclassed from those in the more general ontology.

Contributes concepts to form definitions.

Suite

parameters: input, output, precondition, effectperforms_taskuses-resourceis_function_of

Page 53: Bio-ontologies for Annotation and Service Discovery

class-def defined BLAST-n_service_operation subclass-of atomic_service_operation has_Class performs_task (aligning has_Class has_feature local has_Class has_feature pairwise) has_Class produces_result (report has_Class is_report_of sequence_alignment) has_Class uses_resource (database has_Class contains (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule))) has_Class requires_input (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule)) has_Class is_function_of (BLAST_application)

Page 54: Bio-ontologies for Annotation and Service Discovery

class-def defined pairwise_sequence_alignment_service subclass-of atomic_service_operation has_Class performs_task (aligning has_Class has_feature local has_Class has_feature pairwise) has_Class produces_result (report has_Class is_report_of sequence_alignment) has_Class uses_resource (database has_Class contains (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule))) has_Class requires_input (data has_Class encodes (sequence has_Class is_sequence_of

nucleic_acid_molecule)) has_Class is_function_of (BLAST_application)

Page 55: Bio-ontologies for Annotation and Service Discovery

Description driven classification

Page 56: Bio-ontologies for Annotation and Service Discovery

PersonalRepository

(Meta Data)Ontology

Server

WorkflowRepository

(Meta Data)Service Type

Directory

RepositoryClient

OntologyClient

WorkflowClient

Portal

Workflowenactment

Bioinformatics services

Service instancedirectory

DAML+OIL Reasoner

(FaCT)

Matcher and

Ranker

Client framework myGrid.version0

Page 57: Bio-ontologies for Annotation and Service Discovery

1. User selects values from a drop down list to create a property based description of their required service. Values are constrained to provide only sensible alternatives.

2. Once the user has entered a partial description they submit it for matching. The results are displayed below.

3. The user adds the operation to the growing workflow.

4. The workflow specification is complete and ready to match against those in the workflow repository.

Page 58: Bio-ontologies for Annotation and Service Discovery

Ontology grounds out Link ontology to WSDL and UDDI

types

messages

portType operation

binding

service

XML Schema businessEntity

businessService

bindingTemplate

tModel

WSDL

UDDI

Page 59: Bio-ontologies for Annotation and Service Discovery

Other uses of ontology Labelling data items in databases

Semantic typing for controlling inputs and outputs

Use by distributed query processing

Page 60: Bio-ontologies for Annotation and Service Discovery

Ontology/ registry issues How to best integrate with existing

registry technology such as UDDI How do ontological descriptions of

data relate to type systems How big should the phrases

become within the ontology? Who builds these descriptions?

Page 61: Bio-ontologies for Annotation and Service Discovery

Summary Different ontologies can have a

different selection of features tailored to requirements

Form a wide spectrum of resources Powerful technology available

Harness it for end users

Page 62: Bio-ontologies for Annotation and Service Discovery

And finally.. predates computers

Linnaeus 18th Century Nomenclature/ classification of species Language independent (Latin) Promoted sharing and integration of knowledge

about related species A community effort – botanists / zoologists

Farr 19th Century Nomenclature of disease for consistent cause of

death reporting Allowed aggregation/integration of data to discover

new knowledge about the aetiology of Cholera. A community effort -- surgeons

Page 63: Bio-ontologies for Annotation and Service Discovery

Links All myGrid tools & ontology

available from: http://www.mygrid.org.uk GONG site: http://gong.man.ac.uk Building ontologies site:

http://oiled.man.ac.uk/building

Page 64: Bio-ontologies for Annotation and Service Discovery

Acknowledgements Manchester metadata team

Carole Goble, Robert Stevens, Sean Bechhofer, Phil Lord, Alan Rector, Jeremy Rogers, Chris Garwood

myGrid team GO Consortium

Esp. Mike Ashburner, Midori Harris, Jane Lomax

Page 65: Bio-ontologies for Annotation and Service Discovery

Sharing info Sharing meaningMetadata Data describing the

content and meaning of resources and services.

But everyone must speak the same language…

Terminologies Shared and common

vocabularies For search engines,

agents, curators, authors and users

But everyone must mean the same thing…

Service providerService provider

Service providerService providerService

providerService provider

Service providerService provider

Service providerService provider

Ontologies Shared and common understanding of a domain Essential for search, exchange and discovery

Page 66: Bio-ontologies for Annotation and Service Discovery

Origin and History• Humans require words (or at least symbols) to

communicate efficiently. The mapping of words to things is only indirect possible. We do it by creating concepts that refer to things.

• The relation between symbols and things has been described in the form of the meaning triangle:

“Jaguar“

Concept

[Ogden, Richards, 1923]

Page 67: Bio-ontologies for Annotation and Service Discovery

So what is an ontology?

Catalog/ID

Thesauri

Terms/glossary

Informal Is-a

FormalIs-a

Formalinstance

Frames(properties)

General Logicalconstraints

Valuerestrictions

Disjointness,Inverse, partof

Gene Ontology

Mouse AnatomyEcoCyc

PharmGKB

TAMBISArom

[Deborah McGuinness, Stanford]

Page 68: Bio-ontologies for Annotation and Service Discovery

Human and machine communication• ... Machine

Agent 1

Things

HumanAgent 2

Ontology Description

MachineAgent 2

exchange symbol,e.g. via nat. language

‘‘JAGUAR“

Internalmodels

Concept

Formalmodels

exchange symbol,e.g. via protocols

MA1HA1 HA2

MA2

Symbol

commit commit

a specific domain, e.g.animals

commitcommitOntology

Formal Semantics

HumanAgent 1

MeaningTriangle

[Maedche et al., 2002]

Page 69: Bio-ontologies for Annotation and Service Discovery

? Important life science ontologies SWISS-PROT Keywords the SWISS-PROT keyword list now has definitions (in nat. lang.) associated with each

keyword. Edinburgh Anatomies Have whole or partial anatomy ontologies for adult and developmental stages for several model organisms. The Ingenuity company has a large knowledge base of experimental findings in biology. Currently, their ontology is not viewable. The MGED ontology working group aim to develop ontologies for describing gene expression experiments and data. Semiotes Regulatory Networks Model PharmGKB: Pharmacogenetics Knowledge Base. the TAMBIS ontology (TaO) an ontology of bioinformatics and molecular biology. RiboWeb an ontology describing ribosomal components, associated data and computations for processing those data. EcoCyc an ontology describing the genes, gene product function, metabolism and regulation within E. coli. Molecular Biology Ontology (MBO)A general, reference ontology for molecular biology. Gene Ontology (GO) an ontology describing the function, the process and cellular location of gene products from eukaryotes. Mouse Genome Informatics GO browser Mouse Anatomical Dictionary ImMunoGeneTics (IMGT) Ontology STAR/mmCIF Macromolecule structure ontology. STAR/mmCIF Signal Transduction Knowledge #Environment (STKE). GENAROM Ontology of gene product interactions. GeneX Ontologies for comparing gene expression across species. EpoDB Controlled Vocabulary function, cell and tissue type, developmental stage and experimental type. CBIL Controlled Vocabulary Terms for human anatomy. Japan Bio-Ontology Committee including Signal Transduction Ontology flybase controlled vocabulary for fly anatomy used for describing phenotypes.

Page 70: Bio-ontologies for Annotation and Service Discovery