the past, present and future of knowledge in biology

Post on 21-May-2015

138 Views

Category:

Science

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Keynote talk at SMBM 2010

TRANSCRIPT

The Past, Present and Future of Knowledge in Biology

Robert StevensBioHealth Informatics GroupThe University of Manchester

ManchesterUnited Kingdom

Robert.Stevens@manchester.ac.uk

Overview

• A look at the state of play• For what are we using ontologies?• What do we count as knowledge?• Doing so much more with knowledge• Stopping text being a dead end

Text and Ontologies: The Terrible Twins of Knowledge in Biology

Robert StevensBioHealth Informatics GroupThe University of Manchester

ManchesterUnited Kingdom

Robert.Stevens@manchester.ac.uk

Biology now has lots of facts

Genome

Proteome

Transcriptome

Interactome

Metabolome

PHENOME

Lots of catalogues

Data are only as Good as their Metadata

• There is a lot of biology out there…• How these entities are described in our data varies• We don’t even agree on what entities there are to

describe in our data• This makes analysing data hard: You have to know

what your data represent• …, but also how the entities described in your data

relate to each other• We need to describe our data – their metadata

Creating Woods, not Trees

Genes

Proteins

Pathways

Interactions

LiteratureComplex Machines

Virtual Organism

…. from biological facts, we make a system that is some model of a real organism

Timeline

There’s a Lot of it About

Searching for “ontology” in five year chunks on the ACM digital

portal

Searching for “ontology” in five year chunks on the ACM digital

portal

Searching for “ontology” in five year chunks on PubMed

Searching for “ontology” in five year chunks on PubMed

It’s all Gruber’s Fault

• “In the context of knowledge sharing, the term ontology means a specification of a conceptualisation. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general. And it is certainly a different sense of the word than its use in philosophy.” DOI:10.1006/knac.1993.1008 DOI:10.1006/ijhc.1995.1081

Angels on the head of a pin

Everything with a Blob and Line is called an Ontology

• Wide acceptance criteria• Narrow evaluation criteria• Different sort of knowledge for different

situations• Different styles of representation; some

scruffy and some formal• Representing knowledge in biology is more

than ontologies• We could stop calling them ontologies

RDF graphRDF

graph

Database schema

Database schema

ThesaurusThesaurus

OWL Ontology

OWL Ontology

Formal ontologyFormal

ontology

SKOS vocabulary

SKOS vocabulary

Uses of Ontologies

Knowing What We’ve got is so Useful

• We could computationally handle lots of data, but we couldn’t do so with what we know about those data

• Ontologies so far mainly used for a common tongue so that we can compare

• … and it works!• Still getting lots of mileage from ontology

annotation• …, But there is so much more

GENERIC GENE ONTOLOGY (GO) TERM FINDER S000003093

MXR1YPL250CS000004294SAM3YIR017CS000003152MMP1MET1

Expressed Genes

P-value score

http://go.princeton.edu/cgi-bin/GOTermFinder

Classifying a Mouse

Individual Description:

Stops wriggling after 3 sec

Has 3 cm tail

Mass 10g

10 days old (since birth)

Strain C57Bl/6

Class Description:

Class:DepressedMouse

EquivalentTo:Mouse that

(wriggles For <=30 OR swims for <=45)

Data Transform

ation

Short tailed mouse

Class:ShortTailedMouse EquivalentTo:Mouse that hasPart EXACTLY 1 (Tail that hasAssay SOME

(LengthAssay that hasValue SOME int[<= 20) and hasUnit SOME Millimetre))

SubClassOf: Mouse thathasPart some (Tail that hasQuality SOME Short)

• We can recognise an instance of short-tailed mouse, but we also know that it has the quality “short”

• Even when the fact isn’t asserted

•First bullet

Classifying Proteins>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine

phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV……

…..

InterPro

Instance Store

Reasoner

Translate

Codify

OWL’s Automated Reasoners

• Demonstrably useful in:– Building ontologies– Querying ontologies– Can automatically annotate– Have made “discoveries”But there is more than OWL’s reasoning

Separation of Knowledge and Software

• We realised a long time ago that we needed to separate

• We only recently called this knowledge component ontology

• We don’t really need to see the ontology• We certainly shouldn’t show people OWL; it

“scares the horses”• Ontology for software not humans (L. Hunter)

The Ontology cottage Industry

• We’ve industrialised data production• We’ve (to some extent) industrialised data

analysis• We’ve not really moved away from hand-

crafted, “whittled” ontologies

Can we have Mass Editing of Ontologies?

• Probably not;• Computer scientists in love with synchronous

editing• …, but not really necessary (see CSCW)• Mass gathering of Knowledge

Mass Gathering of Knowledge and the Application of Patterns or a

metamodel

http://rightfield.org.uk http://www.e-lico.eu/populous

There’s so much more to Ontology Building than editing Axioms

• Gathering knowledge• Adding labels• Adding other human orientated content• Reviewing, checking suggesting• Deploying, using, creating “views”• Ontology comprehension

There’s More to KR than OWL

• OWL and its automated reasoners are useful• But there is so much more to KR than

ontologies and OWL• Higher order reasoning• Rules• Other sorts of reasoning

Generating natural language

Class: HeLa

SubClassOf: Cell,bearer_of some 'cervical carcinoma’,derives_from some 'Homo sapiens’,derives_from some cervix,derives_from some 'epithelial cell'

OWL

HeLa is a cell line. A hela is all of the following: something that is bearer of a cervical carcinoma, something that derives from a homo sapiens, something that derives from an epithelial cell, and something that derives from a cervix.

Generated natural language

Experimental Factor Ontology (EFO)http://www.ebi.ac.uk/efo

Ontology as bookTitle: Experimental Factor Ontology

Table of Contents

Chapter 1. Cell lineChapter 2. Cell typeChapter 3. Chemical CompoundChapter 4. Organism

HeLa is a cell line. A hela is all of the following: something that is bearer of a cervical carcinoma, something that derives from a homo sapiens, something that derives from an epithelial cell, and something that derives from a cervix.

entry

DataData

Types of Knowledge

Biologist’s headBiologist’s head

PapersPapers

DatabasesDatabases

OntologiesOntologies

??????

It’s not Just “Things”

• Experiments produce data about things• Proteins, genes, chemicals, reactions,

diseases, size, shape, speed, ….• As well as this knowledge we have knowledge

of how it was done• OBI is still the “things” to do with production• We still need the methods of by which these

“things” were deployed• The protocol

Knowledge about anexperiment

Workflow Run

Workflow Run

Workflow

ProvenanceProvenance

OrganisationalOrganisational

Results and Interpretation

Results and Interpretation

Workflows are knowledge about methods

Get genes in region

Get pathways that contain genes

Merge data into single files

Get gene descriptions

Get pathway descriptions

Cross-reference ids

Methods:

1. A QTL (region of chromosome) is entered into the workflow, specified as base pairs. These base pairs are subsequently used to identify, in the Ensembl database, any genes that lie within this region.

2. Any genes found within this region are subsequently annotated with Entrez and UniProt identifiers.

3. The Entrez and UniProt identifiers are then passed to a KEGG id conversion Web Service, to cross-reference the input ids to KEGG gene identifiers. This enables gene descriptions and biological pathway data to be returned from KEGG.

4. Each KEGG gene id is then used in a search for KEGG pathways. Any pathways found to contain the gene are returned as KEGG pathway ids.

5. Both KEGG gene and pathway ids are then sent to individual services, provided by KEGG, which provide a description of the gene and pathway.

6. The outputs of the workflow are then combined into single flat files, which can be saved locally and used to identify novel pathways and genes within the QTL region.

myExperiment

http://www.myexperiment.org

Research Objects

MethodMethod

DataData

IntroductionIntroduction

ConclusionsConclusions

ResultsResults

Human WrittenWorkflowWorkflow

Generated Text

Semanticallyannotated

Model, View, Controller

Annotated Data

Annotated Data

ControllerController

ProjectionProjection

TextTables Graphs

Steve Pettiferhttp://utopia.cs.man.ac.uk/

What Next?

• Ontologies are not the only fruit• We could stop calling them ontologies• We need to produce “ontologies” faster• We need to do more interesting things with our knowledge• We need to make them pervade our tools• We need then to be “agile”• Open to other forms of KR and other forms of reasoning• Adding to data automatically• Generating our descriptions of data

Acknowledgements• Simon Jupp for the slides• Alan rector and Carole goble• sysMoDB for rightField (Katy Wolstencroft, Stuart Owen, Matt

Horridge)• Populous – Simon Jupp• SWAT – richard Power, Sandra Williams and Allan third at the

OU• EFO – James Malone and Helen Parkinson• Steve Pettifer for the Utopia and MVC• Paul Fisher and the Taverna team• The myExperiment team at Southampton and Manchester

top related