how do we collect data for the ontology? amphibiatree 2006 workshop saturday 11:30–11:45 j....

How do we Collect Data for the Ontology?

AmphibiaTree 2006 WorkshopSaturday 11:30–11:45

J. Leopold

Outline

• Different approaches to ontology design• Text mining

– Manual curation– Document parsing– Potential problems– Methodology– Evaluation

Approaches to Ontology Design

Inspiration• Start from a premise about why an ontology is

needed• Then design an ontology (from personal

expertise about the domain) that aims to meet the recognized need

Warning: may be impractical, may lack theoretical underpinning


Induction• Ontology developed by observing, examining,

and analyzing specific case(s) in domain• Then resulting ontological characterization for

specific case is applied to other cases in same domain

Warning: may fit a specific case, but not be generalizable


Deduction• Adopt general principles and adaptively apply

them for specific cases• Filter/distill general notions so they are

customized to particular domain subset

Warning: presupposes existence + selection of appropriate general characteristics from which ontology for specific cases can be devised


Synthesis• Identify a base set of ontologies, no one of

which subsumes any other• Synthesize parts, creating a unified ontology

Warning: heavily relies on developers’ synthesis skills

Approaches to Ontology DesignCollaboration• Development is a joint effort reflecting

experiences + viewpoints of persons who intentionally cooperate to produce it

• Can start from a proposed ontology with iterative improvements

Advantages: diverse vantage points, builds commitment by iteratively reducing participants’ objections

Collaborative Approach

Preparation

Anchoring

Application

IterativeImprovement

Demonstrate uses of the ontology

Define design criteria

Determine boundary conditions

Determine evaluation standards

Specify initial seed ontology

Identify diverse participants

Elicit critiques & comments

Revise to address feedback

Iterate until consensus

Text Mining

• What about instantiation???• Experts can design ontology (classes,

hierarchy, etc.)• But need to systematically go through

literature to identify instances and their properties

• Particularly important to accommodate diversity

Text Mining

Goals: • Discover new instances and properties• Increase strength of existing annotations

by locating additional paper evidence

FlyBase Curation

• Watch list of ~35 journals• Each curator inspects latest issue of a journal

to identify papers to curate • So curation takes place on paper-by-paper

basis (as opposed to topic-by-topic)

FlyBase Curation

• Curator fills out record for each paper• Some fields require rephrasing,

paraphrasing, summarization• Other fields record very specific facts using

terms from ontologies

FlyBase Curation

• Software like PaperBrowser presents enhanced display of text with recognized terms highlighted (e.g., Named Entity Recognition)

• Parser identifies boundaries of the NP around each term name and its grammatical relations to other NPs in the text

Document Parsing• PDF is only standard electronic format in

which all relevant papers are available• PDF-to-text processors not aware of the

typesetting of each journal, have trouble with some formatting (e.g., 2-column text, footnotes, headers, figure captions, etc.)

• Document parsing best done with optical character recognition (OCR)

• For images, can parse their captions

Potential Problems for Text Mining

• Lexical ambiguity (e.g., words that denote > 1 concept)

• Polysemy (e.g., term present in 2 papers denotes different concepts)

• Abbreviation (e.g., same concept, but different abbreviations in different papers)

Potential Problems for Text Mining

• Digit removal (e.g., 4-hydroxybutan… vs. 2-hydroxybutan…)

• Stemming (e.g., removing prefixes, suffixes, etc.)

• Stop word removal (e.g., “the”, “a”)

Need a domain-specific text miner!!!

Methodology

• Extract textual elements from papers identifying a term in the ontology

• Construct patterns with reliability scores (confidence that pattern represents term)

• Extend pattern set with longer pattern sets• Apply semantic pattern matching

techniques (i.e., consider synonyms)• Annotate terms based on quality of matched

pattern to concept occurring in the text

Training PhaseObjective: construct set of patterns that characterize indicators for annotation

(1) Find terms in the “training set” papers(2) Extract significant terms/phrases that appear in

the papers(3) Construct patterns based on significant

terms/phrases and terms surrounding significant terms

Annotation Phase

(1) Look for possible matches to the patterns in the papers

(2) Compute a matching score which indicates the strength of the prediction

(3) Determine the term to be associated with the pattern match

(4) Order new annotation predictions by their scores, and present to user

Pattern Construction• structured as { LEFT } < MIDDLE > { RIGHT } • <MIDDLE> is an ordered sequence of significant

terms (i.e., identifying elements)• {LEFT} and {RIGHT} are sets of words that

appear around significant terms (i.e., auxiliary descriptors)

• number of words in {LEFT} and {RIGHT} can be limited

• stop words not included in patterns

Pattern ConstructionExample: pattern template { LEFT } < rna polymerase ii > { RIGHT }

pattern1: { increase catalytic rate } < rna polymerase ii > { transcription suppressing transient }

pattern2: { proteins regulation transcription } < rna polymerase ii > { initiated search proteins }

Pattern ConstructionExample: pattern template { LEFT } < invests > { RIGHT }

pattern1: { frontoparietal } < invests > { sphenethmoid }

pattern2: { anterior ramus pterygoid } < invests > { planum antorbitale }

Pattern Scoring

Calculate score representing how confidently a pattern represents a term

MT = source of <middle>

Patterns whose <middle> exactly matches ontology term gets higher score

Pattern Scoring


TT = type of individual terms in the <middle>

Considers occurrence frequency of a word in <middle> among all ontology terms, and position of word in an ontology term (gets more specific from right to left)

Pattern Scoring


PP = term-wise paper frequency of <middle>

Patterns with <middle> that is highly frequent in the paper dataset get higher scores

Evaluation

Recall = correct responses by software all human responses

Precision = correct responses by software all responses by software

Discussion

how do we collect data for the ontology? amphibiatree 2006 workshop saturday 11:30–11:45 j....

Documents

proposed ontology

ontology classes

unified ontology warning

specific facts

domain warning

generalizable approaches

recognized terms

text processors