how do we collect data for the ontology? amphibiatree 2006 workshop saturday 11:30–11:45 j....
TRANSCRIPT
Outline
• Different approaches to ontology design• Text mining
– Manual curation– Document parsing– Potential problems– Methodology– Evaluation
Approaches to Ontology Design
Inspiration• Start from a premise about why an ontology is
needed• Then design an ontology (from personal
expertise about the domain) that aims to meet the recognized need
Warning: may be impractical, may lack theoretical underpinning
Approaches to Ontology Design
Induction• Ontology developed by observing, examining,
and analyzing specific case(s) in domain• Then resulting ontological characterization for
specific case is applied to other cases in same domain
Warning: may fit a specific case, but not be generalizable
Approaches to Ontology Design
Deduction• Adopt general principles and adaptively apply
them for specific cases• Filter/distill general notions so they are
customized to particular domain subset
Warning: presupposes existence + selection of appropriate general characteristics from which ontology for specific cases can be devised
Approaches to Ontology Design
Synthesis• Identify a base set of ontologies, no one of
which subsumes any other• Synthesize parts, creating a unified ontology
Warning: heavily relies on developers’ synthesis skills
Approaches to Ontology DesignCollaboration• Development is a joint effort reflecting
experiences + viewpoints of persons who intentionally cooperate to produce it
• Can start from a proposed ontology with iterative improvements
Advantages: diverse vantage points, builds commitment by iteratively reducing participants’ objections
Collaborative Approach
Preparation
Anchoring
Application
IterativeImprovement
Demonstrate uses of the ontology
Define design criteria
Determine boundary conditions
Determine evaluation standards
Specify initial seed ontology
Identify diverse participants
Elicit critiques & comments
Revise to address feedback
Iterate until consensus
Text Mining
• What about instantiation???• Experts can design ontology (classes,
hierarchy, etc.)• But need to systematically go through
literature to identify instances and their properties
• Particularly important to accommodate diversity
Text Mining
Goals: • Discover new instances and properties• Increase strength of existing annotations
by locating additional paper evidence
FlyBase Curation
• Watch list of ~35 journals• Each curator inspects latest issue of a journal
to identify papers to curate • So curation takes place on paper-by-paper
basis (as opposed to topic-by-topic)
FlyBase Curation
• Curator fills out record for each paper• Some fields require rephrasing,
paraphrasing, summarization• Other fields record very specific facts using
terms from ontologies
FlyBase Curation
• Software like PaperBrowser presents enhanced display of text with recognized terms highlighted (e.g., Named Entity Recognition)
• Parser identifies boundaries of the NP around each term name and its grammatical relations to other NPs in the text
Document Parsing• PDF is only standard electronic format in
which all relevant papers are available• PDF-to-text processors not aware of the
typesetting of each journal, have trouble with some formatting (e.g., 2-column text, footnotes, headers, figure captions, etc.)
• Document parsing best done with optical character recognition (OCR)
• For images, can parse their captions
Potential Problems for Text Mining
• Lexical ambiguity (e.g., words that denote > 1 concept)
• Polysemy (e.g., term present in 2 papers denotes different concepts)
• Abbreviation (e.g., same concept, but different abbreviations in different papers)
Potential Problems for Text Mining
• Digit removal (e.g., 4-hydroxybutan… vs. 2-hydroxybutan…)
• Stemming (e.g., removing prefixes, suffixes, etc.)
• Stop word removal (e.g., “the”, “a”)
Need a domain-specific text miner!!!
Methodology
• Extract textual elements from papers identifying a term in the ontology
• Construct patterns with reliability scores (confidence that pattern represents term)
• Extend pattern set with longer pattern sets• Apply semantic pattern matching
techniques (i.e., consider synonyms)• Annotate terms based on quality of matched
pattern to concept occurring in the text
Training PhaseObjective: construct set of patterns that characterize indicators for annotation
(1) Find terms in the “training set” papers(2) Extract significant terms/phrases that appear in
the papers(3) Construct patterns based on significant
terms/phrases and terms surrounding significant terms
Annotation Phase
(1) Look for possible matches to the patterns in the papers
(2) Compute a matching score which indicates the strength of the prediction
(3) Determine the term to be associated with the pattern match
(4) Order new annotation predictions by their scores, and present to user
Pattern Construction• structured as { LEFT } < MIDDLE > { RIGHT } • <MIDDLE> is an ordered sequence of significant
terms (i.e., identifying elements)• {LEFT} and {RIGHT} are sets of words that
appear around significant terms (i.e., auxiliary descriptors)
• number of words in {LEFT} and {RIGHT} can be limited
• stop words not included in patterns
Pattern ConstructionExample: pattern template { LEFT } < rna polymerase ii > { RIGHT }
pattern1: { increase catalytic rate } < rna polymerase ii > { transcription suppressing transient }
pattern2: { proteins regulation transcription } < rna polymerase ii > { initiated search proteins }
Pattern ConstructionExample: pattern template { LEFT } < invests > { RIGHT }
pattern1: { frontoparietal } < invests > { sphenethmoid }
pattern2: { anterior ramus pterygoid } < invests > { planum antorbitale }
Pattern Scoring
Calculate score representing how confidently a pattern represents a term
MT = source of <middle>
Patterns whose <middle> exactly matches ontology term gets higher score
Pattern Scoring
Calculate score representing how confidently a pattern represents a term
TT = type of individual terms in the <middle>
Considers occurrence frequency of a word in <middle> among all ontology terms, and position of word in an ontology term (gets more specific from right to left)
Pattern Scoring
Calculate score representing how confidently a pattern represents a term
PP = term-wise paper frequency of <middle>
Patterns with <middle> that is highly frequent in the paper dataset get higher scores
Evaluation
Recall = correct responses by software all human responses
Precision = correct responses by software all responses by software