a domain ontology engineering tool with general ontologies and text corpus naoki sugiura, masaki...

28
A Domain Ontology Engineering Tool with General Ontologies and Text Corpus Naoki Sugiura, Masaki Kurematsu, Naoki Fukuta, Naoki Izumi, & Takahira Yamaguchi

Post on 19-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

A Domain Ontology Engineering Tool with General Ontologies and Text Corpus

Naoki Sugiura, Masaki Kurematsu,

Naoki Fukuta,Naoki Izumi, &

Takahira Yamaguchi

DODDLE and DODDLE II

Domain Ontology rapiD DeveLopmet Environment

Builds taxonomic and non-taxonomic relationships

Uses dictionary approach and text corpus (body) to build relationships

DODDLE & DODDLE II

Large Ontologies are difficult to build by hand

Locates relationships between words based on context similarities; even if separated

Disadvantages Human Interaction is still required Low amount of success

DODDLE vs DODDLE II

DODDLE only works on taxonomic relationships

DODDLE II Extension of DODDLE Finds non-taxonomic relationships

Outline

Overview Taxonomic Relationships Non-Taxonomic Relationships Case Studies Problems/Future Work Conclusion Assessment

OverviewDomain Terms

Domain Specific Text Corpus

Domain Specific Text Corpus

Concept Extraction

Module

NTRL ModuleTRA Module

Overview TRA Module

Matched Result Analysis

Trimmed Result Analysis

Modification using syntactic strategies

Taxonomic Relationship

MRD(Wordnet)

MRD(Wordnet)

Overview NTRL Module

Extraction of frequent words

WordSpace creation

Extraction of similar concept pairs

Non-Taxonomic Relationship

Concept specification templates

Domain Specific Text Corpus

Domain Specific Text Corpus

OverviewOverview

Taxonomic RelationshipNon-Taxonomic

Relationship

Interaction Module

TRA Module

Matched Result Analysis

Trimmed Result Analysis

Modification using syntactic strategies

Taxonomic Relationship

MRD(Wordnet)

MRD(Wordnet)

TRA

Matched Result Analysis Constructs PAB and STM

Trimmed Result Analysis Remove unnecessary nodes

Modification using statistical strategies Allows for human input

PAB and STM

TRA

NTRL Module

Extraction of frequent words

WordSpace creation

Extraction of similar concept pairs

Non-Taxonomic Relationship

Concept specification templates

Domain Specific Text Corpus

Domain Specific Text Corpus

NTRL

Extraction of key words Primitive: 4 words Collocation matrix

ai,j = fi before f j …f8 f4 f3 f7 f8f4 f1 f3 f4 f9 f2f5 f1 f7 f1 f5 …

…f8 f4 f3 f7 f8f4 f1 f3 f4 f9 f2f5 f1 f7 f1 f5 …

NTRL

o WordSpace Creation Context Vectors Word Vectors

Sum of Context Vectors г(w)=∑ ( ∑ φ(f))

iε C(w) f close to i

A vector representation of a word of phrase w

a 4-gram vector of a 4 gram f

Appearance places of a word or phrase w

WordSpace is a collocation of г(w)

NTRL

Extraction of Concept Pairs Each input has a best-matched “synset”

Synset: collection of word vectors Sum of the word vectors set to a concept which

corresponds with each input term Inner product of all combinations of concept

pairs Match is determined by user set threshold

Case Study: .87

NTRL

Finding Association Rules Locates Rules of the form:

NTRL

Constructing Concept Specification Templates Set of Similar concept pairs and

association rules DODDLE sets priorities between

concept pairs Based on TRA Module and Co-occurrence

information

Case Study

Law-“Contract for International Sale of Goods”

Business -“XML Common Business Library”

Support: 0.4 %Confidence: 80%

Law Case Study

Given 46 Concepts WordSpace: 77 concept pairs Association between input terms: 55

pairs or terms Templates

Business Case Study

Input: 57 terms Wordspace: 40 pairs Association between input terms:

39

Taxonomic Results

Bus. Precision Recall per path

Recall per subtree

Matched Result

.2 .29 .71

Trimmed Result

.22 .13 .5

Law Precision Recall per path

Recall per subtree

Matched Result

.25 .23 .19

Trimmed Result

.3 .3 .15

Non-taxonomic Results

Law WS AR Join of WS and AR

# Extracted Concept Pairs

77 55 117

# Accepted Concept Pairs

18 13 27

Precision .23 .24 .23

Recall .38 .27 .56

Bus. WS AR Join of WS and AR

# Extracted Concept Pairs

40 39 66

# Accepted Concept Pairs

30 20 39

Precision .75 .51 .59

Problems/ Future Work

Threshold Changes with each domain

Specification of a Concept Relation Still need to specify relationships

Ambiguity of Multiple Terminology “transmission” Semantic specialization of multi-definition

words needed. DODDLE-R

Uses RDF tags

Conclusion

Uses MRD and text corpus Two strategies for taxonomic: matched

result analysis and trimmed result analysis

Non-Taxonomic: extracted by co-occurrence information in text corpus

Concept Specification: a way to eliminate concept pairs to build an ontology

Assessment

Designed to be a tool No time results Determining thresholds is plug-and-

guess.

Questions ?