lrec - 2010

26
LREC - 2010 Authors Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu Presented by Chris Irwin Davis Semi-Automatic Domain Ontology Creation from Text Resources

Upload: terrel

Post on 11-Jan-2016

47 views

Category:

Documents


2 download

DESCRIPTION

LREC - 2010. Semi-Automatic Domain Ontology Creation from Text Resources. Authors Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu Presented by Chris Irwin Davis. Jaguar Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LREC -  2010

LREC - 2010

Authors

Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu

Presented by

Chris Irwin Davis

Semi-Automatic Domain Ontology Creation from Text Resources

Page 2: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 2

Jaguar Overview

• Jaguar: Builds Ontologies and Knowledge-Bases from the concepts and relationships between those concepts found in text.

• Constituents of a knowledge base

– Concepts/Vocabulary (“weapon”, “WMD”, “launcher”)

– Relations (“anthrax” ISA “biological weapon”, “anthrax” CAU “death”)

• 26 different semantic relation types extracted

– Organization of Relations• Hierarchical• Contextual

Page 3: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 3

Types of Knowledge

• Universal (or ontological)

– Represented in Hierarchies

– Simple binary relations between concepts

– “Chemical weapons such as nerve gas, …”

• Contextual

– Represented in individual (semantic) contexts

– Groups of relations centered on a common concept

– “The forces launched a full-scale attack on Monday”

chemical weapon

nerve gas

launch

AGT

THM

TMP

forces

full-scale attack

monday

Page 4: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 4

KB Constituents

Concept Set

C3C5

C6C4

Knowledge Base

C2

C1

ContextualKnowledge

C21 C2

2C23

C24

R1

R2

R3

C33 C3

6R4

Hierarchy

C7

R5 C37

C4

C3C16

C13

C14

C11

anthrax

biological weapon

assassinate

AGT

THM

TMP

rebel

political leader

may 21

isa

isa

isa

isa

pw

pw p

w

Ontology

Page 5: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 5

Jaguar Overview

Documents

Seeds

Ontology(structured knowledge)

Functionality1. Produce ontologies

2. Link concepts & relations to text

3. Visualize ontology

4. Edit ontology

5. Enhance an existing ontology

6. Merge two ontologies into a consistent ontology

7. Ontological search of documents (search documents using ontology)

Jaguar

Ontology + pointers to text

Knowledge Base (ontology + contextual knowledge + pointers to text)

Page 6: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 6

Knowledge Bases

• Ontology/KB creation overview

– Knowledge Extraction from Text

• Pattern recognition; Semantic Parsing

– Knowledge Representation and Storage

• Contextual vs. Universal

• XML; Relational Database

– Knowledge Base Maintenance

• Conflict Resolution; Ontology Merging

• User Interaction; Ontology Modification

Page 7: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 7

Jaguar – Process & Modules

Jaguar

Text Processin

g

Classification

Hierarchy Creation

Knowledge Base

Maintenance

Seeds (keywords-list or Ontology)

Ontology + pointers to text

Knowledge Base (ontology + contextual knowledge + pointers

to text)

Chopshop: Tokenization

Post: Part-of-speech Tagging

Rose: Named Entity Recognition

Relu: Syntactic Parsing

Talbot: Word Sense Disambiguation

Polaris: Semantic Parsing

PreProcessor: Text-Extraction from HTML. MS Word & PDF Docs

Documents

ConceptTagger: Concept/Temporal Tagging

Text ProcessingInput: Documents, Seeds• Extract “concepts” of interest• Extract binary relations (universal)• Use Semantic Parser to obtain contextual

knowledge

Output: Concepts, Contexts, Binary Relations

“The rebels had access to chemical weapons, such as nerve gas and other poisonous gases.”

Page 8: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 8

Domain Ontology Creation

• Polaris: Extract semantic relations in text– Pattern matching and machine learning – Syntactic parse tree broken down into a number of

syntactic patterns – Syntactic patterns include verbs and their arguments,

complex nominals, adjective phrases, adjective clauses, and others.

– There are six primary pattern types discovered within noun phrases:

• N-N and Adj-N (which comprise compound nominals)• ’s and of (Genitive patterns)• Adjective Phrases• Adjective Clauses

• first five further subdivided into nominalized and non-nominalized (giving a total of 11 patterns discovered within compound nominals)

– There are also five verb argument level patterns being discovered:

• NP verb• verb NP• verb PP• verb ADVP• verb S

Jaguar

Text Processin

g

Classification

Hierarchy Creation

Knowledge Base

Maintenance

Page 9: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 9

Domain Ontology Creation

Input: Concepts, Binary Relations• Classify each concept against every other using

defined procedures, obtaining set of ISA relations• Add all ISA and other binary relations to the

hierarchy using conflict resolution

Output: Hierarchy of relations

“Scud missile” ISA “missile”

“Squadron” PW “Platoon”

“weapons inspection team” ISA “inspection team”

Jaguar/KAT

Text Processing

Classification

Hierarchy Creation

Knowledge Base

Maintenance

Classification/Hierarchy Creation

Page 10: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 10

Domain Ontology Creation

• Classification Procedures:

– Procedure 1: Classify a concept of the form [word, head] with respect to concept [head]

– Procedure 2: Classify a concept [word1, head1] with respect to another concept [word2, head2]

– Procedure 3: To classify a concept [word1, word2, head]

– Procedure 4: Classify a concept [word1, head] with respect to a concept hierarchy under [head]

Jaguar/KAT

Text Processing

Classification

Hierarchy Creation

Knowledge Base

Maintenance

Page 11: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 11

Domain Ontology Creation

• Knowledge Base Merging• Visualization• Knowledge Base Editing

– User Interaction– Modifications

Jaguar/KATText

Processing

Classification

Hierarchy Creation

Knowledge Base

Maintenance

Knowledge Base Maintenance

Page 12: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 12

Domain Ontology/KB Creation - Example

Page 13: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 13

Domain Ontology/KB Creation - Example

Page 14: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 14

Conflict Resolution Algorithm

• Approach Used: Prevention

– Start from an empty hierarchy and an input relation set

– Add a relation from the input set to the hierarchy, if:

• It does not form a cycle

• It is not redundant (does not duplicate a path)

– After the addition of any relation, algorithms (jump link removal) are run to

ensure that all jump links are removed

Page 15: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 15

Knowledge Base Merging

• Current Approach

– Label the bigger ontology L1, and the other L2

– Merge concepts (from those in L2 into those of L1)

– Copy all contexts (from L2 to L1)

– Add all relations (from the hierarchy of L2 to the hierarchy of L1) using the conflict

resolution algorithm

– Additionally, classify all concepts in L1’s hierarchy against concepts in L2’s

hierarchy (form relation set R)

– Add relations from R into L1’s hierarchy (conflict resolution)

Page 16: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 16

Merging Hierarchies

stock_market

exchange

work_place

money_market

market

industry

stock_exchange

money_market

capital market

financial marketL1 L2

Page 17: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 17

Merging Hierarchies

stock_market, stock_exchange

exchange

work_place

money_market

market

industry

“stock_market” ISA “capital market”

capital market

“capital market” ISA “financial market”

financial market

“money_market” ISA “financial market”“financial market” ISA “market”“capital market” ISA “market”

L1

Simulating Classification

stock_market

“stock_market” SYN “stock_exchange”

Page 18: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 18

Semantic Relation Evaluation

• Training corpus:– noun phrase patterns: Wall Street Journal (TreeBank 2), L.A.

Times (TREC 9), and XWN 2.0– verb argument patterns: FrameNet

• Three evaluation corpora to benchmark the Polaris semantic relations:– TreeBank: we manually annotated 500 random sentences from

the Penn Treebank 3 corpus with 5879 semantic relations.– GlassBox Human: 51 random sentences from the NIMD corpus

was manually POS-tagged, syntactically parsed and semantically annotated with 706 semantic relations.

– GlassBox Machine: the same 51 sentences used in GlassBox Human evaluation corpus was POS-tagged, syntactically parsed by our NLP tools and then manually annotated with 741 semantic relations.

Page 19: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 19

Semantic Relation Evaluation

• For Treebank evaluation corpus:– Polaris discovered 5245 relations

• 2212 exact matches to the human annotations• 630 partial matches

– partial matches mean that while the relation type was correct and the argument bracketing at least overlapped, there were some extra or missing tokens in the generated arguments

– partial matches are scored using precision, recall, and f-measure on the overlapping tokens

• For the GlassBox Human evaluation corpus:– Polaris discovered 449 relations

• 311 were perfect matches to the human annotations• 56 were partial matches

• For the GlassBox Machine evaluation corpus:– Polaris discovered 464 relations

• 249 were perfect matches to the human annotations• 71 were partial matches

Page 20: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 20

Semantic Relation Evaluation

Page 21: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 21

Domain Ontology Library Creation

• We use Jaguar to create an ontology library for the 33 topics defined in NIPF and 10 topics from the financial domain– NIPF is the Director of National Intelligence’s (DNI’s) guidance to

the Intelligence Community on the national intelligence priorities approved by the President of the United States of America

– For each topic, we collected 500 documents from the web and manually verified their relevance to the corresponding topic.

– For each topic, Jaguar is provided with an initial seed set containing on average 47 concepts of interest

Page 22: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 22

Domain Ontology/KB Evaluation

• We evaluated the quality of 8 Jaguar ontologies by comparing them against manual gold annotations

• Our evaluations are focused on the – Lexical Level– Vocabulary, or Data Layer Level – Other Semantic Relations Level

• Viewing an ontology as a set of semantic relations between two concepts, the human annotators:– Labeled an entry correct if the concepts and the semantic

relation are correctly detected by the system else marked the entry as Incorrect

– Labeled a correct entry as irrelevant if any of the concepts or the semantic relation are irrelevant to the domain

– From the sentences added new entries if the concepts and the semantic relation were omitted by Jaguar

Page 23: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 23

NIPF Ontology/KB Evaluation - Metrics

Nj(.) gives the counts from Jaguar’s output

Ng(.) correspond to counts in the user annotations

Page 24: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 24

Domain Ontology/KB Evaluation - Results

Page 25: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 25

Domain Ontology/KB Evaluation - Results

Page 26: LREC -  2010

LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources 26

Conclusions

• We presented a generalized and improved procedure to automatically extract deep semantic information from text resources

• A methodology to rapidly create semantically-rich domain ontologies while keeping the manual intervention to a minimum

• We defined evaluation metrics to assess the quality of the ontologies and presented evaluation results for a subset of the intelligence and financial ontology libraries, semi-automatically created using freely-available textual resources from the Web

• The results show that a decent amount of knowledge can be accurately extracted while keeping the manual intervention in the process to a minimum.