unsupervised ontology acquisition from plain texts: the ontogain method efthymios drymonas kalliopi...

26
Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems Laboratory http://www.intelligence.tuc.gr Technical University of Crete (TUC), Chania, Greece

Upload: pauline-johns

Post on 17-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Unsupervised Ontology Acquisition from plain texts: The OntoGain method

Efthymios DrymonasKalliopi ZervanouEuripides G.M. Petrakis

Intelligent Systems Laboratoryhttp://www.intelligence.tuc.gr

Technical University of Crete (TUC), Chania, Greece

Page 2: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

OntoGain

A platform for unsupervised ontology acquisition from text Application independent Ontology of multi-word term concepts Adjusts existing methods for taxonomy &

relation acquisition to handle multi-word concepts

Outputs ontology in OWL Good results on Medical, Computer science

corpora

2

Page 3: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Why multi-word term concepts?

Majority of terminological expressions Convey classificatory information,

expressed as modifierse.g. “carotid artery disease” denotes a type

of “artery disease” which is a type of “disease”

Leads to more expressive and compact ontology lexicon

3

Page 4: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Ontology Learning Steps

Concept Extraction C/NC-value

Taxonomy Induction Clustering, Formal Concept Analysis

Non-taxonomic Relations Association Rules, Probabilistic algorithm

4

Page 5: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

5

The C/NC-Value method [Frantzi et.al. , 2000] Identifies multi-word term phrases

denoting domain concepts Noun phrases are extracted first ((adj | noun)+ | ((adj | noun) * (noun prep)?)

(adj | noun) *) noun C-Value: Term validity criterion, relying

on the hypothesis that multi-word terms tend to consist of other terms

NC-Value: Uses context information (valid terms tend to appear in specific context and co-occur with other terms)

Page 6: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

C-Value: Statistical Part For candidate term a

f(a): Total frequency of occurrence f(b): Frequency of a as part of longer termsP(Ta): number of these longer terms

|a|: The length of the candidate string

otherwisebf

TPafa

nestednotaafaavalueC

aTba

,))()(

1)((||log

:),(||log)(

2

2

Concept Extraction

Page 7: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

C/NC-Value sample resultsoutput term c-nc value

web page 1740.11

information retrieval 1274.14

search engine 1103.99

machine learning 727.70

computer science 723.82

experimental result 655.125

text mining 645.57

natural language processing 582.83

world wide web 557.33

large number 530.67

artificial intelligence 515.73

relevant document 468.22

similarity measure 464.64

information extraction 443.29

knowledge discovery 435.79

7

Page 8: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Ontology Learning Steps

Preprocessing Concept ExtractionTaxonomy Induction Non-taxonomic Relations

8

Page 9: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Taxonomy Induction

Aims at organizing concepts into a hierarchical structure where each concept is related to its respective broader and narrower terms

Two methods in OntoGainAgglomerative clustering Formal Concept Analysis (FCA)

Page 10: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Agglomerative Clustering

Proceeds bottom-up: at each step, the most similar clusters are merged

Initially each term is considered a cluster Similarity between all pairs of clusters is

computed The most similar clusters are merged as

long as they share terms with common heads

Group average for clusters, Dice like formula for terms

10

Page 11: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Formal Concept Analysis (FCA) [Ganter et al., 1999]

FCA relies on the idea that the objects (terms) are associated with their attributes (verbs)

Finds common attributes (verbs) between objects and forms object clusters that share common attributes

Formal concepts are connected with the sub-concept relationship

)(),(),( 21212211 AAOOAOAO

Page 12: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

FCA Example

Takes as input a matrix showing associations between terms (concepts) and attributes (verbs)

submit test describe print compute search

Html form * * *

Hierarchical clustering

* *

Text retrieval *

Root node * * * *

Single cluster * * *

Web page * *

Page 13: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

FCA Taxonomy

13

Formal concepts ({hierarchical

clustering, root node, single cluster}, {compute, search})

({html form, web page}, {print, search})

Not all dependencies c,v are interesting

tvf

vcfvcP

)(

),()|(

Page 14: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Non-Taxonomic Relations extraction phase

14

Concept Extraction Taxonomy InductionNon-Taxonomic Relations

Page 15: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Non-Taxonomic Relations

Concepts are also characterized by attributes and relations to other concepts in the hierarchy

Typically expressed by a verb relating pair of concepts

Two approaches Associations rules Probabilistic

Page 16: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Association Rules [Aggrawal et.al., 1993]

Introduced to predict the purchase behavior of customers

Extract terms connected with some relation subject-verb-object

Enhance with general terms from the taxonomy

Eliminate redundant relations:

predictive accuracy < t

Page 17: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Association Rules: ExampleDomain Range Label

chiasmal syndrome pituitary disproportion cause by

medial collateral ligament surgical treatment need

blood transfusion antibiotic prophylaxis result

lipid peroxidation cardiopulmonary bypass lead to

prostate specific antigen prostatectomy follow

chronic fatigue syndrome cardiac function yield

right ventricular infraction radionuclide ventriculography analyze by

creatinine clearance arteriovenous hemofiltration achieve

cardioplegic solution superoxide dismutase give

bacterial translocation antibiotic prophylaxis decrease

accurate diagnosis clinical suspicion depend

ultrasound examination clinical suspicion give

total body oxygen consumption epidural analgesia attenuate by

coronary arteriography physician perform by

17

Page 18: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Probabilistic approach [Cimiano et.al. 2006]

Collect verbal relations from the corpus Find the most general relation wrt verb

using frequency of occurrence Suffer_from(man, head_ache)Suffer_from(woman, stomach_ache)Suffer_from(patient,ache)

Select relationships satisfying a conditional probability measureAssociations > t become accepted

18

Page 19: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Evaluation

Relevance judgments are provided by humans

Precision - Recall We examined the 200 top-ranked

concepts and their respective relations in 500 lines

Results from OhsuMed & Computer Science corpus

19

Page 20: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Results

20

Processing Layer Method

Precision –

OhsuMed

Recall -

OhsuMed

Precision –

Comp. Science

Recall –

Comp. Science

Concept Extraction C/NC-Value 89.7% 91.4% 86.7% 89.6%

Taxonomic Relations

Formal Concept Analysis

47.1% 41.6% 44.2% 48.6%

Hierarchical Clustering 71.2% 67.3% 71.3% 62.7%

Non-Taxonomic Relations

Association Rules 71.8% 67.7% 72.8% 61.7%

Probabilistic 62.7% 55.9% 61.6% 49.4%

Page 21: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Comparison with Text2Onto [Cimiano & Volker, 2005]

21

Huge lists of plain single word terms, and relations lacking of semantic meaning

Text2Onto cannot work with big texts Cannot export results in OWL

Page 22: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Conclusions OntoGain

Multi-word term concepts Exports ontology in OWL Domain independent

Results C/NC-Value yields good results Clustering outperforms FCA Association Rules perform better than

Verbal Expressions

22

Page 23: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Future Work

Explore more methods / combinations e.g., clustering, FCA Hearst patterns for discovering additional

relation types (Part-of)

Discover attributes and cardinality constraints

Incorporate term similarity information from WordNet, MeSH

Resolve term ambiguities

23

Page 24: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Thank you!

Questions ?

24

Page 25: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

Preprocessing

Tokenization, POS tagging, Shallow parsing (OpenNLP suite)

Lemmatization (WordNet Java LibraryApply to all steps of OntoGainShallow parsing is used in relations

acquisition for the detection of verbal dependencies

Page 26: Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems

26

Terms sharing a head tend to be similar e.g. hierarchical method and agglomerative

method are both methods Nested terms are related to each other

e.g. agglomerative clustering method and clustering method should be associated)