problems of ontology development for a broad domain loukachevitch natalia [email protected] leading...

54
Problems of Ontology Development for a Broad Domain Loukachevitch Natalia [email protected] Leading Researcher of Lomonosov Moscow State University Center for Informatio Research Lomonosov Moscow State University Research Computing Center

Post on 21-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Problems of Ontology Development

for a Broad Domain

Loukachevitch Natalia

[email protected]

Leading Researcher

of Lomonosov Moscow State University

Center for InformationResearch

Lomonosov Moscow State University Research Computing Center

Technologies• Ontologies for Natural Language Processing and

Information Retrieval Applications• Applications

– Conceptual indexing– Query expansion– Text Categorization– Document Clustering– Question-Answering– Automatic Summarization

• Linguistic Ontologies– RuThes thesaurus

(52 thousand concepts, 150 thousand words and expressions)– Ontology on Natural Sciences and Technologies

(60 thousand concepts)– Banking Thesaurus for Information Retrieval applications et. al.

Projects of Our Research Group-1

• State Bodies– Central Bank of the Russian Federation (2006 – ..)

• Development of banking thesaurus, conceptual indexing, text categorization

– Central Election Committee of the RF (1999 – ..) • Information-retrieval system, conceptual indexing, text

categorization,

– State Duma of RF (1999 – ..) • Information retrieval system on Duma records

– Accounting Chamber of RF (2003)• Creation of a terminology dictionary

– other state bodies• Text categorization, clusterization, development of domain-

specific ontologies,

Projects of Our Research Group-2

• Commercial organizations– Rambler Media company (2007– ..)

• Automatic clusterization, categorization, summarization of news flow

• Personalization of news and advertisements

• Spam detection

• Information extraction

– Garant Legal Information Company (2002 – …)• Text categorization of legal documents

• Summarization of court decisions

• Learning to rank in information-retrieval

– etc.

Plan of Tutorial

• Ontologies: general remarks– Main paradigms and their problems– Level of formalization

• Broad vs. simple domains– Boundaries of a domain– Main source of knowledge - texts

• Domain-specific texts– Concepts and terms, term extraction– Synonyms and near-synonyms– Ambiguity of terms– Establishing relations

• Example: Ontology-based text categorization

Domains and Tasks

• Ontology vs. Machine Learning?

• Description of domains is difficult

• Data can need generalization

• Some knowledge can be already described in ontology-based resources

• Therefore for many tasks we need

• Ontology+Machine learningOntology+Machine learning

Ontologies: general remarks

• Ontology - formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts

• Main components:– Concepts (classes)– Instances (individuals)– Relations– Attributes– Axioms (rules)

siamese

mammal

cat

organism

objectTaxonomyClasses

animal

frog

instances

Knowledge management domain

Ontology development paradigms

• Formal, logically sound ontologies– Logical inference,– Some domains are difficult to formalize– Inconsistency is a huge problem

• Semantic Web– Lot of specific ontologies– Rdf triples, Same_as links– a lot of “messy” data

• Ontologies for Natural Language processing– Less formal – Relation to language semantics– Formalization is restricted with current state of natural

language processing

Ontology-1: Ontology Spectrum (Obrst, 2006)

weak semanticsweak semantics

strong semanticsstrong semantics

Is Disjoint Subclass of with transitivity property

Modal Logic

Logical Theory

Thesaurus Has Narrower Meaning Than

TaxonomyIs Sub-Classification of

Conceptual Model Is Subclass of

DB Schemas, XML Schema

UML

First Order Logic

RelationalModel, XML

ER

Extended ER

Description LogicDAML+OIL, OWL

RDF/SXTM

Syntactic Interoperability

Structural Interoperability

Semantic Interoperability

From le

ss to m

ore expre

ssive

Expressivity vs. community-size (Hepp, 2007)

Ontology-2,Semantic Web. Linking Data Project http

://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

Approach 3. Ontologies for Natural Language Processing

• Relations between the concepts and lexical meanings are quite complex

• How represent synonyms and near-synonyms

• How detailed lexical senses of ambiguous words should be represented

• Large volume vs. complexity of description• WordNet as a symbol of this approach

• (!) For different tasks – different types of ontologies

Plan of Tutorial

• Ontologies: general remarks– Main paradigms and their problems– Level of formalization

• Broad vs. simple domains– Boundaries of a domain– Main source of knowledge - texts

• Domain-specific texts– Concepts and terms, term extraction– Synonyms and near-synonyms– Ambiguity of terms– Establishing relations

• Example: Ontology-based text categorization

Complicated vs. simpledomains

• Simple domains (wine ontology)– Explicit boundaries– Boundaries are determined with “physical processes” e.g. production, services– Clear roles of entities– Small number of classes (may have many

instances) or many uniform classes

• Complicated domains (terrorism, financial control)

– Vague boundaries,– The same entities used in different roles and functions– Knowledge stored in text documents,

Wine ontology

http://www.w3.org/TR/owl-guide/wine.rdf Wine

WhiteBurgundy

WhiteLoire

WhiteBordeaux

TableWine

SweetWine

Region

Grape

WhiteWine

RedWine

Meal course

Complicated domains: vague boundaries

• Interdisciplinarity– state financial control (economy+ law + finances)– Counter-terrorism (criminal law + international law+ +

constitutional law +state bodies+ buildings+vehicles+weapons…)

• Two main parts– Center of the domain– Additional concepts from neighbour domains

Boundaries of domain: Terrorism

• Center of domain– Terrorist acts, groups, terrorists– Anti-terrorist activity

• Additional spheres– Geographic places,– Weapons and explosives,– Transport,– Financial payment,– Ideology, Religion etc.

• Re-use of ontologies?

Problem: Distortion of Reality

• General concepts necessary for domain description are treated as subordinates of domain concepts

• Name of concept is general but its intended sense in domain specific

– Law (=antiterrorist law=),

– Intelligence

– (= antiterrorist intelligence)

• Problems in ontology mapping, ontology reuse

• Thesaurus on Radiological terrorism

• http://www.jasonmorrison.net/content/2004/a-thesaurus-for-radiological-terrorism-research

/

Example: distortion of reality

Plan of Tutorial

• Ontologies: general remarks– Main paradigms and their problems– Level of formalization

• Broad vs. simple domains– Boundaries of a domain– Main source of knowledge - texts

• Domain-specific texts– Concepts and terms, term extraction– Synonyms and near-synonyms– Ambiguity of terms– Establishing relations

• Example: Ontology-based text categorization

Ontology Development and Domain-Specific Texts

• Knowledge stored in texts• Domain-specific text collection

– As many as possible– Necessary to find exact boundaries

• Automatic extraction of terms from texts (Term acquisition)– Terms are expressions corresponding

to concepts of a specific domain

• Top-level modeling• Use of existing ontologies

Automatic Term Acquisition from Texts

• Linguistic criteria (noun groups)

• Lexical restrictions (f.e. evaluative words good, bad are rarely parts of terms)

• Statistical criteria (Frequency, Mutual information, and many others)

• !!Use of machine learning approaches to improve term extraction

• Formation of ordered list of term-candidates

The most frequent phrases in documents of financial control

domain• Translation from Russian

– Federal budget– Russian Federation– Accounting Chamber– Federal law– Overall sum (-)– Resources of federal budget (?)– Oblast budget– Financial means– Use of financial means (?)– Wages– Ministry of finance– Budget resources– Tax body

Analysis of Term-Candidate List

• In the beginning of the list there are many evident terms

• Further there are many unclear expressions

– whether they are terms (domain experts can have different opinions)

– whether they are related to the domain

– where is a boundary of the domain

• A lot of synonymic variants

• Ambiguity of terms

Boundaries of the domain

• Bottom-up+top-down

• Term extraction from texts – a bottom-up stage

• Extracted expressions are necessary to understand what types of entities are needed in the domain – in fact design of top-level taxonomy

• Top-down analysis

• Combined approach to concept selection (frequency from the collection+top-level taxonomy restrictions)

Synonyms and variants of “money laundering”

• CRIMINAL LAUNDERING

• ILLEGAL LAUNDERING

• LAUNDERING

• LAUNDERING ACTIVITIES

• LAUNDERING OF MONEY

• LAUNDERING OPERATIONS

• MONEY LAUNDERING

• MONEY LAUNDERING ACTIVITIES

• MONEY LEGALIZATION

• MONEY WASHING

• PROFIT LAUNDERING

• PROFIT WASHING

Lexical ambiguity

• Homonyms are words that share the same spelling but have different meanings (unrelated in origin)– bank (financial institution vs. land (river bank)),– rarely met in the same domain except broad one– easily recognized by non-linguists– different concepts, different sets of relations

• Polysemes are words with the same spelling and distinct but related meanings – bank (financial institution vs. building)– very often met in any domains– regular polysemes (institutions and their buildings)– difficult for recognition by non-linguists– tendency to use the same concept of ontology for related senses

Lexical ambiguity (homonyms): bow

Lexical ambiguity (polysemes)

• Transport– They have succeeded in stopping the transport of

live animals (=moving)– mechanism of contactless payment in public

transport (=vehicles)

• Regular polysemy– Tree – wood (material): birch

• Non-linguists cannot recognize different senses, feel strange deviations in relations

Lexical ambiguity (polysemes)

• How to help yourselves – nonambiguous synonymic phrases

– Transport1 = Transportation process

– Transport2 = transport vehicle

– Birch1 = birch tree

– Birch2 = birch wood

• Possible to see different entities behind closely related senses

Relations of an ontology

• The set of relations of ontology can be non-evident• Main relations

– Class-subclass – Instance relation– Role relations

• Different properties: transitivity et.al.• Old AI books and manuals: the same relation in all

cases – “is_a”• Diagnostic expression “X is a Y” can be appropriate

in all cases

Class-subclass relation• Relation between two sets of entities (classes) (many-

to-many): birch - tree• Properties: transitive, inheritance• Rules:

– If class A is a subclass of class B, then each instance of class A is also an instance of B

– Top-level classes (categories) should coincide for A and B

– Real example of a mistake:– river – water object – water – substance ->

– Moscow river – is a Substance?

?

Instance relation• Relation one-to-many

– Moscow river – instance of river– Teacher – instance of profession

• Not transitive– Rex, Poodle, dog breed, dog

– what relations– Rex is an instance of poodle– Poodle is an instance of dog breed– Poodle is a subclass of dog– Rex is not a dog breed– Rex is a dog

DogDog breed

Poodle

Rex

Instance

InstanceSubclass

XInstance

Roles and types• Roles: student, employer, terrorist, player• Types: Person, animal, building, car

• Role is a type in some conditions• A student is a person in the role of learning

• Properties of roles:– Roles are created dynamically– Roles can play other roles– A type can play many different roles

Confusion of type-role relations

with class-subclass relations• Frequent mistake of almost every beginner

• Not every person is an employer, an organization is not an employer in all situations

• Problems with inference

Person

Employer

Organization

X X

Text-motivated confusion of types and roles

• Natural substances such as salt, sugar, vinegar, alcohol, .. are also used as traditional preservatives. (wikipedia)

• Often salt and other preservatives are added to canned foods. (http://www.family-health-and-nutrition.com/this-vs-that.html)

• What relation is between salt and preservative?– Class-subclass?

– Class – instance?

– ..

• In practice, beginners usually try to establish relations “Class-subclass”, however this is a type-role relation, preservative is a role of substances.

Automatic extraction of relations from texts

• A lot of scientific publications: extraction of synonyms, taxonomies, part-whole relations etc.

• But in complex domain it is impossible fully rely on automatic tools

• In many cases evident relations are extracted

• Causes– Multiword expressions– Ambiguity of language expressions– Contextual dependence– Necessity of very large domain text collection

processing

Plan of Tutorial

• Ontologies: general remarks– Main paradigms and their problems– Level of formalization

• Broad vs. simple domains– Boundaries of a domain– Main source of knowledge - texts

• Domain-specific texts– Concepts and terms, term extraction– Synonyms and near-synonyms– Ambiguity of terms– Establishing relations

• Example: Ontology-based text categorization

Automatic text categorization

• Main approaches– Knowledge-based methods (based on rules)– Machine learning methods – very popular in scientific

conferences

• Text categorization in real practice (operational text categorization)– Training collection should exist– Experts should categorize documents in a consistent way– Every category needs enough number of training examples

In practice knowledge-based systems are widely used

• Reuter company (provider of known training collection Reuter-21578) uses a knowledge-based system for text categorization of own documents

Subjectivity of experts

Experts’ agreement in manual text categorization is around 60%

Our text categorization projects

• Use of both approaches in dependence of task and data• Knowledge-based approach uses knowledge of our large

resource RuThes thesaurus• Projects

– Classifier for Central Election Committee (450 categories, 4 levels)

– Classifier of Russian legislation (1169 categories, 3000 categories)

– Classifier of English economic research papers (700 categories)

– Classifier of public opinion polls (350 categories)

– Classifier of banking document and news (200 categories)

– General news classifiers – and others

Thesaurus on sociopolitical life

• Sociopolitical domain: social life of contemporary society

• Includes: thematic vocabulary and terminology from such domains as economy, finance, defense, law, sport, arts, military conflicts etc.

• Domain for such documents as government documents, legal acts, international treaties, newspaper articles, news reports

• 36 thousand concepts, 100 thousand terms, 140 thousand direct relations

• Applications: conceptual indexing; automatic text categorization, document clustering, automatic text summarization, question-answering.

Socio-Political Domain

Socio-Political Domain

Levels

of

Hie

rarc

hy

LawAccounting

Taxation

Banking

Thesaurus-based text categorization

• Use of knowledge described in the Thesaurus

• Manual description of Boolean expressions for categories based on small number of thesaurus concepts

• Automatic thesaurus-based expansion of Boolean expressions

• Thesaurus-based thematic representation of the text content independent of the genre and the length of a text (lexical chain technique)

Describing a category with supporting concepts

• Categotization of legal acts• 200.020.020. Heads of states summits

• { ( HEADS OF STATES SUMMITY )• OR

{

( NEGOTIATIONSN )

( INTERNATIONAL NEGOTIATIONSY )

( INTERNATIONAL CONTACTSN )

( MEETINGN )} AND

( HEAD OF STATEL )}

Expanded representation of the category

• {( HEADS OF STATES SUMMITY )

• ( summit, summit meeting, top-level meeting, head of states meeting )

• OR

{ ( NEGOTIATIONSN )

( negotiations, talks )

( INTERNATIONAL NEGOTIATIONSY )

( international talks, interstate talks, diplomatic negotiations, international talks, multinational talks, intergovernmental talks, contracting nations, negotiating states …)

( INTERNATIONAL CONTACTSN )

( international intercourse, transnational contacts… )

( MEETINGN )}

AND

( HEAD OF STATEL)

( leader of country, president, president of country, federal president, RF president, US president, monarch, …, emir, emir of Kuwait … )}

ROMIP: Russian Seminar on Information Retrieval

• Russian TREC• Text categorization task• Categories: DMOZ,

247 categories of 2nd level Top/World/Russian/*/*• Training collection: «DMOZ» (presented by Rambler)

– 300 000 documents, 2100 sites.• Testing collection: Belorussian Internet «BY.web»

(granted by Yandex company)– 1 500 000 documents, 19 000 sites

• Our task:– Thesaurus-based text categorization– Measuring of time to create categorization system– Evaluation

Knowledge-based approach (8 man-hours)

Category 135 «Martial arts» (F1-measure [OR] = 97%, R=98%, P= 96%)

Boolean expression for the category

MARTIAL ARTS (Е)

«E» -- full expansion using the thesaurus tree

The expanded description includes: AIKIDO, JIUJUTSU, JUDO, KARATE, JUDOIST, KARATEKA …

ROMIP: web-page categorization [or]

DMOZ categorization webpages 2007, or onlyJudged

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

F1 F1 (microaverage)

Precision Precision (microaverage)

Recall Recall (microaverage)

xxxx-1

xxxx-2

xxxx-3

xxxx-4

thescateg

Benefits from Large-Scale Linguistic Ontologies Use in

Information Retrieval

Information Retrieval Tasks Benefits

Web Search 0+ %

Corporate Search / Legal Search 10 %

Long Queries / Verbose Queries 15 %

Text Categorization 15-50 %

News Clustering 15 %

Summarization, Visualization,Multi Document Summarization

++(SUMMAC)

Conclusion

• Complex domains– Broad domains including a lot of

heterogeneous entities– vague boundaries,– Knowledge stored in texts

• Special efforts to find boundaries

• Acquisition knowledge from texts– Partial automation– Necessity to prevail ambiguity and vagueness

of natural texts even for non-linguists