problems of ontology development for a broad domain loukachevitch natalia [email protected]

54
Problems of Ontology Development for a Broad Domain Loukachevitch Natalia [email protected] Leading Researcher of Lomonosov Moscow State University Center for Informatio Research Lomonosov Moscow State University Research Computing Center

Upload: felcia

Post on 19-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Lomonosov Moscow State University Research Computing Center. Center for Information Research. Problems of Ontology Development for a Broad Domain Loukachevitch Natalia [email protected] Leading Researcher of Lomonosov Moscow State University. Technologies. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Problems of Ontology Development

for a Broad Domain

Loukachevitch Natalia

[email protected]

Leading Researcher

of Lomonosov Moscow State University

Center for InformationResearch

Lomonosov Moscow State University Research Computing Center

Page 2: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Technologies• Ontologies for Natural Language Processing and

Information Retrieval Applications• Applications

– Conceptual indexing– Query expansion– Text Categorization– Document Clustering– Question-Answering– Automatic Summarization

• Linguistic Ontologies– RuThes thesaurus

(52 thousand concepts, 150 thousand words and expressions)– Ontology on Natural Sciences and Technologies

(60 thousand concepts)– Banking Thesaurus for Information Retrieval applications et. al.

Page 3: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Projects of Our Research Group-1

• State Bodies– Central Bank of the Russian Federation (2006 – ..)

• Development of banking thesaurus, conceptual indexing, text categorization

– Central Election Committee of the RF (1999 – ..) • Information-retrieval system, conceptual indexing, text

categorization,

– State Duma of RF (1999 – ..) • Information retrieval system on Duma records

– Accounting Chamber of RF (2003)• Creation of a terminology dictionary

– other state bodies• Text categorization, clusterization, development of domain-

specific ontologies,

Page 4: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Projects of Our Research Group-2

• Commercial organizations– Rambler Media company (2007– ..)

• Automatic clusterization, categorization, summarization of news flow

• Personalization of news and advertisements

• Spam detection

• Information extraction

– Garant Legal Information Company (2002 – …)• Text categorization of legal documents

• Summarization of court decisions

• Learning to rank in information-retrieval

– etc.

Page 5: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Plan of Tutorial

• Ontologies: general remarks– Main paradigms and their problems– Level of formalization

• Broad vs. simple domains– Boundaries of a domain– Main source of knowledge - texts

• Domain-specific texts– Concepts and terms, term extraction– Synonyms and near-synonyms– Ambiguity of terms– Establishing relations

• Example: Ontology-based text categorization

Page 6: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Domains and Tasks

• Ontology vs. Machine Learning?

• Description of domains is difficult

• Data can need generalization

• Some knowledge can be already described in ontology-based resources

• Therefore for many tasks we need

• Ontology+Machine learningOntology+Machine learning

Page 7: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Ontologies: general remarks

• Ontology - formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts

• Main components:– Concepts (classes)– Instances (individuals)– Relations– Attributes– Axioms (rules)

Page 8: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

siamese

mammal

cat

organism

objectTaxonomyClasses

animal

frog

instances

Page 9: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Knowledge management domain

Page 10: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Ontology development paradigms

• Formal, logically sound ontologies– Logical inference,– Some domains are difficult to formalize– Inconsistency is a huge problem

• Semantic Web– Lot of specific ontologies– Rdf triples, Same_as links– a lot of “messy” data

• Ontologies for Natural Language processing– Less formal – Relation to language semantics– Formalization is restricted with current state of natural

language processing

Page 11: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Ontology-1: Ontology Spectrum (Obrst, 2006)

weak semanticsweak semantics

strong semanticsstrong semantics

Is Disjoint Subclass of with transitivity property

Modal Logic

Logical Theory

Thesaurus Has Narrower Meaning Than

TaxonomyIs Sub-Classification of

Conceptual Model Is Subclass of

DB Schemas, XML Schema

UML

First Order Logic

RelationalModel, XML

ER

Extended ER

Description LogicDAML+OIL, OWL

RDF/SXTM

Syntactic Interoperability

Structural Interoperability

Semantic Interoperability

From le

ss to m

ore expre

ssive

Page 12: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Expressivity vs. community-size (Hepp, 2007)

Page 13: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Ontology-2,Semantic Web. Linking Data Project http

://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

Page 14: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Approach 3. Ontologies for Natural Language Processing

• Relations between the concepts and lexical meanings are quite complex

• How represent synonyms and near-synonyms

• How detailed lexical senses of ambiguous words should be represented

• Large volume vs. complexity of description• WordNet as a symbol of this approach

• (!) For different tasks – different types of ontologies

Page 15: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Plan of Tutorial

• Ontologies: general remarks– Main paradigms and their problems– Level of formalization

• Broad vs. simple domains– Boundaries of a domain– Main source of knowledge - texts

• Domain-specific texts– Concepts and terms, term extraction– Synonyms and near-synonyms– Ambiguity of terms– Establishing relations

• Example: Ontology-based text categorization

Page 16: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Complicated vs. simpledomains

• Simple domains (wine ontology)– Explicit boundaries– Boundaries are determined with “physical processes” e.g. production, services– Clear roles of entities– Small number of classes (may have many

instances) or many uniform classes

• Complicated domains (terrorism, financial control)

– Vague boundaries,– The same entities used in different roles and functions– Knowledge stored in text documents,

Page 17: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Wine ontology

http://www.w3.org/TR/owl-guide/wine.rdf Wine

WhiteBurgundy

WhiteLoire

WhiteBordeaux

TableWine

SweetWine

Region

Grape

WhiteWine

RedWine

Meal course

Page 18: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Complicated domains: vague boundaries

• Interdisciplinarity– state financial control (economy+ law + finances)– Counter-terrorism (criminal law + international law+ +

constitutional law +state bodies+ buildings+vehicles+weapons…)

• Two main parts– Center of the domain– Additional concepts from neighbour domains

Page 19: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru
Page 20: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Boundaries of domain: Terrorism

• Center of domain– Terrorist acts, groups, terrorists– Anti-terrorist activity

• Additional spheres– Geographic places,– Weapons and explosives,– Transport,– Financial payment,– Ideology, Religion etc.

• Re-use of ontologies?

Page 21: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Problem: Distortion of Reality

• General concepts necessary for domain description are treated as subordinates of domain concepts

• Name of concept is general but its intended sense in domain specific

– Law (=antiterrorist law=),

– Intelligence

– (= antiterrorist intelligence)

• Problems in ontology mapping, ontology reuse

• Thesaurus on Radiological terrorism

• http://www.jasonmorrison.net/content/2004/a-thesaurus-for-radiological-terrorism-research

/

Page 22: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Example: distortion of reality

Page 23: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Plan of Tutorial

• Ontologies: general remarks– Main paradigms and their problems– Level of formalization

• Broad vs. simple domains– Boundaries of a domain– Main source of knowledge - texts

• Domain-specific texts– Concepts and terms, term extraction– Synonyms and near-synonyms– Ambiguity of terms– Establishing relations

• Example: Ontology-based text categorization

Page 24: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Ontology Development and Domain-Specific Texts

• Knowledge stored in texts• Domain-specific text collection

– As many as possible– Necessary to find exact boundaries

• Automatic extraction of terms from texts (Term acquisition)– Terms are expressions corresponding

to concepts of a specific domain

• Top-level modeling• Use of existing ontologies

Page 25: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Automatic Term Acquisition from Texts

• Linguistic criteria (noun groups)

• Lexical restrictions (f.e. evaluative words good, bad are rarely parts of terms)

• Statistical criteria (Frequency, Mutual information, and many others)

• !!Use of machine learning approaches to improve term extraction

• Formation of ordered list of term-candidates

Page 26: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

The most frequent phrases in documents of financial control

domain• Translation from Russian

– Federal budget– Russian Federation– Accounting Chamber– Federal law– Overall sum (-)– Resources of federal budget (?)– Oblast budget– Financial means– Use of financial means (?)– Wages– Ministry of finance– Budget resources– Tax body

Page 27: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Analysis of Term-Candidate List

• In the beginning of the list there are many evident terms

• Further there are many unclear expressions

– whether they are terms (domain experts can have different opinions)

– whether they are related to the domain

– where is a boundary of the domain

• A lot of synonymic variants

• Ambiguity of terms

Page 28: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Boundaries of the domain

• Bottom-up+top-down

• Term extraction from texts – a bottom-up stage

• Extracted expressions are necessary to understand what types of entities are needed in the domain – in fact design of top-level taxonomy

• Top-down analysis

• Combined approach to concept selection (frequency from the collection+top-level taxonomy restrictions)

Page 29: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Synonyms and variants of “money laundering”

• CRIMINAL LAUNDERING

• ILLEGAL LAUNDERING

• LAUNDERING

• LAUNDERING ACTIVITIES

• LAUNDERING OF MONEY

• LAUNDERING OPERATIONS

• MONEY LAUNDERING

• MONEY LAUNDERING ACTIVITIES

• MONEY LEGALIZATION

• MONEY WASHING

• PROFIT LAUNDERING

• PROFIT WASHING

Page 30: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Lexical ambiguity

• Homonyms are words that share the same spelling but have different meanings (unrelated in origin)– bank (financial institution vs. land (river bank)),– rarely met in the same domain except broad one– easily recognized by non-linguists– different concepts, different sets of relations

• Polysemes are words with the same spelling and distinct but related meanings – bank (financial institution vs. building)– very often met in any domains– regular polysemes (institutions and their buildings)– difficult for recognition by non-linguists– tendency to use the same concept of ontology for related senses

Page 31: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Lexical ambiguity (homonyms): bow

Page 32: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Lexical ambiguity (polysemes)

• Transport– They have succeeded in stopping the transport of

live animals (=moving)– mechanism of contactless payment in public

transport (=vehicles)

• Regular polysemy– Tree – wood (material): birch

• Non-linguists cannot recognize different senses, feel strange deviations in relations

Page 33: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Lexical ambiguity (polysemes)

• How to help yourselves – nonambiguous synonymic phrases

– Transport1 = Transportation process

– Transport2 = transport vehicle

– Birch1 = birch tree

– Birch2 = birch wood

• Possible to see different entities behind closely related senses

Page 34: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Relations of an ontology

• The set of relations of ontology can be non-evident• Main relations

– Class-subclass – Instance relation– Role relations

• Different properties: transitivity et.al.• Old AI books and manuals: the same relation in all

cases – “is_a”• Diagnostic expression “X is a Y” can be appropriate

in all cases

Page 35: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Class-subclass relation• Relation between two sets of entities (classes) (many-

to-many): birch - tree• Properties: transitive, inheritance• Rules:

– If class A is a subclass of class B, then each instance of class A is also an instance of B

– Top-level classes (categories) should coincide for A and B

– Real example of a mistake:– river – water object – water – substance ->

– Moscow river – is a Substance?

?

Page 36: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Instance relation• Relation one-to-many

– Moscow river – instance of river– Teacher – instance of profession

• Not transitive– Rex, Poodle, dog breed, dog

– what relations– Rex is an instance of poodle– Poodle is an instance of dog breed– Poodle is a subclass of dog– Rex is not a dog breed– Rex is a dog

DogDog breed

Poodle

Rex

Instance

InstanceSubclass

XInstance

Page 37: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Roles and types• Roles: student, employer, terrorist, player• Types: Person, animal, building, car

• Role is a type in some conditions• A student is a person in the role of learning

• Properties of roles:– Roles are created dynamically– Roles can play other roles– A type can play many different roles

Page 38: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Confusion of type-role relations

with class-subclass relations• Frequent mistake of almost every beginner

• Not every person is an employer, an organization is not an employer in all situations

• Problems with inference

Person

Employer

Organization

X X

Page 39: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Text-motivated confusion of types and roles

• Natural substances such as salt, sugar, vinegar, alcohol, .. are also used as traditional preservatives. (wikipedia)

• Often salt and other preservatives are added to canned foods. (http://www.family-health-and-nutrition.com/this-vs-that.html)

• What relation is between salt and preservative?– Class-subclass?

– Class – instance?

– ..

• In practice, beginners usually try to establish relations “Class-subclass”, however this is a type-role relation, preservative is a role of substances.

Page 40: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Automatic extraction of relations from texts

• A lot of scientific publications: extraction of synonyms, taxonomies, part-whole relations etc.

• But in complex domain it is impossible fully rely on automatic tools

• In many cases evident relations are extracted

• Causes– Multiword expressions– Ambiguity of language expressions– Contextual dependence– Necessity of very large domain text collection

processing

Page 41: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Plan of Tutorial

• Ontologies: general remarks– Main paradigms and their problems– Level of formalization

• Broad vs. simple domains– Boundaries of a domain– Main source of knowledge - texts

• Domain-specific texts– Concepts and terms, term extraction– Synonyms and near-synonyms– Ambiguity of terms– Establishing relations

• Example: Ontology-based text categorization

Page 42: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Automatic text categorization

• Main approaches– Knowledge-based methods (based on rules)– Machine learning methods – very popular in scientific

conferences

• Text categorization in real practice (operational text categorization)– Training collection should exist– Experts should categorize documents in a consistent way– Every category needs enough number of training examples

In practice knowledge-based systems are widely used

• Reuter company (provider of known training collection Reuter-21578) uses a knowledge-based system for text categorization of own documents

Page 43: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Subjectivity of experts

Experts’ agreement in manual text categorization is around 60%

Page 44: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Our text categorization projects

• Use of both approaches in dependence of task and data• Knowledge-based approach uses knowledge of our large

resource RuThes thesaurus• Projects

– Classifier for Central Election Committee (450 categories, 4 levels)

– Classifier of Russian legislation (1169 categories, 3000 categories)

– Classifier of English economic research papers (700 categories)

– Classifier of public opinion polls (350 categories)

– Classifier of banking document and news (200 categories)

– General news classifiers – and others

Page 45: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Thesaurus on sociopolitical life

• Sociopolitical domain: social life of contemporary society

• Includes: thematic vocabulary and terminology from such domains as economy, finance, defense, law, sport, arts, military conflicts etc.

• Domain for such documents as government documents, legal acts, international treaties, newspaper articles, news reports

• 36 thousand concepts, 100 thousand terms, 140 thousand direct relations

• Applications: conceptual indexing; automatic text categorization, document clustering, automatic text summarization, question-answering.

Page 46: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Socio-Political Domain

Socio-Political Domain

Levels

of

Hie

rarc

hy

LawAccounting

Taxation

Banking

Page 47: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Thesaurus-based text categorization

• Use of knowledge described in the Thesaurus

• Manual description of Boolean expressions for categories based on small number of thesaurus concepts

• Automatic thesaurus-based expansion of Boolean expressions

• Thesaurus-based thematic representation of the text content independent of the genre and the length of a text (lexical chain technique)

Page 48: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Describing a category with supporting concepts

• Categotization of legal acts• 200.020.020. Heads of states summits

• { ( HEADS OF STATES SUMMITY )• OR

{

( NEGOTIATIONSN )

( INTERNATIONAL NEGOTIATIONSY )

( INTERNATIONAL CONTACTSN )

( MEETINGN )} AND

( HEAD OF STATEL )}

Page 49: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Expanded representation of the category

• {( HEADS OF STATES SUMMITY )

• ( summit, summit meeting, top-level meeting, head of states meeting )

• OR

{ ( NEGOTIATIONSN )

( negotiations, talks )

( INTERNATIONAL NEGOTIATIONSY )

( international talks, interstate talks, diplomatic negotiations, international talks, multinational talks, intergovernmental talks, contracting nations, negotiating states …)

( INTERNATIONAL CONTACTSN )

( international intercourse, transnational contacts… )

( MEETINGN )}

AND

( HEAD OF STATEL)

( leader of country, president, president of country, federal president, RF president, US president, monarch, …, emir, emir of Kuwait … )}

Page 50: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

ROMIP: Russian Seminar on Information Retrieval

• Russian TREC• Text categorization task• Categories: DMOZ,

247 categories of 2nd level Top/World/Russian/*/*• Training collection: «DMOZ» (presented by Rambler)

– 300 000 documents, 2100 sites.• Testing collection: Belorussian Internet «BY.web»

(granted by Yandex company)– 1 500 000 documents, 19 000 sites

• Our task:– Thesaurus-based text categorization– Measuring of time to create categorization system– Evaluation

Page 51: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Knowledge-based approach (8 man-hours)

Category 135 «Martial arts» (F1-measure [OR] = 97%, R=98%, P= 96%)

Boolean expression for the category

MARTIAL ARTS (Е)

«E» -- full expansion using the thesaurus tree

The expanded description includes: AIKIDO, JIUJUTSU, JUDO, KARATE, JUDOIST, KARATEKA …

Page 52: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

ROMIP: web-page categorization [or]

DMOZ categorization webpages 2007, or onlyJudged

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

F1 F1 (microaverage)

Precision Precision (microaverage)

Recall Recall (microaverage)

xxxx-1

xxxx-2

xxxx-3

xxxx-4

thescateg

Page 53: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Benefits from Large-Scale Linguistic Ontologies Use in

Information Retrieval

Information Retrieval Tasks Benefits

Web Search 0+ %

Corporate Search / Legal Search 10 %

Long Queries / Verbose Queries 15 %

Text Categorization 15-50 %

News Clustering 15 %

Summarization, Visualization,Multi Document Summarization

++(SUMMAC)

Page 54: Problems of Ontology Development  for a Broad Domain Loukachevitch Natalia louk_nat@mail.ru

Conclusion

• Complex domains– Broad domains including a lot of

heterogeneous entities– vague boundaries,– Knowledge stored in texts

• Special efforts to find boundaries

• Acquisition knowledge from texts– Partial automation– Necessity to prevail ambiguity and vagueness

of natural texts even for non-linguists