j. nogueras-iso , j.a. bañares, j. lacasta, j. zarazaga-soria münster, 26-27 june 2003

19
Advanced Information Systems Laboratory http://iaaa.cps.unizar.es Department of Computer Science and Systems Engineering GI-DAYS MÜNSTER A software tool for thesauri management, browsing and supporting advanced searches J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

Upload: kasia

Post on 10-Jan-2016

35 views

Category:

Documents


2 download

DESCRIPTION

GI-DAYS MÜNSTER A software tool for thesauri management, browsing and supporting advanced searches. J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003. Contents. Introduction Architecture of THManager application Basic capabilities Enhanced capabilities - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

Advanced Information Systems Laboratory

http://iaaa.cps.unizar.esDepartment of Computer Science and Systems Engineering

GI-DAYS MÜNSTER

A software tool for thesauri management, browsing and supporting advanced searches

J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

Page 2: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 2

Contents

Introduction Architecture of THManager application Basic capabilities Enhanced capabilities Conclusions

Page 3: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 3

Introduction to thesauri

„ A thesaurus is a set of terms that describe the vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example synonymous terms, broader terms, narrower terms and related terms) are made explicit“ [ISO 2788]

Used to improve the precision and recall of information retrieval in digital libraries provide a uniform and consistent vocabulary for indexing

metadata ("description of the data holdings“) supply users with a suitable vocabulary for the retrieval. expansion of users queries by automatically adding new

terms to the query

Page 4: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 4

Introduction to thesauri A thesaurus management tool becomes a vital

component in the development of any kind of digital library

One of the main objectives of Spatial Data Infrastructures is to provide the discovery, evaluation and access to spatial data for a community of users. an SDI can be considered as digital library specialised in

geographic information resources.

A thesaurus management tool will be also a vital component for the development of SDIs.

Page 5: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 5

Level 3. Application

Level 2. GUI

Level 1. Model

Level 0. Database

Thesaurus managementImport/export

Thesaurus.modelKeywords expansionKeywords

Thesaurus

-100% SQL (basic)

-Oracle IntermediaText (enhanced)

WordNetfiles

Metadatarecords

Thesaurus.guiGeneric GUI components for thesauri visualization

Architecture of THManager application

Lexicon

WordNet

PolisemyPolisemy extraction

Branch disambiguation

ThesaurusMngmtThManager

basic

enhanced

enhanced

<<

JD

BC

>>

Page 6: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 6

Basic Capabilities

Edition of thesauri according to ISO norms Broader (BT), narrrower terms (NT) Related terms (RT), preferred terms (PT) Scope notes (SN), Synonyms (SYN,USE) Language translations (TR)

Visualization of thesauri Hierarchical, alphabetical Search of terms

Multilingual access support Browsing according to the language selected by users

Import/Export Text file proprietary formats

Page 7: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 7

Browsing /Edition

Page 8: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 8

Import/export formats Formats

Dot based notation sucession of narrower terms + additional

relationships (SYN,TR, ...) Hierarchical Numbering of terms

It should use more standardized formats: RDFS/XML, ...

Page 9: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 9

Enhanced capabilities

Thesauri are intended for the homogeneous classification of resources They are used to fill metadata keywords

However, there is still heterogeneity in metadata keywords Metadata creators use different thesauri in different

application domains If metadata catalogs provide access to general public

Queries may not contain same terms as keywords in metadata records

A possible solution to fill the semantic gap Disambiguation of thesauri (and queries) in relation

with the concepts of an upper level ontology

Page 10: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 10

Enhanced capabilities

Additional tools around semantic disambiguation Browsing WordNet as another thesaurus Searching polysemic senses in WordNet Thesauri disambiguation Automatic Expansion of Keywords

Other knowledge representation

models

Thesaurus 1Thesaurus 2Thesaurus N

Controlled list 1

Controlled list 2

Controlled list NWordNet

Page 11: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 11

Browsing WordNet

WordNet is structured in a hierarchy of synsetsSynsets are defined as set of synonyms

representing a particular concept (sense) WordNet libraries and files are accessed by JNI

Page 12: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 12

Searching polysemic senses in WordNet

Functionality provided by Polisemy package Compound terms are partioned if no synset is

found If adjectives found, associated nouns are also

searched to reduce number of not-found words

Page 13: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 13

Thesauri Disambiguation

Unsupervised disambiguation method The senses of every thesaurus term are

searched in WordNet. The hierarchical structure of the thesaurus is

used as the word context for a voting algorithm to find the closest sense Thesauri are partitioned into branches (trees formed by

BT/NT terms whose root has no BT)accident

accident source

environmental accident

major accident

traffic accident

work accident

technologicalaccident

shippingaccident

nuclearaccident

core meltdown

oil sick accidentexplosion

leakage

administration

...

Page 14: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 14

Thesauri Disambiguation II

Voting algorithm to obtain the disambiguated synset of a term a Every synset s associated to the rest of terms in the

branch votes (proximity weight) for the synsets of term “a”

Main weight: number of subsummers in WordNet hierarchyMatches in WordNet hierarchy of ancestors

Discounting factors:Synset depthBranch distancePolisemy of term associated with synset “s”

Page 15: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 15

Thesauri disambiguation III

Annotation of disambiguated synsets

Page 16: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 16

Automatic expansion of keywordswith new disambiguated thesauri

Thesaurus Original term Reliability

CEOPARAMETER earth science atmosphere 100

CEODISCIPLINE weather & climate 100

Thesaurus Expanded term Reliability

atmosphere

climate

climate weather

climate weather weather condition

climatic issue

climatic issue weather

GEMET

climatic issue.weather weather condition

99

ADL-FTT regions climatic regions 49.5

atmospheric science

atmospheric science atmospheric pressure

NASA

atmospheric science atmospheric temperature

49.5

Comparison between the initial collection of synsets and the synsets of a new term tresshod

termnewofsynsets

termnewofmatchessynset

___

____

Page 17: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 17

Expansion of keywords II

Page 18: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 18

Conclusions & future lines

ThManager is a flexible tool to manage thesauri It provides enhanced functionality for the improvement of

classifications. This tool can be easily integrated in other tools

It is used by a metadata edition tool (also presented here) to select the appropriate term for the distinct metadata fields.

Future lines: Creation of a thesaurus Web Service providing some of the

functionality offered by this tool. thesaurus browsing, WordNet polysemy extraction,

keywords expansion, ... Concept based retrieval

Exploit the semantic disambiguation of thesauri to test different information retrieval strategies for geographic data catalogs.

It is possible to index metadata records according to a unified system: the disambiguated WordNet synsets

Page 19: J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003

21 de abr de 2023 19

Advanced Information Systems Laboratory

http://iaaa.cps.unizar.es