j. nogueras-iso , j.a. bañares, j. lacasta, j. zarazaga-soria münster, 26-27 june 2003
DESCRIPTION
GI-DAYS MÜNSTER A software tool for thesauri management, browsing and supporting advanced searches. J. Nogueras-Iso , J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003. Contents. Introduction Architecture of THManager application Basic capabilities Enhanced capabilities - PowerPoint PPT PresentationTRANSCRIPT
Advanced Information Systems Laboratory
http://iaaa.cps.unizar.esDepartment of Computer Science and Systems Engineering
GI-DAYS MÜNSTER
A software tool for thesauri management, browsing and supporting advanced searches
J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, 26-27 June 2003
21 de abr de 2023 2
Contents
Introduction Architecture of THManager application Basic capabilities Enhanced capabilities Conclusions
21 de abr de 2023 3
Introduction to thesauri
„ A thesaurus is a set of terms that describe the vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example synonymous terms, broader terms, narrower terms and related terms) are made explicit“ [ISO 2788]
Used to improve the precision and recall of information retrieval in digital libraries provide a uniform and consistent vocabulary for indexing
metadata ("description of the data holdings“) supply users with a suitable vocabulary for the retrieval. expansion of users queries by automatically adding new
terms to the query
21 de abr de 2023 4
Introduction to thesauri A thesaurus management tool becomes a vital
component in the development of any kind of digital library
One of the main objectives of Spatial Data Infrastructures is to provide the discovery, evaluation and access to spatial data for a community of users. an SDI can be considered as digital library specialised in
geographic information resources.
A thesaurus management tool will be also a vital component for the development of SDIs.
21 de abr de 2023 5
Level 3. Application
Level 2. GUI
Level 1. Model
Level 0. Database
Thesaurus managementImport/export
Thesaurus.modelKeywords expansionKeywords
Thesaurus
-100% SQL (basic)
-Oracle IntermediaText (enhanced)
WordNetfiles
Metadatarecords
Thesaurus.guiGeneric GUI components for thesauri visualization
Architecture of THManager application
Lexicon
WordNet
PolisemyPolisemy extraction
Branch disambiguation
ThesaurusMngmtThManager
basic
enhanced
enhanced
<<
JD
BC
>>
21 de abr de 2023 6
Basic Capabilities
Edition of thesauri according to ISO norms Broader (BT), narrrower terms (NT) Related terms (RT), preferred terms (PT) Scope notes (SN), Synonyms (SYN,USE) Language translations (TR)
Visualization of thesauri Hierarchical, alphabetical Search of terms
Multilingual access support Browsing according to the language selected by users
Import/Export Text file proprietary formats
21 de abr de 2023 7
Browsing /Edition
21 de abr de 2023 8
Import/export formats Formats
Dot based notation sucession of narrower terms + additional
relationships (SYN,TR, ...) Hierarchical Numbering of terms
It should use more standardized formats: RDFS/XML, ...
21 de abr de 2023 9
Enhanced capabilities
Thesauri are intended for the homogeneous classification of resources They are used to fill metadata keywords
However, there is still heterogeneity in metadata keywords Metadata creators use different thesauri in different
application domains If metadata catalogs provide access to general public
Queries may not contain same terms as keywords in metadata records
A possible solution to fill the semantic gap Disambiguation of thesauri (and queries) in relation
with the concepts of an upper level ontology
21 de abr de 2023 10
Enhanced capabilities
Additional tools around semantic disambiguation Browsing WordNet as another thesaurus Searching polysemic senses in WordNet Thesauri disambiguation Automatic Expansion of Keywords
Other knowledge representation
models
Thesaurus 1Thesaurus 2Thesaurus N
Controlled list 1
Controlled list 2
Controlled list NWordNet
21 de abr de 2023 11
Browsing WordNet
WordNet is structured in a hierarchy of synsetsSynsets are defined as set of synonyms
representing a particular concept (sense) WordNet libraries and files are accessed by JNI
21 de abr de 2023 12
Searching polysemic senses in WordNet
Functionality provided by Polisemy package Compound terms are partioned if no synset is
found If adjectives found, associated nouns are also
searched to reduce number of not-found words
21 de abr de 2023 13
Thesauri Disambiguation
Unsupervised disambiguation method The senses of every thesaurus term are
searched in WordNet. The hierarchical structure of the thesaurus is
used as the word context for a voting algorithm to find the closest sense Thesauri are partitioned into branches (trees formed by
BT/NT terms whose root has no BT)accident
accident source
environmental accident
major accident
traffic accident
work accident
technologicalaccident
shippingaccident
nuclearaccident
core meltdown
oil sick accidentexplosion
leakage
administration
...
21 de abr de 2023 14
Thesauri Disambiguation II
Voting algorithm to obtain the disambiguated synset of a term a Every synset s associated to the rest of terms in the
branch votes (proximity weight) for the synsets of term “a”
Main weight: number of subsummers in WordNet hierarchyMatches in WordNet hierarchy of ancestors
Discounting factors:Synset depthBranch distancePolisemy of term associated with synset “s”
21 de abr de 2023 15
Thesauri disambiguation III
Annotation of disambiguated synsets
21 de abr de 2023 16
Automatic expansion of keywordswith new disambiguated thesauri
Thesaurus Original term Reliability
CEOPARAMETER earth science atmosphere 100
CEODISCIPLINE weather & climate 100
Thesaurus Expanded term Reliability
atmosphere
climate
climate weather
climate weather weather condition
climatic issue
climatic issue weather
GEMET
climatic issue.weather weather condition
99
ADL-FTT regions climatic regions 49.5
atmospheric science
atmospheric science atmospheric pressure
NASA
atmospheric science atmospheric temperature
49.5
Comparison between the initial collection of synsets and the synsets of a new term tresshod
termnewofsynsets
termnewofmatchessynset
___
____
21 de abr de 2023 17
Expansion of keywords II
21 de abr de 2023 18
Conclusions & future lines
ThManager is a flexible tool to manage thesauri It provides enhanced functionality for the improvement of
classifications. This tool can be easily integrated in other tools
It is used by a metadata edition tool (also presented here) to select the appropriate term for the distinct metadata fields.
Future lines: Creation of a thesaurus Web Service providing some of the
functionality offered by this tool. thesaurus browsing, WordNet polysemy extraction,
keywords expansion, ... Concept based retrieval
Exploit the semantic disambiguation of thesauri to test different information retrieval strategies for geographic data catalogs.
It is possible to index metadata records according to a unified system: the disambiguated WordNet synsets
21 de abr de 2023 19
Advanced Information Systems Laboratory
http://iaaa.cps.unizar.es