quality control for wordnet development in balkanet pavel smrž smrz@fi.muni.cz smrz@fi.muni.cz...

Post on 02-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Quality Control for WordnetQuality Control for Wordnet DevelopmentDevelopment in BalkaNet in BalkaNet

Pavel SmržPavel Smržsmrzsmrz@fi.muni.cz@fi.muni.cz

Faculty of Informatics, Faculty of Informatics, Masaryk University in Brno, Masaryk University in Brno,

Czech Republic Czech Republic

OutlineOutline

Introduction, Introduction, general-purpose general-purpose language resourceslanguage resources

General considerationsGeneral considerations Case Study of Quality Control in Case Study of Quality Control in

BalkaNetBalkaNet Conclusions and Future DirectionsConclusions and Future Directions

IntroductionIntroduction

BalkaNet shares many fundamental BalkaNet shares many fundamental principles with principles with EuroWordNetEuroWordNet (expected (expected sharing of procedures, policy, structure and sharing of procedures, policy, structure and tools).tools).

Discovered limitations of the EuroWordNet Discovered limitations of the EuroWordNet approach brought us to the decision to approach brought us to the decision to change data format, to design and change data format, to design and implement new applications, and also to implement new applications, and also to propose a modified perspective of the propose a modified perspective of the future development of the lexical semantic future development of the lexical semantic databases. databases.

IntroductionIntroduction

application-specific vs. application-specific vs. general-purposegeneral-purpose LRLR procedures of procedures of quality control for general-quality control for general-

purpose language resourcespurpose language resources much less much less developeddeveloped

this area has been strongly underestimated this area has been strongly underestimated in many previous projectsin many previous projects

if quality assurance policy has not been if quality assurance policy has not been applied the results could differ considerably applied the results could differ considerably from that what was declaredfrom that what was declared

General ConsiderationsGeneral Considerations

the availability of documentation of the availability of documentation of the development process and the the development process and the final statefinal state of data of data

resource documentation should be resource documentation should be comprehensive but at the same time comprehensive but at the same time conciseconcise to allow quick scan to allow quick scan

project deliverablesproject deliverables

General ConsiderationsGeneral Considerations

the availability of documentation of the availability of documentation of the development process and the the development process and the final statefinal state of data of data

resource documentation should be resource documentation should be comprehensive but at the same time comprehensive but at the same time conciseconcise to allow quick scan to allow quick scan

project deliverables (longer than project deliverables (longer than necessary, do not describe all aspects, necessary, do not describe all aspects, do not reflect the process of do not reflect the process of development)development)

The First Commandment!!!The First Commandment!!!

Summarize the description of Summarize the description of resources in the end of your project resources in the end of your project and check validity of information in and check validity of information in all documents that will be part of the all documents that will be part of the documentation!documentation!

The Second The Second Commandment!!!Commandment!!!

Explicitly define your terminology!Explicitly define your terminology! (even the meaning of terms that (even the meaning of terms that

seem to be basic in the context!)seem to be basic in the context!) what kinds of variants (typographic, what kinds of variants (typographic,

regional, register…) are contained in regional, register…) are contained in synsetssynsets??

((lakelake, , lochloch and and loughlough – regional – regional variants of the same concept – form 3 variants of the same concept – form 3 different synsets in PWN, different synsets in PWN, lakelake is the is the hypernym of the two others)hypernym of the two others)

Other RequirementsOther Requirements

description of the description of the data formatdata format in in which the resource is providedwhich the resource is provided

XMLXML as the standard for data as the standard for data interchangeinterchange

DTD, XSW and other XML DTD, XSW and other XML SchemataSchemata Quantitative characteristicsQuantitative characteristics (empty (empty

tags may signalize inconsistency)tags may signalize inconsistency)

BalkaNet ExperienceBalkaNet Experience

The most successful procedure to The most successful procedure to control the quality of linguistic output control the quality of linguistic output is to implement a set of is to implement a set of validation validation checkschecks and periodically and periodically publish their publish their resultsresults. It holds especially for projects . It holds especially for projects with many participants that are not with many participants that are not under the same supervision. under the same supervision. Validation check reports together Validation check reports together with the quantitative assessment can with the quantitative assessment can serve as development serve as development synchronization points too. synchronization points too.

Case Study of Quality Control Case Study of Quality Control in BalkaNetin BalkaNet

Resource description sheets:Resource description sheets: description of the content of synset records and description of the content of synset records and

constraints on data types;constraints on data types; types of relations included together with types of relations included together with

examples;examples; degree of checking relations borrowed from degree of checking relations borrowed from

PWN (related to the expand model);PWN (related to the expand model); numbering scheme of different senses (random, numbering scheme of different senses (random,

according to their frequency in a balanced according to their frequency in a balanced corpus, from a particular dictionary, etc.)corpus, from a particular dictionary, etc.)

source of definitions and usage examples;source of definitions and usage examples; order of literals in synsets (corpus frequency, order of literals in synsets (corpus frequency,

familiarity, register or style characteristics)familiarity, register or style characteristics)

Quantitative characteristicsQuantitative characteristics tag frequenciestag frequencies ratio of the number of literals in the national ratio of the number of literals in the national

wordnet and in PWNwordnet and in PWN ID prefix frequenciesID prefix frequencies frequency of link typesfrequency of link types frequency of POSfrequency of POS coverage of BCScoverage of BCS number-of-senses distributionnumber-of-senses distribution number of “multi-parent” synsetsnumber of “multi-parent” synsets number of leaves, inner nodes, roots, free number of leaves, inner nodes, roots, free

nodes in hyper-hyponymic “trees”nodes in hyper-hyponymic “trees” path-length distributionpath-length distribution

Automatic and Semi-automatic Automatic and Semi-automatic Quality CheckingQuality Checking

Classification according to:Classification according to: the amount of human effortthe amount of human effort applicability for all languages (or applicability for all languages (or

language-specific)language-specific) the need for additional resources and/or the need for additional resources and/or

tools (annotated monolingual or parallel tools (annotated monolingual or parallel corpora, spell-checkers, explanatory or corpora, spell-checkers, explanatory or bilingual dictionaries, encyclopedias, bilingual dictionaries, encyclopedias, lemmatizers, morphological analyzers)lemmatizers, morphological analyzers)

Inconsistencies regularly Inconsistencies regularly examined on all BalkaNet dataexamined on all BalkaNet data

XML validationXML validation – empty ID, POS, SYNONYM, – empty ID, POS, SYNONYM, SENSE, ... ;SENSE, ... ;

XML tag XML tag data typesdata types for POS, SENSE, TYPE (of for POS, SENSE, TYPE (of relation), characters from a defined relation), characters from a defined character set in DEF and USAGE;character set in DEF and USAGE;

duplicate IDs;duplicate IDs; duplicate triplets (duplicate triplets (POS, literal, sensePOS, literal, sense);); duplicate literals in one synset;duplicate literals in one synset; not corresponding POS in the relevant tag not corresponding POS in the relevant tag

and in the ID postfix;and in the ID postfix; hypernym and holonym links (uplinks) to a hypernym and holonym links (uplinks) to a

synset with different POS;synset with different POS;

Inconsistencies regularly Inconsistencies regularly examined on all BalkaNet dataexamined on all BalkaNet data

dangling linksdangling links (dangling uplinks); (dangling uplinks); cyclescycles in uplinks (conflicting with PWN, e.g. in uplinks (conflicting with PWN, e.g.

goalpost:1 is a kind of post:4 is a kind of goalpost:1 is a kind of post:4 is a kind of upright:1; vertical:2 which is a part of upright:1; vertical:2 which is a part of goalpost:1);goalpost:1);

cycles in other relations;cycles in other relations; top-most synset not from the defined set top-most synset not from the defined set

((unique beginnersunique beginners) – missing hypernym or ) – missing hypernym or holonym of a synset (see BCS selecting holonym of a synset (see BCS selecting procedure above);procedure above);

non-compatible links to the same synset;non-compatible links to the same synset; non-continuous numbering where declared non-continuous numbering where declared

(possibility of automatic renumbering).(possibility of automatic renumbering).

Semi-automatic checksSemi-automatic checks

(additional language resources) spell-checking of literals, definitions,

usage examples and notes coverage of coverage of the most frequent wordsthe most frequent words

from monolingual corpora;from monolingual corpora; coverage of coverage of translationstranslations (bilingual (bilingual

dictionaries, parallel corpora);dictionaries, parallel corpora); incompatibility with incompatibility with relations relations

extractedextracted from corpora, dictionaries, from corpora, dictionaries, or encyclopediasor encyclopedias

Lists of “suspicious” synsetsLists of “suspicious” synsets nonlexicalized literals; literals with many senses; multi-parent relations; autohyponymy, automeronymy and other

relations between synsets containing the same literal;

longest paths in hyper-hyponymic graphs; similar definitions; incorrect occurrences of defined literals in

definitions; presence of literals in usage examples; dependencies between relations (e.g. near

antonyms differing in their hypernyms);

Validation of quality in Validation of quality in applicationsapplications

corpus annotation for WSD experiments (missing senses, impossibility to choose between different senses)

comparison between the semantic classifications from the wordnet with the syntactic patterns based on computational grammar (verb valencies, selectional restrictions)

information retrieval - augmented user-interface for search engines

Conclusions and Future Conclusions and Future DirectionsDirections

The quality control has been one of the priorities of the BalkaNet project. As our evaluation proves even the actual data from the second year of the project are more consistent that the results of previous wordnet-development projects.

XSLT and other XML standards to define validation checks in DEB

top related