quality control for wordnet development in balkanet pavel smrž [email protected] [email protected]...

19
Quality Control for Quality Control for Wordnet Wordnet Development Development in in BalkaNet BalkaNet Pavel Smrž Pavel Smrž smrz smrz @fi.muni.cz @fi.muni.cz Faculty of Informatics, Faculty of Informatics, Masaryk University in Brno, Masaryk University in Brno, Czech Republic Czech Republic

Upload: bernadette-reeves

Post on 02-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Quality Control for WordnetQuality Control for Wordnet DevelopmentDevelopment in BalkaNet in BalkaNet

Pavel SmržPavel Smrž[email protected]@fi.muni.cz

Faculty of Informatics, Faculty of Informatics, Masaryk University in Brno, Masaryk University in Brno,

Czech Republic Czech Republic

Page 2: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

OutlineOutline

Introduction, Introduction, general-purpose general-purpose language resourceslanguage resources

General considerationsGeneral considerations Case Study of Quality Control in Case Study of Quality Control in

BalkaNetBalkaNet Conclusions and Future DirectionsConclusions and Future Directions

Page 3: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

IntroductionIntroduction

BalkaNet shares many fundamental BalkaNet shares many fundamental principles with principles with EuroWordNetEuroWordNet (expected (expected sharing of procedures, policy, structure and sharing of procedures, policy, structure and tools).tools).

Discovered limitations of the EuroWordNet Discovered limitations of the EuroWordNet approach brought us to the decision to approach brought us to the decision to change data format, to design and change data format, to design and implement new applications, and also to implement new applications, and also to propose a modified perspective of the propose a modified perspective of the future development of the lexical semantic future development of the lexical semantic databases. databases.

Page 4: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

IntroductionIntroduction

application-specific vs. application-specific vs. general-purposegeneral-purpose LRLR procedures of procedures of quality control for general-quality control for general-

purpose language resourcespurpose language resources much less much less developeddeveloped

this area has been strongly underestimated this area has been strongly underestimated in many previous projectsin many previous projects

if quality assurance policy has not been if quality assurance policy has not been applied the results could differ considerably applied the results could differ considerably from that what was declaredfrom that what was declared

Page 5: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

General ConsiderationsGeneral Considerations

the availability of documentation of the availability of documentation of the development process and the the development process and the final statefinal state of data of data

resource documentation should be resource documentation should be comprehensive but at the same time comprehensive but at the same time conciseconcise to allow quick scan to allow quick scan

project deliverablesproject deliverables

Page 6: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

General ConsiderationsGeneral Considerations

the availability of documentation of the availability of documentation of the development process and the the development process and the final statefinal state of data of data

resource documentation should be resource documentation should be comprehensive but at the same time comprehensive but at the same time conciseconcise to allow quick scan to allow quick scan

project deliverables (longer than project deliverables (longer than necessary, do not describe all aspects, necessary, do not describe all aspects, do not reflect the process of do not reflect the process of development)development)

Page 7: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

The First Commandment!!!The First Commandment!!!

Summarize the description of Summarize the description of resources in the end of your project resources in the end of your project and check validity of information in and check validity of information in all documents that will be part of the all documents that will be part of the documentation!documentation!

Page 8: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

The Second The Second Commandment!!!Commandment!!!

Explicitly define your terminology!Explicitly define your terminology! (even the meaning of terms that (even the meaning of terms that

seem to be basic in the context!)seem to be basic in the context!) what kinds of variants (typographic, what kinds of variants (typographic,

regional, register…) are contained in regional, register…) are contained in synsetssynsets??

((lakelake, , lochloch and and loughlough – regional – regional variants of the same concept – form 3 variants of the same concept – form 3 different synsets in PWN, different synsets in PWN, lakelake is the is the hypernym of the two others)hypernym of the two others)

Page 9: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Other RequirementsOther Requirements

description of the description of the data formatdata format in in which the resource is providedwhich the resource is provided

XMLXML as the standard for data as the standard for data interchangeinterchange

DTD, XSW and other XML DTD, XSW and other XML SchemataSchemata Quantitative characteristicsQuantitative characteristics (empty (empty

tags may signalize inconsistency)tags may signalize inconsistency)

Page 10: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

BalkaNet ExperienceBalkaNet Experience

The most successful procedure to The most successful procedure to control the quality of linguistic output control the quality of linguistic output is to implement a set of is to implement a set of validation validation checkschecks and periodically and periodically publish their publish their resultsresults. It holds especially for projects . It holds especially for projects with many participants that are not with many participants that are not under the same supervision. under the same supervision. Validation check reports together Validation check reports together with the quantitative assessment can with the quantitative assessment can serve as development serve as development synchronization points too. synchronization points too.

Page 11: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Case Study of Quality Control Case Study of Quality Control in BalkaNetin BalkaNet

Resource description sheets:Resource description sheets: description of the content of synset records and description of the content of synset records and

constraints on data types;constraints on data types; types of relations included together with types of relations included together with

examples;examples; degree of checking relations borrowed from degree of checking relations borrowed from

PWN (related to the expand model);PWN (related to the expand model); numbering scheme of different senses (random, numbering scheme of different senses (random,

according to their frequency in a balanced according to their frequency in a balanced corpus, from a particular dictionary, etc.)corpus, from a particular dictionary, etc.)

source of definitions and usage examples;source of definitions and usage examples; order of literals in synsets (corpus frequency, order of literals in synsets (corpus frequency,

familiarity, register or style characteristics)familiarity, register or style characteristics)

Page 12: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Quantitative characteristicsQuantitative characteristics tag frequenciestag frequencies ratio of the number of literals in the national ratio of the number of literals in the national

wordnet and in PWNwordnet and in PWN ID prefix frequenciesID prefix frequencies frequency of link typesfrequency of link types frequency of POSfrequency of POS coverage of BCScoverage of BCS number-of-senses distributionnumber-of-senses distribution number of “multi-parent” synsetsnumber of “multi-parent” synsets number of leaves, inner nodes, roots, free number of leaves, inner nodes, roots, free

nodes in hyper-hyponymic “trees”nodes in hyper-hyponymic “trees” path-length distributionpath-length distribution

Page 13: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Automatic and Semi-automatic Automatic and Semi-automatic Quality CheckingQuality Checking

Classification according to:Classification according to: the amount of human effortthe amount of human effort applicability for all languages (or applicability for all languages (or

language-specific)language-specific) the need for additional resources and/or the need for additional resources and/or

tools (annotated monolingual or parallel tools (annotated monolingual or parallel corpora, spell-checkers, explanatory or corpora, spell-checkers, explanatory or bilingual dictionaries, encyclopedias, bilingual dictionaries, encyclopedias, lemmatizers, morphological analyzers)lemmatizers, morphological analyzers)

Page 14: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Inconsistencies regularly Inconsistencies regularly examined on all BalkaNet dataexamined on all BalkaNet data

XML validationXML validation – empty ID, POS, SYNONYM, – empty ID, POS, SYNONYM, SENSE, ... ;SENSE, ... ;

XML tag XML tag data typesdata types for POS, SENSE, TYPE (of for POS, SENSE, TYPE (of relation), characters from a defined relation), characters from a defined character set in DEF and USAGE;character set in DEF and USAGE;

duplicate IDs;duplicate IDs; duplicate triplets (duplicate triplets (POS, literal, sensePOS, literal, sense);); duplicate literals in one synset;duplicate literals in one synset; not corresponding POS in the relevant tag not corresponding POS in the relevant tag

and in the ID postfix;and in the ID postfix; hypernym and holonym links (uplinks) to a hypernym and holonym links (uplinks) to a

synset with different POS;synset with different POS;

Page 15: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Inconsistencies regularly Inconsistencies regularly examined on all BalkaNet dataexamined on all BalkaNet data

dangling linksdangling links (dangling uplinks); (dangling uplinks); cyclescycles in uplinks (conflicting with PWN, e.g. in uplinks (conflicting with PWN, e.g.

goalpost:1 is a kind of post:4 is a kind of goalpost:1 is a kind of post:4 is a kind of upright:1; vertical:2 which is a part of upright:1; vertical:2 which is a part of goalpost:1);goalpost:1);

cycles in other relations;cycles in other relations; top-most synset not from the defined set top-most synset not from the defined set

((unique beginnersunique beginners) – missing hypernym or ) – missing hypernym or holonym of a synset (see BCS selecting holonym of a synset (see BCS selecting procedure above);procedure above);

non-compatible links to the same synset;non-compatible links to the same synset; non-continuous numbering where declared non-continuous numbering where declared

(possibility of automatic renumbering).(possibility of automatic renumbering).

Page 16: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Semi-automatic checksSemi-automatic checks

(additional language resources) spell-checking of literals, definitions,

usage examples and notes coverage of coverage of the most frequent wordsthe most frequent words

from monolingual corpora;from monolingual corpora; coverage of coverage of translationstranslations (bilingual (bilingual

dictionaries, parallel corpora);dictionaries, parallel corpora); incompatibility with incompatibility with relations relations

extractedextracted from corpora, dictionaries, from corpora, dictionaries, or encyclopediasor encyclopedias

Page 17: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Lists of “suspicious” synsetsLists of “suspicious” synsets nonlexicalized literals; literals with many senses; multi-parent relations; autohyponymy, automeronymy and other

relations between synsets containing the same literal;

longest paths in hyper-hyponymic graphs; similar definitions; incorrect occurrences of defined literals in

definitions; presence of literals in usage examples; dependencies between relations (e.g. near

antonyms differing in their hypernyms);

Page 18: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Validation of quality in Validation of quality in applicationsapplications

corpus annotation for WSD experiments (missing senses, impossibility to choose between different senses)

comparison between the semantic classifications from the wordnet with the syntactic patterns based on computational grammar (verb valencies, selectional restrictions)

information retrieval - augmented user-interface for search engines

Page 19: Quality Control for Wordnet Development in BalkaNet Pavel Smrž smrz@fi.muni.cz smrz@fi.muni.cz Faculty of Informatics, Masaryk University in Brno, Czech

Conclusions and Future Conclusions and Future DirectionsDirections

The quality control has been one of the priorities of the BalkaNet project. As our evaluation proves even the actual data from the second year of the project are more consistent that the results of previous wordnet-development projects.

XSLT and other XML standards to define validation checks in DEB