Transcript
Page 1: Dublin April 3 rd ,  2009

MULTEXT-East Version 4: multilingual

morphosyntactic specifications for lots of

languages

Tomaž Erjavechttp://nl.ijs.si/et/

Department of Knowledge TechnologiesJožef Stefan Institute

LjubljanaSlovenia

DublinApril 3April 3rdrd, 2009, 2009

Page 2: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Overview of the talk

1.1. Part-of-speech tagging, tagsets and Part-of-speech tagging, tagsets and interoperabilityinteroperability

2.2. MULTEXT(-East) morphosyntactic MULTEXT(-East) morphosyntactic specificationsspecifications

3.3. Languages, formats, Languages, formats, transformationstransformations

4.4. An application: JOS resources for An application: JOS resources for SloveneSlovene

5.5. ConclusionsConclusions

Page 3: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Part-of-speech tagging The task of assigning the correct PoS The task of assigning the correct PoS

tag to each word in a running texttag to each word in a running text, e.g., e.g.Under/IN the/DT proposal/NN ,/, Delmed/NNP would/MD issue/VB about/IN 123.5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP …

Important HLT infrastructure Very useful annotations for linguists Some applications:

pre-processing step for further analyses: lemmas, syntactic structure, etc.

text indexing, e.g. nouns are more useful than verbs

Page 4: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Methods of PoS tagging PoS tagging:

determine ambiguity class or word (saw → NN | VBD)

disambiguate to correct tag in (local) context(“I saw/VBD a saw/NN “)

Tagger training: manually annotated corpus: source of manually annotated corpus: source of

probabilities for tags given a (local) context +probabilities for tags given a (local) context + (lexicon: gives possible tags for each word-(lexicon: gives possible tags for each word-

form)form) Popular taggers:Popular taggers:

TnT (HMM tagger), TreeTagger (decision TnT (HMM tagger), TreeTagger (decision trees), TBL (transformation based tagging)trees), TBL (transformation based tagging)

Tagging usefulness as well as accuracy crucially Tagging usefulness as well as accuracy crucially depends on the depends on the tagsettagset

Page 5: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

English tagsets Tagging first developed for English Tagging first developed for English

(Brown, CLAWS, PTB tagsets)(Brown, CLAWS, PTB tagsets) English inflectionally very poor language English inflectionally very poor language

→ small tagsets ~ 50 different tags→ small tagsets ~ 50 different tags Tags are typically “synthetic”, Tags are typically “synthetic”,

i.e. the tag does not transparently map i.e. the tag does not transparently map to features e.g. :to features e.g. : to/TO (PoS?) Delmed/NNP (number?) shares/NNS (number?)

Page 6: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Tagsets for other languages will often have many more morphosyntactic will often have many more morphosyntactic

features associated with a word, so tagsets will features associated with a word, so tagsets will be largerbe larger

e.g. Slovene nouns:e.g. Slovene nouns: type: common, propertype: common, proper gender: masculine, feminine, neutergender: masculine, feminine, neuter number: singular, dual, pluralnumber: singular, dual, plural case: nom., gen., dat., acc., loc., ins.case: nom., gen., dat., acc., loc., ins. (animacy: yes, no)(animacy: yes, no) = 104 “PoS” tags just for Nouns= 104 “PoS” tags just for Nouns

Russian, Czech, Slovene ~ 1000-2000 word Russian, Czech, Slovene ~ 1000-2000 word level syntactict tagslevel syntactict tags

Page 7: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

PoS tags vs. MSDs PoS tags: PoS tags:

used in corpora used in corpora for corpus annotations / taggingfor corpus annotations / tagging

typically synthetic typically synthetic Morphosyntactic Descriptions (MSDs):Morphosyntactic Descriptions (MSDs):

used in inflectional lexica used in inflectional lexica for lexical annotations / morphological for lexical annotations / morphological analysisanalysis

typically analytictypically analytic Relation of PoS tagsets to MSD tagsets/featuresRelation of PoS tagsets to MSD tagsets/features

in general: in general: |PoS| < |MSD| |PoS| < |MSD|

but in most MULTEXT-East languages: but in most MULTEXT-East languages: [PoS] [PoS] ≡ [MSD]≡ [MSD]

Page 8: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Developing a multilingual morphosyntactic framework

Interoperability: Tagsets developed Interoperability: Tagsets developed for various languages (or even for for various languages (or even for the same language) have no the same language) have no connection with each other and are connection with each other and are often poorly documented often poorly documented

Best practice: Languages that do Best practice: Languages that do not yet have a tagset could benefit not yet have a tagset could benefit from an operational framework in from an operational framework in which to model itwhich to model it

Page 9: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

so, wouldn’t it be nice to have: an open, standardised, an open, standardised,

documented, flexible model for documented, flexible model for MSD/PoS tagset design,MSD/PoS tagset design,

that would be instantiated for lots that would be instantiated for lots of languages,of languages,

and could be simply applied to any and could be simply applied to any language?language?

Page 10: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

EU standardisation efforts EAGLES: Expert Advisory Group for Language EAGLES: Expert Advisory Group for Language

Engineering Standards (1993-1996)Engineering Standards (1993-1996) MULTEXT: Multilingual Text Tools and Corpora MULTEXT: Multilingual Text Tools and Corpora

(1995)(1995) MULTEXT-East: MULTEXT for Central and MULTEXT-East: MULTEXT for Central and

Eastern European Languages:Eastern European Languages: Version 1: TELRI edition (1998)Version 1: TELRI edition (1998) Version 2: Concede edition (2002)Version 2: Concede edition (2002) Version 3: TEI edition (2004)Version 3: TEI edition (2004) Version 4Version 4: MondiLex edition (2009?): MondiLex edition (2009?)

...... ISO / TC 37 / LMF / isoCat (2008)ISO / TC 37 / LMF / isoCat (2008)

Page 11: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

MULTEXT-East morphosyntactic resources Basic Language Resource Kit:Basic Language Resource Kit:

1.1. specifications:specifications:define features and MSDsdefine features and MSDs

2.2. lexicalexica ( (~15,000 lemmas):~15,000 lemmas):triplets: word-form / lemma / MSDtriplets: word-form / lemma / MSD

3.3. paralparalllel el corpus: corpus: MSD and lemma annotatedMSD and lemma annotated

FFreely available for research reely available for research http://nl.ijs.si/ME/http://nl.ijs.si/ME/

Page 12: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

1984: aligned and annotated

Page 13: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

MULTEXT-East languagesLanguage Language Family Added inEnglish Germanic Version 1Romanian Romance Version 1Russian East Slavic Version 4Ukrainian East Slavic Version 4?Polish East Slavic Version 4?Czech West Slavic Version 1Slovak West Slavic Version 4?Slovene South West Slavic Version 1/4Resian dialect of Slovene Version 3/4Croatian South West Slavic Version 3Serbian South West Slavic Version 2Macedonian South East Slavic Version 4Bulgarian South East Slavic Version 1/4?Persian Indo-Iranian Version 4Estonian Finno-Ugric Version 1Hungarian Finno-Ugric Version 1

Page 14: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

The MULTEX(-East) morphosyntactic specifications They specify that e.g.”Ncmsn” They specify that e.g.”Ncmsn”

corresponds to the feature-structurecorresponds to the feature-structure[Noun, Type=common, Gender=masculine, [Noun, Type=common, Gender=masculine, Number=singular, Case=nominative]Number=singular, Case=nominative]

is a valid MSD for Sloveneis a valid MSD for Slovene Specifications consist ofSpecifications consist of

Front matterFront matter Common partCommon part

- common definitions for all languages - common definitions for all languages (features)(features)

Language particular partsLanguage particular parts - particulars for each language (MSD set) - particulars for each language (MSD set)

Page 15: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

V4 specs draft in HTML

Page 16: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Specifications in Version 4 Encoded in XML / teiLiteEncoded in XML / teiLite

(in Version 3: LaTeX)(in Version 3: LaTeX) TEI = Text Encoding Initiative Guidelines P4TEI = Text Encoding Initiative Guidelines P4 Still in “book-like” in form, to make authoring Still in “book-like” in form, to make authoring

easiereasier XSLT into other formats:XSLT into other formats:

HTMLHTML tabular mapping formatstabular mapping formats

(e.g. MSD to features)(e.g. MSD to features) XML/TEI feature libraryXML/TEI feature library (OWL) (OWL)

Page 17: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

The common specifications Define categories (“parts-of-speech”)Define categories (“parts-of-speech”) For each category define features, i.e. For each category define features, i.e.

attributes and their valuesattributes and their values For each attribute-value specify for For each attribute-value specify for

which languages it is appropriatewhich languages it is appropriate Give positional mapping to MSDs:Give positional mapping to MSDs:

each attribute assigned a positioneach attribute assigned a position each attribute-value assigned a one-each attribute-value assigned a one-

character codecharacter code

Page 18: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Common table (HTML)

Page 19: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Common table (source XML/teiLite)

Page 20: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Language particular sections Recap the feature definitions for the Recap the feature definitions for the

languagelanguage Add “combinations”, i.e. Add “combinations”, i.e.

feature-coocurrence restrictionsfeature-coocurrence restrictions Add “lexicon”, i.e. Add “lexicon”, i.e.

list of all valid MSDs for languagelist of all valid MSDs for language Possibly localise the features and Possibly localise the features and

codescodes Possibly give notes and examplesPossibly give notes and examples

Page 21: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Combinations

Page 22: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Lexicon

Page 23: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Jezikoslovno označevanje slovenščine http://nl.ijs.si/jos

Page 24: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

JOS as a bridge to MULTEXT-East Version 4

FidaPLUSFidaPLUScorpuscorpus

JOSJOScorporacorpora

MTE V3 slvMTE V3 slvspecificationsspecifications

JOS (slv)JOS (slv)specificationsspecifications

MTE V4 (slv)MTE V4 (slv)specificationsspecifications

MTE V4 MTE V4 specificationspecification

ss

Page 25: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Page 26: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

JOS specifications

XML/teiLite + XSLT transformsXML/teiLite + XSLT transforms Allow reordering of attribute positionsAllow reordering of attribute positions

(Vm-----d (Vm-----d → Vmd)→ Vmd) i18n / slv+eng:i18n / slv+eng:

translation: specificationstranslation: specifications localisation: attributes, values, codeslocalisation: attributes, values, codes localisation: TEI element nameslocalisation: TEI element names

Page 27: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Page 28: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Page 29: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

MSD conversion tables Tabular UTF-8 filesTabular UTF-8 files MSD-slv to -engMSD-slv to -eng MSD to featuresMSD to features Collating sequenceCollating sequence

e.g.e.g.01N0101010100 Somei Ncmsn 01N0101010100 Somei Ncmsn 01N0101010200 Somer Ncmsg 01N0101010200 Somer Ncmsg 01N0101010300 Somed Ncmsd 01N0101010300 Somed Ncmsd

Ncmsn Noun Type=common Ncmsn Noun Type=common Gender=masculine Gender=masculine Number=singular Number=singular Case=nominative Animacy=0 Case=nominative Animacy=0

Ncmsg Noun Type=common Ncmsg Noun Type=common Gender=masculine Gender=masculine Number=singular Case=genitive Number=singular Case=genitive Animacy=0 Animacy=0

Ncmsd Noun Type=common Ncmsd Noun Type=common Gender=masculine Gender=masculine Number=singular Case=dative Number=singular Case=dative Animacy=0 Animacy=0

Page 30: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Adding a new language

XSLT scripts: XSLT scripts: mtems-split.xsl:mtems-split.xsl:

make a template for the language make a template for the language particular section of a new languageparticular section of a new language

mtems-merge: mtems-merge: merge a new language particular merge a new language particular section to the common tablessection to the common tables

Maybe shortly to be tested on new Maybe shortly to be tested on new Slavic languages in the scope of Slavic languages in the scope of MondiLexMondiLex

Page 31: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Critiques

It’s just an exercise in encoding It’s just an exercise in encoding anywayanyway

Same is different, different is sameSame is different, different is same The Procrustean bed of standardsThe Procrustean bed of standards

Policy change: from unification to Policy change: from unification to harmonisation (hippy school)harmonisation (hippy school)

Page 32: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Conclusions

Presented work-in-progress on Presented work-in-progress on “standardisation” of multilingual “standardisation” of multilingual morphosyntactic specificationsmorphosyntactic specifications

Specifications are a de-facto standard Specifications are a de-facto standard for several languages (Romanian, for several languages (Romanian, Slovene, Croatian)Slovene, Croatian)

Could serve as “hub” encoding for Could serve as “hub” encoding for multilingual applications, e.g. MTmultilingual applications, e.g. MT

and as an framework for new languagesand as an framework for new languages

Page 33: Dublin April 3 rd ,  2009

Erjavec: MULTEXT-East Version 4 Dublin, 4.4.2009

Further work Finishing MTE V4!Finishing MTE V4! Distribution: LDC, ELDADistribution: LDC, ELDA Relation to ISO-TC37 standards:Relation to ISO-TC37 standards:

LMF, isoCATLMF, isoCAT Connecting to GOLD ontologyConnecting to GOLD ontology Adding new languages:Adding new languages:

Slavic completionSlavic completion Western European: MULTEXT Western European: MULTEXT Japanese: chasen tagset, jpWaC(-L2)Japanese: chasen tagset, jpWaC(-L2) Irish?Irish?☺☺


Top Related