dublin april 3 rd , 2009

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Slovenia. Dublin April 3 rd , 2009.


  • MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages

    Toma Erjavechttp://nl.ijs.si/et/

    Department of Knowledge TechnologiesJoef Stefan InstituteLjubljanaSloveniaDublinApril 3rd, 2009

  • Overview of the talkPart-of-speech tagging, tagsets and interoperabilityMULTEXT(-East) morphosyntactic specificationsLanguages, formats, transformationsAn application: JOS resources for SloveneConclusions

  • Part-of-speech taggingThe task of assigning the correct PoS tag to each word in a running text, e.g. Under/IN the/DT proposal/NN ,/, Delmed/NNP would/MD issue/VB about/IN 123.5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP Important HLT infrastructureVery useful annotations for linguistsSome applications:pre-processing step for further analyses: lemmas, syntactic structure, etc.text indexing, e.g. nouns are more useful than verbs

  • Methods of PoS taggingPoS tagging:determine ambiguity class or word (saw NN | VBD) disambiguate to correct tag in (local) context (I saw/VBD a saw/NN )Tagger training:manually annotated corpus: source of probabilities for tags given a (local) context +(lexicon: gives possible tags for each word-form)Popular taggers:TnT (HMM tagger), TreeTagger (decision trees), TBL (transformation based tagging)Tagging usefulness as well as accuracy crucially depends on the tagset

  • English tagsetsTagging first developed for English (Brown, CLAWS, PTB tagsets)English inflectionally very poor language small tagsets ~ 50 different tagsTags are typically synthetic, i.e. the tag does not transparently map to features e.g. :to/TO (PoS?)Delmed/NNP (number?)shares/NNS (number?)

  • Tagsets for other languageswill often have many more morphosyntactic features associated with a word, so tagsets will be largere.g. Slovene nouns:type: common, propergender: masculine, feminine, neuternumber: singular, dual, pluralcase: nom., gen., dat., acc., loc., ins.(animacy: yes, no)= 104 PoS tags just for NounsRussian, Czech, Slovene ~ 1000-2000 word level syntactict tags

  • PoS tags vs. MSDs PoS tags: used in corpora for corpus annotations / taggingtypically synthetic Morphosyntactic Descriptions (MSDs):used in inflectional lexica for lexical annotations / morphological analysistypically analyticRelation of PoS tagsets to MSD tagsets/featuresin general: |PoS| < |MSD| but in most MULTEXT-East languages: [PoS] [MSD]

  • Developing a multilingual morphosyntactic frameworkInteroperability: Tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented Best practice: Languages that do not yet have a tagset could benefit from an operational framework in which to model it

  • so, wouldnt it be nice to have:an open, standardised, documented, flexible model for MSD/PoS tagset design,that would be instantiated for lots of languages,and could be simply applied to any language?

  • EU standardisation effortsEAGLES: Expert Advisory Group for Language Engineering Standards (1993-1996)MULTEXT: Multilingual Text Tools and Corpora (1995)MULTEXT-East: MULTEXT for Central and Eastern European Languages:Version 1: TELRI edition (1998)Version 2: Concede edition (2002)Version 3: TEI edition (2004)Version 4: MondiLex edition (2009?)...ISO / TC 37 / LMF / isoCat (2008)

  • MULTEXT-East morphosyntactic resourcesBasic Language Resource Kit:specifications: define features and MSDslexica (~15,000 lemmas): triplets: word-form / lemma / MSDparallel corpus: MSD and lemma annotatedFreely available for research http://nl.ijs.si/ME/

  • 1984: aligned and annotated

  • MULTEXT-East languages

  • The MULTEX(-East) morphosyntactic specificationsThey specify that e.g.Ncmsn corresponds to the feature-structure [Noun, Type=common, Gender=masculine, Number=singular, Case=nominative]is a valid MSD for SloveneSpecifications consist ofFront matterCommon part - common definitions for all languages (features)Language particular parts - particulars for each language (MSD set)

  • V4 specs draft in HTML

  • Specifications in Version 4Encoded in XML / teiLite (in Version 3: LaTeX)TEI = Text Encoding Initiative Guidelines P4Still in book-like in form, to make authoring easierXSLT into other formats:HTMLtabular mapping formats (e.g. MSD to features)XML/TEI feature library(OWL)

  • The common specificationsDefine categories (parts-of-speech)For each category define features, i.e. attributes and their valuesFor each attribute-value specify for which languages it is appropriateGive positional mapping to MSDs:each attribute assigned a positioneach attribute-value assigned a one-character code

  • Common table (HTML)

  • Common table (source XML/teiLite)

  • Language particular sectionsRecap the feature definitions for the languageAdd combinations, i.e. feature-coocurrence restrictionsAdd lexicon, i.e. list of all valid MSDs for languagePossibly localise the features and codesPossibly give notes and examples

  • Combinations

  • Lexicon

  • Jezikoslovno oznaevanje slovenine http://nl.ijs.si/jos

  • JOS as a bridge to MULTEXT-East Version 4FidaPLUS corpusJOS corporaMTE V3 slv specificationsJOS (slv) specificationsMTE V4 (slv) specificationsMTE V4 specifications

  • JOS specificationsXML/teiLite + XSLT transformsAllow reordering of attribute positions (Vm-----d Vmd)i18n / slv+eng:translation: specificationslocalisation: attributes, values, codeslocalisation: TEI element names

  • MSD conversion tablesTabular UTF-8 filesMSD-slv to -engMSD to featuresCollating sequence

    e.g.01N0101010100 Somei Ncmsn 01N0101010200 Somer Ncmsg 01N0101010300 Somed Ncmsd

    Ncmsn Noun Type=common Gender=masculine Number=singular Case=nominative Animacy=0 Ncmsg Noun Type=common Gender=masculine Number=singular Case=genitive Animacy=0 Ncmsd Noun Type=common Gender=masculine Number=singular Case=dative Animacy=0

  • Adding a new languageXSLT scripts: mtems-split.xsl: make a template for the language particular section of a new languagemtems-merge: merge a new language particular section to the common tablesMaybe shortly to be tested on new Slavic languages in the scope of MondiLex

  • CritiquesIts just an exercise in encoding anywaySame is different, different is sameThe Procrustean bed of standards

    Policy change: from unification to harmonisation (hippy school)

  • ConclusionsPresented work-in-progress on standardisation of multilingual morphosyntactic specificationsSpecifications are a de-facto standard for several languages (Romanian, Slovene, Croatian)Could serve as hub encoding for multilingual applications, e.g. MTand as an framework for new languages

  • Further workFinishing MTE V4!Distribution: LDC, ELDARelation to ISO-TC37 standards:LMF, isoCATConnecting to GOLD ontologyAdding new languages:Slavic completionWestern European: MULTEXT Japanese: chasen tagset, jpWaC(-L2)Irish?