corpus lingustics 2013, lancaster university, july 25th 2013 digital corpora and other electronic...

Corpus Linguistics 2013

Corpus Lingustics 2013, Lancaster University, July 25th 2013Digital corpora and other electronic resources for Maltese Albert GattInstitute of Linguistics, University of MaltaSlavomr /bulbul/ plInstitute of the Czech National Corpus, Charles Universitywww.bulbul.sk/cl2013 Part 1Some general informationMalteseAffiliationAfro-Asiatic > Semitic > Central Semitic > Arabic > North-AfricanWriting systemLatin scriptSpoken inMalta (~ 400.000)Australia (34.396) 2011 censusStatusNational language (Const. ch. 1, sec. 5.1), official language along with English (Const. ch. 1, sec. 5.2)Regulating bodyIl-Kunsill Nazzjonali tal-Ilsien MaltiPart 2A bit of historyCorpus linguistics and Maltese thenMaltiLex (Rosner et al. 2000)Groundwork for electronic lexicography in MalteseTotal size: ca. 1.500.000 tokens, mostly newspapersNever publicly releasedPreliminary experiments on POS Tagging

PsyCol Maltese Lexical Corpus (Francom, Woudstra and Ussishkin 2009)Data retrieved from the web (mostly newspapers)Total size: 3.323.325 tokensUsed primary in lexical access experimentsCorpus linguistics and Maltese nowClarinwww.clarin.eu

METANET4Uwww.metanet4u.eu

The two corpora discussed herePart 3The corporaThe corporaMLRS Corpus (University of Malta)http://mlrs.research.um.edu.mt/index.php?page=31Running on IMS Open Corpus WorkbenchTwo versions: V1.0: 100 million tokens, mostly publicly available texts, no annotationV2.0 beta: 130 million tokens, PoS-tagged

bulbulistan corpuswww.bulbul.sk/bonito2Running on NoSketchEngineAlpha version: ~ 150 million tokens, no annotationComing soon: beta version, ~ 160 million tokens, PoS-tagged

Data collectionMLRSbulbulistanWeb data Automated keyword-based webcrawlingDomain-targeted retrievalBelles lettresAuthor submission (privately or through publisher) and webcrawlingScanning > OCR > checking and processing; straight up purchaseOthersUser submissionPost-processingMLRSbulbulistanText extraction(web data)HTML parsersHMTL parsersStructure analysisParagraph and sentence splitting, tokenizationSentence splitting, tokenizationCleaning Removal of non-Maltese text (semi-automatic)Removal of non-Maltese text on sentence-level (see next step)Post-processing (continued)MLRSbulbulistanDeduplicationOn VERT file (Onion deduplication tool) at paragraph levelAt document levelPoS TaggingTnT trained on ~28k words, 95% accuracySVMTool / Apache OpenNLP trained on ~25k words, 94% accuracySpellcheckCustom dictionary-based spell check (Rosner et al. 2012)Only to correct diacritics, done as a part of taggingPost-processing (continued)MLRS tagsetEAGLES-like division into 41 major categories with morphosyntactic featuresTwo-level annotation scheme:Level I: annotation of major category onlyLevel II: addition of morphosyntactic featuresCurrent release (MLRS V2.0 beta) only has Level I annotation. Ongoing work on automatic morphological analysis; aim to combine POS tagging with this for Level II

WordLevel I TagLevel II additional featuresrael manNN (noun)sg, mascmar he wentVV (main verb)3sg, masc, perfectivegandu he hasVG (pseudo-verb)3sg, masc, perfectivePost-processing (conclusion)bulbulistan tagset55 categories based on morphological and semantic criteriaThree levels:Major category (NOUN, PRON, VERB)Subcategory (NOUN_PROP, PRON_PERS, PART_ACT)Some morphological information (VERB.PERF, VERB.IMPF, PRON_PERS.NEG)All very much work in progress with the ultimate goal to align the tagset with MLRSWordTaggedNotesrael manrael|NOUNNounmar he wentmar|VERB.PERFverb, perfectivemhijiex she is notm|NEG hijiex|PRON_PERS.NEGnegative particle + negated personal pronoun CompositionText typeTokensJournalistic texts68.800.000Parliamentary debates43.400.000Belles lettres375.000Academic texts170.000Legal texts4.800.000Religious texts403.700Speeches18.000Web pages (blogs etc., including Maltese Wikipedia articles)6.500.000Miscellaneous other texts123.000Text typeTokensJournalistic texts80.000.000Parliamentary debates50.000.000Belles lettres600.000Academic texts100.000Other (blogs, ads etc.)50.000MLRS Corpusbulbulistan corpusBalance and representativenessMLRSbulbulistan corpusOpportunistic text collectionsOngoing effort to achieve balance by expanding underepresented text types:

What is balanced / representative in a bilingual society with languages in complementary roles?Maltese: belles lettres, humanitiesEnglish: sciences, economics

Collaboration with publishersOnline submission system for registered members (followed by filtering and post-processing)

Collaboration with authorsScanning and OCR (especially for out-of-copyright works)Diachronic dimensionCurrent status:Majority of texts date to 1998-2013 (journalistic texts, records of 9th through 12th legislature)Literary works from late 19th through early 20th century (~ 200k), some from 1945-1980 (~ 100k)

Goal: Extend the corpora to cover the history of MalteseTwo major periods1824 (first book in Maltese published) - 19241924 (establishment of official orthography) - present

Diachronic dimension (1824-1924)1831184818851924tiegekqiegedwejjemhux

Corpus as research toolSome recent papers:pl, S. 2013. An overview of object reduplication in Maltese (corpus-informed)Fabri, R. and Gatt, A. 2013. Morphological Productivity in Maltese: A corpus-based investigation of Romance derivational processesThe corpus is also used bytranslatorshigh-school and college studentseverybody interested in Maltese

Part 4Beyond corporaMLRS corpus-related toolsMaltese Language Resource Server (mlrs.research.um.edu.mt)Maltese Language Software Services (metanet4u.research.um.edu.mt)

Paragraph and sentence splitterTokenizerPoS TaggerChunker

Other toolsGrammatical FrameworkMaltese Resource Grammar Library for Grammatical Frameworkhttp://www.grammaticalframework.org/ VerbsAn online database of root-and-pattern verbs (Camilleri and Spagnol 2013) http://mlrs.research.um.edu.mt/resources/verbalroots/ Multimodal corpusMAMCO (Maltese Multimodal Corpus) (Paggio, Galea and Vella 2013)Twelve video-recorded conversations, annotated with speech and gestureProject VassalliThe Maltese equivalent of the Guttenberg project

Part 5Whats nextNext stepsMerge the two corpora into oneMore texts (duh), especially from areas not represented wellMore annotation levels (basic morphological analysis)Creation of a balanced subcorpusSyntactic parsing > treebankIntegrate the two corpora into larger projectsSketchEngineInterCorp

corpus lingustics 2013, lancaster university, july 25th 2013 digital corpora and other electronic...

Documents

corpus lingustics

maltese thenmaltilex

czech national corpus

major categories

historycorpus linguistics

beta version

major category noun

lancaster university