new estonian words and senses: detection and …...•the workflow not yet automated •estonian...
TRANSCRIPT
NEW ESTONIAN WORDS AND
SENSES: DETECTION AND
DESCRIPTIONMargit Langemets, Jelena Kallas, Kaisa Norak, Indrek HeinInstitute of the Estonian Language
Globalex Workshop on Lexicography and Neologism 8 May 2019DSNA 22 – Indiana University, Bloomington, IN
New words in the dictionaries
• Grenzstein (1884)• 1,600 words
• Aavik (1919, 2nd ed. 1921)• 4,000 words
• Erelt, Kull, Meriste 1985• 150 words (stems)
separate dics new words included into Ekilex (2019)(a) separate general dics (unified single resource)(b) database of new words User Interface
Sõnaveeb (Wordweb)
8 May 2019 Globalex Workshop on Neologism 2
Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)
8 May 2019 Globalex Workshop on Neologism 3
Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)
8 May 2019 Globalex Workshop on Neologism 4
Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)
8 May 2019 Globalex Workshop on Neologism 5
Unified single resource Ekilex
• enables constant updating of different data subsets
• NEW WORD in the database• provided (ideally) with
• long definition (< general explanatory dic, large bilingual dic) – Detailed view
• short/simpler definition (< learners' dic, bilingual dic) – Simple view
• gloss/signpost (< orthological dic, bilingual dic) – Detailed/Simple view
• terminological definition (< termbase)
• prescriptive advice
• morphological information
• etymological information
• usage examples (for L1, L2, prescriptive advice)
• translation equivalents (different languages)
• synonyms
• etc.
8 May 2019 Globalex Workshop on Neologism 6
Methods used so far
• the workflow not yet automated
• Estonian National Corpus (NC)
• started in the 1990s
• monitoring corpus (since 2017 every two years)
• Estonian NC 2017 – 1.1 billion tokens
• Estonian NC 2019 (October)
• Sketch Engine
• Wordlist function
• ELEXIS Survey for Lexicographers (2019): 54,8% (of those using 22 CQSs) are using SkE
• there are many neologisms that will be missed (Kilgarriff et al. 2015)
8 May 2019 Globalex Workshop on Neologism 7
An experimental study: detecting new words
• Exclusion Dictionary Architecture (Cartier, 2017)
• extraction of novel forms from monitor corpora
• using lexicographic resources as a reference exclusion dictionary to induce unknown words
• filters to eliminate spelling errors and proper nouns
• no tracking of new meanings (semantic neologisms)
5 stages
Kaisa Norak, Indrek Hein, lexicographers (February–April 2018)
8 May 2019 Globalex Workshop on Neologism 8
Stages 1-2: extraction of novel word forms, filtering
• extraction of novel word forms from the Institute’s text collection (collected from 2016 to 2018)• single new words (not MWEs)
• online news, TV subtitles, transcribed books (from heliraamat.eki.ee: text>audio)
• 712,197 word forms that had failed in the automatic morphological analysis
• filtering (first round)• Python 3 language and its library
• EstNLTK 1.4.1 (for lemmatization and morphological tagging)
• R and its library Tidyverse (for filtering and sorting)
• Excel (sorting)
• Lemmatization
• data selection and (multiple) cleaning of selected lemmas
5,290 lemmas
8 May 2019 Globalex Workshop on Neologism 9
Stage 3: reference exclusion word list
• lexicographic resources
• Explanatory Dictionary of the Estonian Language (EKSS 2009)
• Dictionary of Estonian (DicEst 2019)
• Dictionary of Foreign Words (VL 2015)
• Dictionary of Standard Estonian (ÕS 2013)
• in-house database of new words
3,722 lemmas
• English-Estonian Machine Translation Dictionary
• incl. 233 unadapted English loanwords
• weekend, lite, backup, wallet
8 May 2019 Globalex Workshop on Neologism 10
Stage 4-5: compilation, lexicographic evaluation
• more tokenizing errors• ganisatsioon ‘ganization’
• common spelling mistakes• aitähh ‘thank you’
• lemmatization errors, e.g. nouns in genitive and partitive
• direct loans from other languages• fer-de-lance, fouetté, bordereau, soentjie, societa, bueno, laissez-faire
• Estonian dialect words• tüdrik ‘girl’, mõlemi ‘both’
• words derived from proper nouns (ca 180)• lutsiferianism ‘Luciferianism’ and tarsanlik ‘Tarzan-like’
ca 200 new words• süler ‘laptop’, akrojooga ‘acrobatic yoga’
8 May 2019 Globalex Workshop on Neologism 11
Registering and presenting new words
• in Ekilex + Sõnaveeb (Wordweb)
• diakooniline ‘diaconic’
• in Ekilex for further examination
• baklavaa ‘baklava’, blog ‘blog’, veelkord ‘once more’ – vs. the
standardized lemma forms baklava, blogi, veel kord
• 5,000 words on the waiting list (since 2005)
• 1,500 registered annually (manually), incl. MWEs
• a lot of derivatives and semantically transparent
compound words
• digiteerimine ‘digitalizing’
8 May 2019 Globalex Workshop on Neologism 12
Descriptive vs. prescriptive data: all in one?
• 100 years long line of spelling or ortographic dics (ÕS 2018, 2013 ... 1918)
• government regulations for literary norm (since 2006): printed (!) ÕS
• prescriptive data
• orthography and pronunciation, marking the degree of quantity, stress and palatalization
• inflection
• specifying what belongs to standard Estonian and what does not
• prescriptively pointing out good and bad style in language
• ? meanings
• ? usage examples, ? collocations
• descriptive data
• ? orthography and pronunciation (variation)
• meanings
• usage examples, collocations
• etymological information
8 May 2019 Globalex Workshop on Neologism 13
Plans for the future
• more advanced tools for neologism detecting
• detection of multi-word expressions and new meanings
• ? database of common spelling mistakes
• ? ELEXIS tools
• joining or implementing Néoveille, a web platform for neologism tracking
(Cartier 2017)
• visualizing usage and frequency information on the basis of time-stamped
corpora
• presenting both descriptive and prescriptive data
8 May 2019 Globalex Workshop on Neologism 14
Thank you
Margit Langemets [email protected]
Jelena Kallas [email protected]
Kaisa Norak [email protected]
Indrek Hein [email protected]
8 May 2019 Globalex Workshop on Neologism 15