new estonian words and senses: detection and …...•the workflow not yet automated •estonian...

15
NEW ESTONIAN WORDS AND SENSES: DETECTION AND DESCRIPTION Margit Langemets, Jelena Kallas, Kaisa Norak, Indrek Hein Institute of the Estonian Language Globalex Workshop on Lexicography and Neologism 8 May 2019 DSNA 22 Indiana University, Bloomington, IN

Upload: others

Post on 02-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

NEW ESTONIAN WORDS AND

SENSES: DETECTION AND

DESCRIPTIONMargit Langemets, Jelena Kallas, Kaisa Norak, Indrek HeinInstitute of the Estonian Language

Globalex Workshop on Lexicography and Neologism 8 May 2019DSNA 22 – Indiana University, Bloomington, IN

Page 2: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

New words in the dictionaries

• Grenzstein (1884)• 1,600 words

• Aavik (1919, 2nd ed. 1921)• 4,000 words

• Erelt, Kull, Meriste 1985• 150 words (stems)

separate dics new words included into Ekilex (2019)(a) separate general dics (unified single resource)(b) database of new words User Interface

Sõnaveeb (Wordweb)

8 May 2019 Globalex Workshop on Neologism 2

Page 3: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)

8 May 2019 Globalex Workshop on Neologism 3

Page 4: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)

8 May 2019 Globalex Workshop on Neologism 4

Page 5: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Sõnaveeb (Wordweb): computer and mobile view(Ver. 1.4.0 released in February 2019)

8 May 2019 Globalex Workshop on Neologism 5

Page 6: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Unified single resource Ekilex

• enables constant updating of different data subsets

• NEW WORD in the database• provided (ideally) with

• long definition (< general explanatory dic, large bilingual dic) – Detailed view

• short/simpler definition (< learners' dic, bilingual dic) – Simple view

• gloss/signpost (< orthological dic, bilingual dic) – Detailed/Simple view

• terminological definition (< termbase)

• prescriptive advice

• morphological information

• etymological information

• usage examples (for L1, L2, prescriptive advice)

• translation equivalents (different languages)

• synonyms

• etc.

8 May 2019 Globalex Workshop on Neologism 6

Page 7: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Methods used so far

• the workflow not yet automated

• Estonian National Corpus (NC)

• started in the 1990s

• monitoring corpus (since 2017 every two years)

• Estonian NC 2017 – 1.1 billion tokens

• Estonian NC 2019 (October)

• Sketch Engine

• Wordlist function

• ELEXIS Survey for Lexicographers (2019): 54,8% (of those using 22 CQSs) are using SkE

• there are many neologisms that will be missed (Kilgarriff et al. 2015)

8 May 2019 Globalex Workshop on Neologism 7

Page 8: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

An experimental study: detecting new words

• Exclusion Dictionary Architecture (Cartier, 2017)

• extraction of novel forms from monitor corpora

• using lexicographic resources as a reference exclusion dictionary to induce unknown words

• filters to eliminate spelling errors and proper nouns

• no tracking of new meanings (semantic neologisms)

5 stages

Kaisa Norak, Indrek Hein, lexicographers (February–April 2018)

8 May 2019 Globalex Workshop on Neologism 8

Page 9: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Stages 1-2: extraction of novel word forms, filtering

• extraction of novel word forms from the Institute’s text collection (collected from 2016 to 2018)• single new words (not MWEs)

• online news, TV subtitles, transcribed books (from heliraamat.eki.ee: text>audio)

• 712,197 word forms that had failed in the automatic morphological analysis

• filtering (first round)• Python 3 language and its library

• EstNLTK 1.4.1 (for lemmatization and morphological tagging)

• R and its library Tidyverse (for filtering and sorting)

• Excel (sorting)

• Lemmatization

• data selection and (multiple) cleaning of selected lemmas

5,290 lemmas

8 May 2019 Globalex Workshop on Neologism 9

Page 10: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Stage 3: reference exclusion word list

• lexicographic resources

• Explanatory Dictionary of the Estonian Language (EKSS 2009)

• Dictionary of Estonian (DicEst 2019)

• Dictionary of Foreign Words (VL 2015)

• Dictionary of Standard Estonian (ÕS 2013)

• in-house database of new words

3,722 lemmas

• English-Estonian Machine Translation Dictionary

• incl. 233 unadapted English loanwords

• weekend, lite, backup, wallet

8 May 2019 Globalex Workshop on Neologism 10

Page 11: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Stage 4-5: compilation, lexicographic evaluation

• more tokenizing errors• ganisatsioon ‘ganization’

• common spelling mistakes• aitähh ‘thank you’

• lemmatization errors, e.g. nouns in genitive and partitive

• direct loans from other languages• fer-de-lance, fouetté, bordereau, soentjie, societa, bueno, laissez-faire

• Estonian dialect words• tüdrik ‘girl’, mõlemi ‘both’

• words derived from proper nouns (ca 180)• lutsiferianism ‘Luciferianism’ and tarsanlik ‘Tarzan-like’

ca 200 new words• süler ‘laptop’, akrojooga ‘acrobatic yoga’

8 May 2019 Globalex Workshop on Neologism 11

Page 12: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Registering and presenting new words

• in Ekilex + Sõnaveeb (Wordweb)

• diakooniline ‘diaconic’

• in Ekilex for further examination

• baklavaa ‘baklava’, blog ‘blog’, veelkord ‘once more’ – vs. the

standardized lemma forms baklava, blogi, veel kord

• 5,000 words on the waiting list (since 2005)

• 1,500 registered annually (manually), incl. MWEs

• a lot of derivatives and semantically transparent

compound words

• digiteerimine ‘digitalizing’

8 May 2019 Globalex Workshop on Neologism 12

Page 13: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Descriptive vs. prescriptive data: all in one?

• 100 years long line of spelling or ortographic dics (ÕS 2018, 2013 ... 1918)

• government regulations for literary norm (since 2006): printed (!) ÕS

• prescriptive data

• orthography and pronunciation, marking the degree of quantity, stress and palatalization

• inflection

• specifying what belongs to standard Estonian and what does not

• prescriptively pointing out good and bad style in language

• ? meanings

• ? usage examples, ? collocations

• descriptive data

• ? orthography and pronunciation (variation)

• meanings

• usage examples, collocations

• etymological information

8 May 2019 Globalex Workshop on Neologism 13

Page 14: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Plans for the future

• more advanced tools for neologism detecting

• detection of multi-word expressions and new meanings

• ? database of common spelling mistakes

• ? ELEXIS tools

• joining or implementing Néoveille, a web platform for neologism tracking

(Cartier 2017)

• visualizing usage and frequency information on the basis of time-stamped

corpora

• presenting both descriptive and prescriptive data

8 May 2019 Globalex Workshop on Neologism 14

Page 15: NEW ESTONIAN WORDS AND SENSES: DETECTION AND …...•the workflow not yet automated •Estonian National Corpus (NC) •started in the 1990s •monitoring corpus (since 2017 every

Thank you

Margit Langemets [email protected]

Jelena Kallas [email protected]

Kaisa Norak [email protected]

Indrek Hein [email protected]

8 May 2019 Globalex Workshop on Neologism 15