nooj conference inalco, paris june 16th, 2012

Post on 11-Jan-2016

40 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Russian Module for NooJ: design and implementation. Conception and realization of grammatical & lexical resources for the Russian language for Max Silberztein’s Nooj software. NOOJ Conference Inalco, Paris June 16th, 2012. Vincent BÉNET INALCO CREE Recherche assistée par ordinateur. - PowerPoint PPT Presentation

TRANSCRIPT

11

NOOJ Conference NOOJ Conference Inalco, ParisInalco, Paris

June 16th, 2012June 16th, 2012

Vincent BÉNETINALCO

CREE Recherche assistée par ordinateur

Conception and realization Conception and realization of grammatical & lexical resourcesof grammatical & lexical resources

for the Russian languagefor the Russian language

for Max Silberztein’s Nooj software for Max Silberztein’s Nooj software

Russian Module for NooJ: design and implementation

ORDIDOM

22

Design linguistics resources Design linguistics resources

Description of the realizationDescription of the realization Dictionaries / paradigms /grammarsDictionaries / paradigms /grammars

Job left to be done…Job left to be done…

Russian Module for NooJ:

design and implementation

33

Writing lexical resources for the Russian languageWriting lexical resources for the Russian language

Build dictionairies from textsBuild dictionairies from texts

Create one « small » dictionary and Create one « small » dictionary and many grammars for derivational formsmany grammars for derivational formsраб раб + a (slave) + a (slave) раб раб + o+ oтт ++ а +а + тьть (work) (work)за +за + раб +раб + отот + к+ к ++ аа (salary) (salary)

Complete one « big » existing dictionary Complete one « big » existing dictionary and create manyand create many grammarsgrammars

44

Writing lexical resources for the Russian languageWriting lexical resources for the Russian language

ZALIZNIAK’s grammatical dictionary : 96 000 entriescomplete dictionary, in inverted alphabetical order, with all grammatical annotation

To obtain, to reach :Достигать нсв нп 1a$3 (доcтигнуть//доcтичь) имеется страдDostigat’ ipf nt 1a$3 (dostignut’/dostich’) has a passive form

55

Writing lexical resources for the Russian languageWriting lexical resources for the Russian language

The problem of accent markers was delayed

Encountered problems Classification complete but some tags are absent ( V, N…)Classification based on accent markersA lot Unformal unclassified added annotations

Zalizniak’s dictionary was resorting, its classification was modified, simplified and completed for computer use

66

The design of lexical resources The design of lexical resources for the Russian languagefor the Russian language has consisted in: has consisted in:

33. sorting the dictionary . sorting the dictionary (inverted alphabetical order for each (inverted alphabetical order for each wordword))

1. 1. creatingcreating grammatical tagsgrammatical tags

2. 2. recoding the dictionary with this tagsrecoding the dictionary with this tags

6. 6. problem with problem with ë / eë / e

4. f4. fixing a paradigm model list ixing a paradigm model list ((kartakarta instead ofinstead of zh1a )

5. 5. writing paradigmswriting paradigms

7. a7. allocating models to the wordsllocating models to the words

8. 8. verifying the resultsverifying the results

9. 9. testing with textstesting with texts

10. 10. Correcting and proofreadingCorrecting and proofreading

77

Writing lexical ressources for RussianWriting lexical ressources for Russian

1. Creating tags and properties N, A, V, ADV ….

A_Forme = fc | fl | adv;A_Genre = m | f | n ;A_SGenr = an | inan ;A_Nombre = s | p;A_Cas = Im | Vi | Ro | Da | Tv | Pr | Zv;A_Deg = Comp | Sup ;ADV_Deg = Comp;

V_Pers = 1 | 2 | 3 ;V_Asp = Ipf | Pf ;V_Type = Mvt ;V_Morph = Pvb | Simp | Sufx | PvbSufx ;V_SsAsp = Det | Indet ;V_Temps = Pre | Pa | Fu ;V_Mode = Inf | Ind | Imp | Cond | Ger | Prtp ;V_Voix = Act | Pss ;V_Genre = m | f | n ;V_Nombre = s | p ;V_Constr = intr | tr | sja ;V_Cas = Im | Vi | Ro | Da | Tv | Pr ;

88

Writing lexical ressources for RussianWriting lexical ressources for Russian2. recoding the dictionary

3. Sorting the dictionary to get inverted aphabetical ordering

99

#j1a=karta#jo1a=korova#j2a=nedelja#jo2a=boginja#j3a=kniga#jo3a=sobaka#j4a=tuča#jo4a=kassirša#j5a=ulica#jo5a=volčica#j6a=statuja#jo6a=feja#j7a=linija#jo7a=furija

4. Paradigm model list

карта = <E>/Im+f+s + <B>у/Vi+f+s + <B>ы/Ro+f+s + <B>е/Da+f+s + <B>ой/Tv+f+s + <B>е/Pr+f+s + <B>ы/Im+f+p + <B>ы/Vi+f+p + <B>/Ro+f+p + <B>ам/Da+f+p + <B>ами/Tv+f+p + <B>ах/Pr+f+p ;

5. writing paradigms

Writing lexical Russian resourcesWriting lexical Russian resources

1010

5. Paradigm for verbs

взять = <E>/Inf | <B4>озьму/1+s+Pre | <B4>озьмешь/2+s+Pre | <B4>озьмет/3+s+Pre | <B4>озьмем/1+p+Pre | <B4>озьмете/2+p+Pre | <B4>озьмёшь/2+s+Pre | <B4>озьмёт/3+s+Pre | <B4>озьмём/1+p+Pre | <B4>озьмёте/2+p+Pre | <B4>озьмут/3+p+Pr | <B2>л/m+s+Pa | <B2>ла/f+s+Pa | <B2>ло/n+s+Pa | <B2>ли/p+Pa | <B4>озьми/2+s+Imp | <B4>озьмите/2+p+Imp | <B2>в/Ger | <B2>вши/Ger | <B2>вший/Prtp+Pa+Act+m+s+Im | <B2>вший/Prtp+Pa+Act+m+s+Vi | <B2>вшего/Prtp+Pa+Act+m+an+s+Vi | <B2>вшего/Prtp+Pa+Act+m+s+Ro | <B2>вшему/Prtp+Pa+Act+m+s+Da | <B2>вшим/Prtp+Pa+Act+m+s+Tv | <B2>вшем/Prtp+Pa+Act+m+s+Pr | <B2>вшая/Prtp+Pa+Act+f+s+Im | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшей/Prtp+Pa+Act+f+s+Ro | <B2>вшей/Prtp+Pa+Act+f+s+Da | <B2>вшей/Prtp+Pa+Act+f+s+Tv | <B2>вшею/Prtp+Pa+Act+f+s+Tv | <B2>вшей/Prtp+Pa+Act+f+s+Pr | <B2>вшее/Prtp+Pa+Act+n+s+Im | <B2>вшее/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Ro | <B2>вшему/Prtp+Pa+Act+n+s+Da | <B2>вшим/Prtp+Pa+Act+n+s+Tv | <B2>вшем/Prtp+Pa+Act+n+s+Pr | <B2>вшие/Prtp+Pa+Act+p+Im | <B2>вшие/Prtp+Pa+Act+p+Vi | <B2>вших/Prtp+Pa+Act+an+p+Vi | <B2>вших/Prtp+Pa+Act+p+Ro | <B2>вшим/Prtp+Pa+Act+p+Da | <B2>вшими/Prtp+Pa+Act+p+Tv | <B2>вших/Prtp+Pa+Act+p+Pr | <B2>тый/Prtp+Pa+Pss+m+s+Im | <B2>тый/Prtp+Pa+Pss+m+s+Vi | <B2>того/Prtp+Pa+Pss+m+an+s+Vi | <B2>того/Prtp+Pa+Pss+m+s+Ro | <B2>тому/Prtp+Pa+Pss+m+s+Da | <B2>тым/Prtp+Pa+Pss+mo+s+Tv | <B2>том/Prtp+Pa+Pss+mo+s+Pr | <B2>тая/Prtp+Pa+Pss+f+s+Im | <B2>тую/Prtp+Pa+Pss+f+s+Vi | <B2>той/Prtp+Pa+Pss+f+s+Ro | <B2>той/Prtp+Pa+Pss+f+s+Da | <B2>той/Prtp+Pa+Pss+f+s+Tv | <B2>тою/Prtp+Pa+Pss+f+s+Tv | <B2>той/Prtp+Pa+Pss+f+s+Pr | <B2>тое/Prtp+Pa+Pss+n+s+Im | <B2>тое/Prtp+Pa+Pss+n+s+Vi | <B2>того/Prtp+Pa+Pss+n+s+Ro | <B2>тому/Prtp+Pa+Pss+n+s+Da | <B2>тым/Prtp+Pa+Pss+n+s+Tv | <B2>том/Prtp+Pa+Pss+n+s+Pr | <B2>тые/Prtp+Pa+Pss+p+Im | <B2>тые/Prtp+Pa+Pss+p+Vi | <B2>тых/Prtp+Pa+Pss+an+p+Vi | <B2>тых/Prtp+Pa+Pss+p+Ro | <B2>тым/Prtp+Pa+Pss+p+Da | <B2>тыми/Prtp+Pa+Pss+p+Tv | <B2>тых/Prtp+Pa+Pss+p+Pr | <B2>т/Prtp+Pa+Pss+m+s+fc | <B2>та/Prtp+Pa+Pss+f+s+fc | <B2>то/Prtp+Pa+Pss+n+s+fc | <B2>ты/Prtp+Pa+Pss+p+fc;

Writing lexical Russian resourcesWriting lexical Russian resources

1111

Writing lexical ressources for RussianWriting lexical ressources for Russian

6. Problem of letter ë / e (partially solved: two entries or two paradigms)

ёжик,N+m+an+FLX=бульдогёж,N+m+an+FLX=богачежик,N+m+an+FLX=бульдогеж,N+m+an+FLX=богач

жевать = <E>/Inf | <B5>ую/1+s+Pre | <B5>уёшь/2+s+Pre | <B5>уёт/3+s+Pre | <B5>уём/1+p+Pre | <B5>уёте/2+p+Pre | <B5>уешь/2+s+Pre | <B5>ует/3+s+Pre | <B5>уем/1+p+Pre | <B5>уете/2+p+Pre | <B5>уют/3+p+Pre

1212

7. Allocating models to words

Writing lexical Russian resourcesWriting lexical Russian resources

abažur,N+m+inan+FLX=zavodabazinec,N+m+an+FLX=ukrainecabazin,N+m+an+FLX=artistabaz,N+m+inan+FLX=zavodabak,N+m+inan+FLX=čajnikabbat,N+m+an+FLX=artist

8. verifiying paradigms

1313

Writing lexical resources for RussianWriting lexical resources for Russian

9. Testing with russian texts : 9. Testing with russian texts :

« The nose » by Gogol« The nose » by Gogol

« The gambler » by Dostoievsky« The gambler » by Dostoievsky

««The Prisoner of the CaucasusThe Prisoner of the Caucasus » by Tolstoy» by Tolstoy

««  The lady with the dog » by ChekhovThe lady with the dog » by Chekhov

« Short stories » by Harms« Short stories » by Harms

1414

Writing lexical resources for RussianWriting lexical resources for Russian

10. Correcting errors :10. Correcting errors :

-bad encoding (mixed latin/cyrillic letters) A B E K M H O P C y X MOCKBA

- errors in paradigms

- bad allocation of model to words

mobile vowel / palatalization

1515

Improving lexical resourcesImproving lexical resources

- useless words: source of unnecessary ambiguities the names of letters a, б, в, и, к, о, с, у, яarchaic unused words.- repetitions of the same word in different parts of speech ( adjectives / nouns; adjectives / pronouns; interjections/particles/parenthesis )

Increase the number of different models ?Increase the number of different models ?

To avoid generating To avoid generating unexpected or incongruous unexpected or incongruous forms forms or failing to recognize or failing to recognize existing forms.existing forms.Читав ? Читав ? Čitav ? Čitav ? Пиша ? Пиша ? Piša ? Piša ? Счастие ? Счастие ? ŜastiŜastiее ? ?

Suppress word entries Suppress word entries and / orand / or forms ? forms ?

1616

1 COMPILED BASIC DICTIONAIRY 1 COMPILED BASIC DICTIONAIRY containingcontaining :

Available lexical resources for RussianAvailable lexical resources for Russian

1 dictionary of 45,000 nouns (350 paradigms)1 dictionary of 20,000 adjectives (50 paradigms)1 dictionary of 25,000 verbs (600 paradigms)1 dictionary of 880 prepositions & conjunctions, numerals, pronouns , 1600 adverbs, parenthetical words etc…

22 COMPILED ADDITONNALS DICTIONARIES:COMPILED ADDITONNALS DICTIONARIES:(with facultative use)(with facultative use)

1 dictionary of propers nouns ( cities, countries, rivers … first names with diminutives)1 dictionary of substantives-adjectives

1717

Writing Russian grammars for NoojWriting Russian grammars for Nooj

designing disambiguation grammars fordesigning disambiguation grammars for

-grammatical agreement between adjectives & nouns-case usage with numerals -case usage with prepositions-case usage with verbs

- date and time expression- adverbial phrases of time , place …- idiomatic structures ( my name is, I’m.. old- verbs of motion

designing grammars to locate syntagmsdesigning grammars to locate syntagms

1818

Writing Russian grammars for NoojWriting Russian grammars for Nooj

Syntactic grammar for RussianSyntactic grammar for Russian

1919

Writing Russian grammars for NoojWriting Russian grammars for Nooj

Syntactic grammar for Russian Syntactic grammar for Russian

2020

Grammar to locate the verbs of motion

2121

Grammar to locate the verbs of motion

2222

The prepositions in Russian

2323

The disambiguation of « NA » (on, onto)

2424

Annotating and disambiguating texts

the text with its ambiguitiesthe text with its ambiguities : :

2525

Verifying grammarsVerifying grammars

The text was disambiguated with the grammar ofThe text was disambiguated with the grammar of « NA » : « NA » :

2626

The disambiguation of « V » (in, into)

2727

Russian grammars for NoojRussian grammars for Nooj

All these grammars need improvement:All these grammars need improvement:

They are very sensitive to syntactic order :-fail to regognize structures if unusual ( expressive or non standard) order of word in Russian sentences.

There are no grammars (yet) :-to disambiguate adverbs / adjectives -to disambiguate adjectives / nouns-to disambiguate conjunctions / interjections

2828

To get reliable ressources To get reliable ressources for the Russian languagefor the Russian language : :

• Data bank of verified and annotated texts

design and implement:

• Efficient syntactic grammars

• Develop semantic tagging

• Unified or harmonized tags for (slavic, roman, german etc..) languages to allow further multilingual treatment

The job left to be done is toThe job left to be done is to

2929

Russian Module for NooJ

http://www.nooj4nlp.net/pages/russian.htmlhttp://www.nooj4nlp.net/pages/russian.html

3030

NOOJ Conference Inalco NOOJ Conference Inalco June 16th, 2012June 16th, 2012

vincent.benet@inalco.frINALCO

Russian Module for NooJ: design and implementation

Спасибо за вниманиеСпасибо за вниманиеThank you for your attentionThank you for your attention

Merci de votre attentionMerci de votre attention

ORDIDOM

top related