lexical acquisition through particular adjectival endings for croatian
DESCRIPTION
Lexical acquisition through particular adjectival endings for Croatian. Božo Bekavac, Krešimir Šojat Institute of Linguistics, Faculty of Philosophy, Zagreb. Motivation & Goals. Recognition of unknown words necessary for many NLP applications No attempt for Croatian so far - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/1.jpg)
Lexical acquisition through particular
adjectival endings for Croatian
Božo Bekavac, Krešimir ŠojatInstitute of Linguistics, Faculty of Philosophy, Zagreb
![Page 2: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/2.jpg)
Motivation & Goals
• Recognition of unknown words necessary for many NLP applications
• No attempt for Croatian so far • Focus on recognition of adjectives based on
characteristic endings • Addition of recognized adjectives into
general lexicon • Creation of dynamic rule-based resource
![Page 3: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/3.jpg)
Approach
• assumption adjectives unrecognized by the common lexicon tend to follow regular derivational patterns
• e.g. cyberski (cyber-), imunobioloških (immuno-biological), eurooptimističnog (eurooptimistic)
• Focus on adjectives, but applicable to other parts of speech
![Page 4: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/4.jpg)
Resources used
• Croatian Morphological Lexicon (CML) - 621.000 types generated from ca 33.000 lemmas
• 30 M newspaper corpus consisting of 195.534 types
• (There will always be words not covered by general lexicons)
![Page 5: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/5.jpg)
Adjectives in Croatian (1) Multext - East specification
1) type (qualificative, possessive)2) degree (positive, comparative, superlative)3) gender (masculine, feminine, neuter)4) number (singular, plural)5) case (nominative, genitive, dative, accusative, vocative, locative, instrumental)6) definiteness7) animate (relevant only for masculine-singular-
accusative)
![Page 6: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/6.jpg)
Adjectives in Croatian (2)
• Adjectives: an open and productive class of words
• Morphologic features: derivation+inflection• Derivation:• suffix (e.g. Tomislav Tomislavov)• prefix + suffix (e.g. nad morem nadmorski)• compound + suffix (e.g. primorsko-goranski,
srednjoškolski)
![Page 7: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/7.jpg)
Adjectives in Croatian (3)
• Inflection (e. g singular):dvojb - en
dvojb - ena
dvojb - enu
dvojb - en
dvojb - enu
dvojb - enim
![Page 8: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/8.jpg)
Consequence
• Potential number for adjectival MSD interpretation is 256
• A great number of suffixes overlapping of suffixes (endings and ends) of different POS especially between Adjectives and Nouns
![Page 9: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/9.jpg)
Internal homography (1)
• where same token represents different word-forms of the same lemma
• EXAMPLE: the word-form modalnom of the lemma modalan has five possible MSDs – Amsd, Amsl, Afsi, Ansd, Ansl
• All different MSDs with internal homography grouped under the same ending –alnom
![Page 10: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/10.jpg)
Internal homography (2)
• modalan Afpmsan-n, Afpmsnn
• modalna Afpfsnn, Afpfsny, Afpfsvy, Afpmsan-y, Afpmsgn
• modalni Afpnpan, Afpnpay, Afpnpnn, Afpnpny, Afpnpvy, Afpnsgn, Afpfpan
• modalne Afpfpay, Afpfpnn, Afpfpny, Afpfpvy, Afpfsgn, Afpfsgy, Afpmpan, ...
![Page 11: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/11.jpg)
External homography (1)
• where the same token represents different word-forms (i.e. MSD interpretations) of two or more lemmas
• EXAMPLE: kos– noun kos (Nmsn) of the lemma kos (blackbird)– adjective kos (Amsa; Amsn) of the lemma kos
(slant)
![Page 12: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/12.jpg)
External homography (2) - endings
• Adjectival endings regularly homographic with those of other parts of speech were not taken into consideration at all
• Adjectival paradigms that are partially homographic only unambiguous endings used
![Page 13: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/13.jpg)
External homography: endings/ends
![Page 14: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/14.jpg)
Order of processing
CML (common lexicon)
RECOGNIZER(lexical transducer)
Temporary lexicon of unknown adjectives
Generation ofall word-forms
![Page 15: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/15.jpg)
Lexical transducer –alan.grf
alan+.grfWed Jun 15 10:24:13 2005
<L> )(Root {$Root#$Suf,$Root#alan.A+ 453,452/0/442 af bk+alan}
(Suf
)
alanalnaalnealnialnihalnimalnimaalnoalnogalnogaalnojalnomalnomealnomualnu
mod
Variables
alne
Output
modalne,modalan.A+453,452/0/442 af bk
24 transducers i.e. different paradigms used
![Page 16: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/16.jpg)
Lexical transducer –alan.grf applied on running text
ambijentalni,ambijentalan.A+453,452/0/442 af bk
bijenalna,bijenalan.A+ 453,452/0/442 af bk
cerebrospinalne,cerebrospinalan.A+453,452/0/442 af bk
doktrinalnom,doktrinalan.A+453,452/0/442 af bk
dvodimenzionalnima,dvodimenzionalan.A+453,452/0/442 af bk
dvokanalnom,dvokanalan.A+453,452/0/442 af bk ...
inflectional pattern code
![Page 17: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/17.jpg)
Temporary final lexicon
• Results of lexical transducers stored in temporary lexicon
• Inflectional pattern code and lemma used for generation of all wfs of recognized A
• Such order of processing correctly recognizes wf dvojben as A and does not missclasify wfs with same ends (e. g. bazen)
• Results of generation stored in final lexicon
![Page 18: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/18.jpg)
Final lexicon
aboridžinska,aboridžinski.A:qtfsn-:qtfsv-:qtrpa-:qtrpn-:qtrpv-
aboridžinske,aboridžinski.A:qtfpa-:qtfpn-:qtfpv-:qtfsg-:qtmpa-
aboridžinski,aboridžinski.A:qtmpn-:qtmpv-:qtmsay--:qtmsn-:qtmsv-
aboridžinskih,aboridžinski.A:qtfpg-:qtmpg-:qtrpg-aboridžinskim,aboridžinski.A:qtfpd-:qtfpi-:qtfpl-:qt
mpd-:qtmpi-:qtmpl-:qtmsi-:qtrpd-:qtrpi-:qtrpl-:qtrsi-...
![Page 19: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/19.jpg)
Results (1)
![Page 20: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/20.jpg)
Results (2)
• 13.933 new adjectival word-forms found by recognizer
• 5.035 word-forms belong to different lemmas
• 4.511 new lemmas added into the CML (after manual inspection) 393 type err!
• Precision: 97.01 %
![Page 21: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/21.jpg)
Problems
• Beside inevetable type errors 131 wfs misclassified due to :
1. NE endings/ends homographic with adjectival endings (Joško, Aljaska)
2. Small amount of other POS still not present in the CML (ekosustav)
3. Foreign words and words of foreign origin (certificate)
![Page 22: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/22.jpg)
Solution
• AD 1) is to preprocess the corpus with NERC system developed for Croatian (Bekavac, 2005)
• AD 2) the problem will be solved after the automatic disambiguation of word-forms when added into the CML
• AD 3) foreign words used in their original spelling (e.g. certificate) are not being added into the CML by default not big amount
![Page 23: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/23.jpg)
Frequency of particular adjectival endings found in corpus
ski
čki
av
ški
alan
ičan
ljiv
ast
ovan
evan
![Page 24: Lexical acquisition through particular adjectival endings for Croatian](https://reader031.vdocuments.mx/reader031/viewer/2022032708/56812ea9550346895d9447a1/html5/thumbnails/24.jpg)
Conclusion and future work
• Dynamic resource highly efficient for specific domains
• Applied order of processing overgeneration of word-forms is avoided
• FW to apply same metodology on other open word classess (Nouns and Verbs)