Near Language Identification Using NooJ
Božo Bekavac, Kristina Kocijan, Marko Tadić
Faculty of Humanities and Social SciencesUniversity of Zagreb, Croatia
NooJ 2014Sassari
2014-06-04
NooJ2014Sassari2014-06-04
Introduction It is not hard to distinguish automatically
very different languages, but similar languages like Czech, Slovakian Indonesian, Malaysian or Brazilian Portuguese, European Portuguese
is very hard to distinguish even for state-of-the-art statistical tools they often mix those languages
We use NooJ as a core part of a system designed for automatic identification of near languages Croatian and Serbian
Differences: Croatian - Serbian Lexical level (some differences)
Reflex of proto-Slavic vowel jat ije/je vs. e e. g. milk (en) –mlijeko (hr) vs. mleko (sr) verbs ending –irati, - ovati e. g. to employ (en) – angažirati (hr) vs. angažovati (sr)
Construction of future tense analytical in hr, e. g. pitat ću (I will ask) synthetic in sr, e. g. pitaću (I will ask)
Typical structures for certain language Croatian: modal verb + infinitive, e. g. hoću raditi Serbian: modal verb + da + present , e. g. hoću
da radim
NooJ2014Sassari2014-06-04
Formalizing differences We used only Croatian language resources
and designed morphological grammars for recognition of unknown tokens in Serbian
some words specific to Serbian are left unknown (e. g. bread (en) – kruh (hr) vs. hleb (sr) but it had no impact on efficiency of system
Syntactic and lexical grammars focuses on formalization of differences between languages Examples follow…
NooJ2014Sassari2014-06-04
Lexical grammars (1) E. g. president (en) –predsjednik (hr) vs.
predsednik (sr)
NooJ2014Sassari2014-06-04
Lexical grammars (2) E. g. to meet (en) –sastati (hr) vs.
sastaću (sr)
NooJ2014Sassari2014-06-04
Syntactic grammars (sr) E. g. should do (en) - treba da uradi (sr)
NooJ2014Sassari2014-06-04
Syntactic grammars (hr) E. g. should do (en) - treba uraditi (sr)
NooJ2014Sassari2014-06-04
Implementation Instead of NoojApply we applied: Fully automated process through
Autohotkey http://www.autohotkey.com/
AutoHotkey - a scripting language for desktop automation > Max suggested enables emulation of clicking on desktop
applications enables scripting language capabilities
Pros & cons are discussed in conclusion
NooJ2014Sassari2014-06-04
System description Open text Apply Croatian language linguistic analyses Count
No. of tokens No. of Serbian lng. lexical units No. of syntactic constructions V da V No. of syntactic constructions V Vinf
Make decision in respect to obtained results from above processing based on percentages of occurrences
Write statistics and results
NooJ2014Sassari2014-06-04
Output of processing
Demo
NooJ2014Sassari2014-06-04
Results Testing was performed on corpus of 2500
articles from SETimes corpus http://www.setimes.com/ texts on Serbian and Croatian language short news translated from English
System obtained precision of 99,82 % Outperforming all known systems in this
task 3 texts on Serbian language are
misclassified as Croatian texts with low recall in considered criteria
NooJ2014Sassari2014-06-04
Conclusion & future work NooJ and AutoHotkey in combination are
sufficient even for performing very complex tasks
The system is completely automatized
Disadvantage: AutoHotkey is very dependent on computer screen resolution (automatic clicking)
Future work: There is room for improvement of the
system To take into account unknown words To tune system voting To create lists of „forbidden” words
NooJ2014Sassari2014-06-04
NooJ2014Sassari2014-06-04
Thank youfor your attention!