nlp in the wild or building a system for text language identification

NLP in the WILD

-or-

Building a System forText Language Identification

Vsevolod Dyomkin12/2016

A Bit about Me

* Lisp programmer* 5+ years of NLP work at Grammarly * Occasional lecturer

https://vseloved.github.io

https://vseloved.github.io/

Langid Problem* 150+ langs in Wikipedia* >10 writing systems (script/alphabet) in active use* script-lang: 1:1, 1:2, 1:n, n:1 :)* Latin >50 langs, Cyrillyc >20* Long texts easy, short hmm– –* Internet texts (mixed langs)* Small task => resource-constrained

Twitter Case Studyhttps://blog.twitter.com/2015/evaluating-language-identification-performance

https://blog.twitter.com/2015/evaluating-language-identification-performance

https://blog.twitter.com/2015/evaluating-language-identification-performance

Prior Art* C++: https://github.com/CLD2Owners/cld2* Python: https://github.com/saffsd/langid.py* Java: https://github.com/shuyo/language-detection/

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.htmlhttp://lab.hypotheses.org/1083http://labs.translated.net/language-identifier/

https://github.com/CLD2Owners/cld2

https://github.com/saffsd/langid.py

https://github.com/shuyo/language-detection/

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

http://lab.hypotheses.org/1083

http://labs.translated.net/language-identifier/

YALI WILD* All of them use weak models* Wanted to use Wiktionary — 150+ languages, always evolving* Wanted to do in Lisp

Linguistics (domain knowledge)

* Polyglots?* ISO 639* Internet lang biashttps://en.wikipedia.org/wiki/Languages_used_on_the_Internet

* Rule-based ideas:- 1:1/1:2 scripts

- unique letters* Per-script/per-lang segmentation

insightdata

https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

Data

* evaluation data:- smoke test- in-/out-of-domain data- precision-/recall-oriented

* training data- where to get? Wikidata- how to get? SAX parsing

Wiktionary

* good source for various dictionaries and word lists (word forms, definitions, synonyms,…)* ~100 langs

Wikipedia

* >150 langs* size? Wikipedia abstracts* automation?* filtering?

Alternatives* API(defun get-examples (word) (remove-if-not ^(upper-case-p (char % 0)) (mapcar ^(substitute #\Space #\_ (? % "text")) (? (yason:parse (drakma:http-request (fmt "http://api.wordnik.com/v4/word.json/~A/examples" (drakma:url-encode word :utf-8)) :additional-headers *wordnik-auth-headers*)) "examples"))))

* Web scraping(defmethod scrape ((site (eql :linguaholic)) source) (match-html source '(>> article (aside (>> a ($ user)) (>> li (strong "Native Tongue:") ($ lang))) (div |...| (>> (div :data-role "commentContent") ($ text) (span) |...|)) !!!))))

Research (quality)

* Simple task => simple models (NB)* Challenges- short texts- mixed langs- 90% of data - cryptic

ideasexperiments

Naive Bayes* Features: 3-/4-char ngrams* Improvement ideas:- add words (word unigrams)- factor in word lengths- use Internet lang bias

Formula:

(argmax (* (? priors lang) (or (? word-probs word) (norm (reduce '* ^(? 3g-probs %) (word-3gs word))))) langs)

http://www.paulgraham.com/spam.html

http://www.paulgraham.com/spam.html

Experiments

* Usual ML setup (70:30) doesn't work here* “If you torment the data too much...” (~c) Yaser Abu-Mosafa* Comparison with existing systems helps

The Ladder of NLP

Rule-basedLinear ML

Decision Trees & co.Sequence models

Artificial Neural networks

Better Models

What can be improved?* Account for word order* Discriminative models per script* DeepLearning™ model

Marginal gain is not huge…

Engineer (efficiency)

* Just a small piece of the pipeline:- good-enough speed- minimize space usage- minimize external dependencies

* Proper floating-point calculations* Proper processing of big texts?* Pre-/post-processing* Clean API

implementationoptimization

Model OptimizationInitial model size: ~1GTarget: ~10M :)

How to do it?- Lossy compression: pruning- Lossless compression:Huffman coding, efficient DS

API

* Levels of detalization:- text-langs- word-langs

- window?* UI: library, REPL & Web APIs

Recap* Triple view of any knowledge-related problem* Ladder of approaches to solving NLP problems* Importance of productive env: general- & special-purpose

REPL lang API access to data – – efficient testing–

* Main stages of problem solving:data experiment → →implementation optimization→

nlp in the wild or building a system for text language identification

Technology