nlp in the wild or building a system for text language identification

25
NLP in the WILD -or- Building a System for Text Language Identification Vsevolod Dyomkin 12/2016

Upload: vsevolod-dyomkin

Post on 25-Jan-2017

82 views

Category:

Technology


0 download

TRANSCRIPT

NLP in the WILD

-or-

Building a System forText Language Identification

Vsevolod Dyomkin12/2016

A Bit about Me

* Lisp programmer* 5+ years of NLP work at Grammarly * Occasional lecturer

https://vseloved.github.io

Roles

Langid Problem* 150+ langs in Wikipedia* >10 writing systems (script/alphabet) in active use* script-lang: 1:1, 1:2, 1:n, n:1 :)* Latin >50 langs, Cyrillyc >20* Long texts easy, short hmm– –* Internet texts (mixed langs)* Small task => resource-constrained

Twitter Case Studyhttps://blog.twitter.com/2015/evaluating-language-identification-performance

Prior Art* C++: https://github.com/CLD2Owners/cld2* Python: https://github.com/saffsd/langid.py* Java:  https://github.com/shuyo/language-detection/

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.htmlhttp://lab.hypotheses.org/1083http://labs.translated.net/language-identifier/

YALI WILD* All of them use weak models* Wanted to use Wiktionary — 150+ languages, always evolving* Wanted to do in Lisp

Linguistics (domain knowledge)

* Polyglots?* ISO 639* Internet lang biashttps://en.wikipedia.org/wiki/Languages_used_on_the_Internet

* Rule-based ideas:- 1:1/1:2 scripts

- unique letters* Per-script/per-lang segmentation

insightdata

Data

* evaluation data:- smoke test- in-/out-of-domain data- precision-/recall-oriented

* training data- where to get? Wikidata- how to get? SAX parsing

Wiktionary

* good source for various dictionaries and word lists (word forms, definitions, synonyms,…)* ~100 langs

Wiktionary

* good source for various dictionaries and word lists (word forms, definitions, synonyms,…)* ~100 langs

Wikipedia

* >150 langs* size? Wikipedia abstracts* automation?* filtering?

Alternatives* API(defun get-examples (word) (remove-if-not ^(upper-case-p (char % 0)) (mapcar ^(substitute #\Space #\_ (? % "text")) (? (yason:parse (drakma:http-request (fmt "http://api.wordnik.com/v4/word.json/~A/examples" (drakma:url-encode word :utf-8)) :additional-headers *wordnik-auth-headers*)) "examples"))))

* Web scraping(defmethod scrape ((site (eql :linguaholic)) source) (match-html source '(>> article (aside (>> a ($ user)) (>> li (strong "Native Tongue:") ($ lang))) (div |...| (>> (div :data-role "commentContent") ($ text) (span) |...|)) !!!))))

Research (quality)

* Simple task => simple models (NB)* Challenges- short texts- mixed langs- 90% of data - cryptic

ideasexperiments

Naive Bayes* Features: 3-/4-char ngrams* Improvement ideas:- add words (word unigrams)- factor in word lengths- use Internet lang bias

Formula:

(argmax (* (? priors lang) (or (? word-probs word) (norm (reduce '* ^(? 3g-probs %) (word-3gs word))))) langs)

http://www.paulgraham.com/spam.html

Experiments

* Usual ML setup (70:30) doesn't work here* “If you torment the data too much...” (~c) Yaser Abu-Mosafa* Comparison with existing systems helps

Confusion MatrixAB: 0.90 | FR:0.10AF: 0.80 | EN:0.20AK: 0.80 | NN:0.10 IT:0.10AN: 0.90 | ES:0.10AY: 0.90 | ES:0.10BG: 0.60 | RU:0.40BM: 0.80 | FR:0.10 LA:0.10BS: 0.90 | EN:0.10CO: 0.90 | IT:0.10CR: 0.40 | FR:0.30 UND:0.20 MS:0.10CS: 0.90 | IT:0.10CU: 0.90 | VI:0.10CV: 0.80 | RU:0.20DA: 0.70 | FO:0.10 NO:0.10 NN:0.10DV: 0.80 | UZ:0.10 EN:0.10DZ: NIL | BO:0.80 IK:0.10 NE:0.10EN: 0.90 | NL:0.10ET: 0.80 | EN:0.20FF: 0.50 | EN:0.20 FR:0.10 EO:0.10 SV:0.10FI: 0.80 | FR:0.10 DA:0.10FJ: 0.90 | OC:0.10GL: 0.90 | ES:0.10HA: 0.80 | YO:0.10 EN:0.10HR: 0.70 | BS:0.10 DE:0.10 GL:0.10ID: 0.80 | MS:0.20IE: 0.90 | EN:0.10IG: 0.60 | EN:0.40IO: 0.86 | DA:0.14KG: 0.90 | SW:0.10KL: 0.90 | EN:0.10KS: 0.30 | UR:0.60 UND:0.10KU: 0.90 | EN:0.10KW: 0.89 | UND:0.11LA: 0.90 | FR:0.10

LB: 0.90 | EN:0.10LG: 0.90 | IT:0.10LI: 0.80 | NL:0.20MI: 0.90 | ES:0.10MK: 0.80 | IT:0.10 RU:0.10MS: 0.80 | ID:0.10 EN:0.10MT: 0.90 | DE:0.10NO: 0.90 | DA:0.10NY: 0.80 | AR:0.10 SW:0.10OM: 0.90 | EN:0.10OS: 0.90 | RU:0.10QU: 0.70 | ES:0.20 EN:0.10RM: 0.90 | EN:0.10RN: 0.50 | RW:0.40 YO:0.10SC: 0.90 | FR:0.10SG: 0.90 | FR:0.10SR: 0.80 | HR:0.10 BS:0.10SS: 0.50 | EN:0.30 DA:0.10 ZU:0.10ST: 0.90 | PT:0.10SV: 0.90 | DA:0.10TI: 0.40 | AM:0.40 LA:0.10 EN:0.10TK: 0.80 | TR:0.20TO: 0.50 | EN:0.50TS: 0.80 | EN:0.10 UZ:0.10TW: 0.40 | EN:0.40 AK:0.10 YO:0.10TY: 0.90 | ES:0.10UG: 0.60 | UZ:0.40UK: 0.80 | UND:0.10 VI:0.10VE: 0.90 | EN:0.10WO: 0.80 | NL:0.10 FR:0.10XH: 0.80 | UZ:0.10 EN:0.10YO: 0.80 | EN:0.20ZU: 0.60 | XH:0.30 PT:0.10Total quality: 0.90

The Ladder of NLP

Rule-basedLinear ML

Decision Trees & co.Sequence models

Artificial Neural networks

Better Models

What can be improved?* Account for word order* Discriminative models per script* DeepLearning™ model

Marginal gain is not huge…

Engineer (efficiency)

* Just a small piece of the pipeline:- good-enough speed- minimize space usage- minimize external dependencies

* Proper floating-point calculations* Proper processing of big texts?* Pre-/post-processing* Clean API

implementationoptimization

Model OptimizationInitial model size: ~1GTarget: ~10M :)

How to do it?- Lossy compression: pruning- Lossless compression:Huffman coding, efficient DS

API

* Levels of detalization:- text-langs- word-langs

- window?* UI: library, REPL & Web APIs

Recap* Triple view of any knowledge-related problem* Ladder of approaches to solving NLP problems* Importance of productive env: general- & special-purpose

REPL lang API access to data – – efficient testing–

* Main stages of problem solving:data experiment → →implementation optimization→