nlp project full cycle

NLP Project Full Cycle

Vsevolod Dyomkin10/2016

A Bit about Me

* Lisp programmer* 5+ years of NLP work at Grammarly * Occasional lecturer

https://vseloved.github.io

https://vseloved.github.io/

Plan* Overview of NLP* NLP Data* Common NLP problems and approaches* Example NLP application: text language identification

What Is NLP?Transforming free-form text into structured data and back

What Is NLP?Transforming free-form text into structured data and back

Intersection of:* Computational Linguistics* CompSci & AI* ML, Stats, Information Theory

Natural Language

* ambiguous* noisy* evolving

linguist [noun]1. A specialist in linguistics

linguist [noun]1. A specialist in linguistics

linguistics [noun]1. The scientific study of language.

NLP DataTypes of text data:* structured* semi-structured* unstructured

“Data is ten times more

powerful than algorithms.”-- Peter NorvigThe Unreasonable Effectiveness of Data.http://youtu.be/yvDCzhbjYWs

http://youtu.be/yvDCzhbjYWs

Kinds of Data* Dictionaries* Databases/Ontologies* Corpora* Internet/user Data

Where to Get Data?* Linguistic Data Consortium http://www.ldc.upenn.edu/ * Common Crawl* Wikimedia* Wordnet* APIs: Twitter, Wordnik, ...* University sites & the academic community: Stanford, Oxford, CMU, ...

http://www.ldc.upenn.edu/

Create Your Own!* Linguists* Crowdsourcing* By-product

-- Johnatahn Zittrain http://goo.gl/hs4qB

http://goo.gl/hs4qB

Classic NLP Problems* Linguistically-motivated: segmentation, tagging, parsing

* Analytical: classification, sentiment analysis

* Transformation: translation, correction, generation

* Conversation:question answering, dialog

engineer [noun]5. A person skilled in the design and programming of computer systems

TokenizationExample:This is a test that isn't so simple: 1.23."This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "."

Issues:* Finland’s capital - Finland Finlands Finland’s* what’re, I’m, isn’t - what ’re, I ’m, is n’t* Hewlett-Packard or Hewlett Packard * San Francisco - one token or two?* m.p.h., PhD.

Regular ExpressionsSimplest regex: [^\s]+

More advanced regex:\w+|[!"#$%&'*+,\./:;<=>?@^`~… {}\[\|\]⟨⟩ ‒–—«»“”‘’-]―

Even more advanced regex:[+-]?[0-9](?:[0-9,\.]*[0-9])?|[\w@](?:[\w'’`@-][\w']|[\w'][\w@'’`-])*[\w']?|["#$%&*+,/:;<=>@^`~… {}\[\|\] «»“”‘’']⟨⟩ ‒–—―|[\.!?]+|-+

In fact, it works:https://github.com/lang-uk/ner-uk/blob/master/doc/tokenization.md

https://github.com/lang-uk/ner-uk/blob/master/doc/tokenization.md

https://github.com/lang-uk/ner-uk/blob/master/doc/tokenization.md

Rule-based Approach* easy to understand and reason about* can be arbitrarily precise* iterative, can be used to gather more data

Limitations:* recall problems* poor adaptability

Rule-based NLP tools

* SpamAssasin* LanguageTool* ELIZA* GATE

researcher [noun]1. One who researches

researcher [noun]1. One who researches

research [noun]1. Diligent inquiry or examination to seek or revise facts, principles, theories, applications, etc.; laborious or continued search after truth

Models

Statistical Approach

“Probability theoryis nothing butcommon sensereduced to calculation.”-- Pierre-Simon Laplace

Language Models

Question: what is the probability of a sequence of words/sentence?

Language Models

Question: what is the probability of a sequence of words/sentence?

Answer: Apply the chain rule

P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * …

where S = w0 w1 w2 …

NgramsApply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also).

If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * …

According to the chain rule:

P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)

Spam FilteringA 2-class classification problem with a bias towards minimizing FPs.

Default approach: rule-based (SpamAssassin)

Problems:* scales poorly* hard to reach arbitrary precision* hard to rank the importance of complex features?

Bag-of-words Model* each word is a feature* each word is independent of others* position of the word in a sentence is irrelevant

Pros:* simple* fast* scalable

Limitations:* independence assumption doesn't hold

Bag-of-words Model* each word is a feature* each word is independent of others* position of the word in a sentence is irrelevant

Pros:* simple* fast* scalable

Limitations:* independence assumption doesn't hold

http://www.paulgraham.com/spam.html - A Plan for SpamInitial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97%

http://www.paulgraham.com/spam.html

Machine Learning Approach

Dependency Parsing

nsubj(ate-2, They-1)root(ROOT-0, ate-2)det(pizza-4, the-3)dobj(ate-2, pizza-4)prep(ate-2, with-5)pobj(with-5, anchovies-6)

https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/



Shift-reduce Parsing

Averaged Perceptron

def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)

ML-based ParsingThe parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input.

SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT]

def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse

The Hierarchy ofML Models

Linear:* (Averaged) Perceptron* Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field* SVM

Non-linear:* Decision Trees, Random Forests, Boosted Trees* Artificial Neural networks

SemanticsQuestion: how to model relationships between words?

SemanticsQuestion: how to model relationships between words?Answer: build a graph

WordnetFreebaseDBPedia

Word Similarity

Next question: now, how do we measure those relations?

Word Similarity


* different Wordnet similarity measures

Word Similarity


* different Wordnet similarity measures

* PMI(x,y) = log(p(x,y) / p(x) * p(y))

Distributional Semantics

Distributional hypothesis:"You shall know a word bythe company it keeps"--John Rupert Firth

Word representations:* Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415* Dense representation (word2vec, GloVe, …)* Hierarchical repr (Brown clusters)

Steps to Developan NLP System

* Translate real-world requirements into a measurable goal* Find a suitable level and representation* Find initial data for experiments* Find and utilize existing tools and frameworks where possible* Setup and perform a proper experiment (series of experiments)* Optimize the system for production

Going into Prod

* NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready* Don't trust research results* Value pre- and post- processing* Gather user feedback

Text Language Identification

Not an unsolved problem:* https://github.com/CLD2Owners/cld2 - C++* https://github.com/saffsd/langid.py - Python* https://github.com/shuyo/language-detection/ - Java

To read:https://blog.twitter.com/2015/evaluating-language-identification-performancehttp://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.htmlhttp://lab.hypotheses.org/1083http://labs.translated.net/language-identifier/

https://github.com/CLD2Owners/cld2

https://github.com/saffsd/langid.py

https://github.com/shuyo/language-detection/

https://blog.twitter.com/2015/evaluating-language-identification-performance

https://blog.twitter.com/2015/evaluating-language-identification-performance

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

http://lab.hypotheses.org/1083

http://labs.translated.net/language-identifier/

WILD Challenges

YALI WILD

* All of them use weak models* Wanted to use Wiktionary — 150+ languages, always evolving* Wanted to do in Lisp

WILD Linguistics* Scripts vs languageshttp://www.omniglot.com/writing/langalph.htm

* Languages distributionhttps://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites

* Frequency word listshttps://invokeit.wordpress.com/frequency-word-lists/

* Word segmentation?

http://www.omniglot.com/writing/langalph.htm

https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites

https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites

https://invokeit.wordpress.com/frequency-word-lists/

https://invokeit.wordpress.com/frequency-word-lists/

WILD DataWiktionary Wikipedia data:used abstracts, ~175 languages- download & store- process (SAX parsing)- setup learning & test data sets

10,778,404 unique words481,581 unique character trigrams

WILD Engineering* Initial model size ~1G - script hacks & Huffman coding to the rescue* Model pruning* Proper probability calculations* Efficient testing* Properly saving the model* Library & public API

nlp project full cycle

Technology