natural language processing - rose-hulman.edu · \arti cial intelligence" stephen mayhew...

68
Natural Language Processing Stephen Mayhew Oct. 3, 2014 Stephen Mayhew (UIUC) 1 / 30

Upload: vuongcong

Post on 06-May-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Natural Language Processing

Stephen MayhewOct. 3, 2014

Stephen Mayhew (UIUC) 1 / 30

Outline

Introduction

Machine Learning

TasksLanguage ModelsPOS TaggingChunking/ParsingNamed Entity RecognitionCoreference ResolutionSentiment AnalysisTopic ModelingWikifierMachine TranslationTrustworthiness

Cognitive Computation Group

Stephen Mayhew (UIUC) 2 / 30

What is NLP?

It is NOT:

• Neuro Linguistic Programming

• Speech processing, although they are similar.

It is:

• Subfield of AI

• Synthesis of: statistics, math, linguistics, computer science,probability theory, cognitive science.

Vague goal: Natural Language Understanding.

Stephen Mayhew (UIUC) 3 / 30

What is NLP?

It is NOT:

• Neuro Linguistic Programming

• Speech processing, although they are similar.

It is:

• Subfield of AI

• Synthesis of: statistics, math, linguistics, computer science,probability theory, cognitive science.

Vague goal: Natural Language Understanding.

Stephen Mayhew (UIUC) 3 / 30

What is NLP?

It is NOT:

• Neuro Linguistic Programming

• Speech processing, although they are similar.

It is:

• Subfield of AI

• Synthesis of: statistics, math, linguistics, computer science,probability theory, cognitive science.

Vague goal: Natural Language Understanding.

Stephen Mayhew (UIUC) 3 / 30

What is NLP?

It is NOT:

• Neuro Linguistic Programming

• Speech processing, although they are similar.

It is:

• Subfield of AI

• Synthesis of: statistics, math, linguistics, computer science,probability theory, cognitive science.

Vague goal: Natural Language Understanding.

Stephen Mayhew (UIUC) 3 / 30

What is NLP?

It is NOT:

• Neuro Linguistic Programming

• Speech processing, although they are similar.

It is:

• Subfield of AI

• Synthesis of: statistics, math, linguistics, computer science,probability theory, cognitive science.

Vague goal: Natural Language Understanding.

Stephen Mayhew (UIUC) 3 / 30

What is NLP?

It is NOT:

• Neuro Linguistic Programming

• Speech processing, although they are similar.

It is:

• Subfield of AI

• Synthesis of: statistics, math, linguistics, computer science,probability theory, cognitive science.

Vague goal: Natural Language Understanding.

Stephen Mayhew (UIUC) 3 / 30

What is NLP?

It is NOT:

• Neuro Linguistic Programming

• Speech processing, although they are similar.

It is:

• Subfield of AI

• Synthesis of: statistics, math, linguistics, computer science,probability theory, cognitive science.

Vague goal: Natural Language Understanding.

Stephen Mayhew (UIUC) 3 / 30

History of NLP

• 1940s and 1950s Birth of the computer

• 1957 - 1983 Two camps: grammatical, statistical

• 1983 - 2000 FSMs, “Empiricism Strikes Back”

• 2000 - present Rise of Machine Learning

Stephen Mayhew (UIUC) 4 / 30

Machine Learning

Simple definition: Machine learning is essentially finding aseparating hyperplane among a set of points in some dimension.

Complicated definition: take a class.

Stephen Mayhew (UIUC) 5 / 30

Training Examples - Label: MAN

Stephen Mayhew (UIUC) 6 / 30

Training Examples - Label: MAN

Stephen Mayhew (UIUC) 6 / 30

Training Examples - Label: MAN

Stephen Mayhew (UIUC) 6 / 30

Training Examples - Label: MAN

Stephen Mayhew (UIUC) 6 / 30

Training Examples - Label: WOMAN

Stephen Mayhew (UIUC) 7 / 30

Training Examples - Label: WOMAN

Stephen Mayhew (UIUC) 7 / 30

Training Examples - Label: WOMAN

Stephen Mayhew (UIUC) 7 / 30

Training Examples - Label: WOMAN

Stephen Mayhew (UIUC) 7 / 30

Testing Examples - Label: ?

Stephen Mayhew (UIUC) 8 / 30

Testing Examples - Label: ?

Stephen Mayhew (UIUC) 8 / 30

Machine Learning

Stephen Mayhew (UIUC) 9 / 30

NLP Tasks

Stephen Mayhew (UIUC) 10 / 30

Language Models

n-grams:“The cat sat on the mat” →(The, cat), (cat, sat), (sat, on), (on, the), (the, mat).

A language model is a distribution over word frequencies,estimated from data.

Pr(wi|wi−2wi−1)

What about unseen words? Smoothing.

What size n-grams make sense?

Stephen Mayhew (UIUC) 11 / 30

Language Models

n-grams:“The cat sat on the mat” →(The, cat), (cat, sat), (sat, on), (on, the), (the, mat).

A language model is a distribution over word frequencies,estimated from data.

Pr(wi|wi−2wi−1)

What about unseen words? Smoothing.

What size n-grams make sense?

Stephen Mayhew (UIUC) 11 / 30

Language Models

n-grams:“The cat sat on the mat” →(The, cat), (cat, sat), (sat, on), (on, the), (the, mat).

A language model is a distribution over word frequencies,estimated from data.

Pr(wi|wi−2wi−1)

What about unseen words? Smoothing.

What size n-grams make sense?

Stephen Mayhew (UIUC) 11 / 30

Language Models

n-grams:“The cat sat on the mat” →(The, cat), (cat, sat), (sat, on), (on, the), (the, mat).

A language model is a distribution over word frequencies,estimated from data.

Pr(wi|wi−2wi−1)

What about unseen words? Smoothing.

What size n-grams make sense?

Stephen Mayhew (UIUC) 11 / 30

Generating Shakespeare

Generating sentences with random unigrams:• Every enter now severally so, let• Hill he late speaks; or! a more to leg less first you enter

With bigrams• What means, sir. I confess she? then all sorts, he is trim,

captain.• Why dost stand forth thy canopy, forsooth; he is this palpable

hit the King Henry.

Trigrams• Sweet prince, Falstaff shall die.• This shall forbid it should be branded, if renown made it empty.

Quadrigrams• What! I will go seek the traitor Gloucester.• Will you not tell me who I am?

Stephen Mayhew (UIUC) 12 / 30

Google n-grams

https://books.google.com/ngrams/Interesting examples: “Ford Model T”, “steam engine”,“artificial intelligence”

Stephen Mayhew (UIUC) 13 / 30

Part-of-Speech Tagging

.Example..

......

Fruit flies like a banana, time flies like an arrow.

NNP/Fruit VBZ/flies IN/like DT/a NN/banana ,/, NN/timeVBZ/flies IN/like DT/an NN/arrow ./.

• Sequence tagging task

• Choose from a fixed set of tags (in English, about 44)

• Solved using Viterbi algorithm (HMM)

• In English, state-of-the-art is about 97%. (Solved).

Stephen Mayhew (UIUC) 14 / 30

Part-of-Speech Tagging

.Example..

......

Fruit flies like a banana, time flies like an arrow.

NNP/Fruit VBZ/flies IN/like DT/a NN/banana ,/, NN/timeVBZ/flies IN/like DT/an NN/arrow ./.

• Sequence tagging task

• Choose from a fixed set of tags (in English, about 44)

• Solved using Viterbi algorithm (HMM)

• In English, state-of-the-art is about 97%. (Solved).

Stephen Mayhew (UIUC) 14 / 30

Part-of-Speech Tagging

.Example..

......

Fruit flies like a banana, time flies like an arrow.

NNP/Fruit VBZ/flies IN/like DT/a NN/banana ,/, NN/timeVBZ/flies IN/like DT/an NN/arrow ./.

• Sequence tagging task

• Choose from a fixed set of tags (in English, about 44)

• Solved using Viterbi algorithm (HMM)

• In English, state-of-the-art is about 97%. (Solved).

Stephen Mayhew (UIUC) 14 / 30

Part-of-Speech Tagging

.Example..

......

Fruit flies like a banana, time flies like an arrow.

NNP/Fruit VBZ/flies IN/like DT/a NN/banana ,/, NN/timeVBZ/flies IN/like DT/an NN/arrow ./.

• Sequence tagging task

• Choose from a fixed set of tags (in English, about 44)

• Solved using Viterbi algorithm (HMM)

• In English, state-of-the-art is about 97%. (Solved).

Stephen Mayhew (UIUC) 14 / 30

Part-of-Speech Tagging

.Example..

......

Fruit flies like a banana, time flies like an arrow.

NNP/Fruit VBZ/flies IN/like DT/a NN/banana ,/, NN/timeVBZ/flies IN/like DT/an NN/arrow ./.

• Sequence tagging task

• Choose from a fixed set of tags (in English, about 44)

• Solved using Viterbi algorithm (HMM)

• In English, state-of-the-art is about 97%. (Solved).

Stephen Mayhew (UIUC) 14 / 30

Part-of-Speech Tagging

.Example..

......

Fruit flies like a banana, time flies like an arrow.

NNP/Fruit VBZ/flies IN/like DT/a NN/banana ,/, NN/timeVBZ/flies IN/like DT/an NN/arrow ./.

• Sequence tagging task

• Choose from a fixed set of tags (in English, about 44)

• Solved using Viterbi algorithm (HMM)

• In English, state-of-the-art is about 97%. (Solved).

Stephen Mayhew (UIUC) 14 / 30

Chunking

.Example..

......

Chunking is not far from POS tagging.

[NP Chunking] [VP is] not [ADVP far] [PP from] [NP POStagging].

• Also a sequence tagging task

• Smaller set of fixed tags (NP, VP, ADVP, etc.)

Stephen Mayhew (UIUC) 15 / 30

Chunking

.Example..

......

Chunking is not far from POS tagging.

[NP Chunking] [VP is] not [ADVP far] [PP from] [NP POStagging].

• Also a sequence tagging task

• Smaller set of fixed tags (NP, VP, ADVP, etc.)

Stephen Mayhew (UIUC) 15 / 30

Chunking

.Example..

......

Chunking is not far from POS tagging.

[NP Chunking] [VP is] not [ADVP far] [PP from] [NP POStagging].

• Also a sequence tagging task

• Smaller set of fixed tags (NP, VP, ADVP, etc.)

Stephen Mayhew (UIUC) 15 / 30

Chunking

.Example..

......

Chunking is not far from POS tagging.

[NP Chunking] [VP is] not [ADVP far] [PP from] [NP POStagging].

• Also a sequence tagging task

• Smaller set of fixed tags (NP, VP, ADVP, etc.)

Stephen Mayhew (UIUC) 15 / 30

Parsing

• More complicated thanchunking

• No longer a sequencetagging problem

• A difficult problem

• Used as input to otherproblems

Stephen Mayhew (UIUC) 16 / 30

Parsing

• More complicated thanchunking

• No longer a sequencetagging problem

• A difficult problem

• Used as input to otherproblems

Stephen Mayhew (UIUC) 16 / 30

Parsing

• More complicated thanchunking

• No longer a sequencetagging problem

• A difficult problem

• Used as input to otherproblems

Stephen Mayhew (UIUC) 16 / 30

Parsing

• More complicated thanchunking

• No longer a sequencetagging problem

• A difficult problem

• Used as input to otherproblems

Stephen Mayhew (UIUC) 16 / 30

Named Entity Recognition

.Example..

......

I’ve got a feeling we’re not in [LOC Kansas] anymore.I’m sorry, [PER Dave], I’m afraid I can’t do that.I’m going to [LOC Lohmann Park] with [PER AbcdeRedbottom] next week.

• Also a sequence labeling task

• Labels: BIO label for each word

• Note: if you have the training data, this can recognize anytype of label.

Stephen Mayhew (UIUC) 17 / 30

Named Entity Recognition

.Example..

......

I’ve got a feeling we’re not in [LOC Kansas] anymore.I’m sorry, [PER Dave], I’m afraid I can’t do that.I’m going to [LOC Lohmann Park] with [PER AbcdeRedbottom] next week.

• Also a sequence labeling task

• Labels: BIO label for each word

• Note: if you have the training data, this can recognize anytype of label.

Stephen Mayhew (UIUC) 17 / 30

Named Entity Recognition

.Example..

......

I’ve got a feeling we’re not in [LOC Kansas] anymore.I’m sorry, [PER Dave], I’m afraid I can’t do that.I’m going to [LOC Lohmann Park] with [PER AbcdeRedbottom] next week.

• Also a sequence labeling task

• Labels: BIO label for each word

• Note: if you have the training data, this can recognize anytype of label.

Stephen Mayhew (UIUC) 17 / 30

Named Entity Recognition

.Example..

......

I’ve got a feeling we’re not in [LOC Kansas] anymore.I’m sorry, [PER Dave], I’m afraid I can’t do that.I’m going to [LOC Lohmann Park] with [PER AbcdeRedbottom] next week.

• Also a sequence labeling task

• Labels: BIO label for each word

• Note: if you have the training data, this can recognize anytype of label.

Stephen Mayhew (UIUC) 17 / 30

Coreference Resolution

.Example..

......

The ball crashed through the table because [it] was made ofstyrofoam.

vs.

The ball crashed through the table because [it] was made ofsteel.

• Very difficult task, even for humans

Stephen Mayhew (UIUC) 18 / 30

Sentiment Analysis

Positive: “Having never been to a Brazilian steakhouse, this place sets the bar

high. Food was awesome! Service was the best I’ve ever had. Always around and

promptly responding if anything was needed, and checking on us, but not being

annoying. Will definitely be back!”

Negative: “Overall this place could be good but is just a disappointment. They

have a great selection of vegetables, meats, sauces, and other ingredients, but even

when following their “recipes” the food isn’t that great. It was extremely salty

and just not very impressive. I think that the grill maybe got my food mixed up

with someone else’s food maybe, it just wasn’t good. Overall it was edible but I

would never go back for the price I paid for salty, mediocre stir fry.”

Like Mozart: Too easy for beginners, too hard for experts.

Stephen Mayhew (UIUC) 19 / 30

Sentiment Analysis

Positive: “Having never been to a Brazilian steakhouse, this place sets the bar

high. Food was awesome! Service was the best I’ve ever had. Always around and

promptly responding if anything was needed, and checking on us, but not being

annoying. Will definitely be back!”

Negative: “Overall this place could be good but is just a disappointment. They

have a great selection of vegetables, meats, sauces, and other ingredients, but even

when following their “recipes” the food isn’t that great. It was extremely salty

and just not very impressive. I think that the grill maybe got my food mixed up

with someone else’s food maybe, it just wasn’t good. Overall it was edible but I

would never go back for the price I paid for salty, mediocre stir fry.”

Like Mozart: Too easy for beginners, too hard for experts.

Stephen Mayhew (UIUC) 19 / 30

Sentiment Analysis

Positive: “Having never been to a Brazilian steakhouse, this place sets the bar

high. Food was awesome! Service was the best I’ve ever had. Always around and

promptly responding if anything was needed, and checking on us, but not being

annoying. Will definitely be back!”

Negative: “Overall this place could be good but is just a disappointment. They

have a great selection of vegetables, meats, sauces, and other ingredients, but even

when following their “recipes” the food isn’t that great. It was extremely salty

and just not very impressive. I think that the grill maybe got my food mixed up

with someone else’s food maybe, it just wasn’t good. Overall it was edible but I

would never go back for the price I paid for salty, mediocre stir fry.”

Like Mozart: Too easy for beginners, too hard for experts.

Stephen Mayhew (UIUC) 19 / 30

Sentiment Analysis

Positive: “Having never been to a Brazilian steakhouse, this place sets the bar

high. Food was awesome! Service was the best I’ve ever had. Always around and

promptly responding if anything was needed, and checking on us, but not being

annoying. Will definitely be back!”

Negative: “Overall this place could be good but is just a disappointment. They

have a great selection of vegetables, meats, sauces, and other ingredients, but even

when following their “recipes” the food isn’t that great. It was extremely salty

and just not very impressive. I think that the grill maybe got my food mixed up

with someone else’s food maybe, it just wasn’t good. Overall it was edible but I

would never go back for the price I paid for salty, mediocre stir fry.”

Like Mozart: Too easy for beginners, too hard for experts.

Stephen Mayhew (UIUC) 19 / 30

Sentiment Analysis

Positive: “Having never been to a Brazilian steakhouse, this place sets the bar

high. Food was awesome! Service was the best I’ve ever had. Always around and

promptly responding if anything was needed, and checking on us, but not being

annoying. Will definitely be back!”

Negative: “Overall this place could be good but is just a disappointment. They

have a great selection of vegetables, meats, sauces, and other ingredients, but even

when following their “recipes” the food isn’t that great. It was extremely salty

and just not very impressive. I think that the grill maybe got my food mixed up

with someone else’s food maybe, it just wasn’t good. Overall it was edible but I

would never go back for the price I paid for salty, mediocre stir fry.”

Like Mozart: Too easy for beginners, too hard for experts.

Stephen Mayhew (UIUC) 19 / 30

Topic Modeling

Latent Dirichlet Allocation (LDA).

Stephen Mayhew (UIUC) 20 / 30

Topic Modeling

Topic 1:fire, los, angeles, homes, firefighters, miles, area, officials,people, park, san, ...

Topic 2:health, smoking, medical, children, doctors, cigarettes, percent,public, group, ...

Topic 3:farmers, farm, trade, agriculture, agricultural, yeutter, tons,grain, products, ...

Stephen Mayhew (UIUC) 21 / 30

Wikifier

Stephen Mayhew (UIUC) 22 / 30

Machine Translation

Huge task, very difficult.

E = argmaxE

Pr(E | F )

= argmaxE

Pr(F | E) Pr(E)

Note the need of a language model.

• Parallel Corpora

• Alignment

• Phrase-based translation

• More data gets better results

Stephen Mayhew (UIUC) 23 / 30

Machine Translation

Huge task, very difficult.

E = argmaxE

Pr(E | F )

= argmaxE

Pr(F | E) Pr(E)

Note the need of a language model.

• Parallel Corpora

• Alignment

• Phrase-based translation

• More data gets better results

Stephen Mayhew (UIUC) 23 / 30

Machine Translation

Huge task, very difficult.

E = argmaxE

Pr(E | F )

= argmaxE

Pr(F | E) Pr(E)

Note the need of a language model.

• Parallel Corpora

• Alignment

• Phrase-based translation

• More data gets better results

Stephen Mayhew (UIUC) 23 / 30

Machine Translation

Huge task, very difficult.

E = argmaxE

Pr(E | F )

= argmaxE

Pr(F | E) Pr(E)

Note the need of a language model.

• Parallel Corpora

• Alignment

• Phrase-based translation

• More data gets better results

Stephen Mayhew (UIUC) 23 / 30

Machine Translation

Huge task, very difficult.

E = argmaxE

Pr(E | F )

= argmaxE

Pr(F | E) Pr(E)

Note the need of a language model.

• Parallel Corpora

• Alignment

• Phrase-based translation

• More data gets better results

Stephen Mayhew (UIUC) 23 / 30

Vauquois Triangle

Stephen Mayhew (UIUC) 24 / 30

Google Translate Fail

“Tesco found that 40% of apples are wasted, as are just underhalf of bakery items.”

−→ To Spanish −→

“Tesco found that 40% of the blocks are wasted because theyare slightly less than half of the bakery products.”

Stephen Mayhew (UIUC) 25 / 30

Trustworthiness

...1 ..

2

..

3

..

4

..

1

..2

..

3

..

4

..

5

..

6

..

S

..

C

Bipartite source-claimgraph, what sort ofguarantees or interestingconclusions can we get?

Stephen Mayhew (UIUC) 26 / 30

Things I didn’t talk about

• Grammar induction

• Bayesian methods

• Text generation

• Event extraction

• Information retrieval

• Query expansion

• Word sensedisambiguation

• Textual entailment

• Similarity measures

• Context sensitive spellingcorrection

• ESL correction

• Relation extraction

• Transliteration

• Concept extraction

• Question answering

• ...

Stephen Mayhew (UIUC) 27 / 30

CCG Tools

• Learning Based Java (LBJ)

• JLIS (structured learning)

• Named Entity Recognition

• Wikifier

• Coreference resolution

• Demos

• Much more...

Stephen Mayhew (UIUC) 29 / 30

References

Concept graph: Chen-Tse Tsai, UIUC

Machine learning graph: http://scikit-learn.org/

n-grams slide: http://www.cs.columbia.edu/ kathy/NLP/ClassSlides/Class3-

ngrams09/ngrams.pdf

Wikipedia diagram: Xiao Cheng, UIUC

Parse tree: http://geniferology.blogspot.com/

Topic Models: http://www.cs.princeton.edu/ blei/lda-c/index.html

LDA Graphic:

http://www.cs.cornell.edu/courses/cs6784/2010sp/lecture/30-BleiEtAl03.pdf

Various Examples: http://cogcomp.cs.illinois.edu

Vauquois triangle: Julia Hockenmaier’s slides,

http://courses.engr.illinois.edu/cs498jh/fa2012/Slides/Lecture21HO.pdf

Stephen Mayhew (UIUC) 30 / 30