corpora and language modeling - mit media labweb.media.mit.edu/~havasi/mas.s60/pnlp2.pdf · hindi,...

Post on 15-Mar-2018

227 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Corpora and Language Modeling

Ngrams, information, and monkeys on keyboards

Rob Speer Catherine Havasi

MAS.S60

Corpus Linguistics

•  A corpus is a body of existing text •  In descriptive linguistics, it provides

evidence

•  Ideally*, a corpus should contain documents selected for variety

•  In natural language processing, it provides training data

Plain text corpora

•  Project Gutenberg •  British National Corpus •  Presidential inaugural addresses •  The Universal Declaration of Human Rights

(translated into >300 languages) •  CHILDES (conversations between parents

and children) •  Wikipedia •  Google Books •  The entire Web

Annotated corpora

•  Brown corpus (has part of speech tags) •  Penn Treebank (complete parse trees of

sentences, mostly from the WSJ)

•  SemCor (distinguishes word senses) •  Indian POS-tagged corpus (in Bangla,

Hindi, Marathi, and Telugu)

What can you do with corpora?

•  Examine trends and statistics of language use.

•  Train or test a NLP system. •  Build a lexical resource (by hand or

automatically). • Understand what pairs of words go

together and which contain unusual amounts of information.

Distribution

•  Brown corpus – "the" is 7%

– "to" and "of" are 3% each – “rabbit” is 0.0011%

Making the Dictionary

•  Remember the concordance tool from last class?

Concordance Toolkits

Examining trends

Examining trends

Application: text prediction

Application: machine translation

What information do you use?

• My landlord called, asking if I had paid the ____

– Grammar? – N-grams?

– Semantic relatedness?

The Unreasonable Effectiveness of Data

•  Peter Norvig: it’s better to get more data than better representations

•  “An informal, incomplete grammar of the English language runs over 1,700 pages.”

•  “For many tasks, words and word combinations provide all the representational machinery we need to learn from text.”

Existing data isn’t everything

Loading corpora in NLTK

•  nltk.download() •  nltk.corpus.gutenberg – gives  you  a  NLTKCorpusReader  object    

N-grams

• Given a corpus of text, the n-grams are the sequences of n consecutive words that are in the corpus.

• Can be with words or letters

N-grams: Unigram

“The cat that sat on the sofa also sat on the mat.”

The 3 sat 2 on 2 cat 1 that 1 sofa 1 also 1 mat 1

How much information is in a word?

•  An event that happens with a probability of 1 in 2n carries n bits of information

Maximum Likelihood

•  Infinite number of prob. distributions could produce your observations

• Our training data is a sample from an unknown distribution

•  Each sample has a prob of occurring given a distribution.

•  Proportional is most likely distribution given the sampling

Estimating information in a corpus

• Make a FreqDist for English out of the Brown corpus

• How much information is in the word “the”?

• What is the average number of bits per word?

•  Repeat for another language

But there are always more words!

•  A new word shouldn’t have zero probability.

•  The MLE is not a realistic language model, because it cannot handle new words.

• What probability should a new word have?

Zipf ’s Law

Witten-Bell estimation

•  Estimate the probability of an event we haven’t seen yet, based on the number of event types we’ve seen so far

•  Decrease other probabilities accordingly •  A special case of Good-Turing smoothing •  Implemented in WittenBellFreqDist

Witten-Bell estimation

•  Let i range over all unigram types • N = total tokens, T = total types •  The chance of a new event is T / (N + T)

What’s the probability of a specific new event?

• We don’t know anything about the events we haven’t seen, so we assume they’re uniformly distributed

•  So assume there’s some finite number:

• Guess the total number of events, (T + Z)

Information in sequences of words

• Words don’t actually convey information independently

•  There is less information in each word than the unigrams would suggest

• When hearing a sentence, you can often guess the next _____

N-grams: Bigrams

“The cat that sat on the sofa also sat on the mat.” sat on 2 on the 2 the cat 1 cat that 1 that sat 1 the sofa 1 sofa also1 also sat 1 the mat 1

And then there are trigrams, 4-grams, ...

Sliding Windows

“The cat that sat on the sofa also sat on the mat.”

“The cat that sat on the sofa also sat on the mat.”

“The cat that sat on the sofa also sat on the mat.”

Pointwise mutual information

•  Is “vice president” a significant phrase, or is it simply a coincidence when the words “vice” and “president” are near each other?

What if p(W1, W2) = 0?

• Once again we have an unrealistic model with probabilities that could be 0

•  If the phrase “unrealistic model” isn’t in the corpus, does it have infinite information?

• We need to smooth again

Laplace smoothing

•  The “add one” principle •  Any event in your model that never

happens, happens once instead

•  Simple and often good enough

Laplace smoothing on bigrams

Witten-Bell works here too

•  Let i range over all bigrams • N = total tokens, T = total types

You’ve got Spam!

The Turing Test

•  Alan Turing •  "Can machines think?" •  “Imitation Game”

•  Turing is no longer asking if a machine can think - asking if a machine can act like it is thinking.

The Loebner Prize

•  Annual “Turing Test” since 1991 •  Silver (text) and Gold (text + visual) never

won

•  Bronze: “Most human-like”

Alice

How does Alice work?

• Heuristic pattern matching rules. •  Think ELIZA, but more rules •  Reacts on words in the input

What wins the Loebner Prize?

•  Spelling and grammar • Hiding mathematical knowledge •  React like a human (timing)

•  Pretend to pretend to be a robot (Elbot)

MegaHal

How does MegaHal Work?

•  Learns how you talk •  Imitates natural language in general •  ... with a healthy dose of you

•  Imitation

What is MegaHal trained on?

• Conversational sentences designed to help it win the Loebner prize

•  Facts about the world •  References to Hitchhiker's Guide •  TMBG lyrics

Markov chains

• Markov chains are structures where future states depend only on the current state

• Given the present state, the future and past states are independent

•  “Forget where you were. You are here now. Decide where to go next”.

Generating Text with Markov

• Make a probability distribution of what comes next, given the last n-1 words

•  Iterate • Maximum likelihood estimate is fine here

Markov Beyond Bigrams

•  The higher our n, the more sensible our text. – Text plagiarism vs. generation?

• Our space becomes sparser

Making MegaHal

•  Filter stopwords •  Pick a word they said (smartly) •  Forward and backward Markov Chain from that

word – Backward?

•  Some small fixes – my -> your – why -> because

•  Do this several times •  Pick with a heuristic

Assignment

•  Load a sufficiently large new text corpus in NLTK

•  Discover words that happen significantly more frequently in your corpus than the Brown corpus

•  Discover its two-word collocations with the most pointwise mutual information

• Use your code from this class to generate text from it

top related