corpora and language modeling - mit media labweb.media.mit.edu/~havasi/mas.s60/pnlp2.pdf · hindi,...

Corpora and Language Modeling

Ngrams, information, and monkeys on keyboards

Rob Speer Catherine Havasi

MAS.S60

Corpus Linguistics

• A corpus is a body of existing text • In descriptive linguistics, it provides

evidence

• Ideally*, a corpus should contain documents selected for variety

• In natural language processing, it provides training data

Plain text corpora

• Project Gutenberg • British National Corpus • Presidential inaugural addresses • The Universal Declaration of Human Rights

(translated into >300 languages) • CHILDES (conversations between parents

and children) • Wikipedia • Google Books • The entire Web

Annotated corpora

• Brown corpus (has part of speech tags) • Penn Treebank (complete parse trees of

sentences, mostly from the WSJ)

• SemCor (distinguishes word senses) • Indian POS-tagged corpus (in Bangla,

Hindi, Marathi, and Telugu)

What can you do with corpora?

• Examine trends and statistics of language use.

• Train or test a NLP system. • Build a lexical resource (by hand or

automatically). • Understand what pairs of words go

together and which contain unusual amounts of information.

Distribution

• Brown corpus – "the" is 7%

– "to" and "of" are 3% each – “rabbit” is 0.0011%

Making the Dictionary

• Remember the concordance tool from last class?

Concordance Toolkits

Examining trends

Application: text prediction

Application: machine translation

What information do you use?

• My landlord called, asking if I had paid the ____

– Grammar? – N-grams?

– Semantic relatedness?

The Unreasonable Effectiveness of Data

• Peter Norvig: it’s better to get more data than better representations

• “An informal, incomplete grammar of the English language runs over 1,700 pages.”

• “For many tasks, words and word combinations provide all the representational machinery we need to learn from text.”

Existing data isn’t everything

Loading corpora in NLTK

• nltk.download() • nltk.corpus.gutenberg – gives you a NLTKCorpusReader object

N-grams

• Given a corpus of text, the n-grams are the sequences of n consecutive words that are in the corpus.

• Can be with words or letters

N-grams: Unigram

“The cat that sat on the sofa also sat on the mat.”

The 3 sat 2 on 2 cat 1 that 1 sofa 1 also 1 mat 1

How much information is in a word?

• An event that happens with a probability of 1 in 2n carries n bits of information

Maximum Likelihood

• Infinite number of prob. distributions could produce your observations

• Our training data is a sample from an unknown distribution

• Each sample has a prob of occurring given a distribution.

• Proportional is most likely distribution given the sampling

Estimating information in a corpus

• Make a FreqDist for English out of the Brown corpus

• How much information is in the word “the”?

• What is the average number of bits per word?

• Repeat for another language

But there are always more words!

• A new word shouldn’t have zero probability.

• The MLE is not a realistic language model, because it cannot handle new words.

• What probability should a new word have?

Zipf ’s Law

Witten-Bell estimation

• Estimate the probability of an event we haven’t seen yet, based on the number of event types we’ve seen so far

• Decrease other probabilities accordingly • A special case of Good-Turing smoothing • Implemented in WittenBellFreqDist

Witten-Bell estimation

• Let i range over all unigram types • N = total tokens, T = total types • The chance of a new event is T / (N + T)

What’s the probability of a specific new event?

• We don’t know anything about the events we haven’t seen, so we assume they’re uniformly distributed

• So assume there’s some finite number:

• Guess the total number of events, (T + Z)

Information in sequences of words

• Words don’t actually convey information independently

• There is less information in each word than the unigrams would suggest

• When hearing a sentence, you can often guess the next _____

N-grams: Bigrams

“The cat that sat on the sofa also sat on the mat.” sat on 2 on the 2 the cat 1 cat that 1 that sat 1 the sofa 1 sofa also1 also sat 1 the mat 1

And then there are trigrams, 4-grams, ...

Sliding Windows

“The cat that sat on the sofa also sat on the mat.”

Pointwise mutual information

• Is “vice president” a significant phrase, or is it simply a coincidence when the words “vice” and “president” are near each other?

What if p(W1, W2) = 0?

• Once again we have an unrealistic model with probabilities that could be 0

• If the phrase “unrealistic model” isn’t in the corpus, does it have infinite information?

• We need to smooth again

Laplace smoothing

• The “add one” principle • Any event in your model that never

happens, happens once instead

• Simple and often good enough

Laplace smoothing on bigrams

Witten-Bell works here too

• Let i range over all bigrams • N = total tokens, T = total types

You’ve got Spam!

The Turing Test

• Alan Turing • "Can machines think?" • “Imitation Game”

• Turing is no longer asking if a machine can think - asking if a machine can act like it is thinking.

The Loebner Prize

• Annual “Turing Test” since 1991 • Silver (text) and Gold (text + visual) never

• Bronze: “Most human-like”

How does Alice work?

• Heuristic pattern matching rules. • Think ELIZA, but more rules • Reacts on words in the input

What wins the Loebner Prize?

• Spelling and grammar • Hiding mathematical knowledge • React like a human (timing)

• Pretend to pretend to be a robot (Elbot)

MegaHal

How does MegaHal Work?

• Learns how you talk • Imitates natural language in general • ... with a healthy dose of you

• Imitation

What is MegaHal trained on?

• Conversational sentences designed to help it win the Loebner prize

• Facts about the world • References to Hitchhiker's Guide • TMBG lyrics

Markov chains

• Markov chains are structures where future states depend only on the current state

• Given the present state, the future and past states are independent

• “Forget where you were. You are here now. Decide where to go next”.

Generating Text with Markov

• Make a probability distribution of what comes next, given the last n-1 words

• Iterate • Maximum likelihood estimate is fine here

Markov Beyond Bigrams

• The higher our n, the more sensible our text. – Text plagiarism vs. generation?

• Our space becomes sparser

Making MegaHal

• Filter stopwords • Pick a word they said (smartly) • Forward and backward Markov Chain from that

word – Backward?

• Some small fixes – my -> your – why -> because

• Do this several times • Pick with a heuristic

Assignment

• Load a sufficiently large new text corpus in NLTK

• Discover words that happen significantly more frequently in your corpus than the Brown corpus

• Discover its two-word collocations with the most pointwise mutual information

• Use your code from this class to generate text from it

corpora and language modeling - mit media labweb.media.mit.edu/~havasi/mas.s60/pnlp2.pdf · hindi,...

Documents

bob havasi - images1.loopnet.com · cooper commercial...

· web viewhaltenyÉsztÉs bercsényi, miklós hancz,...

gyimóthy tibor, havasi ferenc, kiss Ákos:...

b gÁbor, havasi istvÁn - tankonyvtar.hu

bob havasi -...

havasi Ágnes a kommunikáció tanítása, fejlesztése

reference - truckturk...sarj havasi hortumu 390581 525050...

kálmán havasi md -...

a havasi fÜlespacsirta elŐfordulÁsa...

tÜrbĠn gĠrĠġ havasi soĞutmasinin enerjĠ … giris...

sentiment analysis + maxent* mas.s60 rob speer catherine...

development of smart cane - mit media labweb.media.mit.edu...

zoltan szlavik, laszlo havasi, and tamas...

profilo sÜpÜrgelerle evinizde bahar havasi! · evinizde...

vƏtƏn havasi("vətən havasi" eldar güneyli) "Əyər bir...

notes balazs havasi the storm

havasi krisztina dóra , hámornik balázs péter , vén...

bob havasi bhavasi@coopergrp.com (216) 562 1981 x10 dan

svájci havasi kutyák magyarországi egyesülete...

gerwigstrasse56 vivek sharma - mit media...