n-grams and corpus linguistics

N-Grams and Corpus Linguistics

6 July 2012

Linguistics vs. EngineeringBut it must be recognized that the notion of probability of a sentence is an entirely useless one, under any known interpretation of this term. Noam Chomsky (1969) Anytime a linguist leaves the group the speech recognition rate goes up. Fred Jelinek (1988)

Next Word PredictionFrom an Economic Times story...Stocks ...Stocks plunged this .Stocks plunged this morning, despite a cut in interest ratesStocks plunged this morning, despite a cut in interest rates by the RBI , as Bombay Stock ...Stocks plunged this morning, despite a cut in interest rates by the RBI , as Bombay Stock Exchange began

Stocks plunged this morning, despite a cut in interest rates by the RBI , as Bombay Stock Exchange began trading for the first time since Mumbai Stocks plunged this morning, despite a cut in interest rates by the RBI , as Bombay Stock Exchange began trading for the first time since Mumbai terrorist attacks.

Human Word PredictionClearly, at least some of us have the ability to predict future words in an utterance.How?Domain knowledge: red blood vs. red hatSyntactic knowledge: theLexical knowledge: baked

ClaimA useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniquesIn particular, we'll be interested in the notion of the probability of a sequence (of letters, words,)

Useful ApplicationsWhy do we want to predict a word, given some preceding words?Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASRTheatre owners say popcorn/unicorn sales have doubled...Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translationThe doctor recommended a cat scan. ..billi/cat ..

*Real-Word Spelling ErrorsMental confusionsTheir/theyre/thereTo/too/twoWeather/whetherPeace/pieceYoure/yourTypos that result in real wordsLave for Have

*Real Word Spelling ErrorsCollect a set of common pairs of confusionsWhenever a member of this set is encountered compute the probability of the sentence in which it appearsSubstitute the other possibilities and compute the probability of the resulting sentenceChoose the higher one

N-Gram Models of LanguageUse the previous N-1 words in a sequence to predict the next wordLanguage Model (LM)unigrams, bigrams, trigrams,How do we train these models?Very large corpora

CorporaCorpora are online collections of text and speechBrown CorpusWall Street JournalAP newswireDARPA/NIST text/speech corpora (Call Home, Call Friend, ATIS, Switchboard, Broadcast News, Broadcast Conversation, TDT, Communicator)BNC

Counting Words in CorporaWhat is a word? e.g., are cat and cats the same word?September and Sept?zero and oh?Is _ a word? * ? ) . ,How many words are there in dont ? Gonna ?In Japanese and Chinese text -- how do we identify a word?

TerminologySentence: unit of written languageUtterance: unit of spoken languageWord Form: the inflected form as it actually appears in the corpusLemma: an abstract form, shared by word forms having the same stem, part of speech, word sense stands for the class of words with same stemTypes: number of distinct words in a corpus (vocabulary size)Tokens: total number of words

Simple N-GramsAssume a language has T word types in its lexicon, how likely is word x to follow word y?Simplest model of word probability: 1/TAlternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability)popcorn is more likely to occur than unicornAlternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,)mythical unicorn is more likely than mythical popcorn

*Chain RuleRecall the definition of conditional probabilities

Rewriting

The word sequence from position 1 to n is

Computing the Probability of a Word SequenceCompute the product of component conditional probabilities?P(the mythical unicorn) = P(the) * P(mythical|the) * P(unicorn|the mythical)Butthe longer the sequence, the less likely we are to find it in a training corpus P(Most biologists and folklore specialists believe that in fact the mythical unicorn horns derived from the narwhal)What can we do?

Bigram ModelApproximate by E.g., P(unicorn|the mythical) by P(unicorn|mythical)Markov assumption: the probability of a word depends only on the probability of a limited historyGeneralization: the probability of a word depends only on the probability of the n previous wordstrigrams, 4-grams, 5-grams,the higher n is, the more data needed to trainbackoff models

Using N-GramsFor N-gram models E.g. for bigrams, P(wn-1,wn) = P(wn | wn-1) P(wn-1)P(w8-1,w8) = P(w8 | w7) P(w7)By the Chain Rule we can decompose a joint probability, e.g. P(w1,w2,w3) as followsP(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) P(wn-1|wn) P(wn)

For bigrams then, the probability of a sequence is just the product of the conditional probabilities of its bigrams, e.g.P(the,mythical,unicorn) = P(unicorn|mythical) P(mythical|the) P(the|)

*N-GramsThe big red dogUnigrams:P(dog)Bigrams:P(dog|red)Trigrams:P(dog|big red)Four-grams:P(dog|the big red)

In general, well be dealing withP(Word| Some fixed prefix)

*CaveatThe formulation P(Word| Some fixed prefix) is not really appropriate in many applications.It is if were dealing with real time speech where we only have access to prefixes.But if were dealing with text we already have the right and left contexts. Theres no a priori reason to stick to left contexts.

Training and TestingN-Gram probabilities come from a training corpusoverly narrow corpus: probabilities don't generalizeoverly general corpus: probabilities don't reflect task or domainA separate test corpus is used to evaluate the modelheld out test set; development (dev) test setSimple baselineresults tested for statistical significance how do they differ from a baseline? Other results?

A Simple Bigram ExampleEstimate the likelihood of the sentence I want to eat Chinese food.P(I want to eat Chinese food) = P(I | ) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(|food)What do we need to calculate these likelihoods?Bigram probabilities for each word pair sequence in the sentenceCalculated from a large corpus

Early Bigram Probabilities from BeRP

P(I want to eat British food) = P(I|) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080Suppose P(|food) = .2?How would we calculate I want to eat Chinese food ?Probabilities roughly capture ``syntactic'' facts and ``world knowledge'' eat is often followed by an NPBritish food is not too popularN-gram models can be trained by counting and normalization

*An Aside on LogsYou dont really do all those multiplies. The numbers are too small and lead to underflowsConvert the probabilities to logs and then do additions.To get the real probability (if you need it) go back to the antilog.

Early BeRP Bigram Counts

Early BERP Bigram ProbabilitiesNormalization: divide each row's counts by appropriate unigram counts for wn-1

Computing the bigram probability of I IC(I,I)/C( I in call contexts )p (I|I) = 8 / 3437 = .0023Maximum Likelihood Estimation (MLE): relative frequency

P(I | I) = .0023 I I I I want P(I | want) = .0025 I want I wantP(I | food) = .013 the kind of food I want is ...

*Generation just a testChoose N-Grams according to their probabilities and string them togetherFor bigrams start by generating a word that has a high probability of starting a sentence, then choose a bigram that is high given the first word selected, and so on.See we get better with higher-order n-grams

Approximating ShakespeareGenerating sentences with random unigrams...Every enter now severally so, letHill he late speaks; or! a more to leg less first you enterWith bigrams...What means, sir. I confess she? then all sorts, he is trim, captain.Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.TrigramsSweet prince, Falstaff shall die.This shall forbid it should be branded, if renown made it empty.

QuadrigramsWhat! I will go seek the traitor Gloucester.Will you not tell me who I am?What's coming out here looks like Shakespeare because it is ShakespeareNote: As we increase the value of N, the accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

*There are 884,647 tokens, with 29,066 word form types, in about a one million word Shakespeare corpusShakespeare produced 300,000 bigram types out of 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

N-Gram Training SensitivityIf we repeated the Shakespeare experiment but trained our n-grams on a Wall Street Journal corpus, what would we get?Note: This question has major implications for corpus selection or design

WSJ is not Shakespeare: Sentences Generated from WSJ

*Some Useful ObservationsA small number of events occur with high frequencyYou can collect reliable statistics on these events with relatively small samplesA large number of events occur with small frequencyYou might have to wait a long time to gather statistics on the low frequency events

*Some Useful ObservationsSome zeroes are really zeroesMeaning that they represent events that cant or shouldnt occurOn the other hand, some zeroes arent really zeroesThey represent low frequency events that simply didnt occur in the corpus

SmoothingWords follow a Zipfian distributionSmall number of words occur very frequentlyA large number are seen only onceZipfs law: a words frequency is approximately inversely proportional to its rank in the word distribution listZero probabilities on one bigram cause a zero probability on the entire sentenceSo.how do we estimate the likelihood of unseen n-grams? *

Slide from Dan Klein

Laplace SmoothingFor unigrams:Add 1 to every word (type) count to get an adjusted count c*Normalize by N (#tokens) + V (#types)Original unigram probability

New unigram probability

Unigram Smoothing ExampleTiny Corpus, V=4; N=20

WordTrue CtUnigram ProbNew CtAdjusted Probeat10.511.46British4.25.21food6.37.29happily0.01.04201.0~201.0

So, we lower some (larger) observed counts in order to include unobserved vocabulary For bigrams:Original

New

But this change counts drastically: Too much weight given to unseen ngramsIn practice, unsmoothed bigrams often work better!Can we smooth more usefully?

Good-Turing DiscountingRe-estimate amount of probability mass for zero (or low count) ngrams by looking at ngrams with higher countsEstimate

E.g. N0s adjusted count is a function of the count of ngrams that occur once, N1 Assumes: Word bigrams each follow a binomial distributionWe know number of unseen bigrams (VxV-seen)

Backoff Methods (e.g. Katz 87)For e.g. a trigram modelCompute unigram, bigram and trigram probabilitiesUsage:Where trigram unavailable back off to bigram if available, o.w. back off to the current words unigram probabilityE.g An omnivorous unicorn

Class-Based BackoffBack off to the class rather than the wordParticularly useful for proper nouns (e.g., names)Use count for the number of names in place of the particular nameE.g. < N | friendly > instead of < dog | friendly>

*SummaryN-gram probabilities can be used to estimate the likelihoodOf a word occurring in a context (N-1)Of a sentence occurring at allSmoothing techniques deal with problems of unseen words in a corpus

Google NGrams

Exampleserve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234

n-grams and corpus linguistics

Documents