language modeling - ecology...

IntroductiontoN-grams

LanguageModeling

Many Slides are adapted from slides by Dan Jurafsky

ProbabilisticLanguageModels

• Today’sgoal:assignaprobabilitytoasentence• MachineTranslation:• P(highwindstonite)>P(large windstonite)

• SpellCorrection• Theofficeisaboutfifteenminuets frommyhouse

• P(aboutfifteenminutes from)>P(aboutfifteenminuets from)

• SpeechRecognition• P(Isawavan)>>P(eyesaweofan)

• +Summarization,question-answering,etc.,etc.!!

Why?

ProbabilisticLanguageModeling

• Goal:computetheprobabilityofasentenceorsequenceofwords:

P(W)=P(w1,w2,w3,w4,w5…wn)

• Relatedtask:probabilityofanupcomingword:P(w5|w1,w2,w3,w4)

• Amodelthatcomputeseitherofthese:P(W)orP(wn|w1,w2…wn-1) iscalledalanguagemodel.

• Better:thegrammarButlanguagemodelorLMisstandard

HowtocomputeP(W)

• Howtocomputethisjointprobability:

• P(its,water,is,so,transparent,that)

• Intuition:let’srelyontheChainRuleofProbability

Reminder:TheChainRule

• With 4 variables:P(A,B,C,D)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)

• TheChainRuleinGeneralP(x1,x2,x3,…,xn)=P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

Howtoestimatetheseprobabilities

• Couldwejustcountanddivide?

• No!Toomanypossiblesentences!• We’llneverseeenoughdataforestimatingthese

€

P(the | its water is so transparent that) =

Count(its water is so transparent that the)Count(its water is so transparent that)

MarkovAssumption

• Inotherwords,weapproximateeachcomponentintheproduct

€

P(w1w2…wn ) ≈ P(wi |wi−k…wi−1)i∏

€

P(wi |w1w2…wi−1) ≈ P(wi |wi−k…wi−1)

MarkovAssumption

• Simplifyingassumption:

• Ormaybe

€

P(the | its water is so transparent that) ≈ P(the | that)

€

P(the | its water is so transparent that) ≈ P(the | transparent that)

AndreiMarkov

Simplestcase:Unigrammodel

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Someautomaticallygeneratedsentencesfromaunigrammodel

€

P(w1w2…wn ) ≈ P(wi)i∏

Conditiononthepreviousword:

Bigrammodel

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

€

P(wi |w1w2…wi−1) ≈ P(wi |wi−1)

N-grammodels

• Wecanextendtotrigrams,4-grams,5-grams• Ingeneralthisisaninsufficientmodeloflanguage

• becauselanguagehaslong-distancedependencies:

“ThecomputerwhichIhadjustputintothemachineroomonthefifthfloorcrashed.”

• ButwecanoftengetawaywithN-grammodelsStill, Most words depend on their previous few words

IntroductiontoN-grams

LanguageModeling

EstimatingN-gramProbabilities

LanguageModeling

Estimatingbigramprobabilities

• TheMaximumLikelihoodEstimate

€

P(wi |wi−1) =count(wi−1,wi)count(wi−1)

€

P(wi |wi−1) =c(wi−1,wi)c(wi−1)

Anexample

<s>IamSam</s><s>SamIam</s><s>Idonotlikegreeneggsandham</s>

€

P(wi |wi−1) =c(wi−1,wi)c(wi−1)

Moreexamples:BerkeleyRestaurantProjectsentences

• canyoutellmeaboutanygoodcantonese restaurantscloseby• midpricedthai foodiswhati’m lookingfor• tellmeaboutchezpanisse• canyougivemealistingofthekindsoffoodthatareavailable• i’m lookingforagoodplacetoeatbreakfast• wheniscaffe venezia openduringtheday

Rawbigramcounts

• Outof9222sentences

Rawbigramprobabilities

• Normalizebyunigrams:

• Result:

PracticalIssues

• Wedoeverythinginlogspace• Avoidunderflow• (alsoaddingisfasterthanmultiplying)

log(p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4

LanguageModelingToolkits

• SRILM• http://www.speech.sri.com/projects/srilm/

GoogleN-GramRelease,August2006

…

http://ngrams.googlelabs.com/

EstimatingN-gramProbabilities

LanguageModeling

EvaluationandPerplexity

LanguageModeling

ExtrinsicevaluationofN-grammodels

• BestevaluationforcomparingmodelsAandB• Puteachmodelinatask• spellingcorrector,speechrecognizer,MTsystem

• Runthetask,getanaccuracyforAandforB• Howmanymisspelledwordscorrectedproperly• Howmanywordstranslatedcorrectly

• CompareaccuracyforAandB

Difficultyofextrinsic(in-vivo)evaluationofN-grammodels

• Extrinsicevaluation• Time-consuming;cantakedaysorweeks

• So• Sometimesuseintrinsic evaluation:perplexity

IntuitionofPerplexity

• TheShannonGame:• Howwellcanwepredictthenextword?

• Unigramsareterribleatthisgame.(Why?)• Abettermodelofatext

• isonewhichassignsahigherprobabilitytothewordthatactuallyoccurs

Ialwaysorderpizzawithcheeseand____

The33rd PresidentoftheUSwas____

Isawa____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100

Perplexity

Perplexityistheinverseprobabilityofthetestset,normalizedbythenumberofwords:

Chainrule:

Forbigrams:

Minimizingperplexityisthesameasmaximizingprobability

Thebestlanguagemodelisonethatbestpredictsanunseentestset• GivesthehighestP(sentence)

PP(W ) = P(w1w2...wN )−

1N

=1

P(w1w2...wN )N

Lowerperplexity=bettermodel

• Training38millionwords,test1.5millionwords,WSJ

N-gramOrder

Unigram Bigram Trigram

Perplexity 962 170 109

EvaluationandPerplexity

LanguageModeling

Generalizationandzeros

LanguageModeling

TheShannonVisualizationMethod• Choosearandombigram

(<s>,w)accordingtoitsprobability• Nowchoosearandombigram

(w,x)accordingtoitsprobability• Andsoonuntilwechoose</s>• Thenstringthewordstogether

<s> II wantwant to

to eateat Chinese

Chinese foodfood </s>

I want to eat Chinese food

ApproximatingShakespeare

Shakespeareascorpus

• N=884,647tokens,V=29,066• Shakespeareproduced300,000bigramtypesoutofV2=844millionpossiblebigrams.• So99.96%ofthepossiblebigramswereneverseen(havezeroentriesinthetable)

• Quadrigrams worse:What'scomingoutlookslikeShakespearebecauseitis Shakespeare

Thewallstreetjournalisnotshakespeare(nooffense)

Theperilsofoverfitting

• N-gramsonlyworkwellforwordpredictionifthetestcorpuslookslikethetrainingcorpus• Inreallife,itoftendoesn’t• Weneedtotrainrobustmodels thatgeneralize!• Onekindofgeneralization:Zeros!• Thingsthatdon’teveroccurinthetrainingset• Butoccurinthetestset

Zeros• Trainingset:

…deniedtheallegations…deniedthereports…deniedtheclaims…deniedtherequest

P(“offer”|deniedthe)=0

• Testset…deniedtheoffer…deniedtheloan

Zeroprobabilitybigrams

• Bigramswithzeroprobability• meanthatwewillassign0probabilitytothetestset!

• Andhencewecannotcomputeperplexity(can’tdivideby0)!

Generalizationandzeros

LanguageModeling

Smoothing:Add-one(Laplace)smoothing

LanguageModeling

Add-oneestimation

• AlsocalledLaplacesmoothing• Pretendwesaweachwordonemoretimethanwedid• Justaddonetoallthecounts!

• MLEestimate:

• Add-1estimate:

PMLE (wi |wi−1) =c(wi−1,wi )c(wi−1)

PAdd−1(wi |wi−1) =c(wi−1,wi )+1c(wi−1)+V +1

Berkeley Restaurant Corpus: Laplace smoothed bigram counts

Laplace-smoothed bigrams

+1

Reconstituted counts

+1

Compare with raw bigram counts

Add-1estimationisabluntinstrument

• Soadd-1isn’tusedforN-grams:• We’llseebettermethods

• Butadd-1isusedtosmoothotherNLPmodels• Fortextclassification• Indomainswherethenumberofzerosisn’tsohuge.

Smoothing:Add-one(Laplace)smoothing

LanguageModeling

Interpolation,Backoff

LanguageModeling

Backoff and Interpolation• Sometimesithelpstouseless context

• Conditiononlesscontextforcontextsyouhaven’tlearnedmuchabout

• Backoff:• usetrigramifyouhavegoodevidence,• otherwisebigram,otherwiseunigram

• Interpolation:• mixunigram,bigram,trigram

• Interpolationworksbetter

LinearInterpolation

• Simpleinterpolation

• Lambdasconditionaloncontext:

N-gramSmoothingSummary

• Add-1smoothing:• OKfortextcategorization,notforlanguagemodeling

• Backoff andInterpolationworkbetter• Themostcommonlyusedmethod:

• ExtendedInterpolatedKneser-Ney

53

Interpolation,Backoff

LanguageModeling

language modeling - ecology...

Documents