language modeling - ecology...
TRANSCRIPT
IntroductiontoN-grams
LanguageModeling
Many Slides are adapted from slides by Dan Jurafsky
ProbabilisticLanguageModels
• Today’sgoal:assignaprobabilitytoasentence• MachineTranslation:• P(highwindstonite)>P(large windstonite)
• SpellCorrection• Theofficeisaboutfifteenminuets frommyhouse
• P(aboutfifteenminutes from)>P(aboutfifteenminuets from)
• SpeechRecognition• P(Isawavan)>>P(eyesaweofan)
• +Summarization,question-answering,etc.,etc.!!
Why?
ProbabilisticLanguageModeling
• Goal:computetheprobabilityofasentenceorsequenceofwords:
P(W)=P(w1,w2,w3,w4,w5…wn)
• Relatedtask:probabilityofanupcomingword:P(w5|w1,w2,w3,w4)
• Amodelthatcomputeseitherofthese:P(W)orP(wn|w1,w2…wn-1) iscalledalanguagemodel.
• Better:thegrammarButlanguagemodelorLMisstandard
HowtocomputeP(W)
• Howtocomputethisjointprobability:
• P(its,water,is,so,transparent,that)
• Intuition:let’srelyontheChainRuleofProbability
Reminder:TheChainRule
• With 4 variables:P(A,B,C,D)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• TheChainRuleinGeneralP(x1,x2,x3,…,xn)=P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
TheChainRuleappliedtocomputejointprobabilityofwordsinsentence
P(“itswaterissotransparent”)=P(its)× P(water|its)× P(is|its water)× P(so|its wateris)× P(transparent|its wateris
so)
€
P(w1w2…wn ) = P(wi |w1w2…wi−1)i∏
Howtoestimatetheseprobabilities
• Couldwejustcountanddivide?
• No!Toomanypossiblesentences!• We’llneverseeenoughdataforestimatingthese
€
P(the | its water is so transparent that) =
Count(its water is so transparent that the)Count(its water is so transparent that)
MarkovAssumption
• Inotherwords,weapproximateeachcomponentintheproduct
€
P(w1w2…wn ) ≈ P(wi |wi−k…wi−1)i∏
€
P(wi |w1w2…wi−1) ≈ P(wi |wi−k…wi−1)
MarkovAssumption
• Simplifyingassumption:
• Ormaybe
€
P(the | its water is so transparent that) ≈ P(the | that)
€
P(the | its water is so transparent that) ≈ P(the | transparent that)
AndreiMarkov
Simplestcase:Unigrammodel
fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
Someautomaticallygeneratedsentencesfromaunigrammodel
€
P(w1w2…wn ) ≈ P(wi)i∏
Conditiononthepreviousword:
Bigrammodel
texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
€
P(wi |w1w2…wi−1) ≈ P(wi |wi−1)
N-grammodels
• Wecanextendtotrigrams,4-grams,5-grams• Ingeneralthisisaninsufficientmodeloflanguage
• becauselanguagehaslong-distancedependencies:
“ThecomputerwhichIhadjustputintothemachineroomonthefifthfloorcrashed.”
• ButwecanoftengetawaywithN-grammodelsStill, Most words depend on their previous few words
IntroductiontoN-grams
LanguageModeling
EstimatingN-gramProbabilities
LanguageModeling
Estimatingbigramprobabilities
• TheMaximumLikelihoodEstimate
€
P(wi |wi−1) =count(wi−1,wi)count(wi−1)
€
P(wi |wi−1) =c(wi−1,wi)c(wi−1)
Anexample
<s>IamSam</s><s>SamIam</s><s>Idonotlikegreeneggsandham</s>
€
P(wi |wi−1) =c(wi−1,wi)c(wi−1)
Moreexamples:BerkeleyRestaurantProjectsentences
• canyoutellmeaboutanygoodcantonese restaurantscloseby• midpricedthai foodiswhati’m lookingfor• tellmeaboutchezpanisse• canyougivemealistingofthekindsoffoodthatareavailable• i’m lookingforagoodplacetoeatbreakfast• wheniscaffe venezia openduringtheday
Rawbigramcounts
• Outof9222sentences
Rawbigramprobabilities
• Normalizebyunigrams:
• Result:
Bigramestimatesofsentenceprobabilities
P(<s>Iwantenglish food</s>)=P(I|<s>)× P(want|I)× P(english|want)× P(food|english)× P(</s>|food)=.000031
Whatkindsofknowledge?
• P(english|want)=.0011• P(chinese|want)=.0065• P(to|want)=.66• P(eat|to)=.28• P(food|to)=0• P(want|spend)=0• P(i |<s>)=.25
PracticalIssues
• Wedoeverythinginlogspace• Avoidunderflow• (alsoaddingisfasterthanmultiplying)
log(p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4
LanguageModelingToolkits
• SRILM• http://www.speech.sri.com/projects/srilm/
GoogleN-GramRelease,August2006
…
http://ngrams.googlelabs.com/
EstimatingN-gramProbabilities
LanguageModeling
EvaluationandPerplexity
LanguageModeling
ExtrinsicevaluationofN-grammodels
• BestevaluationforcomparingmodelsAandB• Puteachmodelinatask• spellingcorrector,speechrecognizer,MTsystem
• Runthetask,getanaccuracyforAandforB• Howmanymisspelledwordscorrectedproperly• Howmanywordstranslatedcorrectly
• CompareaccuracyforAandB
Difficultyofextrinsic(in-vivo)evaluationofN-grammodels
• Extrinsicevaluation• Time-consuming;cantakedaysorweeks
• So• Sometimesuseintrinsic evaluation:perplexity
IntuitionofPerplexity
• TheShannonGame:• Howwellcanwepredictthenextword?
• Unigramsareterribleatthisgame.(Why?)• Abettermodelofatext
• isonewhichassignsahigherprobabilitytothewordthatactuallyoccurs
Ialwaysorderpizzawithcheeseand____
The33rd PresidentoftheUSwas____
Isawa____
mushrooms 0.1
pepperoni 0.1
anchovies 0.01
….
fried rice 0.0001
….
and 1e-100
Perplexity
Perplexityistheinverseprobabilityofthetestset,normalizedbythenumberofwords:
Chainrule:
Forbigrams:
Minimizingperplexityisthesameasmaximizingprobability
Thebestlanguagemodelisonethatbestpredictsanunseentestset• GivesthehighestP(sentence)
PP(W ) = P(w1w2...wN )−
1N
=1
P(w1w2...wN )N
Lowerperplexity=bettermodel
• Training38millionwords,test1.5millionwords,WSJ
N-gramOrder
Unigram Bigram Trigram
Perplexity 962 170 109
EvaluationandPerplexity
LanguageModeling
Generalizationandzeros
LanguageModeling
TheShannonVisualizationMethod• Choosearandombigram
(<s>,w)accordingtoitsprobability• Nowchoosearandombigram
(w,x)accordingtoitsprobability• Andsoonuntilwechoose</s>• Thenstringthewordstogether
<s> II wantwant to
to eateat Chinese
Chinese foodfood </s>
I want to eat Chinese food
ApproximatingShakespeare
Shakespeareascorpus
• N=884,647tokens,V=29,066• Shakespeareproduced300,000bigramtypesoutofV2=844millionpossiblebigrams.• So99.96%ofthepossiblebigramswereneverseen(havezeroentriesinthetable)
• Quadrigrams worse:What'scomingoutlookslikeShakespearebecauseitis Shakespeare
Thewallstreetjournalisnotshakespeare(nooffense)
Theperilsofoverfitting
• N-gramsonlyworkwellforwordpredictionifthetestcorpuslookslikethetrainingcorpus• Inreallife,itoftendoesn’t• Weneedtotrainrobustmodels thatgeneralize!• Onekindofgeneralization:Zeros!• Thingsthatdon’teveroccurinthetrainingset• Butoccurinthetestset
Zeros• Trainingset:
…deniedtheallegations…deniedthereports…deniedtheclaims…deniedtherequest
P(“offer”|deniedthe)=0
• Testset…deniedtheoffer…deniedtheloan
Zeroprobabilitybigrams
• Bigramswithzeroprobability• meanthatwewillassign0probabilitytothetestset!
• Andhencewecannotcomputeperplexity(can’tdivideby0)!
Generalizationandzeros
LanguageModeling
Smoothing:Add-one(Laplace)smoothing
LanguageModeling
Add-oneestimation
• AlsocalledLaplacesmoothing• Pretendwesaweachwordonemoretimethanwedid• Justaddonetoallthecounts!
• MLEestimate:
• Add-1estimate:
PMLE (wi |wi−1) =c(wi−1,wi )c(wi−1)
PAdd−1(wi |wi−1) =c(wi−1,wi )+1c(wi−1)+V +1
Berkeley Restaurant Corpus: Laplace smoothed bigram counts
Laplace-smoothed bigrams
+1
Reconstituted counts
+1
Compare with raw bigram counts
Add-1estimationisabluntinstrument
• Soadd-1isn’tusedforN-grams:• We’llseebettermethods
• Butadd-1isusedtosmoothotherNLPmodels• Fortextclassification• Indomainswherethenumberofzerosisn’tsohuge.
Smoothing:Add-one(Laplace)smoothing
LanguageModeling
Interpolation,Backoff
LanguageModeling
Backoff and Interpolation• Sometimesithelpstouseless context
• Conditiononlesscontextforcontextsyouhaven’tlearnedmuchabout
• Backoff:• usetrigramifyouhavegoodevidence,• otherwisebigram,otherwiseunigram
• Interpolation:• mixunigram,bigram,trigram
• Interpolationworksbetter
LinearInterpolation
• Simpleinterpolation
• Lambdasconditionaloncontext:
N-gramSmoothingSummary
• Add-1smoothing:• OKfortextcategorization,notforlanguagemodeling
• Backoff andInterpolationworkbetter• Themostcommonlyusedmethod:
• ExtendedInterpolatedKneser-Ney
53
Interpolation,Backoff
LanguageModeling