probability review and naïve bayes - github...
TRANSCRIPT
Instructor:AlanRitter
ProbabilityReviewandNaïveBayes
SomeslidesadaptedfromDanJurfasky andBrendanO’connor
WhatisProbability?
• “Theprobabilitythecoinwilllandheadsis0.5”– Q:whatdoesthismean?
• 2Interpretations:– Frequentist (Repeatedtrials)• Ifweflipthecoinmanytimes…
– Bayesian• Webelievethereisequalchanceofheads/tails• Advantage:eventsthatdonothavelongtermfrequencies
Q:Whatistheprobability thepolaricecapswillmeltby2050?
ProbabilityReview
X
x
P (X = x) = 1
P (A,B)
P (B)= P (A|B)
P (A|B)P (B) = P (A,B)
ConditionalProbability
ChainRule
ProbabilityReview
P (A _B) = P (A) + P (B)� P (A ^B)
P (¬A) = 1� P (A)
Disjunction/Union:
Negation:
X
x
P (X = x, Y ) = P (Y )
BayesRule
H D
P (D|H)GenerativeModelofHowHypothesisCausesData
P (H|D)BayesianInferece
Hypothesis(Unknown)
Data(ObservedEvidence)
P (H|D) =P (D|H)P (H)
P (D)
BayesRuletellsushowtofliptheconditionalReasonabouteffectstocausesUsefulifyouassumeagenerativemodel foryourdata
BayesRule
P (H|D) =P (D|H)P (H)
P (D)
BayesRuletellsushowtofliptheconditionalReasonabouteffectstocausesUsefulifyouassumeagenerativemodel foryourdata
PriorLikelihood
NormalizerPosterior
BayesRuleBayesRuletellsushowtofliptheconditionalReasonabouteffectstocausesUsefulifyouassumeagenerativemodel foryourdata
PriorLikelihood
NormalizerPosterior
P (H|D) =P (D|H)P (H)Ph P (D|H)P (H)
BayesRuleBayesRuletellsushowtofliptheconditionalReasonabouteffectstocausesUsefulifyouassumeagenerativemodel foryourdata
PriorLikelihood
P (H|D) / P (D|H)P (H)
ProportionalTo(Doesn’tsumto1)Posterior
BayesRuleExample
• Thereisadiseasethataffectsatinyfractionofthepopulation(0.01%)
• Symptomsincludeaheadacheandstiffneck– 99%ofpatientswiththediseasehavethesesymptoms
• 1%ofthegeneralpopulationhasthesesymptoms
• Q:assumeyouhavethesymptom,whatisyourprobabilityofhavingthedisease?
TextClassification
IsthisSpam?
WhowrotewhichFederalistpapers?
• 1787-8:anonymousessaystrytoconvinceNewYorktoratifyU.SConstitution:Jay,Madison,Hamilton.
• Authorshipof12ofthelettersindispute• 1963:solvedbyMosteller andWallaceusingBayesianmethods
JamesMadison AlexanderHamilton
Whatisthesubjectofthisarticle?
• Antogonists andInhibitors
• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …
MeSH SubjectCategoryHierarchyMEDLINE Article
Positiveornegativemoviereview?
• unbelievablydisappointing• Fullofzanycharactersandrichlyappliedsatire,andsomegreatplottwists
• thisisthegreatestscrewballcomedyeverfilmed
• Itwaspathetic.Theworstpartaboutitwastheboxingscenes.
TextClassification:definition
• Input:– adocumentd– afixedsetofclassesC= {c1,c2,…,cJ}
• Output:apredictedclassc∈ C
ClassificationMethods:Hand-codedrules
• Rulesbasedoncombinationsofwordsorotherfeatures– spam:black-list-addressOR(“dollars”AND“havebeenselected”)
• Accuracycanbehigh– Ifrulescarefullyrefinedbyexpert
• Butbuildingandmaintainingtheserulesisexpensive
ClassificationMethods:SupervisedMachineLearning
• Input:– adocumentd– afixedsetofclassesC= {c1,c2,…,cJ}– Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)
• Output:– alearnedclassifierγ:dà c
ClassificationMethods:SupervisedMachineLearning
• Anykindofclassifier– Naïve Bayes– Logisticregression– Support-vectormachines– k-NearestNeighbors
– …
NaïveBayesIntuition
• Simple(“naïve”)classificationmethodbasedonBayesrule
• Reliesonverysimplerepresentationofdocument– Bagofwords
Planning GUIGarbageCollection
Machine Learning NLP
parsertagtrainingtranslationlanguage...
learningtrainingalgorithmshrinkagenetwork...
garbagecollectionmemoryoptimizationregion...
Testdocument
parserlanguagelabeltranslation…
Bagofwordsfordocumentclassification
...planningtemporalreasoningplanlanguage...
?
Bayes’RuleAppliedtoDocumentsandClasses
•Foradocumentd andaclassc
P(c | d) = P(d | c)P(c)P(d)
Naïve BayesClassifier(I)
cMAP = argmaxc∈C
P(c | d)
= argmaxc∈C
P(d | c)P(c)P(d)
= argmaxc∈C
P(d | c)P(c)
MAPis“maximumaposteriori”=mostlikelyclass
BayesRule
Droppingthedenominator
Naïve BayesClassifier(II)
cMAP = argmaxc∈C
P(d | c)P(c)
= argmaxc∈C
P(x1, x2,…, xn | c)P(c)
Naïve BayesClassifier(IV)
Howoftendoesthisclassoccur?
cMAP = argmaxc∈C
P(x1, x2,…, xn | c)P(c)
O(|X|n•|C|) parameters
Wecanjustcounttherelativefrequenciesinacorpus
Couldonlybeestimatedifavery,verylargenumberof trainingexampleswasavailable.
MultinomialNaïve BayesIndependenceAssumptionsP(x1, x2,…, xn | c)
• BagofWordsassumption:Assumepositiondoesn’tmatter
• ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.
MultinomialNaïve BayesClassifier
cMAP = argmaxc∈C
P(x1, x2,…, xn | c)P(c)
cNB = argmaxc∈C
P(cj ) P(x | c)x∈X∏
ApplyingMultinomialNaiveBayesClassifierstoTextClassification
cNB = argmaxc j∈C
P(cj ) P(xi | cj )i∈positions∏
positions ← allwordpositionsintestdocument
LearningtheMultinomialNaïve BayesModel
• Firstattempt:maximumlikelihoodestimates– simplyusethefrequenciesinthedata
P̂(wi | cj ) =count(wi,cj )count(w,cj )
w∈V∑
P̂(cj ) =doccount(C = cj )
Ndoc
• Createmega-documentfortopicj byconcatenatingalldocsinthistopic– Usefrequencyofw inmega-document
Parameterestimation
fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccj
P̂(wi | cj ) =count(wi,cj )count(w,cj )
w∈V∑
ProblemwithMaximumLikelihood• Whatifwehaveseennotrainingdocumentswiththewordfantastic andclassifiedinthetopicpositive (thumbs-up)?
• Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!
cMAP = argmaxc P̂(c) P̂(xi | c)i∏
Laplace(add-1)smoothingforNaïve Bayes
=count(wi,c)+1
count(w,cw∈V∑ )
#
$%%
&
'(( + V
P̂(wi | c) =count(wi,c)count(w,c)( )
w∈V∑
MultinomialNaïveBayes:Learning
• CalculateP(cj) terms– Foreachcj inC dodocsj← alldocswithclass=cj
P(cj )←| docsj |
| total # documents|
MultinomialNaïveBayes:Learning
P(wk | cj )←nk +α
n+α |Vocabulary |
CalculateP(wk | cj) terms• Textj ← singledoccontainingalldocsj• Foreachwordwk inVocabulary
nk← #ofoccurrencesofwk inTextj
Fromtrainingcorpus,extractVocabulary
Exercise
NaïveBayesClassification:PracticalIssues
• Multiplyingtogetherlotsofprobabilities• Probabilitiesarenumbersbetween0and1• Q:Whatcouldgowronghere?
cMAP = argmaxcP (c|x1, . . . , xn)
= argmaxcP (x1, . . . , xn|c)P (c)
= argmaxcP (c)
nY
i=1
P (xi|c)
Workingwithprobabilitiesinlogspace
LogIdentities(review)
log(a⇥ b) = log(a) + log(b)
log(
a
b) = log(a)� log(b)
log(an) = n log(a)
NaïveBayeswithLogProbabilities
cMAP = argmaxcP (c|x1, . . . , xn)
= argmaxcP (c)
nY
i=1
P (xi|c)
= argmaxc log
P (c)
nY
i=1
P (xi|c)!
= argmaxc logP (c) +
nX
i=1
logP (xi|c)
cMAP = argmaxc logP (c) +
nX
i=1
logP (xi|c)
NaïveBayeswithLogProbabilities
• Q:Whydon’twehavetoworryaboutfloatingpointunderflowanymore?
Whatifwewanttocalculateposteriorlog-probabilities?
P (c|x1, . . . , xn) =P (c)
Qni=1 P (xi|c)P
c0 P (c0)Qn
i=1 P (xi|c0)
logP (c|x1, . . . , xn) = log
P (c)
Qni=1 P (xi|c)P
c0 P (c
0)
Qni=1 P (xi|c0)
= logP (c) +
nX
i=1
P (xi|c)� log
"X
c0
P (c
0)
nY
i=1
P (xi|c0)#
LogExp SumTrick:motivation
• Wehave:abunchoflogprobabilities.– log(p1),log(p2),log(p3),…log(pn)
• Wewant:log(p1+p2+p3+…pn)• Wecouldconvertbackfromlogspace,sumthentakethelog.– Iftheprobabilitiesareverysmall,thiswillresultinfloatingpointunderflow
LogExp SumTrick:
log[
X
i
exp(x
i
)] = x
max
+ log[
X
i
exp(x
i
� x
max
)]
Anotherissue:Smoothing
ˆP (wi|c) =count(w, c) + 1P
w02V count(w’,c) + |V |
Anotherissue:Smoothing
ˆP (wi|c) =count(w, c) + ↵P
w02V count(w0, c) + ↵|V |
Alphadoesn’tnecessarilyneedtobe1
(hyperparmeter)
Anotherissue:Smoothing
ˆP (wi|c) =count(w, c) + ↵P
w02V count(w0, c) + ↵|V |
Canthinkofalphaasa“pseudocount”.Imaginarynumberoftimesthiswordhasbeenseen.
Anotherissue:Smoothing
ˆP (wi|c) =count(w, c) + ↵P
w02V count(w0, c) + ↵|V |
• Q:Whatifalpha=0?• Q:whatifalpha=0.000001?• Q:whathappensasalphagetsverylarge?
Overfitting
• Modelcarestoomuchaboutthetrainingdata• Howtocheckforoverfitting?– Trainingvs.testaccuracy
• Pseudocount parametercombatsoverfitting
Q:howtopickAlpha?
• Splittrainvs.Test• Tryabunchofdifferentvalues
• Pickthevalueofalphathatperformsbest
• Whatvaluestotry?Gridsearch– (10^-2,10^-1,...,10^2)
accuracy
↵
Usethisone
DataSplitting
• Trainvs.Test
• Better:– Train(usedforfittingmodelparameters)– Dev(usedfortuninghyperparameters)– Test(reserveforfinalevaluation)
• Cross-validation
FeatureEngineering
• Whatisyourword/featurerepresentation– Tokenizationrules:splittingonwhitespace?– Uppercaseisthesameaslowercase?– Numbers?– Punctuation?– Stemming?