probability review and naïve bayes - github...

Instructor:AlanRitter

ProbabilityReviewandNaïveBayes

SomeslidesadaptedfromDanJurfasky andBrendanO’connor

WhatisProbability?

• “Theprobabilitythecoinwilllandheadsis0.5”– Q:whatdoesthismean?

• 2Interpretations:– Frequentist (Repeatedtrials)• Ifweflipthecoinmanytimes…

– Bayesian• Webelievethereisequalchanceofheads/tails• Advantage:eventsthatdonothavelongtermfrequencies

Q:Whatistheprobability thepolaricecapswillmeltby2050?

ProbabilityReview

X

x

P (X = x) = 1

P (A,B)

P (B)= P (A|B)

P (A|B)P (B) = P (A,B)

ConditionalProbability

ChainRule

ProbabilityReview

P (A _B) = P (A) + P (B)� P (A ^B)

P (¬A) = 1� P (A)

Disjunction/Union:

Negation:

X

x

P (X = x, Y ) = P (Y )

BayesRule

H D

P (D|H)GenerativeModelofHowHypothesisCausesData

P (H|D)BayesianInferece

Hypothesis(Unknown)

Data(ObservedEvidence)

P (H|D) =P (D|H)P (H)

P (D)

BayesRuletellsushowtofliptheconditionalReasonabouteffectstocausesUsefulifyouassumeagenerativemodel foryourdata

BayesRule

P (H|D) =P (D|H)P (H)

P (D)

BayesRuletellsushowtofliptheconditionalReasonabouteffectstocausesUsefulifyouassumeagenerativemodel foryourdata

PriorLikelihood

NormalizerPosterior

BayesRuleBayesRuletellsushowtofliptheconditionalReasonabouteffectstocausesUsefulifyouassumeagenerativemodel foryourdata

PriorLikelihood

NormalizerPosterior

P (H|D) =P (D|H)P (H)Ph P (D|H)P (H)

BayesRuleBayesRuletellsushowtofliptheconditionalReasonabouteffectstocausesUsefulifyouassumeagenerativemodel foryourdata

PriorLikelihood

P (H|D) / P (D|H)P (H)

ProportionalTo(Doesn’tsumto1)Posterior

BayesRuleExample

• Thereisadiseasethataffectsatinyfractionofthepopulation(0.01%)

• Symptomsincludeaheadacheandstiffneck– 99%ofpatientswiththediseasehavethesesymptoms

• 1%ofthegeneralpopulationhasthesesymptoms

• Q:assumeyouhavethesymptom,whatisyourprobabilityofhavingthedisease?

TextClassification

IsthisSpam?

WhowrotewhichFederalistpapers?

• 1787-8:anonymousessaystrytoconvinceNewYorktoratifyU.SConstitution:Jay,Madison,Hamilton.

• Authorshipof12ofthelettersindispute• 1963:solvedbyMosteller andWallaceusingBayesianmethods

JamesMadison AlexanderHamilton

Whatisthesubjectofthisarticle?

• Antogonists andInhibitors

• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …

MeSH SubjectCategoryHierarchyMEDLINE Article

Positiveornegativemoviereview?

• unbelievablydisappointing• Fullofzanycharactersandrichlyappliedsatire,andsomegreatplottwists

• thisisthegreatestscrewballcomedyeverfilmed

• Itwaspathetic.Theworstpartaboutitwastheboxingscenes.

TextClassification:definition

• Input:– adocumentd– afixedsetofclassesC= {c1,c2,…,cJ}

• Output:apredictedclassc∈ C

ClassificationMethods:Hand-codedrules

• Rulesbasedoncombinationsofwordsorotherfeatures– spam:black-list-addressOR(“dollars”AND“havebeenselected”)

• Accuracycanbehigh– Ifrulescarefullyrefinedbyexpert

• Butbuildingandmaintainingtheserulesisexpensive

ClassificationMethods:SupervisedMachineLearning

• Input:– adocumentd– afixedsetofclassesC= {c1,c2,…,cJ}– Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)

• Output:– alearnedclassifierγ:dà c

ClassificationMethods:SupervisedMachineLearning

• Anykindofclassifier– Naïve Bayes– Logisticregression– Support-vectormachines– k-NearestNeighbors

– …

NaïveBayesIntuition

• Simple(“naïve”)classificationmethodbasedonBayesrule

• Reliesonverysimplerepresentationofdocument– Bagofwords

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Testdocument

parserlanguagelabeltranslation…

Bagofwordsfordocumentclassification

...planningtemporalreasoningplanlanguage...

?

Bayes’RuleAppliedtoDocumentsandClasses

•Foradocumentd andaclassc

P(c | d) = P(d | c)P(c)P(d)

Naïve BayesClassifier(I)

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAPis“maximumaposteriori”=mostlikelyclass

BayesRule

Droppingthedenominator

Naïve BayesClassifier(II)

cMAP = argmaxc∈C

P(d | c)P(c)

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

Naïve BayesClassifier(IV)

Howoftendoesthisclassoccur?

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

O(|X|n•|C|) parameters

Wecanjustcounttherelativefrequenciesinacorpus

Couldonlybeestimatedifavery,verylargenumberof trainingexampleswasavailable.

MultinomialNaïve BayesIndependenceAssumptionsP(x1, x2,…, xn | c)

• BagofWordsassumption:Assumepositiondoesn’tmatter

• ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.

MultinomialNaïve BayesClassifier

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

cNB = argmaxc∈C

P(cj ) P(x | c)x∈X∏

ApplyingMultinomialNaiveBayesClassifierstoTextClassification

cNB = argmaxc j∈C

P(cj ) P(xi | cj )i∈positions∏

positions ← allwordpositionsintestdocument

LearningtheMultinomialNaïve BayesModel

• Firstattempt:maximumlikelihoodestimates– simplyusethefrequenciesinthedata

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

• Createmega-documentfortopicj byconcatenatingalldocsinthistopic– Usefrequencyofw inmega-document

Parameterestimation

fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccj

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

ProblemwithMaximumLikelihood• Whatifwehaveseennotrainingdocumentswiththewordfantastic andclassifiedinthetopicpositive (thumbs-up)?

• Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!

cMAP = argmaxc P̂(c) P̂(xi | c)i∏

Laplace(add-1)smoothingforNaïve Bayes

=count(wi,c)+1

count(w,cw∈V∑ )

#

$%%

&

'(( + V

P̂(wi | c) =count(wi,c)count(w,c)( )

w∈V∑

MultinomialNaïveBayes:Learning

• CalculateP(cj) terms– Foreachcj inC dodocsj← alldocswithclass=cj

P(cj )←| docsj |

| total # documents|

MultinomialNaïveBayes:Learning

P(wk | cj )←nk +α

n+α |Vocabulary |

CalculateP(wk | cj) terms• Textj ← singledoccontainingalldocsj• Foreachwordwk inVocabulary

nk← #ofoccurrencesofwk inTextj

Fromtrainingcorpus,extractVocabulary

Exercise

NaïveBayesClassification:PracticalIssues

• Multiplyingtogetherlotsofprobabilities• Probabilitiesarenumbersbetween0and1• Q:Whatcouldgowronghere?

cMAP = argmaxcP (c|x1, . . . , xn)

= argmaxcP (x1, . . . , xn|c)P (c)

= argmaxcP (c)

nY

i=1

P (xi|c)

Workingwithprobabilitiesinlogspace

LogIdentities(review)

log(a⇥ b) = log(a) + log(b)

log(

a

b) = log(a)� log(b)

log(an) = n log(a)

NaïveBayeswithLogProbabilities

cMAP = argmaxcP (c|x1, . . . , xn)

= argmaxcP (c)

nY

i=1

P (xi|c)

= argmaxc log

P (c)

nY

i=1

P (xi|c)!

= argmaxc logP (c) +

nX

i=1

logP (xi|c)

cMAP = argmaxc logP (c) +

nX

i=1

logP (xi|c)

NaïveBayeswithLogProbabilities

• Q:Whydon’twehavetoworryaboutfloatingpointunderflowanymore?

LogExp SumTrick:motivation

• Wehave:abunchoflogprobabilities.– log(p1),log(p2),log(p3),…log(pn)

• Wewant:log(p1+p2+p3+…pn)• Wecouldconvertbackfromlogspace,sumthentakethelog.– Iftheprobabilitiesareverysmall,thiswillresultinfloatingpointunderflow

LogExp SumTrick:

log[

X

i

exp(x

i

)] = x

max

+ log[

X

i

exp(x

i

� x

max

)]

Anotherissue:Smoothing

ˆP (wi|c) =count(w, c) + 1P

w02V count(w’,c) + |V |


ˆP (wi|c) =count(w, c) + ↵P

w02V count(w0, c) + ↵|V |

Alphadoesn’tnecessarilyneedtobe1

(hyperparmeter)




Canthinkofalphaasa“pseudocount”.Imaginarynumberoftimesthiswordhasbeenseen.




• Q:Whatifalpha=0?• Q:whatifalpha=0.000001?• Q:whathappensasalphagetsverylarge?

Overfitting

• Modelcarestoomuchaboutthetrainingdata• Howtocheckforoverfitting?– Trainingvs.testaccuracy

• Pseudocount parametercombatsoverfitting

Q:howtopickAlpha?

• Splittrainvs.Test• Tryabunchofdifferentvalues

• Pickthevalueofalphathatperformsbest

• Whatvaluestotry?Gridsearch– (10^-2,10^-1,...,10^2)

accuracy

↵

Usethisone

DataSplitting

• Trainvs.Test

• Better:– Train(usedforfittingmodelparameters)– Dev(usedfortuninghyperparameters)– Test(reserveforfinalevaluation)

• Cross-validation

FeatureEngineering

• Whatisyourword/featurerepresentation– Tokenizationrules:splittingonwhitespace?– Uppercaseisthesameaslowercase?– Numbers?– Punctuation?– Stemming?