logistic regression - svivek · logistic regression is the discriminative version. this lecture...
Embed Size (px)
TRANSCRIPT

MachineLearning
LogisticRegression
1

Wherearewe?
Wehaveseenthefollowingideas– Linearmodels– Learningaslossminimization– Bayesianlearningcriteria(MAPandMLEestimation)– TheNaïveBayesclassifier
2

Thislecture
• Logisticregression
• ConnectiontoNaïveBayes
• Trainingalogisticregressionclassifier
• Backtolossminimization
3

Thislecture
• Logisticregression
• ConnectiontoNaïveBayes
• Trainingalogisticregressionclassifier
• Backtolossminimization
4

LogisticRegression:Setup
• Thesetting– Binaryclassification– Inputs:Featurevectorsx 2 <d
– Labels:y 2 {-1, +1}
• Trainingdata– S={(xi, yi)},mexamples
5

Classification,but…
Theoutputy isdiscretevalued(-1 or 1)
Insteadofpredictingtheoutput,letustrytopredictP(y=1 |x)
Expandhypothesisspacetofunctionswhoseoutputis[0-1]• Originalproblem:<d ! {-1, 1}• Modifiedproblem:<d ! [0-1]• Effectivelymaketheproblemaregressionproblem
Manyhypothesisspacespossible
6

Classification,but…
Theoutputy isdiscretevalued(-1 or 1)
Insteadofpredictingtheoutput,letustrytopredictP(y=1 |x)
Expandhypothesisspacetofunctionswhoseoutputis[0-1]• Originalproblem:<d ! {-1, 1}• Modifiedproblem:<d ! [0-1]• Effectivelymaketheproblemaregressionproblem
Manyhypothesisspacespossible
7

TheSigmoidfunction
Thehypothesisspaceforlogisticregression:Allfunctionsoftheform
Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾
Whatisthedomainandtherangeofthesigmoidfunction?
Thisisareasonablechoice.Wewillseewhylater
8

TheSigmoidfunction
Thehypothesisspaceforlogisticregression:Allfunctionsoftheform
Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾
Thisisareasonablechoice.Wewillseewhylater
9

TheSigmoidfunction
Thehypothesisspaceforlogisticregression:Allfunctionsoftheform
Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾
Whatisthedomainandtherangeofthesigmoidfunction?
Thisisareasonablechoice.Wewillseewhylater
10

TheSigmoidfunction
¾(z)
z
11

TheSigmoidfunction
12
Whatisitsderivativewithrespecttoz?

TheSigmoidfunction
13
Whatisitsderivativewithrespecttoz?

Predictingprobabilities
Accordingtothelogisticregressionmodel,wehave
14

Predictingprobabilities
Accordingtothelogisticregressionmodel,wehave
15

Predictingprobabilities
Accordingtothelogisticregressionmodel,wehave
16

Predictingprobabilities
Accordingtothelogisticregressionmodel,wehave
Orequivalently
17

Predictingprobabilities
Accordingtothelogisticregressionmodel,wehave
Orequivalently
18
Notethatwearedirectlymodeling𝑃(𝑦|𝑥) ratherthan𝑃(𝑥|𝑦)and𝑃(𝑦)

Predictingalabelwithlogisticregression
• ComputeP(y=1|x;w)
• Ifthisisgreaterthanhalf,predict1elsepredict-1– WhatdoesthiscorrespondtointermsofwTx?
19

Predictingalabelwithlogisticregression
• ComputeP(y=1|x;w)
• Ifthisisgreaterthanhalf,predict1elsepredict-1– WhatdoesthiscorrespondtointermsofwTx?
– Prediction=sgn(wTx)
20

Thislecture
• Logisticregression
• ConnectiontoNaïveBayes
• Trainingalogisticregressionclassifier
• Backtolossminimization
21

NaïveBayesandLogisticregression
RememberthatthenaïveBayesdecisionisalinearfunction
Here,theP’srepresenttheNaïveBayesposteriordistribution,andwcanbeusedtocalculatethepriorsandthelikelihoods.
Thatis,𝑃(𝑦 = 1|𝐰, 𝐱)iscomputedusing𝑃(𝐱|𝑦 = 1,𝐰)and𝑃(𝑦 = 1|𝐰)
22
log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱

NaïveBayesandLogisticregression
RememberthatthenaïveBayesdecisionisalinearfunction
Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)
23
log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱

NaïveBayesandLogisticregression
RememberthatthenaïveBayesdecisionisalinearfunction
Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)
Substitutingintheaboveexpression,weget
24
log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰2𝐱 =1
1 + exp(−𝐰2𝐱)

NaïveBayesandLogisticregression
RememberthatthenaïveBayesdecisionisalinearfunction
Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)
Substitutingintheaboveexpression,weget
25
log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰2𝐱 =1
1 + exp(−𝐰2𝐱)
Thatis,bothnaïveBayesandlogisticregressiontrytocomputethesameposteriordistributionovertheoutputs
NaïveBayesisagenerativemodel.
LogisticRegressionisthediscriminativeversion.

Thislecture
• Logisticregression
• ConnectiontoNaïveBayes
• Trainingalogisticregressionclassifier– First:Maximumlikelihoodestimation– Then:Addingpriorsà MaximumaPosterioriestimation
• Backtolossminimization
26

Maximumlikelihoodestimation
Let’sgetbacktotheproblemoflearning
• Trainingdata– S={(xi, yi)},mexamples
• Whatwewant– Findaw suchthatP(S|w)ismaximized– Weknowthatourexamplesaredrawnindependentlyandareidenticallydistributed(i.i.d)
– Howdoweproceed?
27

Maximumlikelihoodestimation
28
Theusualtrick:Convertproductstosumsbytakinglog
Recallthatthisworksonlybecauselogisanincreasingfunctionandthemaximizer willnotchange
argmax𝐰
𝑃 𝑆 𝐰 = argmax𝐰
;𝑃 𝑦< 𝐱<,𝐰)=
<>?

Maximumlikelihoodestimation
29
Equivalenttosolving
argmax𝐰
𝑃 𝑆 𝐰 = argmax𝐰
;𝑃 𝑦< 𝐱<,𝐰)=
<>?
max𝐰
@log𝑃 𝑦< 𝐱<, 𝐰)=
<

Maximumlikelihoodestimation
30
But(bydefinition)weknowthat
argmax𝐰
𝑃 𝑆 𝐰 = argmax𝐰
;𝑃 𝑦< 𝐱<,𝐰)=
<>?
max𝐰
@log𝑃 𝑦< 𝐱<, 𝐰)=
<
𝑃 𝑦 𝐰, 𝐱 = 𝜎 𝑦<𝐰2𝐱< =1
1 + exp(−𝑦<𝐰2𝐱<)

Maximumlikelihoodestimation
31
argmax𝐰
𝑃 𝑆 𝐰 = argmax𝐰
;𝑃 𝑦< 𝐱<,𝐰)=
<>?
max𝐰
@log𝑃 𝑦< 𝐱<, 𝐰)=
<
𝑃 𝑦 𝐰, 𝐱 =1
1 + exp(−yB𝐰2𝐱<)
Equivalenttosolving
max𝐰
@−log(1 + exp(−𝑦<𝐰2𝐱<)=
<

Maximumlikelihoodestimation
32
argmax𝐰
𝑃 𝑆 𝐰 = argmax𝐰
;𝑃 𝑦< 𝐱<,𝐰)=
<>?
max𝐰
@log𝑃 𝑦< 𝐱<, 𝐰)=
<
𝑃 𝑦 𝐰, 𝐱 =1
1 + exp(−yB𝐰2𝐱<)
Equivalenttosolving
Thegoal:Maximumlikelihoodtrainingofadiscriminativeprobabilisticclassifierunderthelogisticmodelfortheposteriordistribution.
max𝐰
@−log(1 + exp(−𝑦<𝐰2𝐱<)=
<

Maximumlikelihoodestimation
33
argmax𝐰
𝑃 𝑆 𝐰 = argmax𝐰
;𝑃 𝑦< 𝐱<,𝐰)=
<>?
max𝐰
@log𝑃 𝑦< 𝐱<, 𝐰)=
<
𝑃 𝑦 𝐰, 𝐱 =1
1 + exp(−yB𝐰2𝐱<)
Equivalenttosolving
max𝐰
@−log(1 + exp(−𝑦<𝐰2𝐱<)=
<
Equivalentto:Trainingalinearclassifierbyminimizingthelogisticloss.
Thegoal:Maximumlikelihoodtrainingofadiscriminativeprobabilisticclassifierunderthelogisticmodelfortheposteriordistribution.

Maximumaposterioriestimation
Wecouldalsoaddapriorontheweights
Supposeeachweightintheweightvectorisdrawnindependentlyfromthenormaldistributionwithzeromeanandstandarddeviation𝜎
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
34

MAPestimationforlogisticregression
35
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
Letusworkthroughthisprocedureagaintoseewhatchanges

MAPestimationforlogisticregression
36
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
Letusworkthroughthisprocedureagaintoseewhatchanges
WhatisthegoalofMAPestimation?(Inmaximumlikelihood,wemaximizedthelikelihoodofthedata)

MAPestimationforlogisticregression
37
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
WhatisthegoalofMAPestimation?(Inmaximumlikelihood,wemaximizedthelikelihoodofthedata)
Tomaximizetheposteriorprobabilityofthemodelgiventhedata(i.e.tofindthemostprobablemodel,giventhedata)
𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)

MAPestimationforlogisticregression
38
Learningbysolving
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
argmax𝐰
𝑃(𝐰|𝑆) = argmax𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)

MAPestimationforlogisticregression
39
Learningbysolving
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
argmax𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Takelogtosimplify
max𝐰
log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

MAPestimationforlogisticregression
40
Learningbysolving
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
argmax𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Takelogtosimplify
max𝐰
log 𝑃 𝑆 𝐰 + log𝑃(𝐰)
Wehavealreadyexpandedoutthefirstterm.
@−log(1 + exp(−𝑦<𝐰2𝐱<)=
<

MAPestimationforlogisticregression
41
Learningbysolving
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
argmax𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Takelogtosimplify
max𝐰
log 𝑃 𝑆 𝐰 + log𝑃(𝐰)
@−log(1 + exp(−𝑦<𝐰2𝐱<)=
<
+@−𝑤<J
𝜎J
E
F>?
+ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
Expandthelogprior

MAPestimationforlogisticregression
42
Learningbysolving
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
argmax𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Takelogtosimplify
max𝐰
log 𝑃 𝑆 𝐰 + log𝑃(𝐰)
max𝐰
@−log(1 + exp(−𝑦<𝐰2𝐱<)=
<
+@−𝑤<J
𝜎J
E
F>?
+ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠

MAPestimationforlogisticregression
43
Learningbysolving
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
argmax𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Takelogtosimplify
max𝐰
log 𝑃 𝑆 𝐰 + log𝑃(𝐰)
max𝐰
@−log(1 + exp(−𝑦<𝐰2𝐱<)=
<
−1𝜎J 𝐰
2𝐰

MAPestimationforlogisticregression
44
Learningbysolving
𝑝 𝐰 =;𝑝(𝑤<)E
F>?
=;1
𝜎 2𝜋� exp−𝑤<J
𝜎J
E
F>?
argmax𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Takelogtosimplify
max𝐰
log 𝑃 𝑆 𝐰 + log𝑃(𝐰)
max𝐰
@−log(1 + exp(−𝑦<𝐰2𝐱<)=
<
−1𝜎J 𝐰
2𝐰
Maximizinganegativefunctionisthesameasminimizingthefunction

Learningalogisticregressionclassifier
Learningalogisticregressionclassifierisequivalenttosolving
45
min𝐰@log(1 + exp(−𝑦<𝐰2𝐱<)=
<
+1𝜎J 𝐰
2𝐰

Learningalogisticregressionclassifier
Learningalogisticregressionclassifierisequivalenttosolving
46
Wherehaveweseenthisbefore?
min𝐰@log(1 + exp(−𝑦<𝐰2𝐱<)=
<
+1𝜎J 𝐰
2𝐰

Learningalogisticregressionclassifier
Learningalogisticregressionclassifierisequivalenttosolving
47
Wherehaveweseenthisbefore?
Thefirstquestioninthehomework:Writedownthestochasticgradientdescentalgorithmforthis?
Historically,othertrainingalgorithmsexist.Inparticular,youmightrunintoLBFGS
min𝐰@log(1 + exp(−𝑦<𝐰2𝐱<)=
<
+1𝜎J 𝐰
2𝐰

Logisticregressionis…
• Aclassifierthatpredictstheprobabilitythatthelabelis+1foraparticularinput
• Thediscriminativecounter-partofthenaïveBayesclassifier
• AdiscriminativeclassifierthatcanbetrainedviaMAPorMLEestimation
• Adiscriminativeclassifierthatminimizesthelogisticlossoverthetrainingset
48

Thislecture
• Logisticregression
• ConnectiontoNaïveBayes
• Trainingalogisticregressionclassifier
• Backtolossminimization
49

Learningaslossminimization• Thesetup
– Examplesx drawnfromafixed,unknowndistributionD– Hiddenoracleclassifierf labelsexamples– Wewishtofindahypothesish thatmimicsf
• Theidealsituation– DefineafunctionL thatpenalizesbadhypotheses– Learning:Pickafunctionh2 Htominimizeexpectedloss
• Instead,minimizeempiricallossonthetrainingset
50
ButdistributionDisunknown

Empiricallossminimization
Learning=minimizeempiricallossonthetrainingset
51
Isthereaproblemhere?

Empiricallossminimization
Learning=minimizeempiricallossonthetrainingset
Weneedsomethingthatbiasesthelearnertowardssimplerhypotheses• Achievedusingaregularizer,whichpenalizescomplex
hypotheses
52
Isthereaproblemhere? Overfitting!

Regularizedlossminimization
• Learning:
• Withlinearclassifiers:
• Whatisalossfunction?– Lossfunctionsshouldpenalizemistakes– Weareminimizingaveragelossoverthetrainingdata
• Whatistheideallossfunctionforclassification?
53
(usingl2regularization)

The0-1loss
Penalizeclassificationmistakesbetweentruelabelyandpredictiony’
• Forlinearclassifiers,thepredictiony’=sgn(wTx)– MistakeifywTx· 0
Minimizing0-1lossisintractable.Needsurrogates
54

Thelossfunctionzoo
Manylossfunctionsexist– Perceptronloss
– Hingeloss(SVM)
– Exponentialloss(AdaBoost)
– Logisticloss(logisticregression)
55

Thelossfunctionzoo
56

Thelossfunctionzoo
57
Zero-one

Thelossfunctionzoo
58
Hinge:SVM
Zero-one

Thelossfunctionzoo
59
Perceptron
Hinge:SVM
Zero-one

Thelossfunctionzoo
60
Perceptron
Hinge:SVM
Exponential:AdaBoost
Zero-one

Thelossfunctionzoo
61
Perceptron
Hinge:SVM
Logisticregression
Exponential:AdaBoost
Zero-one

Thelossfunctionzoo
62
Zoomedout

Thelossfunctionzoo
63
Zoomedoutevenmore