logistic regression - svivek · logistic regression is the discriminative version. this lecture...

Click here to load reader

Post on 30-Jul-2020

4 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • MachineLearning

    LogisticRegression

    1

  • Wherearewe?

    Wehaveseenthefollowingideas– Linearmodels– Learningaslossminimization– Bayesianlearningcriteria(MAPandMLEestimation)– TheNaïveBayesclassifier

    2

  • Thislecture

    • Logisticregression

    • ConnectiontoNaïveBayes

    • Trainingalogisticregressionclassifier

    • Backtolossminimization

    3

  • Thislecture

    • Logisticregression

    • ConnectiontoNaïveBayes

    • Trainingalogisticregressionclassifier

    • Backtolossminimization

    4

  • LogisticRegression:Setup

    • Thesetting– Binaryclassification– Inputs:Featurevectorsx 2

  • Classification,but…

    Theoutputy isdiscretevalued(-1 or 1)

    Insteadofpredictingtheoutput,letustrytopredictP(y=1 |x)

    Expandhypothesisspacetofunctionswhoseoutputis[0-1]• Originalproblem:

  • Classification,but…

    Theoutputy isdiscretevalued(-1 or 1)

    Insteadofpredictingtheoutput,letustrytopredictP(y=1 |x)

    Expandhypothesisspacetofunctionswhoseoutputis[0-1]• Originalproblem:

  • TheSigmoidfunction

    Thehypothesisspaceforlogisticregression:Allfunctionsoftheform

    Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾

    Whatisthedomainandtherangeofthesigmoidfunction?

    Thisisareasonablechoice.Wewillseewhylater

    8

  • TheSigmoidfunction

    Thehypothesisspaceforlogisticregression:Allfunctionsoftheform

    Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾

    Thisisareasonablechoice.Wewillseewhylater

    9

  • TheSigmoidfunction

    Thehypothesisspaceforlogisticregression:Allfunctionsoftheform

    Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾

    Whatisthedomainandtherangeofthesigmoidfunction?

    Thisisareasonablechoice.Wewillseewhylater

    10

  • TheSigmoidfunction

    ¾(z)

    z

    11

  • TheSigmoidfunction

    12

    Whatisitsderivativewithrespecttoz?

  • TheSigmoidfunction

    13

    Whatisitsderivativewithrespecttoz?

  • Predictingprobabilities

    Accordingtothelogisticregressionmodel,wehave

    14

  • Predictingprobabilities

    Accordingtothelogisticregressionmodel,wehave

    15

  • Predictingprobabilities

    Accordingtothelogisticregressionmodel,wehave

    16

  • Predictingprobabilities

    Accordingtothelogisticregressionmodel,wehave

    Orequivalently

    17

  • Predictingprobabilities

    Accordingtothelogisticregressionmodel,wehave

    Orequivalently

    18

    Notethatwearedirectlymodeling𝑃(𝑦|𝑥) ratherthan𝑃(𝑥|𝑦)and𝑃(𝑦)

  • Predictingalabelwithlogisticregression

    • ComputeP(y=1|x;w)

    • Ifthisisgreaterthanhalf,predict1elsepredict-1– WhatdoesthiscorrespondtointermsofwTx?

    19

  • Predictingalabelwithlogisticregression

    • ComputeP(y=1|x;w)

    • Ifthisisgreaterthanhalf,predict1elsepredict-1– WhatdoesthiscorrespondtointermsofwTx?

    – Prediction=sgn(wTx)

    20

  • Thislecture

    • Logisticregression

    • ConnectiontoNaïveBayes

    • Trainingalogisticregressionclassifier

    • Backtolossminimization

    21

  • NaïveBayesandLogisticregression

    RememberthatthenaïveBayesdecisionisalinearfunction

    Here,theP’srepresenttheNaïveBayesposteriordistribution,andwcanbeusedtocalculatethepriorsandthelikelihoods.

    Thatis,𝑃(𝑦 = 1|𝐰, 𝐱)iscomputedusing𝑃(𝐱|𝑦 = 1,𝐰)and𝑃(𝑦 = 1|𝐰)

    22

    log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰

    2𝐱

  • NaïveBayesandLogisticregression

    RememberthatthenaïveBayesdecisionisalinearfunction

    Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)

    23

    log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰

    2𝐱

  • NaïveBayesandLogisticregression

    RememberthatthenaïveBayesdecisionisalinearfunction

    Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)

    Substitutingintheaboveexpression,weget

    24

    log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰

    2𝐱

    𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰2𝐱 =1

    1 + exp(−𝐰2𝐱)

  • NaïveBayesandLogisticregression

    RememberthatthenaïveBayesdecisionisalinearfunction

    Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)

    Substitutingintheaboveexpression,weget

    25

    log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰

    2𝐱

    𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰2𝐱 =1

    1 + exp(−𝐰2𝐱)

    Thatis,bothnaïveBayesandlogisticregressiontrytocomputethesameposteriordistributionovertheoutputs

    NaïveBayesisagenerativemodel.

    LogisticRegressionisthediscriminativeversion.

  • Thislecture

    • Logisticregression

    • ConnectiontoNaïveBayes

    • Trainingalogisticregressionclassifier– First:Maximumlikelihoodestimation– Then:Addingpriorsà MaximumaPosterioriestimation

    • Backtolossminimization

    26

  • Maximumlikelihoodestimation

    Let’sgetbacktotheproblemoflearning

    • Trainingdata– S={(xi, yi)},mexamples

    • Whatwewant– Findaw suchthatP(S|w)ismaximized– Weknowthatourexamplesaredrawnindependentlyandareidenticallydistributed(i.i.d)

    – Howdoweproceed?

    27

  • Maximumlikelihoodestimation

    28

    Theusualtrick:Convertproductstosumsbytakinglog

    Recallthatthisworksonlybecauselogisanincreasingfunctionandthemaximizer willnotchange

    argmax𝐰

    𝑃 𝑆 𝐰 = argmax𝐰

    ;𝑃 𝑦< 𝐱

  • Maximumlikelihoodestimation

    29

    Equivalenttosolving

    argmax𝐰

    𝑃 𝑆 𝐰 = argmax𝐰

    ;𝑃 𝑦< 𝐱

  • Maximumlikelihoodestimation

    30

    But(bydefinition)weknowthat

    argmax𝐰

    𝑃 𝑆 𝐰 = argmax𝐰

    ;𝑃 𝑦< 𝐱

  • Maximumlikelihoodestimation

    31

    argmax𝐰

    𝑃 𝑆 𝐰 = argmax𝐰

    ;𝑃 𝑦< 𝐱

  • Maximumlikelihoodestimation

    32

    argmax𝐰

    𝑃 𝑆 𝐰 = argmax𝐰

    ;𝑃 𝑦< 𝐱

  • Maximumlikelihoodestimation

    33

    argmax𝐰

    𝑃 𝑆 𝐰 = argmax𝐰

    ;𝑃 𝑦< 𝐱

  • Maximumaposterioriestimation

    Wecouldalsoaddapriorontheweights

    Supposeeachweightintheweightvectorisdrawnindependentlyfromthenormaldistributionwithzeromeanandstandarddeviation𝜎

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    34

  • MAPestimationforlogisticregression

    35

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    Letusworkthroughthisprocedureagaintoseewhatchanges

  • MAPestimationforlogisticregression

    36

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    Letusworkthroughthisprocedureagaintoseewhatchanges

    WhatisthegoalofMAPestimation?(Inmaximumlikelihood,wemaximizedthelikelihoodofthedata)

  • MAPestimationforlogisticregression

    37

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    WhatisthegoalofMAPestimation?(Inmaximumlikelihood,wemaximizedthelikelihoodofthedata)

    Tomaximizetheposteriorprobabilityofthemodelgiventhedata(i.e.tofindthemostprobablemodel,giventhedata)

    𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)

  • MAPestimationforlogisticregression

    38

    Learningbysolving

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    argmax𝐰

    𝑃(𝐰|𝑆) = argmax𝐰

    𝑃 𝑆 𝐰 𝑃(𝐰)

  • MAPestimationforlogisticregression

    39

    Learningbysolving

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    argmax𝐰

    𝑃 𝑆 𝐰 𝑃(𝐰)

    Takelogtosimplify

    max𝐰

    log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

  • MAPestimationforlogisticregression

    40

    Learningbysolving

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    argmax𝐰

    𝑃 𝑆 𝐰 𝑃(𝐰)

    Takelogtosimplify

    max𝐰

    log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

    Wehavealreadyexpandedoutthefirstterm.

    @−log(1 + exp(−𝑦

  • MAPestimationforlogisticregression

    41

    Learningbysolving

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    argmax𝐰

    𝑃 𝑆 𝐰 𝑃(𝐰)

    Takelogtosimplify

    max𝐰

    log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

    @−log(1 + exp(−𝑦

  • MAPestimationforlogisticregression

    42

    Learningbysolving

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    argmax𝐰

    𝑃 𝑆 𝐰 𝑃(𝐰)

    Takelogtosimplify

    max𝐰

    log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

    max𝐰

    @−log(1 + exp(−𝑦

  • MAPestimationforlogisticregression

    43

    Learningbysolving

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    argmax𝐰

    𝑃 𝑆 𝐰 𝑃(𝐰)

    Takelogtosimplify

    max𝐰

    log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

    max𝐰

    @−log(1 + exp(−𝑦

  • MAPestimationforlogisticregression

    44

    Learningbysolving

    𝑝 𝐰 =;𝑝(𝑤?

    =;1

    𝜎 2𝜋�exp

    −𝑤?

    argmax𝐰

    𝑃 𝑆 𝐰 𝑃(𝐰)

    Takelogtosimplify

    max𝐰

    log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

    max𝐰

    @−log(1 + exp(−𝑦

  • Learningalogisticregressionclassifier

    Learningalogisticregressionclassifierisequivalenttosolving

    45

    min𝐰@log(1 + exp(−𝑦

  • Learningalogisticregressionclassifier

    Learningalogisticregressionclassifierisequivalenttosolving

    46

    Wherehaveweseenthisbefore?

    min𝐰@log(1 + exp(−𝑦

  • Learningalogisticregressionclassifier

    Learningalogisticregressionclassifierisequivalenttosolving

    47

    Wherehaveweseenthisbefore?

    Thefirstquestioninthehomework:Writedownthestochasticgradientdescentalgorithmforthis?

    Historically,othertrainingalgorithmsexist.Inparticular,youmightrunintoLBFGS

    min𝐰@log(1 + exp(−𝑦

  • Logisticregressionis…

    • Aclassifierthatpredictstheprobabilitythatthelabelis+1foraparticularinput

    • Thediscriminativecounter-partofthenaïveBayesclassifier

    • AdiscriminativeclassifierthatcanbetrainedviaMAPorMLEestimation

    • Adiscriminativeclassifierthatminimizesthelogisticlossoverthetrainingset

    48

  • Thislecture

    • Logisticregression

    • ConnectiontoNaïveBayes

    • Trainingalogisticregressionclassifier

    • Backtolossminimization

    49

  • Learningaslossminimization• Thesetup

    – Examplesx drawnfromafixed,unknowndistributionD– Hiddenoracleclassifierf labelsexamples– Wewishtofindahypothesish thatmimicsf

    • Theidealsituation– DefineafunctionL thatpenalizesbadhypotheses– Learning:Pickafunctionh2 Htominimizeexpectedloss

    • Instead,minimizeempiricallossonthetrainingset

    50

    ButdistributionDisunknown

  • Empiricallossminimization

    Learning=minimizeempiricallossonthetrainingset

    51

    Isthereaproblemhere?

  • Empiricallossminimization

    Learning=minimizeempiricallossonthetrainingset

    Weneedsomethingthatbiasesthelearnertowardssimplerhypotheses• Achievedusingaregularizer,whichpenalizescomplex

    hypotheses

    52

    Isthereaproblemhere? Overfitting!

  • Regularizedlossminimization

    • Learning:

    • Withlinearclassifiers:

    • Whatisalossfunction?– Lossfunctionsshouldpenalizemistakes– Weareminimizingaveragelossoverthetrainingdata

    • Whatistheideallossfunctionforclassification?

    53

    (usingl2regularization)

  • The0-1loss

    Penalizeclassificationmistakesbetweentruelabelyandpredictiony’

    • Forlinearclassifiers,thepredictiony’=sgn(wTx)– MistakeifywTx· 0

    Minimizing0-1lossisintractable.Needsurrogates

    54

  • Thelossfunctionzoo

    Manylossfunctionsexist– Perceptronloss

    – Hingeloss(SVM)

    – Exponentialloss(AdaBoost)

    – Logisticloss(logisticregression)

    55

  • Thelossfunctionzoo

    56

  • Thelossfunctionzoo

    57

    Zero-one

  • Thelossfunctionzoo

    58

    Hinge:SVM

    Zero-one

  • Thelossfunctionzoo

    59

    Perceptron

    Hinge:SVM

    Zero-one

  • Thelossfunctionzoo

    60

    Perceptron

    Hinge:SVM

    Exponential:AdaBoost

    Zero-one

  • Thelossfunctionzoo

    61

    Perceptron

    Hinge:SVM

    Logisticregression

    Exponential:AdaBoost

    Zero-one

  • Thelossfunctionzoo

    62

    Zoomedout

  • Thelossfunctionzoo

    63

    Zoomedoutevenmore