binary logistic regression multinomial logistic regressionmgormley/courses/10601/slides/... ·...

Click here to load reader

Post on 01-Aug-2020

4 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Binary Logistic Regression+

    Multinomial Logistic Regression

    1

    10 601 Introduction to Machine Learning

    Matt GormleyLecture 10

    Feb. 17, 2020

    Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University

  • Reminders

    Midterm Exam 1Tue, Feb. 18, 7:00pm – 9:00pm

    Homework 4: Logistic RegressionOut: Wed, Feb. 19Due: Fri, Feb. 28 at 11:59pm

    Today’s In Class Pollhttp://p10.mlcourse.org

    Reading on Probabilistic Learning is reusedlater in the course for MLE/MAP

    3

  • MLE

    5

    Principle of Maximum Likelihood Estimation:Choose the parameters that maximize the likelihoodof the data.

    Maximum Likelihood Estimate (MLE)

    L( )

    MLE

    MLE2

    1

    L( 1, 2)

  • MLE

    What does maximizing likelihood accomplish?There is only a finite amount of probabilitymass (i.e. sum to one constraint)MLE tries to allocate as much probabilitymass as possible to the things we haveobserved…

    …at the expense of the things we have notobserved

    6

  • MOTIVATION:LOGISTIC REGRESSION

    7

  • Example: Image ClassificationImageNet LSVRC 2010 contest:

    Dataset: 1.2 million labeled images, 1000 classesTask: Given a new image, label it with the correct classMulticlass classification problem

    Examples from http://image net.org/

    10

  • 11

  • 12

  • 13

  • Example: Image Classification

    14

    CNN for Image Classification(Krizhevsky, Sutskever & Hinton, 2011)17.5% error on ImageNet LSVRC 2010 contest

    Inputimage(pixels)

    Five convolutional layers(w/max pooling)Three fully connected layers

    1000 waysoftmax

  • Example: Image Classification

    15

    CNN for Image Classification(Krizhevsky, Sutskever & Hinton, 2011)17.5% error on ImageNet LSVRC 2010 contest

    Inputimage(pixels)

    Five convolutional layers(w/max pooling)Three fully connected layers

    1000 waysoftmax

    This “softmax”layer is LogisticRegression!

    The rest is justsome fancy

    feature extraction(discussed later in

    the course)

  • LOGISTIC REGRESSION

    16

  • Logistic Regression

    17

    We are back toclassification.

    Despite the namelogistic regression.

    We are back toclassification.

    Despite the namelogistic regression.

    Data: Inputs are continuous vectors of length M. Outputsare discrete.

  • Key idea: Try to learnthis hyperplane directlyKey idea: Try to learn

    this hyperplane directly

    Linear Models for Classification

    Directly modeling thehyperplane would use adecision function:

    for:

    Looking ahead:We’ll see a number ofcommonly used LinearClassifiersThese include:

    PerceptronLogistic RegressionNaïve Bayes (undercertain conditions)Support VectorMachines

  • Background: HyperplanesHyperplane (Definition 1):

    Hyperplane (Definition 2):

    Half spaces:

    Notation Trick: fold thebias b and the weightswinto a single vector byprepending a constant to

    x and increasingdimensionality by one!

    Notation Trick: fold thebias b and the weightswinto a single vector byprepending a constant to

    x and increasingdimensionality by one!

  • Using gradient ascent for linearclassifiers

    Key idea behind today’s lecture:1. Define a linear classifier (logistic regression)2. Define an objective function (likelihood)3. Optimize it with gradient descent to learn

    parameters4. Predict the class with highest probability under

    the model

    20

  • Using gradient ascent for linearclassifiers

    21

    Use a differentiablefunction instead:

    ue u

    This decision function isn’tdifferentiable:

    sign(x)

  • Using gradient ascent for linearclassifiers

    22

    Use a differentiablefunction instead:

    ue u

    This decision function isn’tdifferentiable:

    sign(x)

  • Logistic Regression

    23

    Learning: finds the parameters that minimize someobjective function.

    Prediction: Output is the most probable class.

    Model: Logistic function applied to dot product ofparameters with input vector.

    Data: Inputs are continuous vectors of length M. Outputsare discrete.

  • Logistic Regression

    WhiteboardBernoulli interpretationLogistic Regression ModelDecision boundary

    24

  • Learning for Logistic Regression

    WhiteboardPartial derivative for Logistic RegressionGradient for Logistic Regression

    25

  • LOGISTIC REGRESSION ONGAUSSIAN DATA

    26

  • Logistic Regression

    27

  • Logistic Regression

    28

  • Logistic Regression

    29

  • LEARNING LOGISTIC REGRESSION

    30

  • Maximum ConditionalLikelihood Estimation

    31

    Learning: finds the parameters that minimize someobjective function.

    Weminimize the negative log conditional likelihood:

    Why?1. We can’t maximize likelihood (as in Naïve Bayes)

    because we don’t have a joint model p(x,y)2. It worked well for Linear Regression (least squares is

    MCLE)

  • Maximum ConditionalLikelihood Estimation

    32

    Learning: Four approaches to solving

    Approach 1: Gradient Descent(take larger – more certain – steps opposite the gradient)

    Approach 2: Stochastic Gradient Descent (SGD)(take many small steps opposite the gradient)

    Approach 3: Newton’s Method(use second derivatives to better follow curvature)

    Approach 4: Closed Form???(set derivatives equal to zero and solve for parameters)

  • Maximum ConditionalLikelihood Estimation

    33

    Learning: Four approaches to solving

    Approach 1: Gradient Descent(take larger – more certain – steps opposite the gradient)

    Approach 2: Stochastic Gradient Descent (SGD)(take many small steps opposite the gradient)

    Approach 3: Newton’s Method(use second derivatives to better follow curvature)

    Approach 4: Closed Form???(set derivatives equal to zero and solve for parameters)

    Logistic Regression does nothave a closed form solutionfor MLE parameters.

    Logistic Regression does nothave a closed form solutionfor MLE parameters.

  • SGD for Logistic Regression

    34

    Question:Which of the following is a correct description of SGD for Logistic Regression?Question:Which of the following is a correct description of SGD for Logistic Regression?

    Answer:At each step (i.e. iteration) of SGD for Logistic Regression we…A. (1) compute the gradient of the log likelihood for all examples (2) update all

    the parameters using the gradientB. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down,

    (3) report that answerC. (1) compute the gradient of the log likelihood for all examples (2) randomly

    pick an example (3) update only the parameters for that exampleD. (1) randomly pick a parameter, (2) compute the partial derivative of the log

    likelihood with respect to that parameter, (3) update that parameter for allexamples

    E. (1) randomly pick an example, (2) compute the gradient of the log likelihoodfor that example, (3) update all the parameters using that gradient

    F. (1) randomly pick a parameter and an example, (2) compute the gradient ofthe log likelihood for that example with respect to that parameter, (3) updatethat parameter using that gradient

    Answer:At each step (i.e. iteration) of SGD for Logistic Regression we…A. (1) compute the gradient of the log likelihood for all examples (2) update all

    the parameters using the gradientB. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down,

    (3) report that answerC. (1) compute the gradient of the log likelihood for all examples (2) randomly

    pick an example (3) update only the parameters for that exampleD. (1) randomly pick a parameter, (2) compute the partial derivative of the log

    likelihood with respect to that parameter, (3) update that parameter for allexamples

    E. (1) randomly pick an example, (2) compute the gradient of the log likelihoodfor that example, (3) update all the parameters using that gradient

    F. (1) randomly pick a parameter and an example, (2) compute the gradient ofthe log likelihood for that example with respect to that parameter, (3) updatethat parameter using that gradient

  • Gradient Descent

    35

    In order to apply GD to LogisticRegression all we need is thegradient of the objectivefunction (i.e. vector of partialderivatives).

  • Stochastic Gradient Descent (SGD)

    36

    We need a per example objective:

    We can also apply SGD to solve the MCLEproblem for Logistic Regression.

  • Answer:Answer:

    Logistic Regression vs. Perceptron

    37

    Question:True or False: Just like Perceptron, onestep (i.e. iteration) of SGD for LogisticRegressionwill result in a change to theparameters only if the current example isincorrectly classified.

    Question:True or False: Just like Perceptron, onestep (i.e. iteration) of SGD for LogisticRegressionwill result in a change to theparameters only if the current example isincorrectly classified.

  • Matching Game

    Goal:Match the Algorithm to its Update Rule

    38

    1. SGD for Logistic Regression

    2. Least Mean Squares

    3. Perceptron

    4.

    5.

    6.

    A. 1=5, 2=4, 3=6B. 1=5, 2=6, 3=4C. 1=6, 2=4, 3=4D. 1=5, 2=6, 3=6

    E. 1=6, 2=6, 3=6F. 1=6, 2=5, 3=5G. 1=5, 2=5, 3=5H. 1=4, 2=5, 3=6

  • OPTIMIZATIONMETHOD #4:MINI BATCH SGD

    39

  • Mini Batch SGD

    Gradient Descent:Compute true gradient exactly from all NexamplesStochastic Gradient Descent (SGD):Approximate true gradient by the gradientof one randomly chosen exampleMini Batch SGD:Approximate true gradient by the averagegradient of K randomly chosen examples

    40

  • Mini Batch SGD

    41

    Three variants of first order optimization:

  • Summary

    1. Discriminative classifiers directly model theconditional, p(y|x)

    2. Logistic regression is a simple linearclassifier, that retains a probabilisticsemantics

    3. Parameters in LR are learned by iterativeoptimization (e.g. SGD)

    50

  • Logistic Regression ObjectivesYou should be able to…

    Apply the principle of maximum likelihood estimation (MLE) tolearn the parameters of a probabilistic modelGiven a discriminative probabilistic model, derive the conditionallog likelihood, its gradient, and the corresponding BayesClassifierExplain the practical reasons why we work with the log of thelikelihoodImplement logistic regression for binary or multiclassclassificationProve that the decision boundary of binary logistic regression islinearFor linear regression, show that the parameters which minimizesquared error are equivalent to those that maximize conditionallikelihood

    51

  • MULTINOMIAL LOGISTICREGRESSION

    54

  • 55

  • Multinomial Logistic RegressionChalkboard

    Background: Multinomial distributionDefinition: Multi class classificationGeometric intuitionsMultinomial logistic regression modelGenerative storyReduction to binary logistic regressionPartial derivatives and gradientsApplying Gradient Descent and SGDImplementation w/ sparse features

    56

  • Debug that Program!In Class Exercise: Think Pair ShareDebug the following program which is (incorrectly)attempting to run SGD for multinomial logistic regression

    In Class Exercise: Think Pair ShareDebug the following program which is (incorrectly)attempting to run SGD for multinomial logistic regression

    57

    Buggy Program:

    Assume: returns the gradient of the negative log likelihood ofthe training example (x[i],y[i]) with respect to vector . is the learningrate. N = # of examples. K = # of output classes. M = # of features. is a K by Mmatrix.

    Buggy Program:

    Assume: returns the gradient of the negative log likelihood ofthe training example (x[i],y[i]) with respect to vector . is the learningrate. N = # of examples. K = # of output classes. M = # of features. is a K by Mmatrix.

  • FEATURE ENGINEERING

    59

  • Handcrafted Features

    60

    p(y|x)exp( y f )

    born-in

  • Where do features come from?

    61

    FeatureEn

    gine

    ering

    Feature Learning

    hand craftedfeatures

    Sun et al., 2011

    Zhou et al.,2005

    First word before M1Second word before M1Bag-of-words in M1Head word of M1Other word in betweenFirst word after M2Second word after M2Bag-of-words in M2Head word of M2Bigrams in betweenWords on dependency pathCountry name listPersonal relative triggersPersonal title listWordNet TagsHeads of chunks in betweenPath of phrase labelsCombination of entity types

  • Where do features come from?

    62

    FeatureEn

    gine

    ering

    Feature Learning

    hand craftedfeatures

    Sun et al., 2011

    Zhou et al.,2005 word

    embeddingsMikolov et al.,

    2013

    CBOWmodel in Mikolov et al. (2013)

    input(context words)

    input(context words)

    embedding

    embedding missing wordmissing word

    Look up tableLook up table ClassifierClassifier

    0.13 .26 … .52

    0.11 .23 … .45

    dog:

    cat:similar words,similar embeddings

    unsupervisedlearning

  • Where do features come from?

    63

    FeatureEn

    gine

    ering

    Feature Learning

    hand craftedfeatures

    Sun et al., 2011

    Zhou et al.,2005 word

    embeddingsMikolov et al.,

    2013

    stringembeddings

    Collobert &Weston,2008

    Socher, 2011

    Convolutional Neural Networks(Collobert andWeston 2008)

    The [movie] showed [wars]

    pooling

    CNN

    Recursive Auto Encoder(Socher 2011)

    The [movie] showed [wars]

    RAE

  • Where do features come from?

    64

    FeatureEn

    gine

    ering

    Feature Learning

    hand craftedfeatures

    Sun et al., 2011

    Zhou et al.,2005 word

    embeddingsMikolov et al.,

    2013

    treeembeddingsSocher et al.,

    2013Hermann & Blunsom,

    2013

    stringembeddings

    Collobert &Weston,2008

    Socher, 2011

    The [movie] showed [wars]

    WNP,VP

    WDT,NN WV,NN

    S

    NP VP

  • Where do features come from?

    65

    wordembeddings

    treeembeddings

    hand craftedfeatures

    stringembeddings

    FeatureEn

    gine

    ering

    Feature Learning

    Sun et al., 2011

    Zhou et al.,2005

    Mikolov et al.,2013

    Collobert &Weston,2008

    Socher, 2011

    Socher et al.,2013

    Hermann & Blunsom,2013

    Hermann et al.2014

    word embeddingfeatures

    Turian et al.2010

    Koo et al.2008

  • Where do features come from?

    66

    wordembeddings

    treeembeddings

    word embeddingfeatureshand crafted

    features

    stringembeddings

    FeatureEn

    gine

    ering

    Feature Learning

    Sun et al., 2011

    Zhou et al.,2005

    Mikolov et al.,2013

    Collobert &Weston,2008

    Socher, 2011

    Socher et al.,2013

    Turian et al.2010

    Koo et al.2008

    Hermann et al.2014

    Hermann & Blunsom,2013

  • Feature Engineering for NLP

    Suppose you build a logistic regression modelto predict a part of speech (POS) tag for eachword in a sentence.

    What features should you use?

    67The movie I watched depicted hopedeter. noun noun nounverb verb

  • Per word Features:

    Feature Engineering for NLP

    68The movie I watched depicted hopedeter. noun noun nounverb verb

    is-capital(wi)endswith(wi,“e”)endswith(wi,“d”)endswith(wi,“ed”)wi == “aardvark”wi == “hope”

    110000…

    010000…

    100000…

    001100…

    001100…

    010001…

    x(1) x(2) x(3) x(4) x(5) x(6)

  • Context Features:

    Feature Engineering for NLP

    69The movie I watched depicted hopedeter. noun noun nounverb verb

    …wi == “watched”wi+1 == “watched”wi-1 == “watched”wi+2 == “watched”wi-2 == “watched”

    …00000…

    …00010…

    …01000…

    …10000…

    …00100…

    …00001…

    x(1) x(2) x(3) x(4) x(5) x(6)

  • Context Features:

    Feature Engineering for NLP

    70The movie I watched depicted hopedeter. noun noun nounverb verb

    …wi == “I”wi+1 == “I”wi-1 == “I”wi+2 == “I”wi-2 == “I”

    …00010…

    …01000…

    …10000…

    …00100…

    …00001…

    …00000…

    x(1) x(2) x(3) x(4) x(5) x(6)

  • Feature Engineering for NLP

    71The movie I watched depicted hopedeter. noun noun nounverb verb

    Table from Manning (2011)

  • Feature Engineering for CVEdge detection (Canny)

    76Figures from http://opencv.org

    Corner Detection (Harris)

  • Feature Engineering for CV

    Scale Invariant Feature Transform (SIFT)

    77Figure from Lowe (1999) and Lowe (2004)

  • NON LINEAR FEATURES

    79

  • Nonlinear Featuresaka. “nonlinear basis functions”So far, input was alwaysKey Idea: let input be some function of x

    original input:new input:define

    Examples: (M = 1)

    80

    For a linear model:still a linear functionof b(x) even though anonlinear function ofxExamples:

    PerceptronLinear regressionLogistic regression

    For a linear model:still a linear functionof b(x) even though anonlinear function ofxExamples:

    PerceptronLinear regressionLogistic regression

  • Example: Linear Regression

    81x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • Example: Linear Regression

    82x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • Example: Linear Regression

    83x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • Example: Linear Regression

    84x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • Example: Linear Regression

    85x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • Example: Linear Regression

    86x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • Example: Linear Regression

    87x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • Slide courtesy of William Cohen

  • Slide courtesy of William Cohen

  • Example: Linear Regression

    90x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • Example: Linear Regression

    91x

    y

    Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

    Same as before, but nowwith N = 100 points

    true “unknown”target function islinear withnegative slopeand gaussiannoise

  • REGULARIZATION

    92

  • OverfittingDefinition: The problem of overfitting is whenthe model captures the noise in the training datainstead of the underlying structure

    Overfitting can occur in all the models we’ve seenso far:

    Decision Trees (e.g. when tree is too deep)KNN (e.g. when k is small)Perceptron (e.g. when sample isn’t representative)Linear Regression (e.g. with nonlinear features)Logistic Regression (e.g. with many rare features)

    93

  • Motivation: RegularizationExample: Stock Prices

    Suppose we wish to predictGoogle’s stock price at time t+1What features should we use?(putting all computational concernsaside)

    Stock prices of all other stocks attimes t, t 1, t 2, …, t kMentions of Google with positive /negative sentiment words in allnewspapers and social media outlets

    Do we believe that all of thesefeatures are going to be useful?

    94

  • Motivation: Regularization

    Occam’s Razor: prefer the simplesthypothesis

    What does it mean for a hypothesis (ormodel) to be simple?1. small number of features (model selection)2. small number of “important” features

    (shrinkage)

    95

  • RegularizationGiven objective function: J( )Goal is to find:

    Key idea: Define regularizer r( ) s.t. we tradeoffbetween fitting the data and keeping the modelsimpleChoose form of r( ):

    Example: q norm (usually p norm)

    96

  • Regularization

    98

    Question:Suppose we are minimizing J’( )where

    As increases, the minimum of J’( )will move…

    A. …towards the midpoint betweenJ’( ) and r( )

    B. …towards the minimum of J( )C. …towards the minimum of r( )D. …towards a theta vector of positive

    infinitiesE. …towards a theta vector of negative

    infinities

    Question:Suppose we are minimizing J’( )where

    As increases, the minimum of J’( )will move…

    A. …towards the midpoint betweenJ’( ) and r( )

    B. …towards the minimum of J( )C. …towards the minimum of r( )D. …towards a theta vector of positive

    infinitiesE. …towards a theta vector of negative

    infinities

  • Regularization ExerciseIn class Exercise1. Plot train error vs. regularization weight (cartoon)2. Plot test error vs . regularization weight (cartoon)

    100

    erro

    r

    regularization weight

  • Regularization

    101

    Question:Suppose we are minimizing J’( )where

    As we increase from 0, the thevalidation error will…

    A. …increaseB. …decreaseC. …first increase, then decreaseD. …first decrease, then increaseE. …stay the same

    Question:Suppose we are minimizing J’( )where

    As we increase from 0, the thevalidation error will…

    A. …increaseB. …decreaseC. …first increase, then decreaseD. …first decrease, then increaseE. …stay the same

  • Regularization

    102

    Don’t Regularize the Bias (Intercept) Parameter!In our models so far, the bias / intercept parameter isusually denoted by that is, the parameter for whichwe fixedRegularizers always avoid penalizing this bias / interceptparameterWhy? Because otherwise the learning algorithms wouldn’tbe invariant to a shift in the y values

    Whitening DataIt’s common towhiten each feature by subtracting itsmean and dividing by its varianceFor regularization, this helps all the features be penalizedin the same units(e.g. convert both centimeters and kilometers to z scores)

  • Example: Logistic RegressionFor this example, weconstruct nonlinear features(i.e. feature engineering)Specifically, we addpolynomials up to order 9 ofthe two original features x1and x2Thus our classifier is linear inthe high dimensionalfeature space, but thedecision boundary isnonlinearwhen visualized inlow dimensions (i.e. theoriginal two dimensions)

    107

    TrainingData

    TestData

  • Example: Logistic Regression

    108

    1/lambda

    erro

    r

  • Example: Logistic Regression

    109

  • Example: Logistic Regression

    110

  • Example: Logistic Regression

    111

  • Example: Logistic Regression

    112

  • Example: Logistic Regression

    113

  • Example: Logistic Regression

    114

  • Example: Logistic Regression

    115

  • Example: Logistic Regression

    116

  • Example: Logistic Regression

    117

  • Example: Logistic Regression

    118

  • Example: Logistic Regression

    119

  • Example: Logistic Regression

    120

  • Example: Logistic Regression

    121

  • Example: Logistic Regression

    122

    1/lambda

    erro

    r

  • Regularization as MAP

    L1 and L2 regularization can be interpretedasmaximum a posteriori (MAP) estimationof the parametersTo be discussed later in the course…

    123

  • Takeaways

    1. Nonlinear basis functions allow linearmodels (e.g. Linear Regression, LogisticRegression) to capture nonlinear aspects ofthe original input

    2. Nonlinear features are require no changesto the model (i.e. just preprocessing)

    3. Regularization helps to avoid overfitting4. Regularization andMAP estimation are

    equivalent for appropriately chosen priors

    125

  • Feature Engineering / RegularizationObjectives

    You should be able to…Engineer appropriate features for a new taskUse feature selection techniques to identify andremove irrelevant featuresIdentify when a model is overfittingAdd a regularizer to an existing objective in order tocombat overfittingExplain why we should not regularize the bias termConvert linearly inseparable dataset to a linearlyseparable dataset in higher dimensionsDescribe feature engineering in common applicationareas

    126