# binary logistic regression multinomial logistic regressionmgormley/courses/10601/slides/... ·...

Post on 01-Aug-2020

4 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• Binary Logistic Regression+

Multinomial Logistic Regression

1

10 601 Introduction to Machine Learning

Matt GormleyLecture 10

Feb. 17, 2020

Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University

• Reminders

Midterm Exam 1Tue, Feb. 18, 7:00pm – 9:00pm

Homework 4: Logistic RegressionOut: Wed, Feb. 19Due: Fri, Feb. 28 at 11:59pm

Today’s In Class Pollhttp://p10.mlcourse.org

Reading on Probabilistic Learning is reusedlater in the course for MLE/MAP

3

• MLE

5

Principle of Maximum Likelihood Estimation:Choose the parameters that maximize the likelihoodof the data.

Maximum Likelihood Estimate (MLE)

L( )

MLE

MLE2

1

L( 1, 2)

• MLE

What does maximizing likelihood accomplish?There is only a finite amount of probabilitymass (i.e. sum to one constraint)MLE tries to allocate as much probabilitymass as possible to the things we haveobserved…

…at the expense of the things we have notobserved

6

• MOTIVATION:LOGISTIC REGRESSION

7

• Example: Image ClassificationImageNet LSVRC 2010 contest:

Dataset: 1.2 million labeled images, 1000 classesTask: Given a new image, label it with the correct classMulticlass classification problem

Examples from http://image net.org/

10

• 11

• 12

• 13

• Example: Image Classification

14

CNN for Image Classification(Krizhevsky, Sutskever & Hinton, 2011)17.5% error on ImageNet LSVRC 2010 contest

Inputimage(pixels)

Five convolutional layers(w/max pooling)Three fully connected layers

1000 waysoftmax

• Example: Image Classification

15

CNN for Image Classification(Krizhevsky, Sutskever & Hinton, 2011)17.5% error on ImageNet LSVRC 2010 contest

Inputimage(pixels)

Five convolutional layers(w/max pooling)Three fully connected layers

1000 waysoftmax

This “softmax”layer is LogisticRegression!

The rest is justsome fancy

feature extraction(discussed later in

the course)

• LOGISTIC REGRESSION

16

• Logistic Regression

17

We are back toclassification.

Despite the namelogistic regression.

We are back toclassification.

Despite the namelogistic regression.

Data: Inputs are continuous vectors of length M. Outputsare discrete.

• Key idea: Try to learnthis hyperplane directlyKey idea: Try to learn

this hyperplane directly

Linear Models for Classification

Directly modeling thehyperplane would use adecision function:

for:

Looking ahead:We’ll see a number ofcommonly used LinearClassifiersThese include:

PerceptronLogistic RegressionNaïve Bayes (undercertain conditions)Support VectorMachines

• Background: HyperplanesHyperplane (Definition 1):

Hyperplane (Definition 2):

Half spaces:

Notation Trick: fold thebias b and the weightswinto a single vector byprepending a constant to

x and increasingdimensionality by one!

Notation Trick: fold thebias b and the weightswinto a single vector byprepending a constant to

x and increasingdimensionality by one!

• Using gradient ascent for linearclassifiers

Key idea behind today’s lecture:1. Define a linear classifier (logistic regression)2. Define an objective function (likelihood)3. Optimize it with gradient descent to learn

parameters4. Predict the class with highest probability under

the model

20

• Using gradient ascent for linearclassifiers

21

Use a differentiablefunction instead:

ue u

This decision function isn’tdifferentiable:

sign(x)

• Using gradient ascent for linearclassifiers

22

Use a differentiablefunction instead:

ue u

This decision function isn’tdifferentiable:

sign(x)

• Logistic Regression

23

Learning: finds the parameters that minimize someobjective function.

Prediction: Output is the most probable class.

Model: Logistic function applied to dot product ofparameters with input vector.

Data: Inputs are continuous vectors of length M. Outputsare discrete.

• Logistic Regression

WhiteboardBernoulli interpretationLogistic Regression ModelDecision boundary

24

• Learning for Logistic Regression

WhiteboardPartial derivative for Logistic RegressionGradient for Logistic Regression

25

• LOGISTIC REGRESSION ONGAUSSIAN DATA

26

• Logistic Regression

27

• Logistic Regression

28

• Logistic Regression

29

• LEARNING LOGISTIC REGRESSION

30

• Maximum ConditionalLikelihood Estimation

31

Learning: finds the parameters that minimize someobjective function.

Weminimize the negative log conditional likelihood:

Why?1. We can’t maximize likelihood (as in Naïve Bayes)

because we don’t have a joint model p(x,y)2. It worked well for Linear Regression (least squares is

MCLE)

• Maximum ConditionalLikelihood Estimation

32

Learning: Four approaches to solving

Approach 1: Gradient Descent(take larger – more certain – steps opposite the gradient)

Approach 2: Stochastic Gradient Descent (SGD)(take many small steps opposite the gradient)

Approach 3: Newton’s Method(use second derivatives to better follow curvature)

Approach 4: Closed Form???(set derivatives equal to zero and solve for parameters)

• Maximum ConditionalLikelihood Estimation

33

Learning: Four approaches to solving

Approach 1: Gradient Descent(take larger – more certain – steps opposite the gradient)

Approach 2: Stochastic Gradient Descent (SGD)(take many small steps opposite the gradient)

Approach 3: Newton’s Method(use second derivatives to better follow curvature)

Approach 4: Closed Form???(set derivatives equal to zero and solve for parameters)

Logistic Regression does nothave a closed form solutionfor MLE parameters.

Logistic Regression does nothave a closed form solutionfor MLE parameters.

• SGD for Logistic Regression

34

Question:Which of the following is a correct description of SGD for Logistic Regression?Question:Which of the following is a correct description of SGD for Logistic Regression?

Answer:At each step (i.e. iteration) of SGD for Logistic Regression we…A. (1) compute the gradient of the log likelihood for all examples (2) update all

the parameters using the gradientB. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down,

(3) report that answerC. (1) compute the gradient of the log likelihood for all examples (2) randomly

pick an example (3) update only the parameters for that exampleD. (1) randomly pick a parameter, (2) compute the partial derivative of the log

likelihood with respect to that parameter, (3) update that parameter for allexamples

E. (1) randomly pick an example, (2) compute the gradient of the log likelihoodfor that example, (3) update all the parameters using that gradient

F. (1) randomly pick a parameter and an example, (2) compute the gradient ofthe log likelihood for that example with respect to that parameter, (3) updatethat parameter using that gradient

Answer:At each step (i.e. iteration) of SGD for Logistic Regression we…A. (1) compute the gradient of the log likelihood for all examples (2) update all

the parameters using the gradientB. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down,

(3) report that answerC. (1) compute the gradient of the log likelihood for all examples (2) randomly

pick an example (3) update only the parameters for that exampleD. (1) randomly pick a parameter, (2) compute the partial derivative of the log

likelihood with respect to that parameter, (3) update that parameter for allexamples

E. (1) randomly pick an example, (2) compute the gradient of the log likelihoodfor that example, (3) update all the parameters using that gradient

F. (1) randomly pick a parameter and an example, (2) compute the gradient ofthe log likelihood for that example with respect to that parameter, (3) updatethat parameter using that gradient

35

In order to apply GD to LogisticRegression all we need is thegradient of the objectivefunction (i.e. vector of partialderivatives).

• Stochastic Gradient Descent (SGD)

36

We need a per example objective:

We can also apply SGD to solve the MCLEproblem for Logistic Regression.

Logistic Regression vs. Perceptron

37

Question:True or False: Just like Perceptron, onestep (i.e. iteration) of SGD for LogisticRegressionwill result in a change to theparameters only if the current example isincorrectly classified.

Question:True or False: Just like Perceptron, onestep (i.e. iteration) of SGD for LogisticRegressionwill result in a change to theparameters only if the current example isincorrectly classified.

• Matching Game

Goal:Match the Algorithm to its Update Rule

38

1. SGD for Logistic Regression

2. Least Mean Squares

3. Perceptron

4.

5.

6.

A. 1=5, 2=4, 3=6B. 1=5, 2=6, 3=4C. 1=6, 2=4, 3=4D. 1=5, 2=6, 3=6

E. 1=6, 2=6, 3=6F. 1=6, 2=5, 3=5G. 1=5, 2=5, 3=5H. 1=4, 2=5, 3=6

• OPTIMIZATIONMETHOD #4:MINI BATCH SGD

39

• Mini Batch SGD

Gradient Descent:Compute true gradient exactly from all NexamplesStochastic Gradient Descent (SGD):Approximate true gradient by the gradientof one randomly chosen exampleMini Batch SGD:Approximate true gradient by the averagegradient of K randomly chosen examples

40

• Mini Batch SGD

41

Three variants of first order optimization:

• Summary

1. Discriminative classifiers directly model theconditional, p(y|x)

2. Logistic regression is a simple linearclassifier, that retains a probabilisticsemantics

3. Parameters in LR are learned by iterativeoptimization (e.g. SGD)

50

• Logistic Regression ObjectivesYou should be able to…

Apply the principle of maximum likelihood estimation (MLE) tolearn the parameters of a probabilistic modelGiven a discriminative probabilistic model, derive the conditionallog likelihood, its gradient, and the corresponding BayesClassifierExplain the practical reasons why we work with the log of thelikelihoodImplement logistic regression for binary or multiclassclassificationProve that the decision boundary of binary logistic regression islinearFor linear regression, show that the parameters which minimizesquared error are equivalent to those that maximize conditionallikelihood

51

• MULTINOMIAL LOGISTICREGRESSION

54

• 55

• Multinomial Logistic RegressionChalkboard

Background: Multinomial distributionDefinition: Multi class classificationGeometric intuitionsMultinomial logistic regression modelGenerative storyReduction to binary logistic regressionPartial derivatives and gradientsApplying Gradient Descent and SGDImplementation w/ sparse features

56

• Debug that Program!In Class Exercise: Think Pair ShareDebug the following program which is (incorrectly)attempting to run SGD for multinomial logistic regression

In Class Exercise: Think Pair ShareDebug the following program which is (incorrectly)attempting to run SGD for multinomial logistic regression

57

Buggy Program:

Assume: returns the gradient of the negative log likelihood ofthe training example (x[i],y[i]) with respect to vector . is the learningrate. N = # of examples. K = # of output classes. M = # of features. is a K by Mmatrix.

Buggy Program:

Assume: returns the gradient of the negative log likelihood ofthe training example (x[i],y[i]) with respect to vector . is the learningrate. N = # of examples. K = # of output classes. M = # of features. is a K by Mmatrix.

• FEATURE ENGINEERING

59

• Handcrafted Features

60

p(y|x)exp( y f )

born-in

• Where do features come from?

61

FeatureEn

gine

ering

Feature Learning

hand craftedfeatures

Sun et al., 2011

Zhou et al.,2005

First word before M1Second word before M1Bag-of-words in M1Head word of M1Other word in betweenFirst word after M2Second word after M2Bag-of-words in M2Head word of M2Bigrams in betweenWords on dependency pathCountry name listPersonal relative triggersPersonal title listWordNet TagsHeads of chunks in betweenPath of phrase labelsCombination of entity types

• Where do features come from?

62

FeatureEn

gine

ering

Feature Learning

hand craftedfeatures

Sun et al., 2011

Zhou et al.,2005 word

embeddingsMikolov et al.,

2013

CBOWmodel in Mikolov et al. (2013)

input(context words)

input(context words)

embedding

embedding missing wordmissing word

Look up tableLook up table ClassifierClassifier

0.13 .26 … .52

0.11 .23 … .45

dog:

cat:similar words,similar embeddings

unsupervisedlearning

• Where do features come from?

63

FeatureEn

gine

ering

Feature Learning

hand craftedfeatures

Sun et al., 2011

Zhou et al.,2005 word

embeddingsMikolov et al.,

2013

stringembeddings

Collobert &Weston,2008

Socher, 2011

Convolutional Neural Networks(Collobert andWeston 2008)

The [movie] showed [wars]

pooling

CNN

Recursive Auto Encoder(Socher 2011)

The [movie] showed [wars]

RAE

• Where do features come from?

64

FeatureEn

gine

ering

Feature Learning

hand craftedfeatures

Sun et al., 2011

Zhou et al.,2005 word

embeddingsMikolov et al.,

2013

treeembeddingsSocher et al.,

2013Hermann & Blunsom,

2013

stringembeddings

Collobert &Weston,2008

Socher, 2011

The [movie] showed [wars]

WNP,VP

WDT,NN WV,NN

S

NP VP

• Where do features come from?

65

wordembeddings

treeembeddings

hand craftedfeatures

stringembeddings

FeatureEn

gine

ering

Feature Learning

Sun et al., 2011

Zhou et al.,2005

Mikolov et al.,2013

Collobert &Weston,2008

Socher, 2011

Socher et al.,2013

Hermann & Blunsom,2013

Hermann et al.2014

word embeddingfeatures

Turian et al.2010

Koo et al.2008

• Where do features come from?

66

wordembeddings

treeembeddings

word embeddingfeatureshand crafted

features

stringembeddings

FeatureEn

gine

ering

Feature Learning

Sun et al., 2011

Zhou et al.,2005

Mikolov et al.,2013

Collobert &Weston,2008

Socher, 2011

Socher et al.,2013

Turian et al.2010

Koo et al.2008

Hermann et al.2014

Hermann & Blunsom,2013

• Feature Engineering for NLP

Suppose you build a logistic regression modelto predict a part of speech (POS) tag for eachword in a sentence.

What features should you use?

67The movie I watched depicted hopedeter. noun noun nounverb verb

• Per word Features:

Feature Engineering for NLP

68The movie I watched depicted hopedeter. noun noun nounverb verb

is-capital(wi)endswith(wi,“e”)endswith(wi,“d”)endswith(wi,“ed”)wi == “aardvark”wi == “hope”

110000…

010000…

100000…

001100…

001100…

010001…

x(1) x(2) x(3) x(4) x(5) x(6)

• Context Features:

Feature Engineering for NLP

69The movie I watched depicted hopedeter. noun noun nounverb verb

…wi == “watched”wi+1 == “watched”wi-1 == “watched”wi+2 == “watched”wi-2 == “watched”

…00000…

…00010…

…01000…

…10000…

…00100…

…00001…

x(1) x(2) x(3) x(4) x(5) x(6)

• Context Features:

Feature Engineering for NLP

70The movie I watched depicted hopedeter. noun noun nounverb verb

…wi == “I”wi+1 == “I”wi-1 == “I”wi+2 == “I”wi-2 == “I”

…00010…

…01000…

…10000…

…00100…

…00001…

…00000…

x(1) x(2) x(3) x(4) x(5) x(6)

• Feature Engineering for NLP

71The movie I watched depicted hopedeter. noun noun nounverb verb

Table from Manning (2011)

• Feature Engineering for CVEdge detection (Canny)

76Figures from http://opencv.org

Corner Detection (Harris)

• Feature Engineering for CV

Scale Invariant Feature Transform (SIFT)

77Figure from Lowe (1999) and Lowe (2004)

• NON LINEAR FEATURES

79

• Nonlinear Featuresaka. “nonlinear basis functions”So far, input was alwaysKey Idea: let input be some function of x

original input:new input:define

Examples: (M = 1)

80

For a linear model:still a linear functionof b(x) even though anonlinear function ofxExamples:

PerceptronLinear regressionLogistic regression

For a linear model:still a linear functionof b(x) even though anonlinear function ofxExamples:

PerceptronLinear regressionLogistic regression

• Example: Linear Regression

81x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

true “unknown”target function islinear withnegative slopeand gaussiannoise

• Example: Linear Regression

82x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

true “unknown”target function islinear withnegative slopeand gaussiannoise

• Example: Linear Regression

83x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

true “unknown”target function islinear withnegative slopeand gaussiannoise

• Example: Linear Regression

84x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

true “unknown”target function islinear withnegative slopeand gaussiannoise

• Example: Linear Regression

85x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

true “unknown”target function islinear withnegative slopeand gaussiannoise

• Example: Linear Regression

86x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

true “unknown”target function islinear withnegative slopeand gaussiannoise

• Example: Linear Regression

87x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

true “unknown”target function islinear withnegative slopeand gaussiannoise

• Slide courtesy of William Cohen

• Slide courtesy of William Cohen

• Example: Linear Regression

90x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

true “unknown”target function islinear withnegative slopeand gaussiannoise

• Example: Linear Regression

91x

y

Goal: Learn y =wT f(x) + bwhere f(.) is a polynomialbasis function

Same as before, but nowwith N = 100 points

true “unknown”target function islinear withnegative slopeand gaussiannoise

• REGULARIZATION

92

• OverfittingDefinition: The problem of overfitting is whenthe model captures the noise in the training datainstead of the underlying structure

Overfitting can occur in all the models we’ve seenso far:

Decision Trees (e.g. when tree is too deep)KNN (e.g. when k is small)Perceptron (e.g. when sample isn’t representative)Linear Regression (e.g. with nonlinear features)Logistic Regression (e.g. with many rare features)

93

• Motivation: RegularizationExample: Stock Prices

Suppose we wish to predictGoogle’s stock price at time t+1What features should we use?(putting all computational concernsaside)

Stock prices of all other stocks attimes t, t 1, t 2, …, t kMentions of Google with positive /negative sentiment words in allnewspapers and social media outlets

Do we believe that all of thesefeatures are going to be useful?

94

• Motivation: Regularization

Occam’s Razor: prefer the simplesthypothesis

What does it mean for a hypothesis (ormodel) to be simple?1. small number of features (model selection)2. small number of “important” features

(shrinkage)

95

• RegularizationGiven objective function: J( )Goal is to find:

Key idea: Define regularizer r( ) s.t. we tradeoffbetween fitting the data and keeping the modelsimpleChoose form of r( ):

Example: q norm (usually p norm)

96

• Regularization

98

Question:Suppose we are minimizing J’( )where

As increases, the minimum of J’( )will move…

A. …towards the midpoint betweenJ’( ) and r( )

B. …towards the minimum of J( )C. …towards the minimum of r( )D. …towards a theta vector of positive

infinitiesE. …towards a theta vector of negative

infinities

Question:Suppose we are minimizing J’( )where

As increases, the minimum of J’( )will move…

A. …towards the midpoint betweenJ’( ) and r( )

B. …towards the minimum of J( )C. …towards the minimum of r( )D. …towards a theta vector of positive

infinitiesE. …towards a theta vector of negative

infinities

• Regularization ExerciseIn class Exercise1. Plot train error vs. regularization weight (cartoon)2. Plot test error vs . regularization weight (cartoon)

100

erro

r

regularization weight

• Regularization

101

Question:Suppose we are minimizing J’( )where

As we increase from 0, the thevalidation error will…

A. …increaseB. …decreaseC. …first increase, then decreaseD. …first decrease, then increaseE. …stay the same

Question:Suppose we are minimizing J’( )where

As we increase from 0, the thevalidation error will…

A. …increaseB. …decreaseC. …first increase, then decreaseD. …first decrease, then increaseE. …stay the same

• Regularization

102

Don’t Regularize the Bias (Intercept) Parameter!In our models so far, the bias / intercept parameter isusually denoted by that is, the parameter for whichwe fixedRegularizers always avoid penalizing this bias / interceptparameterWhy? Because otherwise the learning algorithms wouldn’tbe invariant to a shift in the y values

Whitening DataIt’s common towhiten each feature by subtracting itsmean and dividing by its varianceFor regularization, this helps all the features be penalizedin the same units(e.g. convert both centimeters and kilometers to z scores)

• Example: Logistic RegressionFor this example, weconstruct nonlinear features(i.e. feature engineering)Specifically, we addpolynomials up to order 9 ofthe two original features x1and x2Thus our classifier is linear inthe high dimensionalfeature space, but thedecision boundary isnonlinearwhen visualized inlow dimensions (i.e. theoriginal two dimensions)

107

TrainingData

TestData

• Example: Logistic Regression

108

1/lambda

erro

r

• Example: Logistic Regression

109

• Example: Logistic Regression

110

• Example: Logistic Regression

111

• Example: Logistic Regression

112

• Example: Logistic Regression

113

• Example: Logistic Regression

114

• Example: Logistic Regression

115

• Example: Logistic Regression

116

• Example: Logistic Regression

117

• Example: Logistic Regression

118

• Example: Logistic Regression

119

• Example: Logistic Regression

120

• Example: Logistic Regression

121

• Example: Logistic Regression

122

1/lambda

erro

r

• Regularization as MAP

L1 and L2 regularization can be interpretedasmaximum a posteriori (MAP) estimationof the parametersTo be discussed later in the course…

123

• Takeaways

1. Nonlinear basis functions allow linearmodels (e.g. Linear Regression, LogisticRegression) to capture nonlinear aspects ofthe original input

2. Nonlinear features are require no changesto the model (i.e. just preprocessing)

3. Regularization helps to avoid overfitting4. Regularization andMAP estimation are

equivalent for appropriately chosen priors

125

• Feature Engineering / RegularizationObjectives

You should be able to…Engineer appropriate features for a new taskUse feature selection techniques to identify andremove irrelevant featuresIdentify when a model is overfittingAdd a regularizer to an existing objective in order tocombat overfittingExplain why we should not regularize the bias termConvert linearly inseparable dataset to a linearlyseparable dataset in higher dimensionsDescribe feature engineering in common applicationareas

126