18 logistic regression - colorado state universitytext classification: logistic regression chapter 5...

26
11/7/19 1 Text classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression Important analytic tool in natural and social sciences Baseline supervised machine learning tool for classification Is also the foundation of a neural network

Upload: others

Post on 28-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

1

Text classification: Logistic Regression

Chapter 5 in Martin/Jurafsky

Logistic Regression

•  Important analytic tool in natural and social sciences

•  Baseline supervised machine learning tool for classification

•  Is also the foundation of a neural network

Page 2: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

2

Logistic Regression: binary and multi-class

•  Binary logistic regression – Two output classes

•  Multinomial logistic regression – More than 2 output classes

Naive Bayes vs Logistic Regression

•  Naive Bayes

•  Logistic Regression

4

2 CHAPTER 5 • LOGISTIC REGRESSION

More formally, recall that the naive Bayes assigns a class c to a document d notby directly computing P(c|d) but by computing a likelihood and a prior

c = argmaxc2C

likelihoodz }| {P(d|c)

priorz}|{P(c) (5.1)

A generative model like naive Bayes makes use of this likelihood term, whichgenerative

model

expresses how to generate the features of a document if we knew it was of class c.By contrast a discriminative model in this text categorization scenario attemptsdiscriminative

model

to directly compute P(c|d). Perhaps it will learn to assign high weight to documentfeatures that directly improve its ability to discriminate between possible classes,even if it couldn’t generate an example of one of the classes.

Components of a probabilistic machine learning classifier: Like naive Bayes,logistic regression is a probabilistic classifier that makes use of supervised machinelearning. Machine learning classifiers require a training corpus of M observationsinput/output pairs (x(i),y(i)). (We’ll use superscripts in parentheses to refer to indi-vidual instances in the training set—for sentiment classification each instance mightbe an individual document to be classified). A machine learning system for classifi-cation then has four components:

1. A feature representation of the input. For each input observation x

(i), thiswill be a vector of features [x1,x2, ...,xn

]. We will generally refer to featurei for input x

( j) as x

( j)i

, sometimes simplified as x

i

, but we will also see thenotation f

i

, f

i

(x), or, for multiclass classification, f

i

(c,x).2. A classification function that computes y, the estimated class, via p(y|x). In

the next section we will introduce the sigmoid and softmax tools for classifi-cation.

3. An objective function for learning, usually involving minimizing error ontraining examples. We will introduce the cross-entropy loss function

4. An algorithm for optimizing the objective function. We introduce the stochas-

tic gradient descent algorithm.

Logistic regression has two phases:

training: we train the system (specifically the weights w and b) using stochasticgradient descent and the cross-entropy loss.

test: Given a test example x we compute p(y|x) and return the higher probabilitylabel y = 1 or y = 0.

5.1 Classification: the sigmoid

The goal of binary logistic regression is to train a classifier that can make a binarydecision about the class of a new input observation. Here we introduce the sigmoid

classifier that will help us make this decision.Consider a single input observation x, which we will represent by a vector of

features [x1,x2, ...,xn

] (we’ll show sample features in the next subsection). The clas-sifier output y can be 1 (meaning the observation is a member of the class) or 0(the observation is not a member of the class). We want to know the probabilityP(y = 1|x) that this observation is a member of the class. So perhaps the decision

2 CHAPTER 5 • LOGISTIC REGRESSION

More formally, recall that the naive Bayes assigns a class c to a document d notby directly computing P(c|d) but by computing a likelihood and a prior

c = argmaxc2C

likelihoodz }| {P(d|c)

priorz}|{P(c) (5.1)

A generative model like naive Bayes makes use of this likelihood term, whichgenerative

model

expresses how to generate the features of a document if we knew it was of class c.By contrast a discriminative model in this text categorization scenario attemptsdiscriminative

model

to directly compute P(c|d). Perhaps it will learn to assign high weight to documentfeatures that directly improve its ability to discriminate between possible classes,even if it couldn’t generate an example of one of the classes.

Components of a probabilistic machine learning classifier: Like naive Bayes,logistic regression is a probabilistic classifier that makes use of supervised machinelearning. Machine learning classifiers require a training corpus of M observationsinput/output pairs (x(i),y(i)). (We’ll use superscripts in parentheses to refer to indi-vidual instances in the training set—for sentiment classification each instance mightbe an individual document to be classified). A machine learning system for classifi-cation then has four components:

1. A feature representation of the input. For each input observation x

(i), thiswill be a vector of features [x1,x2, ...,xn

]. We will generally refer to featurei for input x

( j) as x

( j)i

, sometimes simplified as x

i

, but we will also see thenotation f

i

, f

i

(x), or, for multiclass classification, f

i

(c,x).2. A classification function that computes y, the estimated class, via p(y|x). In

the next section we will introduce the sigmoid and softmax tools for classifi-cation.

3. An objective function for learning, usually involving minimizing error ontraining examples. We will introduce the cross-entropy loss function

4. An algorithm for optimizing the objective function. We introduce the stochas-

tic gradient descent algorithm.

Logistic regression has two phases:

training: we train the system (specifically the weights w and b) using stochasticgradient descent and the cross-entropy loss.

test: Given a test example x we compute p(y|x) and return the higher probabilitylabel y = 1 or y = 0.

5.1 Classification: the sigmoid

The goal of binary logistic regression is to train a classifier that can make a binarydecision about the class of a new input observation. Here we introduce the sigmoid

classifier that will help us make this decision.Consider a single input observation x, which we will represent by a vector of

features [x1,x2, ...,xn

] (we’ll show sample features in the next subsection). The clas-sifier output y can be 1 (meaning the observation is a member of the class) or 0(the observation is not a member of the class). We want to know the probabilityP(y = 1|x) that this observation is a member of the class. So perhaps the decision

P(c|d) posterior

Page 3: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

3

Generative and Discriminative Classifiers

•  Naïve Bayes is a generative classifier

•  Logistic regression is a discriminative classifier

Generative and Discriminative Classifiers

Suppose we're distinguishing cat from dog images

imagenet imagenet

Page 4: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

4

Generative Classifier •  Build a model of what's in a cat image

•  Knows about whiskers, ears, eyes •  Assigns a probability to any image:

•  how cat-y is this image?

Also build a model for dog images

Given a new image: Run both models and see which one fits better

Discriminative Classifier

Just try to distinguish dogs from cats

Page 5: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

5

Logistic regression

•  Training: given a training set of M observations (x(i), y(i)) where each x(i) is a vector of features

–  Learn the parameters of the model

•  Test: Given a test example x –  return the higher probability class

9

Labeled data A labeled dataset:

•  Observations (x(i),y(i)) where each x(i)

is a vector of features

•  The labels are discrete for classification problems (e.g. 0, 1) for binary classification

Page 6: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

6

Term-document matrix

8 CHAPTER 6 • VECTOR SEMANTICS

6.3 Words and Vectors

Vector or distributional models of meaning are generally based on a co-occurrencematrix, a way of representing how often words co-occur. This matrix can be con-structed in various ways; let’s s begin by looking at one such co-occurrence matrix,a term-document matrix.

6.3.1 Vectors and documentsIn a term-document matrix, each row represents a word in the vocabulary and eachterm-document

matrixcolumn represents a document from some collection of documents. Fig. 6.2 shows asmall selection from a term-document matrix showing the occurrence of four wordsin four plays by Shakespeare. Each cell in this matrix represents the number of timesa particular word (defined by the row) occurs in a particular document (defined bythe column). Thus clown appeared 117 times in Twelfth Night.

As You Like It Twelfth Night Julius Caesar Henry Vbattle 1 1 8 15soldier 2 2 12 36

fool 37 58 1 5clown 5 117 0 0Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cellcontains the number of times the (row) word occurs in the (column) document.

The term-document matrix of Fig. 6.2 was first defined as part of the vectorspace model of information retrieval (Salton, 1971). In this model, a document isvector space

modelrepresented as a count vector, a column in Fig. 6.3.

To review some basic linear algebra, a vector is, at heart, just a list or array ofvectornumbers. So As You Like It is represented as the list [1,2,37,5] and Julius Caesar isrepresented as the list [8,12,1,0]. A vector space is a collection of vectors, character-vector space

ized by their dimension. In the example in Fig. 6.3, the vectors are of dimension 4,dimensionjust so they fit on the page; in real term-document matrices, the vectors representingeach document would have dimensionality |V |, the vocabulary size.

The ordering of the numbers in a vector space is not arbitrary; each positionindicates a meaningful dimension on which the documents can vary. Thus the firstdimension for both these vectors corresponds to the number of times the word battleoccurs, and we can compare each dimension, noting for example that the vectors forAs You Like It and Twelfth Night have the same value 1 for the first dimension.

As You Like It Twelfth Night Julius Caesar Henry Vbattle 1 1 8 15soldier 2 2 12 36

fool 37 58 1 5clown 5 117 0 0Figure 6.3 The term-document matrix for four words in four Shakespeare plays. The redboxes show that each document is represented as a column vector of length four.

We can think of the vector for a document as identifying a point in |V |-dimensionalspace; thus the documents in Fig. 6.3 are points in 4-dimensional space. Since 4-dimensional spaces are hard to draw in textbooks, Fig. 6.4 shows a visualization intwo dimensions; we’ve arbitrarily chosen the dimensions corresponding to the wordsbattle and fool.

8 CHAPTER 6 • VECTOR SEMANTICS

6.3 Words and Vectors

Vector or distributional models of meaning are generally based on a co-occurrencematrix, a way of representing how often words co-occur. This matrix can be con-structed in various ways; let’s s begin by looking at one such co-occurrence matrix,a term-document matrix.

6.3.1 Vectors and documentsIn a term-document matrix, each row represents a word in the vocabulary and eachterm-document

matrixcolumn represents a document from some collection of documents. Fig. 6.2 shows asmall selection from a term-document matrix showing the occurrence of four wordsin four plays by Shakespeare. Each cell in this matrix represents the number of timesa particular word (defined by the row) occurs in a particular document (defined bythe column). Thus fool appeared 58 times in Twelfth Night.

As You Like It Twelfth Night Julius Caesar Henry Vbattle 1 0 7 13good 114 80 62 89fool 36 58 1 4wit 20 15 2 3

Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cellcontains the number of times the (row) word occurs in the (column) document.

The term-document matrix of Fig. 6.2 was first defined as part of the vectorspace model of information retrieval (Salton, 1971). In this model, a document isvector space

modelrepresented as a count vector, a column in Fig. 6.3.

To review some basic linear algebra, a vector is, at heart, just a list or arrayvectorof numbers. So As You Like It is represented as the list [1,114,36,20] and JuliusCaesar is represented as the list [7,62,1,2]. A vector space is a collection of vectors,vector space

characterized by their dimension. In the example in Fig. 6.3, the vectors are ofdimensiondimension 4, just so they fit on the page; in real term-document matrices, the vectorsrepresenting each document would have dimensionality |V |, the vocabulary size.

The ordering of the numbers in a vector space is not arbitrary; each positionindicates a meaningful dimension on which the documents can vary. Thus the firstdimension for both these vectors corresponds to the number of times the word battleoccurs, and we can compare each dimension, noting for example that the vectors forAs You Like It and Twelfth Night have similar values (1 and 0, respectively) for thefirst dimension.

As You Like It Twelfth Night Julius Caesar Henry Vbattle 1 0 7 13good 114 80 62 89fool 36 58 1 4wit 20 15 2 3

Figure 6.3 The term-document matrix for four words in four Shakespeare plays. The redboxes show that each document is represented as a column vector of length four.

We can think of the vector for a document as identifying a point in |V |-dimensionalspace; thus the documents in Fig. 6.3 are points in 4-dimensional space. Since 4-dimensional spaces are hard to draw in textbooks, Fig. 6.4 shows a visualization in

Each document is represented by a vector of words

Visualizing document vectors

5 10 15 20 25 30

5

10

Henry V [4,13]

As You Like It [36,1]

Julius Caesar [1,7]battle

fool

Twelfth Night [58,0]

15

40

35 40 45 50 55 60

Page 7: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

7

Document vectors

8 CHAPTER 6 • VECTOR SEMANTICS

6.3 Words and Vectors

Vector or distributional models of meaning are generally based on a co-occurrencematrix, a way of representing how often words co-occur. This matrix can be con-structed in various ways; let’s s begin by looking at one such co-occurrence matrix,a term-document matrix.

6.3.1 Vectors and documentsIn a term-document matrix, each row represents a word in the vocabulary and eachterm-document

matrixcolumn represents a document from some collection of documents. Fig. 6.2 shows asmall selection from a term-document matrix showing the occurrence of four wordsin four plays by Shakespeare. Each cell in this matrix represents the number of timesa particular word (defined by the row) occurs in a particular document (defined bythe column). Thus fool appeared 58 times in Twelfth Night.

As You Like It Twelfth Night Julius Caesar Henry Vbattle 1 0 7 13good 114 80 62 89fool 36 58 1 4wit 20 15 2 3

Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cellcontains the number of times the (row) word occurs in the (column) document.

The term-document matrix of Fig. 6.2 was first defined as part of the vectorspace model of information retrieval (Salton, 1971). In this model, a document isvector space

modelrepresented as a count vector, a column in Fig. 6.3.

To review some basic linear algebra, a vector is, at heart, just a list or arrayvectorof numbers. So As You Like It is represented as the list [1,114,36,20] and JuliusCaesar is represented as the list [7,62,1,2]. A vector space is a collection of vectors,vector space

characterized by their dimension. In the example in Fig. 6.3, the vectors are ofdimensiondimension 4, just so they fit on the page; in real term-document matrices, the vectorsrepresenting each document would have dimensionality |V |, the vocabulary size.

The ordering of the numbers in a vector space is not arbitrary; each positionindicates a meaningful dimension on which the documents can vary. Thus the firstdimension for both these vectors corresponds to the number of times the word battleoccurs, and we can compare each dimension, noting for example that the vectors forAs You Like It and Twelfth Night have similar values (1 and 0, respectively) for thefirst dimension.

As You Like It Twelfth Night Julius Caesar Henry Vbattle 1 0 7 13good 114 80 62 89fool 36 58 1 4wit 20 15 2 3

Figure 6.3 The term-document matrix for four words in four Shakespeare plays. The redboxes show that each document is represented as a column vector of length four.

We can think of the vector for a document as identifying a point in |V |-dimensionalspace; thus the documents in Fig. 6.3 are points in 4-dimensional space. Since 4-dimensional spaces are hard to draw in textbooks, Fig. 6.4 shows a visualization in

Vectors are similar for the two comedies: Comedies have more fools and wit and fewer battles.

Classification

•  Consider input observation x

–  vector of features [x1, x2, ..., xn]

– Maybe they are counts of words

•  Goal: predict the class y –  We want to know the probability P(y = 1|x) and P(y=0|x)

•  y=1 “positive sentiment”

•  y=0 “negative sentiment”

14

Page 8: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

8

Features in logistic regression

•  For each feature xi, we'll have a weight wi

•  Weight wi tells us how important feature xi is for the classification xi ="awesome": wi very positive +5

xi ="abysmal": wi very negative -5

15

How to do classification

•  For each feature xi, we'll have a weight wi

•  Plus we'll have a bias b

•  We'll sum up all the weighted features and the bias

16

5.1 • CLASSIFICATION: THE SIGMOID 3

is “positive sentiment” versus “negative sentiment”, the features represent countsof words in a document, and P(y = 1|x) is the probability that the document haspositive sentiment, while and P(y = 0|x) is the probability that the document hasnegative sentiment.

Logistic regression solves this task by learning, from a training set, a vector ofweights and a bias term. Each weight w

i

is a real number, and is associated with oneof the input features x

i

. The weight w

i

represents how important that input feature isto the classification decision, and can be positive (meaning the feature is associatedwith the class) or negative (meaning the feature is not associated with the class).Thus we might expect in a sentiment task the word awesome to have a high positiveweight, and abysmal to have a very negative weight. The bias term, also called thebias term

intercept, is another real number that’s added to the weighted inputs.intercept

To make a decision on a test instance— after we’ve learned the weights intraining— the classifier first multiplies each x

i

by its weight w

i

, sums up the weightedfeatures, and adds the bias term b. The resulting single number z expresses theweighted sum of the evidence for the class.

z =

nX

i=1

w

i

x

i

!+b (5.2)

In the rest of the book we’ll represent such sums using the dot product notation fromdot product

linear algebra. The dot product of two vectors a and b, written as a ·b is the sum ofthe products of the corresponding elements of each vector. Thus the following is anequivalent formation to Eq. 5.2:

z = w · x+b (5.3)

But note that nothing in Eq. 5.3 forces z to be a legal probability, that is, to liebetween 0 and 1. In fact, since weights are real-valued, the output might even benegative; z ranges from �• to •.

Figure 5.1 The sigmoid function y= 11+e

�z

takes a real value and maps it to the range [0,1].Because it is nearly linear around 0 but has a sharp slope toward the ends, it tends to squashoutlier values toward 0 or 1.

To create a probability, we’ll pass z through the sigmoid function, s(z). Thesigmoid

sigmoid function (named because it looks like an s) is also called the logistic func-

tion, and gives logistic regression its name. The sigmoid has the following equation,logistic

function

shown graphically in Fig. 5.1:

y = s(z) =1

1+ e

�z

(5.4)

5.1 • CLASSIFICATION: THE SIGMOID 3

is “positive sentiment” versus “negative sentiment”, the features represent countsof words in a document, and P(y = 1|x) is the probability that the document haspositive sentiment, while and P(y = 0|x) is the probability that the document hasnegative sentiment.

Logistic regression solves this task by learning, from a training set, a vector ofweights and a bias term. Each weight w

i

is a real number, and is associated with oneof the input features x

i

. The weight w

i

represents how important that input feature isto the classification decision, and can be positive (meaning the feature is associatedwith the class) or negative (meaning the feature is not associated with the class).Thus we might expect in a sentiment task the word awesome to have a high positiveweight, and abysmal to have a very negative weight. The bias term, also called thebias term

intercept, is another real number that’s added to the weighted inputs.intercept

To make a decision on a test instance— after we’ve learned the weights intraining— the classifier first multiplies each x

i

by its weight w

i

, sums up the weightedfeatures, and adds the bias term b. The resulting single number z expresses theweighted sum of the evidence for the class.

z =

nX

i=1

w

i

x

i

!+b (5.2)

In the rest of the book we’ll represent such sums using the dot product notation fromdot product

linear algebra. The dot product of two vectors a and b, written as a ·b is the sum ofthe products of the corresponding elements of each vector. Thus the following is anequivalent formation to Eq. 5.2:

z = w · x+b (5.3)

But note that nothing in Eq. 5.3 forces z to be a legal probability, that is, to liebetween 0 and 1. In fact, since weights are real-valued, the output might even benegative; z ranges from �• to •.

Figure 5.1 The sigmoid function y= 11+e

�z

takes a real value and maps it to the range [0,1].Because it is nearly linear around 0 but has a sharp slope toward the ends, it tends to squashoutlier values toward 0 or 1.

To create a probability, we’ll pass z through the sigmoid function, s(z). Thesigmoid

sigmoid function (named because it looks like an s) is also called the logistic func-

tion, and gives logistic regression its name. The sigmoid has the following equation,logistic

function

shown graphically in Fig. 5.1:

y = s(z) =1

1+ e

�z

(5.4)

Page 9: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

9

Dot products •  Definition: The Euclidean dot product between two vectors is the

expression

•  The dot product is also referred to as inner product or scalar product.

It is sometimes denoted as

(hence the name dot product).

w

Tx =

dX

i=1

wixi

w · x

Dot products •  Definition: The Euclidean dot product between two vectors is the

expression

•  Geometric interpretation. The dot product between two unit vectors1 is the cosine of the angle between them.

w

Tx =

dX

i=1

wixi

Page 10: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

10

Dot products •  Geometric interpretation. The dot product between two unit vectors

is the cosine of the angle between them.

•  The dot product between a vector and a unit vector is the length of its projection in that direction.

•  And in general: w

|x = ||w|| · ||x||cos(✓)

||x||2 = x

|x

w

wTx + b < 0

wTx + b > 0

Linear models for classification

Page 11: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

11

But z isn't a probability, it's just a number!

21

5.1 • CLASSIFICATION: THE SIGMOID 3

is “positive sentiment” versus “negative sentiment”, the features represent countsof words in a document, and P(y = 1|x) is the probability that the document haspositive sentiment, while and P(y = 0|x) is the probability that the document hasnegative sentiment.

Logistic regression solves this task by learning, from a training set, a vector ofweights and a bias term. Each weight w

i

is a real number, and is associated with oneof the input features x

i

. The weight w

i

represents how important that input feature isto the classification decision, and can be positive (meaning the feature is associatedwith the class) or negative (meaning the feature is not associated with the class).Thus we might expect in a sentiment task the word awesome to have a high positiveweight, and abysmal to have a very negative weight. The bias term, also called thebias term

intercept, is another real number that’s added to the weighted inputs.intercept

To make a decision on a test instance— after we’ve learned the weights intraining— the classifier first multiplies each x

i

by its weight w

i

, sums up the weightedfeatures, and adds the bias term b. The resulting single number z expresses theweighted sum of the evidence for the class.

z =

nX

i=1

w

i

x

i

!+b (5.2)

In the rest of the book we’ll represent such sums using the dot product notation fromdot product

linear algebra. The dot product of two vectors a and b, written as a ·b is the sum ofthe products of the corresponding elements of each vector. Thus the following is anequivalent formation to Eq. 5.2:

z = w · x+b (5.3)

But note that nothing in Eq. 5.3 forces z to be a legal probability, that is, to liebetween 0 and 1. In fact, since weights are real-valued, the output might even benegative; z ranges from �• to •.

Figure 5.1 The sigmoid function y= 11+e

�z

takes a real value and maps it to the range [0,1].Because it is nearly linear around 0 but has a sharp slope toward the ends, it tends to squashoutlier values toward 0 or 1.

To create a probability, we’ll pass z through the sigmoid function, s(z). Thesigmoid

sigmoid function (named because it looks like an s) is also called the logistic func-

tion, and gives logistic regression its name. The sigmoid has the following equation,logistic

function

shown graphically in Fig. 5.1:

y = s(z) =1

1+ e

�z

(5.4)

•  Solution: use a function of z that converts it to a number between 0 and 1:

5.1 • CLASSIFICATION: THE SIGMOID 3

is “positive sentiment” versus “negative sentiment”, the features represent countsof words in a document, and P(y = 1|x) is the probability that the document haspositive sentiment, while and P(y = 0|x) is the probability that the document hasnegative sentiment.

Logistic regression solves this task by learning, from a training set, a vector ofweights and a bias term. Each weight w

i

is a real number, and is associated with oneof the input features x

i

. The weight w

i

represents how important that input feature isto the classification decision, and can be positive (meaning the feature is associatedwith the class) or negative (meaning the feature is not associated with the class).Thus we might expect in a sentiment task the word awesome to have a high positiveweight, and abysmal to have a very negative weight. The bias term, also called thebias term

intercept, is another real number that’s added to the weighted inputs.intercept

To make a decision on a test instance— after we’ve learned the weights intraining— the classifier first multiplies each x

i

by its weight w

i

, sums up the weightedfeatures, and adds the bias term b. The resulting single number z expresses theweighted sum of the evidence for the class.

z =

nX

i=1

w

i

x

i

!+b (5.2)

In the rest of the book we’ll represent such sums using the dot product notation fromdot product

linear algebra. The dot product of two vectors a and b, written as a ·b is the sum ofthe products of the corresponding elements of each vector. Thus the following is anequivalent formation to Eq. 5.2:

z = w · x+b (5.3)

But note that nothing in Eq. 5.3 forces z to be a legal probability, that is, to liebetween 0 and 1. In fact, since weights are real-valued, the output might even benegative; z ranges from �• to •.

Figure 5.1 The sigmoid function y= 11+e

�z

takes a real value and maps it to the range [0,1].Because it is nearly linear around 0 but has a sharp slope toward the ends, it tends to squashoutlier values toward 0 or 1.

To create a probability, we’ll pass z through the sigmoid function, s(z). Thesigmoid

sigmoid function (named because it looks like an s) is also called the logistic func-

tion, and gives logistic regression its name. The sigmoid has the following equation,logistic

function

shown graphically in Fig. 5.1:

y = s(z) =1

1+ e

�z

(5.4)

5.1 • CLASSIFICATION: THE SIGMOID 3

is “positive sentiment” versus “negative sentiment”, the features represent countsof words in a document, and P(y = 1|x) is the probability that the document haspositive sentiment, while and P(y = 0|x) is the probability that the document hasnegative sentiment.

Logistic regression solves this task by learning, from a training set, a vector ofweights and a bias term. Each weight w

i

is a real number, and is associated with oneof the input features x

i

. The weight w

i

represents how important that input feature isto the classification decision, and can be positive (meaning the feature is associatedwith the class) or negative (meaning the feature is not associated with the class).Thus we might expect in a sentiment task the word awesome to have a high positiveweight, and abysmal to have a very negative weight. The bias term, also called thebias term

intercept, is another real number that’s added to the weighted inputs.intercept

To make a decision on a test instance— after we’ve learned the weights intraining— the classifier first multiplies each x

i

by its weight w

i

, sums up the weightedfeatures, and adds the bias term b. The resulting single number z expresses theweighted sum of the evidence for the class.

z =

nX

i=1

w

i

x

i

!+b (5.2)

In the rest of the book we’ll represent such sums using the dot product notation fromdot product

linear algebra. The dot product of two vectors a and b, written as a ·b is the sum ofthe products of the corresponding elements of each vector. Thus the following is anequivalent formation to Eq. 5.2:

z = w · x+b (5.3)

But note that nothing in Eq. 5.3 forces z to be a legal probability, that is, to liebetween 0 and 1. In fact, since weights are real-valued, the output might even benegative; z ranges from �• to •.

Figure 5.1 The sigmoid function y= 11+e

�z

takes a real value and maps it to the range [0,1].Because it is nearly linear around 0 but has a sharp slope toward the ends, it tends to squashoutlier values toward 0 or 1.

To create a probability, we’ll pass z through the sigmoid function, s(z). Thesigmoid

sigmoid function (named because it looks like an s) is also called the logistic func-

tion, and gives logistic regression its name. The sigmoid has the following equation,logistic

function

shown graphically in Fig. 5.1:

y = s(z) =1

1+ e

�z

(5.4)

The sigmoid function

22

5.1 • CLASSIFICATION: THE SIGMOID 3

is “positive sentiment” versus “negative sentiment”, the features represent countsof words in a document, and P(y = 1|x) is the probability that the document haspositive sentiment, while and P(y = 0|x) is the probability that the document hasnegative sentiment.

Logistic regression solves this task by learning, from a training set, a vector ofweights and a bias term. Each weight w

i

is a real number, and is associated with oneof the input features x

i

. The weight w

i

represents how important that input feature isto the classification decision, and can be positive (meaning the feature is associatedwith the class) or negative (meaning the feature is not associated with the class).Thus we might expect in a sentiment task the word awesome to have a high positiveweight, and abysmal to have a very negative weight. The bias term, also called thebias term

intercept, is another real number that’s added to the weighted inputs.intercept

To make a decision on a test instance— after we’ve learned the weights intraining— the classifier first multiplies each x

i

by its weight w

i

, sums up the weightedfeatures, and adds the bias term b. The resulting single number z expresses theweighted sum of the evidence for the class.

z =

nX

i=1

w

i

x

i

!+b (5.2)

In the rest of the book we’ll represent such sums using the dot product notation fromdot product

linear algebra. The dot product of two vectors a and b, written as a ·b is the sum ofthe products of the corresponding elements of each vector. Thus the following is anequivalent formation to Eq. 5.2:

z = w · x+b (5.3)

But note that nothing in Eq. 5.3 forces z to be a legal probability, that is, to liebetween 0 and 1. In fact, since weights are real-valued, the output might even benegative; z ranges from �• to •.

Figure 5.1 The sigmoid function y= 11+e

�z

takes a real value and maps it to the range [0,1].Because it is nearly linear around 0 but has a sharp slope toward the ends, it tends to squashoutlier values toward 0 or 1.

To create a probability, we’ll pass z through the sigmoid function, s(z). Thesigmoid

sigmoid function (named because it looks like an s) is also called the logistic func-

tion, and gives logistic regression its name. The sigmoid has the following equation,logistic

function

shown graphically in Fig. 5.1:

y = s(z) =1

1+ e

�z

(5.4)

Page 12: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

12

Using the model

23

4 CHAPTER 5 • LOGISTIC REGRESSION

The sigmoid has a number of advantages; it take a real-valued number and mapsit into the range [0,1], which is just what we want for a probability. Because it isnearly linear around 0 but has a sharp slope toward the ends, it tends to squash outliervalues toward 0 or 1. And it’s differentiable, which as we’ll see in Section 5.8 willbe handy for learning.

We’re almost there. If we apply the sigmoid to the sum of the weighted features,we get a number between 0 and 1. To make it a probability, we just need to makesure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:

P(y = 1) = s(w · x+b)

=1

1+ e

�(w·x+b)

P(y = 0) = 1�s(w · x+b)

= 1� 11+ e

�(w·x+b)

=e

�(w·x+b)

1+ e

�(w·x+b)(5.5)

Now we have an algorithm that given an instance x computes the probabilityP(y = 1|x). How do we make a decision? For a test instance x, we say yes if theprobability P(y = 1|x) is more than .5, and no otherwise. We call .5 the decision

boundary:decision

boundary

y =

⇢1 if P(y = 1|x)> 0.50 otherwise

5.1.1 Example: sentiment classification

Let’s have an example. Suppose we are doing binary sentiment classification onmovie review text, and we would like to know whether to assign the sentiment class+ or � to a review document doc. We’ll represent each input observation by thefollowing 6 features x1...x6 of the input; Fig. 5.2 shows the features in a sample minitest document.

Var Definition Value in Fig. 5.2x1 count(positive lexicon) 2 doc) 3x2 count(negative lexicon) 2 doc) 2

x3

⇢1 if “no” 2 doc0 otherwise 1

x4 count(1st and 2nd pronouns 2 doc) 3

x5

⇢1 if “!” 2 doc0 otherwise 0

x6 log(word count of doc) ln(64) = 4.15

Let’s assume for the moment that we’ve already learned a real-valued weightfor each of these features, and that the 6 weights corresponding to the 6 featuresare [2.5,�5.0,�1.2,0.5,2.0,0.7], while b = 0.1. (We’ll discuss in the next sectionhow the weights are learned.) The weight w1, for example indicates how important

4 CHAPTER 5 • LOGISTIC REGRESSION

The sigmoid has a number of advantages; it take a real-valued number and mapsit into the range [0,1], which is just what we want for a probability. Because it isnearly linear around 0 but has a sharp slope toward the ends, it tends to squash outliervalues toward 0 or 1. And it’s differentiable, which as we’ll see in Section 5.8 willbe handy for learning.

We’re almost there. If we apply the sigmoid to the sum of the weighted features,we get a number between 0 and 1. To make it a probability, we just need to makesure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:

P(y = 1) = s(w · x+b)

=1

1+ e

�(w·x+b)

P(y = 0) = 1�s(w · x+b)

= 1� 11+ e

�(w·x+b)

=e

�(w·x+b)

1+ e

�(w·x+b)(5.5)

Now we have an algorithm that given an instance x computes the probabilityP(y = 1|x). How do we make a decision? For a test instance x, we say yes if theprobability P(y = 1|x) is more than .5, and no otherwise. We call .5 the decision

boundary:decision

boundary

y =

⇢1 if P(y = 1|x)> 0.50 otherwise

5.1.1 Example: sentiment classification

Let’s have an example. Suppose we are doing binary sentiment classification onmovie review text, and we would like to know whether to assign the sentiment class+ or � to a review document doc. We’ll represent each input observation by thefollowing 6 features x1...x6 of the input; Fig. 5.2 shows the features in a sample minitest document.

Var Definition Value in Fig. 5.2x1 count(positive lexicon) 2 doc) 3x2 count(negative lexicon) 2 doc) 2

x3

⇢1 if “no” 2 doc0 otherwise 1

x4 count(1st and 2nd pronouns 2 doc) 3

x5

⇢1 if “!” 2 doc0 otherwise 0

x6 log(word count of doc) ln(64) = 4.15

Let’s assume for the moment that we’ve already learned a real-valued weightfor each of these features, and that the 6 weights corresponding to the 6 featuresare [2.5,�5.0,�1.2,0.5,2.0,0.7], while b = 0.1. (We’ll discuss in the next sectionhow the weights are learned.) The weight w1, for example indicates how important

Sentiment example: does y=1 or y=0?

•  It's hokey, there are virtually no surprises, and the writing is second-rate. So why was it so enjoyable? For one thing, the cast is great. Another nice touch is the music. I was overcome with the urge to get off the couch and start dancing. It sucked me in, and it'll do the same to you.

24

Page 13: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

13

•  It's hokey, there are virtually no surprises, and the writing is second-rate. So why was it so enjoyable? For one thing, the cast is great. Another nice touch is the music. I was overcome with the urge to get off the couch and start dancing. It sucked me in, and it'll do the same to you. Positive sentiment words

Negative sentiment words "no" i/me/mine/you/your/yours

26

5.1 • CLASSIFICATION: THE SIGMOID 5

It's hokey. There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you .

x1=3 x6=4.15

x3=1

x4=3x5=0

x2=2

Figure 5.2 A sample mini test document showing the extracted features in the vector x.

a feature the number of positive lexicon words (great, nice, enjoyable, etc.) is toa positive sentiment decision, while w2 tells us the importance of negative lexiconwords. Note that w1 = 2.5 is positive, while w2 =�5.0, meaning that negative wordsare negatively associated with a positive sentiment decision, and are about twice asimportant as positive words.

Given these 6 features and the input review x, P(+|x) and P(�|x) can be com-puted using Eq. 5.5:

p(+|x) = P(Y = 1|x) = s(w · x+b)

= s([2.5,�5.0,�1.2,0.5,2.0,0.7] · [3,2,1,3,0,4.15]+0.1)= s(1.805)= 0.86

p(�|x) = P(Y = 0|x) = 1�s(w · x+b)

= 0.14

Logistic regression is commonly applied to all sorts of NLP tasks, and any prop-erty of the input can be a feature. Consider the task of period disambiguation:deciding if a period is the end of a sentence or part of a word, by classifying eachperiod into one of two classes EOS (end-of-sentence) and not-EOS. We might usefeatures like x1 below expressing that the current word is lower case and the classis EOS (perhaps with a positive weight), or that the current word is in our abbrevia-tions dictionary (“Prof.”) and the class is EOS (perhaps with a negative weight). Afeature can also express a quite complex combination of properties. For example aperiod following a upper cased word is a likely to be an EOS, but if the word itself isSt. and the previous word is capitalized, then the period is likely part of a shorteningof the word street.

x1 =

⇢1 if “Case(w

i

) = Lower”0 otherwise

x2 =

⇢1 if “w

i

2 AcronymDict”0 otherwise

x3 =

⇢1 if “w

i

= St. & Case(wi�1) = Cap”

0 otherwise

Designing features: Features are generally designed by examining the trainingset with an eye to linguistic intuitions and the linguistic literature on the domain. Acareful error analysis on the training or dev set. of an early version of a system oftenprovides insights into features.

4 CHAPTER 5 • LOGISTIC REGRESSION

The sigmoid has a number of advantages; it take a real-valued number and mapsit into the range [0,1], which is just what we want for a probability. Because it isnearly linear around 0 but has a sharp slope toward the ends, it tends to squash outliervalues toward 0 or 1. And it’s differentiable, which as we’ll see in Section 5.8 willbe handy for learning.

We’re almost there. If we apply the sigmoid to the sum of the weighted features,we get a number between 0 and 1. To make it a probability, we just need to makesure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:

P(y = 1) = s(w · x+b)

=1

1+ e

�(w·x+b)

P(y = 0) = 1�s(w · x+b)

= 1� 11+ e

�(w·x+b)

=e

�(w·x+b)

1+ e

�(w·x+b)(5.5)

Now we have an algorithm that given an instance x computes the probabilityP(y = 1|x). How do we make a decision? For a test instance x, we say yes if theprobability P(y = 1|x) is more than .5, and no otherwise. We call .5 the decision

boundary:decision

boundary

y =

⇢1 if P(y = 1|x)> 0.50 otherwise

5.1.1 Example: sentiment classification

Let’s have an example. Suppose we are doing binary sentiment classification onmovie review text, and we would like to know whether to assign the sentiment class+ or � to a review document doc. We’ll represent each input observation by thefollowing 6 features x1...x6 of the input; Fig. 5.2 shows the features in a sample minitest document.

Var Definition Value in Fig. 5.2x1 count(positive lexicon) 2 doc) 3x2 count(negative lexicon) 2 doc) 2

x3

⇢1 if “no” 2 doc0 otherwise 1

x4 count(1st and 2nd pronouns 2 doc) 3

x5

⇢1 if “!” 2 doc0 otherwise 0

x6 log(word count of doc) ln(64) = 4.15

Let’s assume for the moment that we’ve already learned a real-valued weightfor each of these features, and that the 6 weights corresponding to the 6 featuresare [2.5,�5.0,�1.2,0.5,2.0,0.7], while b = 0.1. (We’ll discuss in the next sectionhow the weights are learned.) The weight w1, for example indicates how important

Page 14: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

14

Classifying sentiment for input x

27

4 CHAPTER 5 • LOGISTIC REGRESSION

The sigmoid has a number of advantages; it take a real-valued number and mapsit into the range [0,1], which is just what we want for a probability. Because it isnearly linear around 0 but has a sharp slope toward the ends, it tends to squash outliervalues toward 0 or 1. And it’s differentiable, which as we’ll see in Section 5.8 willbe handy for learning.

We’re almost there. If we apply the sigmoid to the sum of the weighted features,we get a number between 0 and 1. To make it a probability, we just need to makesure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:

P(y = 1) = s(w · x+b)

=1

1+ e

�(w·x+b)

P(y = 0) = 1�s(w · x+b)

= 1� 11+ e

�(w·x+b)

=e

�(w·x+b)

1+ e

�(w·x+b)(5.5)

Now we have an algorithm that given an instance x computes the probabilityP(y = 1|x). How do we make a decision? For a test instance x, we say yes if theprobability P(y = 1|x) is more than .5, and no otherwise. We call .5 the decision

boundary:decision

boundary

y =

⇢1 if P(y = 1|x)> 0.50 otherwise

5.1.1 Example: sentiment classification

Let’s have an example. Suppose we are doing binary sentiment classification onmovie review text, and we would like to know whether to assign the sentiment class+ or � to a review document doc. We’ll represent each input observation by thefollowing 6 features x1...x6 of the input; Fig. 5.2 shows the features in a sample minitest document.

Var Definition Value in Fig. 5.2x1 count(positive lexicon) 2 doc) 3x2 count(negative lexicon) 2 doc) 2

x3

⇢1 if “no” 2 doc0 otherwise 1

x4 count(1st and 2nd pronouns 2 doc) 3

x5

⇢1 if “!” 2 doc0 otherwise 0

x6 log(word count of doc) ln(64) = 4.15

Let’s assume for the moment that we’ve already learned a real-valued weightfor each of these features, and that the 6 weights corresponding to the 6 featuresare [2.5,�5.0,�1.2,0.5,2.0,0.7], while b = 0.1. (We’ll discuss in the next sectionhow the weights are learned.) The weight w1, for example indicates how important

4 CHAPTER 5 • LOGISTIC REGRESSION

The sigmoid has a number of advantages; it take a real-valued number and mapsit into the range [0,1], which is just what we want for a probability. Because it isnearly linear around 0 but has a sharp slope toward the ends, it tends to squash outliervalues toward 0 or 1. And it’s differentiable, which as we’ll see in Section 5.8 willbe handy for learning.

We’re almost there. If we apply the sigmoid to the sum of the weighted features,we get a number between 0 and 1. To make it a probability, we just need to makesure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:

P(y = 1) = s(w · x+b)

=1

1+ e

�(w·x+b)

P(y = 0) = 1�s(w · x+b)

= 1� 11+ e

�(w·x+b)

=e

�(w·x+b)

1+ e

�(w·x+b)(5.5)

Now we have an algorithm that given an instance x computes the probabilityP(y = 1|x). How do we make a decision? For a test instance x, we say yes if theprobability P(y = 1|x) is more than .5, and no otherwise. We call .5 the decision

boundary:decision

boundary

y =

⇢1 if P(y = 1|x)> 0.50 otherwise

5.1.1 Example: sentiment classification

Let’s have an example. Suppose we are doing binary sentiment classification onmovie review text, and we would like to know whether to assign the sentiment class+ or � to a review document doc. We’ll represent each input observation by thefollowing 6 features x1...x6 of the input; Fig. 5.2 shows the features in a sample minitest document.

Var Definition Value in Fig. 5.2x1 count(positive lexicon) 2 doc) 3x2 count(negative lexicon) 2 doc) 2

x3

⇢1 if “no” 2 doc0 otherwise 1

x4 count(1st and 2nd pronouns 2 doc) 3

x5

⇢1 if “!” 2 doc0 otherwise 0

x6 log(word count of doc) ln(64) = 4.15

Let’s assume for the moment that we’ve already learned a real-valued weightfor each of these features, and that the 6 weights corresponding to the 6 featuresare [2.5,�5.0,�1.2,0.5,2.0,0.7], while b = 0.1. (We’ll discuss in the next sectionhow the weights are learned.) The weight w1, for example indicates how important

Suppose w =

b = 0.1

5.1 • CLASSIFICATION: THE SIGMOID 5

It's hokey. There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you .

x1=3 x6=4.15

x3=1

x4=3x5=0

x2=2

Figure 5.2 A sample mini test document showing the extracted features in the vector x.

a feature the number of positive lexicon words (great, nice, enjoyable, etc.) is toa positive sentiment decision, while w2 tells us the importance of negative lexiconwords. Note that w1 = 2.5 is positive, while w2 =�5.0, meaning that negative wordsare negatively associated with a positive sentiment decision, and are about twice asimportant as positive words.

Given these 6 features and the input review x, P(+|x) and P(�|x) can be com-puted using Eq. 5.5:

p(+|x) = P(Y = 1|x) = s(w · x+b)

= s([2.5,�5.0,�1.2,0.5,2.0,0.7] · [3,2,1,3,0,4.15]+0.1)= s(1.805)= 0.86

p(�|x) = P(Y = 0|x) = 1�s(w · x+b)

= 0.14

Logistic regression is commonly applied to all sorts of NLP tasks, and any prop-erty of the input can be a feature. Consider the task of period disambiguation:deciding if a period is the end of a sentence or part of a word, by classifying eachperiod into one of two classes EOS (end-of-sentence) and not-EOS. We might usefeatures like x1 below expressing that the current word is lower case and the classis EOS (perhaps with a positive weight), or that the current word is in our abbrevia-tions dictionary (“Prof.”) and the class is EOS (perhaps with a negative weight). Afeature can also express a quite complex combination of properties. For example aperiod following a upper cased word is a likely to be an EOS, but if the word itself isSt. and the previous word is capitalized, then the period is likely part of a shorteningof the word street.

x1 =

⇢1 if “Case(w

i

) = Lower”0 otherwise

x2 =

⇢1 if “w

i

2 AcronymDict”0 otherwise

x3 =

⇢1 if “w

i

= St. & Case(wi�1) = Cap”

0 otherwise

Designing features: Features are generally designed by examining the trainingset with an eye to linguistic intuitions and the linguistic literature on the domain. Acareful error analysis on the training or dev set. of an early version of a system oftenprovides insights into features.

28

5.1 • CLASSIFICATION: THE SIGMOID 5

It's hokey. There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you .

x1=3 x6=4.15

x3=1

x4=3x5=0

x2=2

Figure 5.2 A sample mini test document showing the extracted features in the vector x.

a feature the number of positive lexicon words (great, nice, enjoyable, etc.) is toa positive sentiment decision, while w2 tells us the importance of negative lexiconwords. Note that w1 = 2.5 is positive, while w2 =�5.0, meaning that negative wordsare negatively associated with a positive sentiment decision, and are about twice asimportant as positive words.

Given these 6 features and the input review x, P(+|x) and P(�|x) can be com-puted using Eq. 5.5:

p(+|x) = P(Y = 1|x) = s(w · x+b)

= s([2.5,�5.0,�1.2,0.5,2.0,0.7] · [3,2,1,3,0,4.15]+0.1)= s(1.805)= 0.86

p(�|x) = P(Y = 0|x) = 1�s(w · x+b)

= 0.14

Logistic regression is commonly applied to all sorts of NLP tasks, and any prop-erty of the input can be a feature. Consider the task of period disambiguation:deciding if a period is the end of a sentence or part of a word, by classifying eachperiod into one of two classes EOS (end-of-sentence) and not-EOS. We might usefeatures like x1 below expressing that the current word is lower case and the classis EOS (perhaps with a positive weight), or that the current word is in our abbrevia-tions dictionary (“Prof.”) and the class is EOS (perhaps with a negative weight). Afeature can also express a quite complex combination of properties. For example aperiod following a upper cased word is a likely to be an EOS, but if the word itself isSt. and the previous word is capitalized, then the period is likely part of a shorteningof the word street.

x1 =

⇢1 if “Case(w

i

) = Lower”0 otherwise

x2 =

⇢1 if “w

i

2 AcronymDict”0 otherwise

x3 =

⇢1 if “w

i

= St. & Case(wi�1) = Cap”

0 otherwise

Designing features: Features are generally designed by examining the trainingset with an eye to linguistic intuitions and the linguistic literature on the domain. Acareful error analysis on the training or dev set. of an early version of a system oftenprovides insights into features.

Page 15: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

15

We can build features for logistic regression for any classification task: period disambiguation

5.1 • CLASSIFICATION: THE SIGMOID 5

It's hokey. There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you .

x1=3 x6=4.15

x3=1

x4=3x5=0

x2=2

Figure 5.2 A sample mini test document showing the extracted features in the vector x.

a feature the number of positive lexicon words (great, nice, enjoyable, etc.) is toa positive sentiment decision, while w2 tells us the importance of negative lexiconwords. Note that w1 = 2.5 is positive, while w2 =�5.0, meaning that negative wordsare negatively associated with a positive sentiment decision, and are about twice asimportant as positive words.

Given these 6 features and the input review x, P(+|x) and P(�|x) can be com-puted using Eq. 5.5:

p(+|x) = P(Y = 1|x) = s(w · x+b)

= s([2.5,�5.0,�1.2,0.5,2.0,0.7] · [3,2,1,3,0,4.15]+0.1)= s(1.805)= 0.86

p(�|x) = P(Y = 0|x) = 1�s(w · x+b)

= 0.14

Logistic regression is commonly applied to all sorts of NLP tasks, and any prop-erty of the input can be a feature. Consider the task of period disambiguation:deciding if a period is the end of a sentence or part of a word, by classifying eachperiod into one of two classes EOS (end-of-sentence) and not-EOS. We might usefeatures like x1 below expressing that the current word is lower case and the classis EOS (perhaps with a positive weight), or that the current word is in our abbrevia-tions dictionary (“Prof.”) and the class is EOS (perhaps with a negative weight). Afeature can also express a quite complex combination of properties. For example aperiod following a upper cased word is a likely to be an EOS, but if the word itself isSt. and the previous word is capitalized, then the period is likely part of a shorteningof the word street.

x1 =

⇢1 if “Case(w

i

) = Lower”0 otherwise

x2 =

⇢1 if “w

i

2 AcronymDict”0 otherwise

x3 =

⇢1 if “w

i

= St. & Case(wi�1) = Cap”

0 otherwise

Designing features: Features are generally designed by examining the trainingset with an eye to linguistic intuitions and the linguistic literature on the domain. Acareful error analysis on the training or dev set. of an early version of a system oftenprovides insights into features.

29

This ends in a period. The house at 465 Main St. is new.

End of sentence

Not end

Learning in logistic regression

•  OK, how does logistic regression set the values for w and b?

30

Page 16: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

16

Getting to cross entropy loss: log likelihood

•  Our probability:

•  Taking log of both sides =

–  (values that maximize p(.) will also maximize log p(.)

•  This is called the log likelihood of the observation

31

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

Getting to cross entropy loss: negative log likelihood

•  Log likelihood

•  But instead of maximizing log likelihood, we want to minimize a loss:

•  Plugging in =

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

Page 17: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

17

The cross entropy loss

•  Why it does the right thing:

•  If y=1: -log

•  If y=0: -log (1- )

•  The negative log goes from –  0: neg log of 1, no loss –  ∞: neg log of 0, infinite loss

33

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

5.3 • THE CROSS-ENTROPY LOSS FUNCTION 7

function that is commonly used for logistic regression and also for neural networks,the cross-entropy loss.

The second thing we need is an optimization algorithm for iteratively updatingthe weights so as to minimize this loss function. The standard algorithm for this isgradient descent; we’ll introduce the stochastic gradient descent algorithm in thefollowing section.

5.3 The cross-entropy loss function

We need a loss function that expresses, for an observation x, how close the classifieroutput (y = s(w · x+b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(y,y) = How much y differs from the true y (5.6)

You could imagine using a simple loss function that just takes the mean squarederror between y and y.

LMSE(y,y) =12(y� y)2 (5.7)

It turns out that this MSE loss, which is very useful for some algorithms likelinear regression, becomes harder to optimize (technically, non-convex), when it’sapplied to probabilistic classification.

Instead, we use a loss function that prefers the correct class labels of the trainingexample to be more likely. This is called conditional maximum likelihood estima-

tion: we choose the parameters w,b that maximize the log probability of the true

y labels in the training data given the observations x. The resulting loss functionis the negative log likelihood loss, generally called the cross entropy loss.cross entropy

loss

Let’s derive this loss function, applied to a single observation x. We’d like tolearn weights that maximize the probability of the correct label p(y|x). Since thereare only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we canexpress the probability p(y|x) that our classifier produces for one observation asthe following (keeping in mind that if y=1, Eq. 5.8 simplifies to y; if y=0, Eq. 5.8simplifies to 1� y):

p(y|x) = y

y (1� y)1�y (5.8)

Now we take the log of both sides. This will turn out to be handy mathematically,and doesn’t hurt us; whatever values maximize a probability will also maximize thelog of the probability:

log p(y|x) = log⇥y

y (1� y)1�y

= y log y+(1� y) log(1� y) (5.9)

Eq. 5.9 describes a log likelihood that should be maximized. In order to turn thisinto loss function (something that we need to minimize), we’ll just flip the sign onEq. 5.9. The result is the cross-entropy loss L

CE

:

L

CE

(y,y) =� log p(y|x) = � [y log y+(1� y) log(1� y)] (5.10)

Finally, we can plug in the definition of y = s(w · x)+b:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.11)

Extending it to the whole dataset

34

8 CHAPTER 5 • LOGISTIC REGRESSION

Why does minimizing this negative log probability do what we want? A perfectclassifier would assign probability 1 to the correct outcome (y=1 or y=0) and prob-ability 0 to the incorrect outcome. That means the higher y (the closer it is to 1), thebetter the classifier; the lower y is (the closer it is to 0), the worse the classifier. Thenegative log of this probability is a convenient loss metric since it goes from 0 (neg-ative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss functionalso insures that as probability of the correct answer is maximized, the probabilityof the incorrect answer is minimized; since the two sum to one, any increase in theprobability of the correct answer is coming at the expense of the incorrect answer.It’s called the cross-entropy loss, because Eq. 5.9 is also the formula for the cross-

entropy between the true probability distribution y and our estimated distributiony.

Let’s now extend Eq. 5.10 from one example to the whole training set: we’ll con-tinue to use the notation that x

(i) and y

(i) mean the ith training features and traininglabel, respectively. We make the assumption that the training examples are indepen-dent:

log p(training labels) = logmY

i=1

p(y(i)|x(i)) (5.12)

=mX

i=1

log p(y(i)|x(i)) (5.13)

= �mX

i=1

L

CE

(y(i),y(i)) (5.14)

We’ll define the cost function for the whole dataset as the average loss for eachexample:

Cost(w,b) =1m

mX

i=1

L

CE

(y(i),y(i))

= � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.15)

Now we know what we want to minimize; in the next section, we’ll see how tofind the minimum.

5.4 Gradient Descent

Our goal with gradient descent is to find the optimal weights: minimize the lossfunction we’ve defined for the model. In Eq. 5.16 below, we’ll explicitly representthe fact that the loss function L is parameterized by the weights, which we’ll refer toin machine learning in general as q (in the case of logistic regression q = w,b):

q = argminq

1m

mX

i=1

L

CE

(y(i),x(i);q) (5.16)

8 CHAPTER 5 • LOGISTIC REGRESSION

Why does minimizing this negative log probability do what we want? A perfectclassifier would assign probability 1 to the correct outcome (y=1 or y=0) and prob-ability 0 to the incorrect outcome. That means the higher y (the closer it is to 1), thebetter the classifier; the lower y is (the closer it is to 0), the worse the classifier. Thenegative log of this probability is a convenient loss metric since it goes from 0 (neg-ative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss functionalso insures that as probability of the correct answer is maximized, the probabilityof the incorrect answer is minimized; since the two sum to one, any increase in theprobability of the correct answer is coming at the expense of the incorrect answer.It’s called the cross-entropy loss, because Eq. 5.9 is also the formula for the cross-

entropy between the true probability distribution y and our estimated distributiony.

Let’s now extend Eq. 5.10 from one example to the whole training set: we’ll con-tinue to use the notation that x

(i) and y

(i) mean the ith training features and traininglabel, respectively. We make the assumption that the training examples are indepen-dent:

log p(training labels) = logmY

i=1

p(y(i)|x(i)) (5.12)

=mX

i=1

log p(y(i)|x(i)) (5.13)

= �mX

i=1

L

CE

(y(i),y(i)) (5.14)

We’ll define the cost function for the whole dataset as the average loss for eachexample:

Cost(w,b) =1m

mX

i=1

L

CE

(y(i),y(i))

= � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.15)

Now we know what we want to minimize; in the next section, we’ll see how tofind the minimum.

5.4 Gradient Descent

Our goal with gradient descent is to find the optimal weights: minimize the lossfunction we’ve defined for the model. In Eq. 5.16 below, we’ll explicitly representthe fact that the loss function L is parameterized by the weights, which we’ll refer toin machine learning in general as q (in the case of logistic regression q = w,b):

q = argminq

1m

mX

i=1

L

CE

(y(i),x(i);q) (5.16)

8 CHAPTER 5 • LOGISTIC REGRESSION

Why does minimizing this negative log probability do what we want? A perfectclassifier would assign probability 1 to the correct outcome (y=1 or y=0) and prob-ability 0 to the incorrect outcome. That means the higher y (the closer it is to 1), thebetter the classifier; the lower y is (the closer it is to 0), the worse the classifier. Thenegative log of this probability is a convenient loss metric since it goes from 0 (neg-ative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss functionalso insures that as probability of the correct answer is maximized, the probabilityof the incorrect answer is minimized; since the two sum to one, any increase in theprobability of the correct answer is coming at the expense of the incorrect answer.It’s called the cross-entropy loss, because Eq. 5.9 is also the formula for the cross-

entropy between the true probability distribution y and our estimated distributiony.

Let’s now extend Eq. 5.10 from one example to the whole training set: we’ll con-tinue to use the notation that x

(i) and y

(i) mean the ith training features and traininglabel, respectively. We make the assumption that the training examples are indepen-dent:

log p(training labels) = logmY

i=1

p(y(i)|x(i)) (5.12)

=mX

i=1

log p(y(i)|x(i)) (5.13)

= �mX

i=1

L

CE

(y(i),y(i)) (5.14)

We’ll define the cost function for the whole dataset as the average loss for eachexample:

Cost(w,b) =1m

mX

i=1

L

CE

(y(i),y(i))

= � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.15)

Now we know what we want to minimize; in the next section, we’ll see how tofind the minimum.

5.4 Gradient Descent

Our goal with gradient descent is to find the optimal weights: minimize the lossfunction we’ve defined for the model. In Eq. 5.16 below, we’ll explicitly representthe fact that the loss function L is parameterized by the weights, which we’ll refer toin machine learning in general as q (in the case of logistic regression q = w,b):

q = argminq

1m

mX

i=1

L

CE

(y(i),x(i);q) (5.16)

8 CHAPTER 5 • LOGISTIC REGRESSION

Why does minimizing this negative log probability do what we want? A perfectclassifier would assign probability 1 to the correct outcome (y=1 or y=0) and prob-ability 0 to the incorrect outcome. That means the higher y (the closer it is to 1), thebetter the classifier; the lower y is (the closer it is to 0), the worse the classifier. Thenegative log of this probability is a convenient loss metric since it goes from 0 (neg-ative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss functionalso insures that as probability of the correct answer is maximized, the probabilityof the incorrect answer is minimized; since the two sum to one, any increase in theprobability of the correct answer is coming at the expense of the incorrect answer.It’s called the cross-entropy loss, because Eq. 5.9 is also the formula for the cross-

entropy between the true probability distribution y and our estimated distributiony.

Let’s now extend Eq. 5.10 from one example to the whole training set: we’ll con-tinue to use the notation that x

(i) and y

(i) mean the ith training features and traininglabel, respectively. We make the assumption that the training examples are indepen-dent:

log p(training labels) = logmY

i=1

p(y(i)|x(i)) (5.12)

=mX

i=1

log p(y(i)|x(i)) (5.13)

= �mX

i=1

L

CE

(y(i),y(i)) (5.14)

We’ll define the cost function for the whole dataset as the average loss for eachexample:

Cost(w,b) =1m

mX

i=1

L

CE

(y(i),y(i))

= � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.15)

Now we know what we want to minimize; in the next section, we’ll see how tofind the minimum.

5.4 Gradient Descent

Our goal with gradient descent is to find the optimal weights: minimize the lossfunction we’ve defined for the model. In Eq. 5.16 below, we’ll explicitly representthe fact that the loss function L is parameterized by the weights, which we’ll refer toin machine learning in general as q (in the case of logistic regression q = w,b):

q = argminq

1m

mX

i=1

L

CE

(y(i),x(i);q) (5.16)

8 CHAPTER 5 • LOGISTIC REGRESSION

Why does minimizing this negative log probability do what we want? A perfectclassifier would assign probability 1 to the correct outcome (y=1 or y=0) and prob-ability 0 to the incorrect outcome. That means the higher y (the closer it is to 1), thebetter the classifier; the lower y is (the closer it is to 0), the worse the classifier. Thenegative log of this probability is a convenient loss metric since it goes from 0 (neg-ative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss functionalso insures that as probability of the correct answer is maximized, the probabilityof the incorrect answer is minimized; since the two sum to one, any increase in theprobability of the correct answer is coming at the expense of the incorrect answer.It’s called the cross-entropy loss, because Eq. 5.9 is also the formula for the cross-

entropy between the true probability distribution y and our estimated distributiony.

Let’s now extend Eq. 5.10 from one example to the whole training set: we’ll con-tinue to use the notation that x

(i) and y

(i) mean the ith training features and traininglabel, respectively. We make the assumption that the training examples are indepen-dent:

log p(training labels) = logmY

i=1

p(y(i)|x(i)) (5.12)

=mX

i=1

log p(y(i)|x(i)) (5.13)

= �mX

i=1

L

CE

(y(i),y(i)) (5.14)

We’ll define the cost function for the whole dataset as the average loss for eachexample:

Cost(w,b) =1m

mX

i=1

L

CE

(y(i),y(i))

= � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.15)

Now we know what we want to minimize; in the next section, we’ll see how tofind the minimum.

5.4 Gradient Descent

Our goal with gradient descent is to find the optimal weights: minimize the lossfunction we’ve defined for the model. In Eq. 5.16 below, we’ll explicitly representthe fact that the loss function L is parameterized by the weights, which we’ll refer toin machine learning in general as q (in the case of logistic regression q = w,b):

q = argminq

1m

mX

i=1

L

CE

(y(i),x(i);q) (5.16)

Page 18: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

18

We want to find parameters that minimize loss

•  We'll use 𝜽 to refer to parameters w, b

35

8 CHAPTER 5 • LOGISTIC REGRESSION

Why does minimizing this negative log probability do what we want? A perfectclassifier would assign probability 1 to the correct outcome (y=1 or y=0) and prob-ability 0 to the incorrect outcome. That means the higher y (the closer it is to 1), thebetter the classifier; the lower y is (the closer it is to 0), the worse the classifier. Thenegative log of this probability is a convenient loss metric since it goes from 0 (neg-ative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss functionalso insures that as probability of the correct answer is maximized, the probabilityof the incorrect answer is minimized; since the two sum to one, any increase in theprobability of the correct answer is coming at the expense of the incorrect answer.It’s called the cross-entropy loss, because Eq. 5.9 is also the formula for the cross-

entropy between the true probability distribution y and our estimated distributiony.

Let’s now extend Eq. 5.10 from one example to the whole training set: we’ll con-tinue to use the notation that x

(i) and y

(i) mean the ith training features and traininglabel, respectively. We make the assumption that the training examples are indepen-dent:

log p(training labels) = logmY

i=1

p(y(i)|x(i)) (5.12)

=mX

i=1

log p(y(i)|x(i)) (5.13)

= �mX

i=1

L

CE

(y(i),y(i)) (5.14)

We’ll define the cost function for the whole dataset as the average loss for eachexample:

Cost(w,b) =1m

mX

i=1

L

CE

(y(i),y(i))

= � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.15)

Now we know what we want to minimize; in the next section, we’ll see how tofind the minimum.

5.4 Gradient Descent

Our goal with gradient descent is to find the optimal weights: minimize the lossfunction we’ve defined for the model. In Eq. 5.16 below, we’ll explicitly representthe fact that the loss function L is parameterized by the weights, which we’ll refer toin machine learning in general as q (in the case of logistic regression q = w,b):

q = argminq

1m

mX

i=1

L

CE

(y(i),x(i);q) (5.16)

Gradient descent for one scalar variable w

36

5.4 • GRADIENT DESCENT 9

How shall we find the minimum of this (or any) loss function? Gradient descentis a method that finds a minimum of a function by figuring out in which direction(in the space of the parameters q ) the function’s slope is rising the most steeply,and moving in the opposite direction. The intuition is that if you are hiking in acanyon and trying to descend most quickly down to the river at the bottom, you mightlook around yourself 360 degrees, find the direction where the ground is sloping thesteepest, and walk downhill in that direction.

For logistic regression, this loss function is conveniently convex. A convex func-convex

tion has just one minimum; there are no local minima to get stuck in, so gradientdescent starting from any point is guaranteed to find the minimum.

Although the algorithm (and the concept of gradient) are designed for directionvectors, let’s first consider a visualization of the the case where the parameter of oursystem, is just a single scalar w, shown in Fig. 5.3.

Given a random initialization of w at some value w1, and assuming the lossfunction L happened to have the shape in Fig. 5.3, we need the algorithm to tell uswhether at the next iteration, we should move left (making w

2 smaller than w

1) orright (making w

2 bigger than w

1) to reach the minimum.

w

Loss

0w1 wmin

slope of loss at w1 is negative

(goal)

one stepof gradient

descent

Figure 5.3 The first step in iteratively finding the minimum of this loss function, by movingw in the reverse direction from the slope of the function. Since the slope is negative, we needto move w in a positive direction, to the right. Here superscripts are used for learning steps,so w

1 means the initial value of w (which is 0), w

2 at the second step, and so on.

The gradient descent algorithm answers this question by finding the gradient

gradient

of the loss function at the current point and moving in the opposite direction. Thegradient of a function of many variables is a vector pointing in the direction thegreatest increase in a function. The gradient is a multi-variable generalization of theslope, so for a function of one variable like the one in Fig. 5.3, we can informallythink of the gradient as the slope. The dotted line in Fig. 5.3 shows the slope of thishypothetical loss function at point w = w

1. You can see that the slope of this dottedline is negative. Thus to find the minimum, gradient descent tells us to go in theopposite direction: moving w in a positive direction.

The magnitude of the amount to move in gradient descent is the value of the sloped

dw

f (x;w) weighted by a learning rate h . A higher (faster) learning rate means thatlearning rate

we should move w more on each step. The change we make in our parameter is thelearning rate times the gradient (or the slope, in our single-variable example):

w

t+1 = w

t �h d

dw

f (x;w) (5.17)

Now let’s extend the intuition from a function of one scalar variable w to many

Page 19: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

19

How much to move

37

5.4 • GRADIENT DESCENT 9

How shall we find the minimum of this (or any) loss function? Gradient descentis a method that finds a minimum of a function by figuring out in which direction(in the space of the parameters q ) the function’s slope is rising the most steeply,and moving in the opposite direction. The intuition is that if you are hiking in acanyon and trying to descend most quickly down to the river at the bottom, you mightlook around yourself 360 degrees, find the direction where the ground is sloping thesteepest, and walk downhill in that direction.

For logistic regression, this loss function is conveniently convex. A convex func-convex

tion has just one minimum; there are no local minima to get stuck in, so gradientdescent starting from any point is guaranteed to find the minimum.

Although the algorithm (and the concept of gradient) are designed for directionvectors, let’s first consider a visualization of the the case where the parameter of oursystem, is just a single scalar w, shown in Fig. 5.3.

Given a random initialization of w at some value w1, and assuming the lossfunction L happened to have the shape in Fig. 5.3, we need the algorithm to tell uswhether at the next iteration, we should move left (making w

2 smaller than w

1) orright (making w

2 bigger than w

1) to reach the minimum.

w

Loss

0w1 wmin

slope of loss at w1 is negative

(goal)

one stepof gradient

descent

Figure 5.3 The first step in iteratively finding the minimum of this loss function, by movingw in the reverse direction from the slope of the function. Since the slope is negative, we needto move w in a positive direction, to the right. Here superscripts are used for learning steps,so w

1 means the initial value of w (which is 0), w

2 at the second step, and so on.

The gradient descent algorithm answers this question by finding the gradient

gradient

of the loss function at the current point and moving in the opposite direction. Thegradient of a function of many variables is a vector pointing in the direction thegreatest increase in a function. The gradient is a multi-variable generalization of theslope, so for a function of one variable like the one in Fig. 5.3, we can informallythink of the gradient as the slope. The dotted line in Fig. 5.3 shows the slope of thishypothetical loss function at point w = w

1. You can see that the slope of this dottedline is negative. Thus to find the minimum, gradient descent tells us to go in theopposite direction: moving w in a positive direction.

The magnitude of the amount to move in gradient descent is the value of the sloped

dw

f (x;w) weighted by a learning rate h . A higher (faster) learning rate means thatlearning rate

we should move w more on each step. The change we make in our parameter is thelearning rate times the gradient (or the slope, in our single-variable example):

w

t+1 = w

t �h d

dw

f (x;w) (5.17)

Now let’s extend the intuition from a function of one scalar variable w to many

5.4 • GRADIENT DESCENT 9

How shall we find the minimum of this (or any) loss function? Gradient descentis a method that finds a minimum of a function by figuring out in which direction(in the space of the parameters q ) the function’s slope is rising the most steeply,and moving in the opposite direction. The intuition is that if you are hiking in acanyon and trying to descend most quickly down to the river at the bottom, you mightlook around yourself 360 degrees, find the direction where the ground is sloping thesteepest, and walk downhill in that direction.

For logistic regression, this loss function is conveniently convex. A convex func-convex

tion has just one minimum; there are no local minima to get stuck in, so gradientdescent starting from any point is guaranteed to find the minimum.

Although the algorithm (and the concept of gradient) are designed for directionvectors, let’s first consider a visualization of the the case where the parameter of oursystem, is just a single scalar w, shown in Fig. 5.3.

Given a random initialization of w at some value w1, and assuming the lossfunction L happened to have the shape in Fig. 5.3, we need the algorithm to tell uswhether at the next iteration, we should move left (making w

2 smaller than w

1) orright (making w

2 bigger than w

1) to reach the minimum.

w

Loss

0w1 wmin

slope of loss at w1 is negative

(goal)

one stepof gradient

descent

Figure 5.3 The first step in iteratively finding the minimum of this loss function, by movingw in the reverse direction from the slope of the function. Since the slope is negative, we needto move w in a positive direction, to the right. Here superscripts are used for learning steps,so w

1 means the initial value of w (which is 0), w

2 at the second step, and so on.

The gradient descent algorithm answers this question by finding the gradient

gradient

of the loss function at the current point and moving in the opposite direction. Thegradient of a function of many variables is a vector pointing in the direction thegreatest increase in a function. The gradient is a multi-variable generalization of theslope, so for a function of one variable like the one in Fig. 5.3, we can informallythink of the gradient as the slope. The dotted line in Fig. 5.3 shows the slope of thishypothetical loss function at point w = w

1. You can see that the slope of this dottedline is negative. Thus to find the minimum, gradient descent tells us to go in theopposite direction: moving w in a positive direction.

The magnitude of the amount to move in gradient descent is the value of the sloped

dw

f (x;w) weighted by a learning rate h . A higher (faster) learning rate means thatlearning rate

we should move w more on each step. The change we make in our parameter is thelearning rate times the gradient (or the slope, in our single-variable example):

w

t+1 = w

t �h d

dw

f (x;w) (5.17)

Now let’s extend the intuition from a function of one scalar variable w to many

5.4 • GRADIENT DESCENT 9

How shall we find the minimum of this (or any) loss function? Gradient descentis a method that finds a minimum of a function by figuring out in which direction(in the space of the parameters q ) the function’s slope is rising the most steeply,and moving in the opposite direction. The intuition is that if you are hiking in acanyon and trying to descend most quickly down to the river at the bottom, you mightlook around yourself 360 degrees, find the direction where the ground is sloping thesteepest, and walk downhill in that direction.

For logistic regression, this loss function is conveniently convex. A convex func-convex

tion has just one minimum; there are no local minima to get stuck in, so gradientdescent starting from any point is guaranteed to find the minimum.

Although the algorithm (and the concept of gradient) are designed for directionvectors, let’s first consider a visualization of the the case where the parameter of oursystem, is just a single scalar w, shown in Fig. 5.3.

Given a random initialization of w at some value w1, and assuming the lossfunction L happened to have the shape in Fig. 5.3, we need the algorithm to tell uswhether at the next iteration, we should move left (making w

2 smaller than w

1) orright (making w

2 bigger than w

1) to reach the minimum.

w

Loss

0w1 wmin

slope of loss at w1 is negative

(goal)

one stepof gradient

descent

Figure 5.3 The first step in iteratively finding the minimum of this loss function, by movingw in the reverse direction from the slope of the function. Since the slope is negative, we needto move w in a positive direction, to the right. Here superscripts are used for learning steps,so w

1 means the initial value of w (which is 0), w

2 at the second step, and so on.

The gradient descent algorithm answers this question by finding the gradient

gradient

of the loss function at the current point and moving in the opposite direction. Thegradient of a function of many variables is a vector pointing in the direction thegreatest increase in a function. The gradient is a multi-variable generalization of theslope, so for a function of one variable like the one in Fig. 5.3, we can informallythink of the gradient as the slope. The dotted line in Fig. 5.3 shows the slope of thishypothetical loss function at point w = w

1. You can see that the slope of this dottedline is negative. Thus to find the minimum, gradient descent tells us to go in theopposite direction: moving w in a positive direction.

The magnitude of the amount to move in gradient descent is the value of the sloped

dw

f (x;w) weighted by a learning rate h . A higher (faster) learning rate means thatlearning rate

we should move w more on each step. The change we make in our parameter is thelearning rate times the gradient (or the slope, in our single-variable example):

w

t+1 = w

t �h d

dw

f (x;w) (5.17)

Now let’s extend the intuition from a function of one scalar variable w to many

Now in multiple dimensions

10 CHAPTER 5 • LOGISTIC REGRESSION

variables, because we don’t just want to move left or right, we want to know wherein the N-dimensional space (of the N parameters that make up q ) we should move.The gradient is just such a vector; it expresses the directional components of thesharpest slope along each of those N dimensions. If we’re just imagining two weightdimension (say for one weight w and one bias b), the gradient might be a vector withtwo orthogonal components, each of which tells us how much the ground slopes inthe w dimension and in the b dimension. Fig. 5.4 shows a visualization:

Cost(w,b)

wb

Figure 5.4 Visualization of the gradient vector in two dimensions w and b.

In an actual logistic regression, the parameter vector w is much longer than 1 or2, since the input feature vector x can be quite long, and we need a weight w

i

foreach x

i

For each dimension/variable w

i

in w (plus the bias b), the gradient will havea component that tells us the slope with respect to that variable. Essentially we’reasking: “How much would a small change in that variable w

i

influence the total lossfunction L?”

In each dimension w

i

, we express the slope as a partial derivative ∂∂w

i

of the lossfunction. The gradient is then defined as a vector of these partials. We’ll represent y

as f (x;q) to make the dependence on q more obvious:

—q L( f (x;q),y)) =

2

66664

∂∂w1

L( f (x;q),y)∂

∂w2L( f (x;q),y)

...∂

∂w

n

L( f (x;q),y)

3

77775(5.18)

The final equation for updating q based on the gradient is thus

qt+1 = q

t

�h—L( f (x;q),y) (5.19)

5.4.1 The Gradient for Logistic Regression

In order to update q , we need a definition for the gradient —L( f (x;q),y). Recall thatfor logistic regression, the cross-entropy loss function is:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.20)

It turns out that the derivative of this function for one observation vector x isEq. 5.21 (the interested reader can see Section 5.8 for the derivation of this equation):

∂L

CE

(w,b)

∂w

j

= [s(w · x+b)� y]xj

(5.21)

38

Page 20: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

20

Gradient

10 CHAPTER 5 • LOGISTIC REGRESSION

variables, because we don’t just want to move left or right, we want to know wherein the N-dimensional space (of the N parameters that make up q ) we should move.The gradient is just such a vector; it expresses the directional components of thesharpest slope along each of those N dimensions. If we’re just imagining two weightdimension (say for one weight w and one bias b), the gradient might be a vector withtwo orthogonal components, each of which tells us how much the ground slopes inthe w dimension and in the b dimension. Fig. 5.4 shows a visualization:

Cost(w,b)

wb

Figure 5.4 Visualization of the gradient vector in two dimensions w and b.

In an actual logistic regression, the parameter vector w is much longer than 1 or2, since the input feature vector x can be quite long, and we need a weight w

i

foreach x

i

For each dimension/variable w

i

in w (plus the bias b), the gradient will havea component that tells us the slope with respect to that variable. Essentially we’reasking: “How much would a small change in that variable w

i

influence the total lossfunction L?”

In each dimension w

i

, we express the slope as a partial derivative ∂∂w

i

of the lossfunction. The gradient is then defined as a vector of these partials. We’ll represent y

as f (x;q) to make the dependence on q more obvious:

—q L( f (x;q),y)) =

2

66664

∂∂w1

L( f (x;q),y)∂

∂w2L( f (x;q),y)

...∂

∂w

n

L( f (x;q),y)

3

77775(5.18)

The final equation for updating q based on the gradient is thus

qt+1 = q

t

�h—L( f (x;q),y) (5.19)

5.4.1 The Gradient for Logistic Regression

In order to update q , we need a definition for the gradient —L( f (x;q),y). Recall thatfor logistic regression, the cross-entropy loss function is:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.20)

It turns out that the derivative of this function for one observation vector x isEq. 5.21 (the interested reader can see Section 5.8 for the derivation of this equation):

∂L

CE

(w,b)

∂w

j

= [s(w · x+b)� y]xj

(5.21)

39

What is this gradient? (see textbook for the proof)

•  Loss for one example is very intuitive!

•  Difference between predicted and true values, weighted by x

40

10 CHAPTER 5 • LOGISTIC REGRESSION

variables, because we don’t just want to move left or right, we want to know wherein the N-dimensional space (of the N parameters that make up q ) we should move.The gradient is just such a vector; it expresses the directional components of thesharpest slope along each of those N dimensions. If we’re just imagining two weightdimension (say for one weight w and one bias b), the gradient might be a vector withtwo orthogonal components, each of which tells us how much the ground slopes inthe w dimension and in the b dimension. Fig. 5.4 shows a visualization:

Cost(w,b)

wb

Figure 5.4 Visualization of the gradient vector in two dimensions w and b.

In an actual logistic regression, the parameter vector w is much longer than 1 or2, since the input feature vector x can be quite long, and we need a weight w

i

foreach x

i

For each dimension/variable w

i

in w (plus the bias b), the gradient will havea component that tells us the slope with respect to that variable. Essentially we’reasking: “How much would a small change in that variable w

i

influence the total lossfunction L?”

In each dimension w

i

, we express the slope as a partial derivative ∂∂w

i

of the lossfunction. The gradient is then defined as a vector of these partials. We’ll represent y

as f (x;q) to make the dependence on q more obvious:

—q L( f (x;q),y)) =

2

66664

∂∂w1

L( f (x;q),y)∂

∂w2L( f (x;q),y)

...∂

∂w

n

L( f (x;q),y)

3

77775(5.18)

The final equation for updating q based on the gradient is thus

qt+1 = q

t

�h—L( f (x;q),y) (5.19)

5.4.1 The Gradient for Logistic Regression

In order to update q , we need a definition for the gradient —L( f (x;q),y). Recall thatfor logistic regression, the cross-entropy loss function is:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.20)

It turns out that the derivative of this function for one observation vector x isEq. 5.21 (the interested reader can see Section 5.8 for the derivation of this equation):

∂L

CE

(w,b)

∂w

j

= [s(w · x+b)� y]xj

(5.21)

10 CHAPTER 5 • LOGISTIC REGRESSION

variables, because we don’t just want to move left or right, we want to know wherein the N-dimensional space (of the N parameters that make up q ) we should move.The gradient is just such a vector; it expresses the directional components of thesharpest slope along each of those N dimensions. If we’re just imagining two weightdimension (say for one weight w and one bias b), the gradient might be a vector withtwo orthogonal components, each of which tells us how much the ground slopes inthe w dimension and in the b dimension. Fig. 5.4 shows a visualization:

Cost(w,b)

wb

Figure 5.4 Visualization of the gradient vector in two dimensions w and b.

In an actual logistic regression, the parameter vector w is much longer than 1 or2, since the input feature vector x can be quite long, and we need a weight w

i

foreach x

i

For each dimension/variable w

i

in w (plus the bias b), the gradient will havea component that tells us the slope with respect to that variable. Essentially we’reasking: “How much would a small change in that variable w

i

influence the total lossfunction L?”

In each dimension w

i

, we express the slope as a partial derivative ∂∂w

i

of the lossfunction. The gradient is then defined as a vector of these partials. We’ll represent y

as f (x;q) to make the dependence on q more obvious:

—q L( f (x;q),y)) =

2

66664

∂∂w1

L( f (x;q),y)∂

∂w2L( f (x;q),y)

...∂

∂w

n

L( f (x;q),y)

3

77775(5.18)

The final equation for updating q based on the gradient is thus

qt+1 = q

t

�h—L( f (x;q),y) (5.19)

5.4.1 The Gradient for Logistic Regression

In order to update q , we need a definition for the gradient —L( f (x;q),y). Recall thatfor logistic regression, the cross-entropy loss function is:

L

CE

(w,b) = � [y logs(w · x+b)+(1� y) log(1�s(w · x+b))] (5.20)

It turns out that the derivative of this function for one observation vector x isEq. 5.21 (the interested reader can see Section 5.8 for the derivation of this equation):

∂L

CE

(w,b)

∂w

j

= [s(w · x+b)� y]xj

(5.21)

Page 21: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

21

For an entire dataset

5.4 • GRADIENT DESCENT 11

Note in Eq. 5.21 that the gradient with respect to a single weight w

j

represents avery intuitive value: the difference between the true y and our estimated y = s(w ·x+b) for that observation, multiplied by the corresponding input value x

j

.The loss for a batch of data or an entire dataset is just the average loss over the

m examples:

Cost(w,b) = � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.22)

And the gradient for multiple data points is the sum of the individual gradients::

∂Cost(w,b)

∂w

j

=mX

i=1

hs(w · x(i) +b)� y

(i)i

x

(i)j

(5.23)

5.4.2 The Stochastic Gradient Descent Algorithm

Stochastic gradient descent is an online algorithm that minimizes the loss functionby computing its gradient after each training example, and nudging q in the rightdirection (the opposite direction of the gradient). Fig. 5.5 shows the algorithm.

function STOCHASTIC GRADIENT DESCENT(L(), f (), x, y) returns q# where: L is the loss function# f is a function parameterized by q# x is the set of training inputs x

(1), x

(2), ..., x

(n)

# y is the set of training outputs (labels) y

(1), y

(2), ..., y

(n)

q 0repeat T times

For each training tuple (x(i), y

(i)) (in random order)Compute y

(i) = f (x(i);q) # What is our estimated output y?Compute the loss L(y (i),y(i)) # How far off is y

(i)) from the true output y

(i)?g —q L( f (x(i);q),y(i)) # How should we move q to maximize loss ?q q � h g # go the other way instead

return q

Figure 5.5 The stochastic gradient descent algorithm

Stochastic gradient descent is called stochastic because it chooses a single ran-dom example at a time, moving the weights so as to improve performance on thatsingle example. That can result in very choppy movements, so it’s also common todo minibatch gradient descent, which computes the gradient over batches of train-minibatch

ing instances rather than a single instance.The learning rate h is a parameter that must be adjusted. If it’s too high, the

learner will take steps that are too large, overshooting the minimum of the loss func-tion. If it’s too low, the learner will take steps that are too small, and take too long toget to the minimum. It is most common to begin the learning rate at a higher value,and then slowly decrease it, so that it is a function of the iteration k of training; youwill sometimes see the notation h

k

to mean the value of the learning rate at iterationk.

41

5.4 • GRADIENT DESCENT 11

Note in Eq. 5.21 that the gradient with respect to a single weight w

j

represents avery intuitive value: the difference between the true y and our estimated y = s(w ·x+b) for that observation, multiplied by the corresponding input value x

j

.The loss for a batch of data or an entire dataset is just the average loss over the

m examples:

Cost(w,b) = � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.22)

And the gradient for multiple data points is the sum of the individual gradients::

∂Cost(w,b)

∂w

j

=mX

i=1

hs(w · x(i) +b)� y

(i)i

x

(i)j

(5.23)

5.4.2 The Stochastic Gradient Descent Algorithm

Stochastic gradient descent is an online algorithm that minimizes the loss functionby computing its gradient after each training example, and nudging q in the rightdirection (the opposite direction of the gradient). Fig. 5.5 shows the algorithm.

function STOCHASTIC GRADIENT DESCENT(L(), f (), x, y) returns q# where: L is the loss function# f is a function parameterized by q# x is the set of training inputs x

(1), x

(2), ..., x

(n)

# y is the set of training outputs (labels) y

(1), y

(2), ..., y

(n)

q 0repeat T times

For each training tuple (x(i), y

(i)) (in random order)Compute y

(i) = f (x(i);q) # What is our estimated output y?Compute the loss L(y (i),y(i)) # How far off is y

(i)) from the true output y

(i)?g —q L( f (x(i);q),y(i)) # How should we move q to maximize loss ?q q � h g # go the other way instead

return q

Figure 5.5 The stochastic gradient descent algorithm

Stochastic gradient descent is called stochastic because it chooses a single ran-dom example at a time, moving the weights so as to improve performance on thatsingle example. That can result in very choppy movements, so it’s also common todo minibatch gradient descent, which computes the gradient over batches of train-minibatch

ing instances rather than a single instance.The learning rate h is a parameter that must be adjusted. If it’s too high, the

learner will take steps that are too large, overshooting the minimum of the loss func-tion. If it’s too low, the learner will take steps that are too small, and take too long toget to the minimum. It is most common to begin the learning rate at a higher value,and then slowly decrease it, so that it is a function of the iteration k of training; youwill sometimes see the notation h

k

to mean the value of the learning rate at iterationk.

Stochastic Gradient Descent

5.4 • GRADIENT DESCENT 11

Note in Eq. 5.21 that the gradient with respect to a single weight w

j

represents avery intuitive value: the difference between the true y and our estimated y = s(w ·x+b) for that observation, multiplied by the corresponding input value x

j

.The loss for a batch of data or an entire dataset is just the average loss over the

m examples:

Cost(w,b) = � 1m

mX

i=1

y

(i) logs(w · x(i) +b)+(1� y

(i)) log⇣

1�s(w · x(i) +b)⌘

(5.22)

And the gradient for multiple data points is the sum of the individual gradients::

∂Cost(w,b)

∂w

j

=mX

i=1

hs(w · x(i) +b)� y

(i)i

x

(i)j

(5.23)

5.4.2 The Stochastic Gradient Descent Algorithm

Stochastic gradient descent is an online algorithm that minimizes the loss functionby computing its gradient after each training example, and nudging q in the rightdirection (the opposite direction of the gradient). Fig. 5.5 shows the algorithm.

function STOCHASTIC GRADIENT DESCENT(L(), f (), x, y) returns q# where: L is the loss function# f is a function parameterized by q# x is the set of training inputs x

(1), x

(2), ..., x

(n)

# y is the set of training outputs (labels) y

(1), y

(2), ..., y

(n)

q 0repeat T times

For each training tuple (x(i), y

(i)) (in random order)Compute y

(i) = f (x(i);q) # What is our estimated output y?Compute the loss L(y (i),y(i)) # How far off is y

(i)) from the true output y

(i)?g —q L( f (x(i);q),y(i)) # How should we move q to maximize loss ?q q � h g # go the other way instead

return q

Figure 5.5 The stochastic gradient descent algorithm

Stochastic gradient descent is called stochastic because it chooses a single ran-dom example at a time, moving the weights so as to improve performance on thatsingle example. That can result in very choppy movements, so it’s also common todo minibatch gradient descent, which computes the gradient over batches of train-minibatch

ing instances rather than a single instance.The learning rate h is a parameter that must be adjusted. If it’s too high, the

learner will take steps that are too large, overshooting the minimum of the loss func-tion. If it’s too low, the learner will take steps that are too small, and take too long toget to the minimum. It is most common to begin the learning rate at a higher value,and then slowly decrease it, so that it is a function of the iteration k of training; youwill sometimes see the notation h

k

to mean the value of the learning rate at iterationk.

42

Page 22: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

22

Working through an example

•  Assign sentiment (1 or 0) to an observation document x with 2 features

•  So we need to learn 3 parameters: w1, w2, and b.

•  Let's initialize all weights and bias to 0

43

12 CHAPTER 5 • LOGISTIC REGRESSION

5.4.3 Working through an example

Let’s walk though a single step of the gradient descent algorithm. We’ll use a sim-plified version of the example in Fig. 5.2 as it sees a single observation x, whosecorrect value is y = 1 (this is a positive review), and with only two features:

x1 = 3 (count of positive lexicon words)x2 = 2 (count of negative lexicon words)

Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learningrate h is 0.1:

w1 = w2 = b = 0h = 0.1

The single update step requires that we compute the gradient, multiplied by thelearning rate

q t+1 = q t �h—q L( f (x(i);q),y(i))

In our mini example there are three parameters, so the gradient vector has 3 dimen-sions, for w1, w2, and b. We can compute the first gradient as follows:

—w,b =

2

64

∂L

CE

(w,b)∂w1

∂L

CE

(w,b)∂w2

∂L

CE

(w,b)∂b

3

75=

2

4(s(w · x+b)� y)x1(s(w · x+b)� y)x2s(w · x+b)� y

3

5=

2

4(s(0)�1)x1(s(0)�1)x2s(0)�1

3

5=

2

4�0.5x1�0.5x2�0.5

3

5=

2

4�1.5�1.0�0.5

3

5

Now that we have a gradient, we compute the new parameter vector q 2 by mov-ing q 1 in the opposite direction from the gradient:

q 2 =

2

4w1w2b

3

5�h

2

4�1.5�1.0�0.5

3

5=

2

4.15.1.05

3

5

So after one step of gradient descent, the weights have shifted to be: w1 = .15,w2 = .1, and b = .05.

Note that this observation x happened to be a positive example. We would expectthat after seeing more negative examples with high counts of negative words, thatthe weight w2 would shift to have a negative value.

5.5 Regularization

Numquam ponenda est pluralitas sine necessitate

‘Plurality should never be proposed unless needed’William of Occam

12 CHAPTER 5 • LOGISTIC REGRESSION

5.4.3 Working through an example

Let’s walk though a single step of the gradient descent algorithm. We’ll use a sim-plified version of the example in Fig. 5.2 as it sees a single observation x, whosecorrect value is y = 1 (this is a positive review), and with only two features:

x1 = 3 (count of positive lexicon words)x2 = 2 (count of negative lexicon words)

Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learningrate h is 0.1:

w1 = w2 = b = 0h = 0.1

The single update step requires that we compute the gradient, multiplied by thelearning rate

q t+1 = q t �h—q L( f (x(i);q),y(i))

In our mini example there are three parameters, so the gradient vector has 3 dimen-sions, for w1, w2, and b. We can compute the first gradient as follows:

—w,b =

2

64

∂L

CE

(w,b)∂w1

∂L

CE

(w,b)∂w2

∂L

CE

(w,b)∂b

3

75=

2

4(s(w · x+b)� y)x1(s(w · x+b)� y)x2s(w · x+b)� y

3

5=

2

4(s(0)�1)x1(s(0)�1)x2s(0)�1

3

5=

2

4�0.5x1�0.5x2�0.5

3

5=

2

4�1.5�1.0�0.5

3

5

Now that we have a gradient, we compute the new parameter vector q 2 by mov-ing q 1 in the opposite direction from the gradient:

q 2 =

2

4w1w2b

3

5�h

2

4�1.5�1.0�0.5

3

5=

2

4.15.1.05

3

5

So after one step of gradient descent, the weights have shifted to be: w1 = .15,w2 = .1, and b = .05.

Note that this observation x happened to be a positive example. We would expectthat after seeing more negative examples with high counts of negative words, thatthe weight w2 would shift to have a negative value.

5.5 Regularization

Numquam ponenda est pluralitas sine necessitate

‘Plurality should never be proposed unless needed’William of Occam

An example…

•  So after one step of gradient descent, the weights have shifted to be: w1 = .15, w2 =.1, and b=.05

44

12 CHAPTER 5 • LOGISTIC REGRESSION

5.4.3 Working through an example

Let’s walk though a single step of the gradient descent algorithm. We’ll use a sim-plified version of the example in Fig. 5.2 as it sees a single observation x, whosecorrect value is y = 1 (this is a positive review), and with only two features:

x1 = 3 (count of positive lexicon words)x2 = 2 (count of negative lexicon words)

Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learningrate h is 0.1:

w1 = w2 = b = 0h = 0.1

The single update step requires that we compute the gradient, multiplied by thelearning rate

q t+1 = q t �h—q L( f (x(i);q),y(i))

In our mini example there are three parameters, so the gradient vector has 3 dimen-sions, for w1, w2, and b. We can compute the first gradient as follows:

—w,b =

2

64

∂L

CE

(w,b)∂w1

∂L

CE

(w,b)∂w2

∂L

CE

(w,b)∂b

3

75=

2

4(s(w · x+b)� y)x1(s(w · x+b)� y)x2s(w · x+b)� y

3

5=

2

4(s(0)�1)x1(s(0)�1)x2s(0)�1

3

5=

2

4�0.5x1�0.5x2�0.5

3

5=

2

4�1.5�1.0�0.5

3

5

Now that we have a gradient, we compute the new parameter vector q 2 by mov-ing q 1 in the opposite direction from the gradient:

q 2 =

2

4w1w2b

3

5�h

2

4�1.5�1.0�0.5

3

5=

2

4.15.1.05

3

5

So after one step of gradient descent, the weights have shifted to be: w1 = .15,w2 = .1, and b = .05.

Note that this observation x happened to be a positive example. We would expectthat after seeing more negative examples with high counts of negative words, thatthe weight w2 would shift to have a negative value.

5.5 Regularization

Numquam ponenda est pluralitas sine necessitate

‘Plurality should never be proposed unless needed’William of Occam

12 CHAPTER 5 • LOGISTIC REGRESSION

5.4.3 Working through an example

Let’s walk though a single step of the gradient descent algorithm. We’ll use a sim-plified version of the example in Fig. 5.2 as it sees a single observation x, whosecorrect value is y = 1 (this is a positive review), and with only two features:

x1 = 3 (count of positive lexicon words)x2 = 2 (count of negative lexicon words)

Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learningrate h is 0.1:

w1 = w2 = b = 0h = 0.1

The single update step requires that we compute the gradient, multiplied by thelearning rate

q t+1 = q t �h—q L( f (x(i);q),y(i))

In our mini example there are three parameters, so the gradient vector has 3 dimen-sions, for w1, w2, and b. We can compute the first gradient as follows:

—w,b =

2

64

∂L

CE

(w,b)∂w1

∂L

CE

(w,b)∂w2

∂L

CE

(w,b)∂b

3

75=

2

4(s(w · x+b)� y)x1(s(w · x+b)� y)x2s(w · x+b)� y

3

5=

2

4(s(0)�1)x1(s(0)�1)x2s(0)�1

3

5=

2

4�0.5x1�0.5x2�0.5

3

5=

2

4�1.5�1.0�0.5

3

5

Now that we have a gradient, we compute the new parameter vector q 2 by mov-ing q 1 in the opposite direction from the gradient:

q 2 =

2

4w1w2b

3

5�h

2

4�1.5�1.0�0.5

3

5=

2

4.15.1.05

3

5

So after one step of gradient descent, the weights have shifted to be: w1 = .15,w2 = .1, and b = .05.

Note that this observation x happened to be a positive example. We would expectthat after seeing more negative examples with high counts of negative words, thatthe weight w2 would shift to have a negative value.

5.5 Regularization

Numquam ponenda est pluralitas sine necessitate

‘Plurality should never be proposed unless needed’William of Occam

12 CHAPTER 5 • LOGISTIC REGRESSION

5.4.3 Working through an example

Let’s walk though a single step of the gradient descent algorithm. We’ll use a sim-plified version of the example in Fig. 5.2 as it sees a single observation x, whosecorrect value is y = 1 (this is a positive review), and with only two features:

x1 = 3 (count of positive lexicon words)x2 = 2 (count of negative lexicon words)

Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learningrate h is 0.1:

w1 = w2 = b = 0h = 0.1

The single update step requires that we compute the gradient, multiplied by thelearning rate

q t+1 = q t �h—q L( f (x(i);q),y(i))

In our mini example there are three parameters, so the gradient vector has 3 dimen-sions, for w1, w2, and b. We can compute the first gradient as follows:

—w,b =

2

64

∂L

CE

(w,b)∂w1

∂L

CE

(w,b)∂w2

∂L

CE

(w,b)∂b

3

75=

2

4(s(w · x+b)� y)x1(s(w · x+b)� y)x2s(w · x+b)� y

3

5=

2

4(s(0)�1)x1(s(0)�1)x2s(0)�1

3

5=

2

4�0.5x1�0.5x2�0.5

3

5=

2

4�1.5�1.0�0.5

3

5

Now that we have a gradient, we compute the new parameter vector q 2 by mov-ing q 1 in the opposite direction from the gradient:

q 2 =

2

4w1w2b

3

5�h

2

4�1.5�1.0�0.5

3

5=

2

4.15.1.05

3

5

So after one step of gradient descent, the weights have shifted to be: w1 = .15,w2 = .1, and b = .05.

Note that this observation x happened to be a positive example. We would expectthat after seeing more negative examples with high counts of negative words, thatthe weight w2 would shift to have a negative value.

5.5 Regularization

Numquam ponenda est pluralitas sine necessitate

‘Plurality should never be proposed unless needed’William of Occam

Page 23: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

23

Multinomial Logistic Regression

•  Often we have more than 2 classes –  Positive/negative/neutral

–  Parts of speech (noun, verb, adjective, adverb, preposition, etc.)

–  Classify emergency SMSs into different actionable classes

•  If >2 classes we use multinomial logistic regression

45

Multinomial Logistic Regression

•  The probably of everything must still sum to 1 P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1

•  Need a generalization of the sigmoid called the softmax –  Takes a vector z = [z1, z2, ..., zk] of k values

–  Outputs a probability distribution

•  each value in the range [0,1]

•  all the values summing to 1

46

Page 24: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

24

The softmax function

Turns a vector z = [z1,z2,...,zk] of k values into probabilities

47

5.6 • MULTINOMIAL LOGISTIC REGRESSION 15

For a vector z of dimensionality k, the softmax is defined as:

softmax(zi

) =e

z

i

Pk

j=1 e

z

j

1 i k (5.32)

The softmax of an input vector z = [z1,z2, ...,zk

] is thus a vector itself:

softmax(z) =

"e

z1

Pk

i=1 e

z

i

,e

z2

Pk

i=1 e

z

i

, ...,e

z

k

Pk

i=1 e

z

i

#(5.33)

The denominatorP

k

i=1 e

z

i is used to normalize all the values into probabilities.Thus for example given a vector:

z = [0.6,1.1,�1.5,1.2,3.2,�1.1]

the result softmax(z) is

[0.055,0.090,0.0067,0.10,0.74,0.010]

Again like the sigmoid, the input to the softmax will be the dot product betweena weight vector w and an input vector x (plus a bias). But now we’ll need separateweight vectors (and bias) for each of the K classes.

p(y = c|x) =e

w

c

· x+b

c

kX

j=1

e

w

j

· x+b

j

(5.34)

Like the sigmoid, the softmax has the property of squashing values toward 0 or1. thus if one of the inputs is larger than the others, will tend to push its probabilitytoward 1, and suppress the probabilities of the smaller inputs.

5.6.1 Features in Multinomial Logistic Regression

For multiclass classification the input features need to be a function of both theobservation x and the candidate output class c. Thus instead of the notation x

i

, f

i

or f

i

(x), when we’re discussing features we will use the notation f

i

(c,x), meaningfeature i for a particular class c for a given observation x.

In binary classification, a positive weight on a feature pointed toward y=1 anda negative weight toward y=0... but in multiclass a feature could be evidence for oragainst an individual class.

Let’s look at some sample features for a few NLP tasks to help understand thisperhaps unintuitive use of features that are functions of both the observation x andthe class c,

Suppose we are doing text classification, and instead of binary classification ourtask is to assign one of the 3 classes +, �, or 0 (neutral) to a document. Now afeature related to exclamation marks might have a negative weight for 0 documents,and a positive weight for + or � documents:

5.6 • MULTINOMIAL LOGISTIC REGRESSION 15

For a vector z of dimensionality k, the softmax is defined as:

softmax(zi

) =e

z

i

Pk

j=1 e

z

j

1 i k (5.32)

The softmax of an input vector z = [z1,z2, ...,zk

] is thus a vector itself:

softmax(z) =

"e

z1

Pk

i=1 e

z

i

,e

z2

Pk

i=1 e

z

i

, ...,e

z

k

Pk

i=1 e

z

i

#(5.33)

The denominatorP

k

i=1 e

z

i is used to normalize all the values into probabilities.Thus for example given a vector:

z = [0.6,1.1,�1.5,1.2,3.2,�1.1]

the result softmax(z) is

[0.055,0.090,0.0067,0.10,0.74,0.010]

Again like the sigmoid, the input to the softmax will be the dot product betweena weight vector w and an input vector x (plus a bias). But now we’ll need separateweight vectors (and bias) for each of the K classes.

p(y = c|x) =e

w

c

· x+b

c

kX

j=1

e

w

j

· x+b

j

(5.34)

Like the sigmoid, the softmax has the property of squashing values toward 0 or1. thus if one of the inputs is larger than the others, will tend to push its probabilitytoward 1, and suppress the probabilities of the smaller inputs.

5.6.1 Features in Multinomial Logistic Regression

For multiclass classification the input features need to be a function of both theobservation x and the candidate output class c. Thus instead of the notation x

i

, f

i

or f

i

(x), when we’re discussing features we will use the notation f

i

(c,x), meaningfeature i for a particular class c for a given observation x.

In binary classification, a positive weight on a feature pointed toward y=1 anda negative weight toward y=0... but in multiclass a feature could be evidence for oragainst an individual class.

Let’s look at some sample features for a few NLP tasks to help understand thisperhaps unintuitive use of features that are functions of both the observation x andthe class c,

Suppose we are doing text classification, and instead of binary classification ourtask is to assign one of the 3 classes +, �, or 0 (neutral) to a document. Now afeature related to exclamation marks might have a negative weight for 0 documents,and a positive weight for + or � documents:

5.6 • MULTINOMIAL LOGISTIC REGRESSION 15

For a vector z of dimensionality k, the softmax is defined as:

softmax(zi

) =e

z

i

Pk

j=1 e

z

j

1 i k (5.32)

The softmax of an input vector z = [z1,z2, ...,zk

] is thus a vector itself:

softmax(z) =

"e

z1

Pk

i=1 e

z

i

,e

z2

Pk

i=1 e

z

i

, ...,e

z

k

Pk

i=1 e

z

i

#(5.33)

The denominatorP

k

i=1 e

z

i is used to normalize all the values into probabilities.Thus for example given a vector:

z = [0.6,1.1,�1.5,1.2,3.2,�1.1]

the result softmax(z) is

[0.055,0.090,0.0067,0.10,0.74,0.010]

Again like the sigmoid, the input to the softmax will be the dot product betweena weight vector w and an input vector x (plus a bias). But now we’ll need separateweight vectors (and bias) for each of the K classes.

p(y = c|x) =e

w

c

· x+b

c

kX

j=1

e

w

j

· x+b

j

(5.34)

Like the sigmoid, the softmax has the property of squashing values toward 0 or1. thus if one of the inputs is larger than the others, will tend to push its probabilitytoward 1, and suppress the probabilities of the smaller inputs.

5.6.1 Features in Multinomial Logistic Regression

For multiclass classification the input features need to be a function of both theobservation x and the candidate output class c. Thus instead of the notation x

i

, f

i

or f

i

(x), when we’re discussing features we will use the notation f

i

(c,x), meaningfeature i for a particular class c for a given observation x.

In binary classification, a positive weight on a feature pointed toward y=1 anda negative weight toward y=0... but in multiclass a feature could be evidence for oragainst an individual class.

Let’s look at some sample features for a few NLP tasks to help understand thisperhaps unintuitive use of features that are functions of both the observation x andthe class c,

Suppose we are doing text classification, and instead of binary classification ourtask is to assign one of the 3 classes +, �, or 0 (neutral) to a document. Now afeature related to exclamation marks might have a negative weight for 0 documents,and a positive weight for + or � documents:

The softmax function –  Turns a vector z = [z1,z2,...,zk] of k arbitrary values into probabilities

48

5.6 • MULTINOMIAL LOGISTIC REGRESSION 15

For a vector z of dimensionality k, the softmax is defined as:

softmax(zi

) =e

z

i

Pk

j=1 e

z

j

1 i k (5.32)

The softmax of an input vector z = [z1,z2, ...,zk

] is thus a vector itself:

softmax(z) =

"e

z1

Pk

i=1 e

z

i

,e

z2

Pk

i=1 e

z

i

, ...,e

z

k

Pk

i=1 e

z

i

#(5.33)

The denominatorP

k

i=1 e

z

i is used to normalize all the values into probabilities.Thus for example given a vector:

z = [0.6,1.1,�1.5,1.2,3.2,�1.1]

the result softmax(z) is

[0.055,0.090,0.0067,0.10,0.74,0.010]

Again like the sigmoid, the input to the softmax will be the dot product betweena weight vector w and an input vector x (plus a bias). But now we’ll need separateweight vectors (and bias) for each of the K classes.

p(y = c|x) =e

w

c

· x+b

c

kX

j=1

e

w

j

· x+b

j

(5.34)

Like the sigmoid, the softmax has the property of squashing values toward 0 or1. thus if one of the inputs is larger than the others, will tend to push its probabilitytoward 1, and suppress the probabilities of the smaller inputs.

5.6.1 Features in Multinomial Logistic Regression

For multiclass classification the input features need to be a function of both theobservation x and the candidate output class c. Thus instead of the notation x

i

, f

i

or f

i

(x), when we’re discussing features we will use the notation f

i

(c,x), meaningfeature i for a particular class c for a given observation x.

In binary classification, a positive weight on a feature pointed toward y=1 anda negative weight toward y=0... but in multiclass a feature could be evidence for oragainst an individual class.

Let’s look at some sample features for a few NLP tasks to help understand thisperhaps unintuitive use of features that are functions of both the observation x andthe class c,

Suppose we are doing text classification, and instead of binary classification ourtask is to assign one of the 3 classes +, �, or 0 (neutral) to a document. Now afeature related to exclamation marks might have a negative weight for 0 documents,and a positive weight for + or � documents:

5.6 • MULTINOMIAL LOGISTIC REGRESSION 15

For a vector z of dimensionality k, the softmax is defined as:

softmax(zi

) =e

z

i

Pk

j=1 e

z

j

1 i k (5.32)

The softmax of an input vector z = [z1,z2, ...,zk

] is thus a vector itself:

softmax(z) =

"e

z1

Pk

i=1 e

z

i

,e

z2

Pk

i=1 e

z

i

, ...,e

z

k

Pk

i=1 e

z

i

#(5.33)

The denominatorP

k

i=1 e

z

i is used to normalize all the values into probabilities.Thus for example given a vector:

z = [0.6,1.1,�1.5,1.2,3.2,�1.1]

the result softmax(z) is

[0.055,0.090,0.0067,0.10,0.74,0.010]

Again like the sigmoid, the input to the softmax will be the dot product betweena weight vector w and an input vector x (plus a bias). But now we’ll need separateweight vectors (and bias) for each of the K classes.

p(y = c|x) =e

w

c

· x+b

c

kX

j=1

e

w

j

· x+b

j

(5.34)

Like the sigmoid, the softmax has the property of squashing values toward 0 or1. thus if one of the inputs is larger than the others, will tend to push its probabilitytoward 1, and suppress the probabilities of the smaller inputs.

5.6.1 Features in Multinomial Logistic Regression

For multiclass classification the input features need to be a function of both theobservation x and the candidate output class c. Thus instead of the notation x

i

, f

i

or f

i

(x), when we’re discussing features we will use the notation f

i

(c,x), meaningfeature i for a particular class c for a given observation x.

In binary classification, a positive weight on a feature pointed toward y=1 anda negative weight toward y=0... but in multiclass a feature could be evidence for oragainst an individual class.

Let’s look at some sample features for a few NLP tasks to help understand thisperhaps unintuitive use of features that are functions of both the observation x andthe class c,

Suppose we are doing text classification, and instead of binary classification ourtask is to assign one of the 3 classes +, �, or 0 (neutral) to a document. Now afeature related to exclamation marks might have a negative weight for 0 documents,and a positive weight for + or � documents:

5.6 • MULTINOMIAL LOGISTIC REGRESSION 15

For a vector z of dimensionality k, the softmax is defined as:

softmax(zi

) =e

z

i

Pk

j=1 e

z

j

1 i k (5.32)

The softmax of an input vector z = [z1,z2, ...,zk

] is thus a vector itself:

softmax(z) =

"e

z1

Pk

i=1 e

z

i

,e

z2

Pk

i=1 e

z

i

, ...,e

z

k

Pk

i=1 e

z

i

#(5.33)

The denominatorP

k

i=1 e

z

i is used to normalize all the values into probabilities.Thus for example given a vector:

z = [0.6,1.1,�1.5,1.2,3.2,�1.1]

the result softmax(z) is

[0.055,0.090,0.0067,0.10,0.74,0.010]

Again like the sigmoid, the input to the softmax will be the dot product betweena weight vector w and an input vector x (plus a bias). But now we’ll need separateweight vectors (and bias) for each of the K classes.

p(y = c|x) =e

w

c

· x+b

c

kX

j=1

e

w

j

· x+b

j

(5.34)

Like the sigmoid, the softmax has the property of squashing values toward 0 or1. thus if one of the inputs is larger than the others, will tend to push its probabilitytoward 1, and suppress the probabilities of the smaller inputs.

5.6.1 Features in Multinomial Logistic Regression

For multiclass classification the input features need to be a function of both theobservation x and the candidate output class c. Thus instead of the notation x

i

, f

i

or f

i

(x), when we’re discussing features we will use the notation f

i

(c,x), meaningfeature i for a particular class c for a given observation x.

In binary classification, a positive weight on a feature pointed toward y=1 anda negative weight toward y=0... but in multiclass a feature could be evidence for oragainst an individual class.

Let’s look at some sample features for a few NLP tasks to help understand thisperhaps unintuitive use of features that are functions of both the observation x andthe class c,

Suppose we are doing text classification, and instead of binary classification ourtask is to assign one of the 3 classes +, �, or 0 (neutral) to a document. Now afeature related to exclamation marks might have a negative weight for 0 documents,and a positive weight for + or � documents:

Page 25: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

25

Softmax in multinomial logistic regression

5.6 • MULTINOMIAL LOGISTIC REGRESSION 15

For a vector z of dimensionality k, the softmax is defined as:

softmax(zi

) =e

z

i

Pk

j=1 e

z

j

1 i k (5.32)

The softmax of an input vector z = [z1,z2, ...,zk

] is thus a vector itself:

softmax(z) =

"e

z1

Pk

i=1 e

z

i

,e

z2

Pk

i=1 e

z

i

, ...,e

z

k

Pk

i=1 e

z

i

#(5.33)

The denominatorP

k

i=1 e

z

i is used to normalize all the values into probabilities.Thus for example given a vector:

z = [0.6,1.1,�1.5,1.2,3.2,�1.1]

the result softmax(z) is

[0.055,0.090,0.0067,0.10,0.74,0.010]

Again like the sigmoid, the input to the softmax will be the dot product betweena weight vector w and an input vector x (plus a bias). But now we’ll need separateweight vectors (and bias) for each of the K classes.

p(y = c|x) =e

w

c

· x+b

c

kX

j=1

e

w

j

· x+b

j

(5.34)

Like the sigmoid, the softmax has the property of squashing values toward 0 or1. thus if one of the inputs is larger than the others, will tend to push its probabilitytoward 1, and suppress the probabilities of the smaller inputs.

5.6.1 Features in Multinomial Logistic Regression

For multiclass classification the input features need to be a function of both theobservation x and the candidate output class c. Thus instead of the notation x

i

, f

i

or f

i

(x), when we’re discussing features we will use the notation f

i

(c,x), meaningfeature i for a particular class c for a given observation x.

In binary classification, a positive weight on a feature pointed toward y=1 anda negative weight toward y=0... but in multiclass a feature could be evidence for oragainst an individual class.

Let’s look at some sample features for a few NLP tasks to help understand thisperhaps unintuitive use of features that are functions of both the observation x andthe class c,

Suppose we are doing text classification, and instead of binary classification ourtask is to assign one of the 3 classes +, �, or 0 (neutral) to a document. Now afeature related to exclamation marks might have a negative weight for 0 documents,and a positive weight for + or � documents:

49

Features in (binary) logistic regression

50

f1 = 1 if "!" in doc 0 otherwise

w1 = 2.5

Page 26: 18 logistic regression - Colorado State UniversityText classification: Logistic Regression Chapter 5 in Martin/Jurafsky Logistic Regression • Important analytic tool in natural and

11/7/19

26

Features in softmax regression distinct weights for each class!

16 CHAPTER 5 • LOGISTIC REGRESSION

Var Definition Wt

f1(0,x)⇢

1 if “!” 2 doc0 otherwise �4.5

f1(+,x)

⇢1 if “!” 2 doc0 otherwise 2.6

f1(0,x)⇢

1 if “!” 2 doc0 otherwise 1.3

5.6.2 Learning in Multinomial Logistic Regression

Multinomial logistic regression has a slightly different loss function than binary lo-gistic regression because it uses the softmax rather than sigmoid classifier, The lossfunction for a single example x is the sum of the logs of the K output classes:

L

CE

(y,y) = �KX

k=1

1{y = k} log p(y = k|x)

= �KX

k=1

1{y = k} loge

w

k

·x+b

k

PK

j=1 e

w

j

·x+b

j

(5.35)

This makes use of the function 1{} which evaluates to 1 if the condition in thebrackets is true and to 0 otherwise.

The gradient for a single example turns out to be very similar to the gradient forlogistic regression, although we don’t show the derivation here. It is the differentbetween the value for the true class k (which is 1) and the probability the classifieroutputs for class k, weighted by the value of the input x

k

:

∂L

CE

∂w

k

= (1{y = k}� p(y = k|x))xk

=

1{y = k}� e

w

k

·x+b

k

PK

j=1 e

w

j

·x+b

j

!x

k

(5.36)

5.7 Interpreting models

Often we want to know more than just the correct classification of an observation.We want to know why the classifier made the decision it did. That is, we want ourdecision to be interpretable. Interpretability can be hard to define strictly, but theinterpretable

core idea is that as humans we should know why our algorithms reach the conclu-sions they do. Because the features to logistic regression are often human-designed,one way to understand a classifier’s decision is to understand the role each feature itplays in the decision. Logistic regression can be combined with statistical tests (thelikelihood ratio test, or the Wald test); investigating whether a particular feature issignificant by one of these tests, or inspecting its magnitude (how large is the weightw associated with the feature?) can help us interpret why the classifier made thedecision it makes. This is enormously important for building transparent models.

Furthermore, in addition to its use as a classifier, logistic regression in NLP andmany other fields is widely used as an analytic tool for testing hypotheses about the

51

-

Binary versus multinomial logistic regression

•  Multinomial is obviously applicable to many more classification tasks

–  Softmax is important for neural networks

52