lecture 3: logistic regression (draft: version 0.8.7) · 2017-12-13 · in this sense logistic...

Lectures on Machine Learning (Fall 2017)

Hyeong In Choi Seoul National University

Lecture 3: Logistic Regression

(Draft: version 0.8.7)

Topics to be covered:

• Binary logistic regression

• Multiclass logistic regression

• Neural network formalism

• Ordinal logistic regression

• Performance evaluation

• Mathematical supplement: cross entropy error

In this lecture, we continue to use the notations and conventions of Lec-ture 1. Logistic regression in one form or another represents one of the oldestand still one of the most important families of algorithms being used for clas-sification tasks. Nowadays its importance is even more recognized as it is aprototypical building block of neural network. The formalism and issuesdiscussed in this lecture will be very relevant when we study deep learning.For this reason, the interested reader is advised to get familiar with detailscovered in this lecture.

Since logistic regression is a classification problem, the output variableis a categorical variable, and we assume there are K output categories orlabels. For input variables, logistic regression uses numeric input variablesonly. If there is a categorical input variable, we will use the following so-called one-hot encoding method to turn it into a bunch of (binary) numericvariables.

1

Definition 1. Let v be a categorical variable with values in M categories.As was done in Lecture 1, we assume v ∈ {1, · · · ,M}. Define vj = I(v = j),for j = 1, · · · ,M. Then vj ∈ {0, 1} and v1 + · · · + vM = 1 so that one andonly one of v1, · · · , vM is 1 and the rest are all 0. This way of representingv with M binary variables v1, · · · , vM is called one-hot encoding and wewrite v = (v1, · · · , vM).

Note we have already used one-hot encoding in Lecture 2 to rewrite themulticlass categorical response variable y as y = (y1, · · · , yK). In this lecture,we always assume all multiclass categorical input variables are encoded as aset of one-hot encoded variables and the resulting binary variables are treatedas numeric variables with values 0 or 1. After all this preprocessing, weassume the input space is X = Rd and there are d input variables x1, · · · , xd.We write them in vector form as x = (x1, · · · , xd). Thus x ∈ Rd. In matrixnotation, vector is always written as a column vector. So x = [x1, · · · , xd]T .

As was done in Lecture 1, we use P (x, y) to denote the probability dis-tribution of X×Y. Although the full joint probability P (x, y) is in the back-ground as we described in Lecture 1, logistic regression does not try to modelit per se. Rather, it purports to model the conditional probability P (y |x) ina rather simple way as we shall see below. Using this conditional probabilityone can easily construct a classifier by taking the highest probability out-come. In this sense logistic regression is dubbed a discriminative model.This is in contrast with the generative model, which, like the naive Bayesmodel, directly models the full joint probability distribution.

1 Binary logistic regression

1.1 Probability model

Binary logistic regression assumes there are two output labels, i.e. y ={0, 1}. Let (x, y) ∈ X × Y be a generic sample point. The binary logisticregression postulates the conditional probability P (y |x) of the form

P (y = 1 |x) = σ(w · x+ b). (1)

Here w = [w1, · · · , wd]T and b are model parameters to be determined by thedata, and w · x = wTx = w1x1 + · · ·+wdxd is the dot product of two vectors

2

w and x. And σ(t) is the sigmoid function defined by

σ(t) =1

1 + e−t.

This sigmoid function has value between 0 and 1 as can be seen in Figure 1,and simply because of this reason it is commonly used as a model for proba-bility in machine learning. (Machine learning almost always prefers simplerones.) The conditional probability P (y = 1 |x) depends on the parametersw and b. So to be explicit about it one writes it as P (y = 1 |x;w, b) orPw,b(y = 1 |x). But the dependence on the parameters w and b is understoodin context and they are usually suppressed in the expression of conditionalprobability. Define

z = w · x+ b. (2)

Then (1) is written asP (y = 1 |x) = σ(z). (3)

Thus we can write

Figure 1: Sigmoid function

P (y = 1 |x) =ez

1 + ez=

ew·x+b

1 + ew·x+b

P (y = 0 |x) =1

1 + ez=

1

1 + ew·x+b.

Solving for z in Equation (1) we have

z = w · x+ b = log( p

1− p

),

3

where p = P (y = 1 |x). Define the logit function by

logit(p) = log( p

1− p

),

which is the inverse function of σ(t). Define the Odds (of y being 1 againstits being 0) given x by

Odds = Odds(y = 1 |x) =P (y = 1 |x)

P (y = 0 |x).

We have thus shown that

logit(p) = log(Odds(y = 1 |x)) = w · x+ b. (4)

This means that the binary logistic regression models the log of Odds as alinear combination of the features of the input. Unraveling the definitionin Equation (4), it is easy to see that it is equivalent to the definition inEquation (1)

1.2 Learning

Learning, also called training, means determining the parameters w =(w1, · · · , wd) and b that fit the given dataset in the sense that the resultingerror in doing so is as small as computationally feasible. (There are othercriteria like avoiding overfitting or having small generalization error, whichwe will deal with in subsequent lectures.) So learning entails optimizationin one sense or another. Some may even exaggerate the matter by claimingintelligence is optimization.

We now use the maximum likelihood estimator to determine parametersw1, · · · , wd and b. In doing so, we also show how the sigmoid function arisesnaturally. Let D = {(x(j), y(j))}Nj=1 be a given data set. Instead of using thesigmoid function, let us set

P (y = 1 |x) = ψ(w · x+ b),

where ψ(t) is any function with value in the interval [0, 1]. Since the value ofy is binary, i.e. y ∈ {0, 1}, the conditional probability can be written in thefollowing form:

P (y |x) = ψ(w · x+ b)y(1− ψ(w · x+ b))1−y. (5)

4

The likelihood function given the dataset D is then

L(w, b) =N∏i=1

P (y(i) |x(i)) =N∏i=1

ψ(w · x(i) + b)y(i)

(1− h(w · x(i)) + b)1−y(i)

,

and its log likelihood function l(w, b) = logL(w, b) is

l(w, b) =N∑i=1

{y(i) logψ(w · x(i) + b) + (1− y(i)) log(1−ψ(w · x(i) + b))

}. (6)

The maximum point of this log likelihood function is the maximum like-lihood estimator of w and b.

Now is the time to simplify our notation. Define θi = wi, for i = 1, · · · , dand θ0 = b. We also augment the x-variable by setting x0 = 1. Similarly, forthe x-part of the data, define x

(i)0 = 1. Thus we have vectors θ, x and x(i)

given by

θ = [θ0, θ1, · · · , θd]T = [b, w1, · · · , wd]T

x = [x0, x1, · · · , xd]T = [1, x1, · · · , xd]T

x(i) = [x(i)0 , x

(i)1 , · · · , x

(i)d ]T = [1, x

(i)1 , · · · , x

(i)d ]T

Thus with this new notation, the log likelihood function (6) can be rewrittenas

l(θ) =N∑i=1

{y(i) logψ(θ · x(i)) + (1− y(i)) log(1− ψ(θ · x(i)))

}. (7)

Hence

∂l(θ)

∂θk=

N∑i=1

{y(i)

ψ′(θ · x(i))ψ(θ · x(i))

x(i)k − (1− y(i)) ψ′(θ · x(i))

1− ψ(θ · x(i))x(i)k

}. (8)

This formula would be greatly simplified if ψ′(t) is proportional to ψ(t)(1−ψ(t)). So set

ψ′(t) = ψ(t)(1− ψ(t)) (9)

and solve this ODE to get ψ(t) =1

1 + Ce−t. Letting C = 1, we get

ψ(t) =1

1 + e−t.

5

This is how the sigmoid function came about in the first place. Now use (9)to simplify (8) to get

∂l(θ)

∂θk=

N∑i=1

{y(i) − σ(θ · x(i))}x(i)k , (10)

which, in vector form, can be written as

∇θl(θ) =N∑i=1

{y(i) − σ(θ · x(i))}x(i). (11)

Differentiating (10) once more, we get

∂2l(θ)

∂θk∂θr= −

N∑i=1

σ′(θ · x(i))x(i)k x(i)r .

Now let λ = [λ0, λ1, · · · , λd]T ∈ Rd+1 be any non-zero vector. Check that

−N∑

k,r=0

∂2l(θ)

∂θk∂θrλkλr =

N∑i=1

σ′(θ · x(i))

(d∑

k=0

x(i)k λk

)(d∑r=0

x(i)r λr

)

=N∑i=1

σ′(θ · x(i))|λ · x(i)|2. (12)

Since σ′(t) > 0 always, we can see (12) is always non-negative. Since this

is true for any λ, the negative of the Hessian matrix −( ∂2l(θ)∂θk∂θr

) is positive

semi-definite. Furthermore (12) is strictly positive if the vectors x(j) forj = 1, · · · , N span the space Rd+1, in which case the negative of the Hessianmatrix is positive definite. Therefore we have the following:

Theorem 1. −l(θ) is convex. Furthermore, if the vectors x(i) for i =1, · · · , N span the space Rd+1, −l(θ) is strictly convex.

For the rest of this lecture, we assume that the condition in the above

theorem is always met so that−( ∂2l(θ)∂θk∂θr

) is positive definite, which also implies

that −l(θ) has a unique minimum.

6

1.3 Numerical procedures for MLE

Our task now is to find the maximum likelihood estimator for θ, whichis the minimum point of −l(θ). Since it is a nonlinear function, we have touse a numerical method. But since it is a convex function, the numericalprocedure is pretty easy. We list below some of the widely used numericalmethods.

(1) Batch gradient descentSteepest gradient descent method is one of the oldest and still widelyused method of finding the minimum of a function, which, in a nutshell,is a discretization of the gradient flow. Thus the minimum of −l(θ) isfound by successively finding θ(new) from θ(old) and renaming it asθ(old) and iterating the following:

θ(new) = θ(old)− η∇θ(−l(θ))|θ=θ(old) = θ(old) + η∇θl(θ(old))

= θ(old) + ηN∑i=1

{y(i) − σ(θ(old) · x(i))

}x(i) (13)

Here η is usually called a learning rate that determines the iterativestep size. If it is too small, the convergence is too slow; if it is too big,the θ’s may oscillate without really converging to the maximum of l(θ).

(2) Stochastic gradient descent (SDG)The above algorithm is a batch process in that it utilizes all availabledata all at once. If the dataset is too big, its computational burdenis great. But in a certain case like online learning the data may comesequentially and one has to process them as they come. In that case itis more advantageous to use the following version:

θ(new) = θ(old) + η{y(i) − σ(θ(old) · x(i))

}x(i),

where the data point (x(i), y(i)) to be used above is chosen at random.Even when all data is available, this sequential procedure has the ben-eficial effect of randomization.

(3) Mini-batch gradient descentStochastic gradient descent may be too slow when the data set is toobig. Mini-batch gradient descent is a compromise between the two

7

methods above. It randomly selects a small subset, called the mini-batch, from the dataset D and apply the method of (13) only for thissmall subset, i.e. the sum in (13) is done only for those elements in themini-batch. This gives the best performance and is a favored methodin contemporary machine learning.

(4) Newton’s methodFor simplicity of notation, we use f(θ) for −l(θ). Thus f is strictlyconvex. Suppose we have found θ(old) in the iterative step. We alsouse θ = θ(old). The Taylor expansion at θ is given by

f(θ) = f(θ) +∇f(θ)T (θ − θ) +1

2(θ − θ)TH(θ)(θ − θ) + remainder,

where H(θ) is the Hessian matrix(∂2f(θ)∂θk∂θr

). Using

A = H(θ)

b = ∇f(θ)−H(θ)θ

c = f(θ)−∇f(θ)T θ +1

2θTH(θ)θ

we can write

f(θ) =1

2θTAθ + bT θ + c+ remainder.

Define a quadratic function g by

g(θ) =1

2θTAθ + bT θ + c.

So g is the second order osculating function of f at θ. Take the gradientof g with respect to θ to get

∇θg(θ) = Aθ + b

. Since the Hessian matrixH(θ) is invertible (see the comment followingTheorem 1), g(θ) has a unique minimum at θ(new) which is the solutionof Aθ + b = 0. Thus we set

θ(new) = −A−1b = −H(θ)−1b = −H(θ)−1{∇f(θ)−H(θ)θ},

8

Therefore, we have the following Newton’s algorithm.

θ(new) = θ(old)−H(θ(old))−1∇f(θ(old)).

Newton’s method is a second order method that converges very fast.But it may not be suitable to use in case the computation of the Hessianmatrix is expensive.

Figure 2: Graph of f with osculating quadratic surface

(4) VariationsThere are many other optimization methods one can think of. Espe-cially when we get to the training of deep neural networks, we will comeback to this issue.

1.4 Decision making

Suppose we have found θ = (w, b) by the MLE method. Let x be a new

input data. Knowing θ means we can calculate

P (y = 1 |x) =ew·x+b

1 + ew·x+b

P (y = 0 |x) =1

1 + ew·x+b.

9

The classifier ϕ : X→ Y is gotten by taking the one with the higher proba-bility. i.e.

ϕ(x) = argmaxj

P (y = j |x).

Therefore, examining the expression of the above probabilities, we can easilyconclude that

w · x+ b = 0 ⇔ P (y = 1 |x) =1

2

w · x+ b > 0 ⇔ P (y = 1 |x) >1

2

w · x+ b < 0 ⇔ P (y = 1 |x) <1

2.

Thus the decision boundary is the hyperplane defined by

w · x+ b = w1x1 + · · ·+ wdxd + b = 0.

The fact that the decision boundary is linear has profound implicationon the accuracy or performance in general of the classifier gotten by logisticregresson. So in the case where the dataset itself has nonlinear clusteringpattern, this standard version of logistic regression will not work very well.To cope with this problem, numerous methods were devised. We will alsodeal with many of them in the subsequent lectures.

1.5 Performance evaluation

1.5.1 Confusion Matrix

Once a classifier ϕ is given, one can analyze its performance by measuringvarious errors as follows. Define

P = {(x(i), y(i)) | y(i) = 1}N = {(x(i), y(i)) | y(i) = 0}TP = {(x(i), y(i)) ∈ P |ϕ(x(i)) = 1} : True Positive Set

TN = {(x(i), y(i)) ∈ N |ϕ(x(i)) = 0} : True Negative Set

FN = {(x(i), y(i)) ∈ P |ϕ(x(i)) = 0} : False Negative Set

FP = {(x(i), y(i)) ∈ N |ϕ(x(i)) = 1} : False Positive Set.

10

Figure 3: Confusion Matrix

For example, P is the set of given data points whose label is y = 1 andTP is the set of points in P whose decision turn out to be y = 1, and so on.Similarly, FN is the set of points in P whose decision turn out to be y = 0.Here, the positive or negative appellation is based on the convention that thepoints with y = 1 are the ones one wants to identify. (One can just as easilyfocus on label 0 rather than 1 and do similar analysis.) From these, we candefine various measures:

TPR = True Positive Rate =|TP ||P |

: Sensitivity

TNR = True Negative Rate =|TN ||N |

: Specificity

FNR = False Negative Rate =|FN ||P |

= 1− TPR

FPR = False Positive Rate =|FP ||N |

= 1− TNR,

where the absolute value notation (the left and right bar) on the set meansthe cardinality of set. So for instance |TP | denotes the number of elementsin the set TP, and so on. This summary is usually organized in a tablecalled the confusion matrix as shown in Figure 3. It is worth noting thatin statistics jargon, asserting the existence of something that is not there isusually called the type I error. Thus our FPR corresponds to this type Ierror. On the other hand, asserting the non-existence of something that isactually there is called the type II error. Thus FNR corresponds to thetype II error situation. Note also that

|P | = |TP |+ |FN ||N | = |FP |+ |TN |.

11

Many people prefer to look at the following performance measure:

Precision =|TP |

|TP |+ |FP |

Recall =|TP |

|TP |+ |FN |=|TP ||P |

.

The precision is the ratio of TP over all cases predicted positive. (It concernsthe first vertical column of the confusion matrix.) High precision means thatonce the classifier predicts ϕ(x) = 1, it is very likely that its true label isalso 1. The recall is the ratio of TP over all cases with positive label. (Itconcerns the first horizontal row of the confusion matrix.) High recall meansthat most cases with label 1 is reported to be positive by the classifier. Notethat recall, sensitivity and TPR are the same thing.

It is also common to combine the precision and recall to one measurecalled the F1 measure. It is the harmonic mean of the precision and recallgiven by

1

F1=

1

2

(1

Precision+

1

Recall

),

which is one single measure that is frequently used as an evaluation criterionfor classifiers.

Finally, the accuracy is defined as the true classification over all cases,i.e.

Accuracy =|TP |+ |TN ||P |+ |N |

.

1.5.2 Receiver Operating Characteristic (ROC) Curve

The above decision making criterion of taking the higher probability out-come is sort of the default way in the absence of any preference for classifierbehavior or performance. However, in high security entrance system thatuses the biometric personal identification, any false positive could representa potentially serous security breach. So in this case it is desirable to makesure the FPR is minimal while tolerating reasonably high FNR. On the otherhand, in a marketing campaign, especially when the campaign cost is not veryhigh, it is more desirable to reach as many potential customers as possibleeven if it may result in selecting some people who are not very likely to becustomers. The situation is depicted in Figure 4. In the former case, one

12

Figure 4: Moving the Decision Boundary

moves the decision boundary in the direction of (1) resulting in lower FPRand higher FNR, while in the latter case, the decision boundary is moved inthe direction of (2) which results in higher FPR and lower FNR. Algorithmi-cally one adjusts the constant c in such a way that one decides y = 1 if andonly if θ · x > c. In the former case, c gets bigger, and in the latter, smaller.This trade-off relation is drawn in Figure 5. In some other form, one drawsthe ROC curve of TPR versus FPR as in Figure 6. Also, the scale on thex-axis and also on the y-axis is linear or logarithmic.

As for the precision and recall, one can also draw the so-called precision-recall curve which is similar in spirit to the ROC curve. In fact, it is similarto Figure 5 with Precision in the x-axis and Recall in the y-axis.

Figure 5: ROC Curve: FNR vs. FPR

13

Figure 6: ROC Curve: TPR vs. FPR

2 Multiclass logistic regression

2.1 Probability model

Suppose now the output y is a categorical variable with values in a set ofK elements. So the output variable y = {1, · · · , K}. Using the conventionin Lecture 2, y is represented as

y = (y1, · · · , yK),

where yk = I(y = k).The multiclass logistic regression model postulates the probability of y =

k given x is proportional to an exponential of a linear combination of theinput features, i.e.

P (y = k |x) ∝ ewk·x+bk = exp(wk1x1 + · · ·+ wkdxd + bk), (14)

where wk is a column vector defined by

wk = [wk1, · · · , wkd]T .

Define the K × d matrix W whose (k, j)-th entry is wkj. So

W =

(w1)

T

...(wk)

T

...(wK)T

=

w11 · · · w1j · · · w1d

......

...wk1 · · · wkj · · · wkd

......

...wK1 · · · wKj · · · wKd

. (15)

14

Write b also as a K dimensional column vector as

b = [b1, · · · , bK ]T .

For k = 1, · · · , K, let

zk = wk · x+ bk

=d∑j=1

wkjxj + bk (16)

and letz = [z1, · · · , zK ]T

as a K dimensional column vector. Then in matrix notation we have

z = Wx+ b.

The matrix W and the vector b are to be estimated with data.Since the probabilities should sum to one, the proportionality constant

in Equation (14) should be 1/∑K

j=1 ezj , which implies

P (y = k |x) =ezk∑Kj=1 e

zj=

ewk·x+bk∑Kj=1 e

wj ·x+bj. (17)

Now define the following vector valued function called the softmax function:

Definition 2.

softmax(z1, · · · , zK) =

(ez1∑Kj=1 e

zj, · · · , ezK∑K

j=1 ezj

),

and its k-th element of the output is denoted by

softmaxk(z1, · · · , zK) =ezk∑Kj=1 e

zj.

In vector notation, we write

softmax(z) = softmax(z1, · · · , zK).

Summarizing all these, we have the following:

15

Proposition 1. The output probability P (y = k |x) is a softmax output asgiven by

hk = hk(x) = P (y = k |x) = softmaxk(z1, · · · , zK) =ezk∑Kj=1 e

zj, (18)

where

zk =d∑j=1

wkjxj + bk, (19)

i.e., in vector notation,z = Wx+ b. (20)

This proposition’s message one should note is that the output of logisticregression is the probabilities h1, · · · , hK such that hk = P (y = k |x), fork = 1, · · · , K.

2.2 Neural network formalism

The input-output relationship in Proposition 1 can be recast in a simpleneural network as in Figure 7. The circles in the left box represent theinput variables x1, · · · , xd, and the ones in the right the response variablesh1, · · · , hK . The internal state of the k-th output neuron is the variable zkwhose value is as described in (19). In neural network parlance, zk is gottenby summing over all j the multiples of wkj and xj, and then adding thevalue bk, called the bias. Once the values of z1, · · · , zK are given, the outputh1, · · · , hK can be found by applying the softmax function (18).

2.3 Learning

As is always the case, learning means fixing parameters. In this case,the parameters are w and b. In neural networks, they are represented as theweights of the edges (links).

Let us now use the maximum likelihood estimator to find the parameters.First, we need a likelihood function. In fact, we need a conditional likelihoodfunction p(y |x) which can be written in the following form:

P (y |x) =K∏k=1

hk(x)yk ,

16

Figure 7: Neural network view of logistic regression

where hk(x) is as in (18). Treating this as a likelihood function, its loglikelihood function is

logP (y |x) =K∑k=1

yk log hk(x). (21)

Now let D = {(x(i), y(i))}Ni=1 be a given dataset. The conditional likelihoodfunction with respect to D is

L(W, b) =N∏i=1

p(y(i) |x(i)),

where

p(y(i) |x(i)) =K∏k=1

hk(x(i))y

(i)k .

Therefore the log likelihood function is

l(W, b) = logL(W, b) =N∑i=1

K∑k=1

y(i)k log hk(x

(i)). (22)

The maximum likelihood estimates of W and b are the maximum pointsof (22). Since they cannot be found explicitly, we use a numerical method,

17

in particular the gradient descent method for the negative of log likelihoodfuntion above. To simplify notation, we unify the parameters wkj and bkinto one family by defining θkj = wkj and θk0 = bk for k = 1, · · · , K andj = 1, · · · , d. So written in matrix form, θ is a K × (d + 1) matrix of theform:

θ =

θ10 θ11 · · · θ1j · · · θ1d...

......

...θk0 θk1 · · · θkj · · · θkd...

......

...θK0 θK1 · · · θKj · · · θKd

=

b1 w11 · · · w1j · · · w1d...

......

...bk wk1 · · · wkj · · · wkd...

......

...bK wK1 · · · wKj · · · wKd

.

Now set up an optimization problem. Writing (22) using θ, we have thefollowing form of negative log likelihood function:

f(θ) = − logL(W, b) = −N∑i=1

K∑k=1

y(i)k log hk(x

(i)), (23)

where

hk(x(i)) =

ez(i)k∑K

`=1 ez(i)`

,

and

z(i)k =

d∑j=0

θkjx(i)j .

Note the maximum likelihood estimate θ of the parameter θ is the minimumpoint of f(θ), i.e.

θ = argminf(θ)θ

.

Luckily, this f(θ) is a convex function of θ, which makes the optimizationalgorithm easy. We will show this fact in the general context of the expo-nential family of distribution in Lecture 4. The numerical method to use isstandard gradient descent. Now let us first look at the individual summand

u(i)(θ) = −K∑k=1

y(i)k log hk(x

(i)), (24)

18

which is the objective function when only one data point (x(i), y(i)) is takeninto consideration. (It is the case when we apply the stochastic gradientdescent algorithm.) To simplify notation, we drop the parenthesized super-scripts in the above formula and use hk instead of hk(x) to write

u = −K∑k=1

yk log hk, (25)

which is exactly the negative of the log likelihood function of the probabilityas in (21). The variable dependence is such that u, hence f, depends onh1, · · · , kK , which depend on z1, · · · , zK , which in turn depend on θrs, asdescribed in Proposition 1. The dependence relation is depicted in Figure 8.So by the chain rule, we have

∂u

∂θrs= −

K∑k=1

K∑t=1

∂u

∂ht

∂ht∂zk

∂zk∂θrs

. (26)

It is trivial to calculate ∂u/∂ht, and ∂ht/∂zk can be easily calculated from(18). The derivative ∂zk/∂θrs can also be calculated from (19) using the factthat if s = 0, θrs = br; if s 6= 0, θrs = wrs.

Figure 8: Variable dependencies

Once we get the derivative ∂u/∂θrs in this way, the derivative of ∂u(i)(θ)/∂θrscan be obtained by replacing x with x(i). Then the derivative ∂f(θ)/∂θrs issimply the sum of ∂u(i)(θ)/∂θrs for all i. This way of computing the derivative

19

∂f(θ)/∂θrs of f(θ) is a basic building block of the so-called back propaga-tion algorithm in neural networks, the systematic exposition of which will begiven in our lecture on neural networks.

Once we have the derivative, the learning, i.e. finding the good parame-ters, is done in the usual way. Namely, we can apply batch learning, on-linelearning, or mini-batch learning as described in Section 1.3.

2.4 Cross entropy interpretation

Recall Proposition 1. In there hk(x) represents the probability hk(x) =P (y = k |x). Similarly for the input data x(i), hk(x

(i)) is the probability

hk(x(i)) = P (y = k |x(i)). Interpret the output data y(i) = (y

(i)1 , · · · , y

(i)K ) as

specifying the probability of y = k for k = 1, · · · , K, although only one ofthem is 1 and the rest are 0. Then the formula (24) can be reinterpreted asthe cross entropy:

H(y(i), hk(x(i))) = −

K∑k=1

y(i)k log hk(x

(i)).

Then the optimization algorithm of finding the minimum of (24) is exactlythe same as finding the minimum cross entropy. In view of Proposition 2, itmeans trying to make the probability hk(x

(i)) as close to y(i) as possible. Thenthe overall minimization problem of finding the minimum of the objectivefunction (23) implies trying to make the probability hk(x

(i)) as close to y(i)

as possible for all i on the average.

2.5 Neural network formalism of binary logistic regres-sion

It turns out that the expression (17) has redundancy coming from thefact that the probabilities in (14) have to add up to 1. To see its effect,define w′j = wj − w′K and b′j = bj − b′K for j = 1, · · · , K − 1. Then dividingthe numerator and the denominator of Equation (17) by ewK ·x+bK we have

P (y = k |x) =ew′k·x+b

′k

1 +∑K−1

j=1 ew′j ·x+b′j

, for k = 1, · · · , K − 1,

P (y = K |x) =1

1 +∑K−1

j=1 ew′j ·x+b′j

.

20

It is easy to see that upon multiplying by eθK ·x+bK the numerator and thedenominator of the above equations, one can recover Equation (17). Namelythese two expressions are equivalent. In particular this shows that if K = 2the multiclass logistic regression is the binary logistic regression. Since thereis only one w′j and b′j in this case, we write them as w and b. Thus we have

P (y = 1 |x) =ew·x+b

1 + ew·x+b,

P (y = 0 |x) =1

1 + ew·x+b.

In the last formula, according to the multiclass convention, P (y = 0 |x)should have been P (y = 2 |x). However, to conform to our binary classifi-cation convention, we replace 2 with 0. Its neural network representation isdepicted in Figure 9. Note that

z = w1x1 + · · ·wdxd + b,

h = P (y = 1 |x) = σ(z).

Note also only one output neuron is needed because P (y = 0 |x) = 1−P (y =0 |x).

Figure 9: Neural network of binary logistic regression

21

2.6 Decision making

Let us now discuss how the decision is made once the maximum likelihoodestimator θ = ( w, b ) is found. From (17), the probability can be written as

P (y = k |x) =ewk·x+bk∑Kj=1 e

wj ·x+bj. (27)

As usual the classifier ϕ : X→ Y is defined by

ϕ(x) = argmaxj

P (y = j |x).

Define

Hj(x) = Hj(x1, · · · , xd) = wj · x+ bj = wj1x1 + · · ·+ wjdxd + bj.

Let ` = ϕ(x). Then in view of (27) it is easy to see that it is equivalent tosaying that

H`(x) ≥ Hj(x), for j = 1, · · · , K. (28)

Let us now describe the geometric picture. To do so, we have to be carefulabout the redundancy mentioned at the beginning of Section 2.5. In otherwords, the formula (27) remains the same even if we divide the numerator and

the denominator by ewK ·x+bK , which is equivalent to saying that the linearfunction Hj(x) is replaced with Hj(x) − HK(x). For this reason, we mayassume HK(x) ≡ 0 without loss of generality when it comes to the geometricdescription.

Example (d = 2, K = 3) In this case there are two variables x1 and x2 andthree linear functions H1(x1, x2), H2(x1, x2) and H3(x1, x2). But sinceH3 = 0, we have only to deal with two linear functions H1 and H2.In Figure 10, the zero set {H1 = 0} and {H2 = 0} are drawn. H+

1

represents the region {H1 > 0}, and similarly H−1 represents the region{H1 < 0} and so on. Applying the decision making method above,R2 is divided into three regions in Figure 11. The region with thehorizontal shading is where the decision is y = 1 and in the region withthe vertical shading the decision is y = 2. In the region shaded withslant lines, y = 3.

Homework. Let d = 2 and K = 4. Suppose H1, H2 and H3 are configuredas in Figure 12. Draw the decision regions for the output y ∈ {1, 2, 3, 4}.

22

Figure 10: A configuration for the case of d = 2 and K = 3

Figure 11: The regions with output

Figure 12: A configuration for the case of d = 2 and K = 4

23

3 Ordinal logistic regression

There are problems in which the output labels of y have order relations.Typical of such case is where scoring is done. Credit scoring in banks, ratingbonds for default and the page rank in the Internet search engines are exam-ples of such problems. Figure 13 shows the data points marked with labels A,B, C and D. They have an order relation in the sense that A is more desirablethan B, B more than C, and so on. The task is how to classify them in such away that the order relation is respected. The ordinal regression tries to solvethis problem by setting up a family of parallel hyperplanes in the input spaceRd. In Figure 13, three parallel hyperplanes divide the space into four regionsso that these data points are classified more or less correctly. To set up the

Figure 13: Illustration of ordinal regression

problem, let us now assume that there are K classes {1, 2, · · · , K} whichhave an ordinal relation of increasing order. Let D = {(x(i), y(i))}Nj=1 be a

given dataset. The output y(i) takes on one of the values in {1, 2, · · · , K}.For each ` = 1, 2, · · · , K − 1, define a new data set D(`) = {(x(i), z(i)` )}Ni=1,where

z(i)` =

{0 if y(i) ∈ {1, 2, · · · , l}1 else,

so that z(i)` is a binary output as in Figure 14.

Here we use the binary logistic regression as a binary classifier to do theordinal regression. (But one can use any binary classifier that constructs ahyperplane as the class boundary.) Then proceeding as in binary logistic

24

Figure 14: Definition of z`

regression, the log likelihood function is

l(b`, w) =N∑i=1

{z(i)` log(σ(b` + w1x(i)1 + · · ·+ wdx

(i)d ))

+ (1− z(i)` ) log(1− σ(b` + w1x(i)1 + · · ·+ wdx

(i)d ))}.

Since there are K − 1 classifiers for ` = 1, · · · , (K − 1), we need to simul-taneously find appropriate parameters b1, · · · , bK−1, w1, · · · , wd. One way todo so is to maximize

l(b, w) =K−1∑`=1

l(b`, w) =K−1∑`=1

N∑i=1

{z(i)` log(σ(b` + w1x(i)1 + · · ·+ wdx

(i)d ))

+ (1− z(i)` ) log(1− σ(b` + w1x(i)1 + · · ·+ wdx

(i)d ))}.

Now recall that the logistic regression is a method of setting up the logof odds as a linear combination of the input (feature) variables x1, · · · , xd as

b` + w1x1 + · · ·+ wdxd = log Odds(z` = 1 |x)

= logP (y ≥ `+ 1 |x)

P (y ≤ ` |x)

= logp`+1 + · · ·+ pkp1 + · · ·+ p`

, (29)

where pi = P (y = i |x) for i = 1, · · · , K. Thus

eb`ew1x1+···+wdxd = Odds(z` = 1 |x).

This means that for any x = (x1, · · · , xd), Odds(z1 = 1 |x), · · · ,Odds(zk−1 =1 |x) are all multiples of each other, hence the name proportional odds

25

property of the ordinal regression. Also, (29) means that for fixed x =(x1, · · · , xd), b` + w1x1 + · · ·+ wdxd decreases as ` increases. Thus we have

b1 > b2 > · · · > bk−1. (30)

Let us now look at how the decision is made and how the decision regionslook. First, we assume that the maximum likelihood estimators b and ware already found. (Note l(b`, w) is (strictly) concave, so is l(b, w). So it isrelatively easy to find the maximum likelihood estimator by applying thegradient descent algorithm.) But for the simplicity of notation, let us dropthe hat notation and still denote them as b and w. Now define

H`(x1, · · · , xd) = b` + w1x1 + · · ·+ wdxd.

Note that at the point x where H2 = 0, i.e., b2 + w1x1 + · · ·+ wdxd = 0, wehave b1 + w1x1 + · · · + wdxd = b1 − b2 > 0. Therefore the zero set {H2 = 0}is in the region where {H1 > 0}. Figure 15 shows the parallel planes. Asbefore H+

1 represents the region where {H1 > 0} and so on. As we have justseen, {H2 = 0} must be in {H1 > 0} and so on. To illustrate how to find the

Figure 15: Decision regions of ordinal regression

decision region, let us suppose a point x is between H` = 0 and H`+1 = 0.We want to know what the decision has to be. Note that by assumption

H`(x) > 0

H`+1(x) < 0.

Thus

1 < eH`(x) = Odds(z` = 1 |x)

=P (y ≥ `+ 1 |x)

P (y ≤ ` |x).

26

Therefore, we have

P (y ≥ `+ 1 |x) > P (y ≤ ` |x),

which implies that the higher probability decision has to be

y ≥ `+ 1. (31)

On the other hand,

1 > eH`+1(x) = Odds(z`+1 = 1 |x)

=P (y ≥ `+ 2 |x)

P (y ≤ `+ 1 |x).

Therefore, we have

P (y ≥ `+ 2 |x) < P (y ≤ `+ 1 |x),

which again implies thaty ≤ `+ 1. (32)

Therefore combining (31) and (32), the decision has to be

y = `+ 1.

Now assume H1(x) < 0. Then

1 > eH1(x) = Odds(z1 = 1 |x)

=P (y ≥ 2 |x)

P (y ≤ 1 |x).

Therefore, since 1 is the smallest value y can take, the decision has to be

y = 1.

Finally, suppose that HK−1(x) > 0. Then

1 < eHK−1(x) = Odds(zK−1 = 1 |x)

=P (y ≥ K |x)

P (y ≤ K − 1 |x),

which again must imply that the decision has to be

y = K.

All these decisions define the decision regions in Figure 15 in which thedecision regions are marked.

Define HK(x) = +∞, then the classifier ϕ(x) can be succinctly writtenas

ϕ(x) = min{` |H`(x) ≤ 0}.

27

4 Mathematical supplements

4.1 Cross entropy

Let p = (p1, · · · , pn) be a probability distribution (PMF) on n objects.Its entropy H(p) is defined by

H(p) = −n∑i=1

pi log pi.

Let q = (q1, · · · , qn) be a probability distribution on the same n objects. Itscross entropy H(p, q) is defined by

H(p, q) = −n∑i=1

pi log qi.

It is easy to see that H(p, q) can be written in the following form:

H(p, q) = H(p) +DKL(p ‖ q),

where

DKL(p ‖ q) = −n∑i=1

pi log

(qipi

),

which is called the Kullback-Leibler divergence of q from p.

Proposition 2. DKL(p ‖ q) ≥ 0, and is zero if and only if p = q. Thus forany q, H(p, q) ≥ H(p). Moreover, for a fixed probability p, H(p, q) achievesits minimum H(p) if and only if q = p.

Proof. Since − log(t) is a convex function, using the Jensen inequality wehave

n∑i=1

pi(− log)

(qipi

)≥ (− log)

(n∑i=1

piqipi

)= log(

n∑i=1

qi) = 0.

The equality must hold if and only if qi/pi is the same for all i. The rest ofthe proof follows immediately.

28

lecture 3: logistic regression (draft: version 0.8.7) · 2017-12-13 · in this sense logistic...

Documents