logistic regressionpersonal.psu.edu/jol2/course/stat597e/notes2/logit.pdf · i a simple calculation...

Logistic Regression

Logistic Regression

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

Logistic Regression

Logistic RegressionPreserve linear classification boundaries.

I By the Bayes rule:

G (x) = arg maxk

Pr(G = k | X = x) .

I Decision boundary between class k and l is determined by theequation:

Pr(G = k | X = x) = Pr(G = l | X = x) .

I Divide both sides by Pr(G = l | X = x) and take log. Theabove equation is equivalent to

logPr(G = k | X = x)

Pr(G = l | X = x)= 0 .


Logistic Regression

I Since we enforce linear boundary, we can assume


Pr(G = l | X = x)= a

(k,l)0 +

p∑j=1

a(k,l)j xj .

I For logistic regression, there are restrictive relations betweena(k,l) for different pairs of (k, l).


Logistic Regression

I For any pair (k, l):


Pr(G = l | X = x)= βk0 − βl0 + (βk − βl)

T x .

I Number of parameters: (K − 1)(p + 1).

I Denote the entire parameter set by

θ = {β10, β1, β20, β2, ..., β(K−1)0, βK−1} .

I The log ratio of posterior probabilities are called log-odds orlogit transformations.


Logistic Regression

I Under the assumptions, the posterior probabilities are givenby:

Pr(G = k | X = x) =exp(βk0 + βT

k x)

1 +∑K−1

l=1 exp(βl0 + βTl x)

for k = 1, ...,K − 1

Pr(G = K | X = x) =1

1 +∑K−1

l=1 exp(βl0 + βTl x)

.

I For Pr(G = k | X = x) given above, obviously

I Sum up to 1:∑K

k=1 Pr(G = k | X = x) = 1.I A simple calculation shows that the assumptions are satisfied.


Logistic Regression

Comparison with LR on Indicators

I Similarities:I Both attempt to estimate Pr(G = k | X = x).I Both have linear classification boundaries.

I Difference:I Linear regression on indicator matrix: approximate

Pr(G = k | X = x) by a linear function of x .Pr(G = k | X = x) is not guaranteed to fall between 0 and 1and to sum up to 1.

I Logistic regression: Pr(G = k | X = x) is a nonlinear functionof x . It is guaranteed to range from 0 to 1 and to sum up to 1.


Logistic Regression

Fitting Logistic Regression Models

I Criteria: find parameters that maximize the conditionallikelihood of G given X using the training data.

I Denote pk(xi ; θ) = Pr(G = k | X = xi ; θ).

I Given the first input x1, the posterior probability of its classbeing g1 is Pr(G = g1 | X = x1).

I Since samples in the training data set are independent, theposterior probability for the N samples each having class gi ,i = 1, 2, ...,N, given their inputs x1, x2, ..., xN is:

N∏i=1

Pr(G = gi | X = xi ) .


Logistic Regression

I The conditional log-likelihood of the class labels in thetraining data set is

L(θ) =N∑

i=1

log Pr(G = gi | X = xi )

=N∑

i=1

log pgi (xi ; θ) .


Logistic Regression

Binary Classification

I For binary classification, if gi = 1, denote yi = 1; if gi = 2,denote yi = 0.

I Let p1(x ; θ) = p(x ; θ), then

p2(x ; θ) = 1− p1(x ; θ) = 1− p(x ; θ) .

I Since K = 2, the parameters θ = {β10, β1}.We denote β = (β10, β1)

T .


Logistic Regression

I If yi = 1, i.e., gi = 1,

log pgi (x ;β) = log p1(x ;β)

= 1 · log p(x ;β)

= yi log p(x ;β) .

If yi = 0, i.e., gi = 2,

log pgi (x ;β) = log p2(x ;β)

= 1 · log(1− p(x ;β))

= (1− yi ) log(1− p(x ;β)) .

Since either yi = 0 or 1− yi = 0, we have

log pgi (x ;β) = yi log p(x ;β) + (1− yi ) log(1− p(x ;β)) .


Logistic Regression

I The conditional likelihood

L(β) =N∑

i=1

log pgi (xi ;β)

=N∑

i=1

[yi log p(xi ;β) + (1− yi ) log(1− p(xi ;β))]

I There p + 1 parameters in β = (β10, β1)T .

I Assume a column vector form for β:

β =

β10

β11

β12...β1,p

.


Logistic Regression

I Here we add the constant term 1 to x to accommodate theintercept.

x =

1x,1

x,2...x,p

.


Logistic Regression

I By the assumption of logistic regression model:

p(x ;β) = Pr(G = 1 | X = x) =exp(βT x)

1 + exp(βT x)

1− p(x ;β) = Pr(G = 2 | X = x) =1

1 + exp(βT x)

I Substitute the above in L(β):

L(β) =N∑

i=1

[yiβ

T xi − log(1 + eβT xi )]

.


Logistic Regression

I To maximize L(β), we set the first order partial derivatives ofL(β) to zero.

∂L(β)

β1j=

N∑i=1

yixij −N∑

i=1

xijeβT xi

1 + eβT xi

=N∑

i=1

yixij −N∑

i=1

p(x ;β)xij

=N∑

i=1

xij(yi − p(xi ;β))

for all j = 0, 1, ..., p.


Logistic Regression

I In matrix form, we write

∂L(β)

∂β=

N∑i=1

xi (yi − p(xi ;β)) .

I To solve the set of p + 1 nonlinear equations ∂L(β)∂β1j

= 0,

j = 0, 1, ..., p, use the Newton-Raphson algorithm.

I The Newton-Raphson algorithm requires thesecond-derivatives or Hessian matrix:

∂2L(β)

∂β∂βT= −

N∑i=1

xixTi p(xi ;β)(1− p(xi ;β)) .


Logistic Regression

I The element on the jth row and nth column is (counting from0):

∂L(β)

∂β1j∂β1n

= −N∑

i=1

(1 + eβT xi )eβT xi xijxin − (eβT xi )2xijxin

(1 + eβT xi )2

= −N∑

i=1

xijxinp(xi ;β)− xijxinp(xi ;β)2

= −N∑

i=1

xijxinp(xi ;β)(1− p(xi ;β)) .


Logistic Regression

I Starting with βold , a single Newton-Raphson update is

βnew = βold −(

∂2L(β)

∂β∂βT

)−1∂L(β)

∂β,

where the derivatives are evaluated at βold .


Logistic Regression

I The iteration can be expressed compactly in matrix form.I Let y be the column vector of yi .I Let X be the N × (p + 1) input matrix.I Let p be the N-vector of fitted probabilities with ith element

p(xi ;βold).

I Let W be an N × N diagonal matrix of weights with ithelement p(xi ;β

old)(1− p(xi ;βold)).

I Then

∂L(β)

∂β= XT (y − p)

∂2L(β)

∂β∂βT= −XTWX .


Logistic Regression

I The Newton-Raphson step is

βnew = βold + (XTWX)−1XT (y − p)

= (XTWX)−1XTW(Xβold + W−1(y − p))

= (XTWX)−1XTWz ,

where z , Xβold + W−1(y − p).I If z is viewed as a response and X is the input matrix, βnew is

the solution to a weighted least square problem:

βnew ← arg minβ

(z− Xβ)TW(z− Xβ) .

I Recall that linear regression by least square is to solve

arg minβ

(z− Xβ)T (z− Xβ) .

I z is referred to as the adjusted response.I The algorithm is referred to as iteratively reweighted least

squares or IRLS.


Logistic Regression

Pseudo Code1. 0→ β2. Compute y by setting its elements to

yi =

{1 if gi = 10 if gi = 2

,

i = 1, 2, ...,N.3. Compute p by setting its elements to

p(xi ;β) =eβT xi

1 + eβT xii = 1, 2, ...,N.

4. Compute the diagonal matrix W. The ith diagonal element isp(xi ;β)(1− p(xi ;β)), i = 1, 2, ...,N.

5. z← Xβ + W−1(y − p).6. β ← (XTWX)−1XTWz.7. If the stopping criteria is met, stop; otherwise go back to step

3.Jia Li http://www.stat.psu.edu/∼jiali

Logistic Regression

Computational Efficiency

I Since W is an N × N diagonal matrix, direct matrixoperations with it may be very inefficient.

I A modified pseudo code is provided next.


Logistic Regression

1. 0→ β2. Compute y by setting its elements to

yi =

{1 if gi = 10 if gi = 2

, i = 1, 2, ...,N .

3. Compute p by setting its elements to

p(xi ;β) =eβT xi

1 + eβT xii = 1, 2, ...,N.

4. Compute the N × (p + 1) matrix X by multiplying the ith row ofmatrix X by p(xi ;β)(1− p(xi ;β)), i = 1, 2, ...,N:

X =

xT1

xT2

· · ·xTN

X =

p(x1;β)(1− p(x1;β))xT

1

p(x2;β)(1− p(x2;β))xT2

· · ·p(xN ;β)(1− p(xN ;β))xT

N

5. β ← β + (XT X)−1XT (y − p).6. If the stopping criteria is met, stop; otherwise go back to step 3.


Logistic Regression

Example

Diabetes data set

I Input X is two dimensional. X1 and X2 are the two principalcomponents of the original 8 variables.

I Class 1: without diabetes; Class 2: with diabetes.

I Applying logistic regression, we obtain

β = (0.7679,−0.6816,−0.3664)T .


Logistic Regression

I The posterior probabilities are:

Pr(G = 1 | X = x) =e0.7679−0.6816X1−0.3664X2

1 + e0.7679−0.6816X1−0.3664X2

Pr(G = 2 | X = x) =1

1 + e0.7679−0.6816X1−0.3664X2

I The classification rule is:

G (x) =

{1 0.7679− 0.6816X1 − 0.3664X2 ≥ 02 0.7679− 0.6816X1 − 0.3664X2 < 0


Logistic Regression

Solid line: decision boundary obtained by logistic regression. Dashline: decision boundary obtained by LDA.

I Within trainingdata setclassification errorrate: 28.12%.

I Sensitivity: 45.9%.

I Specificity: 85.8%.


Logistic Regression

Multiclass Case (K ≥ 3)I When K ≥ 3, β is a (K-1)(p+1)-vector:

β =

β10

β1

β20

β2...β(K−1)0

βK−1

=

β10

β11...β1p

β20...β2p...β(K−1)0...β(K−1)p


Logistic Regression

I Let βl =

(βl0

βl

).

I The likelihood function becomes

L(β) =N∑

i=1

log pgi (xi ;β)

=N∑

i=1

log

(e βT

gixi

1 +∑K−1

l=1 e βTl xi

)

=N∑

i=1

[βT

gixi − log

(1 +

K−1∑l=1

e βTl xi

)]


Logistic Regression

I Note: the indicator function I (·) equals 1 when the argumentis true and 0 otherwise.

I First order derivatives:

∂L(β)

∂βkj=

N∑i=1

[I (gi = k)xij −

e βTk xi xij

1 +∑K−1

l=1 e βTl xi

]

=N∑

i=1

xij(I (gi = k)− pk(xi ;β))


Logistic Regression

I Second order derivatives:

∂2L(β)

∂βkj∂βmn

=N∑

i=1

xij ·1

(1 +∑K−1

l=1 e βTl xi )2

·[−e βT

k xi I (k = m)xin(1 +K−1∑l=1

e βTl xi ) + e βT

k xi e βTmxi xin

]

=N∑

i=1

xijxin(−pk(xi ;β)I (k = m) + pk(xi ;β)pm(xi ;β))

= −N∑

i=1

xijxinpk(xi ;β)[I (k = m)− pm(xi ;β)] .


Logistic Regression

I Matrix form.I y is the concatenated indicator vector of dimension

N × (K − 1).

y =

y1

y2

...yK−1

yk =

I (g1 = k)I (g2 = k)...I (gN = k)

1 ≤ k ≤ K − 1

I p is the concatenated vector of fitted probabilities of dimensionN × (K − 1).

p =

p1

p2

...pK−1

pk =

pk(x1;β)pk(x2;β)...pk(xN ;β)

1 ≤ k ≤ K − 1


Logistic Regression

I X is an N(K − 1)× (p + 1)(K − 1) matrix:

X =

X 0 · · · 00 X · · · 0· · · · · · · · · · · ·0 0 · · · X


Logistic Regression

I Matrix W is an N(K − 1)× N(K − 1) square matrix:

W =

W11 W12 · · · W1(K−1)

W21 W22 · · · W2(K−1)

· · · · · · · · · · · ·W(K−1),1 W(K−1),2 · · · W(K−1),(K−1)

I Each submatrix Wkm, 1 ≤ k,m ≤ K − 1, is an N × N

diagonal matrix.

I When k = m, the ith diagonal element in Wkk ispk(xi ;β

old)(1− pk(xi ;βold)).

I When k 6= m, the ith diagonal element in Wkm is−pk(xi ;β

old)pm(xi ;βold).


Logistic Regression

I Similarly as with binary classification

∂L(β)

∂β= XT (y − p)

∂2L(β)

∂β∂βT= −XTWX .

I The formula for updating βnew in the binary classification caseholds for multiclass.

βnew = (XTWX)−1XTWz ,

where z , Xβold + W−1(y − p). Or simply:

βnew = βold + (XTWX)−1XT (y − p) .


Logistic Regression

Computation Issues

I Initialization: one option is to use β = 0.

I Convergence is not guaranteed, but usually is the case.

I Usually, the log-likelihood increases after each iteration, butovershooting can occur.

I In the rare cases that the log-likelihood decreases, cut stepsize by half.


Logistic Regression

Connection with LDA

I Under the model of LDA:


Pr(G = K | X = x)

= logπk

πK− 1

2(µk + µK )TΣ−1(µk − µK )

+xTΣ−1(µk − µK )

= ak0 + aTk x .

I The model of LDA satisfies the assumption of the linearlogistic model.

I The linear logistic model only specifies the conditionaldistribution Pr(G = k | X = x). No assumption is madeabout Pr(X ).


Logistic Regression

I The LDA model specifies the joint distribution of X and G .Pr(X ) is a mixture of Gaussians:

Pr(X ) =K∑

k=1

πkφ(X ;µk ,Σ) .

where φ is the Gaussian density function.

I Linear logistic regression maximizes the conditional likelihoodof G given X : Pr(G = k | X = x).

I LDA maximizes the joint likelihood of G and X :Pr(X = x ,G = k).


Logistic Regression

I If the additional assumption made by LDA is appropriate,LDA tends to estimate the parameters more efficiently byusing more information about the data.

I Samples without class labels can be used under the model ofLDA.

I LDA is not robust to gross outliers.

I As logistic regression relies on fewer assumptions, it seems tobe more robust.

I In practice, logistic regression and LDA often give similarresults.


Logistic Regression

Simulation

I Assume input X is 1-D.

I Two classes have equal priors and the class-conditionaldensities of X are shifted versions of each other.

I Each conditional density is a mixture of two normals:I Class 1 (red): 0.6N(−2, 1

4 ) + 0.4N(0, 1).I Class 2 (blue): 0.6N(0, 1

4 ) + 0.4N(2, 1).

I The class-conditional densities are shown below.


Logistic Regression


Logistic Regression

LDA Result

I Training data set: 2000 samples for each class.

I Test data set: 1000 samples for each class.

I The estimation by LDA: µ1 = −1.1948, µ2 = 0.8224,σ2 = 1.5268. Boundary value between the two classes is(µ1 + µ2)/2 =−0.1862.

I The classification error rate on the test data is 0.2315.

I Based on the true distribution, the Bayes (optimal) boundaryvalue between the two classes is −0.7750 and the error rate is0.1765.


Logistic Regression


Logistic Regression

Logistic Regression Result

I Linear logistic regression obtains

β = (−0.3288,−1.3275)T .

The boundary value satisfies −0.3288− 1.3275X = 0, henceequals −0.2477.

I The error rate on the test data set is 0.2205.

I The estimated posterior probability is:

Pr(G = 1 | X = x) =e−0.3288−1.3275x

1 + e−0.3288−1.3275x.


Logistic Regression

The estimated posterior probability Pr(G = 1 | X = x) and its truevalue based on the true distribution are compared in the graphbelow.


logistic regressionpersonal.psu.edu/jol2/course/stat597e/notes2/logit.pdf · i a simple calculation...

Documents