logistic regressionpersonal.psu.edu/jol2/course/stat597e/notes2/logit.pdf · i a simple calculation...
TRANSCRIPT
Logistic Regression
Logistic Regression
Jia Li
Department of StatisticsThe Pennsylvania State University
Email: [email protected]://www.stat.psu.edu/∼jiali
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Logistic RegressionPreserve linear classification boundaries.
I By the Bayes rule:
G (x) = arg maxk
Pr(G = k | X = x) .
I Decision boundary between class k and l is determined by theequation:
Pr(G = k | X = x) = Pr(G = l | X = x) .
I Divide both sides by Pr(G = l | X = x) and take log. Theabove equation is equivalent to
logPr(G = k | X = x)
Pr(G = l | X = x)= 0 .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Since we enforce linear boundary, we can assume
logPr(G = k | X = x)
Pr(G = l | X = x)= a
(k,l)0 +
p∑j=1
a(k,l)j xj .
I For logistic regression, there are restrictive relations betweena(k,l) for different pairs of (k, l).
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Assumptions
logPr(G = 1 | X = x)
Pr(G = K | X = x)= β10 + βT
1 x
logPr(G = 2 | X = x)
Pr(G = K | X = x)= β20 + βT
2 x
...
logPr(G = K − 1 | X = x)
Pr(G = K | X = x)= β(K−1)0 + βT
K−1x
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I For any pair (k, l):
logPr(G = k | X = x)
Pr(G = l | X = x)= βk0 − βl0 + (βk − βl)
T x .
I Number of parameters: (K − 1)(p + 1).
I Denote the entire parameter set by
θ = {β10, β1, β20, β2, ..., β(K−1)0, βK−1} .
I The log ratio of posterior probabilities are called log-odds orlogit transformations.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Under the assumptions, the posterior probabilities are givenby:
Pr(G = k | X = x) =exp(βk0 + βT
k x)
1 +∑K−1
l=1 exp(βl0 + βTl x)
for k = 1, ...,K − 1
Pr(G = K | X = x) =1
1 +∑K−1
l=1 exp(βl0 + βTl x)
.
I For Pr(G = k | X = x) given above, obviously
I Sum up to 1:∑K
k=1 Pr(G = k | X = x) = 1.I A simple calculation shows that the assumptions are satisfied.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Comparison with LR on Indicators
I Similarities:I Both attempt to estimate Pr(G = k | X = x).I Both have linear classification boundaries.
I Difference:I Linear regression on indicator matrix: approximate
Pr(G = k | X = x) by a linear function of x .Pr(G = k | X = x) is not guaranteed to fall between 0 and 1and to sum up to 1.
I Logistic regression: Pr(G = k | X = x) is a nonlinear functionof x . It is guaranteed to range from 0 to 1 and to sum up to 1.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Fitting Logistic Regression Models
I Criteria: find parameters that maximize the conditionallikelihood of G given X using the training data.
I Denote pk(xi ; θ) = Pr(G = k | X = xi ; θ).
I Given the first input x1, the posterior probability of its classbeing g1 is Pr(G = g1 | X = x1).
I Since samples in the training data set are independent, theposterior probability for the N samples each having class gi ,i = 1, 2, ...,N, given their inputs x1, x2, ..., xN is:
N∏i=1
Pr(G = gi | X = xi ) .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I The conditional log-likelihood of the class labels in thetraining data set is
L(θ) =N∑
i=1
log Pr(G = gi | X = xi )
=N∑
i=1
log pgi (xi ; θ) .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Binary Classification
I For binary classification, if gi = 1, denote yi = 1; if gi = 2,denote yi = 0.
I Let p1(x ; θ) = p(x ; θ), then
p2(x ; θ) = 1− p1(x ; θ) = 1− p(x ; θ) .
I Since K = 2, the parameters θ = {β10, β1}.We denote β = (β10, β1)
T .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I If yi = 1, i.e., gi = 1,
log pgi (x ;β) = log p1(x ;β)
= 1 · log p(x ;β)
= yi log p(x ;β) .
If yi = 0, i.e., gi = 2,
log pgi (x ;β) = log p2(x ;β)
= 1 · log(1− p(x ;β))
= (1− yi ) log(1− p(x ;β)) .
Since either yi = 0 or 1− yi = 0, we have
log pgi (x ;β) = yi log p(x ;β) + (1− yi ) log(1− p(x ;β)) .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I The conditional likelihood
L(β) =N∑
i=1
log pgi (xi ;β)
=N∑
i=1
[yi log p(xi ;β) + (1− yi ) log(1− p(xi ;β))]
I There p + 1 parameters in β = (β10, β1)T .
I Assume a column vector form for β:
β =
β10
β11
β12...β1,p
.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Here we add the constant term 1 to x to accommodate theintercept.
x =
1x,1
x,2...x,p
.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I By the assumption of logistic regression model:
p(x ;β) = Pr(G = 1 | X = x) =exp(βT x)
1 + exp(βT x)
1− p(x ;β) = Pr(G = 2 | X = x) =1
1 + exp(βT x)
I Substitute the above in L(β):
L(β) =N∑
i=1
[yiβ
T xi − log(1 + eβT xi )]
.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I To maximize L(β), we set the first order partial derivatives ofL(β) to zero.
∂L(β)
β1j=
N∑i=1
yixij −N∑
i=1
xijeβT xi
1 + eβT xi
=N∑
i=1
yixij −N∑
i=1
p(x ;β)xij
=N∑
i=1
xij(yi − p(xi ;β))
for all j = 0, 1, ..., p.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I In matrix form, we write
∂L(β)
∂β=
N∑i=1
xi (yi − p(xi ;β)) .
I To solve the set of p + 1 nonlinear equations ∂L(β)∂β1j
= 0,
j = 0, 1, ..., p, use the Newton-Raphson algorithm.
I The Newton-Raphson algorithm requires thesecond-derivatives or Hessian matrix:
∂2L(β)
∂β∂βT= −
N∑i=1
xixTi p(xi ;β)(1− p(xi ;β)) .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I The element on the jth row and nth column is (counting from0):
∂L(β)
∂β1j∂β1n
= −N∑
i=1
(1 + eβT xi )eβT xi xijxin − (eβT xi )2xijxin
(1 + eβT xi )2
= −N∑
i=1
xijxinp(xi ;β)− xijxinp(xi ;β)2
= −N∑
i=1
xijxinp(xi ;β)(1− p(xi ;β)) .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Starting with βold , a single Newton-Raphson update is
βnew = βold −(
∂2L(β)
∂β∂βT
)−1∂L(β)
∂β,
where the derivatives are evaluated at βold .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I The iteration can be expressed compactly in matrix form.I Let y be the column vector of yi .I Let X be the N × (p + 1) input matrix.I Let p be the N-vector of fitted probabilities with ith element
p(xi ;βold).
I Let W be an N × N diagonal matrix of weights with ithelement p(xi ;β
old)(1− p(xi ;βold)).
I Then
∂L(β)
∂β= XT (y − p)
∂2L(β)
∂β∂βT= −XTWX .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I The Newton-Raphson step is
βnew = βold + (XTWX)−1XT (y − p)
= (XTWX)−1XTW(Xβold + W−1(y − p))
= (XTWX)−1XTWz ,
where z , Xβold + W−1(y − p).I If z is viewed as a response and X is the input matrix, βnew is
the solution to a weighted least square problem:
βnew ← arg minβ
(z− Xβ)TW(z− Xβ) .
I Recall that linear regression by least square is to solve
arg minβ
(z− Xβ)T (z− Xβ) .
I z is referred to as the adjusted response.I The algorithm is referred to as iteratively reweighted least
squares or IRLS.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Pseudo Code1. 0→ β2. Compute y by setting its elements to
yi =
{1 if gi = 10 if gi = 2
,
i = 1, 2, ...,N.3. Compute p by setting its elements to
p(xi ;β) =eβT xi
1 + eβT xii = 1, 2, ...,N.
4. Compute the diagonal matrix W. The ith diagonal element isp(xi ;β)(1− p(xi ;β)), i = 1, 2, ...,N.
5. z← Xβ + W−1(y − p).6. β ← (XTWX)−1XTWz.7. If the stopping criteria is met, stop; otherwise go back to step
3.Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Computational Efficiency
I Since W is an N × N diagonal matrix, direct matrixoperations with it may be very inefficient.
I A modified pseudo code is provided next.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
1. 0→ β2. Compute y by setting its elements to
yi =
{1 if gi = 10 if gi = 2
, i = 1, 2, ...,N .
3. Compute p by setting its elements to
p(xi ;β) =eβT xi
1 + eβT xii = 1, 2, ...,N.
4. Compute the N × (p + 1) matrix X by multiplying the ith row ofmatrix X by p(xi ;β)(1− p(xi ;β)), i = 1, 2, ...,N:
X =
xT1
xT2
· · ·xTN
X =
p(x1;β)(1− p(x1;β))xT
1
p(x2;β)(1− p(x2;β))xT2
· · ·p(xN ;β)(1− p(xN ;β))xT
N
5. β ← β + (XT X)−1XT (y − p).6. If the stopping criteria is met, stop; otherwise go back to step 3.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Example
Diabetes data set
I Input X is two dimensional. X1 and X2 are the two principalcomponents of the original 8 variables.
I Class 1: without diabetes; Class 2: with diabetes.
I Applying logistic regression, we obtain
β = (0.7679,−0.6816,−0.3664)T .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I The posterior probabilities are:
Pr(G = 1 | X = x) =e0.7679−0.6816X1−0.3664X2
1 + e0.7679−0.6816X1−0.3664X2
Pr(G = 2 | X = x) =1
1 + e0.7679−0.6816X1−0.3664X2
I The classification rule is:
G (x) =
{1 0.7679− 0.6816X1 − 0.3664X2 ≥ 02 0.7679− 0.6816X1 − 0.3664X2 < 0
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Solid line: decision boundary obtained by logistic regression. Dashline: decision boundary obtained by LDA.
I Within trainingdata setclassification errorrate: 28.12%.
I Sensitivity: 45.9%.
I Specificity: 85.8%.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Multiclass Case (K ≥ 3)I When K ≥ 3, β is a (K-1)(p+1)-vector:
β =
β10
β1
β20
β2...β(K−1)0
βK−1
=
β10
β11...β1p
β20...β2p...β(K−1)0...β(K−1)p
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Let βl =
(βl0
βl
).
I The likelihood function becomes
L(β) =N∑
i=1
log pgi (xi ;β)
=N∑
i=1
log
(e βT
gixi
1 +∑K−1
l=1 e βTl xi
)
=N∑
i=1
[βT
gixi − log
(1 +
K−1∑l=1
e βTl xi
)]
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Note: the indicator function I (·) equals 1 when the argumentis true and 0 otherwise.
I First order derivatives:
∂L(β)
∂βkj=
N∑i=1
[I (gi = k)xij −
e βTk xi xij
1 +∑K−1
l=1 e βTl xi
]
=N∑
i=1
xij(I (gi = k)− pk(xi ;β))
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Second order derivatives:
∂2L(β)
∂βkj∂βmn
=N∑
i=1
xij ·1
(1 +∑K−1
l=1 e βTl xi )2
·[−e βT
k xi I (k = m)xin(1 +K−1∑l=1
e βTl xi ) + e βT
k xi e βTmxi xin
]
=N∑
i=1
xijxin(−pk(xi ;β)I (k = m) + pk(xi ;β)pm(xi ;β))
= −N∑
i=1
xijxinpk(xi ;β)[I (k = m)− pm(xi ;β)] .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Matrix form.I y is the concatenated indicator vector of dimension
N × (K − 1).
y =
y1
y2
...yK−1
yk =
I (g1 = k)I (g2 = k)...I (gN = k)
1 ≤ k ≤ K − 1
I p is the concatenated vector of fitted probabilities of dimensionN × (K − 1).
p =
p1
p2
...pK−1
pk =
pk(x1;β)pk(x2;β)...pk(xN ;β)
1 ≤ k ≤ K − 1
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I X is an N(K − 1)× (p + 1)(K − 1) matrix:
X =
X 0 · · · 00 X · · · 0· · · · · · · · · · · ·0 0 · · · X
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Matrix W is an N(K − 1)× N(K − 1) square matrix:
W =
W11 W12 · · · W1(K−1)
W21 W22 · · · W2(K−1)
· · · · · · · · · · · ·W(K−1),1 W(K−1),2 · · · W(K−1),(K−1)
I Each submatrix Wkm, 1 ≤ k,m ≤ K − 1, is an N × N
diagonal matrix.
I When k = m, the ith diagonal element in Wkk ispk(xi ;β
old)(1− pk(xi ;βold)).
I When k 6= m, the ith diagonal element in Wkm is−pk(xi ;β
old)pm(xi ;βold).
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I Similarly as with binary classification
∂L(β)
∂β= XT (y − p)
∂2L(β)
∂β∂βT= −XTWX .
I The formula for updating βnew in the binary classification caseholds for multiclass.
βnew = (XTWX)−1XTWz ,
where z , Xβold + W−1(y − p). Or simply:
βnew = βold + (XTWX)−1XT (y − p) .
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Computation Issues
I Initialization: one option is to use β = 0.
I Convergence is not guaranteed, but usually is the case.
I Usually, the log-likelihood increases after each iteration, butovershooting can occur.
I In the rare cases that the log-likelihood decreases, cut stepsize by half.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Connection with LDA
I Under the model of LDA:
logPr(G = k | X = x)
Pr(G = K | X = x)
= logπk
πK− 1
2(µk + µK )TΣ−1(µk − µK )
+xTΣ−1(µk − µK )
= ak0 + aTk x .
I The model of LDA satisfies the assumption of the linearlogistic model.
I The linear logistic model only specifies the conditionaldistribution Pr(G = k | X = x). No assumption is madeabout Pr(X ).
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I The LDA model specifies the joint distribution of X and G .Pr(X ) is a mixture of Gaussians:
Pr(X ) =K∑
k=1
πkφ(X ;µk ,Σ) .
where φ is the Gaussian density function.
I Linear logistic regression maximizes the conditional likelihoodof G given X : Pr(G = k | X = x).
I LDA maximizes the joint likelihood of G and X :Pr(X = x ,G = k).
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
I If the additional assumption made by LDA is appropriate,LDA tends to estimate the parameters more efficiently byusing more information about the data.
I Samples without class labels can be used under the model ofLDA.
I LDA is not robust to gross outliers.
I As logistic regression relies on fewer assumptions, it seems tobe more robust.
I In practice, logistic regression and LDA often give similarresults.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Simulation
I Assume input X is 1-D.
I Two classes have equal priors and the class-conditionaldensities of X are shifted versions of each other.
I Each conditional density is a mixture of two normals:I Class 1 (red): 0.6N(−2, 1
4 ) + 0.4N(0, 1).I Class 2 (blue): 0.6N(0, 1
4 ) + 0.4N(2, 1).
I The class-conditional densities are shown below.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
LDA Result
I Training data set: 2000 samples for each class.
I Test data set: 1000 samples for each class.
I The estimation by LDA: µ1 = −1.1948, µ2 = 0.8224,σ2 = 1.5268. Boundary value between the two classes is(µ1 + µ2)/2 =−0.1862.
I The classification error rate on the test data is 0.2315.
I Based on the true distribution, the Bayes (optimal) boundaryvalue between the two classes is −0.7750 and the error rate is0.1765.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Logistic Regression Result
I Linear logistic regression obtains
β = (−0.3288,−1.3275)T .
The boundary value satisfies −0.3288− 1.3275X = 0, henceequals −0.2477.
I The error rate on the test data set is 0.2205.
I The estimated posterior probability is:
Pr(G = 1 | X = x) =e−0.3288−1.3275x
1 + e−0.3288−1.3275x.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
The estimated posterior probability Pr(G = 1 | X = x) and its truevalue based on the true distribution are compared in the graphbelow.
Jia Li http://www.stat.psu.edu/∼jiali