fsan815/eleg815: foundations of statistical learning · fsan815/eleg815: foundations of statistical...

FSAN815/ELEG815:Foundations of Statistical

Learning

Gonzalo R. ArceChapter 14: Logistic Regression

Fall 2014

Course Objectives & Structure

Course Objectives & Structure

The course provides an introduction to the mathematics of data analysis anda detailed overview of statistical models for inference and prediction.Course Structure:

Weekly lectures [notes: www.ece.udel.edu/∼arce/Courses]Homework & computer assignments [30%]Midterm & Final examinations [70%]

Textbooks:Papoulis and Pillai, Probability, random variables, and stochasticprocesses.Hastie, Tibshirani and Friedman, The elements of statistical learning.Haykin, Adaptive Filter Theory.

Gonzalo R. Arce (ECE, Univ. of Delaware) FSAN-815/ELEG-815: Statistical Learning Fall 2014 2 / 18

Logistic Regression

Logistic Regression

Logistic Regression

The logistic regression model arises from the desire to model the posteriorprobabilities of the K classes via linear functions in x , while at the same timeensuring that they sum to one and remain in [0,1].

logPr(G = 1|X = x)Pr(G = K |X = x)

= β10 +βββT1 x

logPr(G = 2|X = x)Pr(G = K |X = x)

= β20 +βββT2 x

.

.

logPr(G = K −1|X = x)

Pr(G = K |X = x)= β(K−1)0 +βββ

T1 x

(1)


Logistic Regression

Logistic Regression

Here βββ k = [βk1,βk2, ...,βkp]T , x = [x1,x2, ...,xp]

T .The model is specified in terms of K −1 log-odds or logittransformations(reflecting the constraint that the probabilities sum to one).A simple calculation shows that

Pr(G = k |X = x) =exp(βk0 +βββ

Tk x)

1+∑K−1l=1 exp(βl0 +βββ

Tl x)

,k = 1,2, ...,K −1,

Pr(G = K |X = x) =1

1+∑K−1l=1 exp(βl0 +βββ

Tl x)

(2)

To emphasize the dependence on the entire parameter setθ = {β10,βββ

Y1 , ...,β(K−1)0,βββ

TK−1}, we denote the probabilities

Pr(G = k |X = x) = pk (x;θ)


Fitting Logistic Regression Models


Logistic regression models are usually fit by maximum likelihood, usingthe conditional likelihood of G given X .

The log-likelihood for N observations is

l(θ) =N

∑i=1

logpgi (xi ;θ) (3)

where pk (xi ;θ) = Pr(G = k |X = xi ;θ)



Fitting Logistic Regression ModelsDiscuss in detail the two-class case.

Code the two-class gi via a 0/1 response yi , where yi = 1 when gi = 1,and yi = 0 when gi = 2.Let p1(x;θ) = p(x;θ), p2(x;θ) = 1−p(x;θ)

The log-likelihood can be written as

l(θ) =N

∑i=1{yi logp(xi ;βββ )+(1−yi)log(1−p(xi ;βββ ))}

=N

∑i=1{yi(logp(xi ;βββ )− log(1−p(xi ;βββ )))+ log(1−p(xi ;βββ ))}

=N

∑i=1{yi log

Pr(G = 1|X = xi)

Pr(G = 2|X = xi)+ log(Pr(G = 2|X = xi))}

=N

∑i=1{yiβββ

T xi − log(1+eβββT xi )}

(4)




Here βββ = {β10,βββ 1}, and we assume that the vector of inputs xi includes theconstant term 1 to accommodate the intercept.

To maximize the log-likelihood, we set its derivatives to zero. These scoreequations are

∂ l(βββ )∂βββ

=N

∑i=1

xi(yi −p(xi ;βββ )) = 0 (5)

which are p+1 equations nonlinear in βββ .

Since the first component of xi is 1, the first score equation specifies that∑

Ni=1 yi = ∑

Ni=1 p(xi ;βββ ), the expected number of class ones matches the

observed number.



Newton−Raphson algorithm

To solve the score equations (5), we use the Newton−Raphson algorithm,which requires the second-derivative or Hessian matrix

∂ 2l(βββ )

∂βββ∂βββT =−

N

∑i=1

xixTi p(xi ;βββ )(1−p(xi ;βββ )) (6)

Starting with βββold , a single Newton update is

βββnew = βββ

old − (∂ 2l(βββ )

∂βββ∂βββT )−1 ∂ l(βββ )

∂βββ(7)

where the derivatives are evaluated at βββold .



Newton Raphson algorithm

Write the score and Hessian in matrix notation.y: N vector of yi values.xi : p+1 vector of input.X: N× (p+1) matrix of xi values.

p: N vector of fitted probabilities with i th element p(xi ;βββold ).

W: N×N diagonal matrix of weights with i th diagonal elementp(xi ;βββ

old )(1−p(xi ;βββold )).




Then we have

∂ l(βββ )∂βββ

= XT (y−p)

∂ 2l(βββ )

∂βββ∂βββT =−XT WX

(8)

The Newton step is thus

βββnew = βββ

old +(XT WX)−1XT (y−p)

= (XT WX)−1XT W(Xβββold +W−1(y−p))

= (XT WX)−1XT Wz

(9)




In the second and third line we have re-expressed the Newton step as aweighted least squares step, with the response

z = Xβββold +W−1(y−p) (10)

sometimes known as the adjusted response.These equations get solved repeatedly, since at each iteration pchanges, and hence so does W and z.

This algorithm is referred to as iteratively reweighted least squares or IRLS,since each iteration solves the weighted least squares problem:

βββnew ← argmin

βββ

(z−Xβββ )T W(z−Xβββ ) (11)


Exercise

Exercise 4.10

Weekly data set is part of the ISLR package. It contains 1089 weeklypercentage returns for 21 years, from the beginning of 1990 to the end of2010.

Lag1 to The percentage returns forLag5 each of the five previous trading weeksVolume The number of shares traded

on the previous week, in billionsToday The percentage return

on the week in questionDirection Whether the market was

Up or Down on this week


Exercise

Exercise

(b) Use the full data set to perform a logistic regression.

library(ISLR)attach(Weekly)glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Weekly,family=binomial)summary(glm.fit)

According to p-value, lag1, lag2 and lag4 appear to be statistically significant.


Exercise

Exercise

(c) Compute the confusion matrix and overall fraction of correct predictions.

glm.probs=predict(glm.fit,type="response")glm.pred =rep("Down",1089)glm.pred [glm.probs>.5]="Up"table(glm.pred,Direction)mean(glm.pred==Direction)


Exercise

Exercise

(d) Fit the logistic regression model using a training data then predict.

train=(Year<2009)A=(Year==2009)B=(Year==2010)Weekly.2009=Weekly[A,]Weekly.2010=Weekly[B,]Direction.2009=Direction[A]Direction.2010=Direction[B]glm.fit=glm(Direction∼Lag2,data=Weekly,family=binomial,subset=train)glm.probs=predict(glm.fit,Weekly.2009,type="response")glm.pred =rep("Down",52)glm.pred [glm.probs>.5]="Up"table(glm.pred,Direction.2009)mean(glm.pred==Direction.2009)


Exercise

Exercise

Figure: prediction for 2009 Figure: prediction for 2010


Exercise

Exercise(d) Fit the LDA model using a training data then predict.

library(MASS)lda.fit=lda(Direction∼Lag2,data=Weekly,subset=train)lda.pred=predict(lda.fit,Weekly.2009)lda.class=lda.pred$classtable(lda.class,Direction.2009)mean(lda.class==Direction.2009)



Exercise

Exercise

(e) Fit the QDA model using a training data then predict.

qda.fit=qda(Direction∼Lag2,data=Weekly,subset=train)qda.pred=predict(qda.fit,Weekly.2009)qda.class=qda.pred$classtable(qda.class,Direction.2009)mean(qda.class==Direction.2009)



fsan815/eleg815: foundations of statistical learning · fsan815/eleg815: foundations of statistical...

Documents