# logistic regression and generalized linear models · pdf filelogistic regression and...

Post on 10-Apr-2018

230 views

Embed Size (px)

TRANSCRIPT

Logistic Regression andGeneralized Linear

ModelsSridhar Mahadevan

University of Massachusetts

Sridhar Mahadevan: CMPSCI 689 p. 1/29

TopicsGenerative vs. Discriminative models

In many cases, it is difficult to model data using a parametric class conditionaldensity P (X|, )

Yet, in many problems, a linear decision boundary is usually adequate toseparate classes (also, gaussian densities with a shared covariance matrixproduces a linear decision boundary).

Logistic regression: discriminative model for classification that produces lineardecision boundaries

Model fitting problem solved using maximum likelihood

Iterative gradient-based algorithm for solving nonlinear maximum likelihoodequations

Recursive weighted least squares regression

Logistic regression is an instance of a generalized linear model (GLM), whichconsists of a large variety of exponential models.GLMs can also be extended togeneralized additive models (GAMs).

Sridhar Mahadevan: CMPSCI 689 p. 2/29

Discriminative vs.Generative Models

Both generative and discriminative approaches address the problem of modelingthe discriminant function P (y|x) of output labels (or values) y conditioned on theinput x.

In generative models, we estimate both P (x) and P (x|y), and use Bayes rule tocompute the discriminant.

P (y|x) P (x)P (x|y)

Discriminative approaches model the conditional distribution P (y|x) directly, andignore the marginal P (x).

We now turn to explore several types instances of discriminative models, includinglogistic regression in this class, and later several other types including supportvector machines.

Sridhar Mahadevan: CMPSCI 689 p. 3/29

Generalized LinearModels

In linear regression, we model the output y as a linear function of the inputvariables, with a noise term that is zero mean constant variance Gaussian.

y = g(x) + , where the conditional mean E(y|x) = g(x), and the noise term is .g(x) = T x (where 0 is an offset term).

We saw earlier that the maximum likelihood framework justified the use of asquared error loss function, provided the errors were IID gaussian (the variancedoes not matter).

We want to generalize this idea of specifying a model family by specifying the typeof error distribution:

When the output variable y is discrete (e.g., binary or multinomial), the noiseterm is not gaussian, but binomial or multinomial.

A change in the mean is coupled by a change in the variance, and we want tobe able to couple mean and variance in our model.

Generalized linear models provides a rich tool of models based on specifying theerror distribution.

Sridhar Mahadevan: CMPSCI 689 p. 4/29

Logit FunctionSince the output variable y only takes on values (0, 1) (for binary classification),we need a different way of representing E(y|x) so that the range of y (0, 1).

One convenient form to use is the sigmoid or logistic function. Let us assume avector-valued input variable x = (x1, . . . , xp). The logistic function is S shapedand approaches 0 (as x ) or 1 (as x).

P (y = 1|x, ) = (x|) =e

T x

1 + eT x=

1

1 + eT x

P (y = 0|x, ) = 1 (x|) =1

1 + eT x

We assume an extra input x0 = 1, so that 0 is an offset. We can invert the abovetransformation to get the logit function

g(x|) = log(x|)

1 (x|)= T x

Sridhar Mahadevan: CMPSCI 689 p. 5/29

Logistic Regression

y

X1 X0X2

2 10

Sridhar Mahadevan: CMPSCI 689 p. 6/29

Example Dataset forLogistic Regression

The data set we are analyzing is coronary heart disease in South Africa. The chdresponse (output) variable is binary (yes, no), and there are 9 predictor variables:

There are 462 instances, out of which 160 are cases (positive instances), and 302are controls (negative instances).

The predictor variables are systolic blood pressure, tobacco, ldl, famhist,obesity, alcohol, age, adiposity, typea,

Lets focus on a subset of the predictors: sbp, tobacco, ldl, famhist,obesity, alcohol, age.

We want to fit a model of the following form

P (chd = 1|x, ) =1

1 + eT x

where T x =0+1xsbp+2xtobacco+3xldl+4xfamhist+5xage+6xalcohol+7xobesity

Sridhar Mahadevan: CMPSCI 689 p. 7/29

Noise Model forLogistic Regression

Let us try to represent the logistic regression model as y = (x|) + and askourself what sort of noise model is represented by .

Since y takes on the value 1 with probability (x|), it follows that can also onlytake on two possible values, namely

If y = 1, then = 1 (x|) with probability (x|).

Conversely, if y = 0, then = (x|) and this happens with probability(1 (x|)).

This analysis shows that the error term in logistic regression is a binomiallydistributed random variable. Its moments can be computed readily as shownbelow:

E() = (x|)(1 (x|)) (1 (x|))(x|) = 0 (the error term hasmean 0).

V ar() = E2 (E)2 = E2 = (x|)(1 (x|)) (show this!)

Sridhar Mahadevan: CMPSCI 689 p. 8/29

Maximum Likelihoodfor LR

Suppose we want to fit a logistic regression model to a dataset of n observationsX = (x1, y1), . . . , (xn, yn).

We can express the conditional likelihood of a single observation simply as

P (yi|xi, ) = (xi|)yi

(1 (xi|))1yi

Hence, the conditional likelihood of the entire dataset can be written as

P (Y |X, ) =

n

i=1

(xi|)yi

(1 (xi|))1yi

The conditional log-likelihood is then simply

l(|X, Y ) =

n

i=1

yi log (xi|) + (1 yi) log(1 (xi|))

Sridhar Mahadevan: CMPSCI 689 p. 9/29

Maximum Likelihoodfor LR

We solve the conditional log-likelihood equation by taking gradients

l(|X, Y )

k=

n

i=1

yi1

(xi|)

(xi|)

k (1 yi)

1

(1 (xi|))

(xi)

k

Using the fact that (xi|)

k=

k( 11+e

T xi) = (xi|)(1(xi|))xi

k, we get

l(|X, Y )

k=

n

i=1

xik(yi (xi|))

Setting this to 0, since x0 = 1 the first component of these equations reduces to

n

i=1

yi =

n

i=1

(xi|)

The expected number of instances of each class must match the observednumber. Sridhar Mahadevan: CMPSCI 689 p. 10/29

Newton-RaphsonMethod

Newtons method is a general procedure for finding the roots of an equationf() = 0. Newtons algorithm is based on the recursion

t+1 = t f(t)

f (t)

Newtons method finds the minimum of a function f . We want to find the maximumof the log likehood equation.

But, the maximum of a function f() is exactly when its derivative f () = 0. So,plugging in f () for f() above, we get

t+1 = t f (t)

f (t)

Sridhar Mahadevan: CMPSCI 689 p. 11/29

Fisher ScoringIn logistic regresion, the parameter is a vector, so we have to use theNewton-Raphson algorithm

t+1 = t H1 l(t|X, Y )

Here, l(t|X, Y ) is the vector of partial derivatives of the log-likelihood equation

Hij =2l(|X,Y )

ijis the Hessian matrix of second order derivatives.

The use of Newtons method to find the solution to the conditional log-likelihoodequation is called Fisher scoring.

Sridhar Mahadevan: CMPSCI 689 p. 12/29

Fisher Scoring forMaximum Likelihood

Taking the second derivative of the likelihood score equations gives us

2l(|X, Y )

km=

n

i=1

xikxim(x

i|)(1 (xi|))

We can use matrix notation to write the Newton-Raphson algorithm for logisticregression. Define the n n diagonal matrix

W =

(x1|)(1 (x1|)) . . . 0

0 (x2|)(1 (x2|)) . . .

. . .

0 . . . (xn|)(1 (xn|))

Let Y be an n 1 column vector of output values, and X be the design matrix ofsize n (p + 1) of input values, and P be the column vector of fitted probabilityvalues (xi|).

Sridhar Mahadevan: CMPSCI 689 p. 13/29

Iterative WeightedLeast Squares

The gradient of the log likelihood can be written in matrix form as

l(|X, Y )

=

n

i=1

xi(yi (xi|)) = XT (Y P )

The Hessian can be written as

2l(|X, Y )

T= XT WX

The Newton-Raphson algorithm then becomes

new = old + (XT WX)1XT (Y P )

= (XT WX)1XT W(

Xold + W1(Y P ))

= (XT WX)1XT WZ where Z Xold + W1(Y P )

Sridhar Mahadevan: CMPSCI 689 p. 14/29

Weighted LeastSquares Regression

Weighted least squares regression finds the best least-squares solution to theequation

WAx Wb

(WA)T WAx = (WA)T Wb

x = (AT CA)1AT Cb where C = WT W

Returning to logistic regression, we now see new = (XT WX)1XT WZ isweighted least squares regression (where X is the matrix A above, W is adiagonal weight vector with entries (xi|)(1 (xi|)), and Z corresponds tothe vector b above).

It is termed recursive weighted least squares, because at each step, the weightvector W keeps changing (since the s are changing). We can visualize RWLS assolving the following equation

new argmin(Z X)T W (Z X)

Sridhar Mahadevan: CMPSCI 689 p. 15/29

Stochastic GradientAscent

Newtons method is often referred to as a 2nd order method, because it involvestaking the Hessian. This can be difficult in large problems, because it involvesmatrix inversion.

One way to avoid this is to settle for slower convergence, but less work at eachstep. For each training instance (x, y) we can derive an incremental gradientupdate rule.

l(|x, y)

j= xj(y (x|))

The stochastic gradient ascent rule can be written as (for instance (xi, yi))

j j + (yi (xi|))xij

The c