# logistic regression: online, lazy, kernelized, sequential ... · pdf filelogistic regression:...

Post on 07-Feb-2018

217 views

Embed Size (px)

TRANSCRIPT

Logistic Regression:Online, Lazy, Kernelized, Sequential, etc.

Harsha Veeramachaneni

Thomson Reuter Research and Development

April 1, 2010

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 1 / 56

1 Logistic Regression (Plain and Simple)Introduction to logistic regression

2 Learning AlgorithmLog-LikelihoodGradient DescentLazy Updates

3 Kernelization

4 Sequence Tagging

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 2 / 56

1 Logistic Regression (Plain and Simple)Introduction to logistic regression

2 Learning AlgorithmLog-LikelihoodGradient DescentLazy Updates

3 Kernelization

4 Sequence Tagging

What is Logistic Regression?

Its called regression but its actually for classification.

Assume we have a classification problem1 from x Rd to y {0, 1}.Assume further that we want the classifier to output posterior classprobabilities.

Example: Say we have x = [x1, x2] and a binary classification problem.We would like a classifier that can compute p(y = 0|x) and p(y = 1|x)

1Everything we say applies to multi-class problems but well restrict the talk to binaryfor simplicityHarsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 3 / 56

In the Beginning There Was a Model 2

Say we have two scoring functions s0(x) and s1(x) and we write

p(y = i |x) = si (x)s0(x)+s1(x) .As long as si (x) > 0 we obtain a probability like number.

Say we project x along some vector and map the projection to apositive quantity.

For example, si (x) = ewTi x

2All models are wrong. Some are useful.Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 4 / 56

Modeling Continued

According to our definition above

p(y = 1|x) = s1(x)s0(x) + s1(x)

=exp(wT1 x)

exp(wT0 x) + exp(wT1 x)

=exp((w1 w0)T x)

1 + exp((w1 w0)T x)

=exp(T x)

1 + exp(T x)

where = w1 w0

f (z) = exp(z)1+exp(z) is called the logistic function.

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 5 / 56

Model

Therefore

p(y = 1|x) = exp(T x)

1 + exp(T x)

And

p(y = 0|x) = 11 + exp(T x)

For our two feature example we have

p(y = 1|x) = exp(1x1+2x2)1+exp(1x1+2x2)

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 6 / 56

Logistic Function Pictures

Figure: Logistic function in one variable.

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 7 / 56

Adding in a Bias Term

Once we can compute p(y |x) we can get a minimum error-rate classifierby assigning all feature vectors with p(y = 1|x) > 0.5 to class y = 1 and 0otherwise.

What is p(y = 1|x) when x = 0?

p(y = 1|x) = exp(0)1+exp(0) = 0.5

Therefore the classification boundary always passes through the origin.

We change this behavior by augmenting our model with a bias term.

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 8 / 56

Adding in a Bias Term

So we write p(y = 1|x) = exp(0+T x)

1+exp(0+T x)

For the two feature example it becomes

p(y = 1|x) = exp(0+1x1+2x2)1+exp(0+1x1+2x2)

We can rewrite this as

p(y = 1|x) = exp(T x)

1+exp(T x)

where = [0, 1, 2] and x = [1, x1, x2]

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 9 / 56

Logistic Function Pictures

Figure: Logistic function in two variables, 0 = 0, 1 = 1.0, 2 = 1.0

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 10 / 56

Logistic Function Pictures

Figure: Logistic function in two variables, 0 = 20.0, 1 = 1.0, 2 = 1.0

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 11 / 56

Logistic Function Pictures

Figure: Logistic function in two variables, 0 = 0.0, 1 = 3.0, 2 = 1.0

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 12 / 56

Logistic Function Pictures

Figure: Logistic function in two variables, 0 = 0, 1 = 0.1, 2 = 0.1

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 13 / 56

Logistic Function Pictures

Figure: Derivative of the logistic function in two variables, 0 = 0, 1 = 0.1,2 = 0.1

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 14 / 56

Decoding and Inference

At test time the given feature vector x is classified (decoded) as class 1 if

p(y = 1|x) > p(y = 0|x)exp(T x)

1 + exp(T x)>

1

1 + exp(T x)

exp(T x) > 1

T x > 0

Inference: If we need the probability, we need to compute the partitionfunction (1 + exp(T x))

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 15 / 56

Some Justification for the Model

Why exp()?

The log-odds ratio

log(

p(y=1|x)p(y=0|x)

)= T x

is an affine function of the observations.

The log-odds ratio is an affine function of the sufficient statistics ifthe class-conditional distributions are from a fixed exponential family.

The converse also true3.

Logistic regression same as MAXENT classification (under the rightconstraints on the features).

3Banerjee (2007), An Analysis of Logistic Models: Exponential Family Connectionsand Online Performance.Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 16 / 56

1 Logistic Regression (Plain and Simple)Introduction to logistic regression

2 Learning AlgorithmLog-LikelihoodGradient DescentLazy Updates

3 Kernelization

4 Sequence Tagging

Learning

Suppose we are given a labeled data set

D = {(x1, y1), (x2, y2), . . . , (xN , yN)}

From this data we wish to learn the logistic regression parameter (weight)vector .

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 17 / 56

Learning

If we assume that the labeled examples are i.i.d., the likelihood of the datais

l(D;) = p(D|) =N

i=1 p(y = yi |xi ;)

Or the log-likelihood is

L(D;) = log l(D;) =N

i=1 log p(y = yi |xi ;)

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 18 / 56

Maximum Likelihood (ML) Estimation

The Maximum Likelihood (ML) estimate of the parameter vector is givenby

= arg max

p(D|) = arg max

L(D|)

= arg max

Ni=1

log p(y = yi |xi ;)

What happens if there is a feature which is zero for all the examplesexcept one?

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 19 / 56

Maximum A Posteriori (MAP) EstimationWe place a prior on that penalizes large values.

For example a Gaussian prior: p() = N(; 0, 2I)

The posterior probability of the is given by

p(|D) = p(D|)p()p(D)

So instead of ML we can obtain the Maximum A Posteriori (MAP)estimate

= arg max

p(|D) = arg max

p(D|)p()

= arg max

(log p(D|) + log p())

= arg max

(N

i=1

log p(y = yi |xi ;) + log p()

)How is this maximization actually done?Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 20 / 56

Gradient Ascent

Start with = 0 and iteratively move along the direction of steepestascent.

Since the function we are optimizing is concave, any local maximumwe find is also the global maximum.

We need to be able to compute the gradient (p(|D)) at anypoint.

The choice of the size of the step is important.

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 21 / 56

Gradient Ascent

Gradient = p(|D) = ( 0 p(|D),1

p(|D), . . .)

jp(|D) =

j{

Ni=1

log p(y = yi |xi ;) + log p()}

So the update rule would be

(t+1)j =

(t)j +

jp(|D)

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 22 / 56

Stochastic Gradient Ascent

Gradient = p(|D) = ( 0 p(|D),1

p(|D), . . .)

We can write

jp(|D) =

j{

Ni=1

log p(y = yi |xi ;) + log p()}

=N

i=1

j(log p(y = yi |xi ;)) +

jlog p()

=N

i=1

(

jlog p(y = yi |xi ;) +

1

N

jlog p()

)

=N

i=1

g(xi , yi ) (At a fixed value of )

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 23 / 56

Stochastic Gradient Ascent 4

The update rule is

(t+1)j =

(t)j +

jp(|D)

= (t)j +

Ni=1

g(xi , yi )

= (t)j + N

(1

N

Ni=1

g(xi , yi )

)

(t)j + E [g(x , y)]

4Usually talked about as Stochastic Gradient Descent (SGD), because minimizationis canonical.Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 24 / 56

Stochastic Gradient Ascent

We could randomly sample a small batch of L examples from thetraining set to estimate E [g(x , y)].

What if we went to extreme and made L = 1?

We obtain online (incremental) learning.

Can be shown to converge to the same optimum under somereasonable assumptions.

Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010 25 / 56

Stochastic Gradient Ascent

The learning algorithm in its entirety isvspace0.1in

Algorithm 1: SGD

Input: D = {xi , yi}i=1...N , , MaxEpochs, Convergence ThresholdOutput: 0for e = 1 : MaxEpochs || Log-Likelihood Convergence do

for i = 1 : N do + g(xi , yi )

return

If x Rd , complexity is O(Nd).

Harsha Veeramachaneni