classification: logistic regression from data

37
Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 1 of 15

Upload: others

Post on 30-Jan-2022

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification: Logistic Regression from Data

Classification: Logistic Regressionfrom Data

Machine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 3

Slides adapted from Emily Fox

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 1 of 15

Page 2: Classification: Logistic Regression from Data

Reminder: Logistic Regression

P(Y = 0|X) =1

1 + exp�

β0 +∑

i βiXi

� (1)

P(Y = 1|X) =exp

β0 +∑

i βiXi

1 + exp�

β0 +∑

i βiXi

� (2)

• Discriminative prediction: p(y |x)

• Classification uses: ad placement, spam detection

• What we didn’t talk about is how to learn β from data

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 2 of 15

Page 3: Classification: Logistic Regression from Data

Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β) =∑

j

lnp(y(j) |x(j),β) (3)

=∑

j

y(j)

β0 +∑

i

βix(j)i

!

− ln

1 + exp

β0 +∑

i

βix(j)i

!

(4)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 3 of 15

Page 4: Classification: Logistic Regression from Data

Convexity

• Convex function

• Doesn’t matter where you start, ifyou “walk up” objective

• Gradient!

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 4 of 15

Page 5: Classification: Logistic Regression from Data

Convexity

• Convex function

• Doesn’t matter where you start, ifyou “walk up” objective

• Gradient!

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 4 of 15

Page 6: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

0

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 7: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 8: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 9: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

10

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 10: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

10

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 11: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

10

2

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 12: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

10

2

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 13: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

10

23

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 14: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

10

23

UndiscoveredCountry

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 15: Classification: Logistic Regression from Data

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

0

Wij(l) Wij

(l)

Jα∂

∂W

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15

Page 16: Classification: Logistic Regression from Data

Gradient for Logistic Regression

Gradient

∇βL (~β) =

∂L (~β)

∂ β0, . . . ,

∂L (~β)

∂ βn

(5)

Update

∆β ≡η∇βL (~β) (6)

βi ←β ′i +η∂L (~β)

∂ βi(7)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15

Page 17: Classification: Logistic Regression from Data

Gradient for Logistic Regression

Gradient

∇βL (~β) =

∂L (~β)

∂ β0, . . . ,

∂L (~β)

∂ βn

(5)

Update

∆β ≡η∇βL (~β) (6)

βi ←β ′i +η∂L (~β)

∂ βi(7)

Why are we adding? What would well do if we wanted to do descent?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15

Page 18: Classification: Logistic Regression from Data

Gradient for Logistic Regression

Gradient

∇βL (~β) =

∂L (~β)

∂ β0, . . . ,

∂L (~β)

∂ βn

(5)

Update

∆β ≡η∇βL (~β) (6)

βi ←β ′i +η∂L (~β)

∂ βi(7)

η: step size, must be greater than zero

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15

Page 19: Classification: Logistic Regression from Data

Gradient for Logistic Regression

Gradient

∇βL (~β) =

∂L (~β)

∂ β0, . . . ,

∂L (~β)

∂ βn

(5)

Update

∆β ≡η∇βL (~β) (6)

βi ←β ′i +η∂L (~β)

∂ βi(7)

NB: Conjugate gradient is usually better, but harder to implement

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15

Page 20: Classification: Logistic Regression from Data

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15

Page 21: Classification: Logistic Regression from Data

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15

Page 22: Classification: Logistic Regression from Data

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15

Page 23: Classification: Logistic Regression from Data

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15

Page 24: Classification: Logistic Regression from Data

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15

Page 25: Classification: Logistic Regression from Data

Remaining issues

• When to stop?

• What if β keeps getting bigger?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 8 of 15

Page 26: Classification: Logistic Regression from Data

Regularized Conditional Log Likelihood

Unregularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

(8)

Regularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

−µ∑

i

β 2i (9)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 15

Page 27: Classification: Logistic Regression from Data

Regularized Conditional Log Likelihood

Unregularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

(8)

Regularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

−µ∑

i

β 2i (9)

µ is “regularization” parameter that trades off between likelihood and havingsmall parameters

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 15

Page 28: Classification: Logistic Regression from Data

Approximating the Gradient

• Our datasets are big (to fit into memory)

• . . . or data are changing / streaming

• Hard to compute true gradient

L (β)≡Ex [∇L (β ,x)] (10)

• Average over all observations

• What if we compute an update just from one observation?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15

Page 29: Classification: Logistic Regression from Data

Approximating the Gradient

• Our datasets are big (to fit into memory)

• . . . or data are changing / streaming

• Hard to compute true gradient

L (β)≡Ex [∇L (β ,x)] (10)

• Average over all observations

• What if we compute an update just from one observation?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15

Page 30: Classification: Logistic Regression from Data

Approximating the Gradient

• Our datasets are big (to fit into memory)

• . . . or data are changing / streaming

• Hard to compute true gradient

L (β)≡Ex [∇L (β ,x)] (10)

• Average over all observations

• What if we compute an update just from one observation?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15

Page 31: Classification: Logistic Regression from Data

Getting to Union Station

Pretend it’s a pre-smartphone world and you want to get to Union Station

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 11 of 15

Page 32: Classification: Logistic Regression from Data

Stochastic Gradient for Logistic Regression

Given a single observation x chosen at random from the dataset,

βi ←β ′i +η�

−µβ ′i + xi

y −p(y = 1 |~x , ~β ′)��

(11)

Examples in class.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 12 of 15

Page 33: Classification: Logistic Regression from Data

Stochastic Gradient for Logistic Regression

Given a single observation x chosen at random from the dataset,

βi ←β ′i +η�

−µβ ′i + xi

y −p(y = 1 |~x , ~β ′)��

(11)

Examples in class.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 12 of 15

Page 34: Classification: Logistic Regression from Data

Stochastic Gradient for Regularized Regression

L = logp(y |x;β)−µ∑

j

β 2j (12)

Taking the derivative (with respect to example xi )

∂L∂ βj

=(yi −p(yi = 1 |~xi ;β))xj −2µβj (13)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 13 of 15

Page 35: Classification: Logistic Regression from Data

Stochastic Gradient for Regularized Regression

L = logp(y |x;β)−µ∑

j

β 2j (12)

Taking the derivative (with respect to example xi )

∂L∂ βj

=(yi −p(yi = 1 |~xi ;β))xj −2µβj (13)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 13 of 15

Page 36: Classification: Logistic Regression from Data

Proofs about Stochastic Gradient

• Depends on convexity of objective and how close ε you want to get toactual answer

• Best bounds depend on changing η over time and per dimension (not allfeatures created equal)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 14 of 15

Page 37: Classification: Logistic Regression from Data

In class

• Your questions!

• Working through simple example

• Prepared for logistic regression homework

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 15 of 15