mehdi allahyari georgia southern university logistic ... · mehdi allahyari georgia southern...

36
Spring 2020 Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, Barnabás Póczos & Aarti Singh 1 Logistic Regression CSCI 4520 - Introduction to Machine Learning

Upload: others

Post on 30-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Spring 2020Mehdi Allahyari

Georgia Southern University

(slides borrowed from Tom Mitchell, Barnabás Póczos & Aarti Singh

1

Logistic Regression

CSCI 4520 - Introduction to Machine Learning

Page 2: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Linear Regression & Linear Classification

2

Linear(Regression(&(Linear(Classifica&on(

2%

Weight%

Height%

Linear%fit% Linear%decision%boundary%

Page 3: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Naïve Bayes Recap…

3

Naïve(Bayes(Recap…(

•  NB%Assump$on:%

•  NB%Classifier:%

•  Assume%parametric%form%for%P(Xi|Y)%and%P(Y)%–  Es$mate%parameters%using%MLE/MAP%and%plug%in%

3%

Page 4: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Generative vs. Discriminative Classifiers

4

Genera&ve(vs.(Discrimina&ve(Classifiers(

7%

Genera$ve%classifiers%(e.g.%Naïve%Bayes)%%

• %Assume%some%func$onal%form%for%P(X,Y)%(or%P(X|Y)%and%P(Y))%• %Es$mate%parameters%of%P(X|Y),%P(Y)%directly%from%training%data%

%But% %arg%max_Y%P(X|Y)%P(Y)%=%arg%max_Y%P(Y|X)%%Why%not%learn%P(Y|X)%directly?%Or%beder%yet,%why%not%learn%the%decision%boundary%directly?%%Discrimina$ve%classifiers%(e.g.%Logis$c%Regression)%%

• %Assume%some%func$onal%form%for%P(Y|X)%or%for%the%decision%boundary%• %Es$mate%parameters%of%P(Y|X)%directly%from%training%data%%

Page 5: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Logistic Regression

5

Logistic Regression Idea: •  Naïve Bayes allows computing P(Y|X) by

learning P(Y) and P(X|Y)

•  Why not learn P(Y|X) directly?

Page 6: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

GNB with equal variance is a linear classifier

6

•  Consider learning f: X ! Y, where •  X is a vector of real-valued features, < X1 … Xn > •  Y is boolean •  assume all Xi are conditionally independent given Y •  model P(Xi | Y = yk) as Gaussian N(µik,σi) •  model P(Y) as Bernoulli (π)

•  What does that imply about the form of P(Y|X)?

Page 7: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

7

Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi

Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi

Page 8: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

8

Very convenient!

implies

implies

implies

Page 9: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

9

Very convenient!

implies

implies

implies

linear classification

rule!

Page 10: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Logistic Function

10

Logistic function

Page 11: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Logistic regression more generally

11

Logistic regression more generally•  Logistic regression when Y not boolean (but

still discrete-valued). •  Now y ∈ {y1 ... yR} : learn R-1 sets of weights

for k<R

for k=R

Page 12: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Training Logistic Regression: MCLE

12

Training(Logis&c(Regression(

12%

We’ll%focus%on%binary%classifica$on:%%

How(to(learn(the(parameters(w0,(w1,(…(wd?(

Training%Data%

Maximum%Likelihood%Es$mates%

%

But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

Training(Logis&c(Regression(

12%

We’ll%focus%on%binary%classifica$on:%%

How(to(learn(the(parameters(w0,(w1,(…(wd?(

Training%Data%

Maximum%Likelihood%Es$mates%

%

But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

Training Logistic Regression: MCLE •  we have L training examples:

•  maximum likelihood estimate for parameters W

•  maximum conditional likelihood estimate

Page 13: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Training Logistic Regression: MCLE

13

Training Logistic Regression: MCLE •  we have L training examples:

•  maximum likelihood estimate for parameters W

•  maximum conditional likelihood estimate

Training(Logis&c(Regression(

12%

We’ll%focus%on%binary%classifica$on:%%

How(to(learn(the(parameters(w0,(w1,(…(wd?(

Training%Data%

Maximum%Likelihood%Es$mates%

%

But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

Page 14: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Training Logistic Regression: MCLE

14

Training Logistic Regression: MCLE

•  Choose parameters W=<w0, ... wn> to maximize conditional likelihood of training data

•  Training data D = •  Data likelihood = •  Data conditional likelihood =

where

Page 15: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Expressing Conditional Log Likelihood

15

Expressing Conditional Log Likelihood

Page 16: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Maximizing Conditional Log Likelihood

16

Logistic Regression Parameter Estimation: Maximize Conditional Log Likelihood

Good news: l(w) is a concave function of w

→no locally optimal solutions!

Bad news: no closed-form solution to maximize l(w)

Good news: concave functions “easy” to optimize

Maximizing Conditional Log Likelihood

Good news: l(W) is concave function of WBad news: no closed-form solution to maximize l(W)

Page 17: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Optimizing concave/convex functions

17

Op&mizing(concave/convex(func&on(

16%

•  Condi$onal%likelihood%for%Logis$c%Regression%is%concave% •  Maximum%of%a%concave%func$on%=%minimum%of%a%convex%func$on%

Gradient(Ascent((concave)/(Gradient(Descent((convex)(

Gradient:(

Learning(rate,(η>0(Update(rule:(

Page 18: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

18

Gradient Descent: Batch gradient: use error over entire training set D Do until satisfied:

1. Compute the gradient

2. Update the vector of parameters:

Stochastic gradient: use error over single examples Do until satisfied: 1. Choose (with replacement) a random training example

2. Compute the gradient just for :

3. Update the vector of parameters: Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D

Page 19: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Maximize Conditional Log Likelihood: Gradient Ascent

19

Maximize Conditional Log Likelihood: Gradient Ascent

Page 20: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Maximize Conditional Log Likelihood: Gradient Ascent

20

Maximize Conditional Log Likelihood: Gradient Ascent

Gradient ascent algorithm: iterate until change < ε For all i, repeat

Page 21: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Effect of step size η

21

Effect(of(step]size(η#

18%

Large%η%%=>%Fast%convergence%but%larger%residual%error%%%%%%%%%Also%possible%oscilla$ons%

%Small%η%%=>%Slow%convergence%but%small%residual%error%

%%%%

Page 22: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

That’s all for M(C)LE. How about MAP?

22

That’s all for M(C)LE. How about MAP?

•  One common approach is to define priors on W –  Normal distribution, zero mean, identity covariance

•  Helps avoid very large weights and overfitting •  MAP estimate

•  let’s assume Gaussian prior: W ~ N(0, σ)

Page 23: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

MLE vs. MAP

23

MLE vs MAP •  Maximum conditional likelihood estimate

•  Maximum a posteriori estimate with prior W~N(0,σI)

Page 24: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

MAP estimates and Regularization

24

MAP estimates and Regularization •  Maximum a posteriori estimate with prior W~N(0,σI)

called a “regularization” term •  helps reduce overfitting •  keep weights nearer to zero (if P(W) is zero mean Gaussian prior), or whatever the prior suggests •  used very frequently in Logistic Regression

Page 25: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

The Bottom Line

25

•  Consider learning f: X ! Y, where •  X is a vector of real-valued features, < X1 … Xn > •  Y is boolean •  assume all Xi are conditionally independent given Y •  model P(Xi | Y = yk) as Gaussian N(µik,σi) •  model P(Y) as Bernoulli (π)

•  Then P(Y|X) is of this form, and we can directly estimate W

•  Furthermore, same holds if the Xi are boolean •  trying proving that to yourself

The Bottom Line

Page 26: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Generative vs. Discriminative Classifiers

26

Generative vs. Discriminative Classifiers

Training classifiers involves estimating f: X ! Y, or P(Y|X) Generative classifiers (e.g., Naïve Bayes) •  Assume some functional form for P(X|Y), P(X) •  Estimate parameters of P(X|Y), P(X) directly from training data •  Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers (e.g., Logistic regression) •  Assume some functional form for P(Y|X) •  Estimate parameters of P(Y|X) directly from training data

Page 27: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Use Naïve Bayes or Logistic Regression?

27

Use Naïve Bayes or Logisitic Regression?

Consider •  Restrictiveness of modeling assumptions •  Rate of convergence (in amount of

training data) toward asymptotic hypothesis

Page 28: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Use Naïve Bayes or Logistic Regression?

28

Naïve Bayes vs Logistic Regression

Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters to estimate: •  NB:

•  LR:

Page 29: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Use Naïve Bayes or Logistic Regression?

29

Naïve Bayes vs Logistic Regression

Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters: •  NB: 4n +1 •  LR: n+1

Estimation method: •  NB parameter estimates are uncoupled •  LR parameter estimates are coupled

Page 30: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

G.Naïve Bayes vs. Logistic Regression

30

G.Naïve Bayes vs. Logistic Regression

Recall two assumptions deriving form of LR from GNBayes: 1.  Xi conditionally independent of Xk given Y 2.  P(Xi | Y = yk) = N(µik,σi), " not N(µik,σik)

Consider three learning methods: •  GNB (assumption 1 only) •  GNB2 (assumption 1 and 2) •  LR Which method works better if we have infinite training data, and… •  Both (1) and (2) are satisfied •  Neither (1) nor (2) is satisfied •  (1) is satisfied, but not (2)

Page 31: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

G.Naïve Bayes vs. Logistic Regression

31

G.Naïve Bayes vs. Logistic Regression

Recall two assumptions deriving form of LR from GNBayes: 1.  Xi conditionally independent of Xk given Y 2.  P(Xi | Y = yk) = N(µik,σi), " not N(µik,σik)

Consider three learning methods: • GNB (assumption 1 only) -- decision surface can be non-linear • GNB2 (assumption 1 and 2) – decision surface linear • LR -- decision surface linear, trained without

assumption 1. Which method works better if we have infinite training data, and... • Both (1) and (2) are satisfied: LR = GNB2 = GNB

• (1) is satisfied, but not (2) : GNB > GNB2, GNB > LR, LR > GNB2

• Neither (1) nor (2) is satisfied: GNB>GNB2, LR > GNB2, LR><GNB

[Ng & Jordan, 2002]

Page 32: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

G.Naïve Bayes vs. Logistic Regression

32

G.Naïve Bayes vs. Logistic Regression

What if we have only finite training data? They converge at different rates to their asymptotic (∞ data) error Let refer to expected error of learning algorithm A after n training examples Let d be the number of features: <X1 … Xd> So, GNB requires n = O(log d) to converge, but LR requires n = O(d)

[Ng & Jordan, 2002]

Page 33: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

33

Some experiments from UCI data sets

[Ng & Jordan, 2002] ©Carlos Guestrin 2005-2009

Some experiments

from UCI data sets

32

Naïve bayes Logistic Regression

Page 34: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

Naïve Bayes vs. Logistic Regression

34

Naïve Bayes vs. Logistic Regression

The bottom line: GNB2 and LR both use linear decision surfaces, GNB need not Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did). But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might outperform the other

Page 35: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

What you should know:

35

What you should know:

•  Logistic regression –  Functional form follows from Naïve Bayes assumptions

•  For Gaussian Naïve Bayes assuming variance σi,k = σi •  For discrete-valued Naïve Bayes too

–  But training procedure picks parameters without making conditional independence assumption

–  MLE training: pick W to maximize P(Y | X, W) –  MAP training: pick W to maximize P(W | X,Y)

•  ‘regularization’ •  helps reduce overfitting

•  Gradient ascent/descent –  General approach when closed-form solutions unavailable

•  Generative vs. Discriminative classifiers –  Bias vs. variance tradeoff

Page 36: Mehdi Allahyari Georgia Southern University Logistic ... · Mehdi Allahyari Georgia Southern University (slides borrowed from Tom Mitchell, BarnabásPóczos& AartiSingh 1 Logistic

What you should know:

36

What you should know about Logistic Regression (LR)

• Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR – Solution differs because of objective (loss) function

• In general, NB and LR make different assumptions – NB: Features independent given class ! assumption on P(X|Y) – LR: Functional form of P(Y|X), no assumption on P(X|Y)

• LR is a linear classifier – decision rule is a hyperplane

• LR optimized by conditional likelihood – no closed-form solution – concave ! global optimum with gradient ascent – Maximum conditional a posteriori corresponds to regularization

• Convergence rates – GNB (usually) needs less data – LR (usually) gets to better solutions in the limit