naïve bayes and logistic regression

36
Nakul Gopalan Georgia Tech Naïve Bayes and Logistic Regression Machine Learning CS 4641 These slides are adopted based on slides from Le Song, Eric Eaton, Mahdi Roozbahani and Chao Zhang.

Upload: others

Post on 29-Nov-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Naïve Bayes and Logistic Regression

Nakul Gopalan

Georgia Tech

Naïve Bayes and Logistic

Regression

Machine Learning CS 4641

These slides are adopted based on slides from Le Song, Eric Eaton, Mahdi Roozbahani and Chao Zhang.

Page 2: Naïve Bayes and Logistic Regression
Page 3: Naïve Bayes and Logistic Regression

Outline

• Generative and Discriminative Classification

• The Logistic Regression Model

• Understanding the Objective Function

• Gradient Descent for Parameter Learning

• Multiclass Logistic Regression

3

Page 4: Naïve Bayes and Logistic Regression

Classification

4

1

𝑥𝑑

𝑥1ℎ 𝑥 ∈ {−1,1}

ℎ 𝑥 = 𝑠𝑖𝑔𝑛 𝑥𝜃 → linear classification (Perceptron)

Σ

Page 5: Naïve Bayes and Logistic Regression

Decision Making: Dividing the Feature Space

5

Page 6: Naïve Bayes and Logistic Regression

How to Determine the Decision Boundary?

6

Page 7: Naïve Bayes and Logistic Regression

Bayes Decision Rule

7

Page 8: Naïve Bayes and Logistic Regression

Bayes Decision Rule

8

= 0 (decision

boundary)

Page 9: Naïve Bayes and Logistic Regression

• Generative models

Model prior and likelihood explicitly

“Generative” means able to generate synthetic data points

Examples: Naive Bayes, Hidden Markov Models

• Discriminative models

Directly estimate the posterior probabilities

No need to model underlying prior and likelihood distributions

Examples: Logistic Regression, SVM, Neural Networks

9

What do People do in Practice?

Page 10: Naïve Bayes and Logistic Regression

Generative Model: Naive Bayes

10

: Dimensions are independent.

Page 11: Naïve Bayes and Logistic Regression

“Naïve” conditional independence assumption

𝑃 𝑦 𝑥 =𝑃 𝑥 𝑦 𝑃(𝑦)

𝑃 𝑥=𝑃 𝑥, 𝑦

𝑃(𝑥)

𝑃 𝑥, 𝑦𝑙𝑎𝑏𝑒𝑙=1 = P x1, … , xd, ylabel=1 = P x1 x2, , … , xd, ylabel=1 𝑃(x2, … , xd, ylabel=1)

Joint probability model:

= P x1 x2, , … , xd, ylabel=1 𝑃(x2|x3… , xd, ylabel=1)𝑃(x3, … , xd, ylabel=1)

= ⋯

= P x1 x2, , … , xd, ylabel=1 𝑃 x2 x3… , xd, ylabel=1 …𝑃 xd−1 𝑥𝑑 , 𝑦𝑙𝑎𝑏𝑒𝑙=1 𝑃 𝑥𝑑 𝑦𝑙𝑎𝑏𝑒𝑙=1 P(𝑦𝑙𝑎𝑏𝑒𝑙=1)

Naïve Bayes assumption: let’s rewrite it as:

𝑃 𝑥, 𝑦𝑙𝑎𝑏𝑒𝑙=1 = 𝑃 𝑥1 𝑦𝑙𝑎𝑏𝑒𝑙=1 𝑃 𝑥2 𝑦𝑙𝑎𝑏𝑒𝑙=1 …𝑃 𝑥𝑛 𝑦𝑙𝑎𝑏𝑒𝑙=1 𝑃 𝑦𝑙𝑎𝑏𝑒𝑙=1 =

𝑃(𝑦𝑙𝑎𝑏𝑒𝑙=1)ෑ

𝑖=1

𝑑

𝑃 𝑥𝑖 𝑦𝑙𝑎𝑏𝑒𝑙=1

ExampleGaussian naïve Bayes

A typical assumption

Page 12: Naïve Bayes and Logistic Regression

Naïve Bayes cat vs dog!

Page 13: Naïve Bayes and Logistic Regression

Administrative things

• Project team composition due this weekend

• Quiz out today. Let us know if you have problems with it, and

take as many as you can, it will only help!!

• Homework due next week. Deadlines will start getting closeby

from now.

Page 14: Naïve Bayes and Logistic Regression

Naïve Bayes

Page 15: Naïve Bayes and Logistic Regression

Naïve Bayes

Page 16: Naïve Bayes and Logistic Regression

Discriminative Models

16

Generative model

Page 17: Naïve Bayes and Logistic Regression

Outline

• Generative and Discriminative Classification

• The Logistic Regression Model

• Understanding the Objective Function

• Gradient Descent for Parameter Learning

• Multiclass Logistic Regression

17

Page 18: Naïve Bayes and Logistic Regression

Gaussian Naïve Bayes

𝑃 𝑦 = 1 𝑥 =𝑃 𝑥 𝑦 = 1 𝑃(𝑦 = 1)

𝑃 𝑥=

𝑃(𝑦 = 1)ς𝑖=1𝑑 𝑃 𝑥𝑖 𝑦 = 1

𝑃(𝑥)

𝑑

𝑑

Page 19: Naïve Bayes and Logistic Regression

get exp(ln 𝑢 ) of numerator and denominator

𝑑

𝑑

𝑑

𝑑

labels

Page 20: Naïve Bayes and Logistic Regression

𝑑

𝑑

𝑑=

Page 21: Naïve Bayes and Logistic Regression

𝑃 𝑦 = 1 𝑥 =𝑑

𝑃 𝑦 = 1 𝑥 =1

1 + exp[−(σ𝑖(𝜃𝑖𝑥𝑖) + 𝜃0)]=

1

1 + exp(−𝑠)

2𝑑 + 1 → 𝑑 𝑚𝑒𝑎𝑛, 𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒, 𝑎𝑛𝑑 1 𝑓𝑜𝑟 𝑝𝑟𝑖𝑜𝑟

Number of parameters = 𝑑 + 1 → 𝜃0, 𝜃1, 𝜃2, … , 𝜃𝑑

Why not directly learning 𝑃 𝑦 = 1 𝑥 or 𝜃 parameters?

Number of parameters:

Gaussian Naïve Bayes is a subset of logistic regression

Page 22: Naïve Bayes and Logistic Regression

Logistic function for posterior probability

𝑔(𝑠)

Many equations can give us this shape

𝒔

𝑔 𝑠 =𝑒𝑠

1 + 𝑒𝑠=

1

1 + 𝑒−𝑠

Let’s use the following function:

It is easier to use this function for optimization

This formula is called sigmoid function

𝑠 = 𝑥𝜃

Page 23: Naïve Bayes and Logistic Regression

Sigmoid Function

𝑠 =

𝑖=0

𝑑

𝑥𝑖𝜃𝑖 = 𝜃0 + 𝜃1𝑥1 +⋯+ 𝜃𝑑𝑥𝑑

Soft classification

Posterior probability

1

𝑥𝑑

𝑥1ℎ(𝑥)

ℎ 𝑥 = 𝑔(𝑥𝜃) → logistic regression

Σ

𝑔 𝑠 =𝑒𝑠

1 + 𝑒𝑠=

1

1 + 𝑒−𝑠

Page 24: Naïve Bayes and Logistic Regression

Three linear models𝑠 =

𝑖=0

𝑑

𝑥𝑖𝜃𝑖 = 𝜃0 + 𝜃1𝑥1 +⋯+ 𝜃𝑑𝑥𝑑

Hard classification

Soft classification

Posterior probability

1

𝑥𝑑

𝑥1ℎ(𝑥)

ℎ 𝑥 = 𝑠𝑖𝑔𝑛 𝑥𝜃 → linear classification (Perceptron)

Σ

1

𝑥𝑑

𝑥1ℎ(𝑥)

ℎ 𝑥 → linear regression

Σ

1

𝑥𝑑

𝑥1ℎ(𝑥)

ℎ 𝑥 = 𝑔(𝑥𝜃) → logistic regression

Σ

Page 25: Naïve Bayes and Logistic Regression

𝑔 𝑠 is interpreted as probability

Example: Prediction of heart attacks

Input 𝒙: cholesterol level, age, weight, finger size, etc.

𝑔 𝑠 : probability of heart attack within a certain time

𝑠 = 𝑥𝜃 Let’s call this risk score

We can’t have a hard prediction here

ℎ𝜃 𝑥 = 𝑝(𝑦|𝑥) = ቊ𝑔(𝑠), 𝑦 = 11 − 𝑔(𝑠), 𝑦 = 0

Using posterior probability directly

Page 26: Naïve Bayes and Logistic Regression

Logistic regression model

𝑝(𝑦|𝑥) =

1

1 + exp(−𝑥𝜃)𝑦 = 1

1 −1

1 + exp −𝑥𝜃=

exp −𝑥𝜃

1 + exp −𝑥𝜃𝑦 = 0

We need to find 𝜃 parameters, let’s set up log-likelihood for n datapoints

This form is concave, negative of this form is convex

=

𝑖

𝜃𝑇𝑥𝑖𝑇 𝑦𝑖 − 1 − log(1 + exp −𝑥𝑖𝜃 )

𝑙 𝜃 ≔ 𝑙𝑜𝑔ෑ

𝑖=1

𝑛

𝑝(𝑦𝑖 , |𝑥𝑖 , 𝜃)

Page 27: Naïve Bayes and Logistic Regression

The gradient of 𝑙(𝜃)

𝑖

𝑥𝑖𝑇 𝑦𝑖 − 1 + 𝑥𝑖

𝑇 exp −𝑥𝑖𝜃

1 + exp(−𝑥𝑖𝜃)

=

𝑖

𝜃𝑇𝑥𝑖𝑇 𝑦𝑖 − 1 − log(1 + exp −𝑥𝑖𝜃 )

𝑙 𝜃 ≔ 𝑙𝑜𝑔ෑ

𝑖=1

𝑛

𝑝(𝑦𝑖 , |𝑥𝑖 , 𝜃)

Page 28: Naïve Bayes and Logistic Regression

The Objective Function

28

max𝜃

𝑙 𝜃 ≔ 𝑙𝑜𝑔ෑ

𝑖=1

𝑛

𝑝(𝑦𝑖 , |𝑥𝑖 , 𝜃)

Page 29: Naïve Bayes and Logistic Regression

Gradient Descent

29

Page 30: Naïve Bayes and Logistic Regression

Gradient Ascent(concave)/Descent(convex) algorithm

𝜃𝑡+1 ← 𝜃𝑡 + 𝜂

𝑖

𝑥𝑖𝑇 𝑦𝑖 − 1 + 𝑥𝑖

𝑇 exp −𝑥𝑖𝜃

1 + exp(−𝑥𝑖𝜃)

Page 31: Naïve Bayes and Logistic Regression

Logistic Regression

31

𝑔 𝑠 =𝑒𝑠

1 + 𝑒𝑠=

1

1 + 𝑒−𝑠

𝑠 = 𝑥𝜃

𝑠

𝒔

𝑔 𝑠

𝑔 𝑠

𝑥𝜃 𝑥𝜃

Page 32: Naïve Bayes and Logistic Regression

Outline

• Generative and Discriminative Classification

• The Logistic Regression Model

• Understanding the Objective Function

• Gradient Descent for Parameter Learning

• Multiclass Logistic Regression

32

Page 33: Naïve Bayes and Logistic Regression

Multiclass Logistic Regression

33

Page 34: Naïve Bayes and Logistic Regression

One-vs-all (one-vs-rest)

ℎ𝜃1 𝑥

ℎ𝜃2 𝑥

ℎ𝜃3 𝑥

ℎ𝜃(𝑖)

𝑥 = 𝑝 𝑦 = 1 𝑥, 𝜃 (𝑖 = 1,2,3)

Page 35: Naïve Bayes and Logistic Regression

One-vs-all (one-vs-rest)

Train a logistic regression ℎ𝜃(𝑖)

𝑥 for each class 𝑖

To predict the label of a new input 𝑥, pick class 𝑖 that maximizes:

max𝑖

ℎ𝜃(𝑖)

𝑥

Page 36: Naïve Bayes and Logistic Regression

Take-Home Messages

• Generative and Discriminative Classification

• The Logistic Regression Model

• Understanding the Objective Function

• Gradient Descent for Parameter Learning

• Multiclass Logistic Regression

36