revisiting logistic regression & naïve logistic regression...

46
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes Aarti Singh Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010

Upload: others

Post on 23-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Revisiting Revisiting

Logistic Regression & Naïve Logistic Regression & Naïve BayesBayes

Aarti SinghAarti Singh

Machine Learning 10-701/15-781

Jan 27, 2010

Page 2: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Generative and Discriminative

Classifiers

Training classifiers involves learning a mapping f: X -> Y, or P(Y|X)

Generative classifiers (e.g. Naïve Bayes)

• Assume some functional form for P(X,Y) (or P(X|Y) and P(Y))

• Estimate parameters of P(X|Y), P(Y) directly from training data

2

• Estimate parameters of P(X|Y), P(Y) directly from training data

• Use Bayes rule to calculate P(Y|X)

Discriminative classifiers (e.g. Logistic Regression)

• Assume some functional form for P(Y|X)

• Estimate parameters of P(Y|X) directly from training data

Page 3: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Logistic Regression

Assumes the following functional form for P(Y|X):

Alternatively,

3

Alternatively,

(Linear Decision Boundary)

DOES NOT require any conditional independence assumptions

Page 4: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Connection to Gaussian Naïve Bayes

There are several distributions that can lead to a linear decision boundary.

As another example, consider a generative model:

Exponential family

4

Observe that Gaussian is a special case

Page 5: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Connection to Gaussian Naïve Bayes

5

Constant term First-order term

Special case: P(X|Y=y) ~ Gaussian( µy,Σy) where Σ0 = Σ1 (cij,0 = cij,1)

Conditionally independent cij,y = 0 , i ≠ j

(Gaussian Naïve Bayes)

Page 6: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Generative vs Discriminative

Given infinite data (asymptotically),

If conditional independence assumption holds,

Discriminative and generative perform similar.

6

If conditional independence assumption does NOT holds,

Discriminative outperforms generative.

Page 7: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Generative vs Discriminative

Given finite data (n data points, p features),

Ng-Jordan

paper

7

Naïve Bayes (generative) requires n = O(log p) to converge to its

asymptotic error, whereas Logistic regression (discriminative)

requires n = O(p).

Why? “Independent class conditional densities”

* smaller classes are easier to learn

* parameter estimates not coupled – each parameter is learnt

independently, not jointly, from training data.

Page 8: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Naïve Bayes vs Logistic Regression

Verdict

Both learn a linear decision boundary.

Naïve Bayes makes more restrictive assumptions

8

Naïve Bayes makes more restrictive assumptions

and has higher asymptotic error,

BUT

converges faster to its less accurate asymptotic

error.

Page 9: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Experimental Comparison (Ng-Jordan’01)

UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features

9Logistic RegressionNaïve Bayes

More in

Paper…

Page 10: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Classification so far … (Recap)Classification so far … (Recap)

10

Page 11: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Tax Fraud Detection

Diagnosing sickle

cell anemia

Anemic cell

Healthy cell

Features, X Labels, Y

Classification Tasks

Tax Fraud Detection

Web ClassificationSports

Science

News

Predict squirrel hill

residentDrive to CMU, Rachel’s fan,

Shop at SH Giant Eagle

Resident

Not resident11

Page 12: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Goal:

Classification

Sports

12

Sports

Science

News

Features, X Labels, Y

Probability of Error

Page 13: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Classification

Optimal predictor:(Bayes classifier)

13

Depends on unknown distribution

Page 14: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Classification algorithms

Independent and identically distributed

However, we can learn a good prediction rule from training data

Learning algorithm

So far …

14

Decision Trees

K-Nearest Neighbor

Naïve Bayes

Logistic Regression

Page 15: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Linear RegressionLinear Regression

Aarti SinghAarti Singh

Machine Learning 10-701/15-781

Jan 27, 2010

Page 16: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Discrete to Continuous Labels

Sports

Science

News

Classification

Anemic cell

Healthy cell

16

Regression

Stock Market

PredictionY = ?

X = Feb01

X = Document Y = Topic X = Cell Image Y = Diagnosis

Page 17: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Regression Tasks

Weather Prediction

Y = Temp

17

X = 7 pm

Estimating

Contamination

X = new location

Y = sensor reading

Page 18: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Supervised Learning

Sports

Goal:

18

Sports

Science

News

Classification: Regression:

Probability of Error Mean Squared Error

Y = ?

X = Feb01

Page 19: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Regression

Optimal predictor:(Conditional Mean)

Dropping subscripts

for notational convenience

19

for notational convenience

Page 20: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Regression

Optimal predictor:(Conditional Mean)

20

Depends on unknown distribution

Intuition: Signal plus (zero-mean) Noise model

Page 21: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Regression algorithms

Learning algorithm

Linear Regression

21

Linear Regression

Lasso, Ridge regression (Regularized Linear Regression)

Nonlinear Regression

Kernel Regression

Regression Trees, Splines, Wavelet estimators, …

Empirical Risk Minimizer:

Empirical mean

Page 22: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Linear Regression

- Class of Linear functions

Least Squares Estimator

22

ββββ1 - intercept

ββββ2 = slopeUni-variate case:

Multi-variate case: 1

where ,

Page 23: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Least Squares Estimator

23

Page 24: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Least Squares Estimator

24

Page 25: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Normal Equations

If is invertible,

p xp p x1 p x1

25

When is invertible ? (Homework 2)

Recall: Full rank matrices are invertible. What is rank of ?

What if is not invertible ? (Homework 2)

Regularization (later)

Page 26: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Geometric Interpretation

Difference in prediction on training set:

26

is the orthogonal projection of

onto the linear subspace spanned by the

columns of

0

Page 27: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Revisiting Gradient Descent

Even when is invertible, might be computationally expensive if A is huge.

Gradient Descent

27

Initialize:

Update:

0 if =

Stop: when some criterion met e.g. fixed # iterations, or < ε.

Gradient Descent

Page 28: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Effect of step-size α

28

Large α => Fast convergence but larger residual error

Also possible oscillations

Small α => Slow convergence but small residual error

Page 29: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

When does Gradient Descent

succeed?View of the algorithm is myopic.

29

Guaranteed to converge to local minima if Converges as in jth direction

Convergence depends on eigenvalue spread

http://demonstrations.wolfram.comhttp://www.ce.berkeley.edu/~bayen/

Page 30: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Least Squares and MLE

Intuition: Signal plus (zero-mean) Noise model

30

Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model !

log likelihood

Page 31: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Regularized Least Squares and MAP

What if is not invertible ?

log likelihood log prior

31Prior belief that β is Gaussian with zero-mean biases solution to “small” β

I) Gaussian Prior

0

Ridge Regression

Closed form: HW

Page 32: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Regularized Least Squares and MAP

What if is not invertible ?

log likelihood log prior

32Prior belief that β is Laplace with zero-mean biases solution to “small” β

Lasso

Closed form: HW

II) Laplace Prior

Page 33: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Ridge Regression vs Lasso

Ridge Regression: Lasso: HOT!Ideally l0 penalty,

but optimization

becomes non-convex

βs with constant J(β)

(level sets of J(β))

33

Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinates

Good for high-dimensional problems – don’t have to store all coordinates!

βs with

constant

l1 norm

βs with

constant

l0 norm

(level sets of J(β))

βs with

constant

l2 norm

β2

β1

Page 34: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Beyond Linear Regression

Polynomial regression

Regression with nonlinear features/basis functions

34

Kernel regression - Local/Weighted regression

Regression trees – Spatially adaptive regression

h

Page 35: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Polynomial Regression

Univariate case:

where ,

35

Nonlinear features

Weight ofeach feature

Page 36: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Nonlinear Regression

Fourier Basis Wavelet Basis

Nonlinear features/basis functionsBasis coefficients

36

Good representation for oscillatory functions Good representation for functions

localized at multiple scales

Page 37: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Local Regression

Nonlinear features/basis functionsBasis coefficients

37

Globally supported

basis functions

(polynomial, fourier)

will not yield a good

representation

Page 38: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Local Regression

Nonlinear features/basis functionsBasis coefficients

38

Globally supported

basis functions

(polynomial, fourier)

will not yield a good

representation

Page 39: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Kernel Regression (Local)

Weighted Least Squares

Weigh each training point based on distance to test point

39

K – Kernel

h – Bandwidth of kernel

Page 40: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Nadaraya-Watson Kernel Regression

constant

40

Page 41: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Nadaraya-Watson Kernel Regression

constant

41

h

Recall: NN classifierAverage <-> majority vote

with box-car kernel

Sum of Ys in h ball around X#pts in h ball around X

Page 42: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Choice of Bandwidth

hShould depend on n, # training data

(determines variance)

Should depend on smoothness of

function

42

Large Bandwidth – average more data points, reduce noise

Small Bandwidth – less smoothing, more accurate fit

(Lower variance)

(Lower bias)

Bias – Variance tradeoff : More to come in later lectures

function

(determines bias)

Page 43: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Spatially adaptive regression

h

h

43

If function smoothness varies spatially, we want to allow bandwidth h to

depend on X

Local polynomials, splines, wavelets, regression trees …

Page 44: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Regression trees

Num Children?

≥ 2 < 2

Binary Decision Tree

44

Average (fit a constant ) on the leaves

Page 45: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Regression trees

h

h

Quad Decision Tree

45

- Polynomial fit on each leaf

If , then split

Else stop Compare residual error with and without split

Page 46: Revisiting Logistic Regression & Naïve Logistic Regression ...epxing/Class/10701-10s/Lecture/lecture6.pdf · Connection to Gaussian Naïve Bayes There are several distributions that

Summary

Discriminative vs Generative Classifiers

- Naïve Bayes vs Logistic Regression

Regression

- Linear Regression

Least Squares Estimator

46

Least Squares Estimator

Normal Equations

Gradient Descent

Geometric Interpretation

Probabilistic Interpretation (connection to MLE)

- Regularized Linear Regression (connection to MAP)

Ridge Regression, Lasso

- Polynomial Regression, Basis (Fourier, Wavelet) Estimators

- Kernel Regression (Localized)

- Regression Trees