logistic regression

33
Discriminative classifier and logistic regression Machine Learning CS 7641,CSE/ISYE 6740, Fall 2015 Le Song

Upload: devavret-makkar

Post on 09-Apr-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Logistic regression

TRANSCRIPT

Page 1: Logistic regression

Discriminative classifier and

logistic regression

Machine Learning

CS 7641,CSE/ISYE 6740, Fall 2015

Le Song

Page 2: Logistic regression

Classification

Represent the data

A label is provided for each data point, eg., � ∈ �1,�1

Classifier

2

Page 3: Logistic regression

Boys vs. Girls (demo)

3

Page 4: Logistic regression

How to come up with decision boundary

Given class conditional distribution: � � � 1 , ��|� � 1�, and class prior: � � 1 , �� �1�

4

� � � 1 �; ��, ��

� � � �1 �; ��, ��

?

Page 5: Logistic regression

Use Bayes rule

� � � � � � � �� � ��, ��

∑ ��, ���

5

posterior

likelihood Prior

normalization constant

Page 6: Logistic regression

Bayes Decision Rule

Learning: prior: � � ,class conditional distribution : � � �

The poster probability of a test point

�� � ≔ � � � � � � � � �� �

Bayes decision rule:

If �� � � ����, then � �, otherwise � �

Alternatively:

If ratio � � � |!"��� |!"�� �

� !"�� !"� , then � �, otherwise � �

Or look at the log-likelihood ratio h x � ln '( �')

6

Page 7: Logistic regression

More on Bayes error of Bayes rule

Bayes error is the lower bound of probability of classification

error

Bayes decision rule is the theoretically best classifier that

minimize probability of classification error

However, computing Bayes error or Bayes decision rule is in

general a very complex problem. Why?

Need density estimation

Need to do integral, eg. * � � 1 � � � 1+, -�

7

Page 8: Logistic regression

What do people do in practice?

Use simplifying assumption for � � � 1Assume � � � 1 is Gaussian, � �!, Σ!Assume � � � 1 is fully factorized

Use geometric intuitions

k-nearest neighbor classifier

Support vector machine

Directly go for the decision boundary h x � ln '( ')

Logistic regression

Neural networks

8

Page 9: Logistic regression

Naïve Bayes Classifier

Use Bayes decision rule for classification

� � � � � � � �� �

But assume � � � 1 is fully factorized

� � � 1 .���|� 1�/

�"�

Or the variables corresponding to each dimension of the data

are independent given the label

9

Page 10: Logistic regression

Naïve Bayes classifier is a generative model

Once you have the model, you can generate sample from it:

For each data point �:

Sample a label, � 1,2 , with according to the class prior � �Sample the value of � from class conditional � � ��

Naïve Bayes: conditioned on �, generate first dimension ��, second

dimension ��, …., independently

Difference from mixture of Gaussian models

Purpose is different (density estimation vs. classification)

Data different (with/without labels)

Learning different (em/or not)

10

��

�� �1…

label

dimensions

Page 11: Logistic regression

K- nearest neighbors

k-nearest neighbor classifier: assign � a label by taking a

majority vote over the 2 training points �� closest to �

For 3 4 1 , the k-nearest neighbor rule generalizes the nearest

neighbor rule

To define this more mathematically:

I6 � ≔ indices of the 2 training points closest to �.

If �� 71, then we can write the 2-nearest neighbor classifier

as:

86 � ≔ 9�:; < ���∈=> �

11

Page 12: Logistic regression

Example

12

K = 1

Page 13: Logistic regression

Example

13

K = 3

Page 14: Logistic regression

Example

14

K = 5

Page 15: Logistic regression

Example

15

K = 25

Page 16: Logistic regression

Example

16

K = 51

Page 17: Logistic regression

Example

17

K = 101

Page 18: Logistic regression

Computations in K-NN

Similar to KDE, essentially no “training” or “learning” phase,

computation is needed when applying the classifier

Memory: ?@-�

Finding the nearest neighbors out of a set of millions of examples

is still pretty hard

Test computation ?@-�

Use smart data structures and algorithms to index training data

Memory: ? @-Training computation: ? @ log@Test computation: ?log@�KD-tree, Ball tree, Cover tree

18

Page 19: Logistic regression

Discriminative classifier

Directly estimate decision boundary h x � ln '( ') or

posterior distribution ��|��Logistic regression, Neural networks

Do not estimate ��|�� and ���

C�� or 8 � : � � 1 � is a function of �, and

does not have probabilistic meaning for �,

hence can not be used to sample data points

Why discriminative classifier?

Avoid difficult density estimation problem

Empirically achieve better classification results

19

Page 20: Logistic regression

What is logistic regression model

Assume that the posterior distribution � � 1 � take a

particular form

� � 1 �, E 11 � exp�EH��

Logistic function 8 I ��JKLM NO

20

Page 21: Logistic regression

Learning parameters in logistic regression

Find E, such that the conditional likelihood of the labels is

maximized

maxR � E : log.� �� �� , ES

�"�

Good news: � E is concave function of E, and there is a single

global optimum.

Bad new: no closed form solution (resort to numerical method)21

Page 22: Logistic regression

The objective function � Elogistic regression model

� � 1 �, E 11 � exp�EH��

Note that

� � 0 �, E 1 � 11 � exp �EH� exp �EH�

1 � exp �EH�

Plug in

� E : log.� �� �� , ES

�"�<�� � 1�

�EH�� � log1 � exp �EH�� �

22

Page 23: Logistic regression

The gradient of � E� E : log.� �� �� , E

S

�"�<�� � 1�

�EH�� � log1 � exp �EH�� �

Gradient

U�E�UE <�� � 1�

��� � exp �EH�� ��

1 � exp �EH�

Setting it to 0 does not lead to closed form solution

23

Page 24: Logistic regression

One way to solve an unconstrained optimization problem is

gradient descent

Given an initial guess, we iteratively refine the guess by taking

the direction of the negative gradient

Think about going down a hill by

taking the steepest direction

at each step

Update rule

�6J� �6 � V6W8�6�V6 is called the step size or learning rate

Gradient descent/ascent

24

Page 25: Logistic regression

Gradient Ascent/Descent algorithm

Initialize parameter EX

Do

EYJ� ← EY � [<�� � 1��

�� � exp �EH�� ��1 � exp �EH�

While the ||EYJ� � EY|| � \

25

Page 26: Logistic regression

Boys vs girls (demo)

26

Page 27: Logistic regression

Naïve Bayes vs. logistic regression

Consider � ∈ 1,�1 , � ∈ ]1

Number of parameters

Naïve Bayes :

2; � 1, when all random variables are binary

4n+1 for Gaussians: 2; mean, 2; variance, and 1 for prior

logistic regression:

; � 1:EX, E�, E�, … , E1

27

Page 28: Logistic regression

Naïve Bayes vs logistic regression II

When model assumptions correct

Both Naïve Bayes and logistic regression produce good classifiers

When model assumptions incorrect

logistic regression is less biased – does not assume conditional

independence

logistic regression has fewer parameters

expected to outperform Naïve Bayes in practice

28

Page 29: Logistic regression

Naïve Bayes vs logistic regression III

Estimation method:

Naïve Bayes parameter estimates are decoupled (super easy)

Logistic regression parameter estimates are coupled (less easy)

How to estimate the parameters in logistic regression?

Maximum likelihood estimation

More specifically, maximize the conditional likelihood the label

29

Page 30: Logistic regression

Handwritten digits (demo)

30

Page 31: Logistic regression

Assign input vector �� , � 1,… ,@ into oneofclasses `, ` 1,… , a

Assume that the posterior distribution take a particular form:

��� `|�� , E�, … , Eb� exp EcH��

∑ exp EcdH��cd

Now, let’s introduce some notations:

Ic� ≔ � ��c|�� , E�, … , Eb�c� f�� `�

Multiclass logistic regression

31

Page 32: Logistic regression

Given all the input data

��, �� , ��, �� , … , �S, �S�

The log-likelihood can be written as:

� E ≔ log..Ic� �!g(b

c"�

S

�"�

<<�c�logIc�b

c"�

S

�"�

<<�c�Ech�� � log< < expEcih ���b

ci"�

S

�"�

b

c"�

S

�"�

Learning parameters in multiclass logistic regression

32

Page 33: Logistic regression

Find E such that the conditional likelihood of the labels is

maximized

��E� also known as cross-entropy error function for

multiclass

Compute the gradient of 8 E with respect to one parameter

vector E� :U8UEc �< Ic� � �c� ��

S

Learning parameters in multiclass logistic regression

33