logistic regression
DESCRIPTION
Logistic regressionTRANSCRIPT
Discriminative classifier and
logistic regression
Machine Learning
CS 7641,CSE/ISYE 6740, Fall 2015
Le Song
Classification
Represent the data
A label is provided for each data point, eg., � ∈ �1,�1
Classifier
2
Boys vs. Girls (demo)
3
How to come up with decision boundary
Given class conditional distribution: � � � 1 , ��|� � 1�, and class prior: � � 1 , �� �1�
4
� � � 1 �; ��, ��
� � � �1 �; ��, ��
?
Use Bayes rule
� � � � � � � �� � ��, ��
∑ ��, ���
5
posterior
likelihood Prior
normalization constant
Bayes Decision Rule
Learning: prior: � � ,class conditional distribution : � � �
The poster probability of a test point
�� � ≔ � � � � � � � � �� �
Bayes decision rule:
If �� � � ����, then � �, otherwise � �
Alternatively:
If ratio � � � |!"��� |!"�� �
� !"�� !"� , then � �, otherwise � �
Or look at the log-likelihood ratio h x � ln '( �')
6
More on Bayes error of Bayes rule
Bayes error is the lower bound of probability of classification
error
Bayes decision rule is the theoretically best classifier that
minimize probability of classification error
However, computing Bayes error or Bayes decision rule is in
general a very complex problem. Why?
Need density estimation
Need to do integral, eg. * � � 1 � � � 1+, -�
7
What do people do in practice?
Use simplifying assumption for � � � 1Assume � � � 1 is Gaussian, � �!, Σ!Assume � � � 1 is fully factorized
Use geometric intuitions
k-nearest neighbor classifier
Support vector machine
Directly go for the decision boundary h x � ln '( ')
Logistic regression
Neural networks
8
Naïve Bayes Classifier
Use Bayes decision rule for classification
� � � � � � � �� �
But assume � � � 1 is fully factorized
� � � 1 .���|� 1�/
�"�
Or the variables corresponding to each dimension of the data
are independent given the label
9
Naïve Bayes classifier is a generative model
Once you have the model, you can generate sample from it:
For each data point �:
Sample a label, � 1,2 , with according to the class prior � �Sample the value of � from class conditional � � ��
Naïve Bayes: conditioned on �, generate first dimension ��, second
dimension ��, …., independently
Difference from mixture of Gaussian models
Purpose is different (density estimation vs. classification)
Data different (with/without labels)
Learning different (em/or not)
10
��
�
�� �1…
label
dimensions
K- nearest neighbors
k-nearest neighbor classifier: assign � a label by taking a
majority vote over the 2 training points �� closest to �
For 3 4 1 , the k-nearest neighbor rule generalizes the nearest
neighbor rule
To define this more mathematically:
I6 � ≔ indices of the 2 training points closest to �.
If �� 71, then we can write the 2-nearest neighbor classifier
as:
86 � ≔ 9�:; < ���∈=> �
11
Example
12
K = 1
Example
13
K = 3
Example
14
K = 5
Example
15
K = 25
Example
16
K = 51
Example
17
K = 101
Computations in K-NN
Similar to KDE, essentially no “training” or “learning” phase,
computation is needed when applying the classifier
Memory: ?@-�
Finding the nearest neighbors out of a set of millions of examples
is still pretty hard
Test computation ?@-�
Use smart data structures and algorithms to index training data
Memory: ? @-Training computation: ? @ log@Test computation: ?log@�KD-tree, Ball tree, Cover tree
18
Discriminative classifier
Directly estimate decision boundary h x � ln '( ') or
posterior distribution ��|��Logistic regression, Neural networks
Do not estimate ��|�� and ���
C�� or 8 � : � � 1 � is a function of �, and
does not have probabilistic meaning for �,
hence can not be used to sample data points
Why discriminative classifier?
Avoid difficult density estimation problem
Empirically achieve better classification results
19
What is logistic regression model
Assume that the posterior distribution � � 1 � take a
particular form
� � 1 �, E 11 � exp�EH��
Logistic function 8 I ��JKLM NO
20
Learning parameters in logistic regression
Find E, such that the conditional likelihood of the labels is
maximized
maxR � E : log.� �� �� , ES
�"�
Good news: � E is concave function of E, and there is a single
global optimum.
Bad new: no closed form solution (resort to numerical method)21
The objective function � Elogistic regression model
� � 1 �, E 11 � exp�EH��
Note that
� � 0 �, E 1 � 11 � exp �EH� exp �EH�
1 � exp �EH�
Plug in
� E : log.� �� �� , ES
�"�<�� � 1�
�EH�� � log1 � exp �EH�� �
22
The gradient of � E� E : log.� �� �� , E
S
�"�<�� � 1�
�EH�� � log1 � exp �EH�� �
Gradient
U�E�UE <�� � 1�
��� � exp �EH�� ��
1 � exp �EH�
Setting it to 0 does not lead to closed form solution
23
One way to solve an unconstrained optimization problem is
gradient descent
Given an initial guess, we iteratively refine the guess by taking
the direction of the negative gradient
Think about going down a hill by
taking the steepest direction
at each step
Update rule
�6J� �6 � V6W8�6�V6 is called the step size or learning rate
Gradient descent/ascent
24
Gradient Ascent/Descent algorithm
Initialize parameter EX
Do
EYJ� ← EY � [<�� � 1��
�� � exp �EH�� ��1 � exp �EH�
While the ||EYJ� � EY|| � \
25
Boys vs girls (demo)
26
Naïve Bayes vs. logistic regression
Consider � ∈ 1,�1 , � ∈ ]1
Number of parameters
Naïve Bayes :
2; � 1, when all random variables are binary
4n+1 for Gaussians: 2; mean, 2; variance, and 1 for prior
logistic regression:
; � 1:EX, E�, E�, … , E1
27
Naïve Bayes vs logistic regression II
When model assumptions correct
Both Naïve Bayes and logistic regression produce good classifiers
When model assumptions incorrect
logistic regression is less biased – does not assume conditional
independence
logistic regression has fewer parameters
expected to outperform Naïve Bayes in practice
28
Naïve Bayes vs logistic regression III
Estimation method:
Naïve Bayes parameter estimates are decoupled (super easy)
Logistic regression parameter estimates are coupled (less easy)
How to estimate the parameters in logistic regression?
Maximum likelihood estimation
More specifically, maximize the conditional likelihood the label
29
Handwritten digits (demo)
30
Assign input vector �� , � 1,… ,@ into oneofclasses `, ` 1,… , a
Assume that the posterior distribution take a particular form:
��� `|�� , E�, … , Eb� exp EcH��
∑ exp EcdH��cd
Now, let’s introduce some notations:
Ic� ≔ � ��c|�� , E�, … , Eb�c� f�� `�
Multiclass logistic regression
31
Given all the input data
��, �� , ��, �� , … , �S, �S�
The log-likelihood can be written as:
� E ≔ log..Ic� �!g(b
c"�
S
�"�
<<�c�logIc�b
c"�
S
�"�
<<�c�Ech�� � log< < expEcih ���b
ci"�
S
�"�
b
c"�
S
�"�
Learning parameters in multiclass logistic regression
32
Find E such that the conditional likelihood of the labels is
maximized
��E� also known as cross-entropy error function for
multiclass
Compute the gradient of 8 E with respect to one parameter
vector E� :U8UEc �< Ic� � �c� ��
S
�
Learning parameters in multiclass logistic regression
33