csci 590: machine learning - iupuimdundar/cscimachinelearning/lecture11.pdfย ยท logistic regression...

26
CSCI 590: Machine Learning Lecture 11: Perceptron algorithm, probabilistic generative models, probabilistic discriminative models Instructor: Murat Dundar Acknowledgement: These slides are prepared using the course textbook http://research.microsoft.com/~cmbishop/prml/

Upload: others

Post on 13-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

CSCI 590: Machine Learning

Lecture 11: Perceptron algorithm, probabilistic generative

models, probabilistic discriminative models Instructor: Murat Dundar

Acknowledgement: These slides are prepared using the course textbook http://research.microsoft.com/~cmbishop/prml/

Page 2: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Perceptron Algorithm (1)

Another example of a linear discriminant model is the perceptron of Rosenblatt (1962)

where the nonlinear activation function f(.) is given by a step function of the form

Page 3: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Perceptron Algorithm (2)

Targets: t = +1 for class C1 and t = โˆ’1 for class C2.

Error Function: Total number of misclassified

patterns? Does not work because of discontinuities.

Methods based on optimizing w using the gradient of the error function cannot be applied, because the gradient is zero almost everywhere.

Page 4: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Perceptron Algorithm (3)

Perceptron Criterion: where M denotes the set of all misclassified

samples. Stochastic gradient descent:

Page 5: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Perceptron Algorithm (4)

โ€ข for each sample xn evaluate y(xn)

โ€ข is xn correctly classified ?

yes: do nothing

no:

if tn = 1 add ฯ•(xn) to the current estimate of ๐ฐ

if tn = โˆ’1 subtract ฯ•(xn) from the current estimate of ๐ฐ

Perceptron Algorithm:

Page 6: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Perceptron Algorithm (4)

The contribution to the error from a misclassified pattern will be reduced

The total error may still increase because the change in w may cause the contribution of other samples to the error function to increase.

Perceptron convergence theorem: If there exists an exact solution, i.e., the classes are linearly separable, then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps.

Page 7: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (1)

For a two-class classification problem the posterior probability for class ๐ถ1 is

Logistic Sigmoid Function

Page 8: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (2)

Symmetry property:

Inverse function is also known as the logit function and represents the log of the ratio of probabilities ln [p(C1|x)/p(C2|x)].

Inverse function:

Page 9: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (3)

For the case of K>2 classes

Page 10: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (4)

Continuous Inputs: Class conditional densities are Gaussian with a shared covariance matrix

To find ๐’˜๐‘‡๐’™ + ๐‘ค0 we evaluate ln๐‘ ๐‘ฅ ๐ถ1 ๐‘(๐ถ1)

๐‘ ๐‘ฅ ๐ถ2 ๐‘(๐ถ2)

Page 11: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (5)

Maximum Likelihood Solution: We have a dataset {๐’™๐‘›, ๐‘ก๐‘›}, ๐‘› = 1, โ€ฆ , ๐‘, ๐‘ก๐‘› = {1, โˆ’1}

Page 12: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (6)

Page 13: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (7)

Page 14: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (8)

Page 15: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Generative Models (9)

Page 16: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models (1)

Logistic Regression

๐‘ ๐ถ2 ๐œ™ = 1 โˆ’ ๐‘(๐ถ1|๐œ™)

Logistic Sigmoid Function

Probabilistic generative model with Gaussian class densities had M(M+1)/2+2M+2 parameters. In contrast, logistic regression has only M parameters.

For large values of M there is a clear advantage working with logistic regression.

Page 17: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models (2)

Maximum Likelihood for Logistic Regression

For a data set {฿ถ๐‘›, ๐‘ก๐‘›}, where ๐‘ก๐‘› โˆˆ {0,1} and ฿ถ๐‘› = ๐œ™(๐‘ฅ๐‘›), with n=1,โ€ฆ,N, the likelihood function can be written

Negative log likelihood (cross-entropy function)

Page 18: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models (3)

Taking the gradient of the error function with respect to w:

We used the fact that:

Page 19: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models (4)

Maximum likelihood can exhibit severe over-fitting for data sets that are linearly separable. This arises because the maximum likelihood solution occurs when ๐‘ฆ๐‘› = ๐‘ก๐‘› for all samples. This occurs when

the sigmoid function saturates, i.e., ๐‘ค๐‘‡฿ถ๐‘› โ†’ ยฑโˆž.

Page 20: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models(5)

In the case of the linear regression models discussed the maximum likelihood solution, on the assumption of a Gaussian noise model, leads to a closed-form solution. This was a consequence of the quadratic dependence of the log likelihood function on the parameter vector w. For logistic regression, there is no longer a closed-form solution, due to the nonlinearity of the logistic sigmoid function.

Page 21: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models(6)

The departure from a quadratic form is not substantial. The error function is convex and hence has a unique minimum. The error function can be minimized by an efficient iterative technique based on the Newton-Raphson iterative optimization scheme, which uses a local quadratic approximation to the log likelihood function.

Page 22: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models(7)

Page 23: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models(8)

The Newton-Raphson update formula for the logistic regression model then becomes

Compare this with linear regression:

๐‘…๐‘›๐‘› = ๐‘ฆ๐‘›(1 โˆ’ ๐‘ฆ๐‘›)

Page 24: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models(9)

The update formula takes the form of a weighted least-squares solution. Because the weighing matrix R is not constant but depends on the parameter vector w, unlike the least square solution to the linear regression problem there is no closed-form solution. We apply the equations iteratively, each time using the new weight vector w to compute a revised weighing matrix R. For this reason, the algorithm is known as iterative reweighted least squares, or IRLS (Rubin, 1983).

Page 25: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models(10)

Probit regression: For a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic (or softmax) transformation acting on a linear function of the feature variables. However, not all choices of class-conditional density give rise to such a simple form for the posterior probabilities (for instance, if the class-conditional densities are modelled using Gaussian mixtures).

Page 26: CSCI 590: Machine Learning - IUPUImdundar/CSCIMachineLearning/Lecture11.pdfย ยท Logistic Regression ๐‘๐ถ2๐œ™=1โˆ’๐‘(๐ถ1|๐œ™) Logistic Sigmoid Function Probabilistic generative

Probabilistic Discriminative Models(11)

The generalized linear model based on an inverse probit activation function is known as probit regression.

Inverse Probit function:

Results obtained by probit regression is usually similar to those of logistic regression. However, in the case of outliers they behave differently. The tails of a sigmoid function decay asymptotically like exp(-x) whereas those of a probit function decay like exp(-x2), which makes probit more sensitive to outliers.