introduction to machine learning...

46
Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2014-15 1

Upload: others

Post on 23-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Introduction to Machine Learning Regression

Computer Science, Tel-Aviv University, 2014-15

1

Page 2: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Classification

Input: X

¨  Real valued, vectors over real.

¨  Discrete values (0,1,2,...)

¨  Other structures (e.g., strings, graphs, etc.)

Output: Y

¨  Discrete (0,1,2,...)

Page 3: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Regression

Input: X

¨  Real valued, vectors over real.

¨  Discrete values (0,1,2,...)

Output: Y

¨  Real valued, vectors over real.

Page 4: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Examples: Regression

¨  Weight + height cholesterol level

¨  Age + gender time spent in front of the TV

¨  Past choices of a user 'Netflix score' ¨  Profile of a job (user, machine, time) Memory

usage of a submitted process.

Page 5: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior
Page 6: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior
Page 7: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Linear Regression

Input: A set of points (xi,yi) ¨  Assume there is a linear relation between y and x.

Page 8: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Linear Regression

Input: A set of points (xi,yi) ¨  Assume there is a linear relation between y and x.

¨  Find a,b by solving

Page 9: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Regression: Minimize the Residuals

y = a · x+ b

Page 10: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Regression: Minimize the Residuals

(xi, yi)ri = yi � a · xi � b

y = a · x+ b

Page 11: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Likelihood Formulation

Model assumptions:

Input:

Page 12: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Likelihood Formulation

Model assumptions:

Input:

Page 13: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Likelihood Formulation

Model assumptions:

Input:

Likelihood maximized when

Page 14: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Note:

We can add another variable xd+1=1, and set ad+1=b. Therefore, without loss of generality

Page 15: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Matrix Notations

Page 16: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

The Normal Equations

Page 17: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior
Page 18: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Functions over n-dimensions

For a function f(x1,…,xn), the gradient is the vector of partial derivatives:

In one dimension: derivative.

18

∇f = ∂ f∂x1

,…, ∂ f∂xn

"

#$

%

&'

Page 19: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Gradient Descent

¨  Goal: Minimize a function ¨  Algorithm:

1.  Start from a point

2.  Compute

3.  Update

4.  Return to (2), unless converged.

u =∇f (x1i,..., xn

i )

Page 20: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Gradient Descent

Gradient Descent iteration:

Advantage: simple, efficient.

Page 21: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Online Least Squares

Online update step:

Advantage: Efficient, similar to perceptron.

Page 22: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Singularity issues

¨  Not very efficient since we need to inverse a matrix. ¨  The solution is unique if is invertible. ¨  If is singular, we have an infinite number of

solutions to the equations. What is the solution minimizing ?

XtXXtX

ky �Xak2

Page 23: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

The Singular Case

Page 24: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

The Singular Case

−3 −2 −1 0 1 2−4

−2

0

2−8

−6

−4

−2

0

2

4

6

8

Page 25: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

The Singular Case

Page 26: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

The Singular Case

Page 27: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Risk of Overfitting

¨  Say we have a very large number of variables (d is large). ¨  When the number of variables is large we usually have co-

linear variables, and therefore is singular.

¨  Even if is non-singular, there is a risk for over-fitting. For instance, if d=n we can get

¨  Intuitively, we want a small number of variables to explain y.

XtXXtX

Page 28: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Regularization

Letλbe a regularization parameter. Ideally, we need to solve the following: This is a hard problem (NP-hard).

Page 29: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Shrinkage Methods

Lasso regression: Ridge regression:

Page 30: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Ridge Regression

Page 31: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Ridge Regression

Page 32: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Ridge Regression

Positive definite and therefore nonsingular

Page 33: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Ridge Regression – Bayesian View

Prior on a

Page 34: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Ridge Regression – Bayesian View

Prior on a

Maximizing the posterior is equivalent to Ridge with

logPosterior(a | �, ⌧, Data) = � 1

2�

2

nX

i=1

(yi � a · xi)2 � 1

2⌧

2

dX

i=1

a

2i

�n

2

log(2⇡�

2)� d

2

log(2⇡⌧

2)

� =�2

⌧2

Page 35: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Lasso Regression

Page 36: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Lasso Regression

The above is equivalent to the following quadratic program:

Page 37: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Lasso Regression – Bayesian View

Prior on a

Page 38: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Lasso Regression – Bayesian View

Prior on a

logPosterior(a | Data,�,�) = � 1

2�

2

nX

i=1

(yi � a · xi)2 � n

2

log(2⇡�

2)

+ log(

4�

2)� �

2�

2

dX

i=1

|ai|

Page 39: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Lasso vs. Ridge

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Laplace vs. Normal priors (mean 0, variance 1)

Page 40: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

An Equivalent Formulation

Lasso: Ridge:

Claim: for every λ there is λ’ that produces the same solution

Page 41: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior
Page 42: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior
Page 43: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Breaking Linearity

This can be solved using the usual linear regression by plugging X

Page 44: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Regression for Classification

Input: X

¤ Real valued, vectors over real. ¤ Discrete values (0,1,2,...) ¤ Other structures (e.g., strings, graphs, etc.)

Output: Y

¤ Discrete (0 or 1)

We treat the probability Pr(Y|X) as a linear function of X. Problem: Pr(Y|X) should be bounded in [0,1].

Page 45: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Logistic Regression

Model:

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 46: Introduction to Machine Learning Regressionml-tau-2015.wdfiles.com/local--files/course-schedule/Regression_201… · Ridge Regression – Bayesian View Prior on a Maximizing the posterior

Logistic Regression

Given training data, we can write down the likelihood:

linear concave

There is a unique solution – can be found using gradient descent.