advanced machine learning lecture 3 - computer vision

55
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’15 Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ [email protected]

Upload: others

Post on 21-Jan-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Advanced Machine Learning

Lecture 3

Linear Regression II

02.11.2015

Bastian Leibe

RWTH Aachen

http://www.vision.rwth-aachen.de/

[email protected]

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAAAAAAAAAAAA

Page 2: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

This Lecture: Advanced Machine Learning

• Regression Approaches

Linear Regression

Regularization (Ridge, Lasso)

Support Vector Regression

Gaussian Processes

• Learning with Latent Variables

EM and Generalizations

Dirichlet Processes

• Structured Output Learning

Large-margin Learning

B. Leibe

Page 3: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Topics of This Lecture

• Recap: Probabilistic View on Regression

• Properties of Linear Regression Loss functions for regression

Basis functions

Multiple Outputs

Sequential Estimation

• Regularization revisited Regularized Least-squares

The Lasso

Discussion

• Bias-Variance Decomposition

3 B. Leibe

Page 4: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Recap: Probabilistic Regression

• First assumption:

Our target function values t are generated by adding noise to

the ideal function estimate:

• Second assumption:

The noise is Gaussian distributed.

4 B. Leibe

Target function

value

Regression function Input value Weights or

parameters

Noise

Mean Variance

(¯ precision)

Slide adapted from Bernt Schiele

Page 5: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Recap: Probabilistic Regression

• Given

Training data points:

Associated function values:

• Conditional likelihood (assuming i.i.d. data)

Maximize w.r.t. w, ¯

5 B. Leibe

X = [x1; : : : ;xn] 2 Rd£n

t = [t1; : : : ; tn]T

Generalized linear

regression function

Slide adapted from Bernt Schiele

Page 6: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Recap: Maximum Likelihood Regression

• Setting the gradient to zero:

Least-squares regression is equivalent to Maximum Likelihood

under the assumption of Gaussian noise.

6 B. Leibe

Same as in least-squares

regression!

Slide adapted from Bernt Schiele

©= [Á(x1); : : : ; Á(xn)]

Page 7: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Recap: Role of the Precision Parameter

• Also use ML to determine the precision parameter ¯:

• Gradient w.r.t. ¯:

The inverse of the noise precision is given by the residual

variance of the target values around the regression function.

7

B. Leibe

Page 8: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Recap: Predictive Distribution

• Having determined the parameters w and ¯, we can

now make predictions for new values of x.

• This means

Rather than giving a point

estimate, we can now also

give an estimate of the

estimation uncertainty.

8 B. Leibe Image source: C.M. Bishop, 2006

Page 9: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Recap: Maximum-A-Posteriori Estimation

• Introduce a prior distribution over the coefficients w.

For simplicity, assume a zero-mean Gaussian distribution

New hyperparameter ® controls the distribution of model

parameters.

• Express the posterior distribution over w.

Using Bayes’ theorem:

We can now determine w by maximizing the posterior.

This technique is called maximum-a-posteriori (MAP). 9

B. Leibe

Page 10: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Recap: MAP Solution

• Minimize the negative logarithm

• The MAP solution is therefore

Maximizing the posterior distribution is equivalent to

minimizing the regularized sum-of-squares error (with ).

10

B. Leibe

Page 11: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

MAP Solution (2)

• Setting the gradient to zero:

B. Leibe

©= [Á(x1); : : : ; Á(xn)]

11

Effect of regularization:

Keeps the inverse well-conditioned

Page 12: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Recap: Bayesian Curve Fitting

• Given

Training data points:

Associated function values:

Our goal is to predict the value of t for a new point x.

• Evaluate the predictive distribution

Noise distribition – again assume a Gaussian here

Assume that parameters ® and ¯ are fixed and known for now.

12 B. Leibe

X = [x1; : : : ;xn] 2 Rd£n

t = [t1; : : : ; tn]T

What we just computed for MAP

Page 13: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Bayesian Curve Fitting

• Under those assumptions, the posterior distribution is a

Gaussian and can be evaluated analytically:

where the mean and variance are given by

and S is the regularized covariance matrix

13 B. Leibe Image source: C.M. Bishop, 2006

Page 14: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Analyzing the result

• Analyzing the variance of the predictive distribution

14 B. Leibe

Uncertainty in the parameters w (consequence of Bayesian

treatment)

Uncertainty in the predicted

value due to noise on the

target variables

(expressed already in ML)

Page 15: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Bayesian Predictive Distribution

• Important difference to previous example

Uncertainty may vary with test point x!

15 B. Leibe Image source: C.M. Bishop, 2006

Page 16: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Discussion

• We now have a better understanding of regression

Least-squares regression: Assumption of Gaussian noise

We can now also plug in different noise models and explore how

they affect the error function.

L2 regularization as a Gaussian prior on parameters w.

We can now also use different regularizers and explore what

they mean.

This lecture…

General formulation with basis functions Á(x).

We can now also use different basis functions.

16 B. Leibe

Page 17: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Discussion

• General regression formulation

In principle, we can perform regression in arbitrary spaces

and with many different types of basis functions

However, there is a caveat… Can you see what it is?

• Example: Polynomial curve fitting, M = 3

Number of coefficients grows with DM!

The approach becomes quickly unpractical for high dimensions.

This is known as the curse of dimensionality.

We will encounter some ways to deal with this later... 17

B. Leibe

Page 18: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Topics of This Lecture

• Recap: Probabilistic View on Regression

• Properties of Linear Regression Loss functions for regression

Basis functions

Multiple Outputs

Sequential Estimation

• Regularization revisited Regularized Least-squares

The Lasso

Discussion

• Bias-Variance Decomposition

18 B. Leibe

Page 19: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Loss Functions for Regression

• Given p(y, x, w, ¯), how do we actually estimate a

function value yt for a new point xt?

• We need a loss function, just as in the classification case

• Optimal prediction: Minimize the expected loss

19 B. Leibe Slide adapted from Stefan Roth

Page 20: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Loss Functions for Regression

• Simplest case

Squared loss:

Expected loss

20 B. Leibe Slide adapted from Stefan Roth

@E[L]

@y(x)= 2

Zfy(x)¡ tg p(x; t)dt

!= 0

,Z

tp(x; t)dt = y(x)

Zp(x; t)dt

Page 21: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Loss Functions for Regression

• Important result

Under Squared loss, the optimal regression function is the

mean E [t|x] of the posterior p(t|x).

Also called mean prediction.

For our generalized linear regression function and square loss,

we obtain as result

21 B. Leibe Slide adapted from Stefan Roth

Ztp(x; t)dt = y(x)

Zp(x; t)dt

, y(x) =

Ztp(x; t)

p(x)dt =

Ztp(tjx)dt

, y(x) = E[tjx]

Page 22: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Visualization of Mean Prediction

22 B. Leibe

mean prediction

Slide adapted from Stefan Roth Image source: C.M. Bishop, 2006

Page 23: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Loss Functions for Regression

• Different derivation: Expand the square term as follows

• Substituting into the loss function

The cross-term vanishes, and we end up with

23 B. Leibe

fy(x)¡ tg2 = fy(x)¡ E[tjx] + E[tjx]¡ tg2

= fy(x)¡ E[tjx]g2 + fE[tjx]¡ tg2

+2fy(x)¡ E[tjx]gfE[tjx]¡ tg

Optimal least-squares predictor

given by the conditional mean

Intrinsic variability of target data

Irreducible minimum value

of the loss function

Page 24: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Other Loss Functions

• The squared loss is not the only possible choice

Poor choice when conditional distribution p(t|x) is multimodal.

• Simple generalization: Minkowski loss

Expectation

• Minimum of E[Lq] is given by

Conditional mean for q = 2,

Conditional median for q = 1,

Conditional mode for q = 0. 24

B. Leibe

E[Lq] =

Z Zjy(x)¡ tjqp(x; t)dxdt

Page 25: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Minkowski Loss Functions

25 B. Leibe Image source: C.M. Bishop, 2006

Page 26: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Topics of This Lecture

• Recap: Probabilistic View on Regression

• Properties of Linear Regression Loss functions for regression

Basis functions

Multiple Outputs

Sequential Estimation

• Regularization revisited Regularized Least-squares

The Lasso

Discussion

• Bias-Variance Decomposition

26 B. Leibe

Page 27: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Linear Basis Function Models

• Generally, we consider models of the following form

where Áj(x) are known as basis functions.

Typically, Á0(x) = 1, so that w0 acts as a bias.

In the simplest case, we use linear basis functions: Ád(x) = xd.

• Let’s take a look at some other possible basis

functions...

27

B. Leibe Slide adapted from C.M. Bishop, 2006

Page 28: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Linear Basis Function Models (2)

• Polynomial basis functions

• Properties

Global

A small change in x affects all

basis functions.

28 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Page 29: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Linear Basis Function Models (3)

• Gaussian basis functions

• Properties

Local

A small change in x affects

only nearby basis functions.

¹j and s control location and

scale (width).

29 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Page 30: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Linear Basis Function Models (4)

• Sigmoid basis functions

where

• Properties

Local

A small change in x affects

only nearby basis functions.

¹j and s control location and

scale (slope).

30 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Page 31: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Topics of This Lecture

• Recap: Probabilistic View on Regression

• Properties of Linear Regression Loss functions for regression

Basis functions

Multiple Outputs

Sequential Estimation

• Regularization revisited Regularized Least-squares

The Lasso

Discussion

• Bias-Variance Decomposition

31 B. Leibe

Page 32: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Multiple Outputs

• Multiple Output Formulation

So far only considered the case of a single target variable t.

We may wish to predict K > 1 target variables in a vector t.

We can write this in matrix form

where

32 B. Leibe

Page 33: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Multiple Outputs (2)

• Analogously to the single output case we have:

• Given observed inputs, , and targets,

, we obtain the log likelihood function

33 B. Leibe Slide adapted from C.M. Bishop, 2006

Page 34: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Multiple Outputs (3)

• Maximizing with respect to W, we obtain

• If we consider a single target variable, tk, we see that

where , which is identical with the

single output case.

34 B. Leibe Slide adapted from C.M. Bishop, 2006

Page 35: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Topics of This Lecture

• Recap: Probabilistic View on Regression

• Properties of Linear Regression Loss functions for regression

Basis functions

Multiple Outputs

Sequential Estimation

• Regularization revisited Regularized Least-squares

The Lasso

Discussion

• Bias-Variance Decomposition

35 B. Leibe

Page 36: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Sequential Learning

• Up to now, we have mainly considered batch methods

All data was used at the same time

Instead, we can also consider data items one at a time

(a.k.a. online learning)

• Stochastic (sequential) gradient descent:

• This is known as the least-mean-squares (LMS)

algorithm.

• Issue: how to choose the learning rate ´?

36 B. Leibe Slide adapted from C.M. Bishop, 2006

Page 37: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

Topics of This Lecture

• Recap: Probabilistic View on Regression

• Properties of Linear Regression Loss functions for regression

Basis functions

Multiple Outputs

Sequential Estimation

• Regularization revisited Regularized Least-squares

The Lasso

Discussion

• Bias-Variance Decomposition

37 B. Leibe

Page 38: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Regularization Revisited

• Consider the error function

• With the sum-of-squares error function and a quadratic

regularizer, we get

• which is minimized by

38 B. Leibe Slide adapted from C.M. Bishop, 2006

Data term + Regularization term

¸ is called the

regularization

coefficient.

Page 39: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Regularized Least-Squares

• Let’s look at more general regularizers

• “Lq norms”

39 B. Leibe

“Lasso” “Ridge

Regression”

Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Page 40: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Recall: Lagrange Multipliers

40 B. Leibe

Page 41: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Regularized Least-Squares

• We want to minimize

• This is equivalent to minimizing

subject to the constraint

(for some suitably chosen ´)

41 B. Leibe

Page 42: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Regularized Least-Squares

• Effect: Sparsity for q 1.

Minimization tends to set many coefficients to zero

• Why is this good?

• Why don’t we always do it, then? Any problems? 42

B. Leibe Image source: C.M. Bishop, 2006

Page 43: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

• Consider the following regressor

This formulation is known as the Lasso.

• Properties

L1 regularization The solution will be sparse

(only few coefficients will be non-zero)

The L1 penalty makes the problem non-linear.

There is no closed-form solution.

Need to solve a quadratic programming problem.

However, efficient algorithms are available with

the same computational cost as for ridge regression.

The Lasso

43 B. Leibe Image source: C.M. Bishop, 2006

Page 44: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Lasso as Bayes Estimation

• Interpretation as Bayes Estimation

We can think of |wj|q as the log-prior density for wj.

• Prior for Lasso (q = 1): Laplacian distribution

44 B. Leibe Image source: Friedman, Hastie, Tibshirani, 2009

with

Page 45: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Analysis

• Equicontours of the prior distribution

• Analysis

For q · 1, the prior is not uniform in direction, but

concentrates more mass on the coordinate directions.

The case q = 1 (lasso) is the smallest q such that the constraint

region is convex.

Non-convexity makes the optimization problem more difficult.

Limit for q = 0: regularization term becomes j=1..M 1 = M.

This is known as Best Subset Selection.

45

B. Leibe Image source: Friedman, Hastie, Tibshirani, 2009

Page 46: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Discussion

• Bayesian analysis

Lasso, Ridge regression and Best Subset Selection are Bayes

estimates with different priors.

However, derived as maximizers of the posterior.

Should ideally use the posterior mean as the Bayes estimate!

Ridge regression solution is also the posterior mean, but Lasso

and Best Subset Selection are not.

• We might also try using other values of q besides 0,1,2…

However, experience shows that this is not worth the effort.

Values of q 2 (1,2) are a compromise between lasso and ridge

However, |wj|q with q > 1 is differentiable at 0.

Loses the ability of lasso for setting coefficients exactly to zero.

46 B. Leibe

Page 47: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Topics of This Lecture

• Recap: Probabilistic View on Regression

• Properties of Linear Regression Loss functions for regression

Basis functions

Multiple Outputs

Sequential Estimation

• Regularization revisited Regularized Least-squares

The Lasso

Discussion

• Bias-Variance Decomposition

47 B. Leibe

Page 48: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Bias-Variance Decomposition

• Recall the expected squared loss,

where

• The second term of E[L] corresponds to the noise

inherent in the random variable t.

• What about the first term?

48 B. Leibe Slide adapted from C.M. Bishop, 2006

Page 49: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Bias-Variance Decomposition

• Suppose we were given multiple data sets, each of size

N. Any particular data set D will give a particular

function y(x;D). We then have

• Taking the expectation over D yields

49 B. Leibe Slide adapted from C.M. Bishop, 2006

Page 50: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Bias-Variance Decomposition

• Thus we can write

where

50 B. Leibe Slide adapted from C.M. Bishop, 2006

Page 51: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Bias-Variance Decomposition

• Example

25 data sets from the sinusoidal, varying the degree of

regularization, ¸.

51 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Page 52: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Bias-Variance Decomposition

• Example

25 data sets from the sinusoidal, varying the degree of

regularization, ¸.

52 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Page 53: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

Bias-Variance Decomposition

• Example

25 data sets from the sinusoidal, varying the degree of

regularization, ¸.

53 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Page 54: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

12

The Bias-Variance Trade-Off

• Result from these plots

An over-regularized model (large ¸) will have a high

bias.

An under-regularized model (small ¸) will have a high

variance.

• We can compute an estimate for the generalization

capability this way (magenta curve)!

Can you see where the problem is with this?

Computation is based on average w.r.t. ensembles of data sets.

Unfortunately of little practical value…

54

B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Page 55: Advanced Machine Learning Lecture 3 - Computer Vision

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Ad

van

ced

Mach

ine L

earn

ing

Win

ter’

15

References and Further Reading

• More information on linear regression, including a

discussion on regularization can be found in Chapters

1.5.5 and 3.1-3.2 of the Bishop book.

• Additional information on the Lasso, including efficient

algorithms to solve it, can be found in Chapter 3.4 of the

Hastie book.

B. Leibe 55

Christopher M. Bishop

Pattern Recognition and Machine Learning

Springer, 2006

T. Hastie, R. Tibshirani, J. Friedman

Elements of Statistical Learning

2nd edition, Springer, 2009