regression - machine learning and pattern recognition · regression machine learning and pattern...

24
Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 2014 (All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David Barber.) 1 / 24

Upload: others

Post on 20-May-2020

33 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

RegressionMachine Learning and Pattern Recognition

Chris Williams

School of Informatics, University of Edinburgh

September 2014

(All of the slides in this course have been adapted from

previous versions by Charles Sutton, Amos Storkey, David Barber.)

1 / 24

Page 2: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Classification or Regression?

I Classification: want to learn a discrete target variable

I Regression: want to learn a continuous target variableI Linear regression, linear-in-the-parameters models

I Linear regression is a conditional Gaussian modelI Maximum likelihood solution - ordinary least squaresI Can use nonlinear basis functionsI Ridge regressionI Full Bayesian treatment

I Reading: Murphy chapter 7 (not all sections needed), Barber(17.1, 17.2, 18.1.1)

2 / 24

Page 3: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

One Dimensional Data

−2 −1 0 1 2 30

0.5

1

1.5

2

2.5

3 / 24

Page 4: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Linear Regression

I Simple example: one-dimensional linear regression.

I Suppose we have data of the form (x, y), and we believe thedata should fol low a straight line: the data should have astraight line fit of the form y = w0 + w1x.

I However we also believe the target values y are subject tomeasurement error, which we will assume to be Gaussian. Soy = w0 + w1x+ η where η is a Gaussian noise term, mean 0,variance σ2

η.

4 / 24

Page 5: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Figure credit: http://jedismedicine.blogspot.co.uk/2014/01/

I Linear regression is just a conditional version of estimating aGaussian (conditional on the input x)

5 / 24

Page 6: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Generated Data

−2 −1 0 1 2 30

0.5

1

1.5

2

2.5

6 / 24

Page 7: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Multivariate Case

I Consider the case where we are interested in y = f(x) for Ddimensional x: y = w0 + w1x1 + . . . wDxD + η, whereη ∼ Gaussian(0, σ2

η).

I Examples? Final grade depends on time spent on work foreach tutorial.

I We set w = (w0, w1, . . . wD)T and introduce φ = (1,xT )T ,then we can write y = wTφ + η instead

I This implies p(y|φ,w) = N(y; wTφ, σ2η)

I Assume that training data is iid, i.e.,p(y1, . . . yN |x1, . . . ,xN ,w) =

∏Nn=1 p(y

n|xn,w)I Given data {(xn, yn), n = 1, 2, . . . , N}, the log likelihood is

L(w) = logP (y1 . . . yN |x1 . . .xN ,w)

= − 12σ2

η

N∑n=1

(yn −wTφn)2 − N

2log(2πσ2

η)

7 / 24

Page 8: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Minimizing Squared Error

L(w) = − 12σ2

η

N∑n=1

(yn −wTφn)2 − N

2log(2πσ2

η)

= −C1

N∑n=1

(yn −wTφn)2 − C2

where C1 > 0 and C2 don’t depend on w. NowI Multiplying by a positive constant doesn’t change the

maximumI Adding a constant doesn’t change the maximum.I

∑Nn=1(yn −wTφn)2 is the sum of squared errors made if you

use wSo maximizing the likelihood is the same as minimizing the totalsquared error of the linear predictor.

So you don’t have to believe the Gaussian assumption. You cansimply believe that you want to minimize the squared error.

8 / 24

Page 9: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Maximum Likelihood Solution I

I Write Φ = (φ1,φ2, . . . ,φN )T , and y = (y1, y2, . . . , yN )T

I Φ is called the design matrix, has N rows, one for eachexample

L(w) = − 12σ2

η

(y − Φw)T (y − Φw)− C2

I Take derivatives of the log likelihood:

∇wL(w) = − 1σ2η

ΦT (Φw − y)

9 / 24

Page 10: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Maximum Likelihood Solution II

I Setting the derivatives to zero to find the minimum gives

ΦTΦw = ΦTy

I This means the maximum likelihood w is given by

w = (ΦTΦ)−1ΦTy

The matrix (ΦTΦ)−1ΦT is called the pseudo-inverse.

I Ordinary least squares (OLS) solution for wI MLE for the variance

σ2η =

1N

N∑n=1

(yn −wTφn)2

i.e. the average of the squared residuals

10 / 24

Page 11: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Generated Data

−2 −1 0 1 2 3−0.5

0

0.5

1

1.5

2

2.5

The black line is the maximum likelihood fit to the data.

11 / 24

Page 12: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Nonlinear regression

I All this just used φ.

I We chose to put the x values in φ, but we could have putanything in there, including nonlinear transformations of the xvalues.

I In fact we can choose any useful form for φ so long as the finalderivatives are linear wrt w. We can even change the size.

I We already have the maximum likelihood solution in the caseof Gaussian noise: the pseudo-inverse solution.

I Models of this form are called general linear models orlinear-in-the-parameters models.

12 / 24

Page 13: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Example:polynomial fitting

I Model y = w1 + w2x+ w2x2 + w4x

3.

I Set φ = (1, x, x2, x3)T and w = (w1, w2, w3, w4).

I Can immediately write down the ML solution:w = (ΦTΦ)−1ΦTy, where Φ and y are defined as before.

I Could use any features we want: e.g. features that are onlyactive in certain local regions (radial basis functions, RBFs).

Figure credit: David Barber, BRML Fig 17.6

13 / 24

Page 14: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Dimensionality issues

I How many radial basis functions do we need?

I Suppose we need only three per dimension

I Then we would need 3D for a D-dimensional problem

I This becomes large very fast: this is commonly called thecurse of dimensionality

I Gaussian processes (see later) can help with these issues

14 / 24

Page 15: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Higher dimensional outputs

I Suppose the target values are vectors.

I Then we introduce different wi for each yi.

I Then we can do regression independently in each of thosecases.

15 / 24

Page 16: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Adding a PriorI Put prior over parameters, e.g.,

p(y|φ,w) = N(y; wTφ, σ2η)

p(w) = N(w; 0, τ2I)

I I is the identity matrixI The log posterior is

log p(w|D) = const− 12σ2

η

N∑n=1

(yn −wTφn)2 − N

2log(2πσ2)

− 12τ2

wTw︸ ︷︷ ︸penalty on large weights

−D2

log(2πτ2)

I MAP solution can be computed analytically. Derivationalmost the same as with MLE (where λ = σ2

η/τ2)

wMAP = (ΦTΦ + λI)−1ΦTy

This is called ridge regression16 / 24

Page 17: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Effect of Ridge Regression

I Collecting constant terms from log posterior on last slide

log p(w|D) = const− 12σ2

η

N∑n=1

(yn−wTφn)2− 12τ2

wTw︸ ︷︷ ︸||w||22. penalty term

I This is called `2 regularization or weight decay. The secondterm is the squared Euclidean (also called `2) norm of w.

I The idea is to reduce overfitting by forcing the function to besimple. The simplest possible function is constant w = 0, soencourage w to be closer to that.

I τ is a parameter of the method. Trades off between how wellyou fit the training data and how simple the method is. Mostcommonly set via cross validation.

I Regularization is a general term for adding a “second term” toan objective function to encourage simple models.

17 / 24

Page 18: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Effect of Ridge Regression (Graphic)

0 5 10 15 20−10

−5

0

5

10

15

20

ln lambda −20.135

0 5 10 15 20−15

−10

−5

0

5

10

15

20

ln lambda −8.571

Figure credit: Murphy Fig 7.7

Degree 14 polynomial fit with and without regularization

18 / 24

Page 19: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Why Ridge Regression Works (Graphic)

prior mean

MAP Estimate

ML Estimateu

1

u2

Figure credit: Murphy Fig 7.9

19 / 24

Page 20: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Bayesian Regression

I Bayesian regression model

p(y|φ,w) = N(y; wTφ, σ2η)

p(w) = N(w; 0, τ2I)

I Possible to compute the posterior distribution analytically,because linear Gaussian models are jointly Gaussian (seeMurphy §7.6.1 for details)

p(w|Φ,y, σ2η) ∝ p(w)p(y|Φ, σ2

η) = N(w|wN , VN )

wN =1σ2η

VNΦTy

VN = σ2η(σ

2η/τ

2I + ΦTΦ)−1

20 / 24

Page 21: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Making predictions

I For a new test point x∗ with corresponding feature vector φ∗,we have that

f(x∗) = wTφ∗ + η

where w ∼ N(wN , VN ).

I Hence

p(y∗|x∗,D) ∼ N(wTNφ∗, (φ∗)TVNφ∗ + σ2

η)

21 / 24

Page 22: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Example of Bayesian Regression

W0

W1

prior/posterior

−1 0 1−1

0

1

−1 0 1−1

0

1

x

y

data space

W0

W1

−1 0 1−1

0

1

W0

W1

−1 0 1−1

0

1

−1 0 1−1

0

1

x

y

W0

W1

−1 0 1−1

0

1

W0

W1

−1 0 1−1

0

1

−1 0 1−1

0

1

x

y

W0

W1

−1 0 1−1

0

1

W0

W1

−1 0 1−1

0

1

−1 0 1−1

0

1

x

y

likelihood

Figure credit: Murphy Fig 7.11

22 / 24

Page 23: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Another Example

−8 −6 −4 −2 0 2 4 6 80

10

20

30

40

50

60

plugin approximation (MLE)

prediction

training data

−8 −6 −4 −2 0 2 4 6 8−10

0

10

20

30

40

50

60

70

80

Posterior predictive (known variance)

prediction

training data

MLE Bayes

−8 −6 −4 −2 0 2 4 6 80

5

10

15

20

25

30

35

40

45

50

functions sampled from plugin approximation to posterior

−8 −6 −4 −2 0 2 4 6 8−20

0

20

40

60

80

100

functions sampled from posterior

MLE samples Bayes samplesFigure credit: Murphy Fig 7.12

Fitting a quadratic. Notice how the error bars get larger furtheraway from training data

23 / 24

Page 24: Regression - Machine Learning and Pattern Recognition · Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh ... (radial

Summary

I Linear regression is a conditional Gaussian model

I Maximum likelihood solution - ordinary least squares

I Can use nonlinear basis functions

I Ridge regression

I Full Bayesian treatment

24 / 24