04 linear regression

5/26/2018 04 Linear Regression

1/57

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR REGRESSION


2/57

Probability & Bayesian Inference

J. ElderCSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

2

Credits

!Some of these slides were sourced and/or modifiedfrom:

! Christopher Bishop, Microsoft UK


3/57



3

Relevant Problems from Murphy

!7.4, 7.6, 7.7, 7.9

! Please do 7.9 at least. We will discuss the solutionin class.


4/57



4

Linear Regression Topics

!What is linear regression?

! Example: polynomial curve fitting! Other basis families!

Solving linear regression problems! Regularized regression! Multiple linear regression!

Bayesian linear regression


5/57



5

What is Linear Regression?

! In classification, we seek to identify the categorical class Ck

associate with a given input vector x.

! In regression, we seek to identify (or estimate) a continuousvariable yassociated with a given input vector x.

! yis called the dependent variable.! xis called the independent variable.! If yis a vector, we call this multiple regression.! We will focus on the case where yis a scalar.!

Notation:! ywill denote the continuous model of the dependent variable! twill denote discrete noisy observations of the dependent

variable (sometimes called the target variable).


6/57



6

Where is the Linear in Linear Regression?

!In regression we assume that y is a function of x.The exact nature of this function is governed by an

unknown parameter vector w:

! The regression is linear if yis linear in w. In otherwords, we can express yas

y = y x w( )

y = t! x

( )where! x( ) is some (potentially nonlinear) function of x.


7/57



7

Linear Basis Function Models

!Generally

!where !j(

x

)are known as basis functions.! Typically, "0(x) = 1, so that w0acts as a bias.! In the simplest case, we use linear basis functions :"d(x) = xd.


8/57



8



! Example: polynomial curve fitting! Other basis families!

Solving linear regression problems! Regularized regression! Multiple linear regression!

Bayesian linear regression


9/57



9

Example: Polynomial Bases

! Polynomial basisfunctions:

!These are global! a small change in xaffects all basis functions.

! A small change in abasis function affects yfor all x.


10/57



10

Example: Polynomial Curve Fitting


11/57



11

Sum-of-Squares Error Function


12/57



12

1stOrder Polynomial


13/57



13

3rdOrder Polynomial


14/57



14

9thOrder Polynomial


15/57



15

Regularization

! Penalize large coefficient values


16/57



16

Regularization

!"#

%&'(& )*+,-*./0+


17/57



17

Regularization

!"#

%&'(& )*+,-*./0+


18/57



18

Regularization

!"#

%&'(& )*+,-*./0+


19/57



19

Probabilistic View of Curve Fitting

! Why least squares?! Model noise (deviation of data from model) as

Gaussian i.i.d.

where ! !1

"2is the precision of the noise.


20/57



20

Maximum Likelihood

! We determine wMLby minimizing the squared error E(w).

! Thus least-squares regression reflects an assumption that thenoise is i.i.d. Gaussian.


21/57



21

Maximum Likelihood

! We determine wMLby minimizing the squared error E(w).

! Now given wML, we can estimate the variance of the noise:


22/57



22

Predictive Distribution

Generating function

Observed data

Maximum likelihood prediction

Posterior over t


23/57



23

MAP: A Step towards Bayes

! Prior knowledge about probable values of wcan be incorporated into theregression:

! Now the posterior over wis proportional to the product of the likelihoodtimes the prior:

! The result is to introduce a new quadratic term in winto the error functionto be minimized:

! Thus regularized (ridge) regression reflects a 0-mean isotropic Gaussianprior on the weights.


24/57



24



! Example: polynomial curve fitting! Other basis families! Solving linear regression problems! Regularized regression! Multiple linear regression! Bayesian linear regression


25/57



25

Gaussian Bases

! Gaussian basis functions:

!These are local:! a small change in x affectsonly nearby basis functions.

! a small change in a basisfunction affects y only fornearby x.

!jand s control locationand scale (width).

Think of these as interpolation functions.


26/57



26



! Example: polynomial curve fitting! Other basis families! Solving linear regression problems! Regularized regression! Multiple linear regression! Bayesian linear regression


27/57



27

Maximum Likelihood and Linear Least Squares

! Assume observations from a deterministic function withadded Gaussian noise:

! which is the same as saying,

! where

where


28/57



28


! Given observed inputs, , andtargets, we obtain the likelihood

function


29/57



29


! Taking the logarithm, we get

! where

! is the sum-of-squares error.


30/57



30

! Computing the gradient and setting it to zero yields

!Solving for w, we get

! where

Maximum Likelihood and Least Squares

The Moore-Penrose

pseudo-inverse, .


31/57



31


! What is linear regression?! Example: polynomial curve fitting! Other basis families! Solving linear regression problems! Regularized regression! Multiple linear regression! Bayesian linear regression


32/57



32

Regularized Least Squares

! Consider the error function:

! With the sum-of-squares error function and aquadratic regularizer, we get

! which is minimized by

Data term + Regularization term

is called theregularization

coefficient.

Thus the name ridge regression


33/57



33

Application: Colour Restoration

0 100 200 3000

50

100

150

200

250

300

Red channel

Greenchann

el

0 100 200 3000

50

100

150

200

250

300

Blue channel

Greenchann

el


34/57



34

Application: Colour Restoration

RemoveGreen

RestoreGreen

Original Image Red and Blue Channels Only Predicted Image


35/57



35


! A more general regularizer:

Lasso Quadratic

(Least absolute shrinkage and selection operator)


36/57



36


!Lasso generates sparse solutions.

Iso-contoursof data term ED(w)

Iso-contour ofregularization term EW(w)

Quadratic Lasso


37/57



37

Solving Regularized Systems

! Quadratic regularization has the advantage thatthe solution is closed form.

! Non-quadratic regularizers generally do not haveclosed form solutions

! Lasso can be framed as minimizing a quadraticerror with linear constraints, and thus represents aconvex optimization problem that can be solved byquadratic programming or other convex

optimization methods.! We will discuss quadratic programming when we

cover SVMs


38/57



38




39/57



39

Multiple Outputs

! Analogous to the single output case we have:

! Given observed inputs , andtargets

we obtain the log likelihood function


40/57



40

Multiple Outputs

! Maximizing with respect to W, we obtain

! If we consider a single target variable, tk, we see that

! where , which is identical with thesingle output case.


41/57



41

Some Useful MATLAB Functions

! polyfit! Least-squares fit of a polynomial of specified order to

given data

! regress! More general function that computes linear weights for

least-squares fit


42/57



42




43/57

Bayesian Linear Regression

Rev. Thomas Bayes, 1702 - 1761


44/57



44


! In least-squares, we determine the weights wthatminimize the least squared error between the model

and the training data.

! This can result in overlearning!! Overlearning can be reduced by adding a

regularizing term to the error function beingminimized.

! Under specific conditions this is equivalent to aBayesian approach, where we specify a prior

distribution over the weight vector.


45/57



45


! Define a conjugate prior over w:

! Combining this with the likelihood function andmatching terms, we obtain

! where

B L R


46/57



46


! A common choice for the prior is

!for which

!Thus mNrepresents the ridge regression solution with

!Next we consider an example ! ="/#

B i Li R i


47/57



47


Example: fitting a straight line

0 data points observedPrior Data Space

B i Li R i


48/57



48


1 data point observed

Likelihood for (x1,t1) Posterior Data Space

B i Li R i


49/57



49


2 data points observed

Posterior Data SpaceLikelihood for (x2,t2)

B i Li R i


50/57



50


20 data points observed

Posterior Data SpaceLikelihood for (x20,t20)

B i P di i


51/57



51

Bayesian Prediction

! In least-squares, or regularized least-squares, wedetermine specific weights w that allow us to predict

a specific value y(x,w) for every observed input x.

! However, our estimate of the weight vector wwillnever be perfect! This will introduce error into ourprediction.

! In Bayesian prediction, we model the posterior

distribution over our predictions, taking into accountour uncertainty in model parameters.

P di ti Di t ib ti


52/57



52


! Predict tfor new values of xby integrating over w:

! where

P di ti Di t ib ti


53/57



53


! Example: Sinusoidal data, 9 Gaussian basis functions,1 data point

E t| t,!,"#$ %&

p t| t,!,"( )

Samples of y(x,w)

Notice how much bigger our uncertainty is

relative to the ML method!!

P di ti Di t ib ti


54/57



54


! Example: Sinusoidal data, 9 Gaussian basis functions,2 data points

Samples of y(x,w)E t| t,!,"#$ %&

p t| t,!,"( )

P di ti Di t ib ti


55/57



55



E t| t,!,"#$ %& p t| t,!,"( )

Samples of y(x,w)

P di ti Di t ib ti


56/57



56



Samples of y(x,w)E t| t,!,"#$ %&

p t| t,!,"( )

Li R i T i


57/57

Probability & Bayesian Inference57



04 linear regression

Documents