cpsc 881: machine learning regression. 2 copy right notice most slides in this presentation are...

CpSc 881: Machine Learning

Regression

2

Copy Right Notice

Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

3

Regression problems

The goal is to make quantitative (real valued) predictions on the basis of a vector of features or attributes

Examples: house prices, stock values, survival time, fuel efficiency of cars, etc.

Questions: What can we assume about the problem? how do we formalize the regression problem? how do we evaluate predictions?

4

A generic regression problem

The input attributes are given as fixed length vectors x = [x1,...,xd]T , where each component such as xi may be discrete or real valued.

The outputs are assumed to be real valued y R (or a restricted subset of the real values)

We have access to a set of n training examples, Dn = {(x1,y1),...,(xn,yn)}, sampled independently at random from some fixed but unknown distribution P(x,y)

The goal is to minimize the prediction error/loss on new examples (x,y) drawn at random from the same distribution P(x,y). The loss may be, for example, the squared loss

where denotes our prediction in response to x.

2)ˆ()ˆ,( yyyyloss

y

5

Linear regression

We need to define a class of functions (types of predictions we will try to make) such as linear predictions:

where w1,w0 are the parameters we need to set.

0110 ),;( wxwwwxf

6

Estimation criterion

• We need an estimation criterion so as to be able to select appropriate values for our parameters (w1 and w0) based on the training set Dn = {(x1,y1),...,(xn,yn)},

• For example, we can use the empirical loss:

n

iiin wwxfy

nJ

1

210 )),;((

1

7

Empirical loss

n

iiiPyx wwxfy

nwwxfyE

1

210

210~),( )),;((

1)),;((

• Ideally, we would like to find the parameters w1,w0 that minimize the expected loss (assuming unlimited training data):

where the expectation is over samples from P(x,y).

• When the number of training examples n is large, however, the empirical error is approximately what we want

210~),(01 )),;((),( wwxfyEwwJ Pyx

8

Estimating the parameters

We minimize the empirical squared loss

By setting the derivatives with respect to w1 and w0 to zero we get necessary conditions for the “optimal” parameter values

n

iii

n

iiin

xwwyn

wwxfyn

wwJ

1

210

1

21010

)(1

)),;((1

),(

0),(

0),(

101

100

wwJw

wwJw

n

n

9

Structural error measures the error introduced by the limited function class (infinite training data):

where (w0*,w1*) are the optimal linear regression parameters.

Approximation error measures how close we can get to the optimal linear predictions with limited training data:

where are the parameter estimates based on a small training set (therefore themselves random variables).

Types of error

2**~),(

210~),(

,)()(min

1010

xwwyExwwyE PyxPyxww

210

**~),( )ˆˆ(

10xwwxwwE Pyx

)ˆ,ˆ( 10 ww

10

Multivariate Regression

Write matrix X and Y thus:

(there are R datapoints. Each input has m components)

The linear regression model assumes a vector w such that

Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]

The result is the same as in the one dimensional case:

RRmRR

m

m

R y

y

y

xxx

xxx

xxx

2

1

21

22221

11211

2

...

...

...

..........

..........

..........

y

x

x

x

x

1

yXX)X T-1T(ˆ w

11

• The linear regression functions

are convenient because they are linear in the parameters, not necessarily in the input x.

• We can easily generalize these classes of functions to be non-linear functions of the inputs x but still linear in the parameters w. For example: mth order polynomial prediction m

Beyond linear regression

ddd xwxwwwff

xwwwxff

110

10

),(:

),(:

x

mmxwxwxwwwxff 2

210),(:

12

Subset Selection and Shrinkage: Motivation

Bias Variance Trade Off Goal: choose model to minimize error

Method: sacrifice a little bit of bias to reduce the variance

Better interpretation: find the strongest factors from the input space

22 ])ˆ([)ˆ()ˆ( EVarE

13

Shrinkage

Intuition: continuous version of subset selection

Goal: imposing penalty on complexity of model to get lower variance

Two example:Ridge regressionLasso

14

Ridge Regression

Penalize by sum-of-squares of parameters

Or

})({minargˆ1

22

10

i

π

jj

p

jjiji

ridge βλXy

p

jj

i

p

jjiji

ridge

ssubject

Xy

1

2

2

10

to

})({minargˆ

15

Understanding of Ridge Regression

Find the orthogonal principal components (basis vectors), and then apply greater amount of shrinkage to basis vectors with small variance.

Assumption: y vary most in the directions of high varianceIntuitive example: stop words in text classification if assuming no covariance between words

Relates to MAP EstimationIf: ~ N(0, I) , y ~ N(X, 2I)Then:

))()|((maxargˆ

PDataPridge

16

Lasso(Least Absolute Shrinkage and Selection Operator)

A popular model selection and shrinkage estimation method.

In a linear regression set-up: :continuous response

:design matrix

: parameter vector

The lasso estimator is then defined as:

Y

pnX :

p

jjX

1

2

2β

|)|(ˆ argmin Y

n

i iu122

2uwhere , and larger set some exactly to

0.

17

Lasso(Least Absolute Shrinkage and Selection Operator)

FeaturesSparse Solutions

Let be the full least square estimates

and

Value will cause

the shrinkage

Let

as the scaled Lasso

parameter

00

0ˆ| |jt

0t t

00

ˆ/ / | |js t t t

18

Why Lasso?

LASSO is proposed because:Ridge regression is not parsimonious.Ridge regression may generate huge prediction errors under sparse matrix of true (unknown) coefficients.

LASSO can outperform RR if:True (unknown) coefficients are composed of a lot of zeros.

19

Why Lasso?

Prediction AccuracyAssume , and ,

then the prediction error of the estimate is

OLS estimates often have low bias but large variance, the Lasso can improve the overall prediction accuracy by sacrifice a little bias to reduce the variance of the predicted value.

19

( )y f x ( ) 0E 2var( ) ˆ ( )f x

2

2 2 2

2 2

ˆ ˆ ˆ[ ( ) ( )] [ ( ) ( )]

ˆ ˆ( ( )) var( ( ))

ˆ( ) [( ( ))]

Ef x f x E f x Ef x

bias f x f x

Err x E y f x

20

Why Lasso?

InterpretationIn many cases, the response is determined by just a small subset of the predictor variables.

20

y

21

How to solve the problem?

The absolute inequality constraints can be translated into inequality constraints. ( p stands for the number of predictor variables )

Where is an matrix, corresponding to linear inequality constraints.

But direct application of this procedure is not practical due to the fact that may be very large.

Lawson, C. and Hansen, R. (1974) Solving Least Squares Problems. Prentice Hall.

2 p

| |jjt G t

G 2 p p 2 p

2 p

22

How to solve the problem?

Outline of the AlgorithmSequentially introduce the inequality constraints

In practice, the average iteration steps required is in the range of (0.5p, 0,75p), so the algorithm is acceptable.

Lawson, C. and Hansen, R. (1974) Solving Least Squares Problems. Prentice Hall.

23

In some cases not only continuous but also categorical predictors (factors) are present, the lasso solution is not satisfactory with only selecting individual dummy variables but the whole factor.

Extended from the lasso penalty, the group lasso estimator is:

)(minargˆ1

2

2

2

G

gI g

XY

gI : the index set belonging to the th group of variables.

The penalty does the variable selection at the group level , belonging to the intermediate between and type penalty.

It encourages that either or for all

g

1l 2l

0ˆ gβ 0ˆ, jg },,1{ gdfj

Group Lasso

24

Elastic Net

Compromise Between ℓ1 and ℓ2 to Improve Reliability

1

2

1

1 1 11

22 2 2

1

2 2

1 22 1 2

Training Set ( , )

Residual Sum of Squares ( ) ( )

Lasso penalty ( ) | |, 0

Ridge penalty ( ) | | , 0

ˆObjective: Find min

n

i i i

n

i ii

p

jj

p

jj

D X y

RSS y X

L

L

y x

25

Elastic Net

ridge penaltyλ2 elastic net

penalty

lasso penaltyλ1

26

Principle component regression

Goal: Using linear combinations of inputs as inputs in the regression

Usually the derived input directions are orthogonal to each other

Principle component regression

Get vm using SVD

Use as inputs in the regressionmm Xvz

M

mmm

pcr zyy1

ˆˆ

27

Partial Least squares

Idea: find directions that have high variance and have high correlation with y.

Unlike general multiple linear regression, the PLS regression can handle strong collinear data and the data in which number of predictors is larger than the number of observations.

The PLS build the relationship between response and predictors through a few latent variables constructed from predictors. The number of latent variables is much smaller than that of the original predictors.

28


Let vector y (n×1) denotes the single response; matrix X (n×p) denotes the n observations of p predicators and matrix T (n×h) denotes n values of the h latent variables. The latent variables are linear combinations of the original predictors:

where matrix W (p×h) is the weights. Then, the response and observations of predictors can be expressed using T as follow (Wold S., et al. 2001):

where matrix P (h×p) is the is called loadings and matrix C (h×1) is the regression coefficients of T. The matrix E (n×p) and vector f (n×1) are the random errors of X and y. The PLS regression decomposes the X and y simultaneously to find a set of latent variables that explain the covariance between X and y as much as possible.

ij kj ikk

T W X

ikj

jkijik EPTX

mj

ijmjm fTCy

29


The PLS regression has also established the relation between the response y and original predictors X as a multiple regression model:

where vector f’ (n×1) is the regression errors and matrix B (p×1) is the PLS regression coefficients and can be calculated by:

Then, the significant predictors can be selected based on the values of regression coefficients from PLS regression, which is called the PLS-Beta method

mk

ikmkm fXBy '

i

kimim WCB

30

PCR vs. PLS vs. Ridge Regression

PCR discards the smallest eigenvalue components (low-variance direction). The mth component vm solves:

PLS shrink the low-variance direction, while inflate high variance direction. The mth component vm solves:

Ridge Regression: Shrinks coefficients of the principle components. Low-variance direction is shrinked more

)(max1,...,1,0,1||

XVarmlSvTl

)(),(max 2

1,...,1,0,1||

XVarXyCorr

mlSvTl

cpsc 881: machine learning regression. 2 copy right notice most slides in this presentation are...

Documents