3. linear methods for regression

35
3. Linear Methods for Regression

Upload: bryson

Post on 06-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

3. Linear Methods for Regression. Contents. Least Squares Regression QR decomposition for Multiple Regression Subset Selection Coefficient Shrinkage. 1. Introduction. Outline The simple linear regression model Multiple linear regression Model selection and shrinkage—the state of the art. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 3. Linear Methods for Regression

3. Linear Methods for Regression

Page 2: 3. Linear Methods for Regression

Contents

Least Squares Regression

QR decomposition for Multiple Regression

Subset Selection

Coefficient Shrinkage

Page 3: 3. Linear Methods for Regression

1. Introduction

• Outline• The simple linear regression model

• Multiple linear regression

• Model selection and shrinkage—the state of the art

Page 4: 3. Linear Methods for Regression

Regression

0

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6 7 8 9 10

X

Y

How can we model the generative process for this data?

Page 5: 3. Linear Methods for Regression

Linear Assumption

A linear model assumes the regression function E(Y | X) is reasonably approximated as linear

i.e.

• The regression function f(x) = E(Y | X=x) was the result of minimizing squared expected prediction error

• Making the above assumption has high bias, but low variance

),...,(,)( 21

1

0 p

p

j

jj XXXXXXf

Page 6: 3. Linear Methods for Regression

Least Squares Regression

Estimate the parameters based on a set of training data: (x1, y1)…(xN, yN)

Minimize residual sum of squares

• Training samples are random, independent draws• OR, yi’s are conditionally independent given xi

2

1 1

0)(

N

i

p

j

jiji xyRSS

Reasonable criterion when…

Page 7: 3. Linear Methods for Regression

Matrix Notation

X is N (p+1) of input vectors

y is the N-vector of outputs (labels)

is the (p+1)-vector of parameters

NpNN

p

p

TN

T

T

xxx

xxxxxx

x

xx

...1...

...1...1

1...

11

21

22221

11211

2

1

X

Ny

yy

y...

2

1

p

...1

0

Page 8: 3. Linear Methods for Regression

Perfectly Linear Data

When the data is exactly linear, there exists s.t.

(linear regression model in matrix form)

Usually the data is not an exact fit, so…

Xy

Page 9: 3. Linear Methods for Regression

Finding the Best Fit?

-4

0

4

8

12

16

20

0 2 4 6 8 10

X

Y

Fitting Data from Y=1.5X+.35+N(0,1.2)

Page 10: 3. Linear Methods for Regression

Minimize the RSS

We can rewrite the RSS in Matrix form

Getting a least squares fit involves minimizing the RSS

Solve for the parameters for which the first derivative of the RSS is zero

XX yyRSS T)(

Page 11: 3. Linear Methods for Regression

Solving Least Squares

Derivative of a Quadratic Product

bxexexbxdxd TTTT ACDDCADCA

XX

XXXX

XX

y

yIyI

yIyRSS

T

TN

TN

T

NT

2

y

yy

TT

TT

TT

XXX

XXXXXX

1

0

Then,

Setting the First Derivative to Zero:

Page 12: 3. Linear Methods for Regression

Least Squares Solution

YXX)(X T1T β YXX)X(XβXY T1T ˆˆ

1

pN

)ˆ(RSS

•Least Squares Coefficients

•Least Squares Predictions

•Estimated Variance

N

i

ii yypN

ˆ

1

22

1

1

Page 13: 3. Linear Methods for Regression

The N-dimensional Geometry of Least Squares Regression

Page 14: 3. Linear Methods for Regression

Statistics of Least Squares

We can draw inferences about the parameters, , by assuming the true model is linear with noise, i.e.

Then,

),0(~, 2

1

0 NXYp

j

jj

21,~ˆ

XXTN

)1(χ ~ˆ1 222 pNpN

Page 15: 3. Linear Methods for Regression

Significance of One Parameter

Can we eliminate one parameter, Xj (j=0)?

Look at the standardized coefficient

),1(~ˆ

ˆ pNt

vz

j

jj

vj is the jth diagonal element of (XTX)-1

Page 16: 3. Linear Methods for Regression

Significance of Many Parameters

We may want to test many features at onceComparing model M1 with p1+1 parameters to

model M0 with p0+1 parameters from M1 (p0<p1)

Use the F statistic:

)1,(~

1 10111

0110

pNppFpNRSS

ppRSSRSSF

Page 17: 3. Linear Methods for Regression

Confidence Interval for Beta

We can find a confidence interval for j

Confidence Interval for single parameter (1-2 confidence interval for j )

Confidence Interval for entire parameter (Bounds on )

σvzβ,σvzβ /jαj

/jαj

211

211

121

2p

TTˆˆˆ XX

Page 18: 3. Linear Methods for Regression

2.1 : Prostate cancer < Example>

Data• lcavol: log cancer volume• lweight: log prostate weight• age: age• lbph: log of benign prostatic

hyperplasia amount• svi: seminal vesicle invasion• lcp: log of capsular penetration• Gleason: gleason scores• Pgg45: percent Gleason scores 4 or 5

Page 19: 3. Linear Methods for Regression

Technique for Multiple Regression

Computing directly has poor numeric properties

QR Decomposition of X Decompose X = QR where

• Q is N (p+1) orthogonal vector (QTQ = I(p+1))

• R is an (p+1) (p+1) upper triangular matrix

Then

yTT XXX1ˆ

yyyyˆ TTTTTTTTTTT QRQRRRQRRRQRQRQR 11111

yy TQQ

11 qx 11r

222122 qqx 1 rr

333223133 qqqx 1 rrr

Page 20: 3. Linear Methods for Regression

Gram-Schmidt Procedure

1) Initialize z0 = x0 = 12) For j = 1 to p

For k = 0 to j-1, regress xj on the zk’s so that

Then compute the next residual

3) Let Z = [z0 z1 … zp] and be upper triangular with entries kj

X = Z = ZD-1D = QR

where D is diagonal with Djj = || zj ||

kk

jkkj zz

xz

1

0

j

k

kkjjj zxz

(univariate least squares estimates)

Page 21: 3. Linear Methods for Regression

Subset Selection

We want to eliminate unnecessary features

Best subset regression• Choose the subset of size k with lowest RSS

• Leaps and Bounds procedure works with p up to 40

Forward Stepwise Selection• Continually add features to with the largest F-ratio

Backward Stepwise Selection• Remove features from with small F-ratio

Greedy techniques – not guaranteed to find the best model

)1,1(~

1

1

11

10

pNF

pNRSS

RSSRSSF

Page 22: 3. Linear Methods for Regression

Coefficient Shrinkage

Use additional penalties to reduce coefficients

Ridge Regression• Minimize least squares s.t.

The Lasso• Minimize least squares s.t.

Principal Components Regression• Regress on M < p principal components of X

Partial Least Squares• Regress on M < p directions of X weighted by y

p

j

j s1

||

p

j

j s1

2

Page 23: 3. Linear Methods for Regression

4.2 Prostate Cancer Data Example-Continued

Page 24: 3. Linear Methods for Regression

Error Comparison

Page 25: 3. Linear Methods for Regression

Shrinkage Methods (Ridge Regression)

Minimize RSS() + T• Use centered data, so 0 is not penalized

• xj are of length p, no longer including the initial 1

The Ridge estimates are:

NxxxNyN

i

ijijij

N

i

i /,/ˆ

11

0

y

yy

yRSS

yyRSS

Tp

T

pTT

T

T

TT

XIXX

IXXXXX

XX

XX

1

00

22

)(

Page 26: 3. Linear Methods for Regression

Shrinkage Methods (Ridge Regression)

Page 27: 3. Linear Methods for Regression

The Lasso

Use centered data, as before

The L1 penalty makes solutions nonlinear in yi

• Quadratic programming are used to compute them

sxyRSSp

j

j

N

i

p

j

jiji

1

2

1 1

0 ||)( subject to

Page 28: 3. Linear Methods for Regression

Shrinkage Methods (Lasso Regression)

Page 29: 3. Linear Methods for Regression

Principal Components Regression

Singular Value Decomposition (SVD) of X

• U is N p, V is p p; both are orthogonal• D is a p p diagonal matrix

Use linear combinations (v) of X as new features

• vj is the principal component (column of V) corresponding to the jth largest element of D

• vj are the directions of maximal sample variance

• use only M < p features, [z1…zM] replaces X

TUDVX

Mjvz jj ...1X

m

M

m

mpcr zˆyy

1

mmmm z,z/y,zˆ

Page 30: 3. Linear Methods for Regression

Partial Least Squares

Construct linear combinations of inputs incorporating y

Finds directions with maximum variance and correlation with the output

The variance aspect seems to dominate and partial least squares operates like principal component regression

Page 31: 3. Linear Methods for Regression

4.4 Methods Using Derived Input Directions (PLS)

• Partial Least Squares

Page 32: 3. Linear Methods for Regression

Discussion :a comparison of the selection and shrinkage methods

Page 33: 3. Linear Methods for Regression

4.5 Discussion : a comparison of the selection and shrinkage methods

Page 34: 3. Linear Methods for Regression

A Unifying View

We can view all the linear regression techniques under a common framework

includes bias, q indicates a prior distribution on =0: least squares >0, q=0: subset selection (counts number of nonzero parameters)

>0, q=1: the lasso >0, q=2: ridge regression

p

j

qj

N

i

p

j

jiji xy1

2

1 1

0 ||minargˆ

Page 35: 3. Linear Methods for Regression

Discussion :a comparison of the selection and shrinkage methods

• Family of Shrinkage Regression