linear regression models - columbia universitymadigan/dm08/linear.ppt.pdf · linear regression...

Linear Regression Models

Based on Chapter 3 ofHastie, Tibshirani and Friedman

Linear Regression Models

!=

+=

p

j

jjXXf1

0)( ""

Here the X’s might be:

•Raw predictor variables (continuous or coded-categorical)

•Transformed predictors (X4=log X3)

•Basis expansions (X4=X32, X5=X3

3, etc.)

•Interactions (X4=X2 X3 )

Popular choice for estimation is least squares:

2

1 1

0 )()( ! != =

""=

N

i

p

j

jji XyRSS ###

Least Squares

)()()( !!! XyXyRSST ""=

Often assume that the Y’s are independent and normallydistributed, leading to various classical statistical tests andconfidence intervals

yXXXTT 1)(ˆ !

=" #

yXXXXXyTT 1)(ˆˆ !

==" #

hat matrix

The least squares estimate of θ is:

If the linear model is correct, this estimate is unbiased (X fixed):

Gauss-Markov states that for any other linear unbiased estimator :

Of course, there might be a biased estimator with lower MSE…

Gauss-Markov TheoremConsider any linear combination of the β’s:

yXXXaaTTTT 1)(ˆˆ !

== "#

!" Ta=

E(!̂) = E(aT (XTX)

"1XTy) = a

T(X

TX)

"1XTX# = a

T#

ycT

=!~

)(Var)ˆ(Var ycaTT !"

i.e., E(cTy) = a

T!,

bias-variance!~

2

22

2

))~(()

~(

))~(())

~(

~(

))~()

~(

~(

!!!

!!!!

!!!!

"+=

"+"=

"+"=

EVar

EEEE

EEE

2)~()

~( !!! "= EMSE

For any estimator :

bias

Note MSE closely related to prediction error:

)~

()~

()()~

( 0

22

00

2

00

2

00 !"!!!! TTTTTxMSExxExYExYE +=#+#=#

Too Many Predictors?When there are lots of X’s, get models with high variance andprediction suffers. Three “solutions:”

1. Subset selection

2. Shrinkage/Ridge Regression

3. Derived Inputs

Score: AIC, BIC, etc.All-subsets + leaps-and-bounds, Stepwise methods,

Subset Selection•Standard “all-subsets” finds the subset of size k, k=1,…,p,that minimizes RSS:

•Choice of subset size requires tradeoff – AIC, BIC,marginal likelihood, cross-validation, etc.•“Leaps and bounds” is an efficient algorithm to doall-subsets

Cross-Validation•e.g. 10-fold cross-validation:

Randomly divide the data into ten parts

Train model using 9 tenths and compute prediction error on theremaining 1 tenth

Do these for each 1 tenth of the data

Average the 10 prediction error estimates

“One standard error rule”

pick the simplest model withinone standard error of theminimum

Shrinkage Methods•Subset selection is a discrete process – individual variablesare either in or out

•This method can have high variance – a different datasetfrom the same source can result in a totally different model

•Shrinkage methods allow a variable to be partly included inthe model. That is, the variable is included but with ashrunken co-efficient.

Ridge Regression

subject to:

2

1 1

0

ridge )(minargˆ ! != =

""=

N

i

p

j

jiji xy ####

!=

"p

j

js

1

2#

Equivalently:

!!"

#$$%

&+''= (( (

== =

p

j

j

N

i

p

j

jiji xy1

22

1 1

0

ridge )(minargˆ )*))))

This leads to:

Choose λ by cross-validation. Predictors should be centered.

yXIXXTT 1ridge )(ˆ !

+= "#works even whenXTX is singular

effective number of X’s

Ridge Regression = Bayesian Regression

22

2

2

0

),0(~

),(~

!"#

!$

"$$

=

+

with ridgeas same

N

xNy

j

T

ii

The Lasso

subject to:

2

1 1

0

ridge )(minargˆ ! != =

""=

N

i

p

j

jiji xy ####

!=

"p

j

js

1

#

Quadratic programming algorithm needed to solve for theparameter estimates. Choose s via cross-validation.

!!"

#$$%

&+''= (( (

== =

qp

j

j

N

i

p

j

jiji xy1

2

1 1

0 )(minarg~

)*))))

q=0: var. sel.q=1: lassoq=2: ridgeLearn q?

function of 1/lambda

has largest sample variance amongst all normalized linearcombinations of the columns of X

Principal Component RegressionConsider a an eigen-decomposition of XTX (and hence thecovariance matrix of X):

TTVVDXX2

=

The eigenvectors vj are called the principal components of XD is diagonal with entries d1 ≥ d2 ≥… ≥dp

1Xv

has largest sample variance amongst all normalized linearcombinations of the columns of X subject to being orthogonal toall the earlier ones

kXv

))(2

11

N

dXv =(var

(X is first centered)

(X is N x p)

Principal Component RegressionPC Regression regresses on the first M principal componentswhere M<p

Similar to ridge regression in some respects – see HTF, p.66

www.r-project.org/user-2006/Slides/Hesterberg+Fraley.pdf

x1<-rnorm(10)x2<-rnorm(10)y<-(3*x1) + x2 + rnorm(10,0.1)par(mfrow=c(1,2))plot(x1,y,xlim=range(c(x1,x2)),ylim=range(y))abline(lm(y~-1+x1))plot(x2,y,xlim=range(c(x1,x2)),ylim=range(y))abline(lm(y~-1+x2))

epsilon <- 0.1r <- ybeta <- c(0,0)numIter <- 25

for (i in 1:numIter) {

cat(cor(x1,r),"\t",cor(x2,r),"\t",beta[1],"\t",beta[2],"\n");

if (cor(x1,r) > cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x1) > 0))-1) beta[1] <- beta[1] + delta r <- r - (delta * x1) par(mfg=c(1,1)) abline(0,beta[1],col="red") } if (cor(x1,r) <= cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x2) > 0))-1) beta[2] <- beta[2] + delta r <- r - (delta * x2) par(mfg=c(1,2)) abline(0,beta[2],col="green") }}

► Start with all coefficients bj = 0

► Find the predictor xj most correlated with y

► Increase bj in the direction of the sign of its correlationwith y. Take residuals r=y-yhat along the way. Stopwhen some other predictor xk has as much correlationwith r as xj has

► Increase (bj,bk) in their joint least squares directionuntil some other predictor xm has as much correlationwith the residual r.

►Continue until all predictors are in the model

LARS

• If there are many correlated features, lasso givesnon-zero weight to only one of them

• Maybe correlated features (e.g. time-ordered)should have similar coefficients?

Fused Lasso

Tibshirani et al. (2005)

• Suppose you represent a categorical predictorwith indicator variables

• Might want the set of indicators to be in or out

Group Lasso

Yuan and Lin (2006)

regular lasso:

group lasso:

linear regression models - columbia universitymadigan/dm08/linear.ppt.pdf · linear regression...

Documents