linear, ridge regression, and principal component...

33

Upload: vandiep

Post on 10-May-2018

249 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Linear, Ridge Regression, and PrincipalComponent Analysis

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

Page 2: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Introduction to Regression

I Input vector: X = (X1,X2, ...,Xp).

I Output Y is real-valued.

I Predict Y from X by f (X ) so that the expected loss function

E (L(Y , f (X )))

is minimized.

I Square loss:L(Y , f (X )) = (Y − f (X ))2 .

I The optimal predictor

f ∗(X ) = argminf (X )E (Y − f (X ))2

= E (Y | X ) .

I The function E (Y | X ) is the regression function.

Jia Li http://www.stat.psu.edu/∼jiali

Page 3: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Example

The number of active physicians in a Standard MetropolitanStatistical Area (SMSA), denoted by Y , is expected to be relatedto total population (X1, measured in thousands), land area (X2,measured in square miles), and total personal income (X3,measured in millions of dollars). Data are collected for 141SMSAs, as shown in the following table.

i : 1 2 3 ... 139 140 141

X1 9387 7031 7017 ... 233 232 231X2 1348 4069 3719 ... 1011 813 654X3 72100 52737 54542 ... 1337 1589 1148Y 25627 15389 13326 ... 264 371 140

Goal: Predict Y from X1, X2, and X3.

Jia Li http://www.stat.psu.edu/∼jiali

Page 4: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Linear Methods

I The linear regression model

f (X ) = β0 +

p∑j=1

Xjβj .

I What if the model is not true?I It is a good approximationI Because of the lack of training data/or smarter algorithms, it

is the most we can extract robustly from the data.

I Comments on Xj :I Quantitative inputsI Transformations of quantitative inputs, e.g., log(·),

√(·).

I Basis expansions: X2 = X 21 , X3 = X 3

1 , X3 = X1 · X2.

Jia Li http://www.stat.psu.edu/∼jiali

Page 5: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali

Page 6: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Estimation

I The issue of finding the regression function E (Y | X ) isconverted to estimating βj , j = 0, 1, ..., p.

I Training data:

{(x1, y1), (x2, y2), ..., (xN , yN)},where xi = (xi1, xi2, ..., xip) .

I Denote β = (β0, β1, ..., βp)T .

I The loss function E (Y − f (X ))2 is approximated by theempirical loss RSS(β)/N:

RSS(β) =N∑

i=1

(yi − f (xi ))2 =

N∑i=1

(yi − β0 −p∑

j=1

xijβj)2 .

Jia Li http://www.stat.psu.edu/∼jiali

Page 7: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Notation

I The input matrix X of dimension N × (p + 1):1 x1,1 x1,2 ... x1,p

1 x2,1 x2,2 ... x2,p

... ... ... ... ...1 xN,1 xN,2 ... xN,p

I Output vector y:

y =

y1

y2

...yN

Jia Li http://www.stat.psu.edu/∼jiali

Page 8: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

I The estimated β is β.

I The fitted values at the training inputs:

yi = β0 +

p∑j=1

xij βj

and

y =

y1

y2

...yN

Jia Li http://www.stat.psu.edu/∼jiali

Page 9: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Point Estimate

I The least square estimation of β is�� ��β = (XTX)−1XTy

I The fitted value vector is�� ��y = Xβ = X(XTX)−1XTy

I Hat matrix: �� ��H = X(XTX)−1XT

Jia Li http://www.stat.psu.edu/∼jiali

Page 10: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Geometric Interpretation

I Each column of X is a vector in an N-dimensional space (NOTthe p-dimensional feature vector space).

X = (x0, x1, ..., xp)

I The fitted output vector y is a linear combination of thecolumn vectors xj , j = 0, 1, ..., p.

I y lies in the subspace spanned by xj , j = 0, 1, ..., p.

I RSS(β) =‖ y − y ‖2.

I y − y is perpendicular to the subspace, i.e., y is the projectionof y on the subspace.

I The geometric interpretation is very helpful for understandingcoefficient shrinkage and subset selection.

Jia Li http://www.stat.psu.edu/∼jiali

Page 11: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali

Page 12: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Example Results for the SMSA ProblemI Yi = −143.89 + 0.341Xi1 − 0.0193Xi2 + 0.255Xi3.I RSS(β) = 52942336.

Jia Li http://www.stat.psu.edu/∼jiali

Page 13: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

If the Linear Model Is True

I E (Y | X ) = β0 +∑p

j=1 Xjβj

I The least square estimation of β is unbiased,

E (βj) = βj j = 0, 1, ..., p .

I To draw inferences about β, further assume:

Y = E (Y | X ) + ε

where ε ∼ N(0, σ2) and is independent of X .

I Xij are regarded as fixed, Yi are random due to ε.

I Estimation accuracy: Var(β) = (XTX)−1σ2 .

I Under the assumption, β ∼ N(β, (XTX)−1σ2) .

I Confidence intervals can be computed and significant testscan be done.

Jia Li http://www.stat.psu.edu/∼jiali

Page 14: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Gauss-Markov Theorem

I Assume the linear model is true.

I For any linear combination of the parameters β0,...,βp,denoted by θ = aTβ, aT β is an unbiased estimation since β isunbiased.

I The least squares estimate of θ is

θ = aT β

= aT (XTX)−1Xy

, aTy ,

which is linear in y.

Jia Li http://www.stat.psu.edu/∼jiali

Page 15: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

I Suppose cTy is another unbiased linear estimate of θ, i.e.,E (cTy) = θ.

I The least square estimate yields the minimum variance amongall linear unbiased estimate.

Var(aTy) ≤ Var(cTy) .

I βj , j = 0, 1, ..., p are special cases of aTβ, where aT only hasone non-zero element that equals 1.

Jia Li http://www.stat.psu.edu/∼jiali

Page 16: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Subset Selection and Coefficient Shrinkage

I Biased estimation may yield better prediction accuracy.

I Squared loss: E (β − 1)2 = Var(β). For β = βa , a ≥ 1,

E (β − 1)2 = Var(β) + (E (β)− 1)2 = 1a2 + (1

a − 1)2.

I Practical consideration: interpretation. Sometimes, we are notsatisfied with a “black box”.

Jia Li http://www.stat.psu.edu/∼jiali

Page 17: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Assume β ∼ N(1, 1). The squared error loss is reduced by shrinking the

estimation.

Jia Li http://www.stat.psu.edu/∼jiali

Page 18: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Subset Selection

I To choose k predicting variables from the total of p variables,search for the subset yielding minimum RSS(β).

I Forward stepwise selection: start with the intercept, thensequentially adds into the model the predictor that mostimproves the fit.

I Backward stepwise selection: start with the full model, andsequentially deletes predictors.

I How to choose k: stop forward or backward stepwise selectionwhen no predictor produces the F -ratio statistic greater thana threshold.

Jia Li http://www.stat.psu.edu/∼jiali

Page 19: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Ridge Regression

Centered inputs

I Suppose xj , j = 1, ..., p, are mean removed.

I β0 = y =∑N

i=1 yi/N.

I If we remove the mean of yi , we can assume

E (Y | X ) =

p∑j=1

βjXj

I Input matrix X has p (rather than p + 1) columns.

I β = (XTX)−1XTy

I y = X(XTX)−1XTy

Jia Li http://www.stat.psu.edu/∼jiali

Page 20: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Singular Value Decomposition (SVD)

I If the column vectors of X are orthonormal, i.e., the variablesXj , j = 1, 2, ..., p, are uncorrelated and have unit norm.

I βj are the coordinates of y on the orthonormal basis X.

I In general �� ��X = UDVT .

I U = (u1,u2, ...,up) is an N × p orthogonal matrix. uj ,j = 1, ..., p form an orthonormal basis for the space spanned bythe column vectors of X.

I V = (v1, v2, ..., vp) is an p × p orthogonal matrix. vj ,j = 1, ..., p form an orthonormal basis for the space spanned bythe row vectors of X.

I D = diag(d1, d2, ..., dp), d1 ≥ d2 ≥ ... ≥ dp ≥ 0 are thesingular values of X.

Jia Li http://www.stat.psu.edu/∼jiali

Page 21: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Principal Components

I The sample covariance matrix of X is

S = XTX/N .

I Eigen decomposition of XTX:

XTX = (UDVT )T (UDVT )

= VDUTUDVT

= VD2VT

I The eigenvectors of XTX, vj , are called principal componentdirection of X.

Jia Li http://www.stat.psu.edu/∼jiali

Page 22: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

I It’s easy to see that zj = Xvj = ujdj . Hence uj , is simply theprojection of the row vectors of X, i.e., the input predictorvectors, on the direction vj , scaled by dj . For example

z1 =

X1,1v1,1 + X1,2v1,2 + · · ·+ X1,pv1,p

X2,1v1,1 + X2,2v1,2 + · · ·+ X2,pv1,p...

......

XN,1v1,1 + XN,2v1,2 + · · ·+ XN,pv1,p

I The principal components of X are zj = djuj , j = 1, ..., p.I The first principal component of X, z1, has the largest sample

variance amongst all normalized linear combinations of thecolumns of X.

Var(z1) = d21/N .

I Subsequent principal components zj have maximum varianced2j /N, subject to being orthogonal to the earlier ones.

Jia Li http://www.stat.psu.edu/∼jiali

Page 23: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali

Page 24: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Ridge Regression

I Minimize a penalized residual sum of squares

βridge = argminβ

N∑

i=1

(yi − β0 −p∑

j=1

xijβj)2 + λ

p∑j=1

β2j

I Equivalently

βridge = argminβ

N∑i=1

(yi − β0 −p∑

j=1

xijβj)2

subject to

p∑j=1

β2j ≤ s .

I λ or s controls the model complexity.

Jia Li http://www.stat.psu.edu/∼jiali

Page 25: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Solution

I With centered inputs,

RSS(λ) = (y − Xβ)T (y − Xβ) + λβTβ ,

andβridge = (XTX + λI)−1XTy

I Solution exists even when XTX is singular, i.e., has zero eigenvalues.

I When XTX is ill-conditioned (nearly singular), the ridgeregression solution is more robust.

Jia Li http://www.stat.psu.edu/∼jiali

Page 26: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Geometric InterpretationI Center inputs.I Consider the fitted response

y = Xβridge

= X(XTX + λI)−1XTy

= UD(D2 + λI)−1DUTy

=

p∑j=1

uj

d2j

d2j + λ

uTj y ,

where uj are the normalized principal components of X.I Ridge regression shrinks the coordinates with respect to the

orthonormal basis formed by the principal components.I Coordinate with respect to the principal component with a

smaller variance is shrunk more.

Jia Li http://www.stat.psu.edu/∼jiali

Page 27: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

I Instead of using X = (X1,X2, ...,Xp) as predicting variables,use the transformed variables

(Xv1,Xv2, ...,Xvp)

as predictors.I The input matrix is X = UD (Note X = UDVT ).I Then for the new inputs

βridgej =

dj

d2j + λ

uTj y , Var(βj) =

σ2

d2j

where σ2 is the variance of the error term ε in the linearmodel.

I The factor of shrinkage given by ridge regression is

d2j

d2j + λ

.

Jia Li http://www.stat.psu.edu/∼jiali

Page 28: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

The Geometric interpretation of principal components and shrinkage by

ridge regression.

Jia Li http://www.stat.psu.edu/∼jiali

Page 29: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Compare squared loss E (βj − βj)2

I Without shrinkage: σ2/d2j .

I With shrinkage: Bias2 + Variance.

(βj − βj ·d2j

d2j + λ

)2 +σ2

d2j

· (d2j

d2j + λ

)2

=σ2

d2j

·d2j (d2

j + λ2 β2j

σ2 )

(d2j + λ)2

I Consider the ratio between squared loss

d2j (d2

j + λ2 β2j

σ2 )

(d2j + λ)2

.

Jia Li http://www.stat.psu.edu/∼jiali

Page 30: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

The ratio between the squared loss with and without shrinkage. The

amount of shrinkage is set by λ = 1.0. The four curves correspond to

β2/σ2 = 0.5, 1.0, 2.0, 4.0. When β2/σ2 = 0.5, 1.0, 2.0, shrinkage always

leads to lower squared loss. When β2/σ2 = 4.0, shrinkage leads to lower

squared loss when d2j ≤ 0.71. Shrinkage is more beneficial when d2

j is

small.Jia Li http://www.stat.psu.edu/∼jiali

Page 31: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Principal Components Regression (PCR)

I In stead of smoothly shrinking the coordinates on the principalcomponents, PCR either does not shrink a coordinate at all orshrinks it to zero.

I Principal component regression forms the derived inputcolumns zm = Xvm, and then regresses y on z1, z2, ..., zM forsome M ≤ p.

I Principal components regression discards the p −M smallesteigenvalue components.

Jia Li http://www.stat.psu.edu/∼jiali

Page 32: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali

Page 33: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

The Lasso

I The lasso estimate is defined by

βlasso = argminβ

N∑i=1

(yi − β0 −p∑

j=1

xijβj)2

subject to

p∑j=1

|βj | ≤ s

I Comparison with ridge regression: L2 penalty∑p

j=1 β2j is

replaced by the L1 lasso penalty∑p

j=1 |βj |.I Some of the coefficients may be shrunk to exactly zero.

I Orthonormal columns in X are assumed in the following figure.

Jia Li http://www.stat.psu.edu/∼jiali

Page 34: Linear, Ridge Regression, and Principal Component Analysispersonal.psu.edu/jol2/course/stat597e/notes2/lreg.pdf · Linear, Ridge Regression, and Principal Component Analysis Linear,

Linear, Ridge Regression, and Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali