linear regression

LinearMethodsforRegression

TheElementsofStatisticalLearningTrevorHastie,RobertTibshirani,JeromeFriedman

PresentedbyJunyan Li

Input:XT=(x1,,xp )

Linearregressionmodel

Linearregressionmodel: f

The linear model could be a reasonable approximation:

Basic expansion: => a polynomial representation

Interactions between variables:

isaddedsothatf(x)donothavetopassthroughtheorigin.


residualsumofsquares, RSS

,givenasetoftrainingdata(x,y)

denotebyX theN*(p+1)matrixwitheachrowaninputvector(witha1inthefirstposition)

RSS

Let

2 0.As

2 0,ifX isoffullrank(ifnot,

removetheredundancies). , here iscalledthehat.


istheorthogonalprojectionofy ontothesubspacethecolumnvectorsofX byx0,,xp,withx0 1

0.=>y isorthogonaltothesubspace.


Assume hasconstantvariance.

Assumethedeviationsof arounditsexpectationareadditiveandGaussian,ande~N0, .Then~, .


areconstrainedtop+1equalities:

0

So,

isaunbiasedestimatorand 1~

To test the hypothesis that a particular coefficient 0,use z

~0,1(if is

given) or t ~(if is not given). When Np1 is large enough, the

different between the tail quantities of a tdistribution and a standard normaldistribution is negligible, so we use the standard normal distribution regardless ofwhether is given or not.Simultaneously, we could use F statistic to test whether some parameters could beremoved.

F

/

/

/ /

where RSS is the residual sumofsquares for the least squares fit of the biggermodel with +1 parameters, and RSS the same for the nested smaller model with+1 parameters.

GaussMarkovTheorem

the least squares estimates of the parameters have the smallestvariance among all linear unbiased estimates.

Let beanotherlinearestimatorofE E E E ,asE(e)=0E E =>DX=0

However,theremayexistabiasedestimatorwithsmallermeansquarederror,isintimatelyrelatedtopredictionaccuracy.

MultipleRegressionfromSimpleUnivariate Regression

Ifp=1,

,

,,where=

Letx1,,xp bethecolumnsofX.If=0(orthogonal)foreachi j, ,

,,

r

0

0

,

,

,

,

r ,

,

, , =0

=>r, areorthogonal.


Orthogonal inputs occur most often with balanced, designed experiments (whereorthogonality is enforced), but almost never with observational data.

RegressionbySuccessiveOrthogonalization:1.Initializez0 =x0 =1.2.Forj=1,2,...,pRegressxj onz0,z1,...,,zj1 toproducecoefficients

,

,,l=0,...,j 1

andresidualvectorz z

3.Regressyontheresidualzp togivetheestimate ,

,

wecanseethateachofthe isalinearcombinationofthez,kj,andareorthogonaltoeachother.

In matrix form:

, where Z has columns zj and is the upper triangular matrixwith entries . Introducting D with jth diagonal entry Djj= , let and

The QR decomposition represents a convenient orthogonal basis forthe column space of X.Q is an N (p+1) orthogonal matrix, and R is a (p + 1) (p + 1) uppertriangular matrix. , ,


SubsetSelection

two reasons why we are often not satisfied with the least squares estimates :

prediction accuracy: the least squares estimates often have low bias but largevariance interpretation. With a large number of predictors, we often would like todetermine a smaller subset that exhibits the strongest effects.

1. BestSubsetSelection

Bestsubsetregressionfindsforeachk {0,1,2,...,p}thesubsetofsizekthatgivessmallestresidualsumofsquares.Infeasibleforpmuchlargerthan40.

SubsetSelection

2. Forward and BackwardStepwise Selection

1) Forwardstepwise selection is a greedy algorithm, and starts with theintercept, and then sequentially adds into the model the predictor that mostimproves the fit.

Computational: for large p we cannot compute the best subsetsequence, but we can always compute the forward stepwise sequence(even when p >>N).Statistical: a price is paid in variance for selecting the best subset ofeach size; forward stepwise is a more constrained search, and will havelower variance, but perhaps more bias.

2) Backwardstepwise selection starts with the full model, and sequentiallydeletes the predictor that has the least impact on the fit. (use zscore, n>p+1)

3) Hybrid

SubsetSelection

3. ForwardStagewise Regression

Forwardstagewise regression (FS) is even more constrained than forwardstepwiseregression. It starts like forwardstepwise regression, with an intercept equal to y, andcentered predictors with coefficients initially all 0. At each step the algorithm identifiesthe variable most correlated with the current residual. It then computes the simplelinear regression coefficient of the residual on this chosen variable, and then adds it tothe current coefficient for that variable. This is continued till none of the variableshave correlation with the residuals.

In Forward selection, variables are added at one time, but in FS selection, variables areadded partially, which works better in very high dimensional problems.

ShrinkageMethods

it often exhibits high variance, and so doesnt reduce the prediction error ofthe full model. Shrinkage methods are more continuous, and dont sufferas much from high variability.

1. RidgeRegression

, 0,

0

,Subjectto

t

Penalizingbythesumofsquares

There is a onetoone correspondence between in"

" and t insubjectto

ShrinkageMethods

RSS y X y X

, which could be got by calculating thefirst and second derivation.

singular value decomposition (SVD): , where U and V are N pand pp orthogonal matrices, with the columns of U spanning the columnspace of X, and the columns of V spanning the row space. D is a p pdiagonal matrix, with diagonal entries d11 d22 dpp 0.

Likelinearregression,ridgeregressioncomputesthecoordinatesofywithrespecttotheorthonormalbasisU.Itthenshrinksthesecoordinatesbythe

factors

.

ShrinkageMethods

Degreeoffreedomdf tr

Hence the small singular values djcorrespond to directions in the columnspace of X having small variance, andridge regression shrinks these directionsthe most.

Eigendecompositionof D,andTheeigenvectorsvj (columnsofV)arealsocalledtheprincipalcomponents(orKarhunenLoeve)directionsofX.

ShrinkageMethods

2. LassoRegression

, 0

The latter constraint makes the solutions nonlinear in the yi. there is noclosed form expression as in ridge regression. Computing the lasso solutionis a quadratic programming problem.

ifthesolutionoccursatacorner,thenithasoneparameterjequaltozero.

ShrinkageMethods

AssumethatcolumnsofX areorthonormal=>=I

Minimize minimize =

()

(

)+ ) =0so

)

1. Unbiasedness: The resulting estimator is early unbiased when the true unknownparameter is large to avoid unnecessary modeling bias.2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets smallestimated coeffcients to zero to reduce model complexity.3. Continuity: The resulting estimator is continuous to avoid instability in modelprediction.

Tominimize ()+ ,wecanseethat )=

)

so )(

)

)=

)(

)

so ) (

ShrinkageMethods

3. LeastAngleRegression

Forward stepwise regression builds a model sequentially, adding onevariable at a time. At each step, it identifies the best variable to includein the active set, and then updates the least squares fit to include allthe active variables.Least angle regression uses a similar strategy, but only enters as muchof a predictor as it deserves.

1. Standardize the predictors to have mean zero and unit norm. Start with the residual , , . . . , 0.2. Find the predictor xj most correlated with r.(cosine)3. Move from 0 towards its leastsquares coefficient , until some othercompetitor xk has as much correlation with the current residual as does xj .4. Move and in the direction(

, ) definedby their joint least squares coefficient of the current residual on (xj , xk), until some othercompetitor xl has as much correlation with the current residual.4a. If a nonzero coefficient hits zero, drop its variable from the active set of variablesand recompute the current joint least squares direction.(it becomes lasso if this is added)5. Continue in this way until all p predictors have been entered. After min(N 1, p) steps,we arrive at the full leastsquares solution.

LinearMethodsforClassification

K classes and the fitted linear models f x x. The

decision boundary between class k and l is that set of points forwhich f x f xP(G=k|X=x)For two classes, a popular model is

P G 1 X x

P G 2 X x

log

x

Hyperplanes>model boundaries aslinear

Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K, withYk = 1 if G = k else 0. These are collected together in a vector Y = (Y1, . . . ,YK.

Classify according to G x argmaxfxBecause of the rigid nature of the regression model, classes may bemasked by others, even though they are perfectly separated.

A loose but general rule is that if K 3 classes are lined up, polynomialterms up to degree K 1 might be needed to resolve them.



Suppose f x is the classconditional density of X in class G=k and k is the priorprobability of class k.

P G k X x f x

f x

Suppose f x

exp

In linear discriminant analysis(LDA) we assume that for each k.

log

log

log

log

log

is linear in x in a pdimension hyperplane.


Thelineardiscriminantfunction:

x log 1

2

,as issymmetricandisanumber,whosetransposeisstillanumber.

isignored,asitisaconst.

Inpractice,weestimatetheparametersusingtrainingdata:

/,where isthenumberofclasskobservations /

/


Inquadraticdiscriminantanalysis(QDA)weassumethat arenotequal.

x log

x

x

log

LDA in the enlarged quadratic polynomial space is quite similar with QDA.

Regularized discriminant analysis(RDA): a a 1 aIn practice a can be chosen based on the performance of the model onvalidation data, or by crossvalidation.

,where isap*porthonormaland isadiagonal

matrixofpositiveeigenvaluesd.

x x

x

x

x

x log d


Gaussian classification with common covariances leads to linear decisionboundaries. Classification can be achieved by sphering the data with respect to W,and classifying to the closest centroid (modulo log k) in the sphered space.Since only the relative distances to the centroids count, one can confine the data tothe subspace spanned by the centroids in the sphered space.This subspace can be further decomposed into successively optimal subspaces interm of centroid separation. This decomposition is identical to the decompositiondue to Fisher.


FindthelinearcombinationZ aX suchthatthebetweenclass

varianceismaximizedrelativetothewithinclassvariance.max

B isthebetweenclassesscattermatrixandW isthewithinclassesscattermatrix

Asascattermatrix=>min aa subjecttoaa 1

L 1

2aa

1

2aa 1

a a

LogisticRegression

P G k X x

,k=1,,K1

P G K X x

LetP G K X x Px;

l

x;

intwoclasscase:viaa0/1responseyi,whereyi =1whengi =1,andyi =0whengi =2

l ; 1 log 1 ;

l log1

,

;

0

; ;

NewtonRaphson algorithm

LogisticRegression

In matrix notation:Y denote the vector of yi values; X the N*(p+1) matrix of xi values; P thevector of fitted probabilities with ith element ; and W a N*Ndiagonal matrix of weights with ith diagona element ; 1 ;

,adjusted response

Too complicated, many softwares use a quadratic approximation to logisticregression and L1 Regularized Logistic Regression.

LDA: log

log

x

LogisticRegression: log

x

Although they have exactly the same form, the difference lies in the way the linearcoefficients are estimated. The logistic regression model is more general, in that itmakes less assumptions.Logistic regression: fit the parameters by maximizing the conditional likelihoodthe multinomial likelihood with probabilities the P(G = k|X), where P(X) is ignored.LDA: fit the parameters by maximizing the full loglikelihood, based on the jointdensity P(X,G=k)=(X;k,)k, where P(X) does play a role as P(X)= P(X,G=k).Assume f(x)s are Gaussian, we can find a more efficient way.It is generally felt that logistic regression is a safer, more robust bet than the LDAmodel, relying on fewer assumptions. It is our experience that the models givevery similar results, even when LDA is used inappropriately.

LDAorLogisticRegression

AnyQuestions?

linear regression

Documents