linear regression

30
Linear Methods for Regression The Elements of Statistical Learning Trevor Hastie, Robert Tibshirani, Jerome Friedman Presented by Junyan Li

Upload: erico

Post on 06-Sep-2015

26 views

Category:

Documents


5 download

DESCRIPTION

Linear Methods for Regression

TRANSCRIPT

  • LinearMethodsforRegression

    TheElementsofStatisticalLearningTrevorHastie,RobertTibshirani,JeromeFriedman

    PresentedbyJunyan Li

  • Input:XT=(x1,,xp )

    Linearregressionmodel

    Linearregressionmodel: f

    The linear model could be a reasonable approximation:

    Basic expansion: => a polynomial representation

    Interactions between variables:

    isaddedsothatf(x)donothavetopassthroughtheorigin.

  • Linearregressionmodel

    residualsumofsquares, RSS

    ,givenasetoftrainingdata(x,y)

    denotebyX theN*(p+1)matrixwitheachrowaninputvector(witha1inthefirstposition)

    RSS

    Let

    2 0.As

    2 0,ifX isoffullrank(ifnot,

    removetheredundancies). , here iscalledthehat.

  • Linearregressionmodel

    istheorthogonalprojectionofy ontothesubspacethecolumnvectorsofX byx0,,xp,withx0 1

    0.=>y isorthogonaltothesubspace.

  • Linearregressionmodel

    Assume hasconstantvariance.

    Assumethedeviationsof arounditsexpectationareadditiveandGaussian,ande~N0, .Then~, .

  • Linearregressionmodel

    areconstrainedtop+1equalities:

    0

    So,

    isaunbiasedestimatorand 1~

    To test the hypothesis that a particular coefficient 0,use z

    ~0,1(if is

    given) or t ~(if is not given). When Np1 is large enough, the

    different between the tail quantities of a tdistribution and a standard normaldistribution is negligible, so we use the standard normal distribution regardless ofwhether is given or not.Simultaneously, we could use F statistic to test whether some parameters could beremoved.

    F

    /

    /

    / /

    where RSS is the residual sumofsquares for the least squares fit of the biggermodel with +1 parameters, and RSS the same for the nested smaller model with+1 parameters.

  • GaussMarkovTheorem

    the least squares estimates of the parameters have the smallestvariance among all linear unbiased estimates.

    Let beanotherlinearestimatorofE E E E ,asE(e)=0E E =>DX=0

    However,theremayexistabiasedestimatorwithsmallermeansquarederror,isintimatelyrelatedtopredictionaccuracy.

  • MultipleRegressionfromSimpleUnivariate Regression

    Ifp=1,

    ,

    ,,where=

    Letx1,,xp bethecolumnsofX.If=0(orthogonal)foreachi j, ,

    ,,

    r

    0

    0

    ,

    ,

    ,

    ,

    r ,

    ,

    , , =0

    =>r, areorthogonal.

  • MultipleRegressionfromSimpleUnivariate Regression

    Orthogonal inputs occur most often with balanced, designed experiments (whereorthogonality is enforced), but almost never with observational data.

    RegressionbySuccessiveOrthogonalization:1.Initializez0 =x0 =1.2.Forj=1,2,...,pRegressxj onz0,z1,...,,zj1 toproducecoefficients

    ,

    ,,l=0,...,j 1

    andresidualvectorz z

    3.Regressyontheresidualzp togivetheestimate ,

    ,

    wecanseethateachofthe isalinearcombinationofthez,kj,andareorthogonaltoeachother.

  • In matrix form:

    , where Z has columns zj and is the upper triangular matrixwith entries . Introducting D with jth diagonal entry Djj= , let and

    The QR decomposition represents a convenient orthogonal basis forthe column space of X.Q is an N (p+1) orthogonal matrix, and R is a (p + 1) (p + 1) uppertriangular matrix. , ,

    MultipleRegressionfromSimpleUnivariate Regression

  • SubsetSelection

    two reasons why we are often not satisfied with the least squares estimates :

    prediction accuracy: the least squares estimates often have low bias but largevariance interpretation. With a large number of predictors, we often would like todetermine a smaller subset that exhibits the strongest effects.

    1. BestSubsetSelection

    Bestsubsetregressionfindsforeachk {0,1,2,...,p}thesubsetofsizekthatgivessmallestresidualsumofsquares.Infeasibleforpmuchlargerthan40.

  • SubsetSelection

    2. Forward and BackwardStepwise Selection

    1) Forwardstepwise selection is a greedy algorithm, and starts with theintercept, and then sequentially adds into the model the predictor that mostimproves the fit.

    Computational: for large p we cannot compute the best subsetsequence, but we can always compute the forward stepwise sequence(even when p >>N).Statistical: a price is paid in variance for selecting the best subset ofeach size; forward stepwise is a more constrained search, and will havelower variance, but perhaps more bias.

    2) Backwardstepwise selection starts with the full model, and sequentiallydeletes the predictor that has the least impact on the fit. (use zscore, n>p+1)

    3) Hybrid

  • SubsetSelection

    3. ForwardStagewise Regression

    Forwardstagewise regression (FS) is even more constrained than forwardstepwiseregression. It starts like forwardstepwise regression, with an intercept equal to y, andcentered predictors with coefficients initially all 0. At each step the algorithm identifiesthe variable most correlated with the current residual. It then computes the simplelinear regression coefficient of the residual on this chosen variable, and then adds it tothe current coefficient for that variable. This is continued till none of the variableshave correlation with the residuals.

    In Forward selection, variables are added at one time, but in FS selection, variables areadded partially, which works better in very high dimensional problems.

  • ShrinkageMethods

    it often exhibits high variance, and so doesnt reduce the prediction error ofthe full model. Shrinkage methods are more continuous, and dont sufferas much from high variability.

    1. RidgeRegression

    , 0,

    0

    ,Subjectto

    t

    Penalizingbythesumofsquares

    There is a onetoone correspondence between in"

    " and t insubjectto

  • ShrinkageMethods

    RSS y X y X

    , which could be got by calculating thefirst and second derivation.

    singular value decomposition (SVD): , where U and V are N pand pp orthogonal matrices, with the columns of U spanning the columnspace of X, and the columns of V spanning the row space. D is a p pdiagonal matrix, with diagonal entries d11 d22 dpp 0.

    Likelinearregression,ridgeregressioncomputesthecoordinatesofywithrespecttotheorthonormalbasisU.Itthenshrinksthesecoordinatesbythe

    factors

    .

  • ShrinkageMethods

    Degreeoffreedomdf tr

    Hence the small singular values djcorrespond to directions in the columnspace of X having small variance, andridge regression shrinks these directionsthe most.

    Eigendecompositionof D,andTheeigenvectorsvj (columnsofV)arealsocalledtheprincipalcomponents(orKarhunenLoeve)directionsofX.

  • ShrinkageMethods

    2. LassoRegression

    , 0

    The latter constraint makes the solutions nonlinear in the yi. there is noclosed form expression as in ridge regression. Computing the lasso solutionis a quadratic programming problem.

    ifthesolutionoccursatacorner,thenithasoneparameterjequaltozero.

  • ShrinkageMethods

    AssumethatcolumnsofX areorthonormal=>=I

    Minimize minimize =

    ()

    (

    )+ ) =0so

    )

    1. Unbiasedness: The resulting estimator is early unbiased when the true unknownparameter is large to avoid unnecessary modeling bias.2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets smallestimated coeffcients to zero to reduce model complexity.3. Continuity: The resulting estimator is continuous to avoid instability in modelprediction.

    Tominimize ()+ ,wecanseethat )=

    )

    so )(

    )

    )=

    )(

    )

    so ) (

  • ShrinkageMethods

    3. LeastAngleRegression

    Forward stepwise regression builds a model sequentially, adding onevariable at a time. At each step, it identifies the best variable to includein the active set, and then updates the least squares fit to include allthe active variables.Least angle regression uses a similar strategy, but only enters as muchof a predictor as it deserves.

    1. Standardize the predictors to have mean zero and unit norm. Start with the residual , , . . . , 0.2. Find the predictor xj most correlated with r.(cosine)3. Move from 0 towards its leastsquares coefficient , until some othercompetitor xk has as much correlation with the current residual as does xj .4. Move and in the direction(

    , ) definedby their joint least squares coefficient of the current residual on (xj , xk), until some othercompetitor xl has as much correlation with the current residual.4a. If a nonzero coefficient hits zero, drop its variable from the active set of variablesand recompute the current joint least squares direction.(it becomes lasso if this is added)5. Continue in this way until all p predictors have been entered. After min(N 1, p) steps,we arrive at the full leastsquares solution.

  • LinearMethodsforClassification

    K classes and the fitted linear models f x x. The

    decision boundary between class k and l is that set of points forwhich f x f xP(G=k|X=x)For two classes, a popular model is

    P G 1 X x

    P G 2 X x

    log

    x

    Hyperplanes>model boundaries aslinear

  • Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K, withYk = 1 if G = k else 0. These are collected together in a vector Y = (Y1, . . . ,YK.

    Classify according to G x argmaxfxBecause of the rigid nature of the regression model, classes may bemasked by others, even though they are perfectly separated.

    A loose but general rule is that if K 3 classes are lined up, polynomialterms up to degree K 1 might be needed to resolve them.

    LinearMethodsforClassification

  • LinearMethodsforClassification

    Suppose f x is the classconditional density of X in class G=k and k is the priorprobability of class k.

    P G k X x f x

    f x

    Suppose f x

    exp

    In linear discriminant analysis(LDA) we assume that for each k.

    log

    log

    log

    log

    log

    is linear in x in a pdimension hyperplane.

  • LinearMethodsforClassification

    Thelineardiscriminantfunction:

    x log 1

    2

    ,as issymmetricandisanumber,whosetransposeisstillanumber.

    isignored,asitisaconst.

    Inpractice,weestimatetheparametersusingtrainingdata:

    /,where isthenumberofclasskobservations /

    /

  • LinearMethodsforClassification

    Inquadraticdiscriminantanalysis(QDA)weassumethat arenotequal.

    x log

    x

    x

    log

    LDA in the enlarged quadratic polynomial space is quite similar with QDA.

    Regularized discriminant analysis(RDA): a a 1 aIn practice a can be chosen based on the performance of the model onvalidation data, or by crossvalidation.

    ,where isap*porthonormaland isadiagonal

    matrixofpositiveeigenvaluesd.

    x x

    x

    x

    x

    x log d

  • LinearMethodsforClassification

    Gaussian classification with common covariances leads to linear decisionboundaries. Classification can be achieved by sphering the data with respect to W,and classifying to the closest centroid (modulo log k) in the sphered space.Since only the relative distances to the centroids count, one can confine the data tothe subspace spanned by the centroids in the sphered space.This subspace can be further decomposed into successively optimal subspaces interm of centroid separation. This decomposition is identical to the decompositiondue to Fisher.

  • LinearMethodsforClassification

    FindthelinearcombinationZ aX suchthatthebetweenclass

    varianceismaximizedrelativetothewithinclassvariance.max

    B isthebetweenclassesscattermatrixandW isthewithinclassesscattermatrix

    Asascattermatrix=>min aa subjecttoaa 1

    L 1

    2aa

    1

    2aa 1

    a a

  • LogisticRegression

    P G k X x

    ,k=1,,K1

    P G K X x

    LetP G K X x Px;

    l

    x;

    intwoclasscase:viaa0/1responseyi,whereyi =1whengi =1,andyi =0whengi =2

    l ; 1 log 1 ;

    l log1

    ,

    ;

    0

    ; ;

    NewtonRaphson algorithm

  • LogisticRegression

    In matrix notation:Y denote the vector of yi values; X the N*(p+1) matrix of xi values; P thevector of fitted probabilities with ith element ; and W a N*Ndiagonal matrix of weights with ith diagona element ; 1 ;

    ,adjusted response

    Too complicated, many softwares use a quadratic approximation to logisticregression and L1 Regularized Logistic Regression.

  • LDA: log

    log

    x

    LogisticRegression: log

    x

    Although they have exactly the same form, the difference lies in the way the linearcoefficients are estimated. The logistic regression model is more general, in that itmakes less assumptions.Logistic regression: fit the parameters by maximizing the conditional likelihoodthe multinomial likelihood with probabilities the P(G = k|X), where P(X) is ignored.LDA: fit the parameters by maximizing the full loglikelihood, based on the jointdensity P(X,G=k)=(X;k,)k, where P(X) does play a role as P(X)= P(X,G=k).Assume f(x)s are Gaussian, we can find a more efficient way.It is generally felt that logistic regression is a safer, more robust bet than the LDAmodel, relying on fewer assumptions. It is our experience that the models givevery similar results, even when LDA is used inappropriately.

    LDAorLogisticRegression

  • AnyQuestions?