chapter 11 multiple linear regression chapter 11 multiple linear regression

of 97/97
Chapter 11 Chapter 11 Multiple Linear Multiple Linear Regression Regression

Post on 03-Jan-2016




6 download

Embed Size (px)


  • Chapter 11

    Multiple Linear Regression

  • Our Group Members:

  • Content: Multiple Regression Model -----Yifan Wang

    Statistical Inference ---Shaonan Zhang & Yicheng Li

    Variable Selection Methods & SAS ---Guangtao Li & Ruixue Wang

    Strategy for Building a Model and Data Transformation --- Xiaoyu Zhang & Siyuan LuoTopics in Regression Modeling ----Yikang Chai & Tao Li

    Summary -----Xing Chen

  • Ch 11.1-11.3 Introduction to Multiple Linear RegressionYifan WangDec. 6th, 2007

  • Based on Chapter 10, we studied how to fit a linear relationship between a response variable y and a predictor variable x. But, sometimes we cannot handle a problem using simple linear regression, when there are two or more predictor variables. For ExampleThe salary of a company employee may depend on job category years of experience education performance evaluations

  • Extend the simple linear regression model to the case of two or more predictor variables.Multiple Linear Regression (or simply Multiple Regression) is the statistical methodology used to fit such models.

  • Multiple Linear Regression In multiple regression we fit a model of the form (excluding the error term)

    Where are predictor variables and are k+1 unknown parameters. For exampleThis model includes the kth degree polynomial model in a single variable x, namely,

    Since we can put .linear

  • 11.1 A Probabilistic Model For Multiple Linear RegressionRegard the response variable as random Regard the predictor variables as nonrandom. The data for multiple regression consist of n vectors of observations ( ) for i =1,2,,n.Example 1 The response variable the salary of the i th person in the sampleThe predictor variables his/her years of experience his/her years of education.

  • is the observed value of the r.v.. predictor valuesaccording to the followingWhere is a random error with =0, andare unknown parameters. Assume are independentdepends on fixedExample 2model:random variables. Then the are independentrandom variables with

  • 11.2 Fitting the Multiple Regression Model The LS estimates of the unknown parameters minimize

    The LS can be found by setting the first partial derivatives of Q with respect to equal to zero. The result is a set of simultaneous linear equations in (k+1) unknowns. The resulting solutions, are the least squares (LS) estimates of , respectively11.2.1 Least Squares (LS) Fit

  • 11.2.2 Goodness of Fit of the ModelTo access the goodness of fit of the LS model, we use the residuals defined by

    Where the are the fitted values:

    An overall measure of the goodness of fit is the error sum of squares (SSE)

    Compare it to the total sum of squares (SST)

    As in Chapter 10, define the regression sum of squares (SSR) given by

  • the coefficient of multiple determination , values closer to 1 represent better fits Adding predictor variables generally increases , thus can be made to approach 1 by increasing the number of predictors.

    Multiple correlation coefficient (the positive square root of ):

    only positive square root is used r is a measure of the strength of the association between the predictor variables and the one response variable

  • 11.3 Multiple Regression Model in Matrix Notation The multiple regression model can be presented in a compact form by using matrix notation. Letbe the n x 1 vectors of the r.v.s , their observed values , and random errors , respectively. Next letbe the n x (k+1) matrix of the values of predictor variables.

  • Finally Letandbe the (k + 1) x 1 vectors of unknown parameters and their LS estimates, respectively

    The model can be rewritten as: The simultaneous linear equations whose solutions yields the LS estimates can be written in matrix notation as If the inverse of the matrix exists, then the solution is given by

  • 11.4 Statistical InferenceShaonan Zhang & Yicheng Li

  • Statistical Inference on s ----General Hypothesis TestDetermining the statistical significance of predictor variables we test the hypotheses:

    if we cant reject , can be dropped from the model

  • Statistical Inference on s ----General Hypothesis TestPivotal Quantity recall: unbiased estimate of :

    error degrees of freedom

  • Statistical Inference on s ----General Hypothesis Test Confidence Interval for Noted that

    So, the CI is: where

  • Statistical Inference on s ----General Hypothesis TestHypothesis Test

    Specially, when = 0, we reject H0 if

    P (Reject H0 | H0 is true) =

  • Statistical Inference on s ----Another Hypothesis TestHypothesis:

    Pivotal Quantity


    P-value If P-value is less than , we reject H0. And we use the previous test in this case.

  • Statistical Inference on s ----Another Hypothesis TestANOVA Table for Multiple Regression

    Source of Variation (Source)Sum of Squares (SS)Degrees of Freedom (d.f.) Mean Square (MS) FRegression

    Error SSR SSE k n - (k+1)

    Total SST n - 1

  • Statistical Inference on s ----Test Subsets of ParametersFull ModelPartial Model


    test statistics

    reject H0 when

  • Prediction of Future ObservationsLet andWhatever CI (Confidence Interval) or PI (Prediction Interval) we have andPivotal Quantity

    a (1-) level CI to estimate *:

    a (1-) level PI to predict Y*:

  • 11.7Variable Selection Methods

    Guangtao Li, RuiXue Wang

  • 1. Why do we need variable selection methods?

    2. Two methods are introduced Stepwise Regression Best Subsets Regression

  • 11.7.1 STEPWISE REGRESSIONGuangtao Li

  • Recall Test for Subsets of Parameters in 11.4 Full model: Partial model: We test: Reject H0 when(i=1,2,n)vs.for at least one(i=1,2,n) Hypotheses:

  • (p-1)-variable model:

    P-variable model:

  • Partial F-test:

    Reject H0p if

  • Partial Correlation Coefficients

    We should add to the regression equation only if is large enough, i.e., only if is statistically significant.

  • Stepwise Regression Algorithm

  • Example: The Director of Broadcasting Operations for a television station wants to study the issue of standby hours, which are hours where unionized graphic artists at the station are paid but are not actually involved in any activity. We are trying to predict the total number of Standby Hours per Week (Y). Possible explanatory variables are: Total Staff Present (X1), Remote Hours(X2), Dubner Hours (X3) and Total Labor Hours (X4). The results for 26 weeks are given below.

    SAS Program for the Algorithm

  • Data test;input y x1 x2 x3 x4;datalines;2453384143232001177333598340203027135865634022262113726313522154196339528380207813528940933920801953343823312073118293399311175811632534332816241473113383531889154304353518198814631228944020491152833882761796

  • 1613074022071720274322151287205624533522829018902013502713552187183339440300203223732747528418561753283473372068152319449279181318832533624418081883222672531834197317235272197326131516422318392323312702721935run;proc reg data=test; model y = x1 x2 x3 x4 /SELECTION =stepwise ;run;

  • Selected SAS Output

    Stepwise Selection: Step 1

    Variable x1 Entered: R-Square = 0.3660 and C(p) = 13.3215

    Analysis of Variance

    Sum of Mean Source DF Squares Square F Value Pr > F

    Model 1 20667 20667 13.86 0.0011 Error 24 35797 1491.55073 Corrected Total 25 56465

    Parameter Standard Variable Estimate Error Type II SS F Value Pr > F

    Intercept -272.38165 124.24020 7169.17926 4.81 0.0383 x1 1.42405 0.38256 20667 13.86 0.0011

    Bounds on condition number: 1, 1----------------------------------------------------------------------------------------------------

  • Stepwise Selection: Step 2

    Variable x2 Entered: R-Square = 0.4899 and C(p) = 8.4193

    Analysis of Variance

    Sum of Mean Source DF Squares Square F Value Pr > F

    Model 2 27663 13831 11.05 0.0004 Error 23 28802 1252.26402 Corrected Total 25 56465

    Parameter Standard Variable Estimate Error Type II SS F Value Pr > F

    Intercept -330.67483 116.48022 10092 8.06 0.0093 x1 1.76486 0.37904 27149 21.68 0.0001 x2 -0.13897 0.05880 6995.14489 5.59 0.0269

  • SAS Output(cont)All variables left in the model are significant at the 0.1500 level.

    No other variable met the 0.1500 significance level for entry into the model.

    Summary of Stepwise Selection

    Variable Variable Number Partial Model Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F

    1 x1 1 0.3660 0.3660 13.3215 13.86 0.0011 2 x2 2 0.1239 0.4899 8.4193 5.59 0.0269

  • 11.7.2 Best Subsets Regression

  • 11.7.2 Best Subsets RegressionIn practice there are often several almost equally good models, and the choice of the final model may depend on side considerations such as the number of variables, the ease of observing and/or controlling variables, etc. The best subsets regression algorithm permits determination of a specified number of best subsets of size p=1,2,,k from which the choice of the final model can be made by the investigator.

  • 11.7.2 Best Subsets RegressionOptimality Criteria rp2-Criterion: Adjusted rp2-Criterion:

  • Cp-Criterion (recommended for its ease of computation and its ability to judge the predictive power of a model)

    The sample estimator, Mallows Cp-statistic, is given by

    is an almost unbiased estimator of

  • PRESS p Criterion: The total prediction error sum of squares (press) is:

    This criterion evaluates the predictive ability of a postulated model by omitting one observation at a time, fitting the model based on the remaining observations and computing the predicted value for the omitted observation.

    The PRESS p criterion is intuitively easier to grasp than the Cp-Criterion , but it is computationally much more intensive and is not available in many packages.

  • SAS PRGRAMData test;input y x1 x2 x3 x4;datalines;2453384143232001177333598340203027135865634022262113726313522154196339528380207813528940933920801953343823312073118293399311175811632534332816241473113383531889154304353518198814631228944020491152833882761796

  • SAS PRGRAM1613074022071720274322151287205624533522829018902013502713552187183339440300203223732747528418561753283473372068152319449279181318832533624418081883222672531834197317235272197326131516422318392323312702721935run;proc reg data=test; model y = x1 x2 x3 x4 /SELECTION =RSQUARE adjrsq CP mse ;run;

  • Results

    Number in Adjusted Model R-Square R-Square C(p) MSE Variables in Model

    1 0.3660 0.3396 13.3215 1491.55073 x1 1 0.1710 0.1365 24.1846 1950.27491 x4 1 0.0597 0.0205 30.3884 2212.24598 x3 1 0.0091 -.0322 33.2078 2331.30545 x2 ---------------------------------------------------------------------------------- 2 0.4899 0.4456 8.4193 1252.26402 x1 x2 2 0.4499 0.4021 10.6486 1350.49234 x1 x3 2 0.4288 0.3791 11.8231 1402.24672 x3 x4 2 0.3754 0.3211 14.7982 1533.34044 x1 x4 2 0.2238 0.1563 23.2481 1905.67595 x2 x4 2 0.0612 -.0205 32.3067 2304.83375 x2 x3 ---------------------------------------------------------------------------------- 3 0.5378 0.4748 7.7517 1186.29444 x1 x3 x4 3 0.5362 0.4729 7.8418 1190.44739 x1 x2 x3 3 0.5092 0.4423 9.3449 1259.69053 x1 x2 x4 3 0.4591 0.3853 12.1381 1388.36444 x2 x3 x4 ---------------------------------------------------------------------------------- 4 0.6231 0.5513 5.0000 1013.46770 x1 x2 x3 x4

  • 11.7.2 Best Subsets Regression & SASThe resource of the example is

  • 11.5, 11.8 Building A Multiple Regression Modelby SiYuan Luo & Xiaoyu Zhang

  • Building a multiple regression model consists of 7 steps.

    Though it is not necessary to follow each and every step in exact sequence shown on the next slide, the general approach and major steps should be followed.

    The model is an iterative process, it may take several cycles of the steps before arriving at the final model.


  • The 7 steps1.Decide the type6.Select and evaluate5.Fit candidate models4.Divide the data7.Select the final model3.Explore the data2.Collect the data

  • Step 1 Decide the typeDecide the type of model needed, different types of models includes:Predictive a model used to predict the response variable from a chosen set of predictor variables.Theoretical a model based on a theoretical relationship between a response variable and predictor variables. Control a model used to control a response variable by manipulating predictor variables.Inferential a model used to explore the strength of relationships between a response variable and individual predictor variables. Data summary a model used primarily as a device to summarize a large set of data by a single equation. Often a model can be used for multiple purposes. The type of model dictates the type of data needed.

  • Step 2 Collect the dataDecide the variables (predictor and response) on which to collect data. Measurement of the variables should be done the right way depending on the type of subject.

    See chapter 3 for precautions necessary to obtain relevant, bias-free data.

  • Step 3 Explore the dataThe data should be examined for outliers, gross errors, missing values, etc. on a univariate basis using the techniques discussed in chapter 4. Outliers cannot just be omitted because much useful information can be lost. See chapter 10 for how to deal with outliers.

    Scatter plots should be made to study bivariate relationships between the response variable and each of the predictors. They are useful in suggesting possible transformations to linearize the relationships.

  • Step 4 Divide the data Divide the data into training and test sets: only a subset of the data, the training set, should be used to fit the model (step 5 and 6); the remainder, called the training set, should be used for cross-validation of the fitted model (step 7).

    The reason for using an independent data set to test the model is that if the same data are used for both fitting and testing, then an overoptimistic estimate of the predictive ability of the fitted model is obtained.

    The split for the two sets should be done randomly.

  • Step 5 fit Candidate models Generally several equally good models can be identified using the training data set.

    By conducts several runs by varying FIN and FOUT values, we can identify several that fits the training set.

  • Step 6 Select and evaluate From the list of candidate models we are now ready to select two or three good models based on criteria such as the Cp-statistic, the number of predictors (p), and the nature of predictors.

    These selected models should be checked for violation of model assumptions using standard diagnostic techniques, in particular, residual plots. Transformations in the response variable or some of the predictor variables may be necessary to improve model fits.

  • Step 7 Select the Final model:This is the step where we compare competing models by cross-validating them against the test data.

    The model with a smaller cross-validation SSE is better predictive model.

    The final selection of the model is based on a number of considerations, both statistical and no statistical. These include residual plots, outliers, parsimony, relevance, and ease of measurement of predictors. A final test of any model is that it makes practical sense and the client is willing to buy it.

  • Regression Diagnostics (Step VI)Graphical Analysis of ResidualsPlot Estimated Errors vs. Xi ValuesDifference Between Actual Yi & Predicted YiEstimated Errors Are Called ResidualsPlot Histogram or Stem-&-Leaf of ResidualsPurposesExamine Functional Form (Linearity )Evaluate Violations of Assumptions

  • Linear Regression AssumptionsMean of Probability Distribution of Error Is 0

    Probability Distribution of Error Has Constant Variance

    Probability Distribution of Error is Normal

    Errors Are Independent

  • Residual Plot for Functional Form (Linearity)Add X^2 TermCorrect Specification

  • Residual Plot for Equal VarianceUnequal VarianceCorrect SpecificationFan-shaped. Standardized residuals used typically (residual divided by standard error of prediction)

  • Residual Plot for IndependenceNot IndependentCorrect Specification

  • Data transformationsWhy do we need data transformations?Make seemingly nonlinear models linear example:

    Sometimes it gives a better explanation of the variation in the data

  • How do we do the data transformations?Power family of transformations on the response :Box-Cox methodRequirements: all the data is always positive The ratio of the largest observed Y to the smallest is at least 10

  • Transformation form V=

    where is the geometric mean of the

  • How to estimate 1.Choose a value of from a selected range. Usually we look for it in the range (-1,1),we would usually cover the selected range with about 11-21 values of2.For each value, evaluate V by applying each Y to the formula above. You will create a vector V=( ), then use it to fit a linear model by least squares method. Record the residual sum of squares for the regression

    3. Plot versus .Draw a smooth curve through the plotted points, and find at what value of the lowest point of the curve lies. That , is the maximum likelihood estimate of

  • Example:The data in table are part of a more extensive set given by Derringer(1974). This paper has been adapted with permission of John Wiley & Sons, Inc. we wish to find a transformation of the form , or , which will provide a good first-order fit to the data. Our model form is where f is the filler level and p is the plasticizer level.

  • Note that the response data range from 157 to 13, a ratio of 157/13=12.1>10, hence a transformation on Y is likely to be effective. The geometric mean is 41.5461 for this set of data.

    The next table shows a selected values of We pick 20 different values of from (-1,1) in this case.

  • A smooth curve through these points is plotted in the next figure. We see that the minimum occurs at about = -0.05. This is close to zero, so suggesting that the transformation V= , or more simply .

  • Application of the transformation to the original data, then we get a set of data which are better linearly related. The best plane, fitted to these transformed data by least squares, is =3.212+0.03088f-0.03152p. the ANOVA table for this model is

  • If we had fitted a first-order model to the untransformed data, we will obtain =28.184+1.55f-1.717p ANOVA table for this model

  • We find out the transformed model has much stronger F-value.

  • 11.6.1 -11.6.3Topics in Regression Modeling Yikang Chai & Tao Li

  • 11.6.1 MulticollinearityDef. The columns of the X matrix are exactly or approximately linearly dependent. It means the predictor variables are related.

    why are we concerned about it? This can cause serious numerical and statistical difficulties in fitting the regression model unless extra predictor variables are deleted.

  • How does the multicollinearity cause difficulties?The multicollinearity leads to the following problems: is nearly singular, which makes numerically unstable. This reflected in large changes in their magnitudes with small changes in data. The matrix has very large elements. Therefore are large, which makes statistically nonsignificant.

  • Measures of MulticollinearityThree ways:The correlation matrix R. Easy but cant reflect linear relationships between more than two variables. 2. Determinant of R can be used as measurement of singularity of . 3. Variance Inflation Factors (VIF): the diagonal elements of . Generally, VIF>10 is regarded as unacceptable.

  • 11.6.2 Polynomial RegressionConsider the special case: Problems: The powers of x, i.e., tend to be highly correlated.If k is large, the magnitudes of these powers tend to vary over a rather wide range.These problems lead to numerical errors.

  • How to solve these problems? Two ways: 1. Centering the x-variable: Removing the non-essential multicollinearity in the data.

    2. Standardize the x-variable: Alleviate the problem that x varying over a wide range.

  • 11.6.3 Dummy Predictor VariablesIts an method to deal with the categorical variables.1.For ordinal categorical variables, such as the prognosis of a patient (poor, average, good), just assign numerical scores to the categories. (poor=1, average=2, good=3)

    2. If we have nominal variable with c>=2 categories. Use c-1 indicator variables, , called Dummy Variables, to code.

  • How to code? set for the ith category, for the cth category.Why dont we just use c indicator variables: ?because there will be a linear dependency among them:

    This will cause multicollinearity.

  • Example The season of a year can be coded with three indicators: x1(winter),x2(spring),x3(summer). With this coding (1,0,0)for Winter ,(0,1,0) for Spring, (0,0,1) for Summer and (0,0,0) for FallConsider modeling the temperature of a year of an area as a function of the season (X) and its latitude (A) , we can get the following model: For winter:For spring:For summer:For fall:

  • Logistic Regression Model1938, By R. A. Fisher and Frank YatesLogistic transform for analyzing binary data.

  • Logistic Regression ModelThe Importance of Logistic Regression Model

    Logistic regression model is the most popular model for binary data.

    Logistic regression model is generally used for binary response variables. Y = 1 (true, success, YES, etc.) , while Y = 0 ( false, failure, NO, etc.)

  • Logistic Regression ModelDetails of Regression ModelMain StepConsider a response variable Y {0 or 1} and a single predictor variable x. Model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x).

  • Logistic Regression ModelExample

    iIiiiiivvviviiX Instances of Y Coded asTotal ii+iii Y as Observed ProbabilityY as Odds RatioY as Log Odds Ratio028 29 30 31 32 334 3 2 2 4 12 2 7 7 16 146 5 9 9 20 15.3333 .4000 .7778 .7778 .8000 .9333.5000 .6667 3.5000 3.5000 4.0000 14.0000 -.6931 -.4055 1.2528 1.2528 1.3863 2.6391

  • Logistic Regression ModelA. Ordinary Linear Regression B. Logistic Regression

  • Logistic Regression ModelWeighted Linear Regression of Observed Log Odds RatiosonX


  • Logistic Regression ModelProperties of Regression ModelE(Y|x) = P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x . While, it is not true if we use model:In ordinary regression, the regression coefficient has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.

    Extension to Multiple predictor variables

  • Standardized Regression Coefficients Why we need standardize regression coefficients? Recall the regression equation for linear regression model:

    The magnitudes of the can not be directly used to judge the relative effects of on y.By using standardized regression coefficients, we may be able to judge the importance of different predictors

  • Standardized Regression CoefficientsStandardized Transform

    Standardized Regression Coefficients

  • Standardized Regression CoefficientsExample(Industrial sales data from text book)

    Linear Model:The regression equation:

    Notice: but thus has a much larger effect than on y

  • Chapter SummaryMultiple linear regression model Fitting the multiple regression model Least squares fit Goodness of fit of the model SSE, SST, SSR, r^2Statistical inference for multiple regression 1. T-test 2. F-test for all for at least one 3. F-test for at least one How do we select variables (SAS)? Stepwise regression - its fancy algorithm Best subsets regression more realistic, flexibleHow about if the data is not linear? Data transformation Building a multiple regression model 7 steps

  • We very appreciate your attention =)Please feel free to ask questions.

  • The End

    Thank You!

    Galton example of his work: height of sons of 71 fathers is 67, height of sons of 64 fathers is 67SST is the SSE obtained when fitting the model Yi = B0 + ei, which ignores all the xsR^2 = 0.5 means 50% of the variation in y is accounted for by x, in this case, all xs**