multiple linear regression review

of 33/33
Multiple Linear Regression Review

Post on 14-Jan-2016

16 views

Category:

Documents

3 download

Embed Size (px)

DESCRIPTION

Multiple Linear Regression Review. Outline Outline Outline. Simple Linear Regression Simple Linear Regression Multiple Regression Multiple Regression Understanding the Regression Output Understanding the Regression Output Coefficient of Determination R Coefficient of Determination R2 - PowerPoint PPT Presentation

TRANSCRIPT

  • Multiple Linear RegressionReview

  • Outline Outline OutlineSimple Linear Regression Simple Linear RegressionMultiple Regression Multiple RegressionUnderstanding the Regression Output Understanding the Regression OutputCoefficient of Determination R Coefficient of Determination R2Validating the Regression Model

  • Linear Regression: An ExampleFirst Year Sales ($Millions)Advertising Expenditures ($Millions)A pplegloFirst-YearAdvertisingExpenditures($ m illions)First-YearSales($ millions )Reg ionMaineNew HampshireVermontMassachusettsConnecticutRhode IslandNew YorkNew JerseyPennsylvaniaDelawareMarylandWest VirginiaVirginiaOhioQuestions: a) How to relate advertising expenditure to sales? b) What is expected first-year sales if advertising expenditure is $2.2 million? c) How confident is your estimate? How good is the fit?

  • The Basic Model: Simple Linear RegressionData: (x1, y1), (x2, y2), . . . , (xn, yn)Model of the population : Yi = 0 + 1 xi + i 1, 2, . . . , n are are i.i.d. random variables, random variables, N(0, ) This is the true relation between Y and x, but we do not know not know 0 and 1 and have to estimate them based on the data.Comments: E (Yi | xi ) = 0 + 1xi SD(Yi | xi) = Relationship is linear Relationship is linear described by a line 0 = baseline value of Y (i.e., value of Y if x is 0) 1 = slope of line (average change in Y per unit change in x)

  • How do we choose the line that best fits the data?Best choices:bo = 13.82 b1 = 48.60Advertising Expenditures ($M)First Year Sales ($M)Regression coefficients : b0 and b1 are are estimates of 0 and 1Regression estimate for Y at xi : (prediction)Residual (error):The best regression line is the one that chooses b0 and b1 to minimize the total errors (residual sum of squares):

  • Example: Sales of Nature-Bar ($ million)region sales advertising promotions competitors salesSelkirkSusquehannaKitteryActon Finger LakesBerkshireCentralProvidenceNashuaDunsterEndicottFive-TownsWaldeboroJacksonStowe

  • Multiple Regression In general, there are many factors in addition to advertising expenditures that affect sales Multiple regression allows more than one x variables.Independent variables: x1, x2, . . . , xk (k of them)Data: (y1, x11, x21, . . . , xk1), . . . , (yn, x1n, x2n, . . . , xkn),Population Model: Yi = 0 + 1x1i + . . . + kxki + i1, 2, . . . , n are are iid random variables, ~ N(0, )

    Regression coefficients : b0, b1,, bk are estimates of 0, 1,,k .Regression Estimate of yi :Goal: Choose b0, b , b1, ... , , ... , bk to minimize the residual sum of squares. I.e., minimize:

  • Regression Output (from Excel)Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

    Analysis ofVariancedfSum ofSquaresMeanSquareFSignificanceFRegression ResidualTotalCoefficientsStandardErrortStatisticP-valueLower95%Upper95%Intercept Advertising Promotions CompetitorsSales

  • Understanding Regression Output1) Regression coefficients : b0, b1, . . . , bk are estimates are estimates of 0, 1, . . . , k based on sample data. Fact: E[ Fact: E[bj ] = j .Example: b0 = 65.705 (its interpretation is context dependent .b1 = 48.979 (an additional $1 million in advertising is expected to result in an additional $49 million in sales)b2 = 59.654 (an additional $1 million in promotions is expected to result in an additional $60 million in sales)b3 = -1.838 (an increase of $1 million in competitor sales is expected to decrease sales by $1.8 million)

  • Understanding Regression Output, Continued2) Standard errors : an estimate of , the SD of each the SD of each i. It is a measure of the amount of noise in the model. Example: s = 17.603) Degrees of freedom : #cases - #parameters, relates to over relates to over-fitting phenomenon4) Standard errors of the coefficients: sb0 , sb1, . . . , sbk They are just the standard deviations of the estimates b0, b1, . . . , bk.They are useful in assessing the quality of the coefficient estimates and validating the model.

  • R2 takes values between 0 and 1 (it is a percentage). R2 = 0; x values account for none variation in the Y valuesR2 = 1; x values account for all variation in the Y valuesR2 = 0.833 in ourAppleglo Example

  • Understanding Regression Output, Continued5) Coefficient of determination: R2 It is a measure of the overall quality of the regression. Specifically, it is the percentage of total variation exhibited in the yi data that is accounted for by the sample regression line. The sample mean of Y:Total variation in Y Total variation in Y =Residual (unaccounted) variation in Ytotal variationvariation accounted for by x variablestotal variationvariation not accounted for by x variables

  • Coefficient of Determination: R2 A high R2 means that most of the variation we observe in the yi data can be attributed to their corresponding x values a desired property. In simple regression, the R2 is higher if the data points are is better aligned along a line. But outliers better aligned Anscombe example. How high a R2 is good enough depends on the situation (for example, the intended use of the regression, and complexity of the problem). Users of regression tend to be fixated on R2, but its not the , whole story. It is important that the regression model is valid.

  • Coefficient of Determination: R2 One should not include x variables unrelated to Y in the model, just to make the R2 fictitiously high. (With more x variables fictitiously high. there will be more freedom in choosing the bis s to make the residual variation closer to 0).

    Multiple R is just the square root of R Multiple R is just the square root

  • Validating the Regression ModelAssumptions about the population: about the population: Yi = 0 + 1x1i + . . . + kxki + i (i = 1, . . . , n) 1, 2, . . . , n are are iid random variables, ~ N(0, )1) Linearity If k = 1 (simple regression), one can check visually from scatter plot.

    Sanity check: the sign of the coefficients, reason for non-linearity?2) Normality of Normality of i Plot a histogram of the residuals Usually, results are fairly robust with respect to this assumption.

  • 3) Heteroscedasticity Do error terms have constant Std. Dev.? (i.e., SD(i) = for all i?) Check scatter plot of residuals vs. Y and x variables.ResidualsAdvertising ExpendituresResidualsAdvertisingResiduNo evidence of heteroscedasticityEvidence of heteroscedasticity May be fixed by introducing a transformation May be fixed by introducing or eliminating some independent variables

  • 4) Autocorrelation : Are error terms independent? Plot residuals in order and check for patternsTime PlotTime PlotResidualResidualNo evidence of autocorrelationEvidence of autocorrelation Autocorrelation may be present if observations have a naturalsequential order (for example, time). May be fixed by introducing a variable or transforming a variable.

  • Pitfalls and Issues1) Overspecification Including too many x variables to make R2 fictitiously high. Rule of thumb: we should maintain that Rule of thumb: n >= 5(k+2).

    2) Extrapolating beyond the range of dataAdvertising

  • Validating the Regression Model3) Multicollinearity Occurs when two of the x variable are strongly correlated. Can give very wrong estimates for is. Tell-tale signs: - Regression coefficients (bis) have the wrong sign. - Addition/deletion of an independent variable results in large changes of regression coefficients - Regression coefficients (bis ) not significantly different from 0 May be fixed by deleting one or more independent variables

  • ExampleStudent Graduate CollegeNumber GPA GPA GMAT

  • Regression OutputWhat happened?College GPA and GMAT are highly correlated!Eliminate GMATR SquareStandard ErrorObservationsInterceptCollege GPA GMATCoefficients Standard ErrorGraduate College GMAT GraduateCollege GMATR Square Standard Error ObservationsInterceptCollege GPACoefficients Standard Error

  • Regression Models In linear regression, we choose the best coefficients b0, b1, ... , bk as the estimates for 0, 1,, k . We know on average each bj hits the right target j . However, we also want to know how confident we are aboutour estimates

  • Back to Regression OutputIntercept Advertising Promotions Compet.SalesUpper95%Lower95%P-valuetStatisticStandardErrorCoefficientsRegression ResidualTotaldfSum ofSquaresMeanSquareRegression Statistics Multiple R R Square Adjusted R Square Standard Error Observations Analysis of Variance

  • Regression Output Analysis1) Degrees of freedom (dof) Residual dof = n = n - (k+1) (We used up (k + 1) degrees of freedom in forming (k+1) sample estimates b0, b1, . . . , bk .)2) Standard errors of the coefficients : sb0 , sb1, . . . , sbk They are just the SDs of estimates of b0, b1, . . . , bk . Fact: Before we observe bj and sbj, obeys a t-distribution with dof = (n - k - 1), the same dof as the residual. We will use this fact to assess the quality of our estimates bj . What is a 95% confidence interval for j? Does the interval contain 0? Why do we care about this?

  • 3) t-Statistic: A measure of the statistical significance of each individual xj in accounting for the variability in Y. Let c be that number for whichwhere T obeys a t-distribution distribution with dof = (n - k - 1). If > c, then the c, then the % C.I. for j does not contain zero In this case, we are % confident that confident that j different from zero.

  • Example: Executive Compensation Pay Years in Change in Change inNumber ($1,000) position Stock Price (%) Sales (%) MBA?

  • variables: Dummy Often, some of the explanatory variables in a regression are categorical rather than numeric. If we think whether an executive has an MBA or not affects his/her pay, We create a dummy variable and let it be 1 if the executive has an MBA and 0 otherwise. If we think season of the year is an important factor to determinesales, how do we create dummy variables? How many? What is the problem with creating 4 dummy variables? In general, if there are m categories an x variable can belong to,then we need to create m-1 dummy variables for it.

  • OILPLUS dataAugust, 1989September, 1989October, 1989November, 1989December, 1989January, 1990February, 1990March, 1990April, 1990May, 1990June, 1990July, 1990August, 1990September, 1990October, 1990November, 1990December, 1990Month heating oil temperature

  • Heating Oil Consumption(1,000 gallons)Average Temperature (degrees Fahrenheit)

  • oil consumptioninverse temperatureheating oil temperature inverse temperature

  • The Practice of Regression Choose which independent variables to include in the model, based on common sense and context specific knowledge. Collect data (create dummy variables in necessary). Run regression Run regression the easy part. Analyze the output and make changes in the model this is where the action is. Test the regression result on out-of of-sample data

  • The Post-Regression Checklist1) Statistics checklist: Calculate the correlation between pairs of x variables watch for evidence of multicollinearity Check signs of coefficients do they make sense? Check 95% C.I. (use t-statistics as quick scan) are coefficients significantly different from zero? R2 :overall quality of the regression, but not the only measure

    2) Residual checklist: Normality look at histogram of residuals Heteroscedasticity plot residuals with each x variable Autocorrelation if data has a natural order, plot residuals in order and check for a pattern

  • The Grand Checklist Linearity: scatter plot, common sense, and knowing your problem, transform including interactions if useful t-statistics: are the coefficients significantly different from zero? Look at width of confidence intervals F-tests for subsets, equality of coefficients tests for subsets, R2: is it reasonably high in the context? Influential observations, outliers in predictor space, dependent variable space Normality: plot histogram of the residualsStudentized residuals Heteroscedasticity : plot residuals with each x variable, transform if necessary, Box-Cox transformations Autocorrelation:time series plot Multicollinearity : compute correlations of the x variables, do signs of coefficients agree with intuition? Principal Components Missing Values