chapter 8 linear regression. objectives & learning goals understand linear regression (linear...

of 36 /36
Chapter 8 Linear Regression

Author: bruno-atkinson

Post on 18-Jan-2016

221 views

Category:

Documents


1 download

Embed Size (px)

TRANSCRIPT

  • Chapter 8Linear Regression

  • Objectives & Learning GoalsUnderstand Linear Regression (linear modeling):Create and interpret a linear regression model using technology, equations (when statistics are provided) and when given computer output:Discuss slope and y-intercept in context;Check conditions by graphing and interpreting residuals;Make predictions within the range of data using the created model.

  • Fast Food - Fat Versus Protein

    r = 0.83; describe the associationThere is a strong, positive linear association between fat and protein. In general, as protein increases, so does fat. We may be concerned about these points as possible outliers We can say even more with a model

  • The Linear ModelThe linear model is the equation of the line of best fit.. Models arent perfect. No matter how best our line of best fit is; points will fall above and below the line.

    Residuals are the distances from the best fit line to each data point theyre the parts of the relationship between the variables that our model cant explain or model errors.Residual

  • More on ResidualsTo find the residual of a data point, subtract the predicted value from the observed value:

    Residual

  • Residuals (cont.)A negative residual means the models prediction is too large (the model overestimates).

    A positive residual means the predicted value is too small (it underestimates).

    In this example, the linear model overestimates the fat content of a sandwich (33g); the actual value is 25g, so the residual is negative 8g.Residuals are key because our technology uses them to determine which line is best.

  • Best Fit Means Least SquaresSome residuals are positive, others are negative, on average these errors cancel each other out.

    We cant assess how well the line fits by adding up all the residuals (we would just get zero!).

    Solution?? Anyone?? Anyone?? Bueller??Similar to what we did with standard deviation, we square the residuals, then add up the squares. The smaller the sum, the better the line.

    The line of best fit is the line that minimizes the sum of the squared residuals. We call this the least squares regression line (LSRL).

  • Conditions for RegressionWe check the same conditions for regression as we did for correlations. These are???

    Quantitative Variables Condition

    Straight Enough Condition

    Outlier Condition

  • The Regression Line in Real UnitsIn Algebra we learned that the equation of a line can be written as? ___________

    In Statistics we use slightly different notation:

    b0 is the y-intercept b1 is the slope

    We use to emphasize that the points that satisfy the equation are predictions made using a model, not data.

  • Back to Burgers! An ExampleThe LSRL is shown for the for the Burger King sandwich data. The regression equation is:

    To predict the fat content (!!fat hat!!) for a BK sandwich with 30g of protein; PLUG AND CHUG the formula (do it now) !!

    Conclude with a statement like: According to our model, we expect a sandwich with 30 grams of protein to have about 36g of fat.35.9 g

  • Class ExampleThe table above shows the % of city teens who use Marijuana recreationally and the % of teens who use other, more dangerous drugs. Enter the data in your calculator lists and lets build and analyze a model!!

    CountryMarijuana (%)Other Drugs (%)Czech Rep224Denmark173England4021Finland51Ireland3716Italy198Ireland2314Norway63Portugal73Scotland5331USA3424

  • Correlation and the Line of Best FitThis figure shows the scatterplot of the z-scores for fat and protein.

    If a burger has an average protein content, it should also have an average fat content. So in z-score world, the LSRL passes through the origin: (0, 0).

    This means that in the real world the LSRL passes through:

    _________

    In z-score world, r is the slope of the regression line!

  • In z-score world, r is the slope of the LSRL!Moving one standard deviation away from the mean of x moves us r standard deviations away from the mean in y.

  • How Big Can Predicted Values Get?Because we move r units in the y direction for every unit in the x direction and r = -1 r +1, each y-hat must be closer to its mean than its corresponding x value.

    This property of the linear model is called regression to the mean (predictions using the LSRL will tend toward the mean of y).

  • R2 Accounting for the Models Variation The variation in the residuals is the key to assessing how well the model fits.

    In our BK sandwich example, the standard deviation of the variable fat is 16.4 grams.

    The standard deviation of the residuals (the errors after weve applied the model) is only 9.2 grams.*Original After Data* Model

  • R2The Variation Accounted For (cont.)If the correlation were 1.0 and the model predicted fat values perfectly, all the residuals would be zero (no variation).

    As it is, the correlation is 0.83 - not perfect; but pretty good!

    The models residuals have less variation than total fat alone. So the model explains some of the variability in fat by accounting for protein content.

    To build a model that accounts for the remaining errors we would use multiple regression beyond the scope of this course).

  • R2The Variation Accounted For (cont.)The correlation coefficient squared (r2), gives the percentage of the datas variance accounted for by the model.

    For the BK model, r2 = 0.832 = 0.69, this tells us that 69% of the variability fat changes can be explained by changes in protein levels.The remaining 31% of the variability in fat is left in the residuals (other reasons).

  • How Big Should R2 Be?Well(you guessed it!)it depends!

    R2 is always between 0% and 100% (and it is always less than r (unless r = 1 or -1). What makes a good R2 value depends on the kind of data youre analyzing and what you want to do with it. Examples?

    Along with the slope and intercept for a regression, you should always report R2 so that readers can judge for themselves how successful the regression is at fitting the data.

  • AP Test RequirementsFor the AP test, you are expected to be able to find / analyze regression equations THREE ways:Using summary statistics & formulas;Using technology (calculators) when given data;From computer-generated regression output.Worksheet!!!

  • The Regression Line in Real UnitsSince the linear model integrates z-score world (r), the line of best fit we use is a little more complicated than the ones youre used to from your algebra coursesWe want our model to be useful in real units so we have to back out of z-score world...

    To find a linear models slope, we use:

    To find the models y-intercept:

    Income vs. Housing Cost Example

  • Income v. Housing Costs ExampleSome governmental organization is interested in building a model to predict a persons housing costs based on the persons income (using tax return data). They capture a sample of data and find that the mean income is $46,209 with a standard deviation of $7,004. The mean housing cost for this same sample is $324 with a standard deviation of $119; r=0.62. Is a linear model appropriate? Find the regression equation. Explain what the slope and the y-intercept mean in context. Compute and interpret r-squared. The organization then decides to use this same data to predict a persons income based on their housing costs. Find the new regression equation.

  • Reading Computer Output HP vs. MPGWrite the regression equation now

  • More on ResidualsThe linear model assumes the relationship between the two variables is a straight line. The residuals are the part of the data that hasnt been accounted for by the model.Actual Data = Model + Residual or Residual = Actual Prediction (AP)

    In symbols:

  • Residuals (cont.)Residuals help us to see whether the model is appropriate. When it is, there should be nothing interesting left behind in the residuals (just random error).

    After we fit a regression model, we ALWAYS make a scatterplot of the residuals vs. X or y-hat (both will look identical) hoping to find.. well, nothing interesting.

  • Residuals (cont.)The residual plot for the BK sandwich regression looks random (no pattern) and thats good!

  • Residuals (cont.)This residual plot shows a pattern, indicating that our assumption of linearity may be wrong.

  • Residuals (cont.)Another bad residual plot

  • Residual Standard DeviationThe standard deviation of the residuals, se, measures how spread out the points are around the regression line. The equation is: Once again, we dont actually calculate this manually, our technology does it for us!!!

  • The Residual Standard DeviationExamine a Normal Probability Plot or a Histogram of the residuals to make sure the residuals have about the same amount of scatter throughout (this is called the Equal Variance Assumption).

  • Reality Check: Does the Regression Make Sense?When statistics are based on real data, the results of a statistical analysis should reinforce your common sense.

    If the results are surprising, then youve either learned something new about the world or your analysis is incorrect.

    Which do you think is more likely?

    When you perform a regression, think about what the coefficients mean and ask yourself whether they make sense.

  • What Can Go Wrong?Dont fit a straight line to a nonlinear relationship.Beware extraordinary points (y-values that stand off from the linear pattern or extreme x-values).Dont infer that x causes y just because there is a good linear model for their relationship - association is not causation.Dont choose a model based on R2 alone.

  • Get it??http://xkcd.com/552/

  • What Have We Learned?When the relationship between two quantitative variables is straight enough, a linear model can help summarize that relationship.The regression line doesnt pass through all points, but it is the best line because it minimizes the sum of squared residuals.

  • What Have We Learned?Correlation tells us lots of things:The slope of the line is based on the correlation, adjusted for the units of x and y.For each SD in x that we are away from x-bar, we expect to be r SDs in y away from y-bar.Since r is always between 1 and +1, each y-hat is fewer SDs away from its mean than the corresponding x was (regression to the mean).R2 tells us the percent of the response accounted for by the regression model; the rest is error.

    *