chapter 8 linear regression. objectives & learning goals understand linear regression (linear...

Chapter 8Linear Regression

Objectives & Learning Goals

Understand Linear Regression (linear modeling): Create and interpret a linear regression model

using technology, equations (when statistics are provided) and when given computer output:

Discuss slope and y-intercept in context; Check conditions by graphing and interpreting residuals; Make predictions within the range of data using the created

model.

Fast Food - Fat Versus Protein r = 0.83; describe the association…

“There is a strong, positive linear association between fat and protein. In general, as protein increases, so does fat. We may be concerned about these points as possible outliers…”

We can say even more with a model…

The Linear Model The linear model is the equation of the “line of best fit.”.

Models aren’t perfect. No matter how “best” our line of best fit is; points will fall above and below the line.

Residuals are the distances from the best fit line to each data point – they’re the parts of the relationship between the variables that our model can’t explain – or model errors.

Residual

More on Residuals

To find the residual of a data point, subtract the predicted value from the observed value:

Residual

predictedactualyyresidual ˆ

Residuals (cont.) A negative residual means

the model’s prediction is too large (the model overestimates).

A positive residual means the predicted value is too small (it underestimates).

In this example, the linear model overestimates the fat content of a sandwich (33g); the actual value is 25g, so the residual is negative 8g.

Residuals are key because our technology uses them to

determine which line is “best.”

“Best Fit” Means Least Squares Some residuals are positive, others are negative, on

average these errors cancel each other out.

We can’t assess how well the line fits by adding up all the residuals (we would just get zero!).

Solution?? Anyone?? Anyone?? Bueller?? Similar to what we did with standard deviation, we square the

residuals, then add up the squares. The smaller the sum, the better the line.

The line of best fit is the line that minimizes the sum of the squared residuals. We call this the least squares regression line (LSRL).

Conditions for Regression

We check the same conditions for regression as we did for correlations. These are???

Quantitative Variables Condition

Straight Enough Condition

Outlier Condition

The Regression Line in Real Units In Algebra we learned that the equation of a

line can be written as? ___________

In Statistics we use slightly different notation:

b0 is the y-intercept

b1 is the slope

We use to emphasize that the points that satisfy the equation are predictions made using a model, not data.

ymx b

yb0 b1xy

Back to Burgers! An Example The LSRL is shown for the

for the Burger King sandwich data. The regression equation is:

To predict the fat content (!!“fat hat”!!) for a BK sandwich with 30g of protein; PLUG AND CHUG the formula (do it now) !!

Conclude with a statement like: “According to our model, we expect a sandwich with 30 grams of protein to have about 36g of fat.”

35.9 g

Class Example…Country Marijuana (%) Other Drugs (%)

Czech Rep 22 4Denmark 17 3England 40 21Finland 5 1Ireland 37 16

Italy 19 8Ireland 23 14Norway 6 3Portugal 7 3Scotland 53 31

USA 34 24

The table above shows the % of city teens who use Marijuana recreationally and the % of teens who use other, more dangerous drugs. Enter the data in your calculator lists and let’s build and analyze a model!!

Correlation and the Line of Best Fit This figure shows the

scatterplot of the z-scores for fat and protein.

If a burger has an average protein content, it should also have an average fat content. So in z-score world, the LSRL passes through the origin: (0, 0).

This means that in the “real” world the LSRL passes through:

_________

In z-score world, r is the slope of the regression line!

),( yx

In z-score world, r is the slope

of the LSRL!

Moving one standard deviation away from the mean of x moves us r standard deviations away from the mean in y.

How Big Can Predicted Values Get?

Because we move r units in the y direction for every unit in the x direction and r = -1 ≤ r ≤ +1, each y-hat must be closer to its mean than its corresponding x value.

This property of the linear model is called regression to the mean (predictions using the LSRL will tend toward the mean of y).

R2— Accounting for the Model’s Variation The variation in the

residuals is the key to assessing how well the model fits.

In our BK sandwich example, the standard deviation of the variable fat is 16.4 grams.

The standard deviation of the residuals (the errors after we’ve applied the model) is only 9.2 grams.

*Original After Data* Model

R2—The Variation Accounted For (cont.) If the correlation were 1.0 and the

model predicted fat values perfectly, all the residuals would be zero (no variation).

As it is, the correlation is 0.83 - not perfect; but pretty good!

The model’s residuals have less variation than total fat alone. So the model “explains” some of the variability in fat by accounting for protein content.

To build a model that accounts for the remaining errors we would use multiple regression – beyond the scope of this course).

R2—The Variation Accounted For (cont.)

The correlation coefficient squared (r2), gives the percentage of the data’s variance accounted for by the model.

For the BK model, r2 = 0.832 = 0.69, this tells us that 69% of the variability fat changes can be explained by changes in protein levels.

The remaining 31% of the variability in fat is left in the residuals (“other” reasons).

How Big Should R2 Be? Well……(you guessed it!)……it depends!

R2 is always between 0% and 100% (and it is always less than r (unless r = 1 or -1). What makes a “good” R2 value depends on the kind of data you’re analyzing and what you want to do with it.

Examples?

Along with the slope and intercept for a regression, you should always report R2 so that readers can judge for themselves how successful the regression is at fitting the data.

AP Test Requirements

For the AP test, you are expected to be able to find / analyze regression equations THREE ways:1) Using summary statistics & formulas;

2) Using technology (calculators) when given data;

3) From computer-generated regression output.

Worksheet!!!

The Regression Line in Real Units Since the linear model integrates “z-score

world” (r), the line of best fit we use is a little more complicated than the ones you’re used to from your algebra courses…

We want our model to be useful in real units so we have to back out of z-score world...

To find a linear model’s slope, we use:

To find the model’s y-intercept:

b1 rsysx

b0 y b1x

yb0 b1x

Income vs. Housing Cost Example

Income v. Housing Costs ExampleSome governmental organization is interested in building a model to predict a person’s housing costs based on the person’s income (using tax return data). They capture a sample of data and find that the mean income is $46,209 with a standard deviation of $7,004. The mean housing cost for this same sample is $324 with a standard deviation of $119; r=0.62. Is a linear model appropriate? Find the regression equation. Explain what the slope and the y-intercept mean in context. Compute and interpret r-squared. The organization then decides to use this same data to predict a person’s income based on their housing costs. Find the new regression equation.

Reading Computer Output – HP vs. MPG

)(0918175.04542.38ˆ hpgpm

Write the regression equation now…

More on Residuals

The linear model assumes the relationship between the two variables is a straight line. The residuals are the part of the data that hasn’t been accounted for by the model.

Actual Data = Model + Residual

or

Residual = Actual – Prediction (AP)

In symbols:

ey y

Residuals (cont.) Residuals help us to see whether the

model is appropriate. When it is, there should be nothing interesting left behind in the residuals (just random error).

After we fit a regression model, we ALWAYS make a scatterplot of the residuals vs. X or y-hat (both will look identical) hoping to find……….. well, nothing interesting.

Residuals (cont.) The residual plot for the BK sandwich

regression looks random (no pattern) – and that’s good!

Residuals (cont.) This residual plot shows a pattern, indicating

that our assumption of linearity may be wrong.

Residuals (cont.) Another “bad” residual plot…

Residual Standard Deviation The standard deviation of the residuals, se,

measures how spread out the points are around the regression line. The equation is:

se e2

n 2

Once again, we don’t actually calculate this manually, our technology does it for us!!!

The Residual Standard Deviation Examine a Normal Probability Plot or a Histogram

of the residuals to make sure the residuals have about the same amount of scatter throughout (this is called the Equal Variance Assumption).

Reality Check: Does the Regression Make Sense? When statistics are based on real data, the results

of a statistical analysis should reinforce your common sense.

If the results are surprising, then you’ve either learned something new about the world or your analysis is incorrect.

Which do you think is more likely?

When you perform a regression, think about what the coefficients mean and ask yourself whether they make sense.

What Can Go Wrong? Don’t fit a straight line to a nonlinear

relationship. Beware extraordinary points (y-values that

stand off from the linear pattern or extreme x-values).

Don’t infer that x causes y just because there is a good linear model for their relationship - association is not causation.

Don’t choose a model based on R2 alone.

Get it??

http://xkcd.com/552/

What Have We Learned?

When the relationship between two quantitative variables is “straight enough”, a linear model can help summarize that relationship. The regression line doesn’t pass through all points, but

it is the “best” line because it minimizes the sum of squared residuals.

What Have We Learned? Correlation tells us lots of things:

The slope of the line is based on the correlation, adjusted for the units of x and y.

For each SD in x that we are away from x-bar, we expect to be r SDs in y away from y-bar.

Since r is always between –1 and +1, each y-hat is fewer SDs away from its mean than the corresponding x was (regression to the mean).

R2 tells us the percent of the response accounted for by the regression model; the rest is error.

chapter 8 linear regression. objectives & learning goals understand linear regression (linear...

Documents