more simple linear regression

More Simple Linear Regression

1

Variation

2

Remember to calculate the standard deviation of a variable we take each value and subtract off the mean and then square the result. (We also the divided by something, but that is not important in this discussion.)In a regression setting on the dependent variable Y we define the total sum of squares SST asΣ(Yi – Ybar)2 .SST can be rewritten asSST = Σ(Yi – Ŷi + Ŷi –Ybar)2 = Σ(Ŷi –Ybar)2 + Σ(Yi – Ŷi)2 = SSR + SSE.

Note: you may recall from algebra that (a + b)2 = a2 + 2ab + b2. In our story here 2ab = 0. While this is not true in general in algebra it is in this context of regression. If this note makes no sense to you do not worry, just use SST = SSR + SSE

Variation

3

So we have SST = Σ(Yi – Ybar)2, SSR = Σ(Ŷi –Ybar)2 and SSE = Σ(Yi – Ŷi)2 .On the next slide I have a graph of the data with the regression line put in and a line showing the mean of Y. For each point we could look at the how far the point is from the mean line. This is what SST is looking at. But SSR is indicating that of all the difference in the point and the mean the regression line is able to account for some of that variation. The rest of the difference is SSE.

VariationY Least Squares regression

Line = Ŷi

X

4

Y bar

Two examples of what is going into SSR

Two examples of what is going into SSE

The Coefficient of Determination

5

The coefficient of determination, often denoted r2, measures the proportion in the variation in Y that is explained by the independent variable X in the regression model.

r2 = SSR/SST. In our example from the text about the apparel company we have r2 = SSR/SST = 105.7476/116.9543 = 0.9042. This means that 90.42 percent of the variation in sales is explained by the variability in the store square footage.

Plus, only 9.58% of the variability in sales is due to other factors.

Coefficient of Determination

6

Say we didn’t have an X variable to help us predict the Y variable. Then a reasonable way to predict Y would be to just use its average or mean value. But, with a regression, by using an X variable it is thought we can do better than just using the mean of Y as a predictor. In a simple linear regression r2 is an indicator of the strength of the relationship between two variables because the use of the regression model would reduce the variability in predicting the sales by just using the mean sales by the percentage obtained.

In different areas of study (like marketing, management, and so on) the idea of what a good r2 is varies. But, you can be sure if r2 is .8 or above you have a strong relationship.

t Test for slope

7

Hypothesis test about the population slope B1.

Remember we have taken a sample of data. In this context we have taken a sample and estimated the unknown population regression.

Our real point in a study like this is to see if a relationship exists between the two variables in the population. If the slope is not zero in the population, then the X variable has an influence on the outcome of Y. Now, in a sample, the estimated slope may or may not be zero. But the sample provides a basis for a test of the true unknown population slope being zero.

For the test we will use the t distribution. Admittedly, degrees of freedom is a term without much meaning to you, but in the context of simple regression equals the sample size minus 2.

t Test for slope

8

Back to our hypothesis test about the slope. The null hypothesis is that B1 = 0, and the alternative is that B1 is not equal to zero. Since the alternative is not equal to zero we have a two-tailed test.

If we have 14 data points (pairs of points in regression) the df = 12 and if we want alpha = .05 we divide that in half because of the two tail test and our critical values are -2.1788 and 2.1788. If the sample based statistic, tstat, is between the two critical values we can not reject the null and we would conclude the data supports a statement of no relationship between X and Y. If the tstat is outside the critical values we reject the null and go with the alternative and say the data supports that a relationship exists between the variables.

t Test for the Slope

9

In a class such as ours the point is usually not to do a lot of calculations in regression, but interpret results. On page 579 we see an Excel printout for the apparel company. Note on cell d18 we have the calculated tstat for the problem. Since it is outside the critical values we reject Ho and go with the alternative.

The p-value approach is that if the p-value < alpha we reject the null. The p-value printed in Excel in this area is a two tail p-value. Since we have a p-value of essentially 0 we can reject the null.

Confidence Interval For the Slope

10

You may recall from previous work that when get a point estimate we often want to build in an interval around the point estimate because we know about sampling variability.

Excel also gives the confidence interval for the slope estimate. Page on page 579 for our apparel example we see in cells f18 and g18 the lower and upper interval values. We have the interval (1.3280, 2.0118).This interval has the interpretation that we are 95% confidence that the UNKNOWN POPULATION SLOPE is somewhere in this interval.

11

This is to be used on an assignment. Note in cell E57 the notation 6.4878E-11. The E-11 means move the decimal 11 places to the left. So the p-value is .000000000064878.

If the E has a plus after it you move the decimal to the right.

more simple linear regression

Documents

regression model

context of regression

y variable

mean of y

unknown population regression

x variable

mean line

mean sales