chapter 14 part i

Chapter 14 Part I

ISDS 2001Matt Levy

IntroductionRegression is the term used to describe the technique of modeling and analyzing 1 or more variables.

The focus is on a dependent variable, and one or more independent variables.

Simple Linear Regression means 1 independent variable.

Regression, and other statistical modeling techniques gives us the power to infer, or predict future outcomes.

An understanding of regression, and the techniques used to validate your models will provide you with sound methodology to do just that.

Simple Linear RegressionAs previously mentioned, simple linear regression means we have 1 dependent variable (y), and 1 independent variable (x).

In order to make a prediction about y using x, we need sample data (from both x and y) in order to generate some additional terms, namely the parameters (β0 and β1), and an error term (ε).

The parameters, β0 and β1, can be thought of as what is generated from explained variability

The error term (ε) accounts for unexplained variability.

Thus, the simple linear regression model is: y = β0 + β1x + ε

Estimating the Regression EquationIf we were so fortunate to know the population parameters, we could use the equation on the previous slide to compute the mean.

Unfortunately, for us, we must use sample data to estimate these parameters, and subsequently, use different symbols to denote our estimated parameters: ŷ = b0 + b1x

Note that we use place a hat over y (pronounced y-hat) and use english lettering to denote our estimated parameters.

We now have an equation that graphs a "regression line"ŷ is the point estimator of E(y), the mean.b0 is the y-interceptb1 is the slope

The Estimation Process for Simple Linear Regression

The Estimation Process for Simple Linear RegressionSo how to we estimate b0 and b1?

To do this we use a method known as least squares.

In simple linear regression, finding b0 and b1 is relatively straightforward.

Equations 14.6 and 14.7 in your book show the procedure for b0 and b1, respectively.

Once b0 and b1 are obtained, the estimated simple linear regression equation will resemble the following: ŷ = 60 + 5x

It is important to note that you will have a ŷi for every yi in the sample data-set.

It is up to you to determine if the difference between them is small enough to deem the equation an accurate predictor.

Coefficient of DeterminationThe Coefficient of Determination (r2) provides us one measure to judge how well our regression equation (for example: ŷ = 60 + 5x ) fits the actual data.

Lets take some time to build r2 and learn some important terms along the way:

◆ Remember that we have an estimated dependent variable (ŷi) and an actual dependent variable (yi) for each observation.

◆ (yi - ŷi) is known as the ith residual.

◆ When we take (yi - ŷi), square it, and sum the squares we get the Sum of Squares of the Error Terms (SSE) , hence SSE = ∑(yi - ŷi)2 .

◆ When we take (yi - y ̅), square it, and sum the squares we get the Total Sum of Squares (SST), hence SST = ∑ (yi - y ̅)2

◆ Lastly, when we take (ŷi - y ̅), square it, and sum the squares, we get a measure of how much the estimated values on the regression line deviate from the actual mean.

◆ This is known as the Sum of Squares of the Regression Line: SSR = ∑ (ŷi - y ̅)2

Coefficient of Determination (con't)The relationship between SSR, SST, and SSE is one of the most important facts to know in statistics.

SST = SSR + SSE

Now, if (yi - ŷi) = 0 for each ith observation, SST = SSR, and we have a perfect fit of the data. This is never the case.

On the flip side, if SST - SSR = 0, we have the worst possible fit because everything is in the error term, or the unexplained portion of the equation.

Hence to measure of goodness of fit we look at the ratio of SSR to SST.

r2 = SSR/SST

This yields a value between 0 and 1.

r2 can be interpreted as the % of the total sum of squares (SST) that can be explained by using your estimated regression equation.

Correlation CoefficientDenoted rxy, is a measure of the strength of the linear association between the independent (x) and dependent variable (y).

rxy = (sign of b1) √r2

rxy always yields a value between (-1, +1).

A value of 1 indicates perfect positive linear relationship.

A value of -1 indicates perfect negative linear relationship.

A value of zero indicates no relationship.

In practice, this is used much less as it only provides an accurate measurement in the case of perfectly linear relationships.

r2 can be used to measure goodness-of-fit in linear and nonlinear relationships.

Estimating the Regression EquationIn this model, y can be thought of as having a distribution for a given range of x values.

As we have learned in the past, a distribution has a mean or expected value.

Thus the regression equation for the mean is as follows: E(y) = β0 + β1x

Notice that to obtain the mean, we simply remove our ability to account for unexplained variance.

Model AssumptionsIt is important to understand that r2 is not enough to ensure we have an appropriate regression equation.

There are numerous other tests and measures we must use.

All of these tests are based on assumptions about the error term (ε)1. E(ε) = 0.Implication: E(y) = β0 + β1x

2. The variance of ε, denoted by σ2 is the same for all values of x.Implication: The variance of y equals σ2 and is the same for all values of x.

3. The values of ε are independent (uncorrelated)Implication: The value of y for any x is not related to value of y for any other x.

4. ε is a normally distributed random variable.Implication: Because y is a linear function of ε, y is also normally distributed.

Table 14.14 in the text provides a complete explanation.

Testing for SignificanceIn Simple Linear Regression, the mean or expected value of y is a linear function of x (E(y) = β0 + β1x )

If the value of β1 = 0, then E(y) = β0 + 0x = β0.

Hence, in this case we can conclude x and y are not linearly related.

In the next, couple of slides we offer a few tests, the t-test, an evaluation of the confidence interval for β1, the F-test.

Each of these test are based on the following hypothesis:

H0: β1 = 0 Ha: β1 ≠ 0

This starts to tell us more about the appropriateness of our model.

Estimating σ2

As a pre-cursor to running our tests, we need an estimate of σ2.

Recall one of our key assumptions that variance of ε also represents the variance of y.

Also recall the deviations of y about the regression line are called residuals.

Hence we can call upon the SSE to calculate the Mean Square Error (MSE) as an estimate of σ2 which we will denote as s2.

s2 = MSE = SSE/(n-2), where n is the sample size and (n-2) is the model degrees of freedom.

Consequently, to get the standard error (s) of the estimate: √MSE.

t TestRemember we are testing the following: H0: β1 = 0; Ha: β1 ≠ 0

To do this we need information about the distribution of b1 (see figure 14.17)., specifically, we need the estimated standard deviation of b1 (see figure 14.18)

Once we have sb1 we can find the test statistic t: t = b1/sb1.

And using the t-table and our well known rejection rules:

Reject H0 if p-value ≤ α .

where tα/2 is based on a t-distribution with n-2 degrees of freedom.

Confidence Interval for β1As an alternative to the t-test, we can check the confidence interval for β1

We are essentially checking to see if the interval of β1 contains 0.

The form of the confidence interval is as follows:

b1 ± tα/2*sb1

If this interval contains zero at the designated significance level, we cannot reject the null hypothesis (H0).

F-TestBased on the F probability distribution (hence, using our F-table)

In simple linear regression this does the same thing as the t-test.

With more than one independent variable (multiple regression) ONLY the F-test can be used to test for overall significance.

To arrive at the F-Test statistic, we need the Mean Square due to Regression (MSR).

MSR = MSE / (Number of Independent Variables)

F = MSR/MSE (Just like when we first learned ANOVA)

And using the F-table and our well known rejection rules:

Reject H0 if p-value ≤ α .

where Fα is based on an F-distribution with 1 degree of freedom (for SLR) in the numerator and (n-2) degrees of freedom in the denominator.

Caution about the Interpretation of Significance TestingCorrelation is not causation!

Just because we Reject H0 does not guarantee cause-and-effect, theoretical justification must be warranted.

Furthermore, just because we can Reject H0 does not mean the relationship between x and y is linear.

chapter 14 part i

Education