bcor 1020 business statistics lecture 25 – april 22, 2008

BCOR 1020Business Statistics

Lecture 25 – April 22, 2008

Overview

Chapter 12 – Linear Regression– Ordinary Least Squares Formulas– Tests for Significance– Analysis of Variance: Overall Fit– Confidence and Prediction Intervals for Y– Example(s)

Chapter 12 – Ordinary Least Squares Formulas

• The ordinary least squares method (OLS) estimates the slope and intercept of the regression line so that the residuals are small.

• Recall that the residuals are the differences between observed y-values and the fitted y-values on the line…

• The sum of the residuals = 0 for any line…

• So, we consider the sum of the squared residuals (the SSE)…

Slope and Intercept:

iii yye ˆ


• To find our OLS estimators, we need to find the values of b0 and b1 that minimize the SSE…

• The OLS estimator for the slope is:

• The OLS estimator for the intercept is:

Slope and Intercept:

or

These are computed by the regression function on your computer or calculator.


Example (Regression Output):• We will consider the dataset “ShipCost” from your text

(12.19 on p.438) which considers the relationship between Number of Orders (X) and Shipping Costs (Y).

• Using MegaStat we can generate a regression output (in handout)…

• Demonstration in Excel…

y = 4.9322x - 31.19

R2 = 0.67170

1000

2000

3000

4000

5000

6000

7000

0 500 1000 1500

Orders (X)

Sh

ip C

os

t (Y

)


Example (Regression Output):Regression Analysis

r² 0.672 n 12

r 0.820 k 1

Std. Error 599.029 Dep. Var. Ship Cost (Y)

Regression output confidence interval

variables coefficients std. error t (df=10) p-value 95% lower 95% upper

Intercept -31.1895 1,059.8678 -0.029 .9771 -2,392.7222 2,330.3432

Orders (X) 4.9322 1.0905 4.523 .0011 2.5024 7.3619

ANOVA table

Source SS df MS F p-value

Regression 7,340,819.5514 1 7,340,819.5514 20.46 .0011

Residual 3,588,357.1152 10 358,835.7115

Total 10,929,176.6667 11


• We want to explain the total variation in Y around its mean (SST for Total Sums of Squares)

• The regression sum of squares (SSR) is the explained variation in Y

Assessing Fit:


Assessing Fit:• The error sum of squares (SSE) is the

unexplained variation in Y

• If the fit is good, SSE will be relatively small compared to SST.

• A perfect fit is indicated by an SSE = 0.• The magnitude of SSE depends on n and on the

units of measurement.


Coefficient of Determination:

0 < R2 < 1

• Often expressed as a percent, an R2 = 1 (i.e., 100%) indicates perfect fit.

• In a bivariate regression, R2 = (r)2

• R2 is a measure of relative fit based on a comparison of SSR and SST.

Clickers

Suppose you are have found the regression model for a given set of bivariate data. If the correlation is r = -0.72, what is the coefficient of determination?

(A) -0.5184

(B) 0.5184

(C) 0.7200

(D) 0.8485

(E) -0.8485

Chapter 12 – Test for Significance

• The standard error (syx) is an overall measure of model fit.

Standard Error of Regression:

• If the fitted model’s predictions are perfect (SSE = 0), then syx = 0. Thus, a small syx indicates a better fit.

• Used to construct confidence intervals.

• Magnitude of syx depends on the units of measurement of Y and on data magnitude.


• Standard error of the slope:Confidence Intervals for Slope and Intercept:

• Standard error of the intercept:

• Confidence interval for the true slope:

• Confidence interval for the true intercept:


• If b1 = 0, then X cannot influence Y and the regression model collapses to a constant b0 plus random error.

• The hypotheses to be tested are:

• These are tested in the standard regression output in any statistics package like MegaStat.

Hypothesis Tests:


• A t test is used with = n – 2 degrees of freedomThe test statistics for the slope and intercept are:

Hypothesis Tests:

• tn-2 is obtained from Appendix D or Excel for a given .

• Reject H0 if t > t or if p-value < .

• The p-value is provided in the regression output.


Example (Regression Output):• Let’s revisit the regression output from the dataset

“ShipCost” from your text (12.19 on p.438) which considers the relationship between Number of Orders (X) and Shipping Costs (Y).

• Go through tests for significance on 0 and 1.

Chapter 12 – Analysis of Variance

• To explain the variation in the dependent variable around its mean, use the formula

Decomposition of Variance:

• This same decomposition for the sums of squares is

• The decomposition of variance is written asSST

(total variation around the

mean)

SSE

(unexplained or error variation)

SSR

(variation explained by the

regression)

= +


• For a bivariate regression, the F statistic is

F Statistic for Overall Fit:

• For a given sample size, a larger F statistic indicates a better fit.

• Reject H0 if F > F1,n-2 from Appendix F for a given significance level or if p-value < .


Example (Regression Output):• Let’s revisit the regression output from the dataset

“ShipCost” from your text (12.19 on p.438) which considers the relationship between Number of Orders (X) and Shipping Costs (Y).

• Go through the Analysis of Variance (ANOVA) to assess overall fit.

Chapter 12 – Example

Example (Exam Scores):• We will consider the dataset “ExamScores” from your

text (Table 12.3 on p.434) which considers the relationship between Study Hours (X) and Exam Scores (Y).

• Generate MegaStat regression output.• Output on Overhead…

Clickers

If a randomly selected student had studied 12 hours for this exam, what score would this model Predict (to the nearest %)?

(A) 51%

(B) 61%

(C) 73%

(D) 82%

Clickers

Find the p-value on the hypothesis test…

(A) 0.0012

(B) 0.0520

(C) 0.3940

(D) 1.9641

0:

0:

11

10

H

H

Clickers

Recall from Tuesday’s lecture, the critical value for testing whether the correlation is significant is given by

Compute the critical value and determine whether the correlation is significant using = 10%.

(A) Yes, r is significant.(B) No, r is not significant.

22

2,2/

2,2/

nt

tr

n

n

Clickers – Work…

Work…Since n = 10 and = 10%, t/2,n-2 = t.05,8 = 1.860.From the output, r = 0.628.

Since |r| > r, we can reject H0: = 0 in favor of H1: 0.

Or, using …

Since |T*| > t/2,n-2 = t.05,8 = 1.860, we reach the same conclusion. The correlation is significant.

549.0210860.1

860.1

2 22

2,2/

2,2/

nt

tr

n

n

282.2628.0 22 628.01210

12*

r

nrT

Chapter 12 – Confidence & Prediction Intervals for Y

• The regression line is an estimate of the conditional mean of Y.

• An interval estimate is used to show a range of likely values of the point estimate.

• Confidence Interval for the conditional mean of Y

How to Construct an Interval Estimate for Y


How to Construct an Interval Estimate for Y• Prediction interval for individual values of Y is

• Prediction intervals are wider than confidence intervals because individual Y values vary more than the mean of Y.


MegaStat’s Confidence and Prediction Intervals:

bcor 1020 business statistics lecture 25 – april 22, 2008

Documents

regression line

regression sum of squares

linear regression ordinary

regression model

bivariate regression

squares formulas tests

regression function

error sum of squares