+ chapter 13: inference in regression lecture powerpoint slides discovering statistics 2nd edition...

25
+ Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

Upload: noel-robinson

Post on 25-Dec-2015

234 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+

Chapter 13:Inference in Regression

Lecture PowerPoint Slides

Discovering Statistics

2nd Edition Daniel T. Larose

Page 2: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ Chapter 13 Overview

13.1 Inference About the Slope of the Regression Line

13.2 Confidence Intervals and Prediction Intervals

13.3 Multiple Regression

2

Page 3: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ The Big Picture

Where we are coming from and where we are headed…

In the later chapters of Discovering Statistics, we have been studying more advanced methods in statistical inference.

Here in Chapter 13, we return to regression analysis, first discussed in Chapter 4. At that time, we learned descriptive methods for regression analysis; now it is time to learn how to perform statistical inference in regression.

In the last chapter, we will explore nonparametric statistics.

3

Page 4: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 13.1: Inference About the Slope of the Regression Line

Objectives:

Explain the regression model and the regression model assumptions.

Perform the hypothesis test for the slope 1 of the population regression equation.

Construct confidence intervals for the slope 1.

Use confidence intervals to perform the hypothesis test for the slope 1.

4

Page 5: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

5

The Regression ModelRecall that the regression line approximates the relationship between two continuous variables and is described by the regression equation y-hat = b1x + b0.

Regression Model

The population regression equation is defined as:

where 0 is the y-intercept of the population regression line, 1 is the slope, and is the error term.

Regression Model Assumptions

1. Zero Mean: The error term is a random variable with mean = 0.

2. Constant Variance: The variance of is the same regardless of the value of x.

3. Independence: The values of are independent of each other.

4. Normality: The error term is a normal random variable.

y 1x 0

Page 6: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

6

Hypothesis Tests for 1

To test whether there is a relationship between x and y, we begin with the hypothesis test to determine whether or not 1 equals 0.

H0: 1 = 0 There is no linear relationship between x and y.Ha: 2 ≠ 0 There is a linear relationship between x and y.

Test Statistic tdata

where b1 represents the slope of the regression line

represents the standard error of the estimate, and

represents the numerator of the sample variance of the x data.

2

1data

)( xxs

bt

s SSE

n 2

(x x )2

Page 7: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

7

Hypothesis Tests for 1

H0: 1 = 0 There is no linear relationship between x and y.Ha: 2 ≠ 0 There is a linear relationship between x and y.

Hypothesis Test for Slope 1

If the conditions for the regression model are met:

Step 1: State the hypotheses.

Step 2: Find the t critical value and the rejection rule.

Step 3: Calculate the test statistic and p-value.

Step 4: State the conclusion and the interpretation.

2

1data

)( xxs

bt

Page 8: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

8

ExampleTen subjects were given a set of nonsense words to memorize within a certain amount of time and were later scored on the number of words they could remember. The results are in Table 13.4.

Test whether there is a relationship between time and score using level of significance = 0.01. Note the graphs on page 640, indicating the conditions for the regression model have been met.

H0: 1 = 0 There is no linear relationship between time and score.

Ha: 1 ≠ 0 There is a linear relationship between time and score.

Reject H0 if the p-value is less than = 0.01.

Page 9: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

9

Example

Page 10: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

10

Example

Page 11: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

11

Example

Since the p-value of about 0.000 is less than = 0.01, we reject H0.

There is evidence for a linear relationship between time and score.

Page 12: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

12

Confidence Interval for 1

Confidence Interval for Slope 1

When the regression assumptions are met, a 100(1 – )% confidence interval for 1 is given by:

where t has n – 2 degrees of freedom.

Margin of Error

The margin of error for a 100(1 – )% confidence interval for 1 is given by:

b1 t / 2s

(x x )2

E t / 2s

(x x )2

As in earlier sections, we may use a confidence interval for the slope to perform a two-tailed test for 1. If the interval does not contain 0, we would reject the null hypothesis.

Page 13: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 13.2: Confidence Intervals and Prediction Intervals

Objectives:

Construct confidence intervals for the mean value of y for a given value of x.

Construct prediction intervals for a randomly chosen value of y for a given value of x.

13

Page 14: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

14

Confidence Interval for the Mean Value of y for a Given x

Confidence Interval for the Mean Value of y for a Given x

A (100 – )% confidence interval for the mean response, that is, for the population mean of all values of y, given a value of x, may be constructed using the following lower and upper bounds:

where x* represents the given value of the predictor variable. The requirements are that the regression assumptions are met or the sample size is large.

Lower Bound : ˆ y t / 2s1

n

(x * x )2

(x i x )2

Upper Bound : ˆ y t / 2s1

n

(x * x )2

(x i x )2

Page 15: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

15

Prediction Interval for an Individual Value of y for a Given x

Prediction Interval for an Individual Value of y for a Given x

A (100 – )% confidence interval for a randomly selected value of y given a value of x may be constructed using the following lower and upper bounds:

where x* represents the given value of the predictor variable. The requirements are that the regression assumptions are met or the sample size is large.

Lower Bound : ˆ y t / 2s 11n

(x * x )2

(x i x )2

Upper Bound : ˆ y t / 2s 11

n

(x * x )2

(x i x )2

Page 16: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 13.3: Multiple Regression

Objectives:

Find the multiple regression equation, interpret the multiple regression coefficients, and use the multiple regression equation to make predictions.

Calculate and interpret the adjusted coefficient of determination.

Perform the F test for the overall significance of the multiple regression.

Conduct t tests for the significance of individual predictor variables.

Explain the use and effect of dummy variables in multiple regression.

Apply the strategy for building a multiple regression model.

16

Page 17: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

17

Multiple RegressionThus far, we have examined the relationship between the response variable y and a single predictor variable x. In our data-filled world, however, we often encounter situations where we can use more than one x variable to predict the y variable.

Multiple regression describes the linear relationship between one response variable y and more than one predictor variable x1, x2, …. The multiple regression equation is an extension of the regression equation

where k represents the number of x variables in the equation and b0, b1, … represent the multiple regression coefficients.

ˆ y b0 b1x1 b2x2 ...bk xk

The interpretation of the regression coefficients is similar to the interpretation of the slope in simple linear regression, except that we add that the other x variables are held constant.

Page 18: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

18

Adjusted Coefficient of DeterminationWe measure the goodness of a regression equation using the coefficient of determination r2 = SSR/SST. In multiple regression, we use the same formula for the coefficient of determination (though the letter r is promoted to a capital R).

Multiple Coefficient of Determination R2

The multiple coefficient of determination is given by:

R2 = SSR/SST 0 ≤ R2 ≤ 1

where SSR is the sum of squares regression and SST is the total sum of squares. The multiple coefficient of determination represents the proportion of the variability in the response y that is explained by the multiple regression equation.

Page 19: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

19

Adjusted Coefficient of DeterminationUnfortunately, when a new x variable is added to the multiple regression equation, the value of R2 always increases, even when the variable is not useful for predicting y. So, we need a way to adjust the value of R2 as a penalty for having too many unhelpful x variables in the equation.

Adjusted Coefficient of Determination R2adj

The adjusted coefficient of determination is given by:

where n is the number of observations, k is the number of x variables, and R2 is the multiple coefficient of determination.

1

1)1(1 22

adj kn

nRR

Page 20: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

20

F Test for Multiple RegressionThe multiple regression model is an extension of the model from Section 13.1, and approximates the relationship between y and the collection of x variables.

The population parameters are unknown, so we must perform inference to learn about them. We begin by asking: Is our multiple regression useful? To answer this, we perform the F test for the overall significance of the multiple regression.

Multiple Regression Model

The population multiple regression equation is defined as:

where 1, 2, …, k are the parameters of the population regression equation, k is the number of x variables, and is the error term that follows a normal distribution with mean 0 and constant variance.

y 1x1 2x2 ...k xk

Page 21: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

21

F Test for Multiple RegressionThe hypotheses for the F test are:

H0: 1 = 2 = … = k = 0Ha: At least one of the ’s ≠ 0.

The F test is not valid if there is strong evidence that the regression assumptions have been violated.

F Test for Multiple Regression

If the conditions for the regression model are met

Step 1: State the hypotheses and the rejection rule.

Step 2: Find the F statistic and the p-value. (Located in the ANOVA table of computer output.)

Step 3: State the conclusion and the interpretation.

Page 22: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

22

t Test for Individual Predictor VariablesTo determine whether a particular x variable has a significant linear relationship with the response variable y, we perform the t test that was used in Section 13.1 to test for the significance of that x variable.

t Test for Individual Predictor Variables

One may perform as many t tests as there are predictor variables in the model, which is k.

If the conditions for the regression model are met:

Step 1: For each hypothesis test, state the hypotheses and the rejection rule.

Step 2: For each hypothesis test, find the t statistic and the p-value.

Step 3: For each hypothesis test, state the conclusion and the interpretation.

Page 23: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

23

Dummy VariablesIt is possible to include binomial categorical variables in multiple regression by using a “dummy variable.”

Recoding the multiple regression equation will result in two different regression equations, one for one value of the categorical variable and one for the other.

These two regression equations will have the same slope, but different y-intercepts.

A dummy variable is a predictor variable used to recode a binomial categorical variable in regression by taking values 0 or 1.

Page 24: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

24

Building a Multiple Regression Model

Strategy for Building a Multiple Regression Model

Step 1: The F Test – Construct the multiple regression equation using all relevant predictor variables. Apply the F test in order to make sure that a linear relationship exists between the response y and at least one of the predictor variables.

Step 2: The t Tests – Perform the t tests for the individual predictors. If at least one of the predictors is not significant, then eliminate the x variable with the largest p-value from the model. Repeat until all remaining predictors are significant.

Step 3: Verify the Assumptions – For your final model, verify the regression assumptions.

Step 4: Report and Interpret Your Final Model – Provide the multiple regression equation, interpret the multiple regression coefficients, and report and interpret the standard error of the estimate and the adjusted coefficient of determination.

Page 25: + Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ Chapter 13 Overview

13.1 Inference About the Slope of the Regression Line

13.2 Confidence Intervals and Prediction Intervals

13.3 Multiple Regression

25