9 correlation regression

24
1 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 Correlation and Regression Elementary Statistics Larson Farber Chapter 9 Hours of Training Accidents

Upload: zuera-opeq

Post on 21-May-2017

286 views

Category:

Documents


8 download

TRANSCRIPT

1

0 2 4 6 8 10 12 14 16 18 200

10

20

30

40

50

60

Correlation and Regression

Elementary StatisticsLarson Farber

Chapter 9

Hours of Training

Accidents

Ch. 9 Larson/Farber 2

Correlation

What type of relationship exists between the two variables and is the correlation

significant?

x y

Cigarettes smoked per dayScore on SAT

Height

Hours of Training

Explanatory(Independent)

Variable

Response(Dependent)

Variable

A relationship between two variables.

Number of Accidents

Shoe Size HeightLung Capacity

Grade Point Average

IQ

Ch. 9 Larson/Farber 3

Accidents

Negative Correlationas x increases, y decreases

x = hours of trainingy = number of accidents

Scatter Plots and Types of Correlation

Ch. 9 Larson/Farber 4

Positive Correlationas x increases y increases

x = SAT scorey = GPA

GPA

Scatter Plots and Types of Correlation

Ch. 9 Larson/Farber 5

IQ

No linear correlation

x = heighty = IQ

Scatter Plots and Types of Correlation

Ch. 9 Larson/Farber 6

x

x y 8

78 2

92 5

9012

5815

43 9

74 6 81

Absences Grade

Application

0 2 4 6 8 10 12 14 16

404550556065707580859095

x

FinalGrade

Absences

Ch. 9 Larson/Farber 7

Correlation Coefficient

A measure of the strength and direction of a linear relationship between two variables

2222 )( yynxxn

yxxynr

The range of r is from -1 to 1.

If r is close to 1 there

is a strong positive

correlation

If r is close to -1 there is a strong negative correlation

If r is close to 0 there is no

linear correlation

-1 0 1

Ch. 9 Larson/Farber 8

6084846481003364184954766561

624 184450696 645666486

Computation of r

57 516 3751 579 39898

x y 1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81

= - 0.975

2222 )( yynxxn

yxxynr

130308043155

r

64 4 25144225 81 36

xy x2 y2

22 )516()39898(757)579(7

)516)(57()3751(7

r

Ch. 9 Larson/Farber 9

Test for the Significance of r

r is the correlation correlation for the sample. The correlation coefficient for the population is ρ (rho).

Hypothesis test for the significance of r.

Ha: r < 0 significant negative correlation (left tail) H0: r 0 No significant negative correlation

Ha: r > 0 significant positive correlation (right tail) H0: r 0 No significant positive correlation

Ha: r 0 significant correlation (two tail) H0: r = 0 No significant correlation

The sampling distribution for r is a t-distribution with n-2 degrees of freedom.

21

02

nr

rrtr

Standardized teststatistic

Ch. 9 Larson/Farber 10

Test for Significance of r

In finding the correlation between the number of times absent and a final grade, you used seven pairs of data to find r = - 0.975. Test the significance of this correlation. Use = 0.01.

Ha: r 0 significant correlation (two tail) H0: r = 0 No significant correlation

2. State the level of significance

1. Write the null and alternative hypothesis

= 0.01

3. Identify the sampling distribution

A t-distribution with 6 degrees of freedom.

Ch. 9 Larson/Farber 11

t0

4. Find the critical value

Critical Values t0

3.707-3.707

6. Find the test statistic

811.9

27975.01

975.0

21 22

nr

rt

5. Find the rejection region

Rejection Regions

Ch. 9 Larson/Farber 12

7. Make your decision

8. Interpret your decision

t0-3.707 3.707

t = -9.811 falls in the rejection region. Reject the null hypothesis.

There is a significant correlation between the number of times absent and final grades.

13

180

190

200

210

220

230

240

250

260

1.5 2.0 2.5 3.0

Ad $

(xi,yi)

)ˆ,( ii yx

di

iii yyd ˆ Called a residual

(xi,yi) = a data point

)ˆ,( ii yx = a point on the line with same x-value

2d is a minimum

revenue

Ch. 9 Larson/Farber 14

From algebra-the equation of a line may be written asy = mx + b

where m is the slope of the line and b is the y-interceptThe line of regression is: bmxy ˆ

The slope m is found by

22 )( xxnyxxynm

The y-intercept is

xmyb

The Line of Regression

Once you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line.

Ch. 9 Larson/Farber 15

57 516 579

624 184450696 645666486

x y 1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81

64 4 25144225 81 36

6084846481003364184954766561

39898

xy x2 y2

3751

222 )57()579(7)516)(57()3751(7

)(

xxnyxxynm

)143.8)(924.3(714.73 xmyb

= -3.924

Calculate m and b

Write the equation f the line of regression with x = number of times absent and y = final grade.

The line of regression is: 667.105924.3ˆ xy

=105.667

Ch. 9 Larson/Farber 16

0 2 4 6 8 10 12 14 16

404550556065707580859095

xAbsences

FinalGrade

Line of Regressionm = -3.924 and b = 105.667

The line of regression is: 667.105924.3ˆ xy

Note that the point (8.143, 73.714) is on the line

Ch. 9 Larson/Farber 17

Predicting Values

The regression line can be used to predict values of y for values of x within the range of the data.

The regression equation for number of times absent and final grade is:

667.105924.3ˆ xy

Use this equation to predict the expected grade for a student with(a) 3 absences(b) 12 absences

(a) 895.93667.105)3(924.3ˆ y

579.58667.105)12(924.3ˆ y(b)

Ch. 9 Larson/Farber 18

The Coefficient of Determination

The coefficient of determination, r2 is the ratio of the explained variation to the total variation.

variationTotalvariationExplained2 r

The correlation coefficient of number of times absent and final grade is r = - 0.975. Then the coefficient of determination is (-0.975)2 = 0.9506.

Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied etc.

Ch. 9 Larson/Farber 19

1 8 78 74.275 13.8756 2 2 92 97.819 33.8608 3 5 90 86.047 15.6262 4 12 58 58.579 0.3352 5 15 43 46.807 14.4932 6 9 74 70.351 13.3152 7 6 81 82.123 1.2611

The Standard Error of Estimate

5767.92

es92.767

= 4.307

y

The Standard Error of Estimate se is the standard deviation of the observed yi values about the predicted value.

2)ˆ( 2

n

yys iie

y

2)ˆ( yy x y

Ch. 9 Larson/Farber 20

Prediction Intervals

Given a specific linear regression equation and x0 a specific value of x, a c-prediction interval for y is:

EyyEy ˆˆ

22

20

)()(11xxn

xxnn

stE ec

where

Use a t-distribution with n-2 degrees of freedom.

The point estimate is and E is the maximum error of estimate.

y

21

Application

Construct a 90% confidence interval for a final grade when a student has been absent 6 times.

1. Find the point estimate:

123.82667.105)6(924.3667.105924.3ˆ

xy

The point (6, 82.123) is the point on the regression line with x-coordinate of 6.

22

Application

Construct a 90% confidence interval for a final grade when a student has been absent 6 times.

2. Find E

438.918273.1)307.4(015.2

)57()579(7)14.86(7

711)307.4(015.2

)()(11

2

2

22

20

xxn

xxnn

stE ec

At the 90% level of confidence, the maximum error of estimate is 9.438

23

ApplicationConstruct a 90% confidence interval for a final grade when a student has been absent 6 times.

561.91685.72 y

When x = 6, the 90% confidence interval is from 72.685 to 91.586

3. Find the endpoints

685.72438.9123.82ˆ Ey

561.91438.9123.82ˆ Ey

Ch. 9 Larson/Farber 24

Minitab Output

Regression Analysis

The regression equation isy = 106 - 3.92 x

Predictor Coef StDev T PConstant 105.668 3.655 28.91 0.000 x -3.9241 0.4019 -9.76 0.000

S = 4.307 R-Sq = 95.0% R-Sq(adj) = 94.0%