# simple linear regression

Post on 14-Apr-2018

234 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• 7/30/2019 Simple Linear Regression

1/50

1

Linear Regressionand

Correlation Analysis

• 7/30/2019 Simple Linear Regression

2/50

2

Chapter Goals

To understand the methods for displaying and describingrelationship among two variables

Models

Linear regression

Correlations

Frequency tables

Methods for Studying Relationships

Graphical

Scatter plots

Line plots 3-D plots

• 7/30/2019 Simple Linear Regression

3/50

3

Two Quantitative Variables

The response var iable, also called thedependent var iable, is the variable we wantto predict, and is usually denoted by y.

Theexplanatory variab le, also called theindependent var iable, is the variable thatattempts to explain the response, and isdenoted by x.

• 7/30/2019 Simple Linear Regression

4/50

4

Scatter Plots and Correlation

A scatter plot (or scatter diagram) is usedto show the relationship between twovariables

Correlation analysis is used to measurestrength of the association (linearrelationship) between two variables

Only concerned with strength of therelationship

No causal effect is implied

• 7/30/2019 Simple Linear Regression

5/50

5

Example

The following graphshows the scatter plotof Exam 1 score (x)and Exam 2 score (y)for 354 students in aclass.

Is there a relationshipbetween x and y?

• 7/30/2019 Simple Linear Regression

6/50

6

Scatter Plot Examples

y

x

y

x

y

y

x

x

Linear relationships Curvilinear relationships

• 7/30/2019 Simple Linear Regression

7/50

7

Scatter Plot Examples

y

x

y

x

No relationship

(continued)

• 7/30/2019 Simple Linear Regression

8/50

8

Correlation Coefficient

The population correlation coefficient (rho) measures the strength of theassociation between the variables

The sample correlation coefficient ris an estimate of and is used tomeasure the strength of the linear

relationship in the sample observations

(continued)

• 7/30/2019 Simple Linear Regression

9/50

9

Features of and r Unit free

Range between -1 and 1

The closer to -1, the stronger thenegative linear relationship

The closer to 1, the stronger the positivelinear relationship

The closer to 0, the weaker the linearrelationship

• 7/30/2019 Simple Linear Regression

10/50

10

Examples of Approximate r Values

yy

x

y

x

x

y

x

y

x

Tag with appropriate value: -1, -.6, 0, +.3, 1

• 7/30/2019 Simple Linear Regression

11/50

11

Earlier Example

Correlations

1 .400**

.000

366 351

.400** 1

.000

351 356

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Exam1

Exam2

Exam 1 Exam 2

Correlation is significant at the 0.01 level**.

• 7/30/2019 Simple Linear Regression

12/50

12

Questions?

What kind of relationship would you expectin the following situations:

Age (in years) of a car, and its price.

Number of calories consumed per day andweight.

Height and IQ of a person.

• 7/30/2019 Simple Linear Regression

13/50

13

Exercise

Identify the two variables that vary and decidewhich should be the independent variable andwhich should be the dependent variable.Sketch a graph that you think best representsthe relationship between the two variables.

1. The size of a persons vocabulary over his orher lifetime.

2. The distance from the ceiling to the tip of theminute hand of a clock hung on the wall.

• 7/30/2019 Simple Linear Regression

14/50

14

Introduction to Regression Analysis

Regression analysis is used to:

Predict the value of a dependent variable basedon the value of at least one independent variable.

Explain the impact of changes in an independentvariable on the dependent variable.

Dependent variable: the variable we wish to

explain.Independent variable: the variable used to

explain the dependent variable.

• 7/30/2019 Simple Linear Regression

15/50

15

Simple Linear RegressionModel

Only one independent variable, x.

Relationship between x and y isdescribed by a linear function.

Changes in y are assumed to becaused by changes in x.

• 7/30/2019 Simple Linear Regression

16/50

16

Types of Regression ModelsPositive Linear Relationship

Negative Linear Relationship

Relationship NOT Linear

No Relationship

• 7/30/2019 Simple Linear Regression

17/50

17

ni ,...,2,1; ii10i XY

Linear component

Population Linear Regression

The population regression model:

Populationy intercept

PopulationSlopeCoefficient

RandomError term,

or residual

Dependent

Variable

IndependentVariable

Random Errorcomponent

• 7/30/2019 Simple Linear Regression

18/50

18

Linear Regression Assumptions

The Error terms i, i=1, 2. , n areindependent and i ~ Normal (0,

2).

The Error terms i, i=1, 2. , n have constantvariance 2.

The underlying relationship between the Xvariable and the Y variable is linear.

• 7/30/2019 Simple Linear Regression

19/50

19

Population Linear Regression(continued)

Random Errorfor this xi value

Y

X

Observed Valueof yi for xi

Predicted Valueof yi for xi

ii10i XY

xi

Slope = 1

Y-Intercept = 0

i

• 7/30/2019 Simple Linear Regression

20/50

20

i10i XbbY

The sample regression line provides an estimateof the population regression function.

Estimated Regression Model

Estimate ofthe regressiony-intercept

Estimate of theregression slopeEstimated

(or predicted)y value Independent

variable

The individual random error terms ei have a mean of zero

The Regression Function: iiii XEXYE 1010 )()(

• 7/30/2019 Simple Linear Regression

21/50

21

Earlier Example

• 7/30/2019 Simple Linear Regression

22/50

22

ResidualA residualis the difference between the observed

response yiand the predicted response i. Thus, for

each pair of observations (xi, yi), the ith residual isei= yi i= yi (b0+ b1xi)

Least Squares Criterion b0 and b1 are obtained by finding the

values of b0 and b1 that minimize the sumof the squared residuals.

2

10

22

x))b(b(y

)y(ye

• 7/30/2019 Simple Linear Regression

23/50

23

b0 is the estimated average value of

y when the value of x is zero.

b1 is the estimated change in the

average value of y as a result of a

one-unit change in x.

Interpretation of theSlope and the Intercept

• 7/30/2019 Simple Linear Regression

24/50

24

The Least Squares Equation

The formulas for b1

and b0

are:

algebraic equivalent:

nxx

n

yxxy

b2

2

1

)(

21 )(

))((

xx

yyxxb

xbyb 10 and

• 7/30/2019 Simple Linear Regression

25/50

25

Finding the Least Squares Equation

The coefficients b0 and b1 willusually be found using computersoftware, such as Excel, Minitab, or

SPSS.

Other regression measures will alsobe computed as part of computer-based regression analysis.

Si l Li R i

• 7/30/2019 Simple Linear Regression

26/50

26

Simple Linear RegressionExample

A real estate agent wishes to examine therelationship between the selling price of a

home and its size (measured in square feet)

A random sample of 10 houses is selected

Dependent variable (y) = house price in

\$1000s

Independent variable (x) = square feet

• 7/30/2019 Simple Linear Regression

27/50

27

Sample Data for House PriceModel

House Price in \$1000s(y)

Square Feet(x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550405 2350

324 2450

319 1425

255 1700

• 7/30/2019 Simple Linear Regression

28/50

28

Model Summary

.762a .581 .528 41.33032Model1

R R Square

R Square

Std. Error of

the Estimate

Predictors: (Constant), Square Feeta.

Coefficientsa

98.248 58.033 1.693 .129

.110 .033 .762 3.329 .010

(Constant)

Square Feet

Model

1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: House Pricea.

SPSS Output

The regression equation is:feet)(square0.11098.248pricehouse

• 7/30/2019 Simple Linear Regression

29/50

29

0

50

100

150

200

250300

350

400

450

0 500 1000 1500 2000 2500 3000

Square Feet

HousePrice(\$1

000s)

Graphical Presentation

House price model: scatter plot andregression line

feet)(square0.11098.248pricehouse

Slope

= 0.110

Intercept

= 98.248

• 7/30/2019 Simple Linear Regression

30/50

30

Interpretation of the Intercept, b0

b0 is the estimated average value of Y when

the value of X is zero (if x = 0 is in the range

of observed x values)

Here, no houses had 0 square feet, so

b0 = 98.24833 just indicates that, for houses

within the range of sizes observed,

\$98,248.33 is the portion of the house price

not explained by square feet

feet)(square0.11098.248pricehouse

Interpretation of the

• 7/30/2019 Simple Linear Regression

31/50

31

Interpretation of theSlope Coefficient, b1

b1 measures the estimated change in

the average value of Y as a result ofa one-unit change in X

Here, b1 = .10977 tells us that the averagevalue of a house increases by .10977(\$1000)

= \$109.77, on average, for each additional

one square foot of size

feet)(square0.1097798.24833pricehouse

• 7/30/2019 Simple Linear Regression

32/50

32

Least Squares Regression Properties

The sum of the residuals from the least

squares regression line is 0 ( ).

The sum of the squared residuals is a

minimum (minimized ).

The simple regression line always passes

through the mean of the y variable and the

mean of the x variable.

The least squares coefficients b0 and b1 are

unbiased estimates of 0 and 1.

0)( yy

2)( yy

• 7/30/2019 Simple Linear Regression

33/50

33

Exercise

The growth of children from early childhood through

adolescence generally follows a linear pattern. Data onthe heights of female Americans during childhood, fromfour to nine years old, were compiled and the leastsquares regression line was obtained as = 32 + 2.4xwhere is the predicted height in inches, andxis age inyears.

Interpret the value of the estimated slope b1= 2. 4. Would interpretation of the value of the estimated

y-intercept, b0= 32, make sense here? What would you predict the height to be for a female

American at 8 years old? What would you predict the height to be for a femaleAmerican at 25 years old? How does the quality of thisanswer compare to the previous question?

• 7/30/2019 Simple Linear Regression

34/50

34

The coefficient of determination is theportion of the total variation in thedependent variable that is explained byvariation in the independent variable.

The coefficient of determination is alsocalled R-squared and is denoted as R2

Coefficient of Determination, R2

1R0 2

• 7/30/2019 Simple Linear Regression

35/50

35

Coefficient of Determination, R2

(continued)

Note: In the single independent variable case, the coefficientof determination is

where:R2 = Coefficient of determinationr = Simple correlation coefficient

22

rR

E l f A i t

• 7/30/2019 Simple Linear Regression

36/50

36

Examples of ApproximateR2 Values

y

x

y

x

y

x

y

x

E l f A i t

• 7/30/2019 Simple Linear Regression

37/50

37

Examples of ApproximateR2 Values

R2 = 0

No linear relationshipbetween x and y:

The value of Y does not

depend on x. (None of thevariation in y is explainedby variation in x)

y

xR2 = 0

SPSS O t t

• 7/30/2019 Simple Linear Regression

38/50

38

SPSS OutputModel Summary

.762a .581 .528 41.33032

Model

1

R R Square

R Square

Std. Error of

the Estimate

Predictors: (Constant), Square Feeta.

ANOVAb

18934.935 1 18934.935 11.085 .010a

13665.565 8 1708.196

32600.500 9

Regression

Residual

Total

Model

1

Sum of

Squares df Mean Square F Sig.

Predictors: (Constant), Square Feeta.

Dependent Variable: House Priceb.

Coefficientsa

98.248 58.033 1.693 .129

.110 .033 .762 3.329 .010

(Constant)

Square Feet

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: House Pricea.

• 7/30/2019 Simple Linear Regression

39/50

39

Standard Error of Estimate

The standard deviation of the variation ofobservations around the regression line iscalled the standard error of estimate

The standard error of the regressionslope coefficient (b1) is given by sb1

s

• 7/30/2019 Simple Linear Regression

40/50

40

Model Summary

.762a .581 .528 41.33032

Model

1

R R Square

R Square

Std. Error of

the Estimate

Predictors: (Constant), Square Feeta.

Coefficientsa

98.248 58.033 1.693 .129

.110 .033 .762 3.329 .010

(Constant)

Square Feet

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: House Pricea.

SPSS Output

41.33032s

0.03297s1b

• 7/30/2019 Simple Linear Regression

41/50

41

Comparing Standard Errors

y

y y

x

x

x

y

x

1bssmall

1

bslarge

ssmall

slarge

Variation of observed y valuesfrom the regression line Variation in the slope of regressionlines from different possible samples

I f b t th Sl t T t

• 7/30/2019 Simple Linear Regression

42/50

42

Inference about the Slope: t-Test t-test for a population slope

Is there a linear relationship between X and Y? Null and alternate hypotheses

H0: 1 = 0 (no linear relationship)

H1: 1 0 (linear relationship does exist)

Test statistic:

Degree of Freedom:

1b

11

s

bt

2nd.f.

where:

b1 = Sample regression slopecoefficient

1

= Hypothesized slope

sb1 = Estimator of the standarderror of the slope

• 7/30/2019 Simple Linear Regression

43/50

43

House Pricein \$1000s

(y)

Square Feet(x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

(sq.ft.)0.109898.25pricehouse

Estimated Regression Equation:

The slope of this model is 0.1098

Does square footage of the houseaffect its sales price?

Example: Inference about the Slope:t Test

(continued)

I f b t th Sl

• 7/30/2019 Simple Linear Regression

44/50

44

Inferences about the Slope:tTest Example - Continue

H0: 1 = 0

HA: 1 0

Test Statistic: t = 3.329

There is sufficient evidencethat square footage affects

house price

From Excel output:

Coeff icients Standard Error t Stat P-value

Intercept 98.24833 58.03348 1.69296 0.12892Square Feet 0.10977 0.03297 3.32938 0.01039

1bs t*

b1

Decision: Reject H0

Conclusion:

Reject H0Reject H0

a/2=.025

-t/2Do not reject H0

0t(1-/2)

a/2=.025

-2.3060 2.3060 3.329

d.f. = 10-2 = 8

• 7/30/2019 Simple Linear Regression

45/50

45

Confidence Interval for the Slope

Confidence Interval Estimate of the Slope:

Excel Printout for House Prices:

At 95% level of confidence, the confidence interval forthe slope is (0.0337, 0.1858)

1b/211sb a t

Coeff icients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

d.f. = n - 2

C f f S

• 7/30/2019 Simple Linear Regression

46/50

46

Confidence Interval for the Slope

Since the units of the house price variable is\$1000s, we are 95% confident that the averageimpact on sales price is between \$33.70 and\$185.80 per square foot of house size

Coeff ic ient

sStandard

Error t Stat P-value Low er 95% Upper

95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

This 95% confidence interval does not include 0.

Conclusion: There is a significant relationship betweenhouse price and square feet at the .05 level of significance

R id l A l i

• 7/30/2019 Simple Linear Regression

47/50

47

Residual Analysis

Purposes

Examine for linearity assumption

Examine for constant variance for alllevels of x

Evaluate normal distribution assumption

Graphical Analysis of Residuals

Can plot residuals vs. x Can create histogram of residuals tocheck for normality

• 7/30/2019 Simple Linear Regression

48/50

48

Residual Analysis for Linearity

Not Linear Linear

x

residua

ls

x

y

x

y

x

residua

ls

• 7/30/2019 Simple Linear Regression

49/50

49

Residual Analysis for Constant Variance

Non-constant variance Constant variance

x x

y

x x

y

residuals

residuals

• 7/30/2019 Simple Linear Regression

50/50

50

House Price Model Residual Plot

-60

-40

-20

0

20

40

60

80

0 1000 2000 3000

Square Feet

Residuals

Example: Residual Output

RESIDUAL OUTPUTPredicted

House

Price Residuals

1 251.92316 -6.923162

2 273.87671 38.123293 284.85348 -5.853484

4 304.06284 3.937162

5 218.99284 -19.99284

6 268.38832 -49.38832

7 356.20251 48.79749

8 367.17929 -43.17929

9 254.6674 64.33264

10 284.85348 -29.85348