correlation & regression a correlation and regression analysis involves investigating the...

21
Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of interest. The goal of such an investigation is typically to estimate (predict) the value of one variable based on the observed value of the other variable (or variables).

Upload: edwin-patrick

Post on 24-Dec-2015

236 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Correlation & Regression

• A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of interest.

• The goal of such an investigation is typically to estimate (predict) the value of one variable based on the observed value of the other variable (or variables).

Page 2: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Quantitative Variables

• Dependent Variable (Y)• the variable being predicted• called the response variable

• Independent Variable (X)• the variable used to explain or predict Y• called the explanatory or predictor variable

Page 3: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Correlation & Regression

• Correlation• Addresses the questions:

“Is there a relationship between X and Y?”

“If so, how strong is it?”

• Regression• Addresses the question

“What is the relationship between X and Y?”

Page 4: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Simple Linear Relationship

• A linear (straight line) relationship between Y and a single X. • The form of the equation is Y = b0 + b1 X,

where b0 is the y-intercept and b1 is the slope

• A scatter-plot of X versus Y is useful for spotting linear relationships, and obvious departures from linear.• Always start with a scatter plot!!

Page 5: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Correlation• A correlation exists between two variables

when they are related in some way.• Linear Correlation Coefficient (r)

• measures the strength of the linear relationship between X and Y

• Properties of r• -1 ≤ r ≤ 1• r=1 for a perfect positive linear relationship• r= -1 for a perfect negative linear relationship• r = 0 if there is no linear relationship

Page 6: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Sample Correlation Coefficient

• Statistics that is useful for estimating the linear correlation coefficient

xy

2 2

2 2xx yy

, where

S ,

S , S

xy

xx yy

Sr

S S

x yxy

n

x yx y

n n

Page 7: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Coefficient of Determination

• The coefficient of determination is the proportion of variability in Y that can be explained by its linear relationship to X.• Computed by squaring the sample correlation

squared (r2)

22 SSE

=1-TSS

xy

xx yy

Sr

S S

Page 8: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Hypothesis Testing of the Linear Correlation Coefficient

• Appropriate Hypothesis:

ip)Relationsh(Linear 0:ip)RelationshLinear (No 0:

1

0

HH

Page 9: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Testing r

• Test Statistic:

• Rejection Region (3 cases of H1)1. Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2

2. Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα

3. Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα

2 ,

21 2

ndf

nr

rt

Page 10: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Simple Linear Regression

• The Least Squares Regression line is our "best" line for explaining the relationship between Y and X. • It minimizes the squared error (distance between the

observed values and the values predicted by the line).

• The predicted value of Y for any X can be found by plugging X into the least squares regression line.

n

iii xbbybbf

1

21010 )(),(

Page 11: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Simple Linear Regression Line

• The equation is:

where

and

xbby 10ˆ

1xy

xx

Sb

S

xbyb 10

Page 12: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Proper Use of Correlation & Regression

• Correlation does not imply causation.• Simple linear regression is appropriate

only if the data clusters about a line.• Do not extrapolate.• Do not apply model to other populations.• For multiple regression, the size of the

parameter does not indicate importance.

Page 13: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Effect of Extreme Values

• Extreme values can have a very large effect on correlation and regression analysis.

• Influential outliers can largely impact model fit. • Regression Applet by Webster West

Page 14: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Model Assumptions for Inference

The difference between the observed and the model predicted values is called the residual, and is denoted by e:

The residuals are assumed to be independent and identically normal in distribution with mean 0 and standard deviation se.

So far a particular X, the distribution of Y can be described as normal with mean equal to the predicted value of Y for that X, and standard deviation equal to se.

Page 15: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Inference about the Simple Linear Regression Model Parameters

Is there a significant relationship between X and Y? H0: b1 = 0 versus H1: b1 ≠ 0

• Test Statistic:

xx

xyyy

xx

S

SSSSE

n

SSEs

ndf

Ssb

T

2

11

,2

where

2 ,

Page 16: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Inference about the Simple Linear Regression Model Parameters

• Rejection Region (3 cases of H1)1. Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2

2. Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα

3. Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα

Page 17: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Inference about the Simple Linear Regression Model Parameters

Is there a non-zero y-intercept in the linear relationship between X and Y?

H0: b0 = 0 versus H1: b0 ≠ 0• Test Statistic:

xx

xyyy

xx

S

SSSSE

n

SSEs

ndf

nSx

s

bT

2

2

00

,2

where

2 ,

Page 18: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Inference about a Regression Line

E(Y) is the expected value of Y. For a given X, E(Y) is determined by evaluating the simple linear regression equation at X. A t-distribution allows a confidence interval for the true mean value of Y given an X.

xxS

xx

nstx

2*

2/*

00

)(1ˆˆ

Page 19: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Inference about Y for a Given X

The expected observation of Y for a given X is equal to E(Y). A t-distribution on E(Y) allows the construction a predication interval for prediction of a single observation for a particular value of X.

xxS

xx

nstx

2*

2/*

00

)(11ˆˆ

Page 20: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Residual Analysis

Can be useful for checking the model assumptions, which for the linear regression model are: Independent observations Residual have N(0,s2) distribution Plots can be useful for spotting model

inadequacy

Page 21: Correlation & Regression A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of

Variable Selection in Multiple Regression

Compare all possible regressions Backward elimination Forward Selection Stepwise Elimination