correlation & regression

29
CORRELATION & REGRESSION Chapter 10

Upload: dexter-mcknight

Post on 02-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Correlation & Regression. Chapter 10. Introduction. Another area of inferential statistics involves determining whether a relationship exists between two or more quantitative variables For example: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Correlation & Regression

CORRELATION & REGRESSIONChapter 10

Page 2: Correlation & Regression

Introduction• Another area of inferential statistics involves determining

whether a relationship exists between two or more quantitative variables

• For example:• Business person deciding whether volume of sales for given month

is related to amount of advertising the firm does that month• Educators interested in how number of hours a student studies is

related to student’s score on an exam• Medical researchers interested in determining if caffeine is related

to heart damage

Page 3: Correlation & Regression

Introduction cont.• Correlation

• Statistical method used to determine whether a relationship between variables exists

• Regression• Statistical method used to describe nature of relationship between

variables, that is, positive or negative, linear or nonlinear

• Questions to be answered1. Are two or more variables related?

2. If so, what is strength of relationship?

3. What type of relationship exists?

4. What kind of predictions can be made from relationship?

Page 4: Correlation & Regression

Types of Relationships• Two types of relationships: simple and multiple

• Simple relationship• One independent (explanatory) variable, and one dependent

(response) variable• Simple relationship analysis is called simple regression• Positive relationship – exists when both variables increase or

decrease at the same time• Negative relationship – exists when one variable increases as the

other decreases, and vice versa

• Multiple relationship• Two or more independent variables are used to predict one

dependent variable

Page 5: Correlation & Regression

10.1 – Scatter Plots & Regression• In simple correlation and regression studies, researcher

collects data on two quantitative variables to see whether a relationship exists between them

• Independent variable can be controlled or manipulated (designated as x-axis variable)

• Dependent variable cannot be controlled or manipulated (designated as y-axis variable)

Page 6: Correlation & Regression

Scatter Plots• Scatter plot

• Graph of ordered pairs (x, y) of numbers consisting of independent variable x and the dependent variable y

• Visual way to describe nature of relationship between independent and dependent variables

• After plot is drawn, it should be analyzed to determine which type of relationship, if any, exists

• Example 10 – 1• P. 536

• Example 10 – 2• P. 537

• Example 10 – 3• P. 538

Page 7: Correlation & Regression

Correlation• Statisticians use correlation coefficient to determine

strength of linear relationship between two variables

• Pearson product moment correlation coefficient (PPMC)• Named after statistician Karl Pearson, who pioneered research in

this area

• Correlation coefficient• Computed from sample data measures strength and direction of

linear relationship between two variables• Symbol for sample correlation coefficient is r• Symbol for population correlation coefficient is ρ (Greek letter rho)

Page 8: Correlation & Regression

Formula for Correlation Coefficient• Range of the correlation coefficient is from -1 to +1

• Value of r close to +1 suggests strong positive linear relationship• Value of r close to -1 suggests strong negative linear relationship• Value of r close to 0 suggest weak or no relationship

• Formula for Correlation Coefficient r

Where n is the number of data pairs

Page 9: Correlation & Regression

Example 10 – 4• Compute the correlation coefficient for data in example

10-1

Page 10: Correlation & Regression

Significance of Correlation Coefficient• Question arises, when is value of r due to change, and

when does it suggest a significant linear relationship between the variables?

• Since value of r is computed from samples, two possibilities exist when r is not equal to zero• Either value of r is high enough to conclude there is significant

linear relationship OR• Value of r is due to change

• To make a decision, use a hypothesis-testing procedure similar to the traditional method

Page 11: Correlation & Regression

Population Correlation Coefficient• Sample correlation coefficient can be used as an

estimator of p (rho) if following assumptions are valid1. Variables x and y are linearly related

2. Variables are random variables

3. Two variables have a bivariate normal distribution

• Population correlation coefficient• Correlation computed by using all possible pairs of data values

(x,y) taken from a population

Page 12: Correlation & Regression

Hypothesis Testing• In hypothesis testing, one of these is true

• OR

• When null hypothesis is rejected at a specific level, it means there is a significant difference between the value of r and 0.

• When null hypothesis is not rejected, it means value of r is not significantly different from 0 and is probably due to chance

• Do not have to identify claim, since question will always be whether there is significant linear relationship between variable

Page 13: Correlation & Regression

Formula for t Test• Formula for t Test for Correlation Coefficient

with degrees of freedom equal to n – 2

• Example 10 – 7• Test the significance of the correlation coefficient found in example

10 – 4. Use α = 0.05 and r = 0.982

Page 14: Correlation & Regression

Correlation and Causation• When a hypothesis test indicates that a significant linear

relationship exists between variables, researchers must consider possibilities outlined next.

• Possible Relationships Between Variables• When null hypothesis has been rejected for a specific α value, any of the

following five possibilities can exist:

1. There is a direct cause-and-effect relationship between variables

2. There is a reverse cause-and-effect relationship between variables

3. Relationship between variables may be caused by a third variable

4. There may be a complexity of interrelationships among many variables

5. Relationship may be coincidental

• Remember, correlation does not necessarily imply causation

Page 15: Correlation & Regression

10.2 – Regression• If value of correlation coefficient is significant, next step is

to determine equation of regression line

• Regression line• Data’s line of best fit• Allows researcher to see rend and make predictions on basis of the

data

Page 16: Correlation & Regression

Line of Best Fit• Given a scatter plot, you must be able to draw the line of

best fit

• Line of best fit• Line drawn so that sum of squares of vertical distances from each

point in scatter plot to line is at a minimum

Page 17: Correlation & Regression

Determination of Regression Line Equation

• Linear equation in algebra is written as

• In statistics, regression line is written as

Where

• Formula for Regression Line y’= a + bx• and

• Rounding rule: round values of a and b to three decimal places

Page 18: Correlation & Regression

Examples• 10 – 9

• Find the equation of the regression line for data in example 10 – 4 and graph the line on the scatter plot of the data

• 10 – 11• Use the equation of the regression line to predict the income of a

car rental agency that has 200,000 automobiles

Page 19: Correlation & Regression

Assumptions• Marginal change

• Magnitude of change in one variable when the other variable changes exactly 1 unit

• When r is not significantly different from 0, best predictor of y is mean of data values of y

• For valid predictions, value of correlation coefficient must be significant, also two other assumptions must be met:1. For any specific value of the independent variable x, the value of

the dependent variable y must be normally distributed about the regression line

2. The standard deviation of each of the dependent variables must be the same for each value of the independent variable

Page 20: Correlation & Regression

Checking for Outliers• All scatter plots should be checked for outliers

• Influential points/ influential observations• Points that can affect equation of regression line• When point on scatter plot seems to be an outlier it should be checked to

see if it is an influential point because influential points seem to “pull” regression line towards it

• Researchers should use their judgment whether to include influential observations in final analysis of data• If researcher feels observation is not necessary, then it should be

excluded so it does not influence results of study• If researcher feels that it is necessary, he or she may want to obtain

additional data values whose x values are near x value of influential point

Page 21: Correlation & Regression

10.3 – Coefficient of Determination & Standard Error of the Estimate• If correlation coefficient can is significant then equation of

regression line can be determined

• Other measures are associated with correlation and regression techniques:• Coefficient of determination

• Standard error of the estimate

• Prediction interval

Page 22: Correlation & Regression

Regression Model• Consider this hypothetical regression model

• X values: {1, 2, 3, 4, 5}• Y values: {10, 8, 12, 16, 20}

• Regression line equation is: and r = 0.919

• For each value of x there is an observed value and a predicted y’ value• When x = 1, y = 10, and y’ = 7.6

• Recall that closer the y’ values are to actual y values then the better the fit and closer r is to +1 or -1

Page 23: Correlation & Regression

Total Variation• Total variation

• Sum of squares of vertical distances each point is from mean

• Explained variation• Variation obtained from the relationship (y’ predicted values)

• Unexplained variation• Variation due to chance

• *Total variation = Explained variation + unexplained variation*•

Page 24: Correlation & Regression

Residuals & Least-Squares• Residual

• Difference between actual value of y and predicted y’ value for a given x value

• Least-squares line• Another name for a regression line because it is computed using

sum of squares of residuals is the smallest possible value

Page 25: Correlation & Regression

Coefficient of Determination• Coefficient of determination

• Measure of the variation of the dependent variables that is explained by the regression line and the independent variable

• Ratio of explained variation and total variation

• Can also be found by squaring the r value

• Coefficient of nondetermination• Found by subtracting coefficient of determination from 1

Page 26: Correlation & Regression

Standard Error of the Estimate• When a y’ value is predicted for a specific x value,

prediction is a point estimate

• Standard error of the estimate• Denoted by sest, is the standard deviation of the observed y values

about the predicted y’ values• Prediction interval uses this statistic• Formula for standard error of estimate is

Page 27: Correlation & Regression

Examples• 10 – 12

• A researcher collects the following data (page 569) and determines that there is a significant relationship between age of a copy machine and its monthly maintenance cost. The regression line is•

Find the standard error of the estimate

Page 28: Correlation & Regression

Prediction Interval• Prediction interval

• Similar to a confidence interval where the standard error of the estimate is used to create an interval about a y’ value

• By selecting an α value, you can achieve a confidence that the interval contains the actual mean of the y values that correspond to the given x value

• Formula for the Prediction Interval about a Value y’

• With d.f. = n – 2

Page 29: Correlation & Regression

Example 10 – 14• For the data in Example 10 – 12, find the 95% prediction

interval for the monthly maintenance cost of a machine that is 3 years old