correlation and simple linear regression

Correlation◦ The linear association between two variables

Strength of relationship based on how tightly points in an X,Y scatterplot cluster about a straight line

◦ -1 to 1unitless◦ Observations should be quantitative

No categorical variables even if recoded evaluate a visual scatterplot Independent samples Correlation does not imply causality Do not assume infinite ranges of linearity

Ho: there is no linear relationship between the 2 variables

Ha: there is a linear relationship between the 2 variables

Simple Linear Regression◦ Examine relationship between one predictor

variable (independent) and a single quantitative response variable (dependent)

◦ Produces regression equation used for prediction◦ Normality, equal variances, independence◦ Least Squares Principle

Do not extrapolate Analyze residuals

Ho: there is no slope, no linear relationship between the 2 variables

Ha: there is a slope, linear relationship between the 2 variables

Positive correlation: Indicates that the values on the two variables being analyzed move in the same direction. That is, as scores on one variable go up, scores on the other variable go up as well (on average) & vice versa

Negative correlation: Indicates that the values on the two variables being analyzed move in opposite directions. That is, as scores on one variable go up, scores on the other variable go down, and vice-versa (on average)

Correlation coefficients range in strength from -1.00 to +1.00

The closer the correlation coefficient is to either -1.00 or + 1.00, the stronger the relationship is between the two variables

Perfect positive correlation of +1.00 reveals that for every member of the sample or population, a higher score on one variable is related to higher score on the other variable

Perfect negative correlation of –1.00 indicates that for every member of the sample or population, a higher score on one variable is related to a lower score on the other variable

Perfect correlations are never found in actual social science research

Positive and negative Positive and negative correlations are represented correlations are represented by scattergramsby scattergrams

Scattergrams: Graphs that Scattergrams: Graphs that indicate the scores of each indicate the scores of each case in a sample case in a sample simultaneously on two simultaneously on two variablesvariables

rr: the symbol for the sample : the symbol for the sample Pearson correlation coefficientPearson correlation coefficient

Positive Correlation

0102030405060708090

100

1 6 11

Hours Spent Studying

Sco

re o

n E

xam

Negative Correlation

0102030405060708090

100

1 6 11


Sco

re o

n E

xam

The scattergrams presented here represent very The scattergrams presented here represent very strong positive and negative correlations strong positive and negative correlations ((rr = 0.97 and = 0.97 and rr = -0.97 for the positive and = -0.97 for the positive and negative correlations, respectively)negative correlations, respectively)

No discernable pattern between the scores on the two variables

We learn it is virtually impossible to predict an individual’s test score simply by knowing how many hours the person studied for the exam

Scattergram representing virtually no correlation Scattergram representing virtually no correlation between the number of hours spent studying and between the number of hours spent studying and the scores on the exam is presentedthe scores on the exam is presented

No Correlation Between Hours Spent Studying and Exam Scores

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12


Scor

es o

n E

xam

The first step in understanding how Pearson correlation coefficients are calculated is to notice that we are concerned with a sample’s scores on two variables at the same time

The data shown are scores on two variables: hours spent studying and exam score. These data are for a randomly selected sample of five students.

To be used in a correlation analysis, it is critical that the scores on the two variables are paired.

Data for Correlation CoefficientData for Correlation Coefficient

Hours Spent Hours Spent StudyingStudying

(X variable)(X variable)

Exam ScoreExam Score(Y variable)(Y variable)

Student 1Student 1 55 8080





Each student’s score on the Each student’s score on the XX variable must be matched with variable must be matched with his or her own score on the his or her own score on the YY variablevariable

Once this is done a person can Once this is done a person can determine whether, on average, determine whether, on average, hours spent studying is related hours spent studying is related to exam scoresto exam scores

Finding the Pearson correlation coefficient is simple when following these steps:

1. Find the z scores on each of the two variables being examined for each case in the sample

2. Multiply each individual's z score on one variable with that individual's z score on the second variable (i.e., find a cross-product)

3. Sum those across all of the individuals in the sample

4. Divide by N

r

zx

zy

N

Pearson product-moment correlation Pearson product-moment correlation coefficientcoefficient

a a zz score for variable score for variable XX

a paired a paired zz score for variable score for variable YY

the number of pairs of the number of pairs of XX and and YY scores scores

Definitional Formula for Pearson CorrelationDefinitional Formula for Pearson Correlation

r r = = Σ(Σ(zx zyzx zy)) ΝΝ

You then have an average You then have an average standardized cross product. If standardized cross product. If we had not standardized these we had not standardized these scores we would have produced scores we would have produced a a covariancecovariance..

This formula requires that you This formula requires that you standardizestandardize your your variablesvariables

• NoteNote: When you standardize a variable, you are : When you standardize a variable, you are simply subtracting the mean from each score in your simply subtracting the mean from each score in your sample and dividing by the standard deviationsample and dividing by the standard deviation What this does is provide a What this does is provide a z z score for each case score for each case

in the samplein the sample Members of the sample with scores below the Members of the sample with scores below the

mean will have negative mean will have negative z z scores, whereas those scores, whereas those members of the sample with scores above the members of the sample with scores above the mean will have positive mean will have positive z z scoresscores

Correlation coefficients such as the Pearson are very powerful statistics. They allow us to determine whether, on average, the values on one variable are associated with the values on a second variable

People often confuse the concepts of correlation and causation• Correlation (co-relation) simply means that variation in the

scores on one variable correspond with variation in the scores on a second variable

• Causation means that variation in the scores on one variable cause or create variation in the scores on a second variable. Correlation does not equal causation.

Simple Pearson correlations are designed to examine linear relations among variables. In other words, they describe average straight relations among variables

Not all relations between variables are linear

As previously mentioned, people often confuse the concepts of correlation and causation

• ExampleExample: There is a curvilinear : There is a curvilinear relationship between anxiety and relationship between anxiety and performance on a number of performance on a number of academic and non-academic academic and non-academic behaviorsbehaviors as shown in the figure as shown in the figure belowbelow

We call this a curvilinear relationship We call this a curvilinear relationship because what began as a positive because what began as a positive relationship (between performance relationship (between performance and anxiety) at lower levels of and anxiety) at lower levels of anxiety, becomes a negative anxiety, becomes a negative relationship at higher levels of relationship at higher levels of anxietyanxiety

0

10

20

30

40

50

60

70

1 2 3 4 5

Anxiety

Perf

orm

ance

The problem of truncated range is another common problem that arises when examining correlation coefficients. This problem is encountered when the scores on one or both of the variables in the analysis do not have much variance in the distribution of scores, possibly due to a ceiling or floor effect

The data from the table at right show all of the students did well on the test, whether they spend many hours studying for it or not

Data for Studying-Exam Score CorrelationData for Studying-Exam Score Correlation

Hours Spent Hours Spent StudyingStudying

(X variable)(X variable)

Exam ScoreExam Score(Y variable)(Y variable)






• The weak correlation that will be The weak correlation that will be produced by the data in the table produced by the data in the table may not reflect the true may not reflect the true relationship between how much relationship between how much students study and how much they students study and how much they learn because the test was too learn because the test was too easy. A ceiling effect may have easy. A ceiling effect may have occurred, thereby truncating the occurred, thereby truncating the range of scores on the examrange of scores on the exam

Researchers test whether the correlation coefficient is statistically significant

To test whether a correlation coefficient is statistically significant, the researcher begins with the null hypothesis that there is absolutely no relationship between the two variables in the population, or that the correlation coefficient in the population equals zero

The The alternative hypothesisalternative hypothesis is is that there is, in fact, a statistical that there is, in fact, a statistical relationship between the two relationship between the two variables in the population, and variables in the population, and that the population correlation that the population correlation coefficient is coefficient is notnot equal to zero. equal to zero. So what we are testing here is So what we are testing here is whether our correlation whether our correlation coefficient is statistically coefficient is statistically significantly different from 0significantly different from 0

What we want to be able to do with a measure of association, like a correlation coefficient, is be able to explain some of the variance in the scores on one variable with the scores on a second variable. The coefficient of determination tells us how much of the variance in the scores of one variable can be understood, or explained, by the scores on a second variable

One way to conceptualize One way to conceptualize explained variance is to explained variance is to understand that when two understand that when two variables are correlated with variables are correlated with each other, they each other, they shareshare a certain a certain percentage of their variancepercentage of their variance

See next slide for visualSee next slide for visual

r r = 0.00= 0.00rr² = 0.00² = 0.00

In this picture, the two squares In this picture, the two squares are not touching each other, are not touching each other, suggesting that all of the suggesting that all of the variance in each variable is variance in each variable is independent of the other independent of the other variable. There is no overlapvariable. There is no overlap

The precise percentage of The precise percentage of shared, or explained, variance shared, or explained, variance can be determined by squaring can be determined by squaring the correlation coefficient. This the correlation coefficient. This squared correlation coefficient squared correlation coefficient is known as the coefficient of is known as the coefficient of determinationdetermination

r r = 0.30= 0.30rr² = 0.09² = 0.09

r r = 0.55= 0.55rr² = 0.30² = 0.30

All of these statistics are very similar All of these statistics are very similar to the Pearson correlation and each to the Pearson correlation and each produces a correlation coefficient produces a correlation coefficient that is similar to the Pearson that is similar to the Pearson rr

PhiPhi::

Sometimes researchers want Sometimes researchers want to know whether two to know whether two dichotomous variables are dichotomous variables are correlated. In this case, we correlated. In this case, we would calculate a phi would calculate a phi coefficient (coefficient (), which is ), which is specialized version of the specialized version of the Pearson Pearson rr

For example, suppose you For example, suppose you wanted to know whether wanted to know whether gender (male, female) was gender (male, female) was associated with whether one associated with whether one smokes cigarettes or not smokes cigarettes or not (smoker, non smoker)(smoker, non smoker)

• In this case, with two In this case, with two dichotomous variables, you dichotomous variables, you would calculate a phi would calculate a phi coefficientcoefficient

• NoteNote: Readers familiar with : Readers familiar with chi-square analysis will chi-square analysis will notice that two dichotomous notice that two dichotomous variables can also be variables can also be analyzed using chi square analyzed using chi square test (see Chapter 14)test (see Chapter 14)

Point BiserialPoint Biserial::

When one of our variables is a When one of our variables is a continuous variablecontinuous variable(i.e., measured on an interval or (i.e., measured on an interval or ratio scale) and the other is a ratio scale) and the other is a dichotomous variable we need to dichotomous variable we need to calculate a point-biserial calculate a point-biserial correlation coefficientcorrelation coefficient

This coefficient is a specialized This coefficient is a specialized version of the Pearson correlation version of the Pearson correlation coefficientcoefficient

For example, suppose you For example, suppose you wanted to know whether there is wanted to know whether there is a relationship between whether a a relationship between whether a person owns a car (yes or no) person owns a car (yes or no) and their score on a written test and their score on a written test of traffic rule knowledge, such as of traffic rule knowledge, such as the tests one must pass to get a the tests one must pass to get a driver’s licensedriver’s license

• In this example, we are In this example, we are examining the relation between examining the relation between one categorical variable with two one categorical variable with two categories (whether one owns a categories (whether one owns a car) and one continuous variable car) and one continuous variable (one’s score on the driver’s test)(one’s score on the driver’s test)

• Therefore, the point-biserial Therefore, the point-biserial correlation is the appropriate correlation is the appropriate statistic in this instancestatistic in this instance

Spearman Rho: Sometimes data are recorded as

ranks. Because ranks are a form of ordinal data, and the other correlation coefficients discussed so far involve either continuous (interval, ratio) or dichotomous variables, we need a different type of statistic to calculate the correlation between two variables that use ranked data

The Spearman rho is a The Spearman rho is a specialized form of the Pearson specialized form of the Pearson rr that is appropriate for such datathat is appropriate for such data

For example, many schools use For example, many schools use students’ grade point averages students’ grade point averages (a continuous scale) to rank (a continuous scale) to rank students (an ordinal scale)students (an ordinal scale)

• In addition, students’ scores on In addition, students’ scores on standardized achievement tests standardized achievement tests can be rankedcan be ranked

• To see whether a students’ rank To see whether a students’ rank in their school is related to their in their school is related to their rank on the standardized test, a rank on the standardized test, a Spearman rho coefficient can be Spearman rho coefficient can be calculated.calculated.

The correlations on the diagonal show the correlation between a single variable and itself. Because we always get a correlation of 1.00 when we correlate a variable with itself, these correlations presented on the diagonal are meaningless. That is why there is not a p value reported for them

SPSS Printout of Correlation SPSS Printout of Correlation AnalysisAnalysis

GradeGrade Test ScoreTest Score

GradeGrade 1.00001.0000

( 314)( 314)

P = .P = .

Test ScoreTest Score 0.42910.4291 1.00001.0000

( 314)( 314) ( 314)( 314)

P = 0.000P = 0.000 P = .P = .

The numbers in the parentheses, just below the The numbers in the parentheses, just below the correlation coefficients, report the sample size. correlation coefficients, report the sample size. There were 314 eleventh grade students in this There were 314 eleventh grade students in this sample sample

From the correlation coefficient that is From the correlation coefficient that is offoff the the diagonal, we can see that students’ grade point diagonal, we can see that students’ grade point average (Grade) was moderately correlated with average (Grade) was moderately correlated with their scores on the test (their scores on the test (rr = 0.4291). This = 0.4291). This correlation is statistically significant, with a correlation is statistically significant, with a p p value of less than 0.0001 (value of less than 0.0001 (pp < 0.0001) < 0.0001)

To gain a clearer understanding of the relationship between grade and test To gain a clearer understanding of the relationship between grade and test scores, we can calculate a coefficient of determination. We do this by squaring scores, we can calculate a coefficient of determination. We do this by squaring the correlation coefficient. When we square this correlation coefficient (0.4291 * the correlation coefficient. When we square this correlation coefficient (0.4291 * 0.4291 = 0.1841), we see that grades explains a little bit more than 18% of the 0.4291 = 0.1841), we see that grades explains a little bit more than 18% of the variance in the test scoresvariance in the test scores

SPSS Printout of SPSS Printout of Correlation AnalysisCorrelation Analysis

GradesGrades Test scoreTest score

GradesGrades 1.00001.0000

( 314)( 314)

P = .P = .

Test scoreTest score 0.42910.4291 1.00001.0000

( 314)( 314) ( 314)( 314)

P = 0.000P = 0.000 P = .P = .

Because of 80% percentage of unexplained Because of 80% percentage of unexplained variance, we must conclude that teacher-variance, we must conclude that teacher-assigned grades reflect something substantially assigned grades reflect something substantially different from, and more than, just scores on different from, and more than, just scores on tests.tests.

Same table as in previous slide

Allows researchers to examine:• How variables are related to each other• The strength of the relations• Relative predictive power of several

independent variables on a dependent variable

• The unique contribution of one or more independent variables when controlling for one or more covariates

Simple Regression• Simple regression analysis involves

a single independent, or predictor variable and a single dependent, or outcome variable

Multiple Regression• Multiple regression involves models

that have two or more predictor variables and a single dependent variable

The dependent and independent variables need to be measured on an interval or ratio scale

Dichotomous (i.e., categorical variables with two categories) predictor variables can also be used• There is a special form of regression analysis,

logit regression, that allows us to examine dichotomous dependent variables

Regression analysis yields more information The regression equation allows us to think

about the relation between the two variables of interest in a more intuitive way, using the original scales of measurement rather than converting to standardized scores

Regression analysis yields a formula for calculating the predicted value of one variable when we know the actual value of the second variable

Assumes the two variables are linearly related• In other words, if the two variables are

actually related to each other, we assume that every time there is an increase of a given size in value on the X variable (called the predictor or independent variable), there is a corresponding increase (if there is a positive correlation) or decrease (if there is a negative correlation) of a specific size in the Y variable (called the dependent, or outcome, or criterion variable)

= bX + a Y

is the predicted value of the is the predicted value of the YY variable variableY

is the unstandardized regression coefficient, or the slopeis the unstandardized regression coefficient, or the slope

is the intercept (i.e., the point where the regression line is the intercept (i.e., the point where the regression line intercepts the intercepts the YY axis. This is also the predicted value of axis. This is also the predicted value of Y Y when when X X is zero)is zero)

bb

aa

Education Level (X)Education Level (X)in yearsin years

Monthly Income (Y) Monthly Income (Y) in thousandsin thousands

Case 1Case 1 66 $1$1

Case 2Case 2 88 $1.5$1.5

Case 3Case 3 1111 $1$1

Case 4Case 4 1212 $2$2

Case 5Case 5 1212 $4$4

Case 6Case 6 1313 $2.5$2.5

Case 7Case 7 1414 $5$5

Case 8Case 8 1616 $6$6

Case 9Case 9 1616 $10$10

Case 10Case 10 2121 $8$8

MeanMean 12.912.9 $4.1$4.1

Standard DeviationStandard Deviation 4.254.25 $3.12$3.12

Correlation CoefficientCorrelation Coefficient 0.830.83

Is there a relationship between the amount of education people Is there a relationship between the amount of education people have and their monthly incomehave and their monthly income??

Scatterplot for education and income:Scatterplot for education and income: With the data provided in With the data provided in

the table, we can calculate the table, we can calculate a regression. The a regression. The regression equation allows regression equation allows us to do two things:us to do two things:

1)1) find predicted values for find predicted values for the the YY variable for any variable for any given value of the given value of the XX variablevariable

2)2) produce the regression produce the regression lineline

The regression line is the The regression line is the basis for linear regression basis for linear regression and can help us and can help us understand how understand how regression worksregression works

Education (in years)

22 20 18 16 14 12 10 8 6 4 2 0

I n c o m e

12

11

10

9

8

7

6

5

4

3

2

1

0

-1

-2

-3 -4

10

9

8

7

6

5

4

3 2

1

OLS is the most commonly used regression formula

It is based on an idea that we have seen before: the sum of squares

To do OLS: find the line of least squares (i.e., the straight line that produces the smallest sum of squared deviations from the line)

Sum of Squares:Sum of Squares:

ΣΣ (observed value – (observed value – predicted value)predicted value)22

x

y

s

srb *

is the regression coefficientis the regression coefficient

is the correlation between the is the correlation between the XX and and YY variables variables

is the standard deviation of the is the standard deviation of the YY variable variable

is the standard deviation of the is the standard deviation of the XX variable variable

bb

rr

ssyy

ssxx

is the average value of is the average value of YYY

is the average value of is the average value of XX

is the regression coefficientis the regression coefficientbb

XbYa

X

The regression equation does not calculate the actual value of Y. It can only make predictions about the value of Y. So error (e) is bound to occur.• Error is the difference between the

actual, or observed, value of Y and the predicted value of Y

To calculate error, use one of two equations:

OROR

Y

e = Y - a + bX

is the actual, or observed is the actual, or observed value of value of YY

YY

is the predicted value ofis the predicted value of Y Y

e = Y - Y

For the predicted value of Y: For the actual / observed value of Y; takes into account error (e):

Y = bX + a + e = bX + a Y

For every unit of increase in X, there is a corresponding predicted increase of 0.61 units in Y

OR For every additional year of education, we would predict

an increase of 0.61 ($1,000), or $610, in monthly income

Y = -3.77 + .61= -3.77 + .61XX

Example: Example: Is there a relationship between the amount of Is there a relationship between the amount of education people have and their monthly incomeeducation people have and their monthly income??

So we would predict that a person with 9 years of education would make $1,820 per month, plus or minus our error in prediction (e)

= 1.82= 1.82Y

Example: What would we predict the monthly Example: What would we predict the monthly income to be for a person with 9 years of formal income to be for a person with 9 years of formal education?education?

Y = -3.77 + .61(9)= -3.77 + .61(9)

Y = -3.77 + 5.59= -3.77 + 5.59

Drawing the Regression LineDrawing the Regression Line To do this we need to calculate two pointsTo do this we need to calculate two points

Education

22 20 18 16 14 12 10 8 6 4 2 0

12

11

10

9

8

7

6

5

4

3

2

1

0

- 1

- 2

- 3 - 4

10

9

8

7

6

5

4

3 2

1

Income

= -3.77 + .61(9)= -3.77 + .61(9)Y = -3.77 + .61(25)= -3.77 + .61(25)YY = -3.77 + 5.59= -3.77 + 5.59

Y = 1.82= 1.82

Y = -3.77 + 15.25= -3.77 + 15.25

Y = 11.48= 11.48

The regression line does not always accurately predict the actual Y values

In some cases there is a little error, and in other cases there is a larger error• Residuals = errors in

prediction In some cases, our predicted

value is greater than our observed value.• Overpredicted = observed

values of Y at given values of X that are below the predicted values of Y. Produces negative residuals.

Sometimes our predicted value is less than our observed value• Underpredicted = observed

values of Y at given values of X that are above the predicted values of Y. Produces positive residuals.

correlation and simple linear regression

Documents