statistics for the social sciences psychology 340 fall 2013 correlation and regression

55
Statistics for the Statistics for the Social Sciences Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Upload: melanie-simmons

Post on 12-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Statistics for the Social Statistics for the Social SciencesSciences

Psychology 340Fall 2013

Correlation and Regression

Page 2: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Homework #12 due11/19Homework #12 due11/19Chapter 16: 1, 2, 7, 8, 10, 22 (Use

SPSS), 24

(HW #13 – the last homework – is due on 11/21)

Page 3: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Last Time:

•We reviewed Pearson’s r, and how to calculate it from raw scores and Z scores, and with SPSS

•We learned about Spearman’s Rho

•We learned how to get a scatter plot using SPSS

•We learned about bivariate regression

•One x (predictor, independent) variable is used to predict one y (outcome, dependent) variable

•Review of the formula for a line (y=mx+b)

•Applied formula for a line to regression (Y=a + bX + error or Y=β0+β1X+error)

Page 4: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

This time:

• Clarification and review of some regression concepts

• Multiple regression

• Regression in SPSS

• Scatterplots and simple linear (bivariate) regression in Excel

Page 5: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Regression is all about prediction• If you want to predict (make an educated guess about) an individual person’s score on a variable (Y), what’s the best estimate, based on the information available to you?

• Suppose you only know the mean of the variable (MY) and nothing else. Then the best estimate of any individual’s score is the mean (more scores are close to the mean than far from it).

• Suppose you also know that the variable is correlated with another variable (X), and you know a person’s score on X, but not on Y.

• Regression involves using the person’s score on X, combined with what you know about the relationship between X & Y to get a much more accurate prediction of the person’s score on Y than you can get just using the mean of Y (MY).

• If X and Y are uncorrelated, then the best predictor of any individual score on Y is the mean of Y (Y = MY + Error). The regression line is flat (has no slope). Note that in this scenario, the average error is based on deviations of Y values from the mean of Y. (It’s the standard deviation)

Page 6: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

vs. Y vs. µY

Refers to the predicted value of Y

µY Refers to the expected (mean) value of Y. In the context of regression, it refers to the expected value of Y for a given value of X.

Y Refers to an actual (observed) value of Y.

If you know an individual’s score for Y and X, and you know the regression equation, you can calculate a residual score for that individual.

By looking at the residuals for a group of individuals, you can determine how good a “fit” the regression model is (i.e., how much error there is)

Page 7: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Regression• The “best fitting line” is the one that minimizes the

error (differences) between the predicted scores (the line) and the actual scores (the points)

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

Page 8: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Regression

• The linear model

Y = intercept + slope (X) + error

Beta’s (β) are sometimes called parameters

Come in two types:

• standardized

• unstandardized

Now let’s go through an example computing these things

Page 9: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

From when we computed Pearson’s r

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

4.85.2

2.8

0.0

1.2

5.766.76

1.96

0.36

0.36

4.04.0

4.0

0.0

4.0

14.015.20 16.0

SSYSSX

SP

Page 10: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing regression line (with raw scores)

6 61 25 6

3 4

3 2

X Y

14.015.20 16.0

SSYSSX

SPmean 3.6 4.0

Page 11: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing regression line(with raw scores)

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

Y

X

1

23456

1 2 3 4 5 6

Page 12: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing regression line (with raw scores)

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

Y

X

1

23456

1 2 3 4 5 6

The two means will be on the line

Page 13: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing regression line (with z-scores)

mean

ZY

ZX

-1

1

2

0

1 2

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

-2

-1-2

Page 14: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Regression

• Error– Actual score minus the predicted score

• Measures of error– r2 (r-squared)

– Proportionate reduction in error

• Note: Total squared error when predicting from the mean = SSTotal=SSY

– Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror

Page 15: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

• Compute the difference between the predicted values and the observed values (“residuals”)

• Square the differences

• Add up the squared differences

Y

X

1

23456

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror

Page 16: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

Predicted values of Y (points on the line)

• Sum of the squared residuals = SSresidual = SSerror

Page 17: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2 = (0.92)(6)+0.688

Predicted values of Y (points on the line)

• Sum of the squared residuals = SSresidual = SSerror

Page 18: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2 = (0.92)(6)+0.688

1.6 = (0.92)(1)+0.688

5.3 = (0.92)(5)+0.688

3.45 = (0.92)(3)+0.688

3.45 = (0.92)(3)+0.688

• Sum of the squared residuals = SSresidual = SSerror

Page 19: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

Y

X

123

45

6

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror

X Y6 61 25 6

3 4

3 2

6.21.6

5.3

3.45

3.45

6.2

1.6

5.3

3.45

Page 20: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

-0.200.40

0.70

0.55

-1.45

1.6

5.3

3.45

3.45

residuals• Sum of the squared residuals = SSresidual = SSerror

Quick check

6 - 6.2 =

2 - 1.6 =

6 - 5.3 =

4 - 3.45 =

2 - 3.45 =

Page 21: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

-0.200.40

0.70

0.55

-1.45

1.6

5.3

3.45

3.45

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

Page 22: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

-0.200.40

0.70

0.55

-1.45

1.6

5.3

3.45

3.45

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

4.04.0

4.0

0.0

4.0

16.0

SSY

Page 23: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Computing Error around the line

• Also (like r2) represents the percent variance in Y accounted for by X

3.09

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

16.0

SSY

– Proportionate reduction in error

• In fact, in bivariate regression it is mathematically identical to r2

Page 24: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Regression in SPSS

• Running the analysis in SPSS is pretty easy– Analyze: Regression: Linear– X or predictor variable(s) go into the

‘independent variable’ field– Y or predicted variable goes into the

‘dependent variable’ field

• You get a lot of output

Page 25: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Regression in SPSS• The variables in the model

• r

• r2

• Unstandardized coefficients

• Slope (indep var name)• Intercept (constant)

• Standardized coefficients

• We’ll get back to these numbers in a few weeks

Page 26: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

In Excel• With Data Analysis “Tool Pack” you can

perform regression analysis• With standard software package, you

can get bivariate correlation (which is the same as the standardized regression coefficient), you can create a scatterplot, and you can request a trend line, which is a regression line (what is y and what is x in that case?)

Page 27: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression• Multiple regression prediction

models

“fit” “residual”

Page 28: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Prediction in Research Articles

• Bivariate prediction models rarely reported

• Multiple regression results commonly reported

Page 29: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

• Typically researchers are interested in predicting with more than one explanatory variable

• In multiple regression, an additional predictor variable (or set of variables) is used to predict the residuals left over from the first predictor.

Page 30: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

Y = intercept + slope (X) + error

• Bi-variate regression prediction models

Page 31: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

• Multiple regression prediction models

“fit” “residual”

Y = intercept + slope (X) + error

• Bi-variate regression prediction models

Page 32: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

• Multiple regression prediction models

First

Explanatory

Variable

Second

Explanatory

Variable

Fourth

Explanatory

Variable

whatever variability

is left over

Third

Explanatory

Variable

Page 33: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression• Predict test performance based on:

First

Explanatory

Variable

Second

Explanatory

Variable

Fourth

Explanatory

Variable

whatever variability

is left over

Third

Explanatory

Variable

• Study time • Test time • What you eat for breakfast • Hours of sleep

Page 34: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression• Predict test performance based on:

• Study time • Test time • What you eat for breakfast • Hours of sleep

• Typically your analysis consists of testing multiple regression models to see which “fits” best (comparing R2s of the models)

versus

versus

• For example:

Page 35: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

Response variableTotal variability it test performance

Total study timer = .6

Model #1: Some co-variance between the two variables

R2 for Model = .36

64% variance unexplained

• If we know the total study time, we can predict 36% of the variance in test performance

Page 36: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

Response variableTotal variability it test performance

Test timer = .1

Model #2: Add test time to the model

Total study timer = .6

R2 for Model = .37

63% variance unexplained

• Little co-variance between these test performance and test time• We can explain more the of variance in test performance

Page 37: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

Response variableTotal variability it test performance

breakfastr = .0

Model #3: No co-variance between these test performance and breakfast food

Total study timer = .6

Test timer = .1

R2 for Model = .37

63% variance unexplained

• Not related, so we can NOT explain more the of variance in test performance

Page 38: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

Response variableTotal variability it test performance

breakfastr = .0

• We can explain more the of variance • But notice what happens with the overlap (covariation between explanatory

variables), can’t just add r’s or r2’s

Total study timer = .6

Test timer = .1

Hrs of sleepr = .45

R2 for Model = .45

55% variance unexplained

Model #4: Some co-variance between these test performance and hours of sleep

Page 39: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

The “least squares” regression equation when there are multiple intercorrelated predictor (x) variables is found by calculating “partial regression coefficients” for each x

A partial regression coefficient for x1 shows the relationship between y and x1 while statistically controlling for the other x variables (or holding the other x variables constant)

Page 40: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

The formula for the partial regression coefficient is :

b1= (rY1-rY2r12)/(1-r122)*(sY/s1)

WhererY1=correlation of x1and yrY2=correlation of x2and yr12=correlation of x1 and x2

sY=standard deviation of y, s1=standard deviation of x1

Page 41: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

• Multiple correlation coefficient (R) is an estimate of the relationship between the dependent variable (y) and the best linear combination of predictor variables (correlation of y and y-pred.)

• R2 tells you the amount of variance in y explained by the particular multiple regression model being tested.

Page 42: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

Setup as before: Variables (explanatory and response) are entered into columns

• A couple of different ways to use SPSS to compare different models

Page 43: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Regression in SPSS• Analyze: Regression, Linear

Page 44: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• Method 1:enter all the explanatory

variables together – Enter:

• All of the predictor variables into the Independent Variable field

• Predicted (criterion) variable into Dependent Variable field

Page 45: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• The variables in the model

• r for the entire model

• r2 for the entire model

• Unstandardized coefficients

• Coefficient for var1 (var name)

• Coefficient for var2 (var name)

Page 46: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• The variables in the model

• r for the entire model

• r2 for the entire model

• Standardized coefficients

• Coefficient for var1 (var name)

• Coefficient for var2 (var name)

Page 47: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression

– Which coefficient to use, standardized or unstandardized?

– Unstandardized b’s are easier to use if you want to predict a raw score based on raw scores (no z-scores needed).

– Standardized β’s are nice to directly compare which variable is most “important” in the equation

Page 48: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• Predicted (criterion) variable into Dependent Variable field

• First Predictor variable into the Independent Variable field

• Click the Next button

• Method 2: enter first model, then add another

variable for second model, etc. – Enter:

Page 49: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• Method 2 cont: – Enter:

• Second Predictor variable into the Independent Variable field

• Click Statistics

Page 50: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

– Click the ‘R squared change’ box

Page 51: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• The variables in the first model (math SAT)• Shows the results of two models

• The variables in the second model (math and verbal SAT)

Page 52: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• The variables in the first model (math SAT)

• r2 for the first model

• Coefficients for var1 (var name)

• Shows the results of two models

• The variables in the second model (math and verbal SAT)

• Model 1

Page 53: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• The variables in the first model (math SAT)

• Coefficients for var1 (var name)

• Coefficients for var2 (var name)

• Shows the results of two models

• r2 for the second model

• The variables in the second model (math and verbal SAT)

• Model 2

Page 54: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Multiple Regression in SPSS

• The variables in the first model (math SAT)• Shows the results of two models

• The variables in the second model (math and verbal SAT)

• Change statistics: is the change in r2 from Model 1 to Model 2 statistically significant?

Page 55: Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Cautions in Multiple Regression

• We can use as many predictors as we wish but we should be careful not to use more predictors than is warranted.– Simpler models are more likely to generalize to other

samples.– If you use as many predictors as you have participants in

your study, you can predict 100% of the variance. Although this may seem like a good thing, it is unlikely that your results would generalize to any other sample and thus they are not valid.

– You probably should have at least 10 participants per predictor variable (and probably should aim for about 30).