statistics for the social sciences psychology 340 spring 2005 prediction

40
Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social SciencesPsychology 340

Spring 2005

Prediction

Page 2: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Outline (for week)

• Simple bi-variate regression, least-squares fit line– The general linear model

– Residual plots

– Using SPSS

• Multiple regression– Comparing models, (?? Delta r2)

– Using SPSS

Page 3: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

• Last time: with correlation, we examined whether variables X & Y are related

• This time: with regression, we try to predict the value of one variable given what we know about the other variable and the relationship between the two.

Page 4: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

• Last time: “it doesn’t matter which variable goes on the X-axis or the Y-axis”

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• For regression this is NOT the case

• The variable that you are predicting goes on the Y-axis (criterion variable)

Predicted variable

Predicting variable

• The variable that you are making the prediction based on goes on the X-axis (predictor variable)

Quiz performance

Hours of study

Page 5: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

• Last time: “Imagine a line through the points”

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• But there are lots of possible lines

• One line is the “best fitting line”

• Today: learn how to compute the equation corresponding to this “best fitting line”

Quiz performance

Hours of study

Page 6: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)

2.0

Y

X

1

2

3

4

5

6

1 2 3 4 5 60

Y = intercept, when X = 0

Page 7: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)

2.0

Change in Y

Change in X= slope

0.5

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

1

2

0

Page 8: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)Y

X

1

2

3

4

5

6

1 2 3 4 5 60

Y = (X)(0.5) + 2.0

Page 9: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

• A brief review of geometry• Consider a perfect correlation

Y = (X)(0.5) + (2.0)Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Can make specific predictions about Y based on X

X = 5

Y = ?Y = (5)(0.5) + (2.0)

Y = 2.5 + 2 = 4.54.5

Page 10: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Consider a less than perfect correlation• The line still represents the

predicted values of Y given X

Y = (X)(0.5) + (2.0)X = 5

Y = ?Y = (5)(0.5) + (2.0)

Y = 2.5 + 2 = 4.54.5

Page 11: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points)

• Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

Page 12: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

• The linear model

Y = intercept + slope (X) + error

μY = β0 + β1X + ε

Beta’s () are sometimes called parameters

Come in two types:

• standardized

• unstanderdized μY = β0 + β1X + ε )ZY =()(ZX ) + ε

Now let’s go through an example computing these things

Page 13: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Scatterplot

• Using the dataset from our correlation lecture

6 61 25 6

3 4

3 2

X Y Y

X

1

23456

1 2 3 4 5 6

Page 14: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

From the Computing Pearson’s r lecture

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.766.76

1.96

0.36

0.36

Y −Y ( )2

4.04.0

4.0

0.0

4.0

14.015.20 16.0

SSYSSX

SP

Page 15: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing regression line(with raw scores)

6 61 25 6

3 4

3 2

X Y

14.015.20 16.0

SSYSSX

SP

slope = b =SP

SSX

=14

15.2= 0.92

intercept = a = Y − bX

mean 3.6 4.0 €

=4.0 − (0.92)(3.6)

=0.688

Page 16: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing regression line(with raw scores)

6 61 25 6

3 4

3 2

X Y

slope = b = 0.92

mean 3.6 4.0

intercept = 0.688

Y

X

1

23456

1 2 3 4 5 6

Y = 0.92X + 0.688

Page 17: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing regression line (with raw scores)

6 61 25 6

3 4

3 2

X Y

slope = b = 0.92

mean 3.6 4.0

intercept = 0.688

Y

X

1

23456

1 2 3 4 5 6

X

Y

Y = 0.92X + 0.688

The two means will be on the line

Page 18: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing regression line(standardized, using z-scores)

• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.790.0

1.1-1.1

0.0

-1.1

1.1

0.0

X −X( ) X −X( )2

Y −Y( )

1.38-1.49

0.8

- 0.34

- 0.34

Page 19: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing regression line(standardized, using z-scores)

• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores

ZX ZY

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

• Prediction model– Predicted Z score (on criterion variable) =

standardized regression coefficient multiplied by Z score on predictor variable

– Formula

)ZY =()(ZX )

– The standardized regression coefficient (β)

• In bivariate prediction, β = r

Page 20: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing regression line(with z-scores)

slope = =r =0.89

meanintercept =0.0

ZY

ZX

-1

1

2

0

1 2

ZX ZY

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

)ZY =()(ZX )

-2

-1-2

Page 21: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

• Also need a measure of error

Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Same line, but different relationships (strength difference)

Y = intercept + slope (X)+ error

• The linear equation isn’t the whole thing

Page 22: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression

• Error– Actual score minus the predicted score

• Measures of error– r2 (r-squared)– Proportionate reduction in error

• Note: Total squared error when predicting from the mean = SSTotal=SSY

=SStotal − SSerror

SStotal

– Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror

Page 23: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

R-squared

• r2 represents the percent variance in Y accounted for by X

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

r = 0.8 r = 0.5r2 = 0.64 r2 = 0.25

64% variance explained 25% variance explained

Page 24: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

• Compute the difference between the predicted values and the observed values (“residuals”)

• Square the differences

• Add up the squared differences

Y

X

1

23456

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror

Page 25: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

ˆ Y

Y =0.92X + 0.688Predicted values of Y (points on the line)

• Sum of the squared residuals = SSresidual = SSerror

Page 26: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

ˆ Y

Y =0.92X + 0.688

= (0.92)(6)+0.688

Predicted values of Y (points on the line)

• Sum of the squared residuals = SSresidual = SSerror

Page 27: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

ˆ Y

Y =0.92X + 0.688

= (0.92)(6)+0.688

1.6 = (0.92)(1)+0.688

5.3 = (0.92)(5)+0.688

3.45 = (0.92)(3)+0.688

3.45 = (0.92)(3)+0.688

• Sum of the squared residuals = SSresidual = SSerror

Page 28: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

Y

X

123

45

6

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror

X Y

ˆ Y 6 61 25 6

3 4

3 2

6.21.6

5.3

3.45

3.45

6.2

1.6

5.3

3.45

Y =0.92X + 0.688

Page 29: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

ˆ Y

Y − ˆ Y ( )-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

residuals• Sum of the squared residuals = SSresidual = SSerror

Quick check

6 - 6.2 =

2 - 1.6 =

6 - 5.3 =

4 - 3.45 =

2 - 3.45 =

Page 30: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

ˆ Y

Y − ˆ Y ( )

Y − ˆ Y ( )2

-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

Page 31: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

ˆ Y

Y − ˆ Y ( )

Y − ˆ Y ( )2

-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0

SSY

Page 32: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Computing Error around the line

3.09

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

16.0

SSY

– Proportionate reduction in error =SStotal − SSerror

SStotal

=16.0 − 3.09

16.0= 0.81

• Also (like r2) represents the percent variance in Y accounted for by X

• In fact, it is mathematically identical to r2

Page 33: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Seeing patterns in the error

• Residual plots• The sum of the residuals should always equal 0 (as should the mean).

– the least squares regression line splits the data in half, half of the error is above the line and half is below the line.

• In addition to summing to zero, we also want there the residuals to be randomly distributed.

– That is, there should be no pattern to the residuals. – If there is a pattern, it may suggest that there is more than a simple linear

relationship between the two variables.

• Residual plots are very useful tools to examine the relationship even further.

– These are basically scatterplots of the residuals () against the Explanatory (X) variable

(note: the examples actually plot the residuals that have transformed into z-scores).

Page 34: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Seeing patterns in the error

• The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.

Residual plotScatter plot

• The scatterplot shows a nice linear relationship.

Page 35: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Seeing patterns in the error

Residual plot

• The scatterplot also shows a nice linear relationship.

• The residual plot shows that the residuals get larger as X increases.

• This suggests that the variability around the line is not constant across values of X.

• This is referred to as a violation of homogeniety of variance.

Scatter plot

Page 36: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Seeing patterns in the error

• The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).

Residual plotScatter plot

• The scatterplot shows what may be a linear relationship.

Page 37: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression in SPSS

• Running the analysis is pretty easy– Analyze: Regression: Linear– Predictor variables go into the ‘independent

variable’ field– (Predicted variable) goes into the “dependent

variable’ field

• You get a lot of output

Page 38: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Regression in SPSS

• The variables in the model

• r

• r2

• Unstandardized coefficients

• Slope (indep var name)• Intercept (constant)

• Standardized coefficients

• We’ll get back to these numbers in a few weeks

Page 39: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Multiple Regression

• Multiple regression prediction models

μY = β0 + β1X1 + β2 X2 + β 3X3 + ε

“fit” “residual”

Page 40: Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction

Statistics for the Social Sciences

Prediction in Research Articles

• Bivariate prediction models rarely reported

• Multiple regression results commonly reported