chapter 10: linear regression

1 Chapter 10 Prepared by Samantha Gaies, M.A.

• A perfect correlation implies the ability to predict one score from another perfectly.

• Perfect predictions:– When dealing with z-scores, the z-score

you predict for the Y variable is exactly the same as the z-score for the X variable

– That is, when r = +1.0: zY’ = zX

– And, when r = –1.0: zY’ = –zX

• When r is less than perfect, this rule must be modified, according to the strength of the correlation. The modified rule is the standardized regression equation, as shown on the next slide.

Chapter 10: Linear Regression


Predicting with z-scores– Standardized Regression Equation:

zY’ = r zX

• If r = –1 or +1, the magnitude of the predicted z score is the same as the z score from which we are predicting.

• If r = 0, the z score prediction is always zero (i.e., the mean), which implies that, given no other information, our best prediction for a variable is its own mean.

– As the magnitude of r becomes smaller, there is less of a tendency to expect an extreme score on one variable to be associated with an equally extreme score on the other. This is consistent with Galton’s concept of “regression toward mediocrity” (i.e., regression toward the mean).


zsco

re2

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

zscore1-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

zscore2 = 0.909zscore1 + 0.00000000000000093; r^2 = 0.83

Collection 1 Scatter Plot

56

57

58

59

60

61

62

63

64E

xa

m2

87 88 89 90 91 92 93 94 95Exam1

Exam2 = 1.119Exam1 - 41.8; r^2 = 0.83Exam2 = Exam2 mean

Collection 1 Scatter Plot

Raw score graph

z score graph


Regression Formulas when Dealing with a Population

– A basic formula for linear regression in terms of population means and standard deviations is as follows:

– This formula can be simplified to the basic equation for a straight line:

where

and

YXYX aXbY


Regression Formulas for Making Predictions

from Samples– The same raw-score regression

equa-tion is used when working with samples:

except that the slope of the line is now found from the unbiased SDs:

and the Y-intercept is now expressed in terms of the sample means:

rs

sb

X

YYX

XbYa YXYX


Quantifying the Errors around the Regression Line

– Residual: the difference between the actual Y value and the predicted Y value (Y – Y’). Each residual can be thought of as an error of prediction.

– The positive and negative residuals will balance out so that the sum of the residuals will always be zero.

– The linear regression equation gives us the straight line that minimizes the sum of the squared residuals (i.e., the sum of squared errors). Therefore, it is called the least-squares regression line.

– The regression line functions like a running average of Y, in that it passes through the mean of the Y values (approximately) for each value of X.


The Variance of theEstimate in a Population

– Quantifies the average amount of squared error in the predictions:

– The variance of the estimate (or residual variance) is the variance of the data points around the regression line.

– As long as r is not zero, σ2estY will be less than

σ2Y (the ordinary variance of the Y values); the

amount by which it is less represents the advantage of performing regression.

– Larger rs (in absolute value) will lead to less error in prediction (i.e., points closer to the regression line) and therefore a smaller value for σ2

estY.

– This relation between σ2estY and Pearson’s r is

shown in the following formula:


Coefficient of Determination– The proportion of variance in the predicted

variable that is not accounted for by the predicting variable is found by rearranging the formula for the variance of the estimate in the previous slide.

1 – r 2 = unexplained variance = σ2estY

total variance σ2Y

– The ratio of the variance of the estimate to the ordinary variance of Y is called the coefficient of nondetermination, and it is sometimes symbolized as k2.

– Larger absolute values of r are associated with smaller values for k2.

– The proportion of the total variance that is explained by the predictor variable is called the coefficient of determination, and it is simply equal to r2:

r 2 = explained variance = 1 – k2

total variance


Example from Lockhart, Robert S. (1998). Introduction to Statistics and Data Analysis. New York: W. H. Freeman & Company.

X (Age) Y (Score)M = 98.14 (mths) M = 30.35 (items correct)

sX = 21.0 sY = 7.25

r = .72N = 100

Here is a concrete exampleof Linear Regression …


Estimating the Variance of the Estimate from a Sample

– When using a sample to estimate the variance of the estimate, we need to correct for bias, even though we are basing our formula on the unbiased estimate of the ordinary variance:

Standard Error of the Estimate The standard error of the estimate is

just the square root of the variance of the estimate. When estimating from a sample, the formula is:


Assumptions Underlying

Linear Regression– Independent random sampling– Bivariate normal distribution– Linearity of the relationship between

the two variables– Homoscedasticity (i.e., the variance

around the regression line is the same for every X value)

Uses for Linear Regression– Prediction– Statistical control (i.e., removing the

linear effect of one variable on another)– Quantifying the relationship between a

DV and a manipulated IV with quantita-tive levels.

Chapter 10 Prepared by Samantha Gaies, M.A. 12

The Point-Biserial Correlation Coefficient

– An ordinary Pearson’s r calculated for one continuous multivalued variable and one dichotomous (i.e., grouping) variable. The sign of rpb is arbitrary and therefore usually ignored.

– A rpb can be tested for significance with a one-sample t test as follows:

– By solving for rpb, we obtain a simple formula for converting a two-sample pooled-variance t value into a corre-lational measure of effect size:

21

2

pbr

Nrt pb

dft

trpb

2

2


The Proportion of Variance Accounted for in a

Two-Sample Comparison– Squaring rpb gives the proportion of vari-

ance in your DV accounted for by your two-level IV (i.e., group membership).

– Even when you obtain a large t value, it is possible that little variance is accounted for; therefore, rpb is a useful supplement to the two-sample t value.

– rpb is an alternative to g for expressing the effect size found in your samples. The two measures have a fairly simple relationship:

where N is the total number of cases across both groups, and df = N – 2

Ndfg

grpb /42

2


Estimating the Proportion of Variance Accounted for

in the Population– r2

pb from a sample tends to over-estimate the proportion of variance accounted for in the population. This bias can be corrected with the following formula:

– ω2 and d 2 are two different measures of the effect size in the population. They have a very simple relationship, as shown by the following formula:

chapter 10: linear regression

Documents