regression understanding relationships and predicting outcomes

Regression

Understanding relationships and predicting outcomes

Key concepts in understanding regression The General Linear Model Prediction and errors in prediction Coefficients/weight Variance explained, variance not

accounted for Effect of outliers Assumptions

Relations among variables

A goal of science is prediction and explanation of phenomena

In order to do so we must find events that are related in some way such that knowledge about one will lead to knowledge about the other

In psychology we seek to understand the relationship among variables that are indicators of an innumerable amount of information about human nature in order better understand ourselves and why we are the way we are

Before getting too far

While we are getting ‘mathy’ in our discussion of regression, there’s no way around it. All the analyses you see in articles are ‘simply’ mathematical models fit to the data collected. Without an understanding of that aspect on some level, there is no way to do or understand psychological science in any meaningful way.

However it is important to remember why we are doing this. Stats, as a reminder, is simply a tool. Our primary interest however

is in understanding human behavior, and potentially the underlying causes of it.

We are interested in predicting what causes physical and emotional pain, individual happiness, how the mind works, how and why we make the choices we do and so on.

So to aid you in your own understanding, before going on, pick a simple relationship between two variables you would be interested in, and keep them ‘in mind’ as we go through the following Identify one as the predictor, one as the outcome. Write them down and

refer to them as we go along.

Correlation

While we could just use our N of 1 personal experience to try and understand human behavior, a scientific (and better) means of understanding the relationship between variables is by means of assessing correlation

Two variables take on different values, but if they are related in some fashion they will covary

They may do so in a way in which their values tend to move in the same direction, or they may tend to move in opposite directions

The underlying statistic assessing this is covariance, which is at the heart of every statistical procedure you are likely to use inferentially

Covariance and Correlation

Covariance as a statistical construct is unbounded and thus difficult to interpret in its raw form

Correlation (Pearson’s r) is a measure of the direction and degree of a linear association between two variables

Correlation is the standardized covariance between two variables

1

( )( )cov( , )

1

n

i ii

x x y yx y

n

yxxy ss

yxr

),cov(

),cov( yx

1

1

i i

n

x yi

xy

Z Zr

n

11 r

Regression

Regression allows us to use the information about covariance to make predictions

Given knowledge regarding the value of one variable, we can predict an outcome with some level of accuracy

The basic model is that of a straight line (the General Linear Model) The formula for a straight line is:

Y = bX + a Y = the calculated value for the variable on the vertical axis a = the intercept, where the line crosses the Y axis b = the slope of the line X = values for the variable on the horizontal axis

Only one possible straight line can be drawn once the slope and intercept are specified, and once this line is specified, we can calculate the corresponding value of Y for any value of X entered

In more general terms Y = Xb + e, where these elements represent vectors and/or matrices (of the outcome, data, coefficients and error respectively), is the general linear model to which most of the techniques in psychological research adhere to

Real data do not conform perfectly to a straight line The best fit straight line is that which minimizes the

amount of variation in data points from the line The common approach, but by no means the only or only

acceptable method, attempts to derive a least squares regression line which minimizes the squared deviations of the points from it

The equation for this line can be used to predict or estimate an individual’s score on some outcome on the basis of his or her score on the predictor Y-hat here is the predicted (fitted) value for the DV, not the

actual value of the DV for a case1

abX ̂

The Line of Best Fit

Least Squares Modeling When the relation between variables

are expressed in this manner, we call the relevant equation(s) mathematical models, and they reflect our theoretical models

The intercept and weight values are called the parameters of the model

While typical regression analysis by itself does not determine causal relations, the assumption indicated by such a model is that the variable on the left-hand side of the previous equation is being caused by the variable(s) on the right side The arrows explicitly go from the

predictors to the outcome, not vice versa1

Variable X

Variable Y

Variable Z

Criterion

A

B

C

Parameter Estimation Example

Let’s assume that we believe there is a linear relationship between X and Y.

Which set of parameter values will bring us closest to representing the data accurately?

Estimation Example

We begin by picking some values, plugging them into the equation, and seeing how well the implied values correspond to the observed values

We can quantify what we mean by “how well” by examining the difference between the model-implied Y and the actual Y value

This difference between our observed value and the one predicted, , is often called error in prediction, or the residual

The residual Sum of Squares here is 160

XY 22ˆ

YY ˆ

yy ˆ

Estimation Example

Let’s try a different value of b, i.e. a different coefficient, and see what happens

Now the implied values of Y are getting closer to the actual values of Y, but we’re still off by quite a bit

XY 12ˆ

Estimation Example

Things are getting better, but certainly things could improve

XY 02ˆ

Estimation Example

Getting better stillXY 12ˆ

Estimation Example

Now we’ve got it There is a perfect

correspondence between the predicted values of Y and the actual values of Y No residual variance Also no chance of it

ever happening with real data

XY 22ˆ

Estimates of the constant and coefficient in the simple setting Estimating the slope of the line: This is our regression coefficient, and it represents the amount of change in

the outcome seen with 1 unit change in the predictor. It requires first estimating the covariance

Estimating the Y intercept

where and are the means based on the sets of the Y and X values respectively, and b is the estimated slope of the line

These calculations ensure that the regression line passes through the point on the scatterplot defined by the two means

cov( , )

var( )

X Yb

X

a Y bX

,

,

ˆ

y

x

y y

x x

Alternatively slope

sb r

s

so by substituting we get

s sY r X Y r X

s s

In terms of the Pearson r

Break time

Stop and look at your chosen variables of interest. Write down our general linear model1, but substituting the your

predictor and outcome for the X and Y respectively Do you understand how the measurable relationship between the

two comes into play? Can you understand the slope in terms of your predictor and its

effect on the outcome? Can you understand the intercept in terms of a pictorial relationship

of this model? Can you understand the notion of a ‘fitted’ value with regard to your

outcome? If you’re okay at this point, it’s time to see how good a job we’re

doing in this prediction business

Total variance = predicted variance + error variance

Breaking Down the Variance

Total variability in the dependent variable (i.e. how the values bounce about the mean) comes from two sources

Variability predicted by the model i.e. what variability in the dependent variable is due to the predictor How far off our predicted values are from the mean of Y

Error or residual variability i.e. variability not explained by the predictor variable The difference between the predicted values and the observed values

Y

2ˆ

2ˆ

2

YY

Y

Y

S

S

S

Regression and Variance

It’s important to understand this conceptually in terms of the variance in the DV we are trying to account for

With perfect prediction, we’d have zero residual variance, all variance in the outcome variable is accounted for

With zero prediction, all variance would be residual variance Essentially the same as ‘predicting’ the mean each time

Note that if we knew nothing else, that’s all we could predict The fact that there is a correlation between the two allows us to do

better No correlation, no fit

R2: the coefficient of determination

The square of the correlation, R², is the fraction of the variation in the values of the outcome that is explained by our predictor

We can show this graphically using a Venn diagram

R2 = the proportion of variability shared by two variables (X and Y)

The larger the area of overlap, the greater the strength of the association between the two variables

R² = variance of predicted values divided by the total variance of observed DV values

R2 is also the square of the correlation between those fitted values and the original DV

1

)ˆ( 22ˆ

n

YYs i

Y

2

2ˆ2

222ˆ

Y

Y

yY

s

sr

srs

Predicted variance and R2

Measures of ‘fit’

Many measures of fit are available, though with regression you will typically see (adjusted) R2

Some others include: Proportional improvement in prediction (as seen in Howell) From path analysis/sem literature:

Χ2 (typically a poor approach as we have to ‘accept the null’) GFI (goodness of fit index) AIC (Akike information criterion) BIC (Bayesian information criterion)

Some of these, e.g. the BIC, have little utility except in terms of model comparison

One of the means of getting around NHST is changing our question from ‘Is it significant?’ to ‘Which model is better?’

The Accuracy of Prediction

How else might we measure model fit?

The error associated with a prediction (of a Y value from a known X value) is a function of the deviations of Y about the predicted point

The standard error of estimate1 provides an assessment of accuracy of prediction The standard deviation of Y predicted from X

In terms of R2, we can see that the more variance we account for the smaller our standard error of estimate will be

residual

residualXY df

SSS

2

)ˆ( 2

2

1)1( 2

N

RS XY

Example Output Study hours predicted by Book cost1

Assumption is greater cost is indicative of more classes and/or required reading Given just the df and sums of squares, you should be able to fill out the rest of the

ANOVA summary table save the p-value Given the coefficient and standard error, you should be able to calculate the t

Note the relationship of the t-statistic and p-value for the predictor and the F statistic and p-value for the model

Notice the small coefficient? What does this mean? Think of the Book Cost scale and the Hours study per day A one unit movement in Book Cost is only a dollar, and corresponds to .0037 hours. With a more meaningful increase in 100 dollars, we can expect study time to increase .37

hours, or about 22 minutes per day

Source Df SS Mean Sq F p-value Res. Std. Error R2

Model 1 19.044 19.044 5.669 .023 1.833 .147

Residuals 33 110.856 3.359

Total 34 129.90

Coefficients Estimate Std. Error t-value p-value

Intercept 2.28 .748 3.049 0.005

BookCostF08 .0037 .0016 2.381 0.023

Interpreting regression: Summary of the Basics

Intercept Value of the outcome when the predictor value is 0 Often not meaningful, particularly if it’s practically impossible to have a value of 0

for a predictor (e.g. weight) Slope

Amount of change in the outcome seen with 1 unit change in the predictor Standardized regression coefficient

Amount of change in the outcome seen in standard deviation units with 1 standard deviation unit change in the predictor

In simple regression it is equivalent to the Pearson r for the two variables Standard error of estimate

Gives a measure of the accuracy of prediction R2

Proportion of variance explained by the model

Other things to consider

The mean of the predicted values equals the mean of the original DV

The regression line passes through the point representing the mean of both variables

In tests of significance, we can expect sample size, scatter of points about the regression line, and range of predictor values to all have an effect

Coefficients can be of the same size but statistical significance and SSreg will vary (different standard errors)

Hold on a second…

And you thought we were finished! In order to test for model adequacy, we have to run the

regression first.

So yes, we are just getting started. The next notes refer to testing the integrity of the model

in simple regression, but know there are many more issues once additional predictors are added (i.e. the usual case)

regression understanding relationships and predicting outcomes

Documents