regression

Acknowledgment to Andy Field chapter 8

Linear Regression

COMMONWEALTH OF AUSTRALIA

Copyright Regulations 1969

WARNING

This material has been reproduced and communicated to you by or on behalf of the

University of New South Wales pursuant to Part VB of the Copyright Act 1968 (the Act).

The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act.

Do not remove this notice.

Aims

Understand linear regression with one predictor

Understand how we assess the fit of a regression model• Total Sum of Squares• Model Sum of Squares• Residual Sum of Squares• F• R2

Know how to do Regression on IBM SPSS

Interpret a regression model

Summary of Linear Regression• Simple regression is a way of predicting one variable from another.

• We do this by fitting a statistical model to the data in the form of a straight line.

• This line is the line that best summarises the pattern of data.

• We have to assess how well the line fits the data using:• R squared which tells us how much variance is explained by the model

compared to how much variance there is to explain in the first place. It is a proportion of variance in the outcome variable that is shared by the predictor variable.

• F, which tells us how much variability the model can explain relative to how much it can’t explain (i.e., it’s the ratio of how good the model is compared to how bad the model is).

• The b-value, which tells us the gradient of the regression line and the strength of the relationship between a predictor and the outcome variable. If its significant (Sig. < 0.05 in the SPSS table) then the predictor variable significantly predicts the outcome variable.

Remember that we have previously talked about fitting models…

We do this by fitting a statistical model to the data in the form of a straight line

What is Regression?

A way of predicting the value of one variable from another.• It is a hypothetical model of the relationship between two variables.

• The model used is a linear one.

• Therefore, we describe the relationship using the equation of a straight line.

• Remember that a straight line is: y = mx + b

b1

• Regression coefficient for the predictor

• Gradient (slope) of the regression line

• Direction/Strength of Relationship

b0

• Intercept (value of Y when X = 0)

• Point at which the regression line crosses the Y-axis (ordinate)

Describing a Straight Line

y = mx + by = b + mx

So the model is ‘b + mx’y = model + error

iii XbbY 10

Regression co-efficients (b)

b0

• Intercept (value of Y when X = 0)• Point at which the regression line

crosses the Y-axis (ordinate)

b1

• Regression coefficient for the predictor• Gradient (slope) of the regression line• Direction/Strength of Relationship

Outcome: Album sales

Predictor: $ spent on advertisng

Album sales i = b0 + b1 adversting budget i + Error i

Let’s say that the values of b0 & b1 turned out to be 50 & 100 respectively:

Album sales i = 50 + (100 x adversting budget i) + Error i

How much money do you want to spend on advertising per album? Say $5?

Album sales i = 50 + (100 x 5) + Error i

= 550 + Error i

Predicted album sales is 550. This predicted value is not perfect.

Example of simple regression

The Method of Least SquaresGives us the error (residual sums of squares)

https://www.youtube.com/watch?v=0T0z8d0_aY4



https://www.youtube.com/watch?v=ocGEhiLwDVc

Video



The regression line is only a model based on the data.

This model might not reflect reality.

We need some way of testing how well the model fits the observed data.

How?

Slide 12

Interpretation-How Good is the Model?

Most basic model: The mean. Total sum of squares (SS T No relationship between the two variables. (same amount of album sales no matter how big the advertising budget).

Residual sums of squares (SS R) describes the error in the model.

How much better is the model than just using the mean? The reduction of inaccuracy- the improvement is the model sum of squares:

Difference between the mean and the regression line.

Assessing goodness of fit

Most basic model: The mean. Total sum of squares (SS T No relationship between the two variables (same amount of album sales no matter how big the advertising budget).

Residual sums of squares (SSR) describes the error in the model.

How much better is the model than just using the mean? The reduction of inaccuracy- the improvement is the model sum of squares:

Difference between the mean and the regression line.

Summary

SST

• Total variability (variability between scores and the mean).

SSR

• Residual/Error variability (variability between the regression model and the actual data).

SSM • Model variability (difference in variability between the

model and the mean).

Testing the Model

If the model results in better prediction than using the mean, then we expect SSM to be much greater than SSR

SSRSSR

Error in ModelError in Model

SSMSSM

Improvement Due to the ModelImprovement Due to the Model

SSTSST

Total Variance In The DataTotal Variance In The Data

Testing the Model: R2

R2

• The proportion of variance accounted for by the regression model.

• The Pearson Correlation Coefficient Squared

T

M

SSSSR 2

Testing the Model: ANOVA

Mean Squared Error• Sums of Squares are total values.

• They can be expressed as averages.

• These are called Mean Squares, MS

• F is a measure of how much the model has improved the prediction of the outcome compared to the level of inaccuracy of the model.

• Good model has a large F (>1)

R

M

MSMSF

• Gradient of regression line.

• Change in outcome as a result of 1 unit change in predictor.

• Bad model: b = zero. E.g. the mean as a predictor of album sales.

• If a variable, such as advertising predicts an outcome, it should has a non-zero value.

• The hypothesis that something is ‘non-zero’ can be tested using a t-test.

• The t-statistic tells us how confident we can be that something is non-zero. If we are confident that b is non-zero that means that our variable (adverts) is predicting sales.

• We are usually confident that it’s non-zero if the result (significance) of the t-test is <.05.

The b-value (b1)

Show the output of regression and go through confidence intervals

Confidence Intervals

Domjan et al. (1998)• ‘Conditioned’ sperm release in Japanese Quail.

True Mean• 15 Million sperm

Sample Mean• 17 Million sperm

Interval estimate• 12 to 22 million (contains true value)• 16 to 18 million (misses true value)• CIs constructed such that 95% contain the true

value.

Showing Confidence Intervals Visually

Confidence Intervals and Statistical Significance

Chapter 5 of Andy Field text

Assumptions

Something other than evidence is affecting your conclusion. A source of bias comes from violating assumptions

An assumption is a condition that ensures what you’re attempting to do works.

Bias

If Juno had 16 friends this would pull the mean up and incorrectly make KC and SK seem to be more popular. This affects our Sums of Squares and further calculations on the data.

We spot outliers by looking at graphs

Outliers

http://donaldearlcollins.com/2012/12/13/december-doctoral-decisions/graphic-1/




Additivity and Linearity

• The outcome variable is, in reality, linearly related to any predictors.

• If you have several predictors then their combined effect is best described by adding their effects together.

• If this assumption is not met then your model is invalid.

Homoscedasticity/ Homogeneity of Variance p175

When testing several groups of participants, samples should come from populations with the same variance.

In correlational designs, the variance of the outcome variable should be stable at all levels of the predictor variable.

E.g. spread of hearing loss at each concert in Syd, Melb, BrisVegs is the same.

Residuals vs values of outcomes predicted by model.

Is there a systematic relationship between what comes out of model (predicted values) and errors in the model?

We want NO relationship.

Funnel out means homogeneity of variance is violated.

Curve in residuals = Non-linear relationship between outcome and predictors. This means linearity is broken.

Spotting problems with Linearity or Homoscedasticity p174

Analyse/Regression/Linear/Plots

Y = ZPRED

X = ZRESID

Making the residual plot in SPSS

Normally Distributed Something or Other

The normal distribution is relevant to:• Parameters

• Confidence intervals around a parameter

• Null hypothesis significance testing

This assumption tends to get incorrectly translated as ‘your data need to be normally distributed’.

Usually it refers to the Sampling distribution of what’s being tested must be normal.

More on this in tutorials and on page 168 of your textbook.

When does the Assumption of Normality Matter?

In small samples.• The central limit theorem allows us to forget about this assumption in larger

samples.

In practical terms, as long as your sample is fairly large, outliers are a much more pressing concern than normality.

Spotting NormalityWe don’t have access to the sampling distribution so we usually test the observed data

Central Limit Theorem

• If N > 30, the sampling distribution is normal anyway

Graphical displays

• P-P Plot (or Q-Q plot)

• Histogram

Values of Skew/Kurtosis

• 0 in a normal distribution

• Convert to z (by dividing value by SE)

Kolmogorov-Smirnov Test

• Tests if data differ from a normal distribution

• Significant = non-Normal data

• Non-Significant = Normal data

Slide 36

Spotting Normality:The P-P Plot

In newest spss: Descriptive stats/ PP plot

Spotting Normality:The P-P Plot

Normal Not NormalIn newest spss: Descriptive stats/ PP plot

Summary of Linear Regression• Simple regression is a way of predicting one variable from another.

• We do this by fitting a statistical model to the data in the form of a straight line.

• This line is the line that best summarises the pattern of data.

• We have to assess how well the line fits the data using:• R squared which tells us how much variance is explained by the model

compared to how much variance there is to explain in the first place. It is a proportion of variance in the outcome variable that is shared by the predictor variable.

• F, which tells us how much variability the model can explain relative to how much it can’t explain (i.e., it’s the ratio of how good the model is compared to how bad the model is).

• The b-value, which tells us the gradient of the regression line and the strength of the relationship between a predictor and the outcome variable. If its significant (Sig. < 0.05 in the SPSS table) then the predictor variable significantly predicts the outcome variable.

Linear regression

https://www.youtube.com/watch?v=KsVBBJRb9TE&list=PLvxOuBpazmsND0vmkP1ECjTloiVz-pXla

How to calculate Rsquared https://www.youtube.com/watch?v=w2FKXOa0HGA

Cheesy comic video intro to regression

https://www.youtube.com/watch?v=lZ72O- dXhtM

More videos on regression



https://www.youtube.com/watch?v=w2FKXOa0HGA

https://www.youtube.com/watch?v=w2FKXOa0HGA

https://www.youtube.com/watch?v=lZ72O-dXhtM

https://www.youtube.com/watch?v=lZ72O-dXhtM

http://statslc.com/videos/# playlists

Helpful stats channel

http://statslc.com/videos/

http://statslc.com/videos/

http://www.uk.sagepub.com/field4e/main.htm

43

http://www.uk.sagepub.com/field4e/study/default.htm





regression

Education