regression
TRANSCRIPT
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of the
University of New South Wales pursuant to Part VB of the Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act.
Do not remove this notice.
Slide 3
Aims
Understand linear regression with one predictor
Understand how we assess the fit of a regression model• Total Sum of Squares• Model Sum of Squares• Residual Sum of Squares• F• R2
Know how to do Regression on IBM SPSS
Interpret a regression model
Summary of Linear Regression• Simple regression is a way of predicting one variable from another.
• We do this by fitting a statistical model to the data in the form of a straight line.
• This line is the line that best summarises the pattern of data.
• We have to assess how well the line fits the data using:• R squared which tells us how much variance is explained by the model
compared to how much variance there is to explain in the first place. It is a proportion of variance in the outcome variable that is shared by the predictor variable.
• F, which tells us how much variability the model can explain relative to how much it can’t explain (i.e., it’s the ratio of how good the model is compared to how bad the model is).
• The b-value, which tells us the gradient of the regression line and the strength of the relationship between a predictor and the outcome variable. If its significant (Sig. < 0.05 in the SPSS table) then the predictor variable significantly predicts the outcome variable.
Remember that we have previously talked about fitting models…
We do this by fitting a statistical model to the data in the form of a straight line
Slide 6
What is Regression?
A way of predicting the value of one variable from another.• It is a hypothetical model of the relationship between two variables.
• The model used is a linear one.
• Therefore, we describe the relationship using the equation of a straight line.
• Remember that a straight line is: y = mx + b
Slide 7
b1
• Regression coefficient for the predictor
• Gradient (slope) of the regression line
• Direction/Strength of Relationship
b0
• Intercept (value of Y when X = 0)
• Point at which the regression line crosses the Y-axis (ordinate)
Describing a Straight Line
y = mx + by = b + mx
So the model is ‘b + mx’y = model + error
iii XbbY 10
Regression co-efficients (b)
b0
• Intercept (value of Y when X = 0)• Point at which the regression line
crosses the Y-axis (ordinate)
b1
• Regression coefficient for the predictor• Gradient (slope) of the regression line• Direction/Strength of Relationship
Outcome: Album sales
Predictor: $ spent on advertisng
Album sales i = b0 + b1 adversting budget i + Error i
Let’s say that the values of b0 & b1 turned out to be 50 & 100 respectively:
Album sales i = 50 + (100 x adversting budget i) + Error i
How much money do you want to spend on advertising per album? Say $5?
Album sales i = 50 + (100 x 5) + Error i
= 550 + Error i
Predicted album sales is 550. This predicted value is not perfect.
Example of simple regression
Slide 10
The Method of Least SquaresGives us the error (residual sums of squares)
https://www.youtube.com/watch?v=0T0z8d0_aY4
https://www.youtube.com/watch?v=ocGEhiLwDVc
Video
The regression line is only a model based on the data.
This model might not reflect reality.
We need some way of testing how well the model fits the observed data.
How?
Slide 12
Interpretation-How Good is the Model?
Most basic model: The mean. Total sum of squares (SS T No relationship between the two variables. (same amount of album sales no matter how big the advertising budget).
Residual sums of squares (SS R) describes the error in the model.
How much better is the model than just using the mean? The reduction of inaccuracy- the improvement is the model sum of squares:
Difference between the mean and the regression line.
Assessing goodness of fit
Most basic model: The mean. Total sum of squares (SS T No relationship between the two variables (same amount of album sales no matter how big the advertising budget).
Residual sums of squares (SSR) describes the error in the model.
How much better is the model than just using the mean? The reduction of inaccuracy- the improvement is the model sum of squares:
Difference between the mean and the regression line.
Slide 15
Summary
SST
• Total variability (variability between scores and the mean).
SSR
• Residual/Error variability (variability between the regression model and the actual data).
SSM • Model variability (difference in variability between the
model and the mean).
Slide 16
Testing the Model
If the model results in better prediction than using the mean, then we expect SSM to be much greater than SSR
SSRSSR
Error in ModelError in Model
SSMSSM
Improvement Due to the ModelImprovement Due to the Model
SSTSST
Total Variance In The DataTotal Variance In The Data
Slide 17
Testing the Model: R2
R2
• The proportion of variance accounted for by the regression model.
• The Pearson Correlation Coefficient Squared
T
M
SSSSR 2
Slide 18
Testing the Model: ANOVA
Mean Squared Error• Sums of Squares are total values.
• They can be expressed as averages.
• These are called Mean Squares, MS
• F is a measure of how much the model has improved the prediction of the outcome compared to the level of inaccuracy of the model.
• Good model has a large F (>1)
R
M
MSMSF
• Gradient of regression line.
• Change in outcome as a result of 1 unit change in predictor.
• Bad model: b = zero. E.g. the mean as a predictor of album sales.
• If a variable, such as advertising predicts an outcome, it should has a non-zero value.
• The hypothesis that something is ‘non-zero’ can be tested using a t-test.
• The t-statistic tells us how confident we can be that something is non-zero. If we are confident that b is non-zero that means that our variable (adverts) is predicting sales.
• We are usually confident that it’s non-zero if the result (significance) of the t-test is <.05.
The b-value (b1)
Slide 22
Confidence Intervals
Domjan et al. (1998)• ‘Conditioned’ sperm release in Japanese Quail.
True Mean• 15 Million sperm
Sample Mean• 17 Million sperm
Interval estimate• 12 to 22 million (contains true value)• 16 to 18 million (misses true value)• CIs constructed such that 95% contain the true
value.
Something other than evidence is affecting your conclusion. A source of bias comes from violating assumptions
An assumption is a condition that ensures what you’re attempting to do works.
Bias
If Juno had 16 friends this would pull the mean up and incorrectly make KC and SK seem to be more popular. This affects our Sums of Squares and further calculations on the data.
We spot outliers by looking at graphs
Outliers
http://donaldearlcollins.com/2012/12/13/december-doctoral-decisions/graphic-1/
Additivity and Linearity
• The outcome variable is, in reality, linearly related to any predictors.
• If you have several predictors then their combined effect is best described by adding their effects together.
• If this assumption is not met then your model is invalid.
Homoscedasticity/ Homogeneity of Variance p175
When testing several groups of participants, samples should come from populations with the same variance.
In correlational designs, the variance of the outcome variable should be stable at all levels of the predictor variable.
E.g. spread of hearing loss at each concert in Syd, Melb, BrisVegs is the same.
Residuals vs values of outcomes predicted by model.
Is there a systematic relationship between what comes out of model (predicted values) and errors in the model?
We want NO relationship.
Funnel out means homogeneity of variance is violated.
Curve in residuals = Non-linear relationship between outcome and predictors. This means linearity is broken.
Spotting problems with Linearity or Homoscedasticity p174
Normally Distributed Something or Other
The normal distribution is relevant to:• Parameters
• Confidence intervals around a parameter
• Null hypothesis significance testing
This assumption tends to get incorrectly translated as ‘your data need to be normally distributed’.
Usually it refers to the Sampling distribution of what’s being tested must be normal.
More on this in tutorials and on page 168 of your textbook.
When does the Assumption of Normality Matter?
In small samples.• The central limit theorem allows us to forget about this assumption in larger
samples.
In practical terms, as long as your sample is fairly large, outliers are a much more pressing concern than normality.
Spotting NormalityWe don’t have access to the sampling distribution so we usually test the observed data
Central Limit Theorem
• If N > 30, the sampling distribution is normal anyway
Graphical displays
• P-P Plot (or Q-Q plot)
• Histogram
Values of Skew/Kurtosis
• 0 in a normal distribution
• Convert to z (by dividing value by SE)
Kolmogorov-Smirnov Test
• Tests if data differ from a normal distribution
• Significant = non-Normal data
• Non-Significant = Normal data
Slide 36
Summary of Linear Regression• Simple regression is a way of predicting one variable from another.
• We do this by fitting a statistical model to the data in the form of a straight line.
• This line is the line that best summarises the pattern of data.
• We have to assess how well the line fits the data using:• R squared which tells us how much variance is explained by the model
compared to how much variance there is to explain in the first place. It is a proportion of variance in the outcome variable that is shared by the predictor variable.
• F, which tells us how much variability the model can explain relative to how much it can’t explain (i.e., it’s the ratio of how good the model is compared to how bad the model is).
• The b-value, which tells us the gradient of the regression line and the strength of the relationship between a predictor and the outcome variable. If its significant (Sig. < 0.05 in the SPSS table) then the predictor variable significantly predicts the outcome variable.
Linear regression
https://www.youtube.com/watch?v=KsVBBJRb9TE&list=PLvxOuBpazmsND0vmkP1ECjTloiVz-pXla
How to calculate Rsquared https://www.youtube.com/watch?v=w2FKXOa0HGA
Cheesy comic video intro to regression
https://www.youtube.com/watch?v=lZ72O- dXhtM
More videos on regression
http://statslc.com/videos/# playlists
Helpful stats channel
http://www.uk.sagepub.com/field4e/main.htm
43
http://www.uk.sagepub.com/field4e/study/default.htm