correlation and regression - unesco bangkok · – student (e.g. maths and reading performance) –...
TRANSCRIPT
Australian Council for Educational Research
Analysing and Understanding Learning Assessment for Evidence-based Policy
Making
Correlation and RegressionBangkok, 14-18, Sept. 2015
Correlation
• The strength of a mutual relation between 2 (or more) things
• You need to know 2 things about each unit of analysis– student (e.g. maths and reading performance)– school (e.g. funding level and mean reading performance)– country (e.g. mean performance in 2010 and in 2013)
• No assumption about the direction of the relationship• Correlation is simply standardised covariance – i.e.,
covariance divided by the product of the standard deviations of the variables.
Formulas
• Variances:
• Standard deviation
• Covariances:
• Correlation (Pearson’s r)
1)( 2
2
−−
= ∑N
XXσ
1))((
),cov(−
−−= ∑
NYYXX
yx
xy
yxrσσ
),cov(=
1)( 2
−−
= ∑N
XXσ
A note on sample vs population estimators
• Sample variances:
• Sample covariances:
• Estimate of variance based on a sample is biased, it underestimates the true variance
• Needs a correction factor of to produce an unbiased estimate
NYYXX
yx))((
),cov(−−
= ∑
1−NN
NXX 2
2 )(∑ −=σ
Type of correlation
• The correlation coefficient to use depends on the level of measurement of the variables
Ordinal – ranks, Likert scales, ordered categories• Spearman correlation (ρ), Kendall’s tau (τ)Interval/Ratio – metric scales, measures of magnitude• Pearson correlation (ρ)
Things to remember
• Independence – are the two values independent of each other?
• Linearity – is the relationship between the two values linear?
• Normality – are the two values distributed normally? (if not, non-parametric correlation should be used)
Correlation values
0 = no relationship1.0 = perfect positive relationship
-1.0 = perfect negative relationship0.1 = weak relationship (if significant)0.3 = moderate relationship (if significant)0.5 = strong relationship (if significant)
Strong correlation
r = .80
Perfect correlations
r = 1 r = -1
Moderate correlationr = .36
No correlation
r = .06
Correlation vs Regression
• Correlation is not directional. The degree of association goes both ways.
• Correlation is not appropriate if the substantive meaning of X being associated with Y is different from Y being associated with X. For example, Height and Weight.
• Not appropriate when one of the variables is being manipulated, or being used to explain the other. Use regression instead.
Practical exercises
• Be careful about spurious correlations. Just because two variables correlate highly does not mean there is a valid relationship between them.
• Correlation is not causation.• With large enough data, anything can be
significantly correlated with something.
Regression
• Also describes a relationship between 2 things (or more), but assumes a direction
• Explain one variable with one (or more) other variable(s)– How well does SES predict performance?
Regression – cont.
• Two main statistics– Size of the effect or slope– Strength of the effect or
explained variance
The General IdeaSimple regression considers the relation between a single explanatory variable and response variable
Line of best fit (OLS)
Line of best fit (OLS)
Size of the effect
1 unit
50 = slope
Size of the effect – cont.
1 unit
25 = slope
The R2
The proportion of the total sample variance that is notexplained by the regression will be:
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑇𝑇𝑜𝑜𝑇𝑇𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅
Therefore, the proportion of thevariance in the dependentvariable that is explained by theindependent variable (R2) will be:
𝑅𝑅2 = 1 − 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑇𝑇𝑜𝑜𝑇𝑇𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅
Strength of the effect
For example, if the residual variance is a small proportion of the total variance
R2 = 1 – (162.5/1250)R-squared = 0.87
87 % of the variation in reading is explained by ESCS
Strength – cont.
For example, if the residual variance is a large proportion of the total variance
R2 = 1 – (1075/1250)R2 = 0.14
Only 14% of the variation in reading is explained by ESCS
Multiple regression simultaneously considers the influence of multiple explanatory variables on a response variable Y
The intent is to look at the independent effect of each variable while “adjusting out” the influence of potential confounders
Multiple Regression
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi
Regression Modeling
• A simple regression model (one independent variable) fits a regression line in 2-dimensional space
• A multiple regression model with two explanatory variables fits a regression plane in 3-dimensional space
• This concept can be extended indefinitely but visualisation is no longer possible for >3 variables.
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
residual
Multiple Regression ModelAgain, estimates for the multiple slope coefficients are derived by minimizing ∑residuals2 to derive this multiple regression model:
Again, the standard error of the regression is based on the ∑ residuals2 of all xn:
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi
Multiple Regression Model
• Intercept α predicts where the regression plane crosses the Y axis
• Slope for variable X1(β1) predicts the change in Y per unit X1 holding X2constant
• The slope for variable X2 (β2) predicts the change in Y per unit X2 holding X1constant
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
Main purpose of regression analysis
• Prediction– Developing a prediction model based on a set
of predictor/independent variables. This purpose also allows for the evaluation of the predictive powers between different models as well as different sets of predictors within a model.
• Explanation– Validating or confirming an existing prediction
model using new data. This purpose also allows for the assessment of the relationship between predictor and outcome variables.
Regression works provided assumptions are met
• Linearity– Check using partial regression plots (PLOTS Produce all
partial plots)• Uniform variance (homoscedasticity)
– Check by plotting residuals against the predicted value (PLOTS Y:ZRESID, X:ZPRED)
– For ANOVA, check using Levene’s test for homogeneity of variance (EXPLORE PLOTS Spread vs Level)
• Independence of error terms– Check by plotting residuals against a sequencing variable
(PLOTS Produce all partial plots)• Normality of the residuals
– Check using Normal P-P plots of the residuals (PLOTS Normal probability plot)
Sample size
• Thorough method: a priori power analysis– Compute sample sizes for given effect sizes,
alpha levels, and power values (G*Power 3: http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/)
• Fast method (but less thorough): rules of thumb– For R2 significance testing: 50 + 8k– For b-values significance testing : 104 + k– For both, use the larger number
Multicollinearityy= b0 + b1x1y= b0 + b1x1 + b2x2 but if x2 = x1 + 3
y= b0 + b1x1 + b2 (x1+3) y= b0 + b1x1 + b2 x1 +3b2
Checking for multicollinearityFor overall multicollinearity: VIF>10; Tolerance <0.10.For individual variables: Identify Condition Index >15, then check the Variance Proportions of each coefficient >.90.
Influential values• Influential values are outliers that have
substantial effect on the regression line.
Source: Field, A. (2005). Discovering statistics using SPSS. (2nd ed). London: Sage.
When does linear regression modelling become inappropriate?
• When the dependent variable is dichotomous or polytomous (use Logistical Regression).
• When data are sequential over time and variables are ‘auto correlated’ (use Time Series Analysis).
• When context effects need to be analysed and slopes are different across higher level units (use Multi-level Analysis).
Application: Illustrative Example
Childhood respiratory health survey. • Binary explanatory variable (SMOKE) is coded 0
for non-smoker and 1 for smoker• Response variable Forced Expiratory Volume
(FEV) is measured in liters/second (lung capacity)• Regress FEV on SMOKE least squares regression
line:ŷ = 2.566 + 0.711x
• The mean FEV in nonsmokers is 2.566 • The mean FEV in smokers is 3.277
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
Example, cont.
• ŷ = 2.566 + 0.711x• Intercept (2.566) = the mean FEV of group 0• Slope = the mean difference in FEV (because x
is 0,1) 3.277 − 2.566 = 0.711• tstat = 6.464 with 652 df, p <.01 (b1 is significant)• The 95% CI for slope is 0.495 to 0.927
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
Smoking increases lung capacity?
• Children who smoked had higher mean FEV• How can this be true given what we know
about the deleterious respiratory effects of smoking?
• ANS: Smokers were older than the nonsmokers
• AGE confounded the relationship between SMOKE and FEV
• A multiple regression model can be used to adjust for AGE in this situation
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
Extending the analysis:Multiple regression
The multiple regression model is:FEV = 0.367 + −.209(SMOKE) + .231(AGE)
SPSS output for our example:Intercept a Slope b2Slope b1
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
Multiple Regression Coefficients, cont.
• The slope coefficient associated for SMOKE is −.209, suggesting that smokers have .209 less FEV on average compared to non-smokers (after adjusting for age)
• The slope coefficient for AGE is .231, suggesting that each year of age in associated with an increase of .231 FEV units on average (after adjusting for SMOKE)
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
Coefficientsa
.367 .081 4.511 .000-.209 .081 -.072 -2.588 .010.231 .008 .786 28.176 .000
(Constant)smokeage
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: feva.
Inference About the Coefficients
Inferential statistics are calculated for each regression coefficient. For example, in testing
H0: β1 = 0 (SMOKE coefficient controlling for AGE)tstat = −2.588 and P = 0.010
df = n – k – 1 = 654 – 2 – 1 = 651Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
Inference About the CoefficientsThe 95% confidence interval for this slope of SMOKE controlling for AGE is −0.368 to − 0.050.
Coefficientsa
.207 .527-.368 -.050.215 .247
(Constant)smokeage
Model1
Lower Bound Upper Bound95% Confidence Interval for B
Dependent Variable: feva.
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi
Assessing the significance of the model
• R Square (R2) – represents the proportion of variance in the outcome variable that is accounted for by the predictors in the model.For example, if for our previous model R2 = .23, then 23% of the variance in FEV is accounted for by smoking status and age.
• Adjusted R2 – compensates for the inflation of R2 due to overfitting. Useful for comparing the amount of variance explained across several models.
• Standard error of the estimate – measure of accuracy of the predictions. For example, if the SE of the estimate = 0.35 for our previous model:
FEV = 0.367 + −.209(SMOKE) + .231(AGE)then the predicted FEV for a non-smoker aged 12 years is
FEV=3.139 +/- (t x 0.35)
Assessing the significance of the model
Hierarchical models
Suppose Model 1: FEV = 0.367 + −.209(SMOKE) + .231(AGE), R2 =.23Model 2: FEV = 0.367 + −.209(SMOKE) + .231(AGE) + .04(GENDER), R2 =.29
What is the amount of unique variance explained by gender above and beyond that explained by smoking status and age?
FEV
AGESMOKE
FEV
AGE
GENDER
SMOKE
Hierarchical regression in SPSS
Dummy VariablesMore than two levels
For categorical variables with k categories, use k–1 dummy variables
Ex. SMOKE2 has three levels, initially coded 0 = non-smoker 1 = former smoker2 = current smoker
Use k – 1 = 3 – 1 = 2 dummy variables to code this information like this:
Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.
Use of standardised coefficients
• Often thought to be ‘easier’ to interpret.• Standardisation depends on variances
of independent variables.• Unstandardised coefficient can be
translated directly.• Unstandardised coefficients cannot
always be compared if different units are used for the variables.
Finding the best regression model
• The set of predictors must be chosen based on theory
• Avoid the “whatever sticks to the wall” approach.
• The grouping of predictors and the ordering of entry will matter.
• Selecting the “best” final model can sometimes be a judgment call.
How to judge whether a model is good?
• Explained variance proportion as measures by R2
• Size of regression coefficients.• Significance tests (F-test for model, t-
tests for parameters)• Inclusion of all relevant variables
(Theory!)• Is method appropriate?
The six steps to interpreting results1. Look at the prediction equation to see an estimate of the
relationship.2. Refer to the standard error of the estimate (in the appropriate
model) when making predictions for individuals.3. Refer to the standard errors of the coefficients (in the most
complete model) to see how much you can trust the estimates of the effects of the explanatory variables.
4. Look at the significance levels of the t-ratios to see how strong is the evidence in support of including each of the explanatory variables in the model.
5. Use the coefficient of determination (R2) to measure the potential explanatory power of the model.
6. Compare the beta-weights of the explanatory variables in order to rank them in order of explanatory importance.
Notes on interpreting the results
• Prediction is NOT causation.• In inferring causation, there has to be at least
temporal precedence, but temporal precedence alone is still not sufficient.
• Avoid extrapolating the prediction equation beyond the data range.
• Always consider the standard errors and the confidence intervals of the parameter estimates.
• The magnitude of the coefficient of determination (R2), in terms of explanatory power, is a judgment call.
Practice exercises!
Study: Mathematics Beliefs and Achievement of Elementary School Students in Japan and the United States: Results From the Third International Mathematics and Science Study (TIMSS). House, J. D., 2006
• Interpret the parameter estimates• Interpret the statistical significance of the
predictors• Make substantive interpretation about the findings
Extensions: Regression
Multiple regression considers the relation between a set of explanatory variables and response or outcome variable
Independent predictor (x1)
Outcome (y)
Independent predictor (x2)
Moderating effect
Moderated regressionWhen the independent variable does not affect the outcome directly but rather affects the relationship between the predictor and the outcome.
Independent predictor (x1)
Outcome (y)
Independent variable (x2)
Moderating effectSimple Moderating effectWhen a categorical independent variable affects the relationship between the predictor and the outcome.
C1
C2
C3
X
Y
Moderating effects
y = actual scaled score in the Multidimensional Perfectionism Scale (Hewitt & Flett)
Categorical moderator Continuous moderator
Types of moderators (Sharma et al., 1981)
Related to predictor and/or outcome
Not related predictor and/or outcome
No interaction with predictor
Independent predictor Homologizer
Interaction with predictorvariable
Quasi-moderator Pure moderator
Homologizer variables affect the strength (rather than the form) of the relationship between predictor and outcome (Zedeck, 1971)
Testing Moderation
• Moderation effects are also known as interaction effects.• Interaction terms are product terms of the moderator and the
relevant predictor (the variable that the moderator interacts)– Y = b0 + b1x1 + b2x2 + b3m– Interaction term = x1*m =i1
• Choosing the moderator and the relevant predictor must have theoretical support. For example, it is possible that the moderator interacts with x2 instead (i.e., x2*m =i1).
• Testing for the interaction effect necessitates the inclusion of the interaction term/s in the regression equation:
– Y = b0 + b1x1 + b2x2 + b3m + b4i1
– And test H0: b4=0
Mediating effect
Mediated regressionWhen the independent predictor does not affect the outcome directly but affects it through an intermediary variable (the mediator).
Independent predictor (x1)
Outcome (y)
Intermediary predictor (x2)
Mediation vs Moderation
Mediators explain why or how an independent variable X causes the outcome Y while a moderator variable affects the magnitude and direction of the relationship between X and Y (Saunders, 1956).These two approaches can be combined for more complex analyses:
• Moderated mediation
• Mediated moderation
Checkists
• Moderation– Collinearity between predictor and moderator
(especially true for quasi-moderators).– Unequal variances between groups based on the
moderator.– Reliability of measures (measurement errors are
magnified when creating the product terms).• Mediation
– Theoretical assumptions on the mediator– Rationale for selecting the mediator– Significance and type (full/partial) of the mediation
effect.– Implied causation (i.e., directional paths).