correlation and regression - unesco bangkok · – student (e.g. maths and reading performance) –...

Australian Council for Educational Research

Analysing and Understanding Learning Assessment for Evidence-based Policy

Making

Correlation and RegressionBangkok, 14-18, Sept. 2015

Correlation

• The strength of a mutual relation between 2 (or more) things

• You need to know 2 things about each unit of analysis– student (e.g. maths and reading performance)– school (e.g. funding level and mean reading performance)– country (e.g. mean performance in 2010 and in 2013)

• No assumption about the direction of the relationship• Correlation is simply standardised covariance – i.e.,

covariance divided by the product of the standard deviations of the variables.

Formulas

• Variances:

• Standard deviation

• Covariances:

• Correlation (Pearson’s r)

1)( 2

2

−−

= ∑N

XXσ

1))((

),cov(−

−−= ∑

NYYXX

yx

xy

yxrσσ

),cov(=

1)( 2

−−

= ∑N

XXσ

A note on sample vs population estimators

• Sample variances:

• Sample covariances:

• Estimate of variance based on a sample is biased, it underestimates the true variance

• Needs a correction factor of to produce an unbiased estimate

NYYXX

yx))((

),cov(−−

= ∑

1−NN

NXX 2

2 )(∑ −=σ

Presenter

Presentation Notes

Conventional formulas are as follows for variance and covariance. But these are for the population. If applied on a sample, they produce a biased estimate.

Type of correlation

• The correlation coefficient to use depends on the level of measurement of the variables

Ordinal – ranks, Likert scales, ordered categories• Spearman correlation (ρ), Kendall’s tau (τ)Interval/Ratio – metric scales, measures of magnitude• Pearson correlation (ρ)

Things to remember

• Independence – are the two values independent of each other?

• Linearity – is the relationship between the two values linear?

• Normality – are the two values distributed normally? (if not, non-parametric correlation should be used)

Correlation values

0 = no relationship1.0 = perfect positive relationship

-1.0 = perfect negative relationship0.1 = weak relationship (if significant)0.3 = moderate relationship (if significant)0.5 = strong relationship (if significant)

Presenter

Presentation Notes

If significant – if the correlation is significantly different from zero

Strong correlation

r = .80

Perfect correlations

r = 1 r = -1

Presenter

Presentation Notes

These are not common in surveys.

Moderate correlationr = .36

Presenter

Presentation Notes

“Cloud” of dots less well defined

No correlation

r = .06

Presenter

Presentation Notes

No correlation between maths and reading – if you find this in your results it might indicate a problem

Correlation vs Regression

• Correlation is not directional. The degree of association goes both ways.

• Correlation is not appropriate if the substantive meaning of X being associated with Y is different from Y being associated with X. For example, Height and Weight.

• Not appropriate when one of the variables is being manipulated, or being used to explain the other. Use regression instead.

Presenter

Presentation Notes


Practical exercises

• Be careful about spurious correlations. Just because two variables correlate highly does not mean there is a valid relationship between them.

• Correlation is not causation.• With large enough data, anything can be

significantly correlated with something.

Presenter

Presentation Notes


Regression

• Also describes a relationship between 2 things (or more), but assumes a direction

• Explain one variable with one (or more) other variable(s)– How well does SES predict performance?

Regression – cont.

• Two main statistics– Size of the effect or slope– Strength of the effect or

explained variance

The General IdeaSimple regression considers the relation between a single explanatory variable and response variable

Line of best fit (OLS)

Presenter

Presentation Notes

Students have an estimate of ESCS (SES) and a score on a reading test

Size of the effect

1 unit

50 = slope

Presenter

Presentation Notes

The best-fitting line – a line that minimizes the distances between each of those dots and the line This line can be steep or flat. To express how steep the line is: when you increase the value on your x axis by one unit (one SD on ESCS), how much does your reading variable increase? In this case it goes up by 50. That means the slope of this line is 50. Of course this value depends on the unit of your vertical scale. In this case the mean of the reading scale is 500, and the standard deviation is 100, so this goes up half a SD in reading.

Size of the effect – cont.

1 unit

25 = slope

Presenter

Presentation Notes

In this example the slope is less steep. If we go up one SD on ESCS we see the line only goes up by 25 reading points. For any given difference of one SD in ESCS, this is associated with an increase of 25 points on the reading scale (1/4 of the SD). The steeper the line, the larger the effect of ESCS on reading. The unit of measurement of the variable on the x-axis changes the slope of the line. We call the variable on the Y-axis our dependent variable, and the variable on the X-axis our independent variable. An increase in the variable on the Y-axis (in this case reading) is dependent upon an increase in variable on the X-axis (SES). In other words, we are trying to explain reading using ESCS. Note that we cannot infer causation from the types of surveys we use – we would only determine causation by manipulating our “cause effect”, by using controls and experimental groups

The R2

The proportion of the total sample variance that is notexplained by the regression will be:

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑇𝑇𝑜𝑜𝑇𝑇𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅

Therefore, the proportion of thevariance in the dependentvariable that is explained by theindependent variable (R2) will be:

𝑅𝑅2 = 1 − 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑇𝑇𝑜𝑜𝑇𝑇𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅

Presenter

Presentation Notes

Students have an estimate of ESCS (SES) and a score on a reading test

Strength of the effect

For example, if the residual variance is a small proportion of the total variance

R2 = 1 – (162.5/1250)R-squared = 0.87

87 % of the variation in reading is explained by ESCS

Presenter

Presentation Notes

The other thing you look at with regression is the strength. If we look at the best-fitting line in this case, and the distance of each of these dots from the line, we see that all of the dots are quite close to the line. This means that the relationship is quite strong. You can look at it another way. If you take a student with an ESCS value of -1, its predicted reading performance (from the line) is 450. However, the real reading performances of kids with an ESCS value of -1 ranges from between 430 and 480 approximately. Remember this to compare with the next slide The strength is related to the correlation. In fact, the strength is just the correlation squared. In this case we had a very strong correlation (.93). If we square that: the value we get is the proportion of variation in reading that is explained by ESCS. For all of the students that have reading scores, the variation in their performance is explained 87% by ESCS. There is only 13% of variation in kids’ reading performance that is not explained by ESCS. Note that this is not very realistic. We have computed the CI here – we are 95% sure that the true slope is between 47 and 51.

Strength – cont.

For example, if the residual variance is a large proportion of the total variance

R2 = 1 – (1075/1250)R2 = 0.14

Only 14% of the variation in reading is explained by ESCS

Presenter

Presentation Notes

Here we have a wider scatter plot (but the slope of the line is the same as the previous slide). If we again look at all the distances between each dot and the best fitting line we see that these distances are much larger than the previous slide. Also what we see if we take a student with a value of -1 on ESCS we would estimate the student reading performance just over 400. But in reality, students with an ESCS of -1 have a very wide range in reading scores – the best student has reading performance of 800, the weakest student of just under 200. What you can already tell from this is reading explains ESCS a little bit but not as much as the previous example. In this case the correlation of this cloud was .37 (moderate correlation) and R-squared is .14. This is much more realistic than the pervious example. In other words, 86% of the variation in reading is explained by other things (not ESCS). The other thing you can see here is that, even though the slope is the same as the pervious slide, we’re much less certain about it: the CI is from 34 to 106. The less variance you explain, the larger the standard error on your slope estimate.

Multiple regression simultaneously considers the influence of multiple explanatory variables on a response variable Y

The intent is to look at the independent effect of each variable while “adjusting out” the influence of potential confounders

Multiple Regression

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi

Regression Modeling

• A simple regression model (one independent variable) fits a regression line in 2-dimensional space

• A multiple regression model with two explanatory variables fits a regression plane in 3-dimensional space

• This concept can be extended indefinitely but visualisation is no longer possible for >3 variables.

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

residual

Presenter

Presentation Notes

What SPSS does is fit a regression line (or plane) such that the squared residuals are minimised

Multiple Regression ModelAgain, estimates for the multiple slope coefficients are derived by minimizing ∑residuals2 to derive this multiple regression model:

Again, the standard error of the regression is based on the ∑ residuals2 of all xn:


Multiple Regression Model

• Intercept α predicts where the regression plane crosses the Y axis

• Slope for variable X1(β1) predicts the change in Y per unit X1 holding X2constant

• The slope for variable X2 (β2) predicts the change in Y per unit X2 holding X1constant


Main purpose of regression analysis

• Prediction– Developing a prediction model based on a set

of predictor/independent variables. This purpose also allows for the evaluation of the predictive powers between different models as well as different sets of predictors within a model.

• Explanation– Validating or confirming an existing prediction

model using new data. This purpose also allows for the assessment of the relationship between predictor and outcome variables.

Presenter

Presentation Notes

We will utilise regression for both purposes in the exercise.

Regression works provided assumptions are met

• Linearity– Check using partial regression plots (PLOTS Produce all

partial plots)• Uniform variance (homoscedasticity)

– Check by plotting residuals against the predicted value (PLOTS Y:ZRESID, X:ZPRED)

– For ANOVA, check using Levene’s test for homogeneity of variance (EXPLORE PLOTS Spread vs Level)

• Independence of error terms– Check by plotting residuals against a sequencing variable

(PLOTS Produce all partial plots)• Normality of the residuals

– Check using Normal P-P plots of the residuals (PLOTS Normal probability plot)

Sample size

• Thorough method: a priori power analysis– Compute sample sizes for given effect sizes,

alpha levels, and power values (G*Power 3: http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/)

• Fast method (but less thorough): rules of thumb– For R2 significance testing: 50 + 8k– For b-values significance testing : 104 + k– For both, use the larger number

Presenter

Presentation Notes

This shows that according to the rule of thumb, a reasonable minimum sample size is 105 (104+1)

http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/

Multicollinearityy= b0 + b1x1y= b0 + b1x1 + b2x2 but if x2 = x1 + 3

y= b0 + b1x1 + b2 (x1+3) y= b0 + b1x1 + b2 x1 +3b2

Checking for multicollinearityFor overall multicollinearity: VIF>10; Tolerance <0.10.For individual variables: Identify Condition Index >15, then check the Variance Proportions of each coefficient >.90.

Presenter

Presentation Notes

In example above, b1 and b2 cannot be interpreted differentially.

Influential values• Influential values are outliers that have

substantial effect on the regression line.

Source: Field, A. (2005). Discovering statistics using SPSS. (2nd ed). London: Sage.

When does linear regression modelling become inappropriate?

• When the dependent variable is dichotomous or polytomous (use Logistical Regression).

• When data are sequential over time and variables are ‘auto correlated’ (use Time Series Analysis).

• When context effects need to be analysed and slopes are different across higher level units (use Multi-level Analysis).

Application: Illustrative Example

Childhood respiratory health survey. • Binary explanatory variable (SMOKE) is coded 0

for non-smoker and 1 for smoker• Response variable Forced Expiratory Volume

(FEV) is measured in liters/second (lung capacity)• Regress FEV on SMOKE least squares regression

line:ŷ = 2.566 + 0.711x

• The mean FEV in nonsmokers is 2.566 • The mean FEV in smokers is 3.277


Presenter

Presentation Notes

FEV can be thought of as lung capacity

Example, cont.

• ŷ = 2.566 + 0.711x• Intercept (2.566) = the mean FEV of group 0• Slope = the mean difference in FEV (because x

is 0,1) 3.277 − 2.566 = 0.711• tstat = 6.464 with 652 df, p <.01 (b1 is significant)• The 95% CI for slope is 0.495 to 0.927


Smoking increases lung capacity?

• Children who smoked had higher mean FEV• How can this be true given what we know

about the deleterious respiratory effects of smoking?

• ANS: Smokers were older than the nonsmokers

• AGE confounded the relationship between SMOKE and FEV

• A multiple regression model can be used to adjust for AGE in this situation


Extending the analysis:Multiple regression

The multiple regression model is:FEV = 0.367 + −.209(SMOKE) + .231(AGE)

SPSS output for our example:Intercept a Slope b2Slope b1


Multiple Regression Coefficients, cont.

• The slope coefficient associated for SMOKE is −.209, suggesting that smokers have .209 less FEV on average compared to non-smokers (after adjusting for age)

• The slope coefficient for AGE is .231, suggesting that each year of age in associated with an increase of .231 FEV units on average (after adjusting for SMOKE)


Coefficientsa

.367 .081 4.511 .000-.209 .081 -.072 -2.588 .010.231 .008 .786 28.176 .000

(Constant)smokeage

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: feva.

Inference About the Coefficients

Inferential statistics are calculated for each regression coefficient. For example, in testing

H0: β1 = 0 (SMOKE coefficient controlling for AGE)tstat = −2.588 and P = 0.010

df = n – k – 1 = 654 – 2 – 1 = 651Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Inference About the CoefficientsThe 95% confidence interval for this slope of SMOKE controlling for AGE is −0.368 to − 0.050.

Coefficientsa

.207 .527-.368 -.050.215 .247

(Constant)smokeage

Model1

Lower Bound Upper Bound95% Confidence Interval for B

Dependent Variable: feva.


Assessing the significance of the model

• R Square (R2) – represents the proportion of variance in the outcome variable that is accounted for by the predictors in the model.For example, if for our previous model R2 = .23, then 23% of the variance in FEV is accounted for by smoking status and age.

• Adjusted R2 – compensates for the inflation of R2 due to overfitting. Useful for comparing the amount of variance explained across several models.

• Standard error of the estimate – measure of accuracy of the predictions. For example, if the SE of the estimate = 0.35 for our previous model:

FEV = 0.367 + −.209(SMOKE) + .231(AGE)then the predicted FEV for a non-smoker aged 12 years is

FEV=3.139 +/- (t x 0.35)

Assessing the significance of the model

Hierarchical models

Suppose Model 1: FEV = 0.367 + −.209(SMOKE) + .231(AGE), R2 =.23Model 2: FEV = 0.367 + −.209(SMOKE) + .231(AGE) + .04(GENDER), R2 =.29

What is the amount of unique variance explained by gender above and beyond that explained by smoking status and age?

FEV

AGESMOKE

FEV

AGE

GENDER

SMOKE

Hierarchical regression in SPSS

Dummy VariablesMore than two levels

For categorical variables with k categories, use k–1 dummy variables

Ex. SMOKE2 has three levels, initially coded 0 = non-smoker 1 = former smoker2 = current smoker

Use k – 1 = 3 – 1 = 2 dummy variables to code this information like this:


Use of standardised coefficients

• Often thought to be ‘easier’ to interpret.• Standardisation depends on variances

of independent variables.• Unstandardised coefficient can be

translated directly.• Unstandardised coefficients cannot

always be compared if different units are used for the variables.

Finding the best regression model

• The set of predictors must be chosen based on theory

• Avoid the “whatever sticks to the wall” approach.

• The grouping of predictors and the ordering of entry will matter.

• Selecting the “best” final model can sometimes be a judgment call.

Presenter

Presentation Notes

So “whatever sticks to the wall” is ok as long as nobody knows?

How to judge whether a model is good?

• Explained variance proportion as measures by R2

• Size of regression coefficients.• Significance tests (F-test for model, t-

tests for parameters)• Inclusion of all relevant variables

(Theory!)• Is method appropriate?

The six steps to interpreting results1. Look at the prediction equation to see an estimate of the

relationship.2. Refer to the standard error of the estimate (in the appropriate

model) when making predictions for individuals.3. Refer to the standard errors of the coefficients (in the most

complete model) to see how much you can trust the estimates of the effects of the explanatory variables.

4. Look at the significance levels of the t-ratios to see how strong is the evidence in support of including each of the explanatory variables in the model.

5. Use the coefficient of determination (R2) to measure the potential explanatory power of the model.

6. Compare the beta-weights of the explanatory variables in order to rank them in order of explanatory importance.

Notes on interpreting the results

• Prediction is NOT causation.• In inferring causation, there has to be at least

temporal precedence, but temporal precedence alone is still not sufficient.

• Avoid extrapolating the prediction equation beyond the data range.

• Always consider the standard errors and the confidence intervals of the parameter estimates.

• The magnitude of the coefficient of determination (R2), in terms of explanatory power, is a judgment call.

Practice exercises!

Study: Mathematics Beliefs and Achievement of Elementary School Students in Japan and the United States: Results From the Third International Mathematics and Science Study (TIMSS). House, J. D., 2006

• Interpret the parameter estimates• Interpret the statistical significance of the

predictors• Make substantive interpretation about the findings

Extensions: Regression

Multiple regression considers the relation between a set of explanatory variables and response or outcome variable

Independent predictor (x1)

Outcome (y)


Moderating effect

Moderated regressionWhen the independent variable does not affect the outcome directly but rather affects the relationship between the predictor and the outcome.


Outcome (y)

Independent variable (x2)

Moderating effectSimple Moderating effectWhen a categorical independent variable affects the relationship between the predictor and the outcome.

C1

C2

C3

X

Y

Moderating effects

y = actual scaled score in the Multidimensional Perfectionism Scale (Hewitt & Flett)

Categorical moderator Continuous moderator

Presenter

Presentation Notes

EDI- eating disorder inventory

Types of moderators (Sharma et al., 1981)

Related to predictor and/or outcome

Not related predictor and/or outcome

No interaction with predictor

Independent predictor Homologizer

Interaction with predictorvariable

Quasi-moderator Pure moderator

Homologizer variables affect the strength (rather than the form) of the relationship between predictor and outcome (Zedeck, 1971)

Presenter

Presentation Notes

EDI- eating disorder inventory

Testing Moderation

• Moderation effects are also known as interaction effects.• Interaction terms are product terms of the moderator and the

relevant predictor (the variable that the moderator interacts)– Y = b0 + b1x1 + b2x2 + b3m– Interaction term = x1*m =i1

• Choosing the moderator and the relevant predictor must have theoretical support. For example, it is possible that the moderator interacts with x2 instead (i.e., x2*m =i1).

• Testing for the interaction effect necessitates the inclusion of the interaction term/s in the regression equation:

– Y = b0 + b1x1 + b2x2 + b3m + b4i1

– And test H0: b4=0

Mediating effect

Mediated regressionWhen the independent predictor does not affect the outcome directly but affects it through an intermediary variable (the mediator).


Outcome (y)

Intermediary predictor (x2)

Mediation vs Moderation

Mediators explain why or how an independent variable X causes the outcome Y while a moderator variable affects the magnitude and direction of the relationship between X and Y (Saunders, 1956).These two approaches can be combined for more complex analyses:

• Moderated mediation

• Mediated moderation

Checkists

• Moderation– Collinearity between predictor and moderator

(especially true for quasi-moderators).– Unequal variances between groups based on the

moderator.– Reliability of measures (measurement errors are

magnified when creating the product terms).• Mediation

– Theoretical assumptions on the mediator– Rationale for selecting the mediator– Significance and type (full/partial) of the mediation

effect.– Implied causation (i.e., directional paths).

correlation and regression - unesco bangkok · – student (e.g. maths and reading performance) –...

Documents