# multiple linear regression

Embed Size (px)

TRANSCRIPT

Survey Design II

Lecture 7

Survey Research & Design in Psychology

James Neill, 2017

Creative Commons Attribution 4.0

Image source:http://commons.wikimedia.org/wiki/File:Vidrarias_de_Laboratorio.jpg

Multiple Linear Regression I

7126/6667 Survey Research & Design in PsychologySemester 1, 2017, University of Canberra, ACT, AustraliaJames T. NeillHome page: http://en.wikiversity.org/wiki/Survey_research_and_design_in_psychologyLecture page: http://en.wikiversity.org/wiki/Survey_research_and_design_in_psychology/Lectures/Multiple_linear_regression_I

Image source:http://commons.wikimedia.org/wiki/File:Vidrarias_de_Laboratorio.jpg License: Public domain

Description: Introduces and explains the use of linear regression and multiple linear regression in the context of psychology.This lecture is accompanied by a computer-lab based tutorial, the notes for which are available here: http://ucspace.canberra.edu.au/display/7126/Tutorial+-+Multiple+linear+regression

Overview

Readings

Correlation (Review)

Simple linear regression

Multiple linear regression

Summary

MLR I Quiz - Practice questions

Image source: http://commons.wikimedia.org/wiki/File:Information_icon4.svgLicense: Public domain

Assumed knowledge: A solid understanding of linear correlation.

Howitt & Cramer (2014):

Regression: Prediction with precision

[Ch 9] [Textbook/eReserve]

Multiple regression & multiple correlation

[Ch 32] [Textbook/eReserve]

StatSoft (2016). How to find relationship between variables, multiple regression. StatSoft Electronic Statistics Handbook. [Online]

Tabachnick & Fidell (2013).

Multiple regression

(includes example write-ups) [eReserve]

Readings

Correlation (Review)

Linear relation between two variables

Image source: http://commons.wikimedia.org/wiki/File:Information_icon4.svgLicense: Public domain

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/

Purposes of

correlational statistics

Explanatory - Regression

e.g., cross-sectional study (all data collected at same time)

Predictive - Regression

e.g., longitudinal study (predictors collected prior to outcome
measures)

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/Regression tends not to be used for Exploratory or Descriptive purposes.

Linear correlation

Linear relations between interval or ratio variables

Best fitting straight-line on a scatterplot

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/

Correlation Key points

Covariance = sum of cross-products (unstandardised)

Correlation = sum of cross-products (standardised), ranging from -1 to 1 (sign indicates direction, value indicates size)

Coefficient of determination (r2) indicates % of shared variance

Correlation does not necessarily equal causality

Correlation is shared variance

Venn diagrams are helpful for depicting relations between variables.

.32

.68

.68

r2 =

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/

Simple linear regression

Explains and predicts a Dependent Variable (DV) based on a linear relation with an Independent Variable (IV)

Image source: http://commons.wikimedia.org/wiki/File:Information_icon4.svgLicense: Public domain

Image name:James NeillLicense: Public domain

What is simple linear regression?

An extension of correlation

Best-fitting straight line for a scatterplot between two variables:

predictor (X) also called an independent variable (IV)

outcome (Y) - also called a

dependent variable (DV) or criterion variable

LR uses an IV to explain/predict a DV

Help to understand relationships and possible causal effects of one variable on another.

Least squares criterion

The line of best fit minimises the total sum of squares of the vertical deviations for each case.

a = point at which line of best fit crosses the Y-axis.

b = slope

of the line of best fit

Least squares criterion

residuals

= vertical (Y) distance between line of best fit and each
observation

(unexplained variance)

Image source: UnknownTo describe the regression line, need the slope of the line of best fit and the point at which it touches the Y-axis.A line of best fit can be applied using any method e.g., by eye/hand.Another way is to use the Method of Least Squares a formula which minimizes the sum of the vertical deviationsSee also - http://www.hrma-agrh.gc.ca/hr-rh/psds-dfps/dafps_basic_stat2_e.asp#D

Linear Regression - Example:

Cigarettes & coronary heart disease

IV = Cigarette consumption

DV = Coronary Heart Disease

IV = Cigarette consumption

Landwehr & Watkins (1987, cited in Howell, 2004, pp. 216-218)

Image sources: UnknownExample from Landwehr & Watkins (1987), cited in Howell (2004, pp. 216-218) and accompanying powerpoint lecture notes).

Linear regression - Example:

Cigarettes & coronary heart disease

(Howell, 2004)

Research question:

How fast does CHD mortality rise with a one unit increase in
smoking?IV = Av. # of cigs per adult per day

DV = CHD mortality rate (deaths per 10,000 per year due to CHD)

Unit of analysis = Country

Image source: Unknown

Linear regression - Data:

Cigarettes & coronary heart disease

(Howell, 2004)

Image source: Howell (2004, pp. 216-218)

Linear regression - Example:

Scatterplot with Line of Best Fit

Linear regression equation

(without error)

predicted

values of Y

Y-intercept = level of Y when X is 0

b = slope = rate of predicted / for Y scores for each unit increase in X

Image source: Public domainhttp://commons.wikimedia.org/wiki/File:Scatterplot_WithLRForumulaIndicated.png

Y = bX + a + eX = IV valuesY = DV values

a = Y-axis interceptb = slope of line of best fit(regression
coefficient)e = error

Linear regression equation

(with error)

Y = a + bx + eX = predictor value (IV)Y = predicted value (DV)a
= Y axis intercept

(Y-intercept i.e., when X is 0)b = unstandardised regression
coefficient

(i.e. B in SPSS)

(regression coefficient - slope line of best fit average rate at
which Y changes with one unit change in X)e = error

Linear regression Example:

Equation

Variables: (DV) = predicted rate of CHD mortality

X (IV) = mean # of cigarettes per adult per day per country

Regression co-efficients:b = rate of / of CHD mortality for each extra cigarette smoked per day

a = baseline level of CHD (i.e., CHD when no cigarettes are smoked)

LaTeX: \hat Y=bx+a

Linear regression Example:

Explained variance

r = .71

r2 = .712 = .51

p < .05

Approximately 50% in variability of incidence of CHD mortality is associated with variability in countries' smoking rates.

Linear regression Example:

Test for overall significance

r = .71, r2 = .51, p < .05

R2 needs to be provided firstImage source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/

Linear regression Example:

Regression coefficients - SPSS

a

b

The intercept is labeled constant.Slope is labeled by name of predictor variable.

Linear regression - Example:

Making a prediction

What if we want to predict CHD mortality when cigarette consumption is 6?

We predict that 14.61 / 10,000 people in a country with an average cigarette consumption of 6 per person will die of CHD per annum.

Linear regression - Example:

Accuracy of prediction - Residual

Finnish smokers smoke 6 cigarettes/adult/day

We predict 14.61 deaths /10,000

But Finland actually has 23 deaths / 10,000

Therefore, the error (residual) for this case is 23 - 14.61 = 8.39

The variance of these residuals is indicated by the standard error in the regression coefficients table

Residual

Prediction

Image source: Unknown

Hypothesis testing

Null hypotheses (H0):a (Y-intercept) = 0

Unless the DV is ratio (meaningful 0), we are not usually very
interested in the a value (starting value of Y when X is 0).

b (slope of line of best fit) = 0

(rho - population correlation) = 0

Linear regression Example:

Testing slope and intercept

a

b

a is not significant - baseline CHD may be neglible.

b is significant (+ve) - smoking is +vely associated with CHD

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/Significance tests of the slope and intercept are given as t-tests.The t values in the second from right column are tests on slope and intercept.The associated p values are next to them.The slope is significantly different from zero, but not the intercept.

Linear regression - Example

Does a tendency to

ignore problems (IV)

predict

psychological distress (DV)?

Line of best fit seeks to minimise sum of squared residualsPD is measured in the direction of mental health i.e., high scores mean less distress.Higher IP scores indicate greater frequency of ignoring problems as a way of coping.Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/Note that high scores indicate good mental health, i.e., absence of distress

Ignoring Problems accounts for ~10% of the variation in Psychological DistressLinear regression - Example

R = .32, R2 = .11, Adjusted R2 = .10The predictor (Ignore the Problem) explains approximately 10% of the variance in the dependent variable (Psychological Distress).

R = correlation [multiple correlation in MLR]R2 = % of variance explainedAdjusted R2 = % of variance, reduced estimate, bigger adjustments for small samplesIn this case, Ignoring Problems accounts for ~10% of the variation in Psychological Distress

The population relationship between Ignoring Problems and Psychological Distress is unlikely to be 0% because p = .000(i.e., reject the null hypothesis that there is no relationship)

Linear regression - Example

The MLR ANOVA table provides a significance test of RIt is NOT a normal ANOVA (test of mean differences)tests whether a significant (non-zero) amount of variance is explained? (null hypothesis is zero variance explained)In this case a significant amount of Psychological Distress variance is explained by Ignoring Problems, F(1,218) = 25.78, p < .01

PD = 119 - 9.5*IP

There is a sig. a or constant (Y-intercept) - this is the baseline level of Psychological Distress.In addition, Ignore Problems (IP) is a significant predictor of Psychological Distress (PD).

Linear regression - Example

Multiple regression coefficient tableAnalyses the relationship
of each IV with the DVFor each IV, examine B, Beta, t and sig.B =
unstandardised regression coefficient

[use in prediction equations]Beta (b) = standardised regression
coefficient

[use to compare predictors with one another]t-test & sig. shows
the statistical likelihood of a DV-IV relationship being caused by
chance alone

a = 119 b = -9.5 PD = 119 - 9.5*IP

e = errorImage source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/

Linear regression summary

Linear regression is for explaining or predicting the linear relationship between two variables

Y = bx + a + e

= bx + a

(b is the slope; a is the Y-intercept)

Multiple Linear Regression

Linear relations between two or more IVs and a single DV

Image source: http://commons.wikimedia.org/wiki/File:Information_icon4.svgLicense: Public domain

Image soruce: http://commons.wikimedia.org/wiki/File:Example_path_diagram_%28conceptual%29.pngLicense: Public domain

Use of several IVs to predict a DV

Linear Regression

X Y

Multiple Linear RegressionX1X2 X3YX4X5

What is multiple linear regression (MLR)?

Visual model

Single predictor

Multiple predictors

In MLR there are:multiple predictor X variables (IVs) and

a single predicted Y (DV)

What is MLR?

Use of several IVs to predict a DV

Weights each predictor (IV) according to the strength of its linear relationship with the DV

Makes adjustments for inter-relationships among predictors

Provides a measure of overall fit (R)

Interrelationships between predictors e.g. IVs = height, gender DV = weight

CorrelationRegression

CorrelationPartial correlationMultiple linear regression

YYXX1X2What is MLR?

What is MLR?

A 3-way scatterplot can depict the correlational relationship between 3 variables.

However, it is difficult to graph/visualise 4+-way relationships via scatterplot.

Image source: Figure 11.2 Three-dimensional plot of teaching evaluation data (Howell, 2004, p. 248)

General steps

Develop a diagrammatic model and express a research question and/or hypotheses

Check assumptions

Choose type of MLR

Interpret output

Develop a regression equation (if needed)

~50% of the variance in CHD mortality could be explained by cigarette smoking (using LR)

Strong effect - but what about the other 50% (unexplained variance)?

What about other predictors?e.g., exercise and cholesterol?

LR MLR example:

Cigarettes & coronary heart disease

MLR Example

Research question 1

How well do these three IVs:# of cigarettes / day (IV1)

exercise (IV2) and

cholesterol (IV3)

predict CHD mortality (DV)?

CigarettesExercise CHD MortalityCholesterol

MLR Example

Research question 2

To what extent do personality factors (IVs) predict annual income (DV)?

ExtraversionNeuroticism IncomePsychoticism

MLR - Example

Research question 3

Does the # of years of formal study of psychology (IV1) and the
no. of years of experience as a psychologist (IV2) predict clinical
psychologists effectiveness in treating mental illness
(DV)?

StudyExperience Effectiveness

MLR - Example

Your example

Generate your own MLR research question

(e.g., based on some of the following variables):Gender &
Age

Enrolment Type

Hours

Stress

Time managementPlanning

Procrastination

Effective actions

Time perspectivePast-Negative

Past-Positive

Present-Hedonistic

Present-Fatalistic

Future-Positive

Future-Negative

Assumptions

Levels of measurement

Sample size

Normality (univariate, bivariate, and multivariate)

Linearity: Linear relations between IVs & DVs

Homoscedasticity

MulticollinearityIVs are not overly correlated with one another (e.g., not over .7)

Residuals are normally distributed

Levels of measurement

DV = Continuous

(Interval or Ratio)

IV = Continuous or Dichotomous

(if neither, may need to recode

into a dichotomous variable

or create dummy variables)

IVs = metric (interval or ratio) or dichotomous, e.g. age and gender DV = metric (interval or ratio), e.g., pulseLinear relations exist between IVs & DVs, e.g., check scatterplotsIVs are not overly correlated with one another (e.g., over .7) if so, apply cautiouslyAssumptions for correlation apply e.g., watch out for outliers, non-linear relationships, etc.Homoscedasticity similar/even spread of data from line of best throughout the distributionFor more on assumptions, see http://www.visualstatistics.net/web%20Visual%20Statistics/Visual%20Statistics%20Multimedia/correlation_assumtions.htm

Dummy coding

Dummy coding converts a more complex variable into a series of
dichotomous variables

(i.e., 0 or 1)

Several dummy variables can be created from a variable with a higher level of measurement.

Dummy coding - Example

Religion

(1 = Christian; 2 = Muslim; 3 = Atheist)

in this format, can't be an IV in regression

(a linear correlation with a categorical variable doesn't make
sense)

However, it can be dummy coded into dichotomous variables:Christian (0 = no; 1 = yes)

Muslim (0 = no; 1 = yes)

Atheist (0 = no; 1 = yes) (redundant)

These variables can then be used as IVs.

More information (Dummy variable (statistics), Wikiversity)

Sample size:

Rules of thumb

Enough data is needed to provide reliable estimates of the correlations.

N >= 50 cases and N >= 10 to 20 cases x no. of IVs, otherwise the estimates of the regression line are probably unstable and are unlikely to replicate if the study is repeated.

Green (1991) and Tabachnick & Fidell (2013) suggest:50 + 8(k) for testing an overall regression model and

104 + k when testing individual predictors (where k is the number of IVs)

Based on detecting a medium effect size ( >= .20), with critical 50 cases

N > 20 cases x 4 = 80 cases

N > 50 + 8 x 4 = 82 cases

N > 104 + 4 = 108 cases

IVs = metric (interval or ratio) or dichotomous, e.g. age and gender DV = metric (interval or ratio), e.g., pulseLinear relations exist between IVs & DVs, e.g., check scatterplotsIVs are not overly correlated with one another (e.g., over .7) if so, apply cautiouslyAssumptions for correlation apply e.g., watch out for outliers, non-linear relationships, etc.Homoscedasticity similar/even spread of data from line of best throughout the distributionFor more on assumptions, see http://www.visualstatistics.net/web%20Visual%20Statistics/Visual%20Statistics%20Multimedia/correlation_assumtions.htm

Dealing with outliers

Extreme cases should be deleted or modified if they are overly
influential.Univariate outliers -

detect via initial data screening

(e.g., min. and max.)

Bivariate outliers -

detect via scatterplots

Multivariate outliers -

unusual combination of predictors detect via Mahalanobis'
distance

Multivariate outliers

A case may be within normal range for each variable individually, but be a multivariate outlier based on an unusual combination of responses which unduly influences multivariate test results.

e.g., a person who:Is 18 years old

Has 3 children

Has a post-graduate degree

Multivariate outliers

Identify & check unusual cases

Use Mahalanobis' distance or Cooks D as a MVO screening procedure

Multivariate outliers

Mahalanobis' distance (MD)Distributed as 2 with df equal to the number of predictors (with critical = .001)

Cases with a MD greater than the critical value are multivariate outliers.

Cooks DCases with CD values > 1 are multivariate outliers.

Use either MD or CD

Examine cases with extreme MD or CD scores - if in doubt, remove & re-run.

Normality & homoscedasticity

NormalityIf variables are non-normal, this will create heteroscedasticity

HomoscedasticityVariance around the regression line should be the same throughout the distribution

Even spread in residual plots

Image source: Unknown

Multicollinearity

Multicollinearity IVs shouldn't be overly correlated (e.g., over .7) if so, consider combining them into a single variable or removing one.

Singularity - perfect correlations among IVs.

Leads to unstable regression coefficients.

Multicollinearity

Detect via:Correlation matrix - are there large correlations among IVs?

Tolerance statistics - if < .3 then exclude that variable.

Variance Inflation Factor (VIF) if > 3, then exclude that variable.

VIF is the reciprocal of Tolerance (so use one or the other not both)

Causality

Like correlation, regression does not tell us about the causal relationship between variables.

In many analyses, the IVs and DVs could be swapped around therefore, it is important to:Take a theoretical position

Acknowledge alternative explanations

Regression can establish correlational link, but cannot determine causation.

Multiple correlation coefficient

(R)

Big R (capitalised)

Equivalent of r, but takes into account that there are multiple predictors (IVs)

Always positive, between 0 and 1

Interpretation is similar to that for r (correlation coefficient)

Coefficient of determination (R2)

Big R squared

Squared multiple correlation coefficient

Always include R2

Indicates the % of variance in DV explained by combined effects of the IVs

Analogous to r2

The coefficient of determination is a measure of how well the regression line represents the data. If the regression line passes exactly through every point on the scatter plot, it would be able to explain all of the variation and R2 would be 1. The further the line is away from the points, the less it is able to explain. If the scatterplot is completely random and there is zero relationship between the IVs and the DV, then R2 will be 0.

Rule of thumb for interpretation of R2

.00 = no linear relationship

.10 = small (R ~ .3)

.25 = moderate (R ~ .5)

.50 = strong (R ~ .7)

1.00 = perfect linear relationship

R2 > .30

is good in social sciences

Adjusted R2

R2 is explained variance in a sample.

Adjusted R2 is used for estimating explained variance in a population.

Report R2 and adjusted R2.

Particularly for small N and where results are to be generalised, take more note of adjusted R2.

As number of predictors approaches N, R2 is inflated

Multiple linear regression

Test for overall significance

Shows if there is a significant linear relationship between the X variables taken together and Y

Examine F and p in the ANOVA table to determine the likelihood that the explained variance in Y could have occurred by chance

Regression coefficients

Y-intercept (a)

Slopes (b):Unstandardised

Standardised

Slopes are the weighted loading of each IV on the DV, adjusted for the other IVs in the model.

= r in LR but this is only true in MLR when the IVs are uncorrelated.

Unstandardised

regression coefficients

B = unstandardised regression coefficient

Used for regression equations

Used for predicting Y scores

But cant be compared with other Bs unless all IVs are measured on the same scale

Standardised

regression coefficients

Beta () = standardised regression coefficient

Useful for comparing the relative strength of predictors

= r in LR but this is only true in MLR when the IVs are uncorrelated.

Test for significance:

Independent variables

Indicates the likelihood of a linear relationship between each
IV (Xi) and Y occurring by chance.

Hypotheses:H0: i = 0 (No linear relationship)

H1: i 0 (Linear relationship between Xi and Y)

Relative importance of IVs

Which IVs are the most important?

To answer this, compare the standardised regression coefficients (s)

If IVs are uncorrelated (usually not the case) then you can simply use the correlations between the IVs and the DV to determine the strength of the predictors.

If the IVs are standardised (usually not the case), then the unstandardised regression coefficients (B) can be compared to determine the strength of the predictors.

If the IVs are measured using the same scale (sometimes the case), then the unstandardised regression coefficients (B) can meaningfully be compared.

Y = b1x1 + b2x2 +.....+ bixi + a + eY = observed DV scores

bi = unstandardised regression coefficients (the Bs in SPSS) - slopes

x1 to xi = IV scores

a = Y axis intercept

e = error (residual)

Regression equation

The MLR equation has multiple regression coefficients and a constant (intercept).

Multiple linear regression - Example

Does ignoring problems (IV1)

and worrying (IV2)

predict psychological distress (DV)

Image source: http://www.imaja.com/as/poetry/gj/Worry.html

.32

.52.35YX1X2Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/It is a good idea to get into the habit of drawing Venn diagrams to represent the degree of linear relationship between variables.

Multiple linear regression -

Example

Together, Ignoring Problems and Worrying explain 30% of the variance in Psychological Distress in the Australian adolescent population (R2 = .30, Adjusted R2 = .29).

Multiple linear regression -

Example

The explained variance in the population is unlikely to be 0 (p = .00).

Multiple linear regression -

Example

Worry predicts about three times as much variance in Psychological Distress than Ignoring the Problem, although both are significant, negative predictors of mental health.

Linear RegressionPD (hat) = 119 9.50*IgnoreR2 = .11

Multiple Linear Regression

PD (hat) = 139 - .4.7*Ignore - 11.5*WorryR2 = .30

Multiple linear regression -

Example Prediction equations

Confidence interval for the slope

Mental Health (PD) is reduced by between 8.5 and 14.5 units per increase of Worry units.Mental Health (PD) is reduced by between 1.2 and 8.2 units per increase in Ignore the Problem units.

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/95% CI

Multiple linear regression - Example

Effect of violence, stress, social support on internalising
behaviour problems

Kliewer, Lepore, Oskin, & Johnson, (1998)

Image source: http://cloudking.com/artists/noa-terliuc/family-violence.php

Image source: http://cloudking.com/artists/noa-terliuc/family-violence.phpKliewer, Lepore, Oskin, & Johnson, (1998)

Multiple linear regression Example Violence study

Participants were children: 8 - 12 years

Lived in high-violence areas, USA

Hypotheses: Stress internalising behaviour

Violence internalising behaviour

Social support internalising behaviour

Multiple linear regression Example - Variables

Predictors Degree of witnessing violence

Measure of life stress

Measure of social support

OutcomeInternalising behaviour

(e.g., depression, anxiety, withdrawal symptoms) measured using the
Child Behavior Checklist (CBCL)

Data available at www.duxbury.com/dhowell/StatPages/More_Stuff/Kliewer.dat

Correlations

Correlations

amongst

the IVsCorrelations

between the

IVs and the DV

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/CBCL = Child Behavior ChecklistPredictors are largely independent of each other.Stress and Witnessing Violence are significantly correlated with Internalizing.

R2

13.5% of the variance in children's internalising symptoms can be explained by the 3 predictors.

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/R2 has same interpretation as r2. 13.5% of variability in Internal accounted for by variability in Witness, Stress, and SocSupp.

Regression coefficients

2 predictors have

p < .05

Image source: James Neill, Creative Commons Attribution-Share Alike 2.5 Australia, http://creativecommons.org/licenses/by-sa/2.5/au/t test on two slopes (Violence and Stress) are positive and significant. SocSupp is negative and not significant. However the size of the effect is not much different from the two significant effects.

Regression equation

A separate coefficient or slope for each variable

An intercept (here its called b0)

Image source: Unknown

Interpretation

Slopes for Witness and Stress are +ve;

slope for Social Support is -ve.

Ignoring Stress and Social Support, a one unit increase in Witness would produce .038 unit increase in Internalising symptoms.

Image source: UnknownRe the 2nd point - the same holds true for other predictors.

Predictions

Q: If Witness = 20, Stress = 5, and SocSupp = 35, what we would predict internalising symptoms to be?A: .012

Image source: Unknown

Multiple linear regression - Example

The role of human, social, built, and natural capital in explaining
life satisfaction at the country level:

Towards a National Well-Being Index (NWI)

Vemuri & Costanza (2006)

Image source: http://cloudking.com/artists/noa-terliuc/family-violence.phpKliewer, Lepore, Oskin, & Johnson, (1998)

Variables

IVs:Human & Built Capital

(Human Development Index)

Natural Capital

(Ecosystem services per km2)

Social Capital

(Press Freedom)

DV = Life satisfaction

Units of analysis: Countries

(N = 57; mostly developed countries, e.g., in Europe and
America)

There are moderately strong positive and statistically significant linear relations between the IVs and the DV

The IVs have small to moderate positive

inter-correlations.

Image source::Vemuri & Constanza (2006).

R2 = .35

Two sig. IVs (not Social Capital - dropped)

Image source::Vemuri & Constanza (2006).

6 outliers

Image source::Vemuri & Constanza (2006).

Nigeria, India, Bangladesh, Ghana, China and Philippines were treated as outliers and excluded from the analysis.

R2 = .72

(after dropping 6 outliers)

Image source::Vemuri & Constanza (2006).

Types of MLR

Standard or direct (simultaneous)

Hierarchical or sequential

Stepwise (forward & backward)

Image source: https://commons.wikimedia.org/wiki/File:IStumbler.png

Image source: https://commons.wikimedia.org/wiki/File:IStumbler.pngImage author: http://www.istumbler.net/

All predictor variables are entered together (simultaneously)

Allows assessment of the relationship between all predictor variables and the outcome (Y) variable if there is good theoretical reason for doing so.

Manual technique & commonly used.

If you're not sure what type of MLR to use, start with this approach.

Direct or Standard

IVs are entered in blocks or stages.Researcher defines order of entry for the variables, based on theory.

May enter nuisance variables first to control for them, then test purer effect of next block of important variables.

R2 change - additional variance in Y explained at each stage of the regression. F test of R2 change.

Hierarchical (Sequential)

Manual technique & commonly used.e.g., some treatment variables may be less expensive and these could be entered first to find out whether or not there is additional justification for the more expensive treatments

Example Drug A is a cheap, well-proven drug which reduces AIDS symptoms

Drug B is an expensive, experimental drug which could help to cure AIDS

Hierarchical linear regression:Step 1: Drug A (IV1)

Step 2: Drug B (IV2)

DV = AIDS symptoms

Research question: To what extent does Drug B reduce AIDS symptoms above and beyond the effect of Drug A?

Examine the change in R2 between Step 1 & Step 2

Hierarchical (Sequential)

Manual technique & commonly used.e.g., some treatment variables may be less expensive and these could be entered first to find out whether or not there is additional justification for the more expensive treatments

Computer-driven controversial.

Starts with 0 predictors, then the strongest predictor is entered into the model, then the next strongest etc. if they reach a criteria (e.g., p < .05)

Forward selection

Computer-driven controversial.

All predictor variables are entered, then the weakest predictors are removed, one by one, if they meet a criteria (e.g., p > .05)

Backward elimination

Computer-driven controversial.

Combines forward & backward.

At each step, variables may be entered or removed if they meet certain criteria.

Useful for developing the best prediction equation from a large number of variables.

Redundant predictors are removed.

Stepwise

Which method?

Standard: To assess impact of all IVs simultaneously

Hierarchical: To test IVs in a specific order (based on hypotheses derived from theory)

Stepwise: If the goal is accurate statistical prediction from a large # of variables - computer driven

Summary

Summary: General steps

Develop model and hypotheses

Check assumptions

Choose type

Interpret output

Develop a regression equation

(if needed)

Summary: Linear regression

Best-fitting straight line for a scatterplot of two variables

Y = bX + a + ePredictor (X; IV)

Outcome (Y; DV)

Least squares criterion

Residuals are the vertical distance between actual and predicted values

Summary:

MLR assumptions

Level of measurement

Sample size

Normality

Linearity

Homoscedasticity

Collinearity

Multivariate outliers

Residuals should be normally distributed

These residual slides are based on Francis (2007) MLR (Section 5.1.4) Practical Issues & Assumptions, pp. 126-127 and Allen and Bennett (2008)

Note that according to Francis, residual analysis can test:Additivity (i.e., no interactions b/w Ivs) (but this has been left out for the sake of simplicity)

Summary:

Level of measurement and dummy coding

Levels of measurementDV = Interval or ratio

IV = Interval or ratio or dichotomous

Dummy codingConvert complex variables into series of dichotomous IVs

Summary:

MLR types

Standard

Hierarchical

Stepwise / Forward / Backward

Summary:

MLR output

Overall fit

R, R2, Adjusted R2

F, p

CoefficientsRelation between each IV and the DV, adjusted for the other IVs

B, , t, p, and rp

Regression equation (if useful)

Y = b1x1 + b2x2 +.....+ bixi + a + e

Practice quiz

MLR I Quiz

Practice question 1

A linear regression analysis produces the equation Y = 0.4X + 3. This indicates that: When Y = 0.4, X = 3

When Y = 0, X = 3

When X = 3, Y = 0.4

When X = 0, Y = 3

None of the above

Answer: When X = 0, Y = 3

MLR I Quiz

Practice question 1

Multiple linear regression is a ________ type of statistical analysis. univariate

bivariate

multivariate

Answer: Multivariate

MLR I Quiz

Practice question 3

Multiple linear regression is a ________ type of statistical analysis. univariate

bivariate

multivariate

Answer: Multivariate

MLR I Quiz

Practice question 4

The following types of data can be used in MLR (choose all that apply): Interval or higher DV

Interval or higher IVs

Dichotomous Ivs

All of the above

None of the above

Answer: (d)

MLR I Quiz

Practice question 5

In MLR, the square of the multiple correlation coefficient, R2, is called the: Coefficient of determination

Variance

Covariance

Cross-product

Big R

Answer: Coefficient of determination

MLR I Quiz

Practice question 6

In MLR, a residual is the difference between the predicted Y and actual Y values. True

False

Answer: True

Next lecture

Review of MLR I

Semi-partial correlations

Residual analysis

Interactions

Analysis of change

References

Howell, D. C. (2004). Chapter 9: Regression. In D. C. Howell.. Fundamental statistics for the behavioral sciences (5th ed.) (pp. 203-235). Belmont, CA: Wadsworth.Howitt, D. & Cramer, D. (2011). Introduction to statistics in psychology (5th ed.). Harlow, UK: Pearson.Kliewer, W., Lepore, S.J., Oskin, D., & Johnson, P.D. (1998). The role of social and cognitive processes in childrens adjustment to community violence. Journal of Consulting and Clinical Psychology, 66, 199-209.Landwehr, J.M. & Watkins, A.E. (1987) Exploring data: Teachers edition. Palo Alto, CA: Dale Seymour Publications.Tabachnick, B. G., & Fidell, L. S. (2013) (6th ed. - International ed.). Multiple regression [includes example write-ups]. In Using multivariate statistics (pp. 117-170). Boston, MA: Allyn and Bacon.Vemuri, A. W., & Constanza, R. (2006). The role of human, social, built, and natural capital in explaining life satisfaction at the country level: Toward a National Well-Being Index (NWI). Ecological Economics, 58(1), 119-133.

Open Office Impress

This presentation was made using Open Office Impress.

Free and open source software.

http://www.openoffice.org/product/impress.html

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Click to edit the title text format

Click to edit the title text format