Download - Multiple Regression Review Sociology 229A Copyright © 2008 by Evan Schofer Do not copy or distribute without permission

Multiple Regression Review

Sociology 229A

Copyright © 2008 by Evan Schofer

Do not copy or distribute without permission

Multiple Regression

• Question: What if a dependent variable is affected by more than one independent variable?

• Strategy #1: Do separate bivariate regressions– One regression for each independent variable

• This yields separate slope estimates for each independent variable– Bivariate slope estimates implicitly assume that

neither independent variable mediates the other– In reality, there might be no effect of family wealth

over and above education

Multiple Regression

Coefficientsa

35.608 1.290 27.611 .000

2.075 .446 .122 4.652 .000

(Constant)

RS FAMILY INCOMEWHEN 16 YRS OLD

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1970)a.

Coefficientsa

9.417 1.421 6.625 .000

2.488 .108 .520 23.056 .000

(Constant)

HIGHEST YEAR OFSCHOOL COMPLETED

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


• Job Prestige: Two separate regression models

Both variables have positive, significant slopes

Multiple Regression

• Idea #2: Use Multiple Regression

• Multiple regression can examine “partial” relationships– Partial = Relationships after the effects of other

variables have been “controlled” (taken into account)

• This lets you determine the effects of variables “over and above” other variables– And shows the relative impact of different factors on

a dependent variable

• And, you can use several independent variables to improve your predictions of the dependent var

Coefficientsa

8.977 1.629 5.512 .000

2.487 .111 .520 22.403 .000

.178 .394 .011 .453 .651

(Constant)



Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


Multiple Regression

• Job Prestige: 2 variable multiple regression

Education slope is basically unchanged

Family Income slope decreases compared to bivariate analysis

(bivariate: b = 2.07) And, outcome of hypothesis

test changes – t < 1.96

Multiple Regression• Ex: Job Prestige: 2 variable multiple regression• 1. Education has a large slope effect controlling

for (i.e. “over and above”) family income• 2. Family income does not have much effect

controlling for education• Despite a strong bivariate relationship

• Possible interpretations: • Family income may lead to education, but education is the

critical predictor of job prestige• Or, family income is wholly unrelated to job prestige… but

is coincidentally correlated with a variable that is (education), which generated a spurious “effect”.

The Multiple Regression Model

• A two-independent variable regression model:

iiii eXbXbaY 2211

• Note: There are now two X variables

• And a slope (b) is estimated for each one

• The full multiple regression model is:

ikikiii eXbXbXbaY 2211

• For k independent variables

Multiple Regression: Slopes

• Regression slope for the two variable case:

21

21

2121

11 XX

XXYXYX

X

Y

r

rrr

s

sb

• b1 = slope for X1 – controlling for the other independent variable X2

• b2 is computed symmetrically. Swap X1s, X2s

• Compare to bivariate slope: YXX

YYX r

s

sb

Multiple Regression Slopes

• Let’s look more closely at the formulas:

21

21

2121

11 XX

XXYXYX

X

Y

r

rrr

s

sb

• What happens to b1 if X1 and X2 are totally uncorrelated?

• Answer: The formula reduces to the bivariate

• What if X1 and X2 are correlated with each other AND X2 is more correlated with Y than X1?

• Answer: b1 gets smaller (compared to bivariate)

YXX

YYX r

s

sbversus

Regression Slopes

• So, if two variables (X1, X2) are correlated and both predict Y:

• The X variable that is more correlated with Y will have a higher slope in multivariate regression– The slope of the less-correlated variable will shrink

• Thus, slopes for each variable are adjusted to how well the other variable predicts Y– It is the slope “controlling” for other variables.

Multiple Regression Slopes

• One last thing to keep in mind…

21

21

2121

11 XX

XXYXYX

X

Y

r

rrr

s

sb

• What happens to b1 if X1 and X2 are almost perfectly correlated?

• Answer: The denominator approaches Zero

• The slope “blows up”, approaching infinity

• Highly correlated independent variables can cause trouble for regression models… watch out

YXX

YYX r

s

sbversus

Interpreting Results

• (Over)Simplified rules for interpretation– Assumes good sample, measures, models, etc.

• Multivariate regression with two variables: A, B

• If slopes of A, B are the same as bivariate, then each has an independent effect

• If A remains large, B shrinks to zero we typically conclude that effect of B was spurious, or operates through A

• If both A and B shrink a little, each has an effect, but some overlap or mediation is occurring

Interpreting Multivariate Results

• Things to watch out for:

• 1. Remember: Correlation is not causation– Ability to “control” for many variables can help detect

spurious relationships… but it isn’t perfect.– Be aware that other (omitted) variables may be

affecting your model. Don’t over-interpret results.

• 2. Reverse causality– Many sociological processes involve bi-directional

causality. Regression slopes (and correlations) do not identify which variable “causes” the other.

• Ex: self-esteem and test scores.

Standardized Regression Coefficients

• Regression slopes reflect the units of the independent variables

• Question: How do you compare how “strong” the effects of two variables if they have totally different units?

• Example: Education, family wealth, job prestige– Education measured in years, b = 2.5– Family wealth measured on 1-5 scale, b = .18– Which is a “bigger” effect? Units aren’t comparable!

• Answer: Create “standardized” coefficients


• Standardized Coefficients– Also called “Betas” or Beta Weights”– Symbol: Greek b with asterisk: * – Equivalent to Z-scoring (standardizing) all

independent variables before doing the regression

• Formula of coeficient for Xj:j

Y

X

j bs

sj

*

• Result: The unit is standard deviations

• Betas: Indicates the effect a 1 standard deviation change in Xj on Y


• Ex: Education, family income, and job prestige:Coefficientsa

8.977 1.629 5.512 .000

2.487 .111 .520 22.403 .000

.178 .394 .011 .453 .651

(Constant)



Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


An increase of 1 standard deviation in Education results

in a .52 standard deviation increase in job prestige Betas give you a sense of

which variables “matter most”

What is the interpretation of the “family income” beta?

R-Square in Multiple Regression• Multivariate R-square is much like bivariate:

TOTAL

REGRESSION

SS

SSR 2

• But, SSregression is based on the multivariate regression

• The addition of new variables results in better prediction of Y, less error (e), higher R-square.

Model Summary

.522a .272 .271 12.41Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), INCOM16, EDUCa.

R-Square in Multiple Regression• Example:

• R-square of .272 indicates that education, parents wealth explain 27% of variance in job prestige

• “Adjusted R-square” is a more conservative, more accurate measure in multiple regression– Generally, you should report Adjusted R-square.

Dummy Variables

• Question: How can we incorporate nominal variables (e.g., race, gender) into regression?

• Option 1: Analyze each sub-group separately– Generates different slope, constant for each group

• Option 2: Dummy variables– “Dummy” = a dichotomous variables coded to

indicate the presence or absence of something– Absence coded as zero, presence coded as 1.

Dummy Variables

• Strategy: Create a separate dummy variable for all nominal categories

• Ex: Gender – make female & male variables– DFEMALE: coded as 1 for all women, zero for men– DMALE: coded as 1 for all men

• Next: Include all but one dummy variables into a multiple regression model

• If two dummies, include 1; If 5 dummies, include 4.

Dummy Variables

• Question: Why can’t you include DFEMALE and DMALE in the same regression model?

• Answer: They are perfectly correlated (negatively): r = -1– Result: Regression model “blows up”

• For any set of nominal categories, a full set of dummies contains redundant information– DMALE and DFEMALE contain same information– Dropping one removes redundant information.

Dummy Variables: Interpretation

• Consider the following regression equation:

iiii eDFEMALEbINCOMEbaY 21

• Question: What if the case is a male?

• Answer: DFEMALE is 0, so the entire term becomes zero.– Result: Males are modeled using the familiar

regression model: a + b1X + e.



iiii eDFEMALEbINCOMEbaY 21

• Question: What if the case is a female?

• Answer: DFEMALE is 1, so b2(1) stays in the equation (and is added to the constant)– Result: Females are modeled using a different

regression line: (a+b2) + b1X + e

– Thus, the coefficient of b2 reflects difference in the constant for women.


• Remember, a different constant generates a different line, either higher or lower– Variable: DFEMALE (women = 1, men = 0)– A positive coefficient (b) indicates that women are

consistently higher compared to men (on dep. var.)– A negative coefficient indicated women are lower

• Example: If DFEMALE coeff = 1.2:– “Women are on average 1.2 points higher than men”.

Dummy Variables: Interpretation• Visually: Women = blue, Men = red

INCOME

100000800006000040000200000

HA

PP

Y

10

9

8

7

6

5

4

3

2

1

0

Overall slope for all data points

Note: Line for men, women have same slope… but one is

high other is lower. The constant differs!

If women=1, men=0: The constant (a) reflects

men only. Dummy coefficient (b) reflects

increase for women (relative to men)

Dummy Variables

• What if you want to compare more than 2 groups?

• Example: Race– Coded 1=white, 2=black, 3=other (like GSS)

• Make 3 dummy variables:– “DWHITE” is 1 for whites, 0 for everyone else– “DBLACK” is 1 for Af. Am., 0 for everyone else– “DOTHER” is 1 for “others”, 0 for everyone else

• Then, include two of the three variables in the multiple regression model.

Coefficientsa

9.666 1.672 5.780 .000

2.476 .111 .517 22.271 .000

6.282E-02 .397 .004 .158 .874

-2.666 1.117 -.055 -2.388 .017

1.114 1.777 .014 .627 .531

(Constant)

EDUC

INCOM16

DBLACK

DOTHER

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: PRESTIGEa.


• Ex: Job Prestige

• Negative coefficient for DBLACK indicates a lower level of job prestige compared to whites– T- and P-values indicate if difference is significant.


• Comments:

• 1. Dummy coefficients shouldn’t be called slopes– Referring to the “slope” of gender doesn’t make sense– Rather, it is the difference in the constant (or “level”)

• 2. The contrast is always with the nominal category that was left out of the equation– If DFEMALE is included, the contrast is with males– If DBLACK, DOTHER are included, coefficients

reflect difference in constant compared to whites.

Interaction Terms

• Question: What if you suspect that a variable has a totally different slope for two different sub-groups in your data?

• Example: Income and Happiness– Perhaps men are more materialistic -- an extra dollar

increases their happiness a lot– If women are less materialistic, each dollar has a

smaller effect on income (compared to men)

• Issue isn’t men = “more” or “less” than women– Rather, the slope of a variable (income) differs across

groups

Interaction Terms

• Issue isn’t men = “more” or “less” than women– Rather, the slope of a variable coefficient (for income)

differs across groups

• Again, we want to specify a different regression line for each group– We want lines with different slopes, not parallel lines

that are higher or lower.

Interaction Terms• Visually: Women = blue, Men = red

INCOME

100000800006000040000200000

HA

PP

Y

10

9

8

7

6

5

4

3

2

1

0

Overall slope for all data points

Note: Here, the slope for men and women

differs.

The effect of income on happiness (X1 on Y)

varies with gender (X2). This is called an

“interaction effect”

Interaction Terms

• Examples of interaction:– Effect of education on income may interact with type

of school attended (public vs. private)• Private schooling has bigger effect on income

– Effect of aspirations on educational attainment interacts with poverty

• Aspirations matter less if you don’t have money to pay for college

• Question: Can you think of examples of two variables that might interact?

• Either from your final project? Or anything else?

Interaction Terms

• Interaction effects: Differences in the relationship (slope) between two variables for each category of a third variable

• Option #1: Analyze each group separately• Look for different sized slope in each group

• Option #2: Multiply the two variables of interest: (DFEMALE, INCOME) to create a new variable– Called: DFEMALE*INCOME– Add that variable to the multiple regression model.

Interaction Terms


iiii eINCDFEMbINCOMEbaY *21

• Question: What if the case is male?

• Answer: DFEMALE is 0, so b2(DFEM*INC) drops out of the equation– Result: Males are modeled using the ordinary

regression equation: a + b1X + e.

Interaction Terms


iiii eINCDFEMbINCOMEbaY *21

• Question: What if the case is female?

• Answer: DFEMALE is 1, so b2(DFEM*INC) becomes b2*INCOME, which is added to b1

– Result: Females are modeled using a different regression line: a + (b1+b2) X + e

– Thus, the coefficient of b2 reflects difference in the slope of INCOME for women.

Interpreting Interaction Terms

• Interpreting interaction terms:

• A positive b for DFEMALE*INCOME indicates the slope for income is higher for women vs. men– A negative effect indicates the slope is lower– Size of coefficient indicates actual difference in slope

• Example: DFEMALE*INCOME. Observed b’s:– Income: b = .5– DFEMALE * INCOME: b = -.2

• Interpretation: Slope is .5 for men, .3 for women.

Coefficientsa

8.855 1.744 5.076 .000

2.541 .118 .531 21.563 .000

6.636E-02 .396 .004 .167 .867

4.293 4.193 .088 1.024 .306

-.576 .332 -.149 -1.735 .083

(Constant)

EDUC

INCOM16

DBLACK

BL_EDUC

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: PRESTIGEa.

Interpreting Interaction Terms• Example: Interaction of Race and Education

affecting Job Prestige:

DBLACK*EDUC has a negative effect (nearly significant). Coefficient of -.576 indicates that the slope of education and job

prestige is .576 points lower for Blacks than for non-blacks.

Continuous Interaction Terms

• Two continuous variables can also interact

• Example: Effect of education and income on happiness– Perhaps highly educated people are less materialistic– As education increases, the slope between between

income and happiness would decrease

• Simply multiply Education and Income to create the interaction term “EDUCATION*INCOME”

• And add it to the model.


• How do you interpret continuous variable interactions?

• Example: EDUCATION*INCOME: Coefficient = 2.0

• Answer: For each unit change in education, the slope of income vs. happiness increases by 2– Note: coefficient is symmetrical: For each unit

change in income, education slope increases by 2

• Dummy interactions effectively estimate 2 slopes: one for each group

• Continuous interactions result in many slopes: Each value of education*income yields a different slope.


• Interaction terms alters the interpretation of “main effect” coefficients

• Including “EDUC*INCOME changes the interpretation of EDUC and of INCOME

• See Allison p. 166-9

– Specifically, coefficient for EDUC represents slope of EDUC when INCOME = 0

• Likewise, INCOME shows slope when EDUC=0

– Thus, main effects are like “baseline” slopes• And, the interaction effect coefficient shows how the slope

grows (or shrinks) for a given unit change.

Dummy Interactions

• It is also possible to construct interaction terms based on two dummy variables– Instead of a “slope” interaction, dummy interactions

show difference in constants• Constant (not slope) differs across values of a third variable

– Example: Effect of of race on school success varies by gender

• African Americans do less well in school; but the difference is much larger for black males.

Dummy Interactions

• Strategy for dummy interaction is the same: Multiply both variables– Example: Multiply DBLACK, DMALE to create

DBLACK*DMALE• Then, include all 3 variables in the model

– Effect of DBLACK*DMALE reflects difference in constant (level) for black males, compared to white males and black females

• You would observe a negative coefficient, indicating that black males fare worse in schools than black females or white males.

Interaction Terms: Remarks

• 1. If you make an interaction you should also include the component variables in the model:– A model with “DFEMALE * INCOME” should also

include DFEMALE and INCOME• There are rare exceptions. But when in doubt, include them

• 2. Sometimes interaction terms are highly correlated with its components

• That can cause problems (multicollinearity – which we’ll discuss more soon)

Interaction Terms: Remarks

• 3. Make sure you have enough cases in each group for your interaction terms– Interaction terms involve estimating slopes for sub-

groups (e.g., black females vs black males). • If you there are hardly any black females in the dataset, you

can have problems

• 4. “Three-way” interactions are also possible!• An interaction effect that varies across categories of yet

another variable– Ex: DMale*DBlack interaction may vary across class

• They are mainly used in experimental research settings with large sample sizes… but they are possible.

Download - Multiple Regression Review Sociology 229A Copyright © 2008 by Evan Schofer Do not copy or distribute without permission

Top Related