multilevel/hierarchical models

Motivation Multilevel Models Applications Postestimation Conclusion

Multilevel/Hierarchical Models

David A. Hughes, Ph.D.

Auburn University at Montgomery

[email protected]

April 10, 2020

1 / 38


Overview

1 Motivation

2 Multilevel Models

3 Applications

4 Postestimation

5 Conclusion

2 / 38


Introduction

• In both the CLRM and MLE contexts, we’ve touched onissues that can arise when our datasets have a nestedstructure to them.

• For example, when we have observations repeated over timeacross similar geographies, problems relating to efficiency andconsistency can arise using OLS or MLE models that fail toaccount for such structure in the data.

3 / 38


Multilevel or hierarchical datasets

• In all honesty, most of the data we deal with as politicalscientists has some form of structure to it, meaning that we’reoften at risk of violating the assumptions of OLS or MLE.• Let’s see if we can identify the structure in the following sets

of data:• A longitudinal survey of student learning and achievement• A cross-sectional dyadic analysis of whether nation-states

engage in armed conflict• A CSTS study of nation-state GDP

4 / 38


Addressing hierarchical data structures in regression• We could disaggregate groups and run individual regressions

on each.• Upside: we get group-level coefficients which might be of

theoretical interest.• Downside: Un-modeled macro-effects get dumped into the

error term.

• We could estimate fixed effects for the groups we think arecorrelated (i.e., dummy variables).• Upside: We deal with the correlation in the error term.• Downside: We add G− 1 new independent variables (where G

is the number of groups), which eats of G− 1 degrees offreedom.

• We could “cluster” the standard errors by group.• Upside: Can address the problem of correlation in the error

term.• Downside: Stata won’t let you cluster on more than one

variable.5 / 38


An example

• Suppose we’re studying Democratic vote-shares acrossAlabama’s 67 counties.

• For our outcome variable, then, we gather data on everyDemocratic candidate seeking state-wide office in 2018.

• We then measure candidate i’s vote-share in every county, k.

• For explanatory variables, we’ll look at the percent of thecounty that is black, the percent of the county with anundergraduate degree, the percent of a county that is urban,and the percent of the county that is unemployed.

6 / 38


Nested data as a single matrix

Suppose we have i ∈ N Alabama voters nested in k ∈ K counties.Our data matrix might look something like this.

Voter ID DV Individual Indicators County ID County County Indicators1 3 . . . 1 Jefferson . . .2 1 . . . 1 Jefferson . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.321 4 . . . 2 Montgomery . . .322 2 . . . 2 Montgomery . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.N 4 . . . K Winston . . .

7 / 38


Multilevel modeling

• Multilevel models are mere extensions of single-level models.1

• The difference between the two is that the former allowcoefficients to vary by groups nested in the data.

• Multilevel models are attractive due to their extreme flexibilityin how we choose to fit them.

1In this section, I rely heavily upon the work of Andrew Gelman andJennifer Hill, Data Analysis using Regression and Multilevel/HierarchicalModels, 2007, Cambridge University Press.

8 / 38


Varying-intercept model

• A regression that includes model estimates for groups is calleda varying-intercept model.

• It literally estimates a different intercept term for each group.

• Let there be observations i ∈ N across j ∈ J groups.

• Then consider the following varying-intercept model:

Yi = αj[i] + βXi + εi. (1)

• Such an approach may be desirable if we believe a variable Xhas a constant effect on Y but that each group “starts” at adifferent place.

9 / 38


Varying-slope model

• Likewise, we could model a situation in which we allow a slopecoefficient to vary by group, but every group shares a similarconstant.

• Consider the following varying-slope model:

Yi = α+ βj[i]Xi + εi. (2)

• Such an approach may be desirable if we believe that theeffect a variable X has on Y differs in magnitude acrossdisparate groups.

• A trivially simple extension can demonstrate that we can alsoestimate varying-slope and varying-intercept models.

10 / 38


Graphical display of multilevel coefficientsVarying Intercepts

X

Y

Group 1Group 2Group 3

Varying Slopes

X

Y


Varying Slopes and Intercepts

X

Y


11 / 38


Multilevel modeling in Stata

• The command we use depends on the type of regression wewish to run:

1. “mixed" (linear regression)

2. “melogit" (logit)

3. “meologit" (ordered logit)

4. “mepoisson" (poisson)

5. “menbreg" (negative binomial)

6. “metobit" (tobit)

• If you wish to run a multilevel, multinomial logit, you’ll needto use Stata’s structural equation modeling language. For aprimer on this, see:https://www.stata.com/manuals13/semexample41g.pdf

12 / 38

https://www.stata.com/manuals13/semexample41g.pdf


Multilevel modeling in Stata (cont’d.)

• Once you’ve chosen a multilevel model command, you proceedto estimate the level one predictors:[command] y x1 x2 ...

• To specify level two predictors, we write the following:[command] y x1 x2 ... || [group]: [z1 z2 ...]

• The “group” variable identifies the group level at which youwant coefficients to vary.

• You could stop there, and you’d have yourself avarying-intercept model. Or you could proceed to include the“z” variables, which will estimate varying-slope coefficients.

13 / 38


An example with data

• For today’s lesson, we’ll focus on student performance datamade public by Princeton University(https://dss.princeton.edu/training/schools.dta).2

• The dependent variable is a continuous measure of a student’sperformance on a test.

• Independent variables include a student’s gender and the typeof school they attend (all boy, all girl, or integrated).

2I’ll rely heavily upon the work of Oscar Torres-Reyna as I discuss thesemethods. You can review his presentation of multilevel data here:https://dss.princeton.edu/training/Multilevel101.pdf.

14 / 38

https://dss.princeton.edu/training/schools.dta

https://dss.princeton.edu/training/Multilevel101.pdf


Student performance data-4

0-2

00

2040

0 20 40 60School id

Student Score Mean School Score

15 / 38


Estimating the single-level model

• We can begin by estimating a single level model like thefollowing:

Scorei = α+ β1Readingi + β2Femalei + β3Girlsi

β4Boysi + εi,

where integrated schools are the omitted category, i reflects agiven student, and all group-level effects are absorbed by theerror term.

16 / 38


Single-level results

. regress score reading female boys girls

Source | SS df MS Number of obs = 4,059

-------------+---------------------------------- F(4, 4054) = 580.15

Model | 147413.623 4 36853.4058 Prob > F = 0.0000

Residual | 257526.693 4,054 63.5240979 R-squared = 0.3640

-------------+---------------------------------- Adj R-squared = 0.3634

Total | 404940.316 4,058 99.7881508 Root MSE = 7.9702

------------------------------------------------------------------------------

score | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

reading | .5910456 .0126259 46.81 0.000 .5662919 .6157993

female | 1.326415 .3431437 3.87 0.000 .6536643 1.999165

boys | 1.825545 .4256816 4.29 0.000 .9909749 2.660114

girls | 1.680108 .3259095 5.16 0.000 1.041146 2.319069

_cons | -1.608613 .2395156 -6.72 0.000 -2.078195 -1.139031

------------------------------------------------------------------------------

17 / 38


Assessing the single-level model

• We can check some of the assumptions underlying the CLRM:

1. Homoskedasticity: Breusch-Pagan test: p = 0.68.2. Serial correlation within panels: p = 0.08.3. Cross-sectional serial correlation: p = 0.01.

• It looks like we could have some serial correlation in the errorterm both within and across panels.

• This suggests we have some unmodeled unit effects gettingpicked up by the error term.

18 / 38


Addressing the unit effects issue

• To address the unit effects, we could reestimate the model andcontrol for the units themselves on the RHS of the equation.

• That is, we could estimate “fixed effects” by including adummy variable for K − 1 of the units (schools) in thedataset.

19 / 38


Results with fixed effects

. reg score reading female i.type i.school

note: 64.school omitted because of collinearity

note: 65.school omitted because of collinearity

Source | SS df MS Number of obs = 4,059

-------------+---------------------------------- F(66, 3992) = 48.62

Model | 180460.578 66 2734.25118 Prob > F = 0.0000

Residual | 224479.738 3,992 56.2323993 R-squared = 0.4456

-------------+---------------------------------- Adj R-squared = 0.4365

Total | 404940.316 4,058 99.7881508 Root MSE = 7.4988

------------------------------------------------------------------------------

score | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

reading | .5557072 .0125195 44.39 0.000 .5311619 .5802524

female | 1.705288 .3428503 4.97 0.000 1.03311 2.377466

|

type |

Boys only | -2.409221 1.319982 -1.83 0.068 -4.997122 .1786804

Girls only | -6.918818 1.233569 -5.61 0.000 -9.337302 -4.500334

|

school | (INCLUDED)

20 / 38


Assessing the previous model

• The model with fixed effects (predictably) improves on modelfit.

• Clearly some of the results change in light of the fixed effectsparameters.

• But an assessment of the residuals still suggests spatialcorrelation (see next slide).

• While the fixed effects cut down on the cross-sectionaldependence, it’s still significant.

21 / 38


Spatial dependence in the error term

. xtcdf score res1 res2

xtcd test on variables score res1 res2

Panelvar: school

Timevar: student

------------------------------------------------------------------------------+

Variable | CD-test p-value average joint T | mean rho mean abs(rho) |

----------------+--------------------------------------+----------------------|

score + 3.043 0.002 48.40 + 0.01 0.12

res1 + 2.59 0.010 48.40 + 0.01 0.12

res2 + 2.72 0.007 48.40 + 0.01 0.12

------------------------------------------------------------------------------+

Notes: Under the null hypothesis of cross-section independence, CD ~ N(0,1)

P-values close to zero indicate data are correlated across panel groups.

22 / 38


Varying intercept-only model

• Multilevel modeling would allow us to directly address the factthat we have unit dependence in the data.

• Let’s start with the null model (intercept only) where we letthe intercept vary by school.

• We’ll have scores measured across students, i ∈ N , who arenested within schools, k ∈ K:

Scorei = αk + εi.

• In Stata, we estimate this model using the following code:mixed score || school:

23 / 38


Varying intercept-only model

Mixed-effects ML regression Number of obs = 4,059

Group variable: school Number of groups = 65

Obs per group:

min = 2

avg = 62.4

max = 198

Wald chi2(0) = .

Log likelihood = -14851.502 Prob > chi2 = .

------------------------------------------------------------------------------

score | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_cons | -.1317107 .5362734 -0.25 0.806 -1.182787 .9193659

------------------------------------------------------------------------------

------------------------------------------------------------------------------

Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]

-----------------------------+------------------------------------------------

school: Identity |

var(_cons) | 16.86388 3.28458 11.51248 24.7028

-----------------------------+------------------------------------------------

var(Residual) | 84.77541 1.897109 81.13751 88.57642

------------------------------------------------------------------------------

LR test vs. linear model: chibar2(01) = 498.72 Prob >= chibar2 = 0.0000

24 / 38


Reading the Stata printout

• Let’s break down the results from the previous slide:

1. The first set of results are the level one results. We see that“ cons” gives us the mean of the (varying) intercepts from all65 schools.

2. The “Random-effects Parameters” provide the second levelestimates. The parameter, “var( cons)” gives us the varianceacross the 65 estimated intercept terms.

3. The value, “var(Residual)” gives us the variance at theindividual level (students).

4. The last line of printout tests the hypothesis that the randomeffects parameters are equal to 0.

25 / 38


Varying intercept model with covariates

• Now that we’ve seen that there’s a good deal of varianceamong the 65 schools and that some schools have a baselineof aptitude others don’t, let’s include our covariates, stickingto the varying-intercept model for the time being.

• The regression we’ll estimate takes the following form:

Scorei = αk + β1Readingi + β2Femalei + β3AllGirlsi +

β4AllBoysi + εi.

• In Stata, we’ll estimate:mixed score reading female i.type || school:

26 / 38


Varying-intercept model with covariatesMixed-effects ML regression Number of obs = 4,059


Obs per group:

min = 2

avg = 62.4

max = 198

Wald chi2(4) = 2093.27

Log likelihood = -14008.891 Prob > chi2 = 0.0000

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

reading | .5599627 .0124435 45.00 0.000 .5355738 .5843516

female | 1.672275 .340817 4.91 0.000 1.004286 2.340264

|

type |

Boys only | 1.776196 1.107532 1.60 0.109 -.394526 3.946918

Girls only | 1.589599 .8725465 1.82 0.068 -.1205607 3.299759

|

_cons | -1.681537 .5399933 -3.11 0.002 -2.739905 -.6231696

------------------------------------------------------------------------------

------------------------------------------------------------------------------


-----------------------------+------------------------------------------------

school: Identity |

var(_cons) | 8.110744 1.654673 5.437597 12.09802

-----------------------------+------------------------------------------------

var(Residual) | 56.22689 1.258533 53.81353 58.74848

------------------------------------------------------------------------------

LR test vs. linear model: chibar2(01) = 346.77 Prob >= chibar2 = 0.0000

27 / 38


Varying-intercept and varying-slope model

• Now let’s suppose that students’ reading abilities have varyingeffects across units (schools).

• We can then estimate the following (simplified) model:

Scorei = αk + βkReadingi + εi.

28 / 38


Results from the varying-intercept/slope model. mixed score reading || school: reading

Mixed-effects ML regression Number of obs = 4,059


Obs per group:

min = 2

avg = 62.4

max = 198

Wald chi2(1) = 782.26

Log likelihood = -14008.737 Prob > chi2 = 0.0000

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

reading | .5572047 .0199223 27.97 0.000 .5181578 .5962516

_cons | -.0790234 .3975575 -0.20 0.842 -.8582218 .700175

------------------------------------------------------------------------------

------------------------------------------------------------------------------


-----------------------------+------------------------------------------------

school: Independent |

var(reading) | .0143169 .0045226 .0077084 .0265911

var(_cons) | 9.027575 1.831721 6.065405 13.43638

-----------------------------+------------------------------------------------

var(Residual) | 55.36395 1.249216 52.96888 57.86732

------------------------------------------------------------------------------

LR test vs. linear model: chi2(2) = 435.39 Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.

29 / 38


Calculating the fixed effects parameters

• The same suite of commands used in single-level regressionscan be used in the multilevel regression environment.

• For example, we can use predict [var], [option] to getthe linear predictor, standard errors, and residuals.

30 / 38


Calculating group-level parameters

• Stata’s multilevel modeling suite of commands allows us easilyto recover group-level parameters such as the group-levelerrors, linear predictor, and so on.

• Estimate the following model: mixed score reading ||

school: reading.

• Recover the group-level error terms for the multilevel constantand slope coefficient: predict u*, reffects.

• The relative value of these group-level figures could be ofsubstantive interest.

31 / 38


Sample printout of group-level error terms

. list school u2 u1 if school<=10 & student==1

+--------------------------------+

| school u2 u1 |

|--------------------------------|

1. | 1 3.679571 .1061788 |

74. | 2 4.634149 .1410424 |

129. | 3 5.016491 .0311951 |

181. | 4 .1649343 .1375719 |

260. | 5 2.425044 .0437221 |

|--------------------------------|

295. | 6 5.491126 .0093491 |

375. | 7 3.830561 -.20025 |

463. | 8 -.1640639 .0077805 |

565. | 9 -1.615116 -.0774711 |

599. | 10 -3.074617 -.1150779 |

+--------------------------------+

32 / 38


Recovering other group-level estimates

• We can estimate the group-level slope and intercepts usingthe group-level errors we just found.

• Constant: gen re cons= b[ cons] + u2

• Slope: gen re beta= b[reading] + u1

• Linear prediction: predict re yhat, fitted

33 / 38


Sample printout of group-level estimates

. list school re_cons re_beta re_yhat if school<10 & student==1

+-------------------------------------------+

| school re_cons re_beta re_yhat |

|-------------------------------------------|

1. | 1 3.600548 .6633835 7.70729 |

74. | 2 4.555126 .6982471 5.415017 |

129. | 3 4.937468 .5883998 4.689445 |

181. | 4 .085911 .6947766 2.664227 |

260. | 5 2.346021 .6009269 .106006 |

|-------------------------------------------|

295. | 6 5.412103 .5665538 2.83196 |

375. | 7 3.751537 .3569548 4.781173 |

463. | 8 -.2430873 .5649853 -4.216968 |

565. | 9 -1.69414 .4797336 -.7068478 |

+-------------------------------------------+

34 / 38


Comparing residuals between the fixed and random effects

• Note that because we have linear predictions resulting fromboth the level 1 and level 2 estimates, we can recover two setsof residuals.

• gen re res= score - re yhat

• gen fe res=score - yhat

• We can then compare the variance across the two values. Wefind that V ar(re res) ≈ 0 while V ar(fe res) ≈ 18.6.

• Clearly, accounting for the group-level variance (sdtestre res == fe res) helps to reduce variance(F (4058, 4058) = 0.83, p < 0.000) in the residuals (i.e., itincreases efficiency).

35 / 38


Putting it all together

• Finally, a great way to get a handle on how our variables havediffering effects across groups is to present these group-leveleffects graphically.

• We can easily accomplish this using the estimates we justcalculated.

• twoway connected re yhat reading if school<=10,

connect(L)

36 / 38


Graphical display of multilevel results

37 / 38


Discussion

• In this unit, we’ve learned what multilevel models are, thecontexts in which they are helpful, and some of the Statacommands we have at our disposal to estimate them.

• Put simply, multilevel modeling is an extremely flexible meansby which we can estimate changes in a wide variety of nesteddata.

• Doing so allows us to address group-level variance notaccounted for by our independent variables that wouldnormally accumulate in the error term and call into questionour models’ efficiency or consistency.

38 / 38

multilevel/hierarchical models

Documents