assessing treatment e ect variation in observational studies: … · 2019-09-09 · observational...

119
Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment Effect Variation in Observational Studies: Results from a Data Challenge Carlos Carvalho [email protected] Department of Information, Risk and Operations Management The University of Texas at Austin Austin, TX 78712, USA Avi Feller [email protected] Goldman School of Public Policy The University of California, Berkeley Berkeley, CA 94720, USA Jared Murray [email protected] Department of Information, Risk and Operations Management The University of Texas at Austin Austin, TX 78712, USA Spencer Woody [email protected] Department of Statistics and Data Science The University of Texas at Austin Austin, TX 78712, USA David Yeager [email protected] Department of Psychology The University of Texas at Austin Austin, TX 78712, USA Abstract A growing number of methods aim to assess the challenging question of treatment effect variation in observational studies. This special section of Observational Studies reports the results of a workshop conducted at the 2018 Atlantic Causal Inference Conference designed to understand the similarities and differences across these methods. We invited eight groups of researchers to analyze a synthetic observational data set that was generated using a recent large-scale randomized trial in education. Overall, participants employed a diverse set of methods, ranging from matching and flexible outcome modeling to semiparametric estimation and ensemble approaches. While there was broad consensus on the topline estimate, there were also large differences in estimated treatment effect moderation. This highlights the fact that estimating varying treatment effects in observational studies is often more challenging than estimating the average treatment effect alone. We suggest several directions for future work arising from this workshop. Keywords: Heterogeneous treatment effects, effect modification, average treatment effect 1. Introduction Spurred by recent statistical advances and new sources of data, a growing number of meth- ods aim to assess treatment effect variation in observational studies. This is an inherently c 2019 Carlos Carvalho, Avi Feller, Jared Murray, Spencer Woody, and David Yeager .

Upload: others

Post on 10-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19

Assessing Treatment Effect Variation inObservational Studies: Results from a Data Challenge

Carlos Carvalho [email protected] of Information, Risk and Operations ManagementThe University of Texas at AustinAustin, TX 78712, USA

Avi Feller [email protected] School of Public PolicyThe University of California, BerkeleyBerkeley, CA 94720, USA

Jared Murray [email protected] of Information, Risk and Operations ManagementThe University of Texas at AustinAustin, TX 78712, USA

Spencer Woody [email protected] of Statistics and Data ScienceThe University of Texas at AustinAustin, TX 78712, USA

David Yeager [email protected]

Department of Psychology

The University of Texas at Austin

Austin, TX 78712, USA

Abstract

A growing number of methods aim to assess the challenging question of treatment effectvariation in observational studies. This special section of Observational Studies reports theresults of a workshop conducted at the 2018 Atlantic Causal Inference Conference designedto understand the similarities and differences across these methods. We invited eight groupsof researchers to analyze a synthetic observational data set that was generated using arecent large-scale randomized trial in education. Overall, participants employed a diverseset of methods, ranging from matching and flexible outcome modeling to semiparametricestimation and ensemble approaches. While there was broad consensus on the toplineestimate, there were also large differences in estimated treatment effect moderation. Thishighlights the fact that estimating varying treatment effects in observational studies is oftenmore challenging than estimating the average treatment effect alone. We suggest severaldirections for future work arising from this workshop.

Keywords: Heterogeneous treatment effects, effect modification, average treatment effect

1. Introduction

Spurred by recent statistical advances and new sources of data, a growing number of meth-ods aim to assess treatment effect variation in observational studies. This is an inherently

c©2019 Carlos Carvalho, Avi Feller, Jared Murray, Spencer Woody, and David Yeager .

Page 2: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carvalho, Feller, Murray, Woody, and Yeager

challenging problem. Even estimating a single overall effect in non-randomized settings isdifficult, let alone estimating how effects vary across units. As applied researchers beginto use these tools, it is therefore important to understand both how well these approacheswork in practice and how they relate to each other.

This special section of Observational Studies reports the results of a workshop con-ducted at the 2018 Atlantic Causal Inference Conference designed to address these ques-tions. Specifically, we invited researchers to analyze a common data set using their preferredapproach for estimating varying treatment effects in observational settings. The syntheticdata set, which we describe in detail in Section 2.2, was based on data from the NationalStudy of Learning Mindsets, a large-scale randomized trial of a behavioral intervention(Yeager et al., 2019). Unlike the original study, the simulated dataset was constructed toinclude meaningful measured confounding, though the assumption of no unmeasured con-founding still holds. We then asked participants to answer specific questions related totreatment effect variation in this simulated data set and to present these results at theACIC workshop.

Workshop participants submitted a total of eight separate analyses. At a high level, allcontributed analyses followed similar two-step procedures: (1) use a flexible approach tounit-level treatment effects; (2) find low-dimensional summaries of the estimates from thefirst step to answer the substantive questions of interest. The analyses, however, differedwidely in their choice of flexible modeling, including matching, machine learning models,semi-parametric methods, and ensemble approaches. In the end, there was broad consensuson the topline estimate, but large differences in estimated treatment effect moderation.This underscores that estimating varying treatment effects in observational studies is morechallenging than estimating the ATE alone. Section 2 gives an overview of the data challengeand the proposed methods. Section 3 discusses the contributed analyses. Section 4 addressescommon themes and highlights some directions for future research. Participants’ analysesappear subsequently in this volume.

2. Overview of the data challenge

2.1 Background and problem setup

The basis for the data challenge is the National Study of Learning Mindsets (Yeager et al.,2019; Yeager, 2019), which several workshop organizers helped to design and analyze. NSLMis a large-scale randomized evaluation of a low-cost “nudge-like” intervention designed toinstill students with a growth mindset. At a high level, a growth mindset is the belief thatpeople can develop intelligence, as opposed to a fixed mindset, which views intelligence asan innate trait that is fixed from birth. NSLM assessed this intervention by randomizingstudents separately within 76 schools drawn from a national probability sample of U.S.public high schools. In addition to assessing the overall impact, the study was designed tomeasure impact variation, both across students and across schools. See Yeager et al. (2019)for additional discussion.

The goal in generating the synthetic data was to create an observational study thatemulated NSLM in key ways, including covariate distributions, data structures, and effectsizes, but that also introduced additional confounding not present in the original randomizedtrial. The final dataset included 10,000 students across 76 schools, with four student-level

22

Page 3: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Data Challenge: Assessing Treatment Effect Variation

Covariate Description

S3 Student’s self-reported expectations for suc-cess in the future, a proxy for prior achieve-ment, measured prior to random assignment

C1 Categorical variable for studentrace/ethnicity

C2 Categorical variable for student identified gen-der

C3 Categorical variable for student first-generation status (i.e., first in family to go tocollege)

XC School-level categorical variable for urbanicityof the school (i.e., rural, suburban, etc.)

X1 School-level mean of students’ fixed mindsets,reported prior to random assignment

X2 School achievement level, as measured by testscores and college preparation for the previousfour cohorts of students

X3 School racial/ethnic minority composition –i.e., percentage black, Latino, or NativeAmerican

X4 School poverty concentration – i.e., percent-age of students who are from families whoseincomes fall below the federal poverty line

X5 School size – Total number of students in allfour grade levels in the school

Table 1: Descriptions of available covariates in the ACIC workshop synthetic dataset

covariates and six school-level covariates shown in Table 1. These covariates were drawnfrom the original National Study but were slightly perturbed to ensure privacy and otherdata restrictions (see Appendix A for details). For each student, we then generated asimulated outcome Y , representing a continuous measure of achievement, and a simulatedbinary treatment variable Z. We describe the data generating process in Section 2.2, withdetails in Appendix A.

Participants were presented with several objectives for their analysis:

1. To assess whether the mindset intervention is effective in improving student achieve-ment

2. To assess two potential effect moderators of primary scientific interest: Pre-existingmindset norms (X1) and school level achievement (X2). In particular, participantswere asked to evaluate two competing hypotheses about how X2 moderates the effectof the intervention: if it is an effect modifier, researchers hypothesize that either it is

23

Page 4: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carvalho, Feller, Murray, Woody, and Yeager

largest in middle-achieving schools (a “Goldilocks effect”) or is decreasing in school-level achievement.

3. To assess whether there are any other effect modifiers among the recorded variables.

We chose these objectives in part because they arose in the design and analysis of theoriginal National Study. We intentionally kept the wording statistically vague, and didnot map the objectives onto specific estimands in order to emulate part of our experienceworking with collaborators.

2.2 Overview of data generation

According to the model generating synthetic data:

1. The mindset intervention has a relatively large, positive effect on average.

2. The impact varies across both pre-existing mindset norms (X1) and school-level achieve-ment (X2). However, there is no “Goldilocks” effect present, at least when controllingfor other variables (C1 and X1).

3. The impact also varies across race/ethnicity (C1).

Where possible we anchored the synthetic data generating process to the original data,borrowing the original covariate distribution (with slight perturbations) and using semi-parametric models fit to an immediate post-treatment outcome from the original study.

Specifically, let wij denote the vector of the variables in Table 1 for student i in schoolj. We generated the data from the following model:

yobsij = αj + µ(wij) + [τ(xj1, xj2, cij1) + γj ]zij + εij , (1)

where µ is an additive function obtained approximately by fitting a generalized additivemodel to the control arm of the original data; see Appendix A. We simulated αj ∼N(0, 0.152) and γj ∼ N(0, 0.1052) independently. We drew iid samples of εij by jitter-ing and resampling the residuals from a model fit to the original data, scaling them to havestandard deviation 0.5; the distribution of the error terms is displayed in Figure 3. Finally,we generated treatment effects from the following model:

τ(x1, x2, c1) = 0.228+0.05·1(x1 < 0.07)−0.05·1(x2 < −0.69)−0.08·1(c1 ∈ {1, 13, 14}), (2)

where x1 is a measure of pre-existing mindset norms, x2 is school-level achievement, andc1 is a categorical race/ethnicity variable. All three appeared to be associated with treat-ment effect variation in preliminary data from the original National Study, although weadjusted the pattern for the synthetic dataset. Note that there is no additional idiosyn-cratic treatment effect variation and the control and treatment potential outcomes have thesame (conditional) variance.

We generated confounding via a two-step process. First, we dropped observations fromthe treatment arm with probability 1− Φ [−0.5 + 1.5µ(wij)], where Φ is the standard nor-mal CDF. This simulates a scenario where students with high expected outcomes undercontrol were more likely to receive the treatment, yielding naive treatment effect estimates

24

Page 5: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Data Challenge: Assessing Treatment Effect Variation

that are too high. Second, similar to the design in Hill (2011), we dropped select unitsfrom the treatment arm to induce a more complicated functional form for the confoundingstructure. Specifically, we dropped students from the treatment arm if they were above the80th percentile on an additional covariate from the National Study that was not included inthe synthetic dataset (and not used in generating the synthetic outcomes). In principle thishas the potential to induce violations of overlap in the presence of very strong dependence,but overlap in the final dataset was quite good (see e.g. Carnegie et al and Johannsson foroverlap checks).

As with any synthetic data simulation, the process involved an extensive number ofmodeling choices. Appendix A contains more details about the data generating processand discusses the principles we formulated prior to generating the data, including decisionsabout the relative magnitude of the average treatment effect, treatment effect heterogeneity,and residual variance.

3. Overview of contributed analyses

3.1 Summary of contributed methods

Participants submitted eight separate analyses using a wide range of methods.1 At a highlevel, all of the contributed analyses followed similar two-step procedures: (1) use a flexibleapproach to impute student-level (or school-level) treatment effects; (2) find low-dimensionalsummaries of the flexible model to answer the substantive questions. Specifically, the pro-posed first-stage methods fall into three broad categories:

• Matching. Keller et al; Keele and Pimentel; and Parikh et al;

• Outcome modeling. Carnegie et al.; Johannsson;

• Machine learning and semi-parametric estimation. Zhao & Panigrahi; Athey & Wa-ger.

Kunzel et al. proposed an ensemble approach that could incorporate any number of candi-date estimators. Approaches for the second step range from graphical summaries and simpleaggregation (e.g., averaging student-level impact estimates by school) to fitting regressionor tree models to the estimated student-level impacts.

3.2 Summary of findings

At a high level, most analyses resulted in remarkably similar estimates for the averagetreatment effect. Conversely, the proposed methods generally do not agree on treatmenteffect heterogeneity, though the presented results rarely contradict each other: While somemethods reported discovered effect modifiers that others did not detect, no two methods,for example, found effects varying in opposite directions.

Objective 1: average effect. Table 2 shows the reported estimates of the AverageTreatment Effect and corresponding 95% uncertainty intervals.2 Overall the point estimates

1. Alejandro Schuler participated in the ACIC workshop but did not submit a written analysis.2. The original prompt was intentionally vague on the target estimand. Most respondents reported esti-

mates for an Average Treatment Effect. Keele & Pimentel specifically reported estimates for an Average

25

Page 6: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carvalho, Feller, Murray, Woody, and Yeager

Author ATE estimate (95 % C.I.)

Athey & Wager 0.25 (0.21, 0.29)Carnegie et al. 0.25 (0.23, 0.27)Johannsson 0.27 (0.22, 0.31)Keele & Pimentel 0.27 (0.25, 0.30)Keller et al. 0.26 (0.22, 0.30)Kunzel et al. 0.25 (0.22, 0.27)Parikh et al. 0.26 (0.25, 0.26)Zhao & Panigrahi 0.26 (0.24, 0.28)

Table 2: Submitted estimates for the average treatment effect and corresponding 95% un-certainty intervals.

are quite similar, ranging from 0.25 to 0.27, compared to a true value of 0.24. The intervalwidths varied widely, however, from 0.01 for Parikh et. al. to 0.09 for Johannsson. Andwhile most intervals bracketed the true value, with the exceptions of Parikh et. al. andKeele and Pimentel.3 we cannot assess the quality of these intervals based on a singlerealized synthetic data set.

These results are consistent with our own experiences and anecdotal evidence suggestingthat — in settings where the identifying assumptions hold, sample sizes are large, and thesignal is reasonably strong — estimates of the overall effect are fairly robust to modelingor analytic choices. Inference may be another story, however. While all eight analysespoint to large, positive effects overall, the variation in interval widths suggests differencesin efficiency and/or frequentist validity of uncertainty intervals, or possibly differences inthe target estimand; for example, Athey and Wager target the population school averagetreatment effect, while Keele and Pimentel estimate the sample ATT.

Objective 2: variation across pre-specified moderators. In contrast to responsesconcerning the overall effect, there is considerable disagreement about treatment effectvariation across the two pre-specified moderators; see Table 3. All participants found atleast some support for effect modification across baseline levels of mindset beliefs (X1) andgenerally agreed on the direction: higher baseline mindset beliefs associated with lowertreatment effects on average. However, there was some disagreement on whether this trendshould be considered “statistically significant.” As we discuss in Section 4, this is due inpart to different standards for evidence as well as challenges in quantifying uncertainty forsome approaches.

Results were more mixed for variation by baseline student achievement (X2). Half of theanalyses found essentially no support for variation across X2. The remaining analyses weresplit as to the magnitude and strength of evidence for variation, though all were suggestiveof an increasing trend.

Treatment Effect on the Treated. Athey & Wager reported an Average Treatment Effect for a populationof sites. In the end, the true values are quite similar across these estimands in the synthetic data set.Finally, all uncertainty intervals are confidence intervals, except for Carnegie et al., who report credibleintervals.

3. The true Sample Average Treatment Effect on the Treated was also about 0.24.

26

Page 7: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Data Challenge: Assessing Treatment Effect Variation

Author School mindset norms (X1) School achievement level (X2)

Athey & Wager Yes, though negative effectmodification is insignificantafter multiplicity adjustment.

No.

Carnegie et al. Yes, decreasing trend intreatment effect.

Yes, an increasing trend intreatment effect, though with-out evidence of a Goldilockseffect

Johannsson Yes, decreasing trend intreatment effect.

No.

Keele & Pimentel Yes, decreasing trend intreatment effect, though con-fidence intervals for quintilesoverlap.

Moderate support; lowestquintile has lower treatmenteffect, though there is littleseparation among other quin-tiles.

Keller et al. Moderate support from ex-ploratory analysis.

Moderate support for pres-ence Goldilocks effect.

Kunzel et al. Yes. increasing trend intreatment effect.

No, though there is cursoryevidence in CATE plots.

Parikh et al. Yes, increasing then decreas-ing trend in treatment effect.

Yes, exploratory support forGoldilocks effect.

Zhao & Panigrahi Yes, though negative effectmodification is insignificantafter selection adjustment.

No.

Table 3: Conclusions about Objective 2 concerning effect modification by X1 and X2

Objective 3: exploratory treatment effect variation. The contributed analyses alsodiffered in their conclusions about additional effect modifiers. (Recall from Eq (1) that inthe data generating process, X1, X2, and C1 were all “true” treatment effect modifiers,with treatment effects increasing in X1, decreasing in X2, and smaller for C1=1,13, or 14) .

With the exception of Athey & Wager, who found no additional treatment effect vari-ation, the remaining seven analyses all identified urbanicity (XC) as an effect modifier,although they varied somewhat in their assessment of the strength of evidence. Some au-thors also identified student self-reported expectations for future success (S3) as a possibleeffect modifier, generally with an increasing trend. These findings were interesting in partbecause neither XC nor S3 are “true” effect modifiers in the sense of appearing in the datagenerating process for τ in Equation (2), although XC is associated with τ in the population.

This highlights two important issues in assessing treatment effect variation in non-randomized studies. First, even in a randomized trial, estimating varying treatment effectsis, in some sense, an inherently observational problem and is generally susceptible to Simp-son’s paradox (VanderWeele and Knol, 2011). For example, urbanicity is strongly relatedto the true effect modifiers X1 and X2: if the analysis does not condition on X1 and X2, the

27

Page 8: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carvalho, Feller, Murray, Woody, and Yeager

Figure 1: Relationships between the Urbanicity variable (XC) and other quantities, fromleft to right: Base mindset norms (X1), school achievement (X2), and the true CATE. Inthe left two panels we see how XC correlates with X1 and X2. The dotted lines are thelocations of the step for the CATE function τ in Equation (2) for each variable (X1 andX2). Note that, in addition to the sample correlation between X1/X2 and XC, observationswith XC = 3 tended to have norms (X1) above the X1 cutpoint in τ and achievement (X2)below the X2 cutpoint in τ . In the last panel we see that marginally the true CATEs forobservations with XC = 3 are notably lower than the other levels.

estimated treatment effects will vary across levels of XC; see Figure 1.4 Therefore, whetherurbanicity should be considered a “true” effect modifier depends on whether the analysisconditions on X1, X2, and C1, and on whether we are estimating heterogeneity across sam-pled schools or in the population. The research question given to workshop participants wasintentionally vague on this point, and different authors interpreted the question in differentways.

Second, disentangling differential confounding from treatment effect variation is inher-ently challenging in observational settings. Unlike XC, S3 has essentially no relationshipwith the true individual-level treatment effects unconditionally, but is a very importantpredictor of selection into treatment. In the analyses that found evidence for treatmenteffect heterogeneity in S3, the estimated pattern of heterogeneity lined up well with thepatterns of confounding – higher estimated treatment effects in the upper two levels of S3 –suggesting that the estimated treatment effect variation may be due to residual confound-ing; see Figure 2. These results are noteworthy in part because the submitted analyseswere able to more or less correctly estimate the overall treatment effect despite confound-ing, which underscores that estimating varying treatment effects in observational studies ismore challenging than estimating the ATE.

4. The observed relationship between urbanicity and treatment is further amplified by the step functionspecification in Equation (2) as well as the particular draw of random slopes, γj , in the realized syntheticdata set.

28

Page 9: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Data Challenge: Assessing Treatment Effect Variation

Figure 2: True CATE and estimated propensity score by level of expected success (S3).Unlike XC, we see relatively little variability in CATE by S3 (left panel). However, S3 wasone of the most important confounders (the right panel above shows how the propensityscore varies with S3; Figure 4 shows that S3 is also very predictive of the outcome).

4. Common themes and open questions

4.1 Making substantive questions statistically precise

As with all real-world analyses, there are important open-ended questions in terms of trans-lating the substantive goals of the research study into corresponding statistical methods.

One important challenge is weighing evidence for pre-defined subgroups versus ex-ploratory treatment effect variation. The proposed methods differ widely. At one extreme,Zhao and Panigrahi have different inferential procedures for pre-determined candidate sub-group effects (school-level fixed mindset X1 and achievement level X2, as specified in Ques-tion 2) versus “discovered” subgroup effects, as mentioned in Question 3. They argue thatusing the same data for both discovery and estimation of effect modification will generallylead to bias (Fithian et al., 2014), which motivates their post-selection inference procedure(Lee et al., 2016). At the other extreme, Carnegie et al. treat all variation the same,and conduct their investigation into treatment effect heterogeneity by leveraging posterioruncertainty from BART. Most approaches lie between these two extremes, such as usingdifferent sample splits to define subgroups versus estimating their impacts. (Kunzel et al.).

Another difference across methods is how submitted analyses addressed the multilevelstructure of the data. Approaches included using “cluster-robust” random forests (Wager &Athey), bootstrapping and sample splitting at the school level (Johannson; Kunzel et al.),including fixed or random intercepts at the school level (Carnegie et al.), and consideringmatching both between and within schools (Keele & Pimentel). These approaches makedifferent assumptions about patterns of treatment effect moderation as well as the estimandof interest. For example, Carnegie et. al.’s choice of random intercepts per school relaxesthe assumption of iid errors at the student level but does not allow the effect of treatmenteffect moderators to vary across schools. Alternatively, Athey & Wager use a cluster-robustapproach and an estimator that targets a population of schools, rather than (as was oftenimplicitly targeted by the other analyses) a population of students drawn from the 76observed schools.

29

Page 10: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carvalho, Feller, Murray, Woody, and Yeager

Third, as we mention in Section 3, the contributed analyses all use a two-step approachfor assessing treatment effect variation, first fitting a flexible model for impacts and thenfinding a low-dimensional summary of that model. While a promising framework, thestatistical properties of such procedures are not well understood. First, as Hahn et al.(2017) argue, this is generally valid from a Bayesian perspective, since it simply entailssummarizing the posterior distribution after conditioning on the data only once. The story,however, is more complicated from a frequentist perspective. Chernozhukov et al. (2018)offer some promising directions in the randomized trial setting; see also the discussion inAthey & Wager and Zhao & Panigrahi. These ideas merit further exploration.

4.2 Tailoring methods for observational studies

The (simulated) data for this data challenge come from an observational study rather thana randomized trial. Therefore, appropriately accounting for confounding is a first-orderconcern. While all the analyses adjusted for confounding, the contributions varied in thelevel of emphasis put on the observational aspect of the data exercise, which creates addi-tional challenges for estimating treatment effect variation. In particular, when a methodonly partially accounts for confounding, it is possible to confuse differential confoundingfor varying treatment effects. This appears to be driving some of the findings of S3 as aneffect modifier, as we discuss in Section 3. Understanding this dimension will be critical forbroader adoption of these methods.

First, it is common to check for global covariate balance when estimating overall impactin an observational study. A natural extension to heterogeneous treatment effect estimationwould be to also check for local covariate balance, such as separately by candidate subgroups.Even if we attain global covariate balance to a reasonable tolerance, imbalance withinsubgroups should give us pause. Such analyses were largely absent from the contributedpapers, suggesting that the field should develop standards in this area.

Another element that is critical for observational studies is assessing the consequenceswhen the unconfoundedness assumption fails, either globally or locally. Keele & Pimentelassess sensitivity within the matching framework; Carnegie et al. instead use a model-basedapproach, and Kunzel et al. use a permutation approach. Sensitivity analysis for treatmenteffect variation remains an active research area, with some recent proposals within theminimax framework (Kallus and Zhou, 2018; Yadlowsky et al., 2018).

5. Conclusion

Our goal for this workshop was to understand how different researchers would approach thekinds of questions we routinely face in our own applied work, as a complement to existingdata analysis competitions like the ACIC data challenge (Dorie et al., 2019) where methodsare formally evaluated based on their operating characteristics. We were fortunate to bringtogether a “methodologically diverse” panel and set them to analyze a single dataset, givingus a unique opportunity to compare perspectives. Our hope was that we would learn moreabout new methods while finding areas of overlap and points of divergence that suggest newlines of research. We suggest a few directions for future work above, but fully expect thissynthesis is the first word and not the last. We sincerely thank the workshop participantsand the other contributors to this volume for making the workshop such a success.

30

Page 11: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Data Challenge: Assessing Treatment Effect Variation

Acknowledgments

We gratefully acknowledge support from the Center for Enterprise and Policy Analyticsat the McCombs School of Business at the University of Texas at Austin. Preparationof the manuscript was supported in part by National Institute of Child Health and Hu-man Development (Grant No. 10.13039/100000071 R01HD084772), P2C-HD042849 (to thePopulation Research Center [PRC] at The University of Texas at Austin) and the NationalScience Foundation under grant number HRD 1761179.

Appendix A. Additional details for generating synthetic data

A.1 Principles for data generation

We had several goals for our data generating process. Since we had a single dataset togenerate based on a scientific problem with which we were quite familiar, we had the lux-ury of thinking carefully about plausible structures for treatment effect heterogeneity andconfounding.

Our model for generating treatment effect heterogeneity was based on the followingprinciples:

1. The average treatment effect should be relatively well-estimated by any reasonableprocedure for confounding adjustment.

2. The variability in conditional average treatment effects (CATEs) should be relativelymodest; here the covariate-dependent component of the CATEs ranged from 0.10 to0.28, while the total CATEs (including school-level heterogeneity unexplained by thecovariates) ranged from -0.08 to 0.53 and have a sample average of 0.24 and standarddeviation 0.10. “Relative” here should be understood in terms of the marginal stan-dard deviation of Y (about 0.6) and the standard deviation of the error term in themodel we used to generate the data (0.5).

3. It should be possible to approximately recover the treatment effect heterogeneity atconventional levels of statistical significance given complete knowledge of the correctfunctional form. This is an obvious baseline. We also tried to ensure that it waspossible to recover at least some aspects of treatment effect moderation using plau-sible methods in a fully exploratory fashion, or when the true set of treatment effectmoderators was known.

4. There should be no additional unmeasured treatment effect moderation at the indi-vidual level, since this is inestimable. This is wholly unrealistic and was primarily avariance-reduction decision; to the extent that other simulation exercises are explicitabout this point, they tend to simulate unmeasured treatment effect moderation asdue to independent unmeasured variables (e.g. Dorie et al. (2019)). This is closer toreality of course, but in our context with a single dataset to be analyzed we saw littleto be gained by injecting additional variability into the problem.

5. Unexplained treatment effect heterogeneity at the group level should be present atreasonable levels. We achieved this by simulating a random slope per school from a

31

Page 12: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carvalho, Feller, Murray, Woody, and Yeager

normal distribution with mean zero and standard deviation slightly larger than whatwe observed in models fit to the original data.

Our confounding structure was based on the following principles:

1. The confounding should be strong enough to matter, but not so strong as to beunrealistic and/or induce practical violations of overlap. In our context this meant anaive, unadjusted estimate average treatment effect (ATE) estimate was about 25%higher than the estimate using a correctly specified model (roughly, 0.30 unadjustedversus 0.24 adjusted, a difference of multiple standard errors).

2. The confounding should be explicitly modeled, at least in part, rather than inducedsolely by randomly sampling coefficients in outcome and selection models. In par-ticular, we assumed that selection into treatment was partially based on expectedoutcomes. Here students with higher expected outcomes under control were morelikely to be treated. Rather than violate conditional ignorability/unconfoundednessvia a (perhaps more plausible) latent variable model, we made the selection modeldepend directly on µ0(w) := E(Y (0) |W = w), where we use W generically to denotethe collection of potential effect moderators and confounders. Hahn et al. (2017) referto this confounding structure as “targeted selection.”

3. We entertained inducing confounding at the group level that was unexplained bygroup-level covariates, but decided against it. This was largely a pragmatic decision,as it was already surprisingly difficult to design a data generating process that metour other desiderata.

We wanted the covariate distribution to be realistic. We were interested in the practicaleffects of dependence in the covariates on estimating effect variation, and in how participantswould interpret the research questions in light of this dependence. We took the covariatesalmost directly from the early Mindset study data. To satisfy privacy constraints and avoidpremature disclosure of variables constructed by the Mindset team, the original covariateswere slightly perturbed. We added noise to continuous variables, where the disturbanceswere sampled from multivariate normal distributions that preserved the covariance structure(marginally over categorical variables). Categorical variables were subject to low levels ofdata swapping. In both cases we preserved the original multilevel data structure.

A.2 Summary of original GAM estimate

See Figure 3.

Appendix B. List of participants

In total there were nine participants in the workshop, all of whom gave presentations of theirfindings. Here we list the presenters by order of appearance, along with a brief descriptionof their methodology:

• Stefan Wager (Stanford): Causal random forests and the R-Learner transformed out-come

32

Page 13: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Data Challenge: Assessing Treatment Effect Variation

Figure 3: Nonlinear terms in µ and error distribution (bottom right panel) for the datagenerating model.

• Alejandro Schuler (Stanford): Shallow CART fit to transformed outcome (R-Learner).

• Luke Keele (Penn) and Sam Pimentel (UC-Berkeley): Various matching-based ap-proaches.

• Qingyuan Zhao (Penn): Linear model on a transformed outcome and treatment withlasso regularization, with selection adjustment for inference about nonzero coefficients.

• Soren Kunzel (UC-Berkeley): Ensemble method to detect candidate subgroups, usingstandard ATE estimators within these subgroups.

• Nicole Carnegie (Montana State) and Jennifer Hill (New York University): BARTwith and without random intercepts.

• Alexander Volfovsky (Duke): Matching after learning to stretch (MALTS), a matchingmethod that infers a distance metric.

• Frederik Johannsson (MIT): Ridge regression, neural networks, and random forests.

• Bryan Keller (Columbia): Propensity score matching and CART summarization.

All presenters, with the exception of Schuler, submitted a written report (most withother coauthor(s)). These submissions are all included in this issue.

33

Page 14: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carvalho, Feller, Murray, Woody, and Yeager

Parametric coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.279038 0.014847 -18.794 < 2e-16 ***

S3 == 2 -0.009130 0.014948 -0.611 0.541351

S3 == 3 -0.017494 0.013360 -1.309 0.190402

S3 == 4 0.046260 0.012716 3.638 0.000276 ***

S3 == 5 0.252158 0.012290 20.518 < 2e-16 ***

S3 == 6 0.614424 0.012315 49.893 < 2e-16 ***

S3 == 7 0.964031 0.012872 74.891 < 2e-16 ***

C1 == 2 0.011635 0.005728 2.031 0.042255 *

C1 == 3 -0.035327 0.012573 -2.810 0.004966 **

C1 == 4 -0.107783 0.004981 -21.638 < 2e-16 ***

C1 == 5 0.215135 0.008003 26.882 < 2e-16 ***

C1 == 6 0.001796 0.021055 0.085 0.932026

C1 == 7 0.034712 0.021602 1.607 0.108097

C1 == 8 -0.073001 0.010092 -7.233 4.96e-13 ***

C1 == 9 -0.138227 0.011785 -11.729 < 2e-16 ***

C1 == 10 -0.099958 0.010868 -9.197 < 2e-16 ***

C1 == 11 -0.084852 0.010809 -7.850 4.45e-15 ***

C1 == 12 0.049686 0.008453 5.878 4.25e-09 ***

C1 == 13 -0.100919 0.010573 -9.545 < 2e-16 ***

C1 == 14 0.064214 0.006531 9.832 < 2e-16 ***

C1 == 15 -0.019233 0.008204 -2.344 0.019078 *

C2 -0.155658 0.002546 -61.127 < 2e-16 ***

C3 -0.093056 0.002850 -32.655 < 2e-16 ***

XC 0.019555 0.002527 7.740 1.07e-14 ***

---

edf Ref.df F p-value

s(X1) 8.718 8.950 128.08 <2e-16 ***

s(X2) 8.920 8.987 25.08 <2e-16 ***

s(X3) 6.726 7.661 62.31 <2e-16 ***

s(X4) 7.827 8.554 19.73 <2e-16 ***

s(X5) 8.818 8.974 58.23 <2e-16 ***

---

Figure 4: GAM summary table of µ for the true expected outcomes under control, seeFigure 3 for the forms of nonlinear terms in X1 through X5.

34

Page 15: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Data Challenge: Assessing Treatment Effect Variation

References

Chernozhukov, V., Demirer, M., Duflo, E., and Fernandez-Val, I. (2018). Generic ma-chine learning inference on heterogenous treatment effects in randomized experiments.Technical report, National Bureau of Economic Research.

Dorie, V., Hill, J., Shalit, U., Scott, M., Cervone, D., et al. (2019). Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition.Statistical Science, 34(1):43–68.

Fithian, W., Sun, D., and Taylor, J. (2014). Optimal Inference After Model Selection. arXive-prints, page arXiv:1410.2597.

Hahn, P. R., Murray, J. S., and Carvalho, C. (2017). Bayesian regression tree models forcausal inference: regularization, confounding, and heterogeneous effects. ArXiv e-prints.

Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Com-putational and Graphical Statistics, 20(1):217–240. Available from: https://doi.org/

10.1198/jcgs.2010.08162.

Kallus, N. and Zhou, A. (2018). Confounding-robust policy improvement. In Advances inNeural Information Processing Systems, pages 9269–9279.

Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact post-selection inference,with application to the lasso. Ann. Statist., 44(3):907–927. Available from: https:

//doi.org/10.1214/15-AOS1371.

VanderWeele, T. J. and Knol, M. J. (2011). Interpretation of subgroup analyses in random-ized trials: heterogeneity versus secondary interventions. Annals of Internal Medicine,154(10):680–683.

Yadlowsky, S., Namkoong, H., Basu, S., Duchi, J., and Tian, L. (2018). Bounds on theconditional and average treatment effect in the presence of unobserved confounders. arXivpreprint arXiv:1808.09521.

Yeager, D. S. (2019). The National Study of Learning Mindsets, [United States], 2015–2016.Inter-university Consortium for Political and Social Research [distributor].

Yeager, D. S., Hanselman, P., Walton, G. M., Murray, J. S., Crosnoe, R., Muller, C., Tipton,E., Schneider, B., Hulleman, C. S., Hinojosa, C. P., Paunesku, D., Romero, C., Flint,K., Roberts, A., Trott, J., Iachan, R., Buontempo, J., Yang, S. M., Carvalho, C. M.,Hahn, P. R., Gopalan, M., Mhatre, P., Ferguson, R., Duckworth, A. L., and Dweck, C. S.(2019). A national experiment reveals where a growth mindset improves achievement.Nature. Available from: https://doi.org/10.1038/s41586-019-1466-y.

35

Page 16: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 36-51 Submitted 7/19; Published 8/19

Estimating Treatment Effects withCausal Forests: An Application

Susan Athey [email protected] Graduate School of Business, Stanford, CA-94305

Stefan Wager [email protected]

Stanford Graduate School of Business, Stanford, CA-94305

AbstractWe apply causal forests to a dataset derived from the National Study of Learning Mindsets,and discusses resulting practical and conceptual challenges. This note will appear in an upcomingissue of Observational Studies, Empirical Investigation of Methods for Heterogeneity, that compilesseveral analyses of the same dataset.

1. Methodology and Motivation

There has been considerable recent interest in methods for heterogeneous treatment effect estimationin observational studies (Athey and Imbens, 2016; Athey et al., 2019; Ding et al., 2016; Dorie et al.,2017; Hahn et al., 2017; Hill, 2011; Imai and Ratkovic, 2013; Kunzel et al., 2017; Luedtke and van derLaan, 2016; Nie and Wager, 2017; Shalit et al., 2017; Su et al., 2009; Wager and Athey, 2018; Zhaoet al., 2017). In order to help elucidate the drivers of successful approaches to treatment effectestimation, Carlos Carvalho, Jennifer Hill, Avi Feller and Jared Murray organized a workshop atthe 2018 Atlantic Causal Inference Conference and asked several authors to analyze a shared datasetderived from the National Study of Learning Mindsets (Yeager et al., 2016).

This note presents an analysis using causal forests (Athey et al., 2019; Wager and Athey, 2018);other approaches will be discussed in a forthcoming issue of Observational Studies with title “Em-pirical Investigation of Methods for Heterogeneity.” All analyses are carried out using the R packagegrf, version 0.10.2 (Tibshirani et al., 2018; R Core Team, 2017). Full replication files are availableat github.com/grf-labs/grf, in the directory experiments/acic18.

1.1 The National Study of Learning Mindsets

The National Study of Learning Mindsets is a randomized study conducted in U.S. public highschools, the purpose of which was to evaluate the impact of a nudge-like intervention designed toinstill students with a growth mindset1 on student achievement. To protect student privacy, thepresent analysis is not based on data from the original study, but rather on data simulated froma model fit to the National Study dataset by the workshop organizers. The present analysis couldserve as a pre-analysis plan to be applied to the original National Study dataset (Nosek et al., 2015).

Our analysis is based on data from n = 10, 391 children from a probability sample of J = 76schools.2 For each child i = 1, ..., n, we observe a binary treatment indicator Zi, a real-valuedoutcome Yi, as well as 10 categorical or real-valued covariates described in Table 1. We expanded

1. According to the National Study, “A growth mindset is the belief that intelligence can be developed. Studentswith a growth mindset understand they can get smarter through hard work, the use of effective strategies, andhelp from others when needed. It is contrasted with a fixed mindset: the belief that intelligence is a fixed traitthat is set in stone at birth.”

2. Initially, 139 schools were recruited into the study using a stratified probability sampling method (Gopalanand Tipton, 2018). Of these 139 recruited schools, 76 agreed to participate in the study; then, students were

c©2019 Susan Athey and Stefan Wager.

Page 17: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Athey and Wager

S3 Student’s self-reported expectations for success in the future, a proxy for priorachievement, measured prior to random assignment

C1 Categorical variable for student race/ethnicityC2 Categorical variable for student identified genderC3 Categorical variable for student first-generation status, i.e. first in family to go

to collegeXC School-level categorical variable for urbanicity of the school, i.e. rural, suburban,

etc.X1 School-level mean of students’ fixed mindsets, reported prior to random assign-

mentX2 School achievement level, as measured by test scores and college preparation for

the previous 4 cohorts of studentsX3 School racial/ethnic minority composition, i.e., percentage of student body that

is Black, Latino, or Native AmericanX4 School poverty concentration, i.e., percentage of students who are from families

whose incomes fall below the federal poverty lineX5 School size, i.e., total number of students in all four grade levels in the schoolY Post-treatment outcome, a continuous measure of achievementZ Treatment, i.e., receipt of the intervention

Table 1: Definition of variables measured in the National Study of Learning Mindsets

out categorical random variables via one-hot encoding, thus resulting in covariates Xi ∈ Rp withp = 28. Given this data, the workshop organizers expressed particular interest in the three followingquestions:

1. Was the mindset intervention effective in improving student achievement?

2. Was the effect of the intervention moderated by school level achievement (X2) or pre-existingmindset norms (X1)? In particular there are two competing hypotheses about how X2 moder-ates the effect of the intervention: Either it is largest in middle-achieving schools (a “Goldilockseffect”) or is decreasing in school-level achievement.

3. Do other covariates moderate treatment effects?

We define causal effects via the potential outcomes model (Imbens and Rubin, 2015): For each samplei, we posit potential outcomes Yi(0) and Yi(1) corresponding to the outcome we would have observedhad we assigned control or treatment to the i-th sample, and assume that we observe Yi = Yi(Zi).The average treatment effect is then defined as τ = E [Yi(1)− Yi(0)], and the conditional averagetreatment effect function is τ(x) = E

[Yi(1)− Yi(0)

∣∣Xi = x].

This dataset exhibits two methodological challenges. First, although the National Study itselfwas a randomized study, there seems to be some selection effects in the synthetic data used here. Asseen in Figure 1, students with a higher expectation of success appear to be more likely to receivetreatment. For this reason, we analyze the study as an observational rather than randomized study.In order to identify causal effects, we assume unconfoundedness, i.e., that treatment assignment isas good as random conditionally on covariates (Rosenbaum and Rubin, 1983)

{Yi(0), Yi(1)} ⊥⊥ Zi∣∣Xi. (1)

individually randomized within the participating schools. In this note, we do not discuss potential bias from thenon-randomized selection of 76 schools among the 139 recruited ones.

38

Page 18: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Estimating Treatment Effects with Causal Forests

●●

●●

●●

●●●●●●●●●●●●

●●●

●●●●●

1 2 3 4 5 6 7

0.25

0.30

0.35

0.40

Student Expectation of Success

Pro

pens

ity S

core

Figure 1: Visualizing estimated treatment propensities against student expectation of success.

To relax this assumption, one could try to find an instrument for treatment assignment (Angristand Pischke, 2008), or conduct a sensitivity analysis for hidden confounding (Rosenbaum, 2002).

Second, the students in this study are not independently sampled; rather, they are all drawnfrom 76 randomly selected schools, and there appears to be considerable heterogeneity across schools.Such a situation could arise if there are unobserved school-level features that are important treat-ment effect modifiers; for example, some schools may have leadership teams who implemented theintervention better than others, or may have a student culture that is more receptive to the treat-ment. If we want our conclusions to generalize outside of the 76 schools we ran the experiment in, wemust run an analysis that robustly accounts for the sampling variability of potentially unexplainedschool-level effects. Here, we take a conservative approach, and assume that the outcomes Yi ofstudents within a same school may be arbitrarily correlated within a school (or “cluster”), and thenapply cluster-robust analysis tools (Abadie et al., 2017).

The rest of this section presents a brief overview of causal forests, with an emphasis of how theyaddress issues related to clustered observations and selection bias. Causal forests are an adaptationof the random forest algorithm of Breiman (2001) to the problem of heterogeneous treatment effectestimation. For simplicity, we start below by discussing how to make random forests cluster-robust inthe classical case of non-parametric regression, where we observe pairs (Xi, Yi) and want to estimateµ(x) = E

[Yi∣∣Xi = x

]. Then, in the next section, we review how forests can be used for treatment

effect estimation in observational studies.

1.2 Cluster-Robust Random Forests

When observations are grouped in unevenly sized clusters, it is important to carefully define theunderlying target of inference. For example, in our setting, do we want to fit a model that accuratelyreflects heterogeneity in our available sample of J = 76 schools, or one that we hope will generalizeto students from other schools also? Should we give more weight in our analysis to schools fromwhich we observe more students?

Here, we assume that we want results that generalize beyond our J schools, and that we giveeach school equal weight; quantitatively, we want models that are accurate for predicting effects ona new student from a new school. Thus, if we only observed outcomes Yi for students with school

39

Page 19: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Athey and Wager

membership Ai ∈ {1, ..., J} we would estimate the global mean as µ with standard error σ, with

µj =1

nj

∑{i:Ai=j}

Yi, µ =1

J

J∑j=1

µj , σ2 =1

J(J − 1)

J∑j=1

(µj − µ)2, (2)

where nj denotes the number of students in school j. Our challenge is then to use random forests tobring covariates into an analysis of type (2). Formally, we seek to carry out a type of non-parametricrandom effects modeling, where each school is assumed to have some effect on the student’s outcome,but we do not make assumptions about its distribution (in particular, we do not assume that schooleffects are Gaussian or additive).

At a high level, random forests make predictions as an average of b trees, as follows: (1) For eachb = 1, ..., B, draw a subsample Sb ⊆ {1, ..., n}; (2) Grow a tree via recursive partitioning on eachsuch subsample of the data; and (3) Make predictions

µ(x) =1

B

B∑b=1

n∑i=1

Yi 1 ({Xi ∈ Lb(x), i ∈ Sb})|{i : Xi ∈ Lb(x), i ∈ Sb}|

, (3)

where Lb(x) denotes the leaf of the b-th tree containing the training sample x. In the case of out-of-bag prediction, we estimate µ(−i)(Xi) by only considering those trees b for which i 6∈ Sb. This shortdescription of forests of course leaves many details implicit. We refer to Biau and Scornet (2016) fora recent overview of random forests and note that, throughout, all our forests are “honest” in thesense of Wager and Athey (2018).

When working with clustered data, we adapt the random forest algorithm as follows. In step(1), rather than directly drawing a subsample of observations, we draw a subsample of clustersJb ⊆ {1, ..., J}; then, we generate the set Sb by drawing k samples at random from each clusterj ∈ Jb.3 The other point where clustering matters is when we want to make out-of-bag predictionsin step (3). Here, to account for potential correlations within each cluster, we only consider anobservation i to be out-of-bag if its cluster was not drawn in step (1), i.e., if Ai 6∈ Jb.

1.3 Causal Forests for Observational Studies

One promising avenue to heterogeneous treatment effect estimation starts from an early result ofRobinson (1988) on inference in the partially linear model (Nie and Wager, 2017; Zhao et al.,2017). Write e(x) = P

[Zi∣∣Xi = x

]for the propensity score and m(x) = E

[Yi∣∣Xi = x

]for the

expected outcome marginalizing over treatment. If the conditional average treatment effect functionis constant, i.e., τ(x) = τ for all x ∈ X , then the following estimator is semiparametrically efficientfor τ under unconfoundedness (1) (Chernozhukov et al., 2017; Robinson, 1988):

τ =1n

∑ni=1

(Yi − m(−i)(Xi)

) (Zi − e(−i)(Xi)

)1n

∑ni=1

(Zi − e(−i)(Xi)

)2 , (4)

assuming that m and e are o(n−1/4)-consistent for m and e respectively in root-mean-squared error,that the data is independent and identically distributed, and that we have overlap, i.e., that propen-sities e(x) are uniformly bounded away from 0 and 1. The (−i)-superscripts denote “out-of-bag” or“out-of-fold” predictions meaning that, e.g., Yi was not used to compute m(−i)(Xi).

3. If k ≤ nj for all j = 1, ..., J , then each cluster contributes the same number of observations to the forest as in(2). In grf, however, we also allow users to specify a value of k larger than the smaller nj ; and, in this case, forclusters with nj ≤ k, we simply use the whole cluster (without duplicates) every time j ∈ Jb. This latter optionmay be helpful in cases where there are some clusters with a very small number of observations, yet we want Sbto be reasonably large so that the tree-growing algorithm is stable.

40

Page 20: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Estimating Treatment Effects with Causal Forests

Although the original estimator (4) was designed for constant treatment effect estimation, Nieand Wager (2017) showed that we can use it to motivate an “R-learner” objective function forheterogeneous treatment effect estimation,

τ(·) = argminτ

{n∑i=1

((Yi − m(−i)(Xi)

)− τ(Xi)

(Zi − e(−i)(Xi)

))2+ Λn (τ(·))

}, (5)

where Λn (τ(·)) is a regularizer that controls the complexity of the learned τ(·) function. A desirableproperty of this approach is that, if the true conditional average treatment effect function τ(·) issimpler than the main effect function m(·) or the propensity function e(·), e.g., qualitatively, thatτ(·) allows for a sparser representation than m(·) or e(·), then the function τ(·) learned by optimizing(5) may converge faster than the estimates for m(·) or e(·) used to form the objective function.

Causal forests as implemented in grf can be seen as a forest-based method motivated by theR-learner (5). Typically, random forests (Breiman, 2001) are understood as an ensemble method: Arandom forest prediction is an average of predictions made by individual trees. However, as discussedin Athey et al. (2019), we can equivalently think of random forests as an adaptive kernel method;for example, we can re-write the regression forest from (3) as

µ(x) =

n∑i=1

αi(x)Yi, αi(x) =1

B

B∑b=1

1 ({Xi ∈ Lb(x), i ∈ Sb})|{i : Xi ∈ Lb(x), i ∈ Sb}|

, (6)

where, qualitatively, αi(x) is a data-adaptive kernel that measures how often the i-th trainingexample falls in the same leaf as the test point x. This kernel-based perspective on forests suggestsa natural way to use them for treatment effect estimation based on (4) and (5): First, we grow aforest to get weights αi(x), and then set

τ =

∑ni=1 αi(x)

(Yi − m(−i)(Xi)

) (Zi − e(−i)(Xi)

)∑ni=1 αi(x)

(Zi − e(−i)(Xi)

)2 . (7)

Athey et al. (2019) discuss this approach in more detail, including how to design a splitting rule fora forest that will be used to estimate predictions via (7). Finally, we address clustered observationsby modifying the random forest sampling procedure in an analogous way to the one used in Section1.2.

Concretely, the grf implementation of causal forests starts by fitting two separate regressionforests to estimate m(·) and e(·). It then makes out-of-bag predictions using these two first-stageforests, and uses them to grow a causal forest via (7). Causal forests have several tuning parameters(e.g., minimum node size for individual trees), and we choose those tuning parameters by cross-validation on the R-objective (5), i.e., we train causal forests with different values of the tuningparameters, and choose the ones that make out-of-bag estimates of the objective minimized in (5)as small as possible.

We provide an exact implementation of our treatment effect estimation strategy with causalforests in Algorithm 1. We train the Y.forest and Z.forest using default settings, as their pre-dictions are simply used as inputs to the causal forest and default parameter choices often performreasonably well with random forests.4 For our final causal forest, however, we deploy some tweaks forimproved precision. Motivated by Basu et al. (2018), we start by training a pilot random forest on allthe features, and then train a second forest on only those features that saw a reasonable number of

4. The nuisance components Y.hat or Z.hat need not be estimated by a regression forest. We could also useother predictive methods (e.g., boosting with cross-fitting) or use oracle values (e.g., the true randomizationprobabilities for Z.hat in a randomized trial). If we simply run the command causal_forest(X, Y, Z) withoutspecifying Y.hat or Z.hat, then the software silently estimates Y.hat or Z.hat via regression forests.

41

Page 21: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Athey and Wager

Algorithm 1 Estimating treatment effects with causal forests. Throughout this note, we followeditorial guidelines for the special issue of Observational Studies, and denote treatment assignmentby Z. However, the grf interface has different conventions, and treatment assignment is denoted byW rather than Z (e.g., the function causal_forest actually has an argument W.hat, not Z.hat).In order to get these code snippets to run in grf, all the “Z” need to be replaced with “W”.

Y.forest = regression_forest(X, Y, clusters = school.id)

Y.hat = predict(Y.forest )$ predictions

Z.forest = regression_forest(X, Z, clusters = school.id)

Z.hat = predict(Z.forest )$ predictions

cf.raw = causal_forest(X, Y, Z,

Y.hat = Y.hat , Z.hat = Z.hat ,

clusters = school.id)

varimp = variable_importance(cf.raw)

selected.idx = which(varimp > mean(varimp ))

cf = causal_forest(X[,selected.idx], Y, Z,

Y.hat = Y.hat , Z.hat = Z.hat ,

clusters = school.id,

samples_per_cluster = 50,

tune.parameters = TRUE)

tau.hat = predict(cf)$ predictions

splits in the first step.5 This enables the forest to make more splits on the most important features inlow-signal situations. Second, we increase the samples_per_cluster parameter (called k in Section1.2) to increase the number of samples used to grow each tree. Finally, the option tune.parameters

= TRUE has the forest cross-validate tuning parameters using the R-objective rather than just settingdefaults.

2. Workshop Results

We now use our causal forest as trained in Algorithm 1 to explore the questions from Section 1.1.

2.1 The average treatment effect

The first question asks about the overall effectiveness of the intervention. The package grf has abuilt-in function for average treatment effect estimation, based on a variant of augmented inverse-propensity weighting (Robins et al., 1994). With clusters, we compute an average treatment effectestimate τ and a standard error estimate σ2 as follows:

τj =1

nj

∑{i:Ai=j}

Γi, τ =1

J

J∑j=1

τj , σ2 =1

J(J − 1)

J∑j=1

(τj − τ)2,

Γi = τ (−i) (Xi) +Zi − e(−i)(Xi)

e(−i) (Xi)(1− e(−i) (Xi)

) (Yi − m(−i)(Xi)−(Zi − e(−i) (Xi)

)τ (−i) (Xi)

).

(8)

See Section 2.1 of Farrell (2015) for a discussion of estimators with this functional form, and Section2.4 of Athey et al. (2018) for a recent literature review. The value of cross-fitting is stressed in

5. Given good estimates of Y.hat and Z.hat, the construction (7) eliminates confounding effects. Thus, we do notneed to give the causal forest all features X that may be confounders. Rather, we can focus on features that webelieve may be treatment modifiers; see Zhao et al. (2017) for a further discussion.

42

Page 22: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Estimating Treatment Effects with Causal Forests

Chernozhukov et al. (2017). An application of this method suggests that the treatment had a largepositive on average.

ATE = average_treatment_effect(cf)

> "95% CI for the ATE: 0.247 +/- 0.04"

2.2 Assessing treatment heterogeneity

The next two questions pertain to treatment heterogeneity. Before addressing questions, however,it is useful to ask whether the causal forest has succeeded in accurately estimating treatment het-erogeneity. As seen in Figure 2, the causal forest CATE estimates obviously exhibit variation; butthis does not automatically imply that τ (−i)(Xi) is a better estimate of τ(Xi) than the overall av-erage treatment effect estimate τ from (8). Below, we seek an overall hypothesis test for whetherheterogeneity in τ (−i)(Xi) is associated with heterogeneity in τ(Xi).

A first, simple approach to testing for heterogeneity involves grouping observations accordingto whether their out-of-bag CATE estimates are above or below the median CATE estimate, andthen estimating average treatment effects in these two subgroups separately using the doubly robustapproach (8). This procedure is somewhat heuristic, as the “high” and “low” subgroups are notindependent of the scores Γi used to estimate the within-group effects; however, the subgroup def-inition does not directly depend on the outcomes or treatments (Yi, Zi) themselves, and it appearsthat this approach can provide at least qualitative insights about the strength of heterogeneity.

We also try a second test for heterogeneity, motivated by the “best linear predictor” methodof Chernozhukov et al. (2018), that seeks to fit the CATE as a linear function of the the out-of-bag causal forest estimates τ (−i)(Xi). Concretely, following (4), we create two synthetic predictors,Ci = τ(Zi − e(−i)(Xi)) and Di = (τ (−i)(Xi)− τ)(Zi − e(−i)(Xi)) where τ is the average of the out-of-bag treatment effect estimates, and regress Yi − m(−i)(Xi) against Ci and Di. Then, we caninterpret the coefficient of Di as a measure of the quality of the estimates of treatment heterogeneity,while Ci absorbs the average treatment effect. If the coefficient on Di is 1, then the treatmentheterogeneity estimates are well calibrated, while if the coefficient is Di significant and positive, thenat least we have evidence of a useful association between τ (−i)(Xi) and τ(Xi). More formally, onecould use the p-value for the coefficient of Di to test the hypothesis that the causal forest succeededin finding heterogeneity; however, we caution that asymptotic results justifying such inference arenot presently available.

Below, we show output from running both analyses (note that all results are cluster-robust,where each cluster gets the same weight). The overall picture appears somewhat mixed: Althoughpoint estimates are consistent with the presence of heterogeneity, neither detection is significant.Thus, at least if we insist on cluster-robust inference, any treatment heterogeneity that may bepresent appears to be relatively weak, and causal forests do not identify subgroups with effects thatobviously stand out. We discuss the role of cluster-robustness further in Section 3.1.

# Compare regions with high and low estimated CATEs

high_effect = tau.hat > median(tau.hat)

ate.high = average_treatment_effect(cf , subset = high_effect)

ate.low = average_treatment_effect(cf , subset = !high_effect)

> "95% CI for difference in ATE: 0.053 +/- 0.071"

# Run best linear predictor analysis

test_calibration(cf)

> Estimate Std. Error t value Pr(>|t|)

> mean.prediction 1.007477 0.083463 12.0710 <2e-16 ***

> differential.prediction 0.321932 0.306738 1.0495 0.294

43

Page 23: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Athey and Wager

estimated CATE

Fre

quen

cy

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

050

010

0015

00

Figure 2: Histogram of out-of-bag CATE estimates from a causal forest trained as in Algorithm 1.

●●

●●

●●●

●●●●

●●●●

●●

●●

●●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●

●●

−3 −2 −1 0 1 2 3

0.10

0.20

0.30

0.40

X1

estim

ated

CAT

E

●●

●●

● ●

●●

0.18 0.20 0.22 0.24 0.26 0.28 0.30

0.18

0.20

0.22

0.24

0.26

0.28

0.30

average CATE estimate in school

scho

ol−

wis

e fo

rest

pre

dict

ions

(a) variation with school-level mindset (b) evaluating forest trained on τj from (8)

Figure 3: Panel (a) plots students’ CATE estimates against school-level mindset X1. Panel (b)compares estimates from a regression forest trained to predict the per-school doubly robust treatmenteffect estimates τj from (8) using school-level covariates, to school-wise averages of the causal forestestimates τ (−i)(Xi) trained as in Algorithm 1.

2.3 The effect of X1 and X2

Although our omnibus tests did not find strong evidence of treatment heterogeneity, this does notmean there is no heterogeneity present. Researchers had pre-specified interest in heterogeneity alongtwo specific variables, namely pre-existing mindset (X1) and school-level achievement (X2), and itis plausible that a test for heterogeneity that focuses on these two variables may have more powerthan the agnostic tests explored above.

44

Page 24: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Estimating Treatment Effects with Causal Forests

Both X1 and X2 are school-level variables, so we here design tests based on the per-school doublyrobust treatment effect estimates τj computed in (8). As seen below, this more targeted analysisuncovers notable heterogeneity along X1, i.e., schools with larger values of X1 appear to experiencelarger effects than schools with smaller values of X1. Conversely, we do not see much heterogeneityalong X2, whether we divide schools into 2 subgroups (to test the monotone hypothesis) or into 3subgroups (to test the goldilocks hypothesis).

Although the p-value for heterogeneity along X1 is not small enough to withstand a Bonferronitest, it seems reasonable to take the detection along X1 seriously because heterogeneity along X1

was one of two pre-specified hypotheses. Interestingly, we also note that X1 was the most importantvariable in the causal forest: The final causal forest was trained on 9 “selected” variables, and spent24% of its splits on X1 with splits weighted by depth (as in the function variable_importance).The left panel of Figure 3 plots the relationship between X1 and τ (−i)(Xi).

dr.score = tau.hat + Z / cf$Z.hat *

(Y - cf$Y.hat - (1 - cf$Z.hat) * tau.hat) -

(1 - Z) / (1 - cf$Z.hat) * (Y - cf$Y.hat + cf$Z.hat * tau.hat)

school.score = t(school.mat) %*% dr.score / school.size

school.X1 = t(school.mat) %*% X$X1 / school.size

high.X1 = school.X1 > median(school.X1)

t.test(school.score[high.X1], school.score [!high.X1])

> t = -3.0205, df = 72.087, p-value = 0.00349

> 95 percent confidence interval: -0.1937 -0.0397

school.X2 = (t(school.mat) %*% X$X2) / school.size

high.X2 = school.X2 > median(school.X2)

t.test(school.score[high.X2], school.score [!high.X2])

> t = 1.043, df = 72.431, p-value = 0.3004

> 95 percent confidence interval: -0.0386 0.1234

school.X2.levels = cut(school.X2,

breaks = c(-Inf , quantile(school.X2, c(1/3, 2/3)), Inf))

summary(aov(school.score ~ school.X2.levels ))

> Df Sum Sq Mean Sq F value Pr(>F)

> school.X2.levels 2 0.085 0.04249 1.365 0.262

> Residuals 73 2.272 0.03112

2.4 Looking for school-level heterogeneity

Our omnibus test for heterogeneity from Section 2.2 produced mixed results; however, when wezoomed in on the pre-specified covariates X1 and X2 in Section 2.3, we uncovered interesting results.Noticing that both X1 and X2 are school-level (as opposed to student-level) covariates, it is naturalto ask whether an analysis that only focuses only on school-level effects may have had more powerthan our original analysis following Algorithm 1.

Here, we examine this question by fitting models to the school-level estimates τj from (8) usingonly school level covariates. We considered both an analysis using a regression forest, as wellas classical linear regression modeling. Both methods, however, result in conclusions that are inline with the ones obtained above. The strength of the heterogeneity found by the regressionforest trained on the τj as measured by the “calibration test” is comparable to the strength of theheterogeneity found by our original causal forest; moreover, as seen in the right panel of Figure 3, thepredictions made by this regression forest are closely aligned with school-wise averaged predictions

45

Page 25: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Athey and Wager

from the original causal forest. Meanwhile, a basic linear regression analysis uncovers a borderlineamount of effect modification along X1 and nothing else stands out.

The overall picture is that, by looking at the predictor X1 alone, we can find credible effectmodification that is correlated negatively with X1. However, there does not appear to be strongenough heterogeneity for us to be able to accurately fit a more complex model for τ(·): Even a linearmodel for effect modification starts to suffer from low signal, and it is not quite clear whether X1 isan effect modifier after we control for the other school-level covariates.

# Regression forest analysis

school.forest = regression_forest(school.X, school.score)

school.pred = predict(school.forest )$ predictions

test_calibration(school.forest)

> Estimate Std. Error t value Pr(>|t|)

> mean.prediction 0.998765 0.083454 11.9679 <2e-16 ***

> differential.prediction 0.619299 0.706514 0.8766 0.3836

# Ordinary least -squares analysis

coeftest(lm(school.score ~ school.X), vcov = vcovHC)

> Estimate Std. Error t value Pr(>|t|)

> (Intercept) 0.2434703 0.0770302 3.1607 0.002377 **

> X1 -0.0493032 0.0291403 -1.6919 0.095377 .

> X2 0.0143625 0.0340139 0.4223 0.674211

> X3 0.0092693 0.0264267 0.3508 0.726888

> X4 0.0248985 0.0258527 0.9631 0.339019

> X5 -0.0336325 0.0265401 -1.2672 0.209525

> XC.1 -0.0024447 0.0928801 -0.0263 0.979081

> XC.2 0.0826898 0.1052411 0.7857 0.434845

> XC.3 -0.1376920 0.0876108 -1.5716 0.120818

> XC.4 0.0408624 0.0820938 0.4978 0.620313

3. Post-workshop analysis

Two notable differences between the causal forest analysis used here and a more direct machine-learning-based analysis were our use of cluster-robust methods, and of orthogonalization for robust-ness to confounding as in (7). To understand the value of these features, we revisit some analysesfrom Section 2 without them.

3.1 The value of clustering

If we train a causal forest on students without clustering by school, we obtain markedly differentresults from before: The confidence interval for the average treatment effect is now roughly half aslong as before, and there appears to be unambiguously detectable heterogeneity according to thetest_calibration function. Moreover, as seen in the left panel of Figure 4, the CATE estimatesτ (−i)(Xi) obtained without clustering are much more dispersed than those obtained with clustering(see Figure 2): The sample variance of the τ (−i)(Xi) increases by a factor 5.82 without clustering.

It appears that these strong detections without clustering are explained by excess optimism fromignoring variation due to idiosyncratic school-specific effects, rather than from a true gain in powerfrom using a version of causal forests without clustering. The right panel of Figure 4 shows per-schoolestimates of τ (−i)(Xi) from the non-cluster-robust causal forest, and compares them to predictionsfor the mean CATE in the school obtained in a way that is cluster-robust. The differences arestriking: For example, the left-most school in the right panel of Figure 4 has non-cluster-robust

46

Page 26: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Estimating Treatment Effects with Causal Forests

τ (−i)(Xi) estimates that vary from 0.26 to 0.36, whereas the cluster-robust estimate of its meanCATE was roughly 0.2. A simple explanation for how this could happen is that students in theschool happened to have unusually high treatment effects, and that the non-cluster-robust forestwas able to overfit to this school-level effect because it does not account for potential correlationsbetween different students in the same school.

To gain deeper insights into the behavior of non-cluster robust forests, we tried a 5-fold versionof this algorithm where the forests themselves are not cluster-robust, but the estimation folds arecluster aligned. Specifically, we split the clusters into 5 folds; then, for each fold, we fit a causalforest without clustering on observations belonging to clusters in the 4/5 other folds, and madeCATE estimates on the held out fold. Finally, re-running a best linear prediction test on out-of-foldpredictions as in the test_calibration function, we found at best tenuous evidence for the presenceof heterogeneity (in fact, the resulting t-statistic for heterogeneity, 0.058, was weaker than the onein Section 2.2). In other words, if we use evaluation methods that are robust to clustering, then theapparent gains from non-cluster-robust forests wash away.

Thus, it appears that different schools have very different values of τj ; however, most of theschool-wise effects appear to be idiosyncratic, and cannot be explained using covariates. In order togain insights that generalize to new schools we need to cluster by school; and, once we do so, muchof the apparent heterogeneity between schools ends up looking like noise.

cf.noclust = causal_forest(X[,selected.idx], Y, Z,

Y.hat = Y.hat , Z.hat = Z.hat ,

tune.parameters = TRUE)

ATE.noclust = average_treatment_effect(cf.noclust)

> "95% CI for the ATE: 0.253 +/- 0.022"

test_calibration(cf.noclust)

> Estimate Std. Error t value Pr(>|t|)

> mean.prediction 1.003796 0.044779 22.4164 < 2.2e-16 ***

> differential.prediction 0.634163 0.132700 4.7789 1.786e-06 ***

3.2 The value of orthogonalization

In this dataset, orthogonalization appears to be less important than clustering. If we train a causalforests without estimating the propensity score or, more specifically, using the trivial propensitymodel e(Xi) = Z = n−1

∑ni=1 Zi, we uncover essentially the same average treatment effect estimate

as with orthogonalization. Moreover, as shown in Figure 5, the causal forests trained with or withoutorthogonalization yield essentially the same CATE estimates τ (−i)(Xi).

One reason for this phenomenon may be that, here, the most important confounders are alsoimportant for predicting Y : In Algorithm 1, the most important predictor for both the Z- andY -forests is S3, with 22% of splits and 70% of splits respectively (both weighted by depth as inthe variable_importance function). Meanwhile, as argued in Belloni et al. (2014), orthogonaliza-tion is often most important when there are some features that are highly predictive of treatmentpropensities but not very predictive of Y . Thus, it is possible that the non-orthogonalized forestdoes well here because we were lucky, and there were no confounders that only had a strong effectthe propensity model.

To explore this hypothesis, we present a synthetic example where some variables have strongereffects on Z than on Y and see that, as expected, orthogonalization is now important. There isclearly no treatment effect, yet the non-orthogonalized forest appears to find a non-zero effect.

47

Page 27: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Athey and Wager

estimated CATE

Fre

quen

cy

0.0 0.1 0.2 0.3 0.4 0.5

020

040

060

080

010

00

●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

0.1

0.2

0.3

0.4

0.5

school

estim

ated

CAT

E

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

school mean CATECATE w/o clustering

(a) histogram of CATE estimates w/o clustering (b) per-school CATE estimates w/o clustering

Figure 4: Panel (a) is a histogram of CATE estimates τ (−i)(Xi) trained using a causal forest thatdoes not account for school-level clustering. Panel (b) compares per-student predictions τ (−i)(Xi)from a non-cluster-robust causal forest to per-school mean treatment effect predictions from a foresttrained on per-school responses as in Section 2.4.

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

● ●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●● ●

●●

●● ●

●●

●●●

●●

●● ●●

●●

●●

●●●

●●

● ●●

●●● ●

●●

●●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

● ●●

●●

●●

●●

● ●

●● ●

●●

●●

●●●

●●

●●●

●●

●● ●●●●

●●

●●

●● ●●

●●●

●●

●●

●●

●●●●

● ●●●●

●●● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

● ●

●●●●

●● ●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

● ●●

●●

●●●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

● ●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ● ●

●●●

● ●

●●

●●●●

●●●

●●

●●●

●●

● ●

●●

●●

●●

● ●●

●● ●

●●

●●

●●● ●

●●

● ●

●●

●●

●●●

● ●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

● ●●

●●

●●

●●●

●● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

● ●

● ●

●●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

● ●

● ●●●

● ●●

●●●

●●

● ●

●●

●●●●

●●●

●●●●●●

●●●

● ●

●●●

●●●●●●

●●●

● ●

●●●

●●

●●

●●

●● ●●

● ●

●●

●●

●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●● ●●●

● ●

●●

●●

●●

●●

●●

●●●●

●●

●● ●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●● ●

● ●

● ●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●● ●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●

●● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●● ●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

● ●●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

● ●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●●● ●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●● ●

●● ●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●●● ●

●●

●●

●●●

●●●

●●

●●

●●●● ●

●●●

●●

●●●●●

●●

●●●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●

●●

●●

●●

●●

●●●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

● ●●

● ●

●●● ● ●●●

● ●

●● ●

● ●

● ●

●●

●● ●

●●●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ● ●●● ●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●●

●●● ●●●

●●

●●●

●● ●●●●

●●● ●

●●●

●●

●●

●●

●●

●● ●●●

●●

●●●●

●●●● ●●

● ●

●●●

●●

●●●

●●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●● ●

●●

●● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

● ●

● ●● ●

● ●

●●

●●

●●

●●

●●●

●●●

● ●

●●

●●

●● ●

●●●●

●● ●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

● ●

●●●

●●●

●●

● ●●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

● ●

●●●

●●

●●

●●●●● ●●

●●

●● ●●

●●

●●

●● ●●

●●

● ●●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●● ●●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●● ●●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●● ●

●●

●●

●●● ●●● ●●

●●

●●

●●

●●●

●● ●● ●

●●●●

●● ●

●●

●●●

●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●●

● ●●

●●●

●●

●●

●●

●● ●●

●●●

●●

●●●

●●

●●

●●●● ●

●●●●

● ●

● ●

● ●

●●

●●●

●●

●●

●●●●

●●

●●

●● ●

●●

●●

● ●

●●

●●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●● ● ● ●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●●

● ●

●●

●●

●●

●●●

● ●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●● ●●

●●

● ●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●●

●●●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●● ●

●●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●●● ●●● ●●

●●●

●●

● ●

●●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●● ●

●●●

●● ●●

●●

●●

● ●

●●

●● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●●

●●●

●● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●● ●●●

●●

●●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

●●

●●

● ●

●●

●●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●●●

●●●

●● ●

●● ●● ●

●●●●

●●●

●●

●●

● ●

●●●

●●

● ●●●

●●

●● ●

● ●

●●●

●●●

●●

●●●●●

● ●

●●

● ●

● ●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●

● ●●●

●●

●● ●

● ●●

●●●●●●●

●●

●●

●●

●●

●● ●

● ●● ●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●

● ●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●●●● ● ●

●●

●●●

●●

●●

● ●

●●●

●●

●●

●●●

● ●●● ●

●●●

●●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

0.10 0.15 0.20 0.25 0.30 0.35 0.40

0.10

0.20

0.30

0.40

orthogonalized causal forest estimates

non−

orth

ogon

aliz

ed c

ausa

l for

est

Figure 5: Comparison of estimates from a forest trained with a trivial propensity modele(Xi) = Z = n−1

∑ni=1 Zi to predictions from the forest trained as in Algorithm 1.

48

Page 28: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Estimating Treatment Effects with Causal Forests

cf.noprop = causal_forest(X[,selected.idx], Y, Z,

Y.hat = Y.hat , Z.hat = mean(Z),

tune.parameters = TRUE ,

samples_per_cluster = 50,

clusters = school.id)

ATE.noprop = average_treatment_effect(cf.noprop)

> "95% CI for the ATE: 0.253 +/- 0.04"

n.synth = 1000; p.synth = 10

X.synth = matrix(rnorm(n.synth * p.synth), n.synth , p.synth)

Z.synth = rbinom(n.synth , 1, 1 / (1 + exp(-X.synth[,1])))

Y.synth = 2 * rowMeans(X.synth[,1:6]) + rnorm(n.synth)

...

cf.synth = causal_forest(X.synth , Y.synth , Z.synth ,

Y.hat = Y.hat.synth , Z.hat = Z.hat.synth)

ATE.synth = average_treatment_effect(cf.synth)

> "95% CI for the ATE: 0.125 +/- 0.151"

cf.synth.noprop = causal_forest(X.synth , Y.synth , Z.synth ,

Y.hat = Y.hat.synth , Z.hat = mean(Z.synth ))

ATE.synth.noprop = average_treatment_effect(cf.synth.noprop)

> "95% CI for the ATE: 0.220 +/- 0.142"

4. Discussion

We applied causal forests to study treatment heterogeneity on a dataset derived from the NationalStudy of Learning Mindsets. Two challenges in this setting involved an observational study designwith unknown treatment propensities, and clustering of outcomes at the school level. Causal forestsallow for an algorithmic specification that addresses both challenges. Of these two challenges, school-level clustering had a dramatic effect on our analysis. If we properly account for the clustering, wefind hints of the presence of treatment heterogeneity (Section 2.3), but accurate non-parametricestimation of τ(x) is difficult (Section 2.2). In contrast, an analysis that ignores clusters claims tofind very strong heterogeneity in τ(x) that can accurately be estimated (Section 3.1).

This result highlights the need for a deeper discussion of the how to work with clustered observa-tions when modeling treatment heterogeneity. The traditional approach is to capture cluster effectsvia “fixed effect” or “random effect” models of the form

Yi = m(Xi) + Ziτ(Xi) + βAi+ ZiγAi

+ εi, (9)

where Ai ∈ {1, ..., J} denotes the cluster membership of the i-th sample whereas βj and γj denoteper-cluster offsets on the main effect and treatment effect respectively, and the nomenclature aroundfixed or random effects reflects modeling choices for β and γ (Wooldridge, 2010). In a non-parametricsetting, however, assuming that clusters have an additive effect on Yi seems rather restrictive. Theapproach we took in this note can be interpreted as fitting a functional random effects model

Yi = mAi(Xi) + ZiτAi(Xi) + εi, τ(x) = E [τj(x)] , (10)

where each cluster has its own main and treatment effect function, and the expectation above isdefined with respect to the distribution of per-cluster treatment effect functions. It would be ofconsiderable interest to develop a better understanding of the pros and cons of different approachesto heterogeneous treatment effect estimation on clustered data.

49

Page 29: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Athey and Wager

References

Abadie, A., Athey, S., Imbens, G., and Wooldridge, J. (2017). When should you adjust standarderrors for clustering? arXiv preprint arXiv:1710.02926.

Angrist, J. D. and Pischke, J.-S. (2008). Mostly harmless econometrics: An empiricist’s companion.Princeton university press.

Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedingsof the National Academy of Sciences, 113(27):7353–7360.

Athey, S., Imbens, G. W., and Wager, S. (2018). Approximate residual balancing: debiased inferenceof average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 80(4):597–623.

Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals of Statistics,forthcoming.

Basu, S., Kumbier, K., Brown, J. B., and Yu, B. (2018). Iterative random forests to discoverpredictive and stable high-order interactions. Proceedings of the National Academy of Sciences,page 201711236.

Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selectionamong high-dimensional controls. The Review of Economic Studies, 81(2):608–650.

Biau, G. and Scornet, E. (2016). A random forest guided tour. Test, 25(2):197–227.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.(2017). Double/debiased machine learning for treatment and structural parameters. The Econo-metrics Journal.

Chernozhukov, V., Demirer, M., Duflo, E., and Fernandez-Val, I. (2018). Generic machine learninginference on heterogenous treatment effects in randomized experiments. Technical report, NationalBureau of Economic Research.

Ding, P., Feller, A., and Miratrix, L. (2016). Randomization inference for treatment effect variation.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(3):655–671.

Dorie, V., Hill, J., Shalit, U., Scott, M., and Cervone, D. (2017). Automated versus do-it-yourselfmethods for causal inference: Lessons learned from a data analysis competition. arXiv preprintarXiv:1707.02641.

Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariatesthan observations. Journal of Econometrics, 189(1):1–23.

Gopalan, M. and Tipton, E. (2018). Is the national study of learning mindsets nationally-representative? PsyArXiv. November, 3.

Hahn, P. R., Murray, J. S., and Carvalho, C. (2017). Bayesian regression tree models for causal in-ference: regularization, confounding, and heterogeneous effects. arXiv preprint arXiv:1706.09523.

Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computationaland Graphical Statistics, 20(1).

Imai, K. and Ratkovic, M. (2013). Estimating treatment effect heterogeneity in randomized programevaluation. The Annals of Applied Statistics, 7(1):443–470.

50

Page 30: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Estimating Treatment Effects with Causal Forests

Imbens, G. W. and Rubin, D. B. (2015). Causal Inference in Statistics, Social, and BiomedicalSciences. Cambridge University Press.

Kunzel, S., Sekhon, J., Bickel, P., and Yu, B. (2017). Meta-learners for estimating heterogeneoustreatment effects using machine learning. arXiv preprint arXiv:1706.03461.

Luedtke, A. R. and van der Laan, M. J. (2016). Super-learning of an optimal dynamic treatmentrule. The International Journal of Biostatistics, 12(1):305–332.

Nie, X. and Wager, S. (2017). Learning objectives for treatment effect estimation. arXiv preprintarXiv:1712.04912.

Nosek, B. A. et al. (2015). Promoting an open research culture. Science, 348(6242):1422–1425.

R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria. Available from: https://www.R-project.org/.

Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when someregressors are not always observed. Journal of the American Statistical Association, 89(427):846–866.

Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica, pages 931–954.

Rosenbaum, P. R. (2002). Observational Studies. Springer.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observationalstudies for causal effects. Biometrika, 70(1):41–55.

Shalit, U., Johansson, F. D., and Sontag, D. (2017). Estimating individual treatment effect: gener-alization bounds and algorithms. In ICML, pages 3076–3085.

Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M., and Li, B. (2009). Subgroup analysis via recursivepartitioning. The Journal of Machine Learning Research, 10:141–158.

Tibshirani, J., Athey, S., Friedberg, R., Hadad, V., Miner, L., Wager, S., and Wright, M. (2018).grf: Generalized Random Forests (Beta). R package version 0.10.2. Available from: https:

//github.com/grf-labs/grf.

Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects usingrandom forests. Journal of the American Statistical Association, 113(523):1228–1242.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.

Yeager, D. S. et al. (2016). Using design thinking to improve psychological interventions: The caseof the growth mindset during the transition to high school. Journal of Educational Psychology,108(3):374.

Zhao, Q., Small, D. S., and Ertefaie, A. (2017). Selective inference for effect modification via thelasso. arXiv preprint arXiv:1705.08020.

51

Page 31: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 52-70 Submitted 9/18; Published 8/19

Examining treatment effect heterogeneity using BART

Nicole Carnegie [email protected]

Montana State University, Bozeman, MT, USA

Vincent Dorie [email protected]

Columbia University, New York, NY, USA

Jennifer L. Hill [email protected]

New York University, New York, NY, USA

Keywords: Causal Inference, Bayesian Additive Regression Trees, Treatment Effect

Modification, Group-structured Data

1. Methodology and Motivation

We were presented with the challenge of estimating causal effects using simulated data that

was intended to roughly mirror “preliminary data extracted from the National Study of

Learning Mindsets.” In particular, we were asked to address three research goals:

1. Was the mindset intervention effective in improving student achievement?.

2. Researchers hypothesize that the effect of the intervention is moderated by school level

achievement (X2) and pre-existing mindset norms (X1). In particular there are two

competing hypotheses about how X2 moderates the effect of the intervention: Either

it is largest in middle-achieving schools (a “Goldilocks effect”) or is decreasing in

school-level achievement.

3. Researchers also collected other covariates and are interested in exploring their possible

role in moderating treatment effects.

We discuss our approach to these three research goals as well as a summary of our results.

1.1 Assumptions

Given that the simulated dataset was based on data from a large-scale randomized ex-

periment, the Learning Mindsets study, we were hopeful that the simulated data satisfied

ignorability for the research questions posed. To be conservative, we assumed that ignor-

ability was conditional on the full set of observed covariates—that is, Y (0), Y (1) ⊥ Z | X(Rubin 1979). In the post-workshop analyses we examined the sensitivity of our estimates

to violations of ignorability and were satisfied that it was not an unreasonable assumption.

Given this ignorability assumption, any analysis would require appropriate conditioning

on covariates to achieve unbiased estimates of E[Y (0) | X] and E[Y (1) | X]. We had two

strategies for avoiding strong parametric assumptions. First, we checked that each variable

c©2019 Nicole Carnegie, Vincent Dorie, and Jennifer L. Hill.

Page 32: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

we controlled for satisfied balance and overlap. This helped ensure that empirical counter-

factuals existed for all observations. Second, we used a very flexible modeling strategy for

estimating these conditional expectations.

At different stages in our analysis, we made several different types of modeling choices

with respect to the grouped data structure. Each has its own set of assumptions. When we

represented this structure through school-specific fixed effects, α, our ignorability assump-

tion generalized to Y (0), Y (1) ⊥ Z | X,α. When we instead modeled school-level variation

as varying intercepts—or “random effects”—we imposed the additional assumption that

the random effects were uncorrelated with the (school-level aggregates of) covariates and

treatment indicator. This would be violated if an unobserved school-level covariate was

predictive of both school-level treatment rates and mean response.

We had no way of knowing whether SUTVA was satisfied. We performed analyses under

the assumption that it was satisfied.

1.2 Choice of BART as the foundation of our approach

Without information about the true parametric form of the response surface, we opted for a

method that flexibly fit the response surface. Recent evidence demonstrates the advantages

of machine learning algorithms as an approach to causal effect estimation (for instance, Hill

2011; Dorie et al. 2018). Within this class of estimators, we prefer automated algorithms

that have been integrated into Bayesian inferential frameworks. This combination allows

for uncertainty quantification and is more flexible in accommodating several other compli-

cations such as grouped data structures and missing outcome data. One such modeling

strategy, based on Bayesian Additive Regression Trees (BART; Chipman et al. 2007, 2010),

already has a proven track record of superior performance in causal inference settings (Hill

2011; Hill et al. 2011; Hill and Su 2013; Dorie et al. 2016; Kern et al. 2016; Wendling et al.

2018). Here we briefly introduce BART and its uses for causal inference.

Bayesian Additive Regression Trees. The BART algorithm consists of a sum-of-trees

model and a regularization prior. The prior avoids overfit by specifying the number of

trees, the probability distribution for the size of each tree, the shrinkage applied to the

fit from each tree, and the degrees of freedom for the prior distribution for the residual

standard error. Interested readers can find more information on the model, prior, and

fitting algorithms in Chipman et al. (2007, 2010). The key point is that BART can be used

to flexibly fit even highly nonlinear response surfaces, which is consistent with our goal to

fit E[Y (1) | X]− E[Y (0) | X] without making undue parametric assumptions.

BART for causal inference. It is straightforward to use BART to estimate the average

treatment effect (ATE). First fit BART to the observed data (Y given Z and X). Next

53

Page 33: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

make predictions for two datasets (Hill 2011). X is kept intact for both, however, in one

all treatment values are set to 0, and in the other they are all set to 1. This allows BART

to draw from the posterior distribution for E[Y (1) | X] and E[Y (0) | X] for each person,

implying we can also obtain draws from E[Y (1)−Y (0) | X] for each person. These posterior

distributions for individual-level treatment effects can then be aggregated to obtain posterior

distributions of average treatment effects either for the full dataset or any subset thereof.

Adding the propensity score as a covariate. While the approach described in Hill

(2011) has good properties across a variety of settings (Hill et al. 2011; Hill and Su 2013;

Dorie et al. 2016; Kern et al. 2016; Wendling et al. 2018), recent work (Hahn et al. 2017) re-

veals situations where the performance of BART can be compromised due to regularization-

induced confounding. While this is less of a concern in settings like the present one, in which

covariates are outnumbered by observations and data are well-behaved, in general the “best

practice” recommendation for using BART for causal inference is to guard against this po-

tential source of bias. One approach to doing so, suggested by Hahn et al. (2017), is to

include an estimate of the propensity score as a covariate. We used BART to fit a propensity

score model (as described below) and included the estimate in response models.

Cross-validation to choose hyperparameters. BART tends to perform well using

the default prior specification described by Chipman et al. (2007), but performance can

sometimes be improved by choosing hyperparameters via cross-validation (Chipman et al.

2010). This is particularly important when using BART for non-continuous outcomes, a

case for which off-the-shelf BART is currently not optimized (Dorie et al. 2016).1

Overlap. BART has certain advantages over propensity score approaches to evaluating

overlap, which can be misled by covariates that are strongly predictive of the treatment

but not are not strongly associated with the outcome. Therefore, in addition to checking

overlap marginally for each covariate and for the propensity score, we also checked using a

BART-specific approach as in Hill and Su (2013); results described in Section 3.

Causal inference with group structured data. The data have a multilevel structure.

Treatment was assigned at the individual level, but students are grouped within schools. A

primary goal was to decide whether and how to model this structure. During the workshop,

we presented results from a fixed-effects specification. Ultimately, our preferred model for

the response surface is the random-effects specification. However, robustness checks (below)

reveal that this choice made little difference in overall results.2

1. Murray (2017) has derived models for a wide class of generalized linear model extensions to BART that

are optimized for binary, count, and multiple category outcomes however these are not yet available in

shareable software.

2. We used fixed effects for the propensity score model, since random effects with a binary response are not

yet implemented in dbarts, or elsewhere in R, to our knowledge.

54

Page 34: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

Overview of preferred modeling strategy. Our preferred modeling strategy proceeds

as follows: 1) Fit a propensity score model with BART using all covariates and including

school ID as a fixed effect, using cross-validation to choose hyperparameters (75 trees with

k of 8). 2) Fit a response model on observed covariates and the estimated propensity score

with BART including schools as random effects, using cross-validation to choose hyperpa-

rameters (350 trees with k = 1.5). For both fits we run 4 chains with 1000 iterations each

(in addition to 500 burn-in iterations). Given the symmetry of the posterior distributions

of interest, we report credible intervals based on normal approximations.

2. Results from analyses run for the workshop

During the workshop we discussed assumptions and addressed the questions posed.

2.1 Checking assumptions

Our first step was to check balance and overlap of covariates between treatment groups. We

checked each covariate individually as well as the propensity score and found overwhelming

support for both balance and overlap (see Figures A1 and A2 in Supplemental Appendix).

We revisit overlap in Section 3 with more sophisticated diagnostics.

2.2 Goal 1: Intervention effectiveness

We addressed this question by using BART in the manner described above with a focus

on estimating an average causal effect and associated uncertainty interval. The posterior

distribution of this effect is reasonably symmetric, so we reported only an effect estimate

(posterior mean) of 0.248 with a 95% credible interval of (0.227, 0.270). By this measure,

we deem the intervention to be effective on average. Choice of grouping adjustment makes

little difference to the estimate of the overall ATE leading to differences of less then .003 in

posterior means and interval endpoints.

2.3 Goal 2: Moderation by specific covariates

We had several related strategies for exploring moderation. These capitalize on the fact

that BART provides a posterior distribution of the causal effect for each observation. It is

thus straightforward to examine the relationship between the expected effect for each person

(represented by the mean of the corresponding posterior distribution) and any covariate of

interest. We can do the same with respect to school-level effects and covariates. We present

a few of the myriad methods for portraying these relationships.

The role of urbanicity. Before discussing our results for Goal 2, we address an important

discovery made in our pursuit of that goal. While exploring the role of X1 (“fixed mindset”)

55

Page 35: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

andX2 (“achievement”) as moderators we created scatterplots of individual treatment effect

estimates (posterior means) versus X1 and X2. These plots revealed a group of schools

with average treatment effects substantially lower than the rest, as displayed in the left-most

plots of Figure 1.3 Fitting a regression tree for the individual effects given all covariates

(using the rpart package in R) easily identified the five-category, school-level covariate XC,

or “urbanicity”, as the culprit. Color-coding by urbanicity levels displays this visually.

Posterior distributions of the treatment effect for each urbanicity level, displayed in the

right-most plot of this figure, would have alerted us to this phenomenon as well. Of course,

researchers do not typically check for moderation with respect to all covariates (and in fact

are often discouraged from doing so out of fear of data snooping). Therefore, in the absence

of a specific hypothesis about urbanicity, the substantial differences in treatment effects

across its levels might have gone undetected with a more traditional test of moderation.

−3 3

0.10

0.15

0.20

0.25

0.30

0.35

−2 −1 0 1 2

X1: Fixed mindset

Indi

vidu

al A

TE

−3 2

0.10

0.15

0.20

0.25

0.30

0.35

−2 −1 0 1

X2: Achievement

trea

tmen

t effe

cts

●●●●●●●●●●●●

●●●●●●●●

●●

●●●

●●●●●●●

●●●●

●●

●●

●●●●●

●●●●●●

●●

●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●

●●●

●●●●

●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●●●

●●●●

●●●●

●●

●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●

●●●●

●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●

●●●

●●

●●

●●●●

●●●●●●●●●●●●●●●

0 1 2 3 4

0.1

00

.15

0.2

00

.25

0.3

00

.35

Urbanicity

Ind

ivid

ua

l AT

E

Figure 1: The role of urbanicity

Moderation by X1 and X2 Given the distinctive role that urbanicity plays in predicting

school-level treatment effects, we opted to subtract the variation due to urbanicity from

the school-level treatment effects. This was accomplished by centering the posterior mean

individual ATEs on the average individual ATE within urbanicity category before computing

school-level averages. In practice, this choice would be made in conjunction with the applied

researcher, since it subtly changes the nature of the research question. In essence, we are

now examining whether treatment effects vary across schools with the same urbanicity

rating that differ in their average level of fixed mindset or their achievement.

For the workshop, we presented plots of the relationship between the school-level treat-

ment effects and each of these potential moderators as lowess curves with uncertainty bounds

as in Figure 2. These provide weak evidence of moderation by X1 with a trend towards

smaller effects for schools that had higher levels of fixed mindset. Similarly there appears

3. Actually first we created lowess plots of these relationships. These masked this phenomenon! This is a

testimony to the power of plotting your data!

56

Page 36: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

to be some evidence for a positive association between school achievement and treatment

effect. However, we weren’t satisfied with using the default uncertainty bounds provided by

ggplot for lowess. Our post-workshop analyses provide more clarity regarding these trends

and associated uncertainty, but do not alter our overall conclusions.

X1: Fixed mindset X2: Achievement

−3 −2 −1 0 1 2 3 −2 0 2

−0.05

0.00

0.05

Scho

ol−lev

el AT

E

Figure 2: Lowess representations of X1 (left) and X2 (right) as moderators.

2.4 Goal 3: Moderation by other covariates

We presented moderation plots similar to those above for each of the continuous covariates

(X3, X4, and X5); these are displayed in the Supplemental Appendix as Figure A3. We

made more informative plots after the workshop; our conclusions did not change.

For binary covariates, we assessed moderation using the posterior distribution of the

difference in ATE between groups. For “first-generation status” (C3), we observe a mean

difference of -0.025 with 95% credible interval (-0.018, 0.060); the treatment effect is slightly

(but not significantly) lower for first-generation students. There is no evidence of a difference

in treatment effects by gender (difference estimate: -0.0095, 95% CI: (-0.44, 0.32)).

For multi-category factors, we began by simply examining side-by-side boxplots of the

adjusted individual ATEs by level. As can be seen in Figure A4 in the Supplemental

Appendix, there appears to be little evidence of a race effect. It is possible that there is a

trend of increasing ATE as student expected success increases.

3. Post-workshop analyses

After the workshop we examined a few issues in more depth, as summarized here.

3.1 Re-examination of modeling choices

Our initial comparison of modeling strategies with regard to the grouping variable only

considered differences in the overall ATE and corresponding uncertainty intervals. Given

57

Page 37: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

the focus on moderation, we are interested in understanding whether the individual ATE

estimates varied much based on this choice. Figure A6 in the Supplemental Appendix

presents scatter plots of individual ATEs across all pairs of the three choices. There is

little difference in estimates excluding the school variable and including it as a fixed effect.

However, modeling the grouped structure using a random effect creates a noticeably wider

distribution of individual-level effects.

We also examined more closely the impact of including the propensity score, by com-

paring our results to models without this feature. The estimate of the ATE in a model that

excludes propensity score from the covariate set is 0.249 with associated credible interval

(0.228, 0.270). This is almost identical to that of our preferred analyses. The correlation

between posterior means of individual effects between these analyses is 0.896. We provide

a more detailed comparison in the Supplemental Appendix.

3.2 Revisiting Moderation

We redid some of our original moderation analyses for several reasons. First, we found a

way of graphically displaying our uncertainty about the relationship trends that is easier

to interpret. Second, we wanted to more explicitly test the “Goldilocks” hypothesis posed

by the research team. The analyses reported here are net of the impact of urbanicity on

the treatment effects. Figure A7 in the Supplemental Appendix displays similar results

without adjustment for urbanicity. These relationships are so dominated by the urbanicity-

specific treatment effect differentials that they nearly all demonstrate a “reverse Goldilocks”

pheonomenon. We start by discussing moderation by school-level achievement, X2, since

the hypotheses regarding X2 are more complicated.

Moderation by X2. We explore the research questions about potential moderation by

X2 in two ways. Our first approach partitions X2 into 3 subgroups using a recursive

partitioning algorithm. Then treatment effects are averaged within subgroup to create

draws from the posterior distribution of the average treatment effect for “low”, “medium”,

and “high” values of X2. We can compare the “low” and “medium” subgroups or the

“high” and “medium” subgroup by differencing the corresponding posterior distributions,

as displayed on the left side of Figure 3. The posterior probability that the average treatment

effect for medium subgroup is greater than for the low subgroup is 99.9%. However, the

effect for the medium subgroup is not likely to be greater than the “high” subgroup - indeed,

we find a posterior probability of 97.8% that the high subgroup has a larger average effect.

This analysis does not provide evidence for the Goldilocks effect.

Second, we display on the right side of Figure 3 a scatter plot of school-level average

treatment effects (net of urbanicity-specific means) versus X2 with a sample of quadratic

fits to the posterior draws to illustrate our uncertainty about this fit. While limited by its

58

Page 38: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

Mid − Left

ATE difference

Den

sity

0.00 0.01 0.02 0.03 0.04 0.05 0.06

020

50p = 0.004

Mid − Right

ATE difference

Den

sity

−0.04 −0.03 −0.02 −0.01 0.00

030

60

p = 0.006

−3 −2 −1 0 1 2

−0.

10−

0.05

0.00

0.05

0.10

School ATE, XC Controlled

X2: Achievement

Sch

ool−

leve

l AT

E

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

Figure 3: Left: Histograms of posterior distributions of differences in school average treat-

ment effects between the medium and low X2 subgroups (top) and the high and

medium subgroups (bottom). Right: Posterior distributions for school average

treatment effects as a function of X2, after controlling for XC. Points are the

posterior means of school average treatment effects and vertical lines show as-

sociated 95% posterior credible intervals. Curved lines show 30 samples from

the posterior distribution of quadratic regressions fit to the school average effects

(gray lines) and the posterior mean of all such regressions (black line).

simplistic parametric form, examining the coefficient of the squared term offers a straightfor-

ward test for a rise-then-fall relationship. The posterior mean indicates a slight Goldilocks

effect, however the posterior uncertainty in the square-term coefficient is consistent with no

effect. There is only an approximate 68.7% posterior probability of this term being negative.

Even cursory visual inspection of Figure 3 discounts the alternative hypothesis that the

higher school-level achievement is associated with smaller treatment effects.

Moderation by X1. The relationship between school-level treatment effects and fixed

mindset, X1, is decreasing without strong evidence of quadratic curvature, as displayed in

Figure 4. The probability that the linear part of this decreasing trend is less then zero is

approximately 96.6%, and the probability that the quadratic part is less than zero is 75.8%.

Moderation by the other school-level continuous covariates. In Figure 5 we dis-

play moderation plots similar to those in the previous section for the remaining continuous

school-level variables: “minority composition” (X3), “poverty concentration” (X4), and

“school size” (X5). The posterior probabilities that the linear terms are negative are ap-

59

Page 39: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

−3 −2 −1 0 1 2 3

−0.

10−

0.05

0.00

0.05

0.10

School ATE, XC controlled

X1: Fixed mindsetS

choo

l−le

vel A

TE

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

Figure 4: Posterior distributions for school average treatment effects as a function of X1,

after controlling for XC. All else is as described in Figure 3.

proximately 84.0%, 73.0%, and 10.6% respectively, while the corresponding probabilities

for the quadratic terms are 90.8%, 91.4%, and 6.3%. This provides some evidence of a

Goldilocks effect for poverty concentration but nothing earthshattering.

−1 0 1 2

−0.

2−

0.1

0.0

0.1

0.2

0.3

X3: Minority composition

Sch

ool−

leve

l AT

E

●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

−2 −1 0 1 2 3

−0.

2−

0.1

0.0

0.1

0.2

0.3

X4: Poverty concentration

Sch

ool−

leve

l AT

E

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

School−level ATE, XC Controlled

−1 0 1 2

−0.

2−

0.1

0.0

0.1

0.2

0.3

X5: School size

Sch

ool−

leve

l AT

E

●●

●● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

● ●

Figure 5: Posterior distributions for school average treatment effects as a function of X3,

X4, and X5, net of XC. All else is as described in Figure 3.

Moderation by Student Expected Success: A closer look. We return to examine

moderation by S3, student expected success, because our initial results provided some

evidence for a moderated effect but we performed no formal tests.4 Figure 6 displays a

line plot connecting the posterior means of the ordered categories along with corresponding

credible intervals. The right plot tests whether levels 6 and 7 have larger effects than those

below; there is moderate support (91% probability) for this hypothesis.

4. The Supplemental Appendix provides a somewhat similar reanalysis for the race variable.

60

Page 40: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

1 2 3 4 5 6 7−0.

08−

0.06

−0.

04−

0.02

0.00

0.02

0.04

Future Success ATEs, XC Controlled

Expectations for future success

ATE

●●

S3:Future Success 6−7 − 1−5, XC Controlled

Diff in ATE

Fre

quen

cy

−0.05 0.00 0.05 0.10 0.15

020

040

060

080

010

0012

0014

00 p = 0.094

Figure 6: Left: Means and 95% credible intervals of posterior distributions (impact of XC

removed) for each level of ordered categorical variable S3 presented as a line plot.

Right: Posterior distribution for the difference in mean effects between the two

top levels of S3 and the rest. About 91% of this distribution lies above zero.

3.3 More formal checks of assumptions

BART has already been incorporated into diagnostic frameworks to examine the plausibility

of two crucial causal assumptions: ignorability and overlap.

Dorie et al. (2016) demonstrate how BART can be incorporated into a sensitivity analysis

framework to help researchers to understand under what conditions their results might be

sensitive to unobserved confounding. This approach is available in the treatSens package

on CRAN as the function treatSens.BART. The results from this sensitivity analysis are

displayed in Figure A10 in the Supplementary Appendix. This plot reveals that the level of

unobserved confounding would need to be extremely strong in order to remove the estimated

positive effect. This sensitivity analysis strongly supports our assumption of ignorability.5

Our covariate-by-covariate examination of overlap in the previous section (results dis-

played in the Supplemental Appendix) provided strong evidence in support of marginal

balance and overlap. In the Supplemental Appendix we present two additional looks at

the issue. The first, displayed on the left side of Figure A9, is a scatter plot of the joint

distribution (estimated posterior means of) Y (0) and Y (1) for each observation for both

treated (red) and control (blue) observations that suggests strong overlap.

To investigate local overlap (as suggested in the workshop discussion) we calculated an

overlap statistic recommended by Hill and Su (2013). For each person we calculate the ratio

of the variance of the posterior distribution of their counterfactual outcome relative to the

variance of the posterior distribution for their factual outcome. The distribution of these

ratios is displayed in Figure 4 A9. We gauge the extremity of any such ratio relative to a

5. The treatSens package does not yet accommodate random effects so was run with fixed effects.

61

Page 41: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

Chi-squared distribution; the 10% cutoff would be about 2.7; no ratios even come close to

this threshold, suggesting that overlap is present locally as well as marginally.

4. Discussion

Several features make BART a powerful tool for causal inference. The sum-of-trees model

flexibly fits even highly non-linear response surfaces. The Bayesian inferential framework

allows us to easily quantify our uncertainty not only about the average treatment effect

and individual-level treatment effects but also any functions of the potential outcomes (all

without re-using our data). The recent extensions that accommodate varying treatment

effects extend the applicability of this tool to simple multilevel data structures.

We used BART to address the questions posed and found strong evidence of a large pos-

itive average effect of the intervention (the “effect size” is about .4 and the credible interval

has near zero probability of covering 0). Urbanicity strongly moderates this treatment effect

therefore we addressed the other questions after adjusting for this.6 Net of urbanicity, we

find some evidence of moderation at the school level: the school-level mean of students’

fixed mindsets, X1, is moderately negatively associated with the size of effect; achievement,

X2, is moderately positively associated. Of all student-level variables there is most support

for moderation by student expected success, S3.

Our results are predicated on satisfying several assumptions—ignorability, overlap, etc.—

that in some situations can be heroic. The BART extensions that easily allow examination

of the evidence for and implications of these assumptions add credibility to our analyses.

We found strong support that these assumptions were satisfied.

References

Chipman, H., George, E., and McCulloch, R. (2007). Bayesian ensemble learning. In

Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Pro-

cessing Systems 19. MIT Press, Cambridge, MA.

Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive

regression trees. Annals of Applied Statistics, 4(1):266–298.

Dorie, V., Carnegie, N. B., Harada, M., and Hill, J. (2016). A flexible, interpretable

framework for assessing sensitivity to unmeasured confounding. Statistics in Medicine,

35(20):3453–70.

6. In a “real-life” situation where we could interact with the applied researcher we might make a different

choice based on their understanding of the theoretical questions of primary interest. However we were

making decisions in the absence of the availability of such an interaction.

62

Page 42: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

Dorie, V., Hill, J., Shalit, U., Scott, M., and Cervone, D. (2018). Automated versus do-it-

yourself methods for causal inference: Lessons learned from a data analysis competition.

Statistical Science, accepted with discussion.

Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2017). Bayesian regression tree models for

causal inference: regularization, confounding, and heterogeneous effects. ArXiv e-prints.

Hill, J. (2011). Bayesian nonparametric modeling for causal inference. Journal of Compu-

tational and Graphical Statistics, 20(1):217–240.

Hill, J. and Su, Y.-S. (2013). Assessing lack of common support in causal inference

using bayesian nonparametrics: Implications for evaluating the effect of breastfeeding

on children’s cognitive outcomes. Ann. Appl. Stat., 7(3):1386–1420. Available from:

http://dx.doi.org/10.1214/13-AOAS630.

Hill, J. L., Weiss, C., and Zhai, F. (2011). Challenges with propensity score strategies in a

high-dimensional setting and a potential alternative. Multivariate Behavioral Research,

46:477–513.

Kern, H. L., Stuart, E. A., Hill, J. L., and Green, D. P. (2016). Assessing methods for

generalizing experimental impact estimates to target samples. Journal of Research in

Educational Effectiveness, 9:103–127.

Murray, J. S. (2017). Log-Linear Bayesian Additive Regression Trees for Categorical and

Count Responses. ArXiv e-prints.

Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to

control bias in observational studies. Journal of the American Statistical Association,

74:318–328.

Wendling, T., Jung, K., Callahan, A., Schuler, A., Shah, N., and Gallego, B. (2018). Com-

paring methods for estimation of heterogeneous treatment effects using observational data

from health care databases. Statistics in medicine.

63

Page 43: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

Appendix A. Supplemental Appendix to Carnegie et al. Discussion

A.1 Additional workshop analyses

Overlap plots We examined the overlap and balance of each of the covariates marginally

through a variety of plots; see Figure A1 and Figure A2. These demonstrated a high degree

of both balance and overlap.

C2: Gender C3: First generation

FALSE TRUE FALSE TRUE

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Treated

Per

cent

of o

bser

vatio

ns

C1: Student race/ethnicity S3: Student expected success XC: School urbanicity

FALSE TRUE FALSE TRUE FALSE TRUE

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Treated

Per

cent

of o

bser

vatio

ns

Figure A1: Overlap in student-level binary variables (left) and multiple-level categorical

variables (right)

64

Page 44: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●

●●●●●●

X4: Poverty concentration X5: School size

X1: Fixed mindset X2: Achievement X3: Minority composition

FALSE TRUE FALSE TRUE

FALSE TRUE FALSE TRUE FALSE TRUE

−1

0

1

2

−2

0

2

−1

0

1

2

−3

−2

−1

0

1

2

3

−2

−1

0

1

2

3

Treated

Val

ue

Figure A2: Overlap in school-level continuous variables

65

Page 45: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

Effect modification by continuous variables: lowess plots For the workshop, we

presented plots of the relationship between the school-level treatment effects and each of

the potential moderators as lowess curves with uncertainty bounds. Figure A3 gives the

resulting plots for school-level continuous covariates X3 through X5.

X3: Minority composition X4: Poverty concentration X5: School size

−1 0 1 2 −2 −1 0 1 2 3 −1 0 1 2

−0.05

0.00

0.05

Sch

ool−

leve

l AT

E

Figure A3: Lowess representations of X3 (left) through X5 (right) as moderators.

Effect modification by categorical variables: boxplots For categorical variables, we

used simple side-by-side boxplots to evaluate potential effec modification. There was little

evidence of an effect of race using this method, but some suggestion of an increasing effect

with student expected success (S3). In particular it appears that the treatment effect is

larger for students whos expected success is greater than 5.

●●●

●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

1 2 3 4 5 6 7

−0.

10−

0.05

0.00

0.05

0.10

s3: Student expected success

Indi

vidu

al a

djus

ted

ATE

−0.

10−

0.05

0.00

0.05

0.10

C1: Race/ethnicity

Indi

vidu

al a

djus

ted

ATE

●●●

1 14 13 3 12 15 2 11 9 4 8 5 6 10 7

Figure A4: Boxplots of adjusted individual ATE by of S3 (left) and C1 (right) as modera-

tors.

A.2 Additional results from post-workshop analyses

The impact of group-level modeling strategies The plot in A6 displays a scatter

plot of individual ATE’s that displays how they vary across adjustment methodologies: no

66

Page 46: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

adjustment (that is, including school ID as a continuous covariate so that BART is forced to

make splits based on difference between continguous categories), fixed effects, and random

effects.

Figure A5: Scatterplots of individual ATE estimates across adjustment methodologies.

Impact of adding the propensity score as a covariate We discuss in the main text

the high degree of correspondence between the individual-level treatment effect estimates

produced using an estimation strategy that includes the estimated propensity score as a

covariate versus one that excludes it. However a scatter plot of the two sets of estimates

reveals an interesting feature which is that these estimates appear to come from a mixture

of two subpopulations. When we predict the random effect estimates using a regression tree

S3 emerges as by far the strongest predictor. Highlighting the pattern we saw in Figure

A4.

Treatment Effect Modification when urbanicity has not been netted out. The

plots in the main test display results examining associations between covariate and treat-

ment effects. We felt it was also important to examine how different these results might be

if we had decided not to net out urbanicity (XC). Figure A7 shows school average treatment

effects as a function of X2 with levels of XC highlighted by color. Urbanicity category 4 is

markedly below the others and it complicates one of the primary objectives of this exercise:

characterizing the moderating effect of X2. Not only are the hypotheses that X2 has a

“Goldilocks” impact on the treatment effect or that it steadily decreases effectiveness ruled

out, if urbanicity is not controlled for one can reach an opposite conclusion - that of least

effect in the middle range. Consequently, all future analyses are done by controlling for

urbanicity and subtracting out the level average effects.

Moderation by Race: A closer look We revisited moderation by race after the work-

shop to implement some more formal tests. Figure A8 shows the racial average treatment

effects after controlling for urbanicity (XC). It provides some evidence of racial moderation

of the treatment effect, however many of the effects are consistent across race categories.

67

Page 47: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

Figure A6: Scatterplot of individual ATE estimates using random effects model with and

without propensity score included as a predictor.

The highest and lowest race average treatment effects are estimated with a considerable

degree of uncertainty due to their sample sizes, however the posterior distribution of the

difference between the highest and lowest racial averages - based on their posterior means -

yields a borderline “statistically significant” difference. The distribution over this difference

has 5.2% probability assigned to negative values, so that a one-sided posterior credible inter-

val would just barely include 0. However this contrast was chosen after looking at the plots

and without a clear hypothesis about “race level 11” as a specific moderator. Therefore we

see such analyses as exploratory.

Diagnostics that assess plausibility of the ignorability and overlap assumptions.

Figure A9 displays evidence regarding the overlap based on BART output as described in

the main text.

Sensitivity to unobserved confounding. We see that the amount of confounding nec-

essary to substantively change our results would be quite extreme and certainly far exceeds

the current levels of associations with observed covariates.

68

Page 48: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Treatment effect heterogeneity with BART

−3 −2 −1 0 1 2 3

0.10

0.15

0.20

0.25

0.30

School ATE Fn X1, XC Not Controlled

X1: Fixed mindset

Sch

ool−

leve

l AT

E

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

XC 1XC 2XC 3XC 4XC 5

−3 −2 −1 0 1 2

0.10

0.15

0.20

0.25

0.30

School ATE Fn X2, XC Not Controlled

X2: Achievement

Sch

ool−

leve

l AT

E

●●

●● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

XC 1XC 2XC 3XC 4XC 5

−1 0 1 2

0.10

0.15

0.20

0.25

0.30

School ATE Fn X3, XC Not Controlled

X3: Minority composition

Sch

ool−

leve

l AT

E

●●

● ● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

XC 1XC 2XC 3XC 4XC 5

−2 −1 0 1 2 3

0.10

0.15

0.20

0.25

0.30

School ATE Fn X4, XC Not Controlled

X4: Poverty concentration

Sch

ool−

leve

l AT

E

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

XC 1XC 2XC 3XC 4XC 5

−1 0 1 2

0.10

0.15

0.20

0.25

0.30

School ATE Fn X5, XC Not Controlled

X5: School size

Sch

ool−

leve

l AT

E

●●

●● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

XC 1XC 2XC 3XC 4XC 5

Figure A7: Posterior distributions for school average treatment effects as a function of,

left-to-right/top-to-bottom, X1, X2, X3, X4, X5. The points are the posterior

means in each school while the curved line is the posterior mean of quadratic

regressions fit to the school average treatment effects.

−0.2 −0.1 0.0 0.1 0.2

Race Ave TEs, XC Controlled

ATE

race

3

13

1

14

6

8

5

15

2

12

10

4

9

7

11

num

stu

dent

s

113

182

983

652

40

195

420

349

1578

325

176

5030

135

43

170

C1:Race/ethnicity 11 − 3, XC Controlled

Diff in ATE

Freq

uenc

y

−0.2 0.0 0.2 0.4 0.6

020

040

060

0

p = 0.052

Figure A8: Left: Posterior means and 95% credible intervals of average treatment effects

for each race category after controlling for XC. Right: Histogram of posterior

samples for the difference between the highest and lowest treatment effect races.

69

Page 49: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Carnegie et al.

−1.0 −0.5 0.0 0.5 1.0

−0.

50.

00.

51.

0

Overlap in Y(0) and Y(1)

Y(0)

Y(1

)

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●

●●

●●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

Overlap described by variance ratios

Counterfactual to factual variance ratio

Fre

quen

cy

1.0 1.5 2.0 2.5

050

015

0025

00

Figure A9: Left: Overlap across treatment (red) and control (blue) groups with regard

to distribution of Y(0) and Y(1). Right: Distribution of variance ratios for

counterfactual versus factual outcomes.

Figure A10: Sensitivity to unobserved confounding.

70

Page 50: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 71-82 Submitted 7/19; Published 8/19

Machine Learning Analysis of Heterogeneity in the Effect ofStudent Mindset Interventions

Fredrik D. Johansson [email protected]

Institute for Medical Engineering & Science

Massachusetts Institute of Technology

Abstract

We study heterogeneity in the effect of a mindset intervention on student-level performancethrough an observational dataset from the National Study of Learning Mindsets (NSLM).Our analysis uses machine learning (ML) to address the following associated problems:assessing treatment group overlap and covariate balance, imputing conditional averagetreatment effects, and interpreting imputed effects. By comparing several different modelfamilies we illustrate the flexibility of both off-the-shelf and purpose-built estimators. Wefind that the mindset intervention has a positive average effect of 0.26, 95%-CI [0.22, 0.30],and that heterogeneity in the range of [0.1, 0.4] is moderated by school-level achievementlevel, poverty concentration, urbanicity, and student prior expectations.

Keywords: Machine learning, interpretability, counterfactual estimation

1. Methodology and Motivation

Machine learning (ML) has had widespread success in solving prediction problems in ap-plications ranging from image and speech recognition (LeCun et al., 2015) to personalizedmedicine (Kononenko, 2001). This makes it an attractive tool also for studying heterogene-ity in causal effects. In fact, ML excels at overcoming well-known limitations of traditionalmethods used to solve this task. For example, matching methods struggle to perform wellwhen confounders and covariates are high-dimensional (Rubin and Thomas, 1996); gener-alized linear models are not flexible enough to discover variable interactions and non-lineartrends; and propensity-based methods suffer from variance issues in estimation (Lee et al.,2011). In contrast, supervised machine learning has proven highly useful in discoveringpatterns in high-dimensional data (LeCun et al., 2015), approximating complex functionsand trading off bias and variance (Swaminathan and Joachims, 2015).

In this observational study based on data from the National Study of Learning Mindsets(NSLM), we apply both off-the-shelf and purpose-built ML estimators to characterize het-erogeneity in the effect of a student mindset intervention on future academic performance.In particular, we compare estimates of conditional average treatment effects based on linearmodels, random forests, gradient boosting and deep neural networks. Below, we introducethe problem of discovering treatment effect heterogeneity and describe our methodology.

1.1 Problem setup

We study the effect of a student mindset intervention based on observations of 10391 stu-dents in 76 schools from the NSLM study. The intervention, assigned at student level, is

c©2019 Fredrik D. Johansson.

Page 51: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Johansson

represented by a binary variable Z ∈ {0, 1} and the performance outcome by a real-valuedvariable Y ∈ R. Students are observed through covariates S3, C1, C2, C3 and schools throughcovariates X1, ..., X5

1. For convenience, we let X = [S3, C1, C2, C3, X1, ..., X5]> represent

the full set of covariates of a student-school pair. We let (xij , zi, yi) denote the observationcorresponding to a student i ∈ {1, ...,m} in a school j ∈ {1, ..., n}. As each student isenrolled in at most one school, we omit the index j in the sequel. Observed treatmentgroups G0 (control) and G1 (treated) are defined by Gz = {i ∈ {1, ...,m} : zi = z}. Thefull dataset of observations is denoted D = {(x1, z1, y1), . . . , (xm, zm, ym)}, and the densityof all variables p(X,Z, Y ).

We adopt the Neyman-Rubin causal model (Rubin, 2005), and denote by Y (0), Y (1)the potential outcomes corresponding to interventions Z = 0 and Z = 1 respectively. Thegoal of this study is to characterize heterogeneity in the treatment effect Y (1)−Y (0) acrossstudents and schools. As is well known, this effect is not identifiable without additionalassumptions as each student is observed in only one treatment group. Instead, we estimatethe conditional average treatment effect (CATE) with respect to observed covariates X.

τ(x) = E [Y (1)− Y (0) | X = x] (1)

CATE is identifiable from observational data under the standard assumptions of ignorability

Y (1), Y (0) ⊥⊥ Z | X ,

consistency, Y = ZY (1) + (1− Z)Y (0), and overlap (positivity)

∀x : p(Z = 0 | X = x) > 0⇔ p(Z = 1 | X = x) > 0 .

The CATE conditioned on the full set of covariates X is the closest we get to estimatingthe treatment effect for an individual student. However, to design policies, it is rarelynecessary to have this level of resolution. In later sections, we estimate conditional effectsalso with respect to subsets or functions of X, such as the average effect stratified by schoolachievement level. By first identifying τ(x) and then marginalizing it with respect to suchfunctions, we adjust for confounding to the best of our ability.

1.2 Methodology overview

The flexibility of ML estimators creates both opportunities and challenges. For example, itis typical for the number of parameters of the best performing model on a task to exceedthe number of available samples. This is made possible by mitigating overfitting (highvariance) through appropriate regularization. Indeed, many models achieve the best fitonly after having carefully set several tuning parameters that control the bias and variancetrade-off. It is standard practice in ML to use sample splitting for this purpose. Here, weapply such a pipeline to CATE estimation, proceeding through the following steps.

1. Split the observed data D into two partitions, a training set Dt and a validation setDv, for parameter fitting and model selection respectively.

1. The meanings of the different covariates are described in later sections of the manuscript.

72

Page 52: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Machine Learning Analysis of Student Mindset Interventions

2. Fit estimators f0, f1 of potential outcomes Y (0) and Y (1) to Dt and select tuningparameters based on held-out error on Dv

3. Impute CATE, τi := f1(xi) − f0(xi), for every student in D and fit an interpretablemodel h(x) ∼ τ to characterize treatment effect heterogeneity

This pipeline allows us to find the best fitting (black-box) estimators possible in Step2 without regard for the interpretability of their predictions. By fitting a simpler, moreinterpretable model to the imputed effects in Step 3, we may explain the predictions ofthe more complex model in terms of known quantities. This procedure is particularly wellsuited when the effect is a simpler function than the response and it also allows us to controlthe granularity at which we study heterogeneity.

The data from NSLM has a multi-level nature; students (level 1) are grouped into schools(level 2) and each level is associated with its own set of covariates. The literature is richwith studies of causal effects in multi-level settings, see for example Gelman and Hill (2006).However, this is primarily targeted towards studying the effects of high-level (e.g. school-level) interventions on lower-level subjects (e.g. students), and the increased uncertaintythat comes with such an analysis. While interventions are assigned at student-level, it isimportant to note that only 76 values of school-level variables are observed, which introducesthe risk of overfitting to these covariates specifically. We adjust for the multi-level natureof the data in sample splitting, bootstrapping and the analysis of imputed effects.

In the following sections, we describe each step of our methodology in detail.

Step 1. Sample splitting

To enable unbiased estimation of prediction error and select tuning parameters, we dividethe dataset D into two parts with 80% of the data used for the training set Dt and 20% fora validation set Dv. We partition the set of schools, rather than students, making sure thatthe entire student body of any one school appears only in either Dt or Dv. This is to mitigateoverfitting to school-level covariates. As there are only 76 schools, random sampling maycreate sets that have very different characteristics. To mitigate this, we balance Dt and Dvby constructing a large number of splits uniformly at random and selecting the one thatminimizes the Euclidean distance between summary statistics of the two sets. In particular,we compare the first and second order moments of all covariates. We increase the influenceof the treatment variable Z by a factor 10 in this comparison to ensure that treatmentgroups are split evenly across Dt and Dv.

Step 2. Estimation of potential outcomes

The conditional average treatment effect is the difference between expected potential out-comes, here denoted µ0 and µ1. Under ignorability w.r.t. X (see above), we have that

µz(x) := E[Y (z) | X = x] = E[Y | X = x, Z = z] for z ∈ {0, 1},

and thus, τ(x) = µ1(x)−µ0(x). A straight-forward route to estimating τ is to independentlyfit the conditionals E[Y | X = x, Z = z] for each value of z ∈ {0, 1} and compute theirdifference. This has recently been dubbed the T-learner approach to distinguish it from

73

Page 53: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Johansson

other learning paradigms (Kunzel et al., 2017). Below, we briefly cover theory that motivatesthis method and point out some of its shortcomings. To study heterogeneity, we considerseveral T-learners as well as two alternative approaches described below.

We approximate µ0, µ1 using hypotheses f0, f1 and measure their quality by the meansquared error. The group-conditional expected and empirical risks are defined as follows

Rz(fz) := E[(µz(x)− fz(x))2 | Z = z]︸ ︷︷ ︸Expected group-conditional risk

and Rz(fz) :=1

|Gz|∑i∈Gz

(f(xi; θ)− yi)2︸ ︷︷ ︸Empirical group-conditional risk

. (2)

We never observe µz directly, but learn from noisy observations y. Statistical learningtheory helps resolve this issue by bounding the expected risk in terms of its empiricalcounterpart and a measure of function complexity (Vapnik, 1999). For hypotheses in aclass F with a particular complexity measure CF (δ, n) with logarithmic dependence on n(e.g. a function of the covering number), it holds with probability greater than 1− δ that

∀fz ∈ F : Rz(fz) ≤ Rz(fz) +CF (δ, n)√

n− σ2Y , (3)

where σ2Y is a bound on the expected variance in Y (see Johansson et al. (2018) for a fullderivation). This class of bounds illustrate the bias-variance trade-off that is typical formachine learning and motivates the use of regularization to control model complexity. Inour experiments, we consider several T-learner models that estimate each potential outcomeindependently using regularized empirical risk minimization, solving the following problem.

fz = arg minf(·;θ)∈F

Rz(f(x; θ)) + λr(θ) (4)

Here, f(x; θ) is a function parameterized by θ and r(θ) a regularizer of model parameterssuch as the `1-norm (LASSO) or `2-norm (Ridge) penalties. In our analysis, we comparefour commonly used off-the-shelf machine learning estimators: ridge regression, randomforests, gradient boosting and deep neural networks.

Sharing power between treatment groups A drawback of T-learners is that no in-formation is shared between estimators of different potential outcomes. In problems wherethe baseline response Y (0) is a more complex function than the effect τ itself, the T-learneris wasteful in terms of statistical power (Kunzel et al., 2017; Nie and Wager, 2017). Asan alternative, we apply the Treatment-Agnostic Representation (TARNet) neural networkarchitecture of Shalit et al. (2017). TARNet estimates all potential outcomes {Y (z)} jointlyas compositions fz(x) := hz(Φ(x)) of treatment-specific hypotheses fz(Φ) and treatment-agnostic representations Φ(x). Trained by minimizing the overall empirical risk, as describedin (4), this choice of architecture encourages sharing of information across treatment groupsin learning the average response, while capturing heterogeneity in treatment effects. For anillustration comparing T-learners and TARNet, see Figure 1.

Generalizing across treatment groups

The careful reader may have noticed that the population and empirical risk in equations (2)–(4) are defined with respect to the observed treatment assignments. To estimate CATE, we

74

Page 54: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Machine Learning Analysis of Student Mindset Interventions

𝑋 𝑓$

𝑓% 𝐿%

Covariates

Predicted potential outcomes

Group-conditional losses

Estimator for control group

𝑋 Estimator for treated

𝐿$

(a) T-learner estimator

𝑋 Φ…

… 𝑓&

𝑓' 𝐿

𝑍

𝑌

Covariates Shared representation

Predicted potential outcomes

Learning objective Outcome

InterventionNeural network layers

(b) TARNet architecture (Shalit et al., 2017).

Figure 1: Illustration of T-learner estimator (left) and Treatment-Agnostic Representation(TARNet) architecture (Shalit et al., 2017). TARNet estimators learn representations ofcovariates Φ(x) that are shared between treatment groups and model different potentialoutcomes Y (z) as functions of Φ. Counterfactual Regression (CFR) (Shalit et al., 2017)extends TARNet by regularizing Φ to encourage balance across different treatment groups.

want our estimates of potential outcomes to be accurate for the counterfactual assignmentas well. In other words, we want for the risk on the full cohort,

R(fz) := E[(µz(x)− fz(x))2]

to be small. When treatment groups p(X | Z = 0) and p(X | Z = 1) are highly imbalanced,the expected risk within one treatment group may not be representative of the risk on thefull population. This is another drawback of T-learner estimators, which do not adjust forthis discrepancy.

In recent work, Shalit et al. (2017) characterize the difference between R(fz) and Rz(fz)and bound the error in CATE estimates using a distance metric between treatment groups.In particular, they considered the integral probability metric (IPM) family of distances,defined with respect to a function family G and densities p, q as

IPMG(p, q) := supg∈G

∣∣∣∣∫xg(x)(p(x)− q(x))dx

∣∣∣∣ ,resulting in the following relation between population and treatment group risk

R(fz)︸ ︷︷ ︸Population risk

≤ Rz(fz)︸ ︷︷ ︸Treatment group risk

+ IPMG(p(X | Z = 0), p(X | Z = 1))︸ ︷︷ ︸Treatment group imbalance

, (5)

under appropriate assumptions. This bound inspired the estimator Counterfactual Regres-sion (CFR) in which the TARNet architecture (see above) is trained to minimize the upperbound in (5) applied to the learned representation φ, instead of the empirical risk. Thisencourages balance between treatment groups in the learned representation space. In ouranalysis, we apply CFR with G the family of functions in the reproducing-kernel Hilbertspace defined by the Gaussian RBF-kernel; the resulting IPM is known as the MaximumMean Discrepancy (Gretton et al., 2012) and may be estimated efficiently from samples.

75

Page 55: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Johansson

0.0 0.5 1.0

Z

0.0

0.5

1.0

2.5 5.0

S3

0.0

0.2

0.4

−2.5 0.0 2.5

X1

0.0

0.2

−2 0 2

X2

0.0

0.1

0.2

0 2

X3

0.0

0.1

0.2

−2 0 2

X4

0.0

0.1

0.2

0 2

X5

0.0

0.1

5 10 15

C1

0.0

0.2

0.4

1.0 1.5 2.0

C2

0.0

0.2

0.4

0.0 0.5 1.0

C3

0.00

0.25

0.50

0 2 4

XC

0.0

0.2 Control

Treated

(a) Marginals of covariates X

Control

Treated

(b) t-SNE projection of covariates X

Figure 2: Examination of overlap through covariate marginal distributions and a low-dimensional t-SNE projection of covariates (Maaten and Hinton, 2008). Marker color cor-responds to treatment assignment Z. Best viewed in color.

Step 3. Characterization of heterogeneity in CATE

After fitting models f0, f1 for each potential outcome, the conditional average treatmenteffect is imputed for each student by τi = f1(xi) − f0(xi). Unlike with linear regressors,the predictions of most ML estimators are difficult to interpret directly through modelparameters. For this reason, ML models are often considered black-box methods (Lipton,2016). However, in the study of heterogeneity, it is crucial to characterize for which subjectsthe effect of an intervention is low and for which it is high. To accomplish this, we adoptthe common practice of post-hoc interpretation—fitting a simpler, more interpretable modelh ∈ H to the imputed effects {τi}.

In its very simplest form h(xi) may be a function of a single attribute, such as the schoolsize, effectively averaging over other attributes. This is usually a good way of discoveringglobal trends in the data but will neglect meaningful interactions between variables, muchlike a linear model. As a more flexible alternative, we also fit decision tree models andinspect the learned trees. Trees of only two variables may be visualized directly in thecovariate space, and larger trees in terms of their decision rules.

2. Workshop results

We present the first results of our analysis as shown in the workshop Empirical Investigationof Methods for Heterogeneity at the Atlantic Causal Inference Conference, 2018.

2.1 Covariate balance

First, we investigate the extent to which the overlap assumption holds true by comparingthe covariate statistics of the treatment and control groups. In Figure 2, we visualize themarginal distributions of each covariate, as well as a 2D t-SNE projection of the entirecovariate set (Maaten and Hinton, 2008). The observed difference between the marginalcovariate distributions of the two treatment groups is very small. Also, the non-lineart-SNE projection reveals little difference between treatment groups. The less imbalance

76

Page 56: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Machine Learning Analysis of Student Mindset Interventions

Estimator ˆATE R2

Naıve 0.30 —

RR 0.26 [0.22, 0.29] 0.26 [0.17, 0.29]

RF 0.27 [0.23, 0.30] 0.25 [0.22, 0.29]

GB 0.26 [0.21, 0.30] 0.25 [0.20, 0.30]

NN 0.27 [0.17, 0.38] 0.14 [−0.08, 0.23]

TARNet 0.26 [0.23, 0.30] 0.27 [0.21, 0.31]

CFR 0.26 [0.22, 0.30] 0.27 [0.22, 0.31]

(a) ATE and held-out R2 score with 95% school-levelcluster bootstrap confidence intervals.

−0.5 0.0 0.5 1.0Estimated CATE

0.00

0.05

0.10

0.15

Fra

ctio

nof

stu

den

ts

ATE = 0.26

(b) Histogram of CATE (CFR).

Figure 3: Comparison between the naıve estimator, T-learners, and representation learningmethods (left). Heterogeneity in treatment effect estimated by CFR (right).

between treatment groups, the closer our problem is to standard supervised learning. Saiddifferently, the density ratio p(Z = 1 | X)/p(Z = 0 | X) is close to 1.0 and the IPM distancebetween conditional distributions, see (5), is small. Hence, we expect neither propensityre-weighting nor balanced representations (e.g. CFR) to have a large effect on the results.

2.2 Estimation of potential outcomes

In our analysis, we compare four T-learners based on ridge regression (RR), random forests(RF), gradient boosting (GB), and neural networks (NN). In addition, we compare therepresentation-learning algorithms TARNet and Counterfactual Regression (CFR) (Shalitet al., 2017). For each estimator family, we fit models of both potential outcomes on thetraining set Dt and select tuning parameters based on the held-out R2 score on the validationset Dv. To estimate uncertainty in model predictions, we perform school-level bootstrappingof the training set (Cameron et al., 2008), fitting each model to each bootstrap sample2.In Table 3, we give the estimate of the average treatment effect (ATE) from each model,the held-out R2 score of the fit of factual outcomes, and 95% confidence intervals based onthe empirical bootstrap. In addition, we give the naıve estimate of the ATE—the differencebetween observed average outcomes in the two treatment groups.

We see that all methods produce very similar estimates of ATE and perform comparablyin terms of R2. As expected, based on the small covariate imbalance shown in the previoussection, the regression adjusted estimates are close to the naıve estimate of the ATE. Thislikely also explains the small difference between TARNet and CFR, as even for moderateto large imbalance regularization, the empirical risk dominates the objective function. Theperformance of the neural network T-learner would likely be improved with a different choiceof architecture or tuning parameters. This is consistent with Shalit et al. (2017) in whichTARNet architecture achieved half of the error of the T-learner on the IHDP benchmark.

2. The bootstrap analysis was added after the workshop results, but is presented here for completeness.

77

Page 57: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Johansson

−2.5 0.0 2.5X1, School-level student mindsets

0.0

0.2

0.4

CA

TE

ATE

−2 0 2X4, School poverty concentration

A B C D EXC, Urbanicity of the school

0 2X5, School size

0.0

0.2

0.4

CA

TE

2.5 5.0S3, Student’s expectations

−2.5 0.0X2, School achievement level

Figure 4: Heterogeneity in causal effect estimated using counterfactual regression (CFR)stratified by different covariates. Bars indicate variation in point estimates across subjects.

2.3 Heterogeneity in causal effect

We examine further the CATE for each student imputed by the best fitting model. As CFRhad a slight edge in R2 over T-learning estimators (although confidence intervals overlap)and has stronger theoretical justification, we analyze the effects imputed by CFR below.In Figure 3b, we visualize the distribution of imputed CATEs. We see that for almost allstudents, the effect is estimated to be positive, indicating an improvement in performance asan effect of the mindset intervention. Recall that the average treatment effect was estimatedto be 0.26. Around 95% of students were estimated to have an effect in the range [0.05, 0.45].For reference, the mean of the observed outcome was 0.10 and the standard deviation 0.64.

To discover drivers of heterogeneity we fit a random forest estimator to imputed effectsand inspect the feature importance of each variable—the frequency with which it is used tosplit the nodes of a tree. The five most important variables of the random forest were X1,X4, XC , X5 and S3. In Figure 4 we stratify imputed CATE with respect to these variables,as well as X2 which is of interest to the study organizers. We see a strong trend thatthe effect of the intervention decreases with prior school-level student mindset, X1. Theurbanicity of the school, XC , is a categorical variable for which Category D appears to beassociated with substantially lower effect. In contrast, the effect of the intervention appearsto increase with students’ prior expectations S3. One of the questions of the original studywas whether there exists a “Goldilocks effect” for school-achievement level X2, meaningthat the intervention only has an effect for schools that are neither achieving too poorly nortoo well. These results cannot confirm this hypothesis, nor reject it.

78

Page 58: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Machine Learning Analysis of Student Mindset Interventions

−2 0 2X1, School-level student mindsets

−2

0

2X

4,S

choo

lp

over

tyco

nce

ntr.

0.26

0.33

0.22 0.15

0.23

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

(a) CATE vs. X1 and X4 (school-level)

−2 0 2X1, School-level student mindsets

2

4

6

S3,

Stu

den

t’s

exp

ecta

tion

s

0.330.270.25

0.290.30

0.19 0.18

0.23

0.0

0.1

0.2

0.3

0.4

0.5

0.6

(b) CATE vs. X1 and S3 (student-level)

Figure 5: Interpretation of CATE estimates using regression trees fit to pairs of covariates.Each dot represents a single school (left) or student (right). The color represents thepredicted CATE. Black lines correspond to leaf boundaries. Background color and numbersin boxes correspond to the average predicted CATE in that box. Best viewed in color.

3. Post-workshop analysis

Heterogeneity in treatment effect may be a non-linear or non-additive function of observedcovariates. Such patterns remain hidden when analyzing CATE as a function of a singlevariable at a time or using linear regression. To reveal richer patterns of heterogeneity,we fit highly regularized regression tree models and inspect their decision rules. First, weconsider combinations only of pairs of variables at a time. We note that for school-levelvariables, only 76 unique values exist, one for each school. To prevent overfitting to thesevariables, we require that each leaf in the regression tree contains samples from at least 10schools. When student-level covariates are included, we require leaves to have samples ofat least 1000 students.

In Figure 5, we visualize trees fit to two distinct variable pairs. We note a very slightnon-linear pattern in heterogeneity as a function of X1 (school-level student mindset) andX4 (school poverty concentration), and that X1 explains a lot of the variance observedat moderate values of X4 in Figure 4. We emphasize, however, that the sample size atthe school-level is small, and that observed patterns have high variance. In the right-handfigure, S3 (student’s expectations) appears associated with a larger effect only if the averagemindset of the school is sufficiently high. This pattern disappears when using a linear model.In the Appendix, we show a regression tree fit to the entire covariate set.

4. Discussion

Machine learning offers a broad range of tools for flexible function approximation and pro-vides theoretical guarantees for statistical estimation under model misspecification. Thismakes it a suitable framework for estimation of causal effects from non-linear, imbalancedor high-dimensional observational data. The flexibility of machine learning comes at aprice however: many methods come with tuning parameters that are challenging to set forcausal estimation; models are often difficult to optimize globally; and the interpretability

79

Page 59: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Johansson

of models suffers. While progress has been made independently on each of these problems,a standardized set of tools has yet to emerge.

In the analysis of the NLSM data, machine learning appears well-suited to study overlap,potential outcomes and heterogeneity in imputed effects. However, the analysis also openssome methodological questions. The multi-level nature of covariates is not accounted forin most off-the-shelf ML models and regularization of models applied to multi-level datahas been comparatively less studied than for single-level data. In addition, as pointed outby several authors (Kunzel et al., 2017; Nie and Wager, 2017), the T-learner approach tocausal effect estimation may suffer from compounding bias and from wasting statisticalpower. This may be one of the reasons we observe a slight advantage of representationlearning methods such as TARNet and CFR.

References

Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2008). Bootstrap-based improvements forinference with clustered errors. The Review of Economics and Statistics, 90(3):414–427.

Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchicalmodels. Cambridge university press.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B., and Smola, A. (2012). A kerneltwo-sample test. Journal of Machine Learning Research, 13(Mar):723–773.

Johansson, F. D., Kallus, N., Shalit, U., and Sontag, D. (2018). Learning weighted repre-sentations for generalization across designs. arXiv:1802.08598.

Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the art andperspective. Artificial Intelligence in medicine, 23(1):89–109.

Kunzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2017). Meta-learners for estimatingheterogeneous treatment effects using machine learning. arXiv:1706.03461.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436.

Lee, B. K., Lessler, J., and Stuart, E. A. (2011). Weight trimming and propensity scoreweighting. PloS one, 6(3):e18174.

Lipton, Z. C. (2016). The mythos of model interpretability. arXiv:1606.03490.

Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machinelearning research, 9(Nov):2579–2605.

Nie, X. and Wager, S. (2017). Learning objectives for treatment effect estimation.arXiv:1712.04912.

Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, deci-sions. Journal of the American Statistical Association, 100(469):322–331.

Rubin, D. B. and Thomas, N. (1996). Matching using estimated propensity scores: relatingtheory to practice. Biometrics, pages 249–264.

80

Page 60: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Machine Learning Analysis of Student Mindset Interventions

Shalit, U., Johansson, F. D., and Sontag, D. (2017). Estimating individual treatment effect:generalization bounds and algorithms. In International Conference on Machine Learning,pages 3076–3085.

Swaminathan, A. and Joachims, T. (2015). Counterfactual risk minimization: Learningfrom logged bandit feedback. In International Conference on Machine Learning, pages814–823.

Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE transactions onneural networks, 10(5):988–999.

81

Page 61: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Johansson

Appendix A. Regression tree explanation of CATE

𝑋" ≤ 0.15

𝑋( ≤ −0.28 𝑋, = 𝐷

𝜏 = 0.10𝑁 = 1333

True False

𝜏 = 0.26𝑁 = 3337𝑋, ∈ {0,1}

𝜏 = 0.24𝑁 = 1699

𝜏 = 0.30𝑁 = 1282

𝑋" ≤ −1.12

𝜏 = 0.37𝑁 = 1063

𝜏 = 0.32𝑁 = 1677

Figure 6: Visualization of a regression tree fit to the imputed CATE values based on thefull covariate set.

82

Page 62: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 83-92 Submitted 7/19; Published 8/19

Matching with attention to effect modification in a datachallenge

Luke Keele [email protected] of Pennsylvania3400 Spruce St, Philadelphia, PA 19130

Samuel D. Pimentel [email protected]

University of California, Berkeley

367 Evans Hall, Berkeley, CA 94720

Abstract

For this data challenge, we implement a data analytic approach based on matching. First,we use matching to control for observed confounders. Next, we implement a randomizationtest that combines results from subgroups of matched pairs formed using potential effectmodifiers. This analysis, which can be conducted in either a planned or an exploratorymanner, is designed both to detect effect modification in specific subgroups and to exploitany variation in effect size to make the study more robust to bias from a hidden confounder.We find that accounting for possible effect modification does make the study results lesssensitive to hidden bias. However, we also find that an exploratory analysis that does notfully exploit outcome information fails to fully discover possible effect modifiers.

Keywords: matching, effect modification, sensitivity analysis

1. Methodology and Motivation

1.1 Matching in observational studies

Matching is a flexible and intuitive tool for designing effective observational studies. In itssimplest form, matching involve pairing each subject from a group receiving a treatment orexposure of interest to a control subject that appears similar on observed pre-treatment co-variates. When observed covariates include all relevant confounders and sufficiently similarpairs are available, the resulting paired design may be analyzed as though it were a random-ized study. Analyzing observational data using matched comparisons has many attractivefeatures. For instance, problems with covariate overlap are quickly identified (Stuart 2010),the separation of design from analysis ensures valid inference even if multiple matched de-signs are initially considered (Rubin 2007), and the preservation of the original units ofanalysis permits helpful qualitative input from scientific collaborators (Rosenbaum 2002).

For our purposes, two such features are particularly important. One is the natural waymatched designs capture information about treatment effect variation. Specifically, eachdistinct matched pair difference is an estimate of the treatment effect at a particular pointin multidimensional covariate space. A single matched design may permit both an averagetreatment effect estimate over the entire sample and a more granular study of treatmenteffect heterogeneity or, in the parlance of the matching literature, effect modification by

c©2019 Luke Keele and Samuel D. Pimentel.

Page 63: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keele and Pimentel

pre-treatment covariates. The second key advantage of matching methods is the set of toolsavailable for conducting sensitivity analysis after matching in order to assess the impact ofpossible violations of the strong ignorability, or “no unmeasured confounders,” assumption.Since some form of this uncheckable assumption underlies all analysis of observational data,effective methods for causal inference should be able to demonstrate some robustness toa violation of this assumption. Sensitivity analysis can provide just such a certificate, oridentify situations in which results are very fragile with respect to failures of ignorability.

1.2 Identifying effect modification in matched designs

Effect modification occurs when the magnitude of the treatment effect varies with the valueof one or more pretreatment covariates, so that certain subgroups of matched pairs expe-rience a different effect than others. A recent literature has studied methods for detectingand leveraging effect modification in matched observational studies (Hsu et al. 2013, 2015;Lee et al. 2018). These works identify two basic strategies for studying effect modification.In one strategy, a small set of subgroups, perhaps identified by one or two nominal vari-ables, are identified a priori based on scientific considerations. In addition to analyzing thedata for the entire match, investigators conduct inference separately within each of thesesubgroups. This strategy is very similar to standard subgroup analysis in social and medicalapplications.

The second approach is to identify subgroups likely to experience effect modificationin an adaptive way, from the data itself. Hsu et al. (2013) describe one way to do thisby fitting a CART model to matched pair differences using a large set of pre-treatmentcovariates, and measuring treatment effects within the subgroups identified by the leavesof the resulting tree model. Importantly, these trees are fit not on the actual matchedpair outcome differences but on their absolute values, or on the ranks of the absolutevalues. Using the signed outcomes to select subgroups for inference would violate thedesign principle of outcome blinding that is fundamental in matching approaches and wouldinvalidate tests for treatment effects in the resulting groups. While the absolute values arean imperfect proxy for the actual matched pair differences, they preserve valid inferencesince randomization inference varies only the signs of these differences.

In principle these methods may be applied to any matched design, whether or not itwas constructed with the study of effect modification in mind. However, a subtle aspect ofthe approaches just described is that they rely on pairs exactly matched for the potentialeffect modifier of interest. For example, if gender is a possible effect modifier, an effect maybe estimated within a group of pairs in which both subjects are women and another effectmay be estimated in group of pairs in which both subjects are men. But if the study alsocontains pairs that include both a woman and a man, these last pairs are not directly usefulfor studying effect modification. One way to understand this principle is to recognize thateffect modification is a result of an interaction effect between a treatment variable and apre-treatment covariate. Pairs matched exactly for the covariate remove the main effect ofthe pre-treatment covariate and allow study of the interaction, while outcome differencesin pairs with non-identical covariates may be partly explained by the main effect of thecovariate and partly explained by the interaction.

84

Page 64: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Matching with attention to effect modification in a data challenge

In practice, this requirement means that investigators with an interest in effect modifi-cation must take care to ensure high proportions of matched pairs share identical values ofthe potential effect modifiers when they construct the match. This requirement stands incontrast to recent recommendations and techniques for matching that focus on achievingmarginal balance, a less stringent goal than exact matching (Zubizarreta 2012; Pimentelet al. 2015). Note that this requirement applies both when subgroups of interest are identi-fied a priori and when the exploratory technique is used; in the latter case the CART modelis fit only to the subset of pairs matched exactly on all the covariates used in the fit.

1.3 Impact of effect modification on inference and sensitivity to bias

While understanding effect modification for a given dataset is usually scientifically interest-ing in itself, another important benefit is reduced sensitivity to bias in testing for the overallpresence or absence of a treatment effect in the entire sample. In particular, matched dataare often analyzed by using randomization inference to test Fisher-style sharp null hypothe-ses of no treatment effect. Such hypotheses specify that no individual anywhere in the studyreceives treatment, in contrast to the more common Neyman hypothesis framework whichspecifies that an average effect is equal to zero but allows for nonzero individual effects. Arejection of the sharp null hypothesis of zero effect for a specific subgroup of the sample isalso a rejection for the overall sharp null; note that this is not the case in a Neyman-styletesting framework. If effect modification is structured so that certain small subgroups ofthe sample experience very large effects, a test that combines results from individual testsin several subgroups may be more powerful than a single test incorporating all the data.For example, Hsu et al. (2015) uses a truncated product of p-values generated from tests inseveral subgroups.

More important yet, analyses that pay attention to effect modification may also reducesensitivity to unmeasured bias. To make this claim more clear, we briefly review Rosen-baum’s method of sensitivity analysis, described in greater detail in Rosenbaum (2005) andRosenbaum (2002). Randomization inference in a matched design typically proceeds underthe assumption that individuals in the same pair have the same true propensity to receivetreatment; mathematically, their odds of treatment are assumed to be identical, allowingconstruction of a known and tractable randomization distribution. In sensitivity analysis,this assumption is relaxed so that odds of treatment for paired individuals may differ byup to a multiplicative factor Γ. For any specific choice of Γ, a worst-case analysis may beconducted to obtain the largest possible p-value or the widest possible confidence intervalthat might have been achieved under these settings. Analysts often repeat the analysis frommany Γ-values and report a threshold value at which the hypothesis test ceases to rejector at which the confidence interval barely covers 0. For example, say we observe that theestimated p-value is 0.01. This p-value assumes Γ = 1, that is the unobserved confounderdoes not change the odds of treatment within matched pairs. To perform a sensitivity anal-ysis, we increase Γ until the worst-case p-value reaches or exceeds 0.05. If this occurs fora relatively small value of Γ, we can conclude that a confounder with a small effect on thetreatment odds would change our conclusions. However, if the value of Γ is large when thebounds include zero, then we have greater confidence that a weak confounder would notchange the conclusions of the statistical analysis.

85

Page 65: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keele and Pimentel

When effect modification is present, hypothesis tests for different subgroups of a samplemay vary not only in their degree of statistical significance but in their sensitivity to bias.Results for multiple sensitivity analyses may be combined using truncated p-values justas standard test results are combined, resulting in an overall sensitivity analysis, and Hsuet al. (2013) demonstrate that in this setting the level of sensitivity to bias tends to bedriven by the subgroup with the strongest or most robust effect. As such, combining theresults of several subgroup analyses may produce more reliable inferences less sensitive tounmeasured confounding than the results of a single test for the entire sample.

1.4 Study design for the workshop data.

We demonstrate both matching and a search for effect modification in a synthetic datasetconstructed from the National Study of Learning Mindsets for a workshop at the 2018Atlantic Causal Inference Conference, described earlier in this journal issue in Carvalhoet al. (2019). We approached the workshop data by forming a matched pair design andconducting randomization inferences and sensitivity analyses designed to detect and leveragepossible effect modification. We first used methods based on a priori identification of twoimportant effect modifiers, and then conducting an exploratory analysis designed to identifyadditional potential modifiers. Given the framing of the workshop data as an observationalstudy, a primary concern guiding the design was the possibility of unmeasured confoundingbetween the treatment and control groups.

When constructing the matched design itself, we sought to match exactly as frequentlyas possible on variables judged to be likely effect modifiers. While in general it is a difficulttask to match on many variables exactly, in this case the large majority of covariates, includ-ing the two pre-identified possible effect modifiers X1 and X2, were measured at the schoollevel. Since in addition each school contained proportionally similar numbers of treated andcontrol students, we chose to match students within schools, thus ensuring that all pairs inthe study shared identical values for all school-level covariates. (However, as discussed be-low, we also conducted an across-schools match for comparative purposes.) Within schoolswe matched students optimally on a robust Mahalanobis distance incorporating all studentcovariates. We also estimated a propensity score using all available covariates and imposeda matching caliper of 0.15 standard deviations of the estimated propensity scores, requir-ing that all pairs formed contain individuals whose estimated propensity scores differ byno more than that amount. This ensures that in the absence of unmeasured confound-ing, matched individuals have near-identical odds of treatment, an important assumptionunderlying later randomization inferences. The caliper restriction prevented some treatedunits in the study from being matched as described in the next section, but we retained alarge proportion of the treated students.

To account for effect modification by X1 and X2, the a priori possible effect modifiers,we made two separate partitions of the matched pairs, one by quintiles of X1 and oneby quintiles of X2. For each partition, we conduct a separate randomization test of thesharp null hypothesis using a member of the family of Huber-Maritz M-tests chosen tobe insensitive to unmeasured bias (Rosenbaum 2007). We then combine the 5 p-valuesfrom these tests using the truncated product of p-values (Zaykin et al. 2002). We alsoconduct a sensitivity analysis for this procedure, reporting the threshold level of unobserved

86

Page 66: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Matching with attention to effect modification in a data challenge

confounding Γ at which the combined p-value ceases to be significant. For comparativepurposes, we also conduct a Huber-Maritz M-test and sensitivity analysis for the overallsample. We also conducted exploratory analyses to detect effect modification using CARTmodels fit to absolute matched pair differences, as discussed below.

2. Workshop Results

We first present a few details on the results of the two matches. First, there was verylittle observed bias in the data. To assess the quality of the match, we used a measure ofstandardized distance, which for a given variable is computed by taking the mean differencebetween matched patients and dividing by the pooled standard deviation before matching(Silber et al. 2001; Rosenbaum and Rubin 1985; Cochran and Rubin 1973). In matching,we generally seek to make all standardized differences less than 0.10 or less than one-tenthof a standard deviation, which is often considered an acceptable discrepancy (Silber et al.2001; Rosenbaum and Rubin 1985; Cochran and Rubin 1973; Rosenbaum 2010).

Table 1: Balance Statistics for the Unmatched Data and Two Matched Comparisons

Std. Diff Unmatched Std. Diff. Within School Match Std. Diff. Across School Match

S3 0.13 0.00 -0.01X1 -0.10 0.00 0.01X2 0.06 0.00 -0.00X3 -0.00 0.00 0.00X4 -0.02 0.00 -0.00X5 0.07 0.00 -0.01C1 Cat. 1 -0.04 0.00 0.01C1 Cat. 2 -0.00 -0.02 -0.02C1 Cat. 3 -0.02 0.03 0.02C1 Cat. 4 0.04 -0.04 -0.07C1 Cat. 5 0.01 0.03 0.02C1 Cat. 6 0.03 -0.01 0.00C1 Cat. 7 0.02 0.00 0.03C1 Cat. 8 0.02 0.02 0.04C1 Cat. 9 -0.03 0.02 0.02C1 Cat. 10 0.01 0.02 0.03C1 Cat. 11 -0.01 0.01 0.05C1 Cat. 12 -0.01 0.02 0.02C1 Cat. 13 -0.05 0.01 0.01C1 Cat. 14 -0.02 -0.01 0.01C1 Cat. 15 0.01 0.03 0.04C2 0.05 0.01 0.00C3 -0.10 -0.01 -0.03XC Cat 1 0.01 0.00 0.00XC Cat 2 -0.03 0.00 0.00XC Cat 3 0.03 0.00 0.01XC Cat 4 -0.05 0.00 -0.00XC Cat 5 0.03 0.00 -0.00

Table 1 contains balance statistics for both the unmatched data and the two matcheswe implemented. First, it is clear that there is little overt bias in the unmatched data.For example, the largest standardized difference across treated and control groups before

87

Page 67: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keele and Pimentel

matching was .13. The matching does further improve balance. After matching, with oneexception, all the standardized differences were 0.05 or less. For the within school match,the standardized differences are zero for all discretized school-level variables since by designeach pair is exactly matched on these variables. Table 2 further illustrates the primarydifference between the two matches. For student level covariates, neither match exactlymatches on student level covariates, but the within school match exactly matches on allschool level covariates, which is a major advantage for studying effect modification by thesecovariates.

Table 2: Proportion of pairs matched exactly on each variable

Across School Match Within School Match

S3 0.79 0.78C1 0.84 0.82C2 0.90 0.87C3 0.92 0.91XC 0.98 1.00X1 0.90 1.00X2 0.90 1.00X3 0.90 1.00X4 0.90 1.00X5 0.90 1.00

Table 3 contains the outcome estimates. We estimated treatment effects and confi-dence intervals after matching by inverting Huber-Maritz m-tests and computing Hodges-Lehmann estimates as described in Rosenbaum (2002); for the unadjusted case we usedsimple regression on the treatment indicator. The results from both matches are nearlyidentical with treatment effect estimates of 0.26 and 0.27 for the across and within schoolmatches respectively. Moreover, consistent with the fact that there is very little overt bias,the unadjusted estimate does not differ much from the adjusted estimates. The sample sizesdo vary some as a result of the matching. In the unmatched data, there were 3,384 treatedunits and 7,007 controls. In the across school match, we matched 3,375 of the treated unitsone to one. In the within school match, we matched 3,079 of the treated units one to one.Some loss of sample size is often a by-product of exact matching, especially when combinedwith a caliper restriction.

Next, we explored the possibility of effect modification by X1. For this analysis, we onlyfocus on results from the within school match. For this analysis, we simply stratified thematched pairs by the quintiles of X1e, and estimated overall treatment effect within eachquantile, again by inverting randomization tests and computing Hodges-Lehmann estimates.Table 4 contains the results from this stratified analysis. The point estimates indicate quiteclearly that for students with lower X1 scores the intervention appears to be more effective.However, the confidence intervals overlap to an extent that these estimates that can’t bedifferentiated statistically.

Next, we repeated this analysis for the X2 covariate. Again, we stratified the matchedpairs by the quintiles of the X2 variable, and estimated overall treatment effect within each

88

Page 68: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Matching with attention to effect modification in a data challenge

Table 3: Results from Outcome Analysis

Outcome

Unadjusted0.30

[ 0.28 , 0.33 ]

Across SchoolMatch 1

0.26[ 0.24 , 0.29 ]

Within SchoolMatch 2

0.27[ 0.25 , 0.30 ]

Table 4: Treatment Effect Estimates Stratified by X1 Quantile

Point Estimate 95% CI

Quantile 1 0.33 [0.27, 0.38]Quantile 2 0.27 [0.21, 0.34]Quantile 3 0.31 [0.26, 0.36]Quantile 4 0.22 [0.16, 0.28]Quantile 5 0.21 [0.14, 0.27]

quantile. Table 5 contains the results from this stratified analysis. Here, the pattern is lessclear, however, treatment effect estimates in the lowest quantile are much lower. Again,the confidence intervals overlap to an extent that these estimates can’t be differentiatedstatistically.

Table 5: Treatment Effect Estimates Stratified by X2 Quantile

Point Estimate 95% CI

Quantile 1 0.17 [0.11, 0.24]Quantile 2 0.30 [0.24, 0.36]Quantile 3 0.29 [0.24, 0.34]Quantile 4 0.33 [0.28, 0.39]Quantile 5 0.24 [0.19, 0.30]

Next, we explored whether effect modification by these two covariates reduced sensitivityto bias from unobserved confounders. Table 6 contains the results from three differentsensitivity analyses. In the first sensitivity analysis, we calculated the upper-bound onthe one-sided p-value for unconditional treatment effect. Here, we find that Γ = 2.51.This implies that a hidden confounder would need to induce the odds of treatment withinmatched pairs to differ by a factor of 2.5 before the observed results could be fully explainedunder the null hypothesis. Next using the methods we outlined above, we conducted twoadditional sensitivity analyses that account for effect modification by the mindset and testscore variables. In both cases, we find the results are less sensitive to hidden bias as thevalue for Γ is now 2.87 for the X2 effect modifier and 2.83 for the X3 effect modifier.

89

Page 69: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keele and Pimentel

Table 6: Sensitivity Analysis for Unconditional and Conditional Treatment Effect Estimates

Unconditional X1 Effect Modifier X2 Effect ModifierΓ 2.51 2.87 2.83

To conduct a more exploratory analysis of effect modification, we calculated within-pair differences in outcome for each matched pair in our study, which we denote as τi.Following Hsu et al. (2013), we then regressed the rank of |τi| on possible effect modifiersusing CART in order to identify subgroups with differential effects (without jeopardizinglater inference by using the signs of the τis). We fit a CART model both to all covariates,school- and student-level, using only those pairs matched exactly on all variables, and toall school-level covariates using all pairs (since all were matched exactly on school-levelcovariates). However, in neither case did the CART model make any splits after pruningusing cross-validation, meaning that no subgroups for effect modification were identified.

3. Post-workshop analysis

We conducted one additional analysis after the workshop. Specifically, we were curious asto why the CART regression did not detect any effect modification. We suspected it wasdue to the fact that we used the unsigned differences in treated and control outcomes in theCART model. As we noted above, using the unsigned ranks is necessary so the statisticaltests on which the sensitivity analysis are based have the correct level. To explore what rolethis might have played in our analysis, we re-fit the CART regression to all matched pairsusing the signed within-pair matched differences as outcomes and the school-level covariatesas regressors. We trimmed the fitted tree to ensure reduce overfitting. The results are inFigure 1. Now the tree splits the data on both the X1 covariate as well as one of theindicators for XC variable. Of course, a hypothesis test or sensitivity analysis based on thissearch procedure violates the design principle by using the same outcome information bothto identify subgroups and to conduct inference.

4. Discussion

In our analysis, we demonstrated how to explore effect modification through a sensitivityanalysis. Under this approach, investigators can both test whether the magnitude of thetreatment effect varies with a baseline covariate and explore whether effect modificationmakes the results less sensitive to bias from hidden confounders. This approach is quiteflexible, since it can be applied to both planned and exploratory analyses of effect modi-fication. That is, one can use flexible tree-based methods to search for subgroups wheretreatment effects may be larger. Our analysis did reveal some interesting tradeoffs that arerequired when the analysis is exploratory. To yield valid tests, the exploratory model mustbe blind to some outcome information. However, we showed that ignoring the full outcomedistribution caused us to overlook some evidence of effect modification.

90

Page 70: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Matching with attention to effect modification in a data challenge

mindset >= −1.2

urb2 >= 1

−0.027100%

−0.06288%

−0.1921%

−0.02368%

0.2412%

yes no

Figure 1: The regression tree formed fitting matched pair differences in outcomes on base-line covariates

We argue that this result has two implications. First, it may be valuable to clarifyconditions under which effect modification cannot be detected without signed data, asoccurred here. This line of research could provide guidance about exactly when and howresearchers pay a cost by blinding themselves to signs of pair differences. Second, we suggesta two-step analysis of effect modification when groups are to be selected from data. First,the CART model procedure can be done using partial outcome information to preserve thevalidity of formal confirmatory test. Second, a purely exploratory analysis can be conductedusing the full outcome information. This allows recovery of strong signals in the data thatmay be hidden by the partially-blinded analysis, even if inference cannot easily be obtained.

Finally, a promising new direction for studying effect modification in matched sets, calledthe submax method, is presented in recent work by Lee et al. (2017). Simulations showthat the submax method may have greater power to detect moderate effect modificationthan the approaches based on CART we describe here. While it is beyond the scope of ourcurrent investigation to adapt the submax framework to the problem in the data challenge,we recommend its further study and generalization.

References

Carlos Carvalho, Avi Feller, Jared Miller, Spencer Woody, and David Yeager. AssessingTreatment Effect Variation in Observational Studies: Results from a Data Challenge.Observational Studies, 5(21-35), 2019.

William G. Cochran and Donald B. Rubin. Controlling bias in observational studies.Sankyha-Indian Journal of Statistics, Series A, 35:417–446, December 1973.

Jesse Y Hsu, Dylan S Small, and Paul R Rosenbaum. Effect modification and designsensitivity in observational studies. Journal of the American Statistical Association, 108

91

Page 71: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keele and Pimentel

(501):135–148, 2013.

Jesse Y Hsu, Jose R Zubizarreta, Dylan S Small, and Paul R Rosenbaum. Strong controlof the familywise error rate in observational studies that discover effect modification byexploratory methods. Biometrika, 102(4):767–782, 2015.

Kwonsang Lee, Dylan S Small, and Paul R Rosenbaum. A powerful approach to the studyof moderate effect modification in observational studies. Biometrics, 2017.

Kwonsang Lee, Dylan S Small, Jesse Y Hsu, Jeffrey H Silber, and Paul R Rosenbaum.Discovering effect modification in an observational study of surgical mortality at hospitalswith superior nursing. Journal of the Royal Statistical Society: Series A (Statistics inSociety), 181(2):535–546, 2018.

Samuel D Pimentel, Rachel R Kelz, Jeffrey H Silber, and Paul R Rosenbaum. Large, sparseoptimal matching with refined covariate balance in an observational study of the healthoutcomes produced by new surgeons. Journal of the American Statistical Association,110(510):515–527, 2015.

Paul R. Rosenbaum. Observational Studies. Springer, New York, NY, 2nd edition, 2002.

Paul R. Rosenbaum. Sensitivity analysis in observational studies. In Brian S. Everitt andDavid C. Howell, editors, Encyclopedia of Statistics in Behavioral Science, volume 4,pages 1809–1814. John Wiley and Sons, Chichester, UK, 2005.

Paul R Rosenbaum. Sensitivity analysis for m-estimates, tests, and confidence intervals inmatched observational studies. Biometrics, 63(2):456–464, 2007.

Paul R. Rosenbaum. Design of Observational Studies. Springer-Verlag, New York, 2010.

Paul R. Rosenbaum and Donald B. Rubin. Constructing a control group using multivariatematched sampling methods. The American Statistician, 39(1):33–38, February 1985.

Donald B Rubin. The design versus the analysis of observational studies for causal effects:parallels with the design of randomized trials. Statistics in medicine, 26(1):20–36, 2007.

Jeffrey H Silber, Paul R Rosenbaum, Martha E Trudeau, Orit Even-Shoshan, Wei Chen,Xuemei Zhang, and Rachel E Mosher. Multivariate matching and bias reduction in thesurgical outcomes study. Medical Care, 39(10):1048–1064, 2001.

Elizebeth A Stuart. Matching methods for causal inference: A review and a look forward.Statistical Science, 25(1):1–21, 2010.

Dmitri V Zaykin, Lev A Zhivotovsky, Peter H Westfall, and Bruce S Weir. Truncatedproduct method for combining p-values. Genetic Epidemiology: The Official Publicationof the International Genetic Epidemiology Society, 22(2):170–185, 2002.

Jose R Zubizarreta. Using mixed integer programming for matching in an observationalstudy of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–1371, 2012.

92

Page 72: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 93-104 Submitted 7/19; Published 8/19

Heterogeneous Subgroup Identification with ObservationalData: A Case Study Based on the National Study of

Learning Mindsets

Bryan Keller [email protected] of Human DevelopmentTeachers College, Columbia UniversityNew York, NY 10027, USA

Jianshen Chen [email protected], Evaluation, and ResearchCollege BoardYardley, PA 19607, USA

Tianyang Zhang [email protected]

Department of Human Development

Teachers College, Columbia University

New York, NY 10027, USA

Abstract

In this paper, we use a two-step approach for heterogeneous subgroup identification witha synthetic data set motivated by the National Study of Learning Mindsets. In the firststep, optimal full propensity score matching is used to estimate stratum-specific treatmenteffects. In the second step, regression trees identify key subgroups based on covariates forwhich the treatment effect varies. In working with regression trees, we emphasize the roleof the cost-complexity tuning parameter, selected through permutation-based Type I errorrate studies, in justifying inferential decision-making, which we contrast with graphical andquantitative exploration for future study. Results indicate that the mindset interventionwas effective, overall, in improving student achievement. While our exploratory analysesidentified XC, C1, and X1 as potential effect modifiers worthy of further study, we findno statistically significant evidence of effect heterogeneity with the exception of urbanicitycategory XC = 3, but the finding is not robust to propensity score estimation method.

Keywords: Heterogeneous Treatment Effect, Observational Studies, Propensity ScoreMatching, Regression Trees

1. Methodology and Motivation

1.1 Introduction

Despite the overwhelming focus on the overall average treatment effect (ATE) in the statis-tics and causal inference literatures, there are many scenarios for which the efficacy of atreatment may vary depending on unit background characteristics. Methods that targetconditional average treatment effects can explain how pretreatment variables interact with

c©2019 Bryan Keller, Jianshen Chen and Tianyang Zhang.

Page 73: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keller, Chen and Zhang

treatment exposure to cause heterogeneity in treatment efficacy. The identification of suchheterogeneity, to the extent that it exists, is of tremendous interest to stakeholders because itcan provide insight into which types of participants are likely to be helped the most, helpedthe least, or even harmed by an intervention. In this paper, we begin with an overview of thesynthetic data set generated for the Workshop for Empirical Investigation of Methods forHeterogeneity, a workshop that co-occurred with the 2018 Meeting of the Atlantic CausalInference Conference in Pittsburgh, PA. We then describe our approach for heterogeneoussubgroup identification based on propensity score matching and regression trees. We thendiscuss data analysis results presented at the workshop, followed by the results of furtheranalyses conducted after the workshop. We conclude with some discussion.

1.2 The Data

The workshop data analyzed herein are synthetic, but were motivated by the National Studyof Learning Mindsets, a randomized controlled trial of an intervention designed to encouragea growth mindset in high school students (Mindset Scholars Network, 2018). Approximately10,000 cases, nested in 76 schools, were simulated to emulate an observational study basedon four categorical student-level covariates and six numeric school-level covariates.

The three research questions we were asked to address for the workshop are as follows:

1. Was the mindset intervention effective in improving student achievement?

2. X1 is a measure of the average fixed mindset rating for each school; X2 is a measure ofschool-level academic achievement; both were measured before the intervention. Re-searchers suspect either (a) the effect is largest in middle-achieving schools, or (b) theeffect is decreasing in school-level achievement. Is there any evidence that X1 and/orX2 moderate the effect of the intervention on student-level academic achievement?

3. Is there evidence that any other covariates moderate the intervention effect?

1.3 Notation

Let Y 1i and Y 0

i be the potential outcomes (Neyman, 1923; Rubin, 1974) under treatment(Zi = 1) and comparison (Zi = 0) conditions, respectively. The average treatment effect, orATE, is defined as the average of individual treatment effects; that is, ATE = E[Y 1

i − Y 0i ].

A conditional average treatment effect, or CATE, is defined as the average of individualtreatment effects, given that a vector of covariates Xi1, Xi2, . . . , Xip take on particularvalues; that is, CATE = E[Y 1

i − Y 0i |Xi1 = xi1, Xi2 = xi1, . . . , Xip = xip]. The propensity

score, ei(Xi) = pr(Zi = 1|Xi), is the probability that unit i is assigned to (or selects) thetreatment group, given the observed covariates. For identification of the ATE and CATE,propensity score analysis, and other conditioning strategies, rely on the strong ignorabilityassumption (Rosenbaum and Rubin, 1983), which specifies

1. ignorability : the potential outcomes are independent of the treatment assignmentgiven observed covariates X; that is, {Y 0, Y 1} ⊥⊥ Z|X,

2. reliable measurement : observed covariates X have been reliably measured (Steineret al., 2011), and

94

Page 74: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Heterogeneous Subgroup Identification with Observational Data

3. positivity : the propensity score for each unit lies strictly between zero and one; thatis, 0 < ei(Xi) < 1 for all i.

The observed outcome for unit i, Yi, is defined via the potential outcomes and thetreatment indicator as Yi = ZiY

1i + (1− Zi)Y

0i .

1.4 Methodology

Our approach to heterogeneous subgroup identification is based on the fact that, underignorability, X ⊥⊥ Z|e(X) (Rosenbaum and Rubin, 1983). That is, by conditioning onthe propensity score, balance on the observed covariates across treated and comparisongroups may be restored to what would have been expected in a randomized experiment;namely, covariate distributions are identical (in the limit) across groups. We use optimalfull propensity score matching to stratify units into S strata, each of which contains atleast one treated case and at least one comparison case. For each stratum s ∈ 1, . . . , S, theestimate of the stratum-specific treatment effect is calculated as the difference in sampleaverages, treated group minus comparison group. That is,

ˆATEs =1

nTs

∑i∈Ts

Yi −1

nCs

∑i∈Cs

Yi,

where Ts and Cs are, respectively, the sets of indices of the treated and comparison casesin stratum s, and nTs and nCs respectively represent the cardinalities of Ts and Cs. Oncestratum-specific treatment effect estimates have been calculated, we use those values asestimates of the individual treatment effect for each unit in the stratum. We then regress thevector of individual treatment effects on the set of predictors using a single regression tree.Any predictors identified by the regression tree as important, meaning that the regressiontree split on those variables, are interpreted as evidence for effect heterogeneity on thevariable or variables involved in the splits.

1.4.1 Regression Trees

A regression tree is an algorithmic method invented by Breiman et al. (1984) that models theresponse surface for an outcome variable, Y , based on predictors, X1, . . . , Xp, by iterativelysplitting units into subgroups based on rectangular regions of predictor values. At eachiteration, a split creates two subgroups, called nodes, and a node that has not been splitis referred to as a terminal node. For unit i in terminal node t, where Nt representsthe set of units in t, the tree-predicted value for unit i is simply the mean score on theoutcome variable for all units in that node: Yi = 1

|Nt|∑

i∈NtYi. The deviance for a tree

T , dev(T ) =∑

i

(Yi − Yi

)2, is used as a cost function to determine the split point at each

iteration. After considering all possible splits on all possible variables, the split that yieldsthe largest decrease in deviance is selected.

If left unchecked, regression trees would continue to split until each terminal node con-tained only one point. A commonly used approach to prevent this kind of overfitting isbased on adding a term to the squared error that penalizes the number of terminal nodes,|T |, in tree T : C(T )cp = dev(T ) + cp|T |. This approach is referred to as cost-complexity

95

Page 75: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keller, Chen and Zhang

pruning, and is implemented in the rpart package (Therneau et al., 2015) in R (R CoreTeam, 2018), which we use to fit regression trees. The tuning parameter, cp, is analogousto the smoothing parameter in the lasso or regularized regression, and is typically selectedthrough cross-validation.

2. Workshop Results

In the synthetic workshop data, school sample sizes for the 76 schools ranged from 14 to529, with a median of 111, and a mean of 136.7. Furthermore, the treatment was non-randomly assigned within schools, such that each school sample contained a proportion oftreated cases that ranged from about 17% to about 45%. This design feature allowed usto estimate propensity scores and create matches within schools1. As a result of within-school matching, all were matched exactly on the five continuously measured school-levelcovariates, X1, . . . , X5. For the workshop, we used two methods to estimate propensityscores: random forests (RF) and generalized boosted modeling (GBM). Both methods arebased on regression trees and, therefore, algorithmically handle interactions and nonlinearrelationships.

2.1 Research Question 1

To address the first question, we used standard propensity score methodology and simplytook weighted averages of stratum-specific treatment effect estimates. The overall ATEwas estimated to be 0.25 or 0.26, based on GBM or RF, respectively, for propensity scoreestimation. The distribution of estimated individual treatment effects, along with a verticalline denoting the average, is shown for the RF analysis in Figure 1. While the resultssuggest a positive treatment effect, we did not present standard errors, so we made noclaims regarding evidence for an overall effect.

2.2 Research Question 2

We fit regression trees and varied the level of the complexity parameter to search for het-erogeneity on X1 and X2. With analyses based on propensity scores estimated by RF andGBM, we noted, based on the regression tree output shown in Figure 2, that the treatmenteffect did appear to vary with X2 and X1.

Figure 2 shows the results of a regression tree fit based on random forests with a com-plexity parameter of 0.0033. Note that at the root node, the overall ATE is estimatedto be 0.26 based on 8910 cases. The first split was at X2 = −0.71, which led to condi-tional ATE estimates of ˆCATE{X2<−0.71} = 0.12 and ˆCATE{X2≥−0.71} = 0.29. The nextsplit was also on X2, thereby modeling a quadratic relationship. In particular, we see that

ˆCATE{X2≥0.83} = 0.23 and ˆCATE{−0.71<X2≤0.83} = 0.31. In other words, the estimatedaverage treatment effect for schools with academic achievement scores between -0.71 and0.83 was 0.31, higher than the estimate of 0.12 for schools with pretest achievement below-0.71, and higher than the estimate of 0.23 for schools with pretest achievement above 0.83.Finally, the last split was on X1, suggesting that X1 and X2 interacted such that, for those

1. Note, however, that two schools, numbers 11 and 31, were dropped due to insufficient sample sizes of 21and 14, respectively

96

Page 76: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Heterogeneous Subgroup Identification with Observational Data

Figure 1: Average Treatment Effect Estimate by the Random Forest (RF) Method

Figure 2: Regression Tree Based on Regressing Individual Treatment Effect Estimates onObserved Covariates; Propensity Scores Estimated by Random Forests

schools with X2 values in the middle range between -0.71 and 0.83, the treatment was moreeffective for schools with fixed mindset scores lower -0.37 at pretest.

Although we did examine the results of ten-fold cross-validation for cp produced bythe rpart package, we encountered multiple situations in which the cross-validated errorrate continued to decrease without bound as the value of the tuning parameter decreased(i.e., favoring more and more complex tree structures; see Figure 3 for an example). Theadvice given in the rpart manual (Therneau et al., 2015) is that “A good choice of cpfor pruning is often the leftmost value for which the mean lies below the horizontal line.”where the “horizontal line” represents one standard error above the minimum value of the

97

Page 77: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keller, Chen and Zhang

Figure 3: Cross-validated error based on output from package rpart; the horizontal linerepresents one standard error above the minimum value of the cross-validatederror curve

cross-validated error curve. Despite this rule of thumb, we often encountered tree solutionsthat were very volatile at the one SE mark. Thus, a limitation of the exploratory approachused for the workshop analyses is a lack of rationale for the selection of the cp value, whichhad the potential to drastically impact results.

2.3 Research Question 3

We noted that student level variables C1, a fifteen-category race variable, and XC, a four-category urbanicity variable, were identified in some of the RF and GBM regression treefits, but did not discuss their roles in detail.

3. Post-Workshop Analysis

For post-workshop analyses, we included main-effects logistic regression (LR) for propensityscore estimation, in addition to RF and GBM. Propensity score strata based on optimal fullmatching were created in each school, as described above. The number of strata per schoolvaried both with the school sample size and the method of propensity score analysis. Forpropensity scores estimated by GBM, for example, the number of strata per school rangedfrom 4 for school 13 (n = 24) to 161 for school 62 (n = 529), with a mean of 36 and medianof 27; the numbers of strata based on LR and RF were similar.

Furthermore, we ran a series of Type I error rate studies, using random permutation,to select cp values that yielded 5% Type I error rate. Following Chen and Keller (2019),for each permutation, we shuffled yoked outcome/treatment pairs while leaving covariatevalues fixed. Under this permutation scheme, the overall average treatment effect and thecovariate marginal distributions and interrelationships remain unperturbed; meanwhile, any

98

Page 78: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Heterogeneous Subgroup Identification with Observational Data

dependence between covariates and individual treatment effects is destroyed, which providesrecourse to the permutation null hypothesis of no effect (Rubin, 1980; Keller, 2012).

For research questions 2 and 3 for the post-workshop analyses we distinguish betweentesting and exploration. We test for effect heterogeneity by using cp values that werefound, through permutation, to hold the rate of false positives to the nominal 5% level;results based on these cp values are appropriate for inferential decision-making. We explore(a) graphically, by examining graphical depictions of key relationships, and (b) quanti-tatively, by ranking variable importance ratings from random forest fits. Although theseexplorations are suitable for hypothesis generation for future study, they are not appropriatefor inferential decision-making.

3.1 Research Question 1

We found that the desired nominal Type I error rate of approximately 5% was attained forGBM, RF, and LR, respectively, for cp values of 0.006 0.008, and 0.006. The overall ATEestimates were hardly changed when using the cp values determined through permutation.For propensity score estimation via GBM, RF, and LR, respectively, the overall ATE es-timates, with 95% nonparametric bootstrap confidence intervals (percentile method), were0.25 (0.22, 0.30), 0.26 (0.22, 0.30), and 0.27 (0.24, 0.28).

3.2 Research Questions 2 & 3

3.2.1 Testing

For propensity scores estimated via LR, and with cp = 0.006, one split on variable XC = 3was flagged. No splits were identified using propensity scores estimated via RF with cp =0.008, nor via GBM with cp = 0.006. Thus, we found some evidence of effect heterogeneitybased on XC, but the finding was not robust to propensity score specification. There wasno evidence of significant effect heterogeneity for any other covariates.

3.2.2 Exploration

In Figure 4 we plot nonparametric regression curves to show the relationship between school-level estimates of the average treatment effect on the vertical axis against each school-levelcovariate. The notion that the intervention was more effective for schools in a “middlerange” on X2 and with lower values on X1 is not inconsistent with the relationships shownin the first two panels of Figure 4.

In Figure 5, because the student-level covariates are categorical, we use conditionalboxplots to show how individual treatment effect estimates vary by category across thefive student-level covariates. We note what appears to be considerable variability in bothmedian and interquartile range across levels of C1, the 15-category race variable. We alsonote a lower median for category XC = 3 as compared with the other categories of XC, afive-category urbanicity variable.

Finally, we fit random forests using the vector of individual treatment effect estimatesas outcome and the school- and student-level variables as predictors to calculate variableimportance ratings. Because these data constitute a mix of continuously and categoricallymeasured predictor variables, and especially because several of the categorical variables

99

Page 79: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keller, Chen and Zhang

Figure 4: Average School-Level Treatment Effect as a Function of School-Level Covariates

have many categories, traditional random forest variable importance (Breiman, 2001) willresult in biased importance rankings by unjustly favoring variables with many categories(Strobl et al., 2007). Instead, we report variable importance from random forests basedon conditional inference trees, as implemented in R package party Hothorn et al. (2006a),which produce unbiased importance values with multi-category predictors.

Conditional inference trees differ from traditional recursive partitioning approaches inthat splitting is based on p-values for linear test statistics derived by permutation the-ory. The p-values are associated with tests of null hypotheses of conditional independencebetween each predictor and the response, given the tree structure. At each step, thesestatistics are aggregated to form a global test of the null hypothesis. If the result of theglobal test is not significant, splitting stops; thus, tree pruning is not needed. If the resultof the global test is significant, the p-values for individual predictors are ranked, and thenext split occurs on the variable with the smallest p-value. By focusing on p-values, whichare not affected by the scales of predictor variables, fair comparisons may be made even forvariables on different scales; see Hothorn et al. (2006b) for more details.

After fitting random forests based on conditional inference trees, we find variable XCis ranked as the most important predictor of variability in the individual treatment effectacross all three propensity score estimation methods: GBM, RF, and LR. The averageimportance ranks across the three PS estimation methods identify XC, X1, and C1 asthe three most important predictors, respectively. Notably, X2 is among the three leastimportant predictors of effect heterogeneity, according to variable importance rankings.

4. Discussion

We implemented a two-step approach to detect treatment effect heterogeneity character-ized by (1) optimal full propensity score matching within schools to estimate individual(stratum-specific) treatment effects, followed by (2) fitting a regression tree of estimatedindividual treatment effects on covariates. In the analyses prepared for the workshop, wefocused on the second research question by exploring the relationships between X1, X2,

100

Page 80: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Heterogeneous Subgroup Identification with Observational Data

Figure 5: Individual Treatment Effect as a Function of Student-Level Predictors by Propen-sity Score Estimation Method; GBM = Generalized Boosted Modeling, RF =Random Forests, LR = Logistic Regression

Figure 6: Variable Importance Rankings from Random Forest Runs Regressing the Indi-vidual Treatment Effect Estimates on the Ten Predictors of Interest

101

Page 81: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keller, Chen and Zhang

and estimated school-level treatment effects. For the post-workshop analyses, we furtherdemarcated analyses by distinguishing between testing and exploration.

In general, our analyses leaned heavily on the regression tree algorithm, which was used(a) in estimating propensity scores via random forests and boosted modeling, (b) to test foreffect heterogeneity through regression tree analysis of individual treatment effect estimates,and (c) for additional exploration through conditional random forest variable importance.With respect to fitting regression trees, we noted that ten-fold cross-validation and theone SE rule of thumb, both methods typically used to select the cost complexity pruningparameter, cp, are inconclusive with respect to Type I error rate. Instead, we used a simplepermutation approach to select cp values that yielded the desired Type I error rate andenabled testing.

For the first research question, we found that the average intervention effect estimates bydifferent methods were all positive, with 95% bootstrap confidence intervals indicating thatthe mindset intervention was effective in improving student achievement. For the second andthird research questions, we found evidence of heterogeneity based on membership in thethird category of the urbanicity variable, but the finding was not robust to propensity scoreestimation method. We found no other significant evidence of treatment effect modification.Based on exploratory analyses, if we were to plan a follow-up study to search for effectmodification, we would recommend focusing on the student-level urbanicity variable, XC,the student level race variable, C1, and the school-level fixed mindset rating variable, X1.We would not recommend prioritizing X2, the school-level achievement variable.

As noted by Feller and Holmes (2009), the assumptions required for identification ofCATEs are identical to those required for the overall ATE (i.e., strong ignorability, nointerference between units, single version of each treatment). We assume these key as-sumptions are met here. Furthermore, the usual recommended steps for the specificationof the propensity score, including iterative respecification to achieve acceptable balanceon observed covariates and an examination of overlap are also important, but details areomitted because our focus is on heterogeneous subgroup identification. Finally, resamplingapproaches such as the jackknife, bootstrap, and boosting may be used to attain errorbounds on ATEs and CATEs estimated via our two-step approach; however, care must betaken when using resampling techniques to estimate standard errors for estimators thatinvolve matching (Abadie and Imbens, 2008; Austin and Small, 2014).

Acknowledgments

We thank Carlos Carvalho, Avi Feller, Jennifer Hill, and Jared Murray for organizing theworkshop and inviting our submission. Jianshen Chen was employed by Educational TestingService when this work was carried out.

References

Abadie, A. and Imbens, G. W. (2008). On the failure of the bootstrap for matching esti-mators. Econometrica, 76:1537–1557.

102

Page 82: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Heterogeneous Subgroup Identification with Observational Data

Austin, P. C. and Small, D. S. (2014). The use of bootstrapping when using propensity-scorematching without replacement: a simulation study. Statistics in Medicine, 33:4306–4319.

Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification andRegression Trees. Wadsworth and Brooks/Cole, Monterey, CA.

Chen, J. and Keller, B. (2019). Heterogeneous subgroup identification in observationalstudies. Journal for Research on Educational Effectiveness. Available from: https:

//doi.org/10.1080/19345747.2019.1615159.

Feller, A. and Holmes, C. (2009). Beyond toplines: Heterogeneous treatment effects inrandomized experiments. Technical report, University of Oxford.

Hothorn, T., Buehlmann, P., Dudoit, S., Molinaro, A., and Van Der Laan, M. (2006a).Survival ensembles. Biostatistics, 7(3):355–373.

Hothorn, T., Hornik, K., and Zeileis, A. (2006b). Unbiased recursive partitioning: A condi-tional inference framework. Journal of Computational and Graphical Statistics, 15:651–674.

Keller, B. (2012). Detecting treatment effects with small samples: The power of some testsunder the randomization model. Psychometrika, 77:324–338.

Mindset Scholars Network (2018). National study of learning mindsets. Available from:http://mindsetscholarsnetwork.org.

Neyman, J. (1923). Sur les applications de la theorie des probabilites aux experiencesagricoles: Essai des principes. In Roczniki Nauk Rolniczych, volume X, pages 1–51. InPolish, English translation by D. Dabrowska and T. Speed in Statistical Science 5, 465 -72, 1990.

R Core Team (2018). R: A language and environment for statistical computing. Availablefrom: http://www.R-project.org/.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score inobservational studies for causal effects. Biometrika, 70:41–55.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonran-domized studies. Journal of Educational Psychology, 66:688–701.

Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomiza-tion test comment. Journal of the American Statistical Association, 75:591–593.

Steiner, P. M., Cook, T. D., and Shadish, W. R. (2011). On the importance of reliablecovariate measurement in selection bias adjustments using propensity scores. Journal ofEducational and Behavioral Statistics, pages 213–236.

103

Page 83: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Keller, Chen and Zhang

Strobl, C., Boulesteix, A., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variableimportance measures: Illustrations, sources and a solution. BMC Bioinformatics, page8:25.

Therneau, T., Atkinson, B., and Ripley, B. (2015). rpart: Recursive partitioning andregression trees. R package version 4.1-9. Available from: http://CRAN.R-project.

org/package=rpart.

104

Page 84: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 105-117 Submitted 7/19; Published 8/19

Causaltoolbox—Estimator Stability for HeterogeneousTreatment Effects

Soren R. Kunzel∗ [email protected] of StatisticsUniversity of California, Berkeley

Simon J. S. Walter∗ [email protected] of StatisticsUniversity of California, Berkeley

Jasjeet S. Sekhon [email protected]

Department of Political Science & Department of Statistics

University of California, Berkeley

Abstract

Estimating heterogeneous treatment effects has become increasingly important in manyfields: for example, they are required to select a personalized treatment for a patient,which may be a life or death decision. Recently, a variety of procedures relying on differentassumptions have been suggested for estimating heterogeneous treatment effects. Unfortu-nately, there are no compelling approaches that allow identification of the procedure thathas assumptions that hew closest to the process generating the data set under study andresearchers often select one arbitrarily. This approach risks making inferences that rely onincorrect assumptions and gives the experimenter too much scope for p-hacking. A singleestimator will also tend to overlook patterns other estimators could have picked up. We be-lieve that the conclusion of many published papers might change had a different estimatorbeen chosen and we suggest that practitioners should evaluate many estimators and assesstheir similarity when investigating heterogeneous treatment effects. We demonstrate thisby applying 28 different estimation procedures to an emulated observational data set; thisanalysis shows that different estimation procedures may give starkly different estimates.We also provide an extensible R package which makes it straightforward for practitionersto follow our recommendations.

Keywords: Heterogeneous treatment effects, conditional average treatment effect, X-learner, joint estimation.

1. Introduction

Heterogeneous Treatment Effect (HTE) estimation is now a mainstay in many disciplines,including personalized medicine (Henderson et al., 2016; Powers et al., 2018), digital ex-perimentation (Taddy et al., 2016), economics (Athey and Imbens, 2016), political science(Green and Kern, 2012), and statistics (Tian et al., 2014). Its prominence has been drivenby a combination of the rise of big data, which permits the estimation of fine-grained hetero-geneity, and recognition that many interventions have heterogeneous effects, suggesting that

∗. These authors contributed equally to this work.

c©2019 Soren R. Kunzel, Simon J. S. Walter, and Jasjeet S. Sekhon.

Page 85: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Kunzel, Walter, and Sekhon

much can be gained by targeting only the individuals likely to experience the most positiveresponse. This increase in interest amongst applied statisticians has been accompanied bya burgeoning methodological and theoretical literature: there are now many methods tocharacterize and estimate heterogeneity; some recent examples include Hill (2011); Atheyand Imbens (2015); Kunzel et al. (2017); Wager and Athey (2017a); Nie and Wager (2017).Many of these methods are accompanied by guarantees suggesting they possess desirableproperties when specific assumptions are met, however, verifying these assumptions maybe impossible in many applications; so, practitioners are given little guidance for choosingthe best estimator for a particular data set. As an alternative to verifying these assump-tions we suggest practitioners construct a large family of HTE estimators and consider theirsimilarities and differences.

Treatment effect estimation contrasts with prediction, where researchers can use cross-validation (CV) or a validation set to compare the performance of different estimators orto combine them in an ensemble. This is infeasible for treatment effect estimation becauseof the fundamental problem of causal inference: we can never observe the treatment effectfor any individual unit directly, so we have no source of truth to validate or cross-validateagainst. Partial progress has been made in addressing this problem; for example, Atheyand Imbens (2015) suggest using the transformed outcome as the truth, a quantity equalin expectation to the individual treatment effect and Kunzel et al. (2017) suggests usingmatching to impute a quantity similar to the unobserved potential outcome. However, evenif there were a reliable procedure for identifying the estimator with the best predictive per-formance, we maintain that using multiple estimates can still be superior, because the bestperforming method or ensemble of methods may perform well in some regions of the featurespace and badly in others; using many estimates simultaneously may permit identificationof this phenomenon. For example, researchers can construct a worst-case estimator that isequal to the most pessimistic point estimate, for each point in the feature space, or theycan consider stability (Yu, 2013) to assess whether one can trust estimates for a particularsubset of units.

2. Methods

2.1 Study setting

The data set we analyzed was constructed for the Empirical Investigation of Methods forHeterogeneity Workshop at the 2018 Atlantic Causal Inference Conference. The organizersof the workshop: Carlos Carvalho, Jennifer Hill, Jared Murray, and Avi Feller used the Na-tional Study of Learning Mindsets, a randomized controlled trial in a probability sample ofU.S. public high schools, to simulate an observational study. The organizers did not disclosehow the simulated observational data were derived from the experimental data because theworkshop was intended to evaluate procedures for analyzing observational studies, wherethe mechanism of treatment assignment is not known a priori.

2.2 Measured variables

The outcome was a measure of student achievement; the treatment was the completion ofonline exercises designed to foster a learning mindset. Eleven covariates were available for

106

Page 86: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

CausalToolbox

each student: four are specific to the student and describe the self-reported expectations forsuccess in the future, race, gender and whether the student is the first in the family to goto college; the remaining seven variables describe the school the student is attending mea-suring urbanicity, poverty concentration, racial composition, the number of pupils, averagestudent performance, and the extent to which students at the school had fixed mindsets;an anonymized school id recorded which students went to the same school.

2.3 Notation and estimands

For each student, indexed by i, we observed a continuous outcome, Yi, a treatment indicator,Zi, that is 1 if the student was in the treatment group and 0 if she was in the control group,and a feature vector Xi. We adopt the notation of the Neyman-Rubin causal model: foreach student we assume there exist two potential outcomes: if a student is assigned totreatment we observe the outcome Yi = Yi(1) and if the student is assigned to control weobserve Yi = Yi(0). Our task was to assess whether the treatment was effective and, if sowhether the effect is heterogeneous. In particular, we are interested in discerning if there isa subset of units for which the treatment effect is particularly large or small.

To assess whether the treatment is effective, we considered the average treatment effect,

ATE := E[Yi(1)− Yi(0)],

and to analyze the heterogeneity of the data, we considered average treatment effects for aselected subgroup S,

E[Yi(1)− Yi(0)|Xi ∈ S],

and the Conditional Average Treatment Effect (CATE) function,

τ(x) := E[Yi(1)− Yi(0)|Xi = x].

2.4 Estimating average effects

Wherever we computed the ATE or the ATE for some subset, we used four estimators.Three were based on the CausalGAM package of Glynn and Quinn (2017). This packageuses generalized additive models to estimate the expected potential outcomes, µ0(x) :=E[Yi(0)|Xi = x], and µ1(x) := E[Yi(1)|Xi = x], and the propensity score: e(x) := E[Zi|Xi =x]. With these estimates we computed the inverse probability weighting (IPW) estimator,

ˆATEIPW :=1

n

n∑i=1

(YiZiei− Yi(1− Zi)

1− ei

),

the regression estimator,

ˆATEReg :=1

n

n∑i=1

[µ1(Xi)− µ0(Xi)] ,

and the augmented inverse probability weighted (AIPW) estimator,

ˆATEAIPW :=1

2n

n∑i=1

([Yi − µ0(Xi)]Zi

ei+

[µ1(Xi)− Yi][1− Zi]1− ei

).

107

Page 87: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Kunzel, Walter, and Sekhon

We also used the Matching package of Sekhon (2011) to construct a matching estimatorfor the ATE. Matches were required to attend the same school as the student to whichthey were matched and to be assigned to the opposite treatment status. Amongst possi-ble matches satisfying these criteria we selected the student minimizing the Mahalonobisdistance on the four student specific features.

2.5 Characterizing heterogeneous treatment effects

In any data set there might be some units where estimators significantly disagree; when thishappens, we should not trust any estimate unless we understand why certain estimates areunreasonable for these units.1 Instead of simply reporting an estimate that is likely to bewrong, we should acknowledge that conclusions cannot be drawn and more data or domainspecific knowledge is needed. Figure 1 demonstrates this phenomenon arising in practice. Itshows the estimated treatment effect for ten subjects corresponding to 28 CATE estimators(these estimates arise from the data analyzed in the remainder of this paper). Some of theseestimators may have better generalization error than others. However, a reasonable analystcould have selected any one of them. We can see that for five units the estimators all fall ina tight cluster, but for the remaining units, the estimators disagree markedly. This may bedue to those units being in regions with little overlap, where the estimators overcome datascarcity by pooling information in different ways.

Since an estimate of the entire CATE function is hard to interpret and drawing sta-tistically significant conclusions based on it is difficult, we decided to use our estimate ofthe CATE function to identify large subgroups with a markedly different average treatmenteffects and then form conclusions based on the differences in ATE for these subgroups.

To ensure the treatment effect estimates for the selected subgroups were valid, we dividedthe data into an exploration set and an equally sized validation set. We used theexploration set to identify subsets, which may behave differently. To do this, we trained all28 CATE estimators on the exploration set and formulated hypotheses based on plots ofthe marginal CATE: for example, based on plots of the CATE estimates we might theorizethat students in schools with more than 900 students have a much higher treatment effectthan those in schools with less than 300 students. Next, we used the validation set to verifyour findings by estimating the ATEs of each of the subgroups.

The exploration and validation sets were constructed by randomly associating schools(not individual students) with each set, so students who attended the same school werenever split between the exploration and validation set. We adopted this procedure becauseit excludes the possibility of there being dependence between the exploration and validationsets if students who attend the same school influence each others’ outcomes; it also mirrorsthe probability sampling approach used to construct the full sample and it means that wecan argue that the estimand captured by evaluating our hypotheses on the validation set isthe estimand corresponding to the population from which all schools were drawn.

1. A standard approach to capture uncertainty is to report the standard errors or confidence regions for asingle estimator, and we recommend using this approach in addition to our appraoch. However, for CATEestimation, these methods can be misleading and cannot be trusted blindly. For example, in Appendix Cof Kunzel et al. (2017), the authors found that in regions without overlap, bootstrap confidence intervalswere smaller than in regions with overlap. This suggests that estimates were more trustworthy in regionswithout overlap but, in fact, the opposite was true.

108

Page 88: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

CausalToolbox

●●

● ●

● ●

● ●

●● ● ●

●●

●●

● ●

●● ●

● ●

● ● ● ● ● ● ● ● ● ●

● ●

● ●

●●

● ● ● ● ● ● ● ● ● ●●

●●

● ● ●

● ●

● ●

● ●

● ●

● ●● ●

●● ● ●

●●

● ●

●●

● ●

●●

● ●

● ●

● ●

● ●

● ●

●●

● ●

● ●

●●

● ●●

● ●

● ●

●●

●●

● ●●

●●

●●

● ●

● ●

●●

● ●

● ●

● ●

● ●

● ●

●● ●

●●

● ●

●● ● ●

● ●●

● ●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

−1.0

−0.5

0.0

0.5

1.0

326

491

995

1011

1915

2387

2416

3482

3822

4626

id

CAT

E e

stim

ate

Figure 1: CATE estimation for ten units. For each unit, the CATE is estimated using 28different estimators.

2.6 CATE estimators

We use several procedures to estimate the CATE and we give a brief overview of theprocedures here; however, interested readers should consult the referenced papers for acomplete exposition.

Many of the procedures can be classified as meta-learners: they modify base learnersdesigned for standard non-causal prediction problems to estimate the CATE. This is ad-vantageous because we can select a base learner that is designed to the work well on thedata we are analyzing.

1. The T-Learner is the most common meta-learner. Base learners are used to estimatethe control and treatment response function separately, µ1(x) := E[Yi(1)|Xi = x] andµ0(x) := E[Yi(0)|X = x]. The CATE estimate is then the difference between thesetwo estimates, τT (x) := µ1(x)− µ0(x).

2. The S-Learner uses one base learner to estimate the joint outcome function, µ(x, z) :=E[Yi|Xi = x, Zi = z]. The predicted CATE is the difference between the predictedvalues when the treatment assignment indicator is changed from treatment to control,τS(x) := µ(x, 1)− µ(x, 0).

3. The MO-Learner (Rubin and van der Laan, 2007; Walter et al., 2018) is a twostage meta-learner. It first uses the base learners to estimate the propensity score,e(x) := E[Zi|X = x], and the control and treatment response functions. It then

109

Page 89: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Kunzel, Walter, and Sekhon

defines the adjusted modified outcome as

Ri :=Zi − e(xi)

e(xi)[1− e(xi)]

(Yi − µ1(xi)[1− e(xi)]− µ0(xi)e(xi)

).

An estimate of the CATE is obtained by using a base learner to estimate the condi-tional expectation of Ri given Xi, τ

MO(x) := E[Ri|Xi = x].

4. The X-Learner (Kunzel et al., 2017) also uses base learners to estimate the responsefunctions and the propensity score. It then defines the imputed treatment effectsfor the treatment group and control group separately as D1

i := Yi(1) − µ0(Xi) andD0i := µ1(Xi)− Yi(0). The two estimators for the CATE are obtained by using base

learners to estimate the conditional expectation of the imputed treatment effects,τX1 := E[D1

i |Xi = x], and τX0 := E[D0i |Xi = x]. The final estimate is then a convex

combination of these two estimators,

τX(x) := e(x)τX0 (x) + (1− e(x))τX1 (x).

All of these meta-learners have different strength and weaknesses. For example, theT-Learner performs particularly well when the control and treatment response functionare simpler than the CATE. The S-Learner performs particularly well when the expectedtreatment effect is mostly zero or constant. The X-Learner, on the other hand, has verydesirable properties when either the treatment or control group is much larger than theother group.

Note, however, that all of these meta-learners need a base learners to be fully defined. Webelieve that tree-based estimators perform well on mostly discrete and low-dimensional datasets. Therefore, we use the causalToolbox package (Kunzel et al., 2018) that implementsall of these estimators combined with RF and BART.

Using two different tree estimators is desirable because CATE estimators based on BARTperform very well when the data-generating process has some global structure (e.g., globalsparsity or linearity), while random forest is better when the data has local structure thatdoes not necessarily generalize to the entire space. However, to protect our analysis frombiases caused by using tree-based approaches only, we also included methods based onneural networks. We followed Kunzel et al. (2018) and implemented the S, T and X-NeuralNetwork methods.

We also included tree-based learners that were not meta-learners that we believe wouldwork well on this data set:

5. The causal forest algorithm (Wager and Athey, 2017b) is a generalization of therandom forest algorithm to estimates the CATE directly. Similar to random forest, itis an ensemble of many tree estimators. Each of the tree estimators follows a greedysplitting strategy to generate leaves for which the CATE function is as homogeneousas possible. The final estimate for each tree for a unit with features x is the difference-in-means estimate of all units in the training set that fall in the same leaf as x.

110

Page 90: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

CausalToolbox

6. The R-Learner (Nie and Wager, 2017) is a set of algorithms that use an approxima-tion of the following optimization problem to estimate the CATE,

arg minτ

{1

n

n∑i=1

((Yi − µ(−i)(Xi)

)−(Wi − e(−i)(Xi)

)τ(Xi)

)2

+ Λn(τ(·))

}.

Λn(τ(·)) is a regularizer and µ(−i)(x) and e(−i)(Xi) are held-out predictions of µ(x) =E[Yi|Xi = x] and the propensity score, e(x), respectively. There are several versionsof the R-Learner; we have decided to use one that is based on XGBoost (Chen andGuestrin, 2016) and one that is based on RF.

Although we expected there would be school-level effects, and that both the expectedperformance of each student and the CATE would vary from school to school, it was notclear how to incorporate the school id. The two choices we considered were to includea categorical variable recording the school id, or to ignore it entirely. The former makesparameters associated with the six school-level features essentially uninterpretable becausethey cannot be identified separately from the school id; the second may lead to less efficientestimates because we are denying our estimation procedure the use of all data that wasavailable to us. Because we do not want our inference to depend on this decision, we fiteach of our estimators twice: once including school id as a feature, and once excluding it.We considered 14 different CATE estimation procedures; since each procedure was appliedtwice, a total of 28 estimators were computed.

3. Workshop Results

Our sample consisted of about 10,000 students enrolled at 76 different schools. The in-tervention was applied to 33% of the students. Pre-treatment features were similar in thetreatment and control groups but some statistically significant differences were present.Specifically, a variable capturing self-reported expectations for success in the future hadmean 5.22 (95% CI, 5.20-5.25) in the control group and mean 5.36 (5.33-5.40) in the treat-ment group. This meant students with higher expectations of achievement were more likelyto be treated.

We assessed whether overlap held by fitting a propensity score model and we found thatpropensity score estimate for all students in the study was between 0.15 and 0.46 therefore,the overlap condition is likely to be satisfied.

3.1 Average treatment effects

The IPW, regression, and AIPW estimator yielded estimates identical up to two significantfigures: 0.25 with 95% bootstrap confidence interval of (0.22, 0.27). The matching estimatorgave a similar ATE estimate of 0.26 with confidence interval (0.23, 0.28).

The similarity of all the estimates we evaluated is reassuring, but we cannot excludethe possibility that the experiment is affected by an unobserved confounder that affectsall estimators in a similar away. To address this, we characterize the extent of hiddenbias required to explain away our conclusion. We conducted a sensitivity analysis for thematching estimator using the sensetivitymv package of Rosenbaum (2018). We found that

111

Page 91: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Kunzel, Walter, and Sekhon

a permutation test for the matching estimator still finds a significant positive treatmenteffect provided the ratio of the odds of treatment assignment for the treated unit relative tothe odds of treatment assignment for the control unit in each pair can be bounded by 0.40and 2.52. This bound is not very large, and it is plausible that there exists an unobservedconfounder that increases the treatment assignment probability for some unit by a factorof more than 2.52. More information about the treatment assignment mechanism would berequired to conclude whether this extent of confounding exists.

3.2 Heterogeneous effects

The marginal distribution and partial dependence plots for the 28 CATE estimators as afunction of school-level pre-existing mindset norms are shown on the left-hand side of Figure2. There appears to be substantial heterogeneity present: students at schools with mindsetnorms lower than 0.15 may have a larger treatment effect than students at schools withhigher mindset norms. However, the Figure suggests the conclusion is not consistent forall of the 28 estimators. A similar analysis of the feature recording the school achievementlevel is shown on the right-hand side of this Figure. Again, we appear to find the existenceof heterogeneity: students with school achievement level near the middle of the range hadthe most positive response to treatment. On the basis of this figure, we identified thresholdsof -0.8 and 1.1 for defining a low achievement level, a middle achievement level, and a highachievement level subgroup.

0.0

0.1

0.2

0.3

CAT

E

0.15

0.20

0.25

0.30

0.35

CAT

E P

DP

−2 0 2

School−level pre−existing mindset norms

Den

sity

0.0

0.1

0.2

0.3

CAT

E

0.10

0.15

0.20

0.25

0.30

CAT

E P

DP

−2 0 2

School achievement level

Den

sity

Figure 2: Marginal CATE and Partial Dependence Plot (PDP) of the CATE as a functionof school-level pre-existing mindset norms and school achievement level.

We then used the validation set to construct ATE estimates for each of the subgroups.We found that students who attended schools where the measure of fixed mindsets wasless than 0.15 had a higher treatment effect (0.31, 95% CI 0.26-0.35) than students where

112

Page 92: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

CausalToolbox

the fixed mindset was higher (0.21, 0.17–0.26). Testing for equality of the ATE for thesetwo groups yielded a p-value of 0.003. However, when we considered the subsets definedby school achievement level the differences were not so pronounced. Students at the lowestachieving schools had the smallest ATE estimate 0.19 (0.10–0.32); while students at middleand high-achieving schools had similar ATE estimates: 0.28 (0.24–0.32) and 0.24 (0.16-0.31) respectively. However, none of the pairwise difference between the three groups weresignificant.

4. Postworkshop results

During the workshop, other contributors found that the variable recording the urbanicity ofthe schools might explain some of the heterogeneity and so we will investiage that findinghere. The left-hand side of Figure 3 shows the CATE as a function of urbanicity, andthe right hand side of this figure shows the CATE as a function of the student’s self-reported expectation of success. We formulated two hypotheses: students at schools withan urbanicity of 3 seemed to have a lower treatment effect than students at other schools;students with a self-reported evaluation of 4 might enjoy a higher treatment effect.

These hypotheses were obtained by only using the exploration set; to confirm or refutethese hypotheses we used the validation set. The validation set confirmed the hypothesisthat students at schools with an urbanicity of 3 had a lower treatment effect (0.16, 0.08–0.24) compared to students at schools with a different urbanicity (0.28, 0.25–0.31); however,we could not reject the null hypothesis of no difference for the subsets identified by theself-reported evaluation measure. The urbanicity test yielded a p-values of 0.008 and theself-reported evaluation test yielded a p-value of 0.56.

0.10

0.15

0.20

0.25

CAT

E

0.15

0.20

0.25

0.30

CAT

E P

DP

2550

0 1 2 3 4

Urbanicity

Den

sity

0.15

0.20

0.25

0.30

CAT

E

0.10

0.15

0.20

0.25

0.30

CAT

E P

DP

2550

1 2 3 4 5 6 7

Students' self−reported expectations

Den

sity

Figure 3: Marginal CATE and PDP of Urbanicity and self-reported expectations.

113

Page 93: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Kunzel, Walter, and Sekhon

5. Discussion

5.1 The importance of considering multiple estimators

The results of our analysis confirm that point estimates of the CATE can differ markedlydepending on subtle modelling choices; so, we confirm that an analyst’s discretion may bethe deciding factor in whether and what kind of heterogeneity is found. As the method-ological literature on heterogeneous treatment effect estimation continues to expand thisproblem will become more serious. To facilitate applying many estimation procedures wehave authored an R package causalToolbox that provides a uniform interface for construct-ing many common heterogeneous treatment estimators. The design of the package makesit straightforward to add new estimators as they are proposed and gain currency.

Differences that arise in our estimation of the CATE function translate directly intosuboptimal real-world applications of the treatment considered. To see this, we proposea thought experiment: suppose we wanted to determine the treatment for a particularstudent: a natural treatment rule is to allocate her to treatment if her estimated CATEexceeds a small positive threshold or withhold treatment if it is below the threshold. ACATE estimator may be chosen on the basis of personal preference or prior experience andit is likely that, for some experimental subjects, the choice of estimator will affect the CATEestimated to such an extent that it changes the treatment decision. This is particularlyproblematic in studies where analysts have a vested interest in a particular result and areworking without a pre-analysis plan, as they should not have discretion to select a procedurethat pushes the results in the direction they desire. On the other hand, if analysts consider awide variety of estimators, as we recommend, and if most estimators agree for an individual,we can be confident that our decision for that individual is not a consequence of arbitrarymodelling choices. Conversely, if some estimators predict a positive and some a negativeresponse, we should reserve judgment for that unit until more conclusive data is availableand admit that we do not know what the best treatment decision is.

5.2 Would we recommend the online exercises?

We find that the overall effect of the treatment is significant and positive. We were notable to identify a subgroup of units that had significant and negative treatment effect andwe would therefore recommend the treatment for every student. In addition, our sensitiv-ity analysis suggested that our findings would still hold provided the confounder is not toostrong. Nonetheless, we cannot exclude the possibility that there is a strong confounder andwe would need to better understand the assignment mechanism to eliminate this possibility.This issue deserves investigation because we saw that students who had higher expectationsfor success in the future were more likely to be in the treatment group. Uncovering theheterogeneity in the CATE function proved to be difficult. We found heterogeneity couldbe identified from school-level pre-exising mindset norms and urbanicity but in general wehad limited power to detect heterogeneous effects. For example, experts believe that theheterogeneity might be moderated by pre-existing mindset norms and school-level achieve-ment. For both covariates, we see that most CATE estimators produce estimates that areconsistent with this theory. Domain experts also believe that there could be a “Goldilockseffect” where middle-achieving schools have the largest treatment effect. We are not able

114

Page 94: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

CausalToolbox

to verify this statistically, but we do observe that most CATE estimators describe such aneffect.

Acknowledgments

We thank Carlos Carvalho, Jennifer Hill, Jared Murray, and Avi Feller for organizing theEmpirical Investigation of Methods for Heterogeneity Workshop and their valuable feedback.We thank Office of Naval Research (ONR) grant N00014-15-1-2367.

115

Page 95: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Kunzel, Walter, and Sekhon

References

Athey, S. and Imbens, G. W. (2015). Machine learning methods for estimating heteroge-neous causal effects. stat, 1050(5).

Athey, S. and Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal ef-fects. Proceedings of the National Academy of Sciences of the United States of America,113(27):7353–60.

Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedingsof the 22nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining, KDD ’16, pages 785–794, New York, NY, USA. ACM.

Glynn, A. and Quinn, K. (2017). CausalGAM: Estimation of Causal Effects with General-ized Additive Models. R package version 0.1-4.

Green, D. P. and Kern, H. L. (2012). Modeling heterogeneous treatment effects in surveyexperiments with bayesian additive regression trees. Public opinion quarterly, 76(3):491–511.

Henderson, N. C., Louis, T. A., Wang, C., and Varadhan, R. (2016). Bayesian analysis ofheterogeneous treatment effects for patient-centered outcomes research. Health Servicesand Outcomes Research Methodology, 16(4):213–233.

Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Com-putational and Graphical Statistics, 20(1):217–240.

Kunzel, S., Sekhon, J., Bickel, P., and Yu, B. (2017). Meta-learners for estimating hetero-geneous treatment effects using machine learning. arXiv preprint arXiv:1706.03461.

Kunzel, S., Tang, A., Xie, L., Saarinen, T., Bickel, P., Yu, B., and Sekhon, J. (2018).causalToolbox: Toolbox for Causal Inference with emphasize on Heterogeneous TreatmentEffect Estimator. R package version 0.0.1.000.

Kunzel, S. R., Stadie, B. C., Vemuri, N., Ramakrishnan, V., Sekhon, J. S., and Abbeel,P. (2018). Transfer learning for estimating causal effects using neural networks. arXivpreprint arXiv:1808.07804.

Nie, X. and Wager, S. (2017). Learning objectives for treatment effect estimation. arXivpreprint arXiv:1712.04912.

Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., and Tibshirani, R.(2018). Some methods for heterogeneous treatment effect estimation in high dimensions.Statistics in medicine.

Rosenbaum, P. R. (2018). sensitivitymv: Sensitivity Analysis in Observational Studies. Rpackage version 1.4.3.

Rubin, D. and van der Laan, M. J. (2007). A doubly robust censoring unbiased transfor-mation. The international journal of biostatistics, 3(1).

116

Page 96: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

CausalToolbox

Sekhon, J. S. (2011). Multivariate and propensity score matching software with automatedbalance optimization: The Matching package for R. Journal of Statistical Software,42(7):1–52.

Taddy, M., Gardner, M., Chen, L., and Draper, D. (2016). A nonparametric bayesiananalysis of heterogenous treatment effects in digital experimentation. Journal of Business& Economic Statistics, 34(4):661–672.

Tian, L., Alizadeh, A. A., Gentles, A. J., and Tibshirani, R. (2014). A simple method forestimating interactions between a treatment and a large number of covariates. Journalof the American Statistical Association, 109(508):1517–1532.

Wager, S. and Athey, S. (2017a). Estimation and inference of heterogeneous treatmenteffects using random forests. Journal of the American Statistical Association.

Wager, S. and Athey, S. (2017b). Estimation and inference of heterogeneous treatmenteffects using random forests. Journal of the American Statistical Association, (just-accepted).

Walter, S., Sekhon, J., and Yu, B. (2018). Analyzing the modified outcome for heterogeneoustreatment effect estimation. Unpublished manuscript.

Yu, B. (2013). Stabiilty. Bernoulli, 19:1484–1500.

117

Page 97: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 118-130 Submitted 7/19; Published 8/19

An Application of Matching After Learning To Stretch(MALTS) to the ACIC 2018 Causal Inference Challenge

Data

Harsh Parikh [email protected] of Computer ScienceDuke UniversityDurham, NC 27710, USA

Cynthia Rudin1 [email protected] of Computer ScienceDepartment of Electrical and Computer EngineeringDepartment of Statstical SciencesDuke UniversityDurham, NC 27710, USA

Alexander Volfovsky1 [email protected]

Department of Statistical Sciences

Duke University

Durham, NC 27710, USA

Abstract

In the learning-to-match framework for causal inference, a parameterized distance metricis trained on a holdout train set so that the matching yields accurate estimated conditionalaverage treatment effects. This way, the matching can be as accurate as other black boxmachine learning techniques for causal inference. We use a new learning-to-match algo-rithm called Matching-After-Learning-To-Stretch (MALTS) (Parikh et al., 2018) to studyan observational dataset from the Atlantic Causal Inference Challenge. Other than pro-viding estimates for (conditional) average treatment effects, the MALTS procedure allowspractitioners to evaluate matched groups directly, understand where more data might needto be collected and gain an understanding of when estimates can be trusted.1

Keywords: Matching Algorithm, Causal Inference, Nearest Neighbors

1. Introduction

Matching methods should place “similar” units into matched groups, but the question ofwhether two units are similar is much more complicated than it might seem. The choice ofdistance metric used to form the matches can have a large impact on causal conclusions.If we construct a high quality distance metric for matching, not only will we make correctcausal conclusions, but we will also have interpretable matches that we can understand andtroubleshoot. Examining the matches will allow us to locate possible sources of confounding,areas where treatment and control units do not overlap, or perhaps it will allow us totroubleshoot other problems with our data.

1. Denotes approximately equal contribution

c©2019 Harsh Parikh, Cynthia Rudin and Alexander Volfovsky.

Page 98: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

MALTS for ACIC Challenge Data

Conversely, a low quality distance metric can lead to poor matches and incorrect con-clusions. If the experimenter chooses a fixed distance metric beforehand, such as Euclideandistance (as is typically used in K-nearest neighbors) or logit distance (as in propensityscore matching), then it would weigh distances between important covariates equally ascompared to the distances between irrelevant covariates. Even with a few irrelevant covari-ates in the data, this could easily degrade the distance metric far enough that it would affectthe treatment effect estimates (Wang et al., 2017). An alternative to Euclidean distancewould be to ask experimenters to construct the distance metric using domain expertise, forinstance, using a weighted Euclidean distance or by performing coarsened exact matching(Iacus et al., 2011). This, however, requires the user to construct a high-dimensional dis-tance function. Humans are not naturally adept at constructing high dimensional functionsmanually, and the high degree of freedom in choosing this metric exposes how easily thischoice could go wrong.

Ideally, the distance metric should match units together so that matched groups yield ac-curate estimated conditional treatment effects. The learning-to-match framework proposedby Parikh et al. (2018) aims to do this. In this framework, the distance metric is trainedon a holdout train set. Because the distances are trained, matched groups tend to havemore accurate estimated treatment effects. Irrelevant covariates are automatically ignoredas part of the learning process, and similarities along relevant covariates tend to be weightedmore highly. The impact of arbitrary human choices is reduced, and we can quantitativelyjudge the quality of the distance metric prior to using it.

Matching After Learning to Stretch (MALTS) (Parikh et al., 2018) is a learning-to-match method that stretches important covariates and shrinks less relevant covariates. Bystretching relevant covariates, it forces the distance to be more sensitive to small changesin these covariates. By shrinking less relevant covariates, the distance metric becomes lesssensitive to their changes. The stretching and shrinking parameters in the distance metricare learned from a holdout train set, as a special case of learning the parameters of adiagonal Mahalanobis distance matrix. If the Mahalanobis distance is not forced to bediagonal, it would induce more general distances, including rotations, but in this work, weconsider only stretching and shrinking of individual covariates for interpretability. MALTShandles categorical covariates by exact matching on as many relevant categorical covariatesas possible, using ideas from the FLAME algorithm of Wang et al. (2017).

We applied the new learning-to-match framework on the Empirical Investigation ofMethods for Heterogeneity Workshop data from the 2018 Atlantic Causal Inference Confer-ence (ACIC). The ACIC dataset emulates the data and intervention in the National Studyof Learning Mindsets. It include school identifiers, self-reported expectation for success infuture (S3), race (C1), gender (C2), first-generation status (C3), level of urbanicity (XC),school-level mean mindset (X1), school achievement level (X2), racial/ethinic compositionof school (X3), poverty concentration of school (X4), and school size (X5). Outcome is acontinuous measure of student achievement (Y ), and treatment is a mindset interventionindicated in the data by T . We find that even though MALTS induces only simple stretch-ing and shrinking of the covariates, it tends to produce results that are similar to a blackbox machine learning methods for treatment effect estimation like BART (Chipman et al.,2010), Causal Forest (Wager and Athey, 2017) or CFR (Johansson et al., 2016). Moreover,it produces matched groups that we can display, critique, and troubleshoot before estima-

119

Page 99: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Parikh, Rudin and Volfovsky

tion takes place. The matched groups are not created posthoc to explain a black box, theyare created as part of an interpretable, auditable process.

Briefly, our findings using Matching After Learning to Stretch (MALTS) methodologyhighlights the importance of self-reported expectations for success in the future in determi-nation of the outcome for both treatment and control sets jointly. We also observe the hintsof goldilock’s effect for covariates regarding school-level achievements. The average trendobserved jointly on the self-reported expectation for success in the future and urbanicityshows that higher values of both are possibly correlated with higher treatment effects. Wefind interesting behavior for level 3 of urbanicity that is different from other levels of urban-icity. Finally, we only observed small heterogenity in treatment effects across gender andalmost no heterogenity for first-generation status. We present these results as an analysispipeline—we first assess the quality of matched groups and then provide estimates that canaccount for that quality.

2. Methodology and Motivation

Matching After Learning to Stretch (MALTS) is a matching method that estimates con-ditional average treatment effects (CATEs) (defined as difference of outcome under treat-ment, Y (t) and outcome under control, Y (c) for a given x, mathematically formulated asE[(Y (t)−Y (c))|X = x]) using K-nearest neighbors (KNN) for estimating the counterfactualoutcomes conditional at a given location in covariate space, with a learned distance metric(Parikh et al., 2018). Because the distance metric is learned from a training set, MALTS,and its discrete counterpart “Fast Large-scale Almost Matching Exactly Approach to CausalInference” (FLAME), are part of the learning-to-match framework (Wang et al., 2017; Dienget al., 2019). MALTS assumes that there is no unobserved confounding.

MALTS learns a distance metric L such that the following minimization problem isapproximately solved on the normalized training set. The train set is normalized suchthat each covariate has zero sample mean and unit sample variance. The right side of theoptimization equation accounts for the aggregate error in estimation of yi’s as an averageof outcome yk of its k-nearest neighbors according to the learned distance metric.

L ∈ argminL′

∑S∈{C,T}

∑i∈S

yi − 1

K

∑k∈KNN(yi,L

′ )⊆S

yk

2 (1)

The learned distance metric forces the samples with similar outcomes in the set to becloser in covariate space. In this implementation, MALTS’ distance metric for KNN isparameterized by a matrix, termed as L, that handles continuous and discrete variablesdifferently. Let the subscript c indicate continuous variables and d indicate discrete variablesthen the distance metric is:

distanceL(a, b) = dLc(ac, bc) + dLd(ad, bd), where L = [Lc, Ld]

and dLc(ac, bc) = ‖Lcac − Lcbc‖2

2, dLd

(ad, bd) =

nd∑j=0

(Lj,jd )21[ad(j) 6= bd(j) ].

MALTS learns the distance metric parameter L using a non-gradient based optimizationmethod with the help of python3’s scipy package (Jones et al., 2001 ).

120

Page 100: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

MALTS for ACIC Challenge Data

The learned distance metric is used for matching the units in the estimation set. Inprinciple the proposed procedure allows for the full data to be used for estimation of CATEs.However, it might be preferable to keep the two sets completely separated. This separationallows for the training to inform additional data collection (for example, if in the trainingset we find a few nearest neighbors who form very tight groups while the others form loosegroups then we might want to explore the space of the loose groups to see if additionaldata might lead to tighter matches). For each unit, we find both the k nearest neighbors inthe control group (KNNc) and the k nearest neighbors in the treatment group (KNNt).

We estimate the outcome under control y(c)i = 1

K

∑k∈KNNc

y(c)k and the outcome under

treatment y(t)i = 1

K

∑k∈KNNt

y(t)k . (The superscript (t) refers to the unit in treated set

and the superscript (c) refers to the unit in control set.) Thus we estimate the conditionalaverage treatment effect (CATE) as the expected difference between outcomes for units intreated set and control set conditional on given covariates using the relation:

E[Y (t) − Y (c)|X = xi] = y(t)i − y

(c)i .

We are able to use MALTS to identify outliers or low-quality matches by examin-ing the diameter of the matched group. The diameter provides an intuitive comparisonbetween matched groups: Matched groups that span a large portion of covariate spacetend to be low-quality matches. If desired, one can prune the matched groups of largediameter. The diameter of the matched group MG(xi) for query unit xi is defined asdiameter(xi) = maxj∈MG(xi) distanceL(xi, xj) where matched group MG(xi) is formed bya union of KNNc(xi) and KNNt(xi). Instead of pruning, one can create an aggregateestimator by weighting each group inversely proportional to a function of its diameter.

3. Workshop Results

We presented results for a preliminary version of the matching algorithm during the ACICchallenge workshop. In this preliminary approach, we embedded the discrete covariates inEuclidean space and learned a single distance metric for the combination of continuous andembedded (discrete) covariates. This approach yielded interpretable matched groups, andwe learned from this exercise that it is important to match on a student’s self reportedexpectations of future success (S3) and probably less important to match on gender andfirst generation status. However, the embedding of the discrete variables is somewhatmisleading within our distance metric learning framework. Specifically, it can make unitsappear artificially similar if they both belong to rare groups. To address this challenge in thepost-workshop analysis, we considered a distance metric that explicitly penalizes non-exactmatching on discrete covariates without requiring an apriori embedding of the covariates inEuclidean space, which we described above. We note that generalizations of this method(discussed in Parikh et al., 2018) allow for flexible data driven embeddings of complex datathat complement the distance metric learning problem. Distance metric learning for nearestneighbors has also been considered in non-causal settings (Goldberger et al., 2005).

121

Page 101: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Parikh, Rudin and Volfovsky

4. Post Workshop Analysis

We estimate treatment effects by using the Matching-After-Learning-to-Stretch (MALTS)methodology to construct matched groups and evaluate conditional average treatment ef-fects. Covariates in the ACIC dataset include a school identifier, self-reported expectationfor success in future (S3), race (C1), gender (C2), first-generation status (C3), level of ur-banicity (XC), school-level mean mindset (X1), school achievement level (X2), racial/ethniccomposition of school (X3), poverty concentration of school (X4), school size (X5), outcomevariable (Y ), and treatment indicator (T ). For each unit we have continuous measurementsX1, . . . , X5 and discrete covariates S3, C1, C2, C3 and XC , a continuous outcome variableY and a binary treatment indicator T .

4.1 Learning a Distance Metric and Match Group Quality Analysis

We use a distance metric of the form defined in Equation (2) below, where MALTS learns ascaling on the exact matching distance for discrete covariates and stretch on the Euclideandistance for the continuous covariates:

distanceL(a, b) = L21,11[S3(a) 6= S3(b)] + L2

2,21[C1(a) 6= C1(b)] + L23,31[C2(a) 6= C2(b)] +

L24,41[C3(a) 6= C3(b)] + L2

5,51[XC(a) 6= XC(b)] +

5∑j=1

L2j+5,j+5(Xj(a)−Xj(b))

2.

MALTS requires a training set and an estimation set and so we randomly partition the datainto these two components: 5% is reserved for training, while the rest is used for estima-tion of conditional average treatment effects (CATEs). MALTS uses a non-gradient basedoptimization method to find the optimal L that minimizes (or approximately minimizes)the quadratic loss in Equation (1). The optimum for one run is presented in Table 1 witheach element of the vector describing the relevant diagonal entry. Exact matching on S3appears to be important, as we can interpret from the learned L and the distance metric, amismatch as incurring a cost of at least 17.81 (equal to L2

1,1)—in comparison, being matched

to an individual of the wrong race or ethnicity only costs 1.21 (equal to L22,2). These costs

translate directly to the quality of matches that we can find for each unit in our estimationset. This means S3 is important in determining the outcomes for control unit Yc and theoutcome for treated unit Yt jointly.

Table 1: Stretch Values: Diagonal entries of the L matrix learned using MALTS.

S3 C1 C2 C3 XC X1 X2 X3 X4 X5Index in L matrix (1,1) (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8) (9,9) (10,10)Corresponding L value 4.22 1.1 0.65 0.04 2.78 2.56 0.32 1.92 0.64 1.83Corresponding L2 value 17.808 1.210 0.422 0.002 7.728 6.554 0.102 3.686 0.409 3.349Relative importance(value/max value)*100

100 26.02 15.39 0.89 65.87 60.52 7.52 45.52 15.22 43.24

Figure 1 (a) shows that as the level of S3 increases, so does the value of Y . To evaluatewhether we have sufficient data to make statements about conditional treatment effectsfor the different levels of S3 we consider the distribution of diameters of matched groups

122

Page 102: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

MALTS for ACIC Challenge Data

Figure 1: Trends for S3 and Diameter : (a) Variation in values of Y for different levels ofself-reported expectation of success (S3). (b) Variation in diameter of matched groupsfor different levels of self-reported expectation of success (S3). Levels 1 and 2 do nothave high-quality matched groups. (c) Histogram of the diameter of matched groups inestimation set. Groups whose diameter is too large could potentially be omitted fromthe analysis.

for each of the levels of S3. Again, diameter is defined as the maximal distance to one ofthe nearest neighbors for each unit. Figure 1 (b) demonstrates that individuals at Levels1 and 2 of S3 are generally not well matched. While it is true that these are the lowestfrequency levels of S3, the matched diameter provides a deeper insight: We are likely tohave bad matches if we fail to match exactly on them, while if we match exactly on theselevels we appear to be paying a large penalty for failing to match on other covariates. Thisis evidence of non-overlap and in our analysis, it is important to prune or down-weight suchbad matched groups to avoid poor performance of the estimator. Figure 1(c) shows thehistogram of diameters of all matched groups for units in our estimation set. If we take thetightest 75% or 50% of the matched groups then we need to prune groups with diametergreater than 2.79 or 1.21 respectively.

4.2 Heterogeneity of Treatment Effect

For different levels of pruning based on the diameter of matched groups, Figure 2 showsthe variation of CATEs for different levels of S3. Analyzing Figure 2 (a), it appears thereis an initial decrease and then increase in average CATEs for different levels of S3, howeverfrom the match quality analysis we know that matches for Level 1 and Level 2 are notvery reliable, so this non-monotonicity could be misleading. Figure 2 (b) shows the trendafter we prune at the 75th percentile, where Levels 1 and 2 have been completely removedfrom the matched group analysis. We see that the trend in the CATEs across levels of S3dissipates. A similar analysis can also be performed for the 50th percentile pruned datasetas shown in Figure 2 (c).

Following a similar evaluation as that for S3 we observe that exactly matching on ur-banicity (XC) appears important. Since S3 and XC have many levels, we present the jointvariability of CATEs in Figures 3 (a), (b) and (c) after no pruning on the diameter, pruningat the 75% level and pruning at the 50% level. It is evident that pruning removes spuriousmatched group allowing us to evaluate underlying trends. One trend we can infer fromthe plot is that a low level of urbanicity (XC) and a high level of self-reported expectation

123

Page 103: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Parikh, Rudin and Volfovsky

Figure 2: CATEs for different values of S3: Variation in predicted CATEs for different levelsof self-reported expectation of success (S3) for (a) all the samples in estimation set, (b)for samples beneath the 75th percentile of the estimation set based on the diameter ofthe matched group, (c) for samples beneath the 50th percentile of the estimation setbased on the diameter of the matched group.

Figure 3: Treatment Effects with respect to S3 and XC : Variation in predicted CATEs fordifferent levels of self-reported expectation of success (S3) and urbanicity (XC) for (a)all the samples in the estimation set, (b) for samples beneath the 75th percentile of theestimation set, (c) for samples beneath the 50th percentile of the estimation set basedon the diameter of the matched group.

(S3) tends to have on an average higher individual treatment effects. We also note thatmarginally, urbanicity Level 3 exhibits substantially lower CATEs than the rest of the levels.

4.2.1 Who did we match?

Table 2 (a), (b), and (c), highlights three specific examples of a good, a bad and an “ugly”matched group. The good matched group is tight, with diameter equal to zero. The badmatched group is interesting because for most of the units it is trying to force a match onS3, and hence not making good matches on other covariates. The ugly one is unable to findany sample similar to the query sample. During estimation the latter two matched groupsmay be pruned as the poor quality matches are likely to produce low quality estimates.

124

Page 104: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

MALTS for ACIC Challenge Data

Tab

le2:

Exa

mple

Matc

hed

Gro

ups-

The

Goo

d,

The

Bad,

The

Ugl

y:

(a)

Exam

ple

ofa

good

mat

ched

grou

pp

rod

uce

dby

MA

LT

Sw

ith

dia

met

ereq

ual

toze

ro(e

xac

tm

atch

onal

lco

vari

ates

).F

orth

equ

ery

poi

nt,

MA

LT

Sfi

nd

s10

nea

rest

nei

ghb

ors

inth

eco

ntr

olgr

oup

and

10n

eare

stnei

ghb

ors

inth

etr

eatm

ent

grou

p,

all

ofw

hom

are

exac

tly

mat

ched

.W

ere

port

on

lyth

eav

eraged

Yc

andYt

over

all

10m

atch

esfo

rea

chgr

oup

resp

ecti

vely

.(b

)E

xam

ple

sof

ab

adan

d(c

)an

ugl

ym

atch

edgro

up

pro

du

ced

by

MA

LT

S.

Th

eG

ood

,d

iam

ete

r:0.0

S3

C1

C2

C3

XC

X1

X2

X3

X4

X5

YT

64

20

4-1

.10.7

-0.1

1.3

1.9

64

20

4-1

.10.7

-0.1

1.3

1.9

-0.0

10

64

20

4-1

.10.7

-0.1

1.3

1.9

0.2

1

Th

eB

ad

,d

iam

ete

r:19.1

S3

11

11

11

11

11

11

11

11

11

16

4C

114

44

144

414

12

414

412

13

14

44

13

24

4C

21

12

12

11

12

11

12

11

12

21

11

C3

11

11

00

10

11

10

01

01

11

01

1X

C2

23

42

20

23

22

43

31

23

24

22

X1

0.4

0.4

0.2

0.3

0.1

-0.2

0.2

-0.2

0.4

0.4

0.4

10.9

0.4

0.1

0.4

0.2

0.2

10.4

0.4

X2

-0.2

-0.2

-0.8

0.6

-0.3

1.3

01.3

-0.5

-0.2

-0.2

-0.1

0.1

-0.5

1.3

-0.2

-0.8

1.2

-0.4

-0.2

-0.2

X3

-1.1

-1.1

-1-1

.3-1

-1.1

-0.7

-1.1

-0.9

-1.1

-1.1

0.2

-0.6

-0.9

-1.3

-1.1

-1-1

.30.1

-1.1

-1.1

X4

-1.2

-1.2

-0.1

0.2

-0.8

-1.1

0.3

-1.1

0-1

.2-1

.2-0

.80.1

0-1

.4-1

.2-0

.1-1

.1-0

.7-1

.2-1

.2X

5-0

.6-0

.6-0

.6-0

.40.

20.

3-0

.80.3

-0.4

-0.6

-0.6

-0.3

-1.1

-0.4

-0.9

-0.6

-0.6

1.6

-0.3

-0.6

-0.6

Y-0

.8-0

.8-0

.5-1

.6-1

.10.3

-0.7

-0.8

0-0

.40.7

-1.6

-0.8

-1.1

-1.1

-0.7

-0.4

-1.1

0.2

-0.8

T0

00

00

00

00

01

11

11

11

11

1

Th

eU

gly

,d

iam

ete

r:21.5

S3

22

25

12

22

52

26

54

74

55

55

5C

113

414

1213

44

44

13

44

15

44

24

41

44

C2

11

11

22

11

11

21

12

21

21

22

1C

31

11

11

10

11

10

11

11

11

11

10

XC

11

31

11

31

11

11

11

11

11

11

1X

11.7

1.7

0.9

1.7

1.7

0.2

0.9

0.9

1.7

1.7

0.9

1.7

1.7

1.7

1.7

1.7

1.7

1.7

1.7

1.7

1.7

X2

-0.2

-0.2

0.1

-0.2

-0.2

0.1

0.1

-1-0

.2-0

.2-1

-0.6

-0.6

-0.2

-0.6

-0.2

-0.2

-0.2

-0.6

-0.6

-0.6

X3

-1.5

-1.5

-0.6

-1.5

-1.5

-1.5

-0.6

-1.2

-1.5

-1.5

-1.2

-1-1

-1.5

-1-1

.5-1

.5-1

.5-1

-1-1

X4

-0.9

-0.9

0.1

-0.9

-0.9

-0.3

0.1

0.7

-0.9

-0.9

0.7

0.5

0.5

-0.9

0.5

-0.9

-0.9

-0.9

0.5

0.5

0.5

X5

-1.7

-1.7

-1.1

-1.7

-1.7

-1.6

-1.1

-1.2

-1.7

-1.7

-1.2

-1.6

-1.6

-1.7

-1.6

-1.7

-1.7

-1.7

-1.6

-1.6

-1.6

Y-0

.30

-0.8

-0.7

0-0

.8-0

.4-0

.3-0

.6-0

.9-0

.50.4

-0.2

-0.6

0.1

-0.3

-1.6

-0.3

-0.6

-0.8

T0

00

00

00

00

01

11

11

11

11

1

125

Page 105: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Parikh, Rudin and Volfovsky

4.2.2 Evaluating the treatment effects

The average treatment effect (ATE) (difference of outcome under treatment, Y (t) and out-come under control, Y (c), that is, E[(Y (t) − Y (c))]) estimated above is not sensitive topruning or weighting of the matched groups based on the diameter of the matched group.Table 3 illustrates the same behavior over three different cases: two different levels of prun-ing and an exponential weighting of the matched groups. The ATE is always estimated inthe interval of 0.257 to 0.261.

Table 3: Average Treatment Effect (ATE) estimates on the full estimation set, 75th per-centile estimation set, 50th percentile estimation set and by exponential weighting eachmatched group in the estimation set with wi = e−0.1×diameter(xi).

Estimated ATE (µ) Standard Error ( σ√n)

Unpruned 0.2570 0.0025375th Percentile 0.2604 0.0029250th Percentile 0.2599 0.00353Exponentially Weighted 0.2575 0.00219

Following our analysis of matched groups, we investigate the “Goldilock’s” hypothesis onthe X2 variable: that the treatment effect is higher for an average school level achievementcompared to low or high achievement. Figure 4 (a) shows that the match quality for lowlevels of school achievement is poor and in fact, it might be hard to make meaningfulstatements about matched groups with diameter more than 4. While eliminating thesebad matched groups allows us to comment on X2, we note from Figure 4 (b) that thispruning removes all large and small values of X1 (mean fixed mindset). This means thatwe can primarily comment about a school with middle-mindset individuals with mediumto high achievement levels. This is a slight restatement of the Goldilocks’s hypothesis as itcontains conditions on the medium mindset. In Figure 5 (a) we can observe that there is apeak near X2 = 1 approximately and it decreases as we move away in either direction, i.e.,towards X2 = 2 or X2 = 0. This behavior is consistent for both the unpruned and prunedversions. The pruned version however is much smoother and shows that a Goldilock’s effectis potentially present, but it may not be large.

We also analyze the variability in CATEs as a function of X1: we observe a sharp tran-sition in trends for schools with middle-level mindset. We notice for middle-level mindsetschools, the treatment effect decreases as the mindset increases. Evaluating X1 and X2

jointly, Figure 5 (c) shows that for high values of both X1 and X2, the variability in treat-ment effect appears to be large, but this can be due to lack of data at the extremes; wheninstead considering middle-values for both covariates, treatment effects are more stable.

Lastly, we evaluate the marginal treatment effect for different levels of categorical vari-ables for race (C1) and urbanicity (XC). As we can infer from Figure 6 (a), the averagediameters for all levels are almost the same, so pruning diameters greater than 4 will stillyield enough samples to analyze. The average CATEs for all other levels are similar to theATE or slightly higher, while the average CATE for individuals at Level 3 is lower. This canpotentially indicate an effect involving urbanicity at Level 3. Figure 7 shows the marginaltreatment effect for gender (C2) and first-generation status (C3). There is little variability

126

Page 106: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

MALTS for ACIC Challenge Data

Figure 4: Diameter variation with respect to (a) X2 and (b) X1. Variation in diameter ofmatched groups for (a) different values of school level achievement (X2), (b) differentvalues of mean fixed-mindset (X1). The black curves shows the trend curves fitted withsupport vector regression.

Figure 5: X1, X2 versus CATEs: Variation in predicted CATEs for different values of (a)school level achievement (X2), for all the samples in the estimation set and for thesamples after pruning for matched groups with diameter exceeding 4. Both the blue andthe orange curves are fitted using support vector regression with gaussian kernel to theCATE as a function of X2; (b) school level students’ fixed mindset (X1), for unprunedestimation set. The black curve is fitted using support vector regression with gaussiankernel to the CATE as a function of X1. (c) Contour plot of variation in predictedCATEs for different values of school level students’ fixed mindset (X1) and school levelachievements (X2) for all matched groups.

127

Page 107: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Parikh, Rudin and Volfovsky

Figure 6: Trends for XC : Variation in (a) diameter of matched groups, (b) predictedCATEs, for different levels of urbanicity (XC) after pruning groups with diameter morethan 4.

Figure 7: Trends for C2 and C3: Variation in CATEs for (a) different levels of gender in thefull estimation set, (b) different levels of gender in the estimation set pruned at diametersgreater than 4, (c) different levels of first generation status in the full estimation set, (d)different levels of first generation status in the estimation set pruned at diameters greaterthan 4.

in average CATEs across the first generation statuses while the average CATEs for genderLevel 2 is on an average higher than average CATEs for gender Level 1.

128

Page 108: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

MALTS for ACIC Challenge Data

5. Conclusion and Discussion

In a nutshell, the average treatment effect estimated by MALTS is approximately 0.25 andthere are some covariates that moderate this effect. For urbanicity, XC , we see a differentbehavior specifically for Level 3 while for other levels, the estimates are fairly consistent.Since matching on S3 appears important, we probably should collect additional data forlower levels of self-reported expectations. Alternatively, it is possible that the scale for thisvariable simply needs to be recalibrated; perhaps Levels 1 and 2 are not meaningful.

The school level achievement covariate (X2) shows signs of Goldilocks’s effect and shouldbe further tested, while the covariate concerning the school-level mean of students’ fixedmindset (X1) shows non-monotonicity in the middle region of its distribution. Finally, thetreatment effect seems to vary slightly across genders and has almost no variation withrespect to first-generation status.

MALTS permits several important aspects of our analysis to be made easier: understand-ing importance of different covariates, the underlying trends in heterogeneity of treatmenteffects, and interpretation of the trends by analyzing matched groups. It also providesconfidence on the predictions of treatment effects based on the diameter of the matchedgroups. In certain situations like one shown in Figure 5, pruning of poor quality matchesyields more stable results. We also observe that ATE prediction by MALTS for this datasetseems to be fairly robust to analyst choices.

Acknowledgments

This work supported by DHHS, PHS, NIH, and NIBI under grant 1R01EB025021-01, andalso by the Duke Energy Initiative.

References

Chipman, H. A., George, E. I., and Mcculloch, R. E. (2010). Bart: Bayesian additiveregression trees. Annals of Applied Statistics, pages 266–298.

Dieng, A., Liu, Y., Roy, S., Rudin, C., and Volfovsky, A. (2019). Almost-exact match-ing with replacement for causal inference. In Proceedings of Artificial Intelligence andStatistics (AISTATS).

Goldberger, J., Hinton, G. E., Roweis, S. T., and Salakhutdinov, R. R. (2005). Neighbour-hood components analysis. In Advances in Neural Information Processing Systems, pages513–520.

Iacus, S. M., King, G., and Porro, G. (2011). Causal inference without balance checking:Coarsened exact matching. Political analysis, page mpr013.

Johansson, F., Shalit, U., and Sontag, D. (2016). Learning representations for counterfactualinference. In International Conference on Machine Learning, pages 3020–3029.

Jones, E., Oliphant, T., Peterson, P., et al. (2001–). SciPy: Open source scientific tools forPython. [Online; accessed 2018-11-10]. Available from: http://www.scipy.org/.

129

Page 109: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Parikh, Rudin and Volfovsky

Parikh, H., Rudin, C., and Volfovsky, A. (2018). Malts: Matching after learning to stretch.ArXiv e-prints arXiv:1811.07415.

Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effectsusing random forests. Journal of the American Statistical Association.

Wang, T., Roy, S., Rudin, C., and Volfovsky, A. (2017). FLAME: A fast large-scale almostmatching exactly approach to causal inference. arXiv preprint arXiv:1707.06315.

130

Page 110: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Observational Studies 5 (2019) 131-140 Submitted 7/19; Published 8/19

Selective Inference for Effect Modification: An EmpiricalInvestigation

Qingyuan Zhao [email protected] of StatisticsThe Wharton School, University of PennsylvaniaPhiladelphia, PA 19104, USA

Snigdha Panigrahi [email protected]

Department of Statistics

University of Michigan

Ann Arbor, MI 48109, USA

Abstract

We demonstrate a selective inferential approach for discovering and making confident con-clusions about treatment effect heterogeneity. Our method consists of two stages. First, weuse Robinson’s transformation to eliminate confounding in the observational study. Nextwe select a simple model for effect modification using lasso-regularized regression and thenuse recently developed tools in selective inference to make valid statistical inference forthe discovered effect modifiers. We analyze the Mindset Study data-set provided by theworkshop organizers and compare our approach with other benchmark methods.

Keywords: Lasso, Semiparametric regression, Selective sampler, Variable selection.

1. Methodology and Motivation

1.1 Motivation

In the 2018 Atlantic Causal Inference Conference (ACIC 2018), we were kindly invited toparticipate in a workshop titled “Empirical Investigation of Methods for Heterogeneity”.The workshop organizers provided an observational dataset simulated from the NationalStudy of Learning Mindsets (Mindset Study hereafter) and tasked the participants to ana-lyze how treatment effect of the mindset intervention varies among students in the study.This workshop, in the words of the organizers, “is not intended to be a ‘bake off’ butrather an opportunity to understand the strengths and weaknesses of methods for address-ing important scientific questions”. More specifically, the organizers sought answers for thefollowing three research questions about the Mindset Study:

Question 1: Is the intervention effective in improving student achievement?

Question 2: Do two hypothesized covariates (X1 and X2) moderate the treatment effect?

Question 3: Are there other covariates moderating the treatment effect?

In this report, we will attempt to answer these questions using a method proposed inour earlier paper (Zhao et al., 2017) which neatly combines Robinson’s transformation

c©2019 Qingyuan Zhao and Snigdha Panigrahi.

Page 111: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Zhao and Panigrahi

(Robinson, 1988) to remove confounding and the recently developed selective inferentialframework (Taylor and Tibshirani, 2015) to discover and make confident conclusions abouteffect modifiers (covariates moderating the treatment effect).

Effect modification or treatment effect heterogeneity is an old topic in statistics buthas gained lots of attention in recent years, possibly due to the increased complexity ofempirical datasets and the development of new statistical learning methods that are muchmore powerful at discovering effect modification. Though the literature on this topic ismassive, an executive summary must include three related but different formulations of thisproblem:

1. What is the optimal treatment assignment rule for future experimental objects?

2. What is the conditional average treatment effect (CATE) as a function of the covari-ates?

3. What are the potential effect modifiers and how certain are we about them?

See Zhao et al. (2017) for more discussion and references. It is obvious that the questions ofthe workshop organizers fall into the third category. In fact, we believe this is quite commonin practice. Empirical researchers often want to use observational or experimental data totest existing scientific hypotheses about effect modification, generate new hypotheses, andgather information for intelligent decision making. However, prior to Zhao et al. (2017),majority of the statistical methods in the third category focused on discovering potentialeffect modifiers with little attention targeted towards providing statistical inference (suchas confidence intervals for the discovered covariates). When the goal is to calibrate thestrengths of effect modifiers in such problems, the researcher often relies on sample splitting,where some of the samples are used for discovery and the remaining samples are used forinference (Athey and Imbens, 2016). However, sample splitting does not optimally utilizethe information in the discovery samples and often results in loss of power. Instead, theselective inference framework described in this paper does not waste any data, as it leverageson a conditional approach that only discards the information used in model selection (Leeet al., 2016; Fithian et al., 2014).

1.2 Main method

To introduce the methodology let’s first fix some notations. Let Y be the observed outcome(a continuous measure of academic achievement), Z be the binary intervention (0 for controland 1 for treated), and X = (X1, . . . , Xp) be the covariates (p = 10 in the Mindset Study).Furthermore, denote Y (0) and Y (1) as the two potential outcomes, thus Y = Y (Z). Weassume that there are no unmeasured confounders throughout the paper, i.e. Y (z) ⊥⊥ Z |Xfor z = 0, 1.

Below we will elaborate the two-step method proposed in Zhao et al. (2017):

Step 1 (Robinson’s transformation): Use machine learning methods to estimate µz(x) =E[Z|X = x] = P(Z = 1|X = x) (the “propensity’ score”) and µy(x) = E[Y |X = x].Let the estimates be µy(x) and µz(x). In R, there are many off-the-shelf implemen-tations available to learn µy(x) and µz(x) without any ex ante model specification.It is helpful to use an algorithm called “cross-fitting” in this step for the purpose of

132

Page 112: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Selective inference for effect modification: An empirical investigation

proving theoretical properties (Schick, 1986; Chernozhukov et al., 2018). Cross-fittingis only implemented for the post-workshop analysis. See Section 3.1 for more detail.

Notice that it is straightforward to show (see Zhao et al., 2017) that the CATEτ(x) = E[Y (1)− Y (0)|X = x] satisfies

E[Y − µy(X) |Z,X] = (Z − µz(X))τ(X) (1)

Step 2 (Statistical inference): By approximating the CATE using a linear model, τ(x) ≈β0 + xTβ, equation (1) implies that

Y − µy(X) ≈ (Z − µz(X))(β0 +XTβ) + approximation error + noise.

This motivates us to treat Y = Y−µy(X) as the (transformed) response and X = (Z−µz(X))X as the (transformed) predictors. We can then use different specifications ofτ(x) to answer the three questions posted by the workshop organizers:

Step 2.1 (answering Question 1): Model CATE by just an intercept term: τ(x) ≈β0. In R, we can report the results of the linear regression lm(Y ∼ Z) whereZ = Z − µz(X).

Step 2.2 (answering Question 2): Suppose XM are the hypothesized effect mod-ifiers (X1 and X2 in the Mindset Study). We can model CATE by an interceptand XM: τ(x) ≈ β0 +XT

MβM. The coefficient βM can be interpreted as thecoefficient in the best linear approximation to the actual τ(x). More precisely,it is defined as (see Zhao et al., 2017):

(β0,βM) = arg min En

[(Z − µz(X))2(τ(X)− β0 −XT

MβM)2], (2)

where En stands for averaging over the n samples. In R, we can report the resultsof the linear regression lm(Y ∼ Z + Z : XM).

Step 2.3 (answering Question 3): Use lasso regularized regression in Tibshirani(1996) (or potentially other automated variable selection methods) to select asubset of covariates M ⊆ {1, 2, . . . , p}. More specifically, M contains positionsof non-zero entries in the solution to the following problem:

minimizen

2En

[Y − Z(β0 + βTX)

]2+ λ‖β‖1. (3)

Then we can use the existing selective inference methods to make inference aboutthe linear submodel τ(x) ≈ β0 + xT

MβM that is selected using the data. The

estimand (β0,βM) is defined in the same way as (2) by treating M as fixed.

The central idea behind the selective inferential methods is to base inference upona conditional likelihood that truncates the usual (pre-selection) likelihood to therealizations of data that can lead to the same selection event. Lee et al. (2016)proposed the first method along this conditional perspective to overcome the biasencountered in inferring about a data-adaptive target. Assuming Gaussian noisein a linear regression setting, Lee et al. (2016) derived a pivotal statistic that

133

Page 113: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Zhao and Panigrahi

can be computed in closed-form and has a truncated Gaussian law for a classof polyhedral selection rules including the lasso (3). An implementation of thismethod can be found in the selectiveInference R package (Tibshirani et al.,2017). In principle, more sophisticated selective inference can be used in thisstep too. We will explore them in Section 3.

Compared to other methods, the approach outlined above has several appealing proper-ties. First, the nuisance parameters—µz(x) and µy(x)—are estimated by flexible machinelearning methods. Because Robinson’s transformation is used, each nuisance parameteronly needs to be estimated at rate faster than n−1/4 to ensure asymptotic validity of thenon-selective or selective inference in Step 2 (Zhao et al., 2017). This echos the suggestionof combining machine learning methods and doubly robust estimation by van der Laan andRose (2011); Chernozhukov et al. (2018). Second, all the the scientific questions raised byworkshop organizers can be answered in the same manner. The data analyst only needsto change the specification of the model for τ(x). Third, when answering Question 3, aneffective variable selection procedure (such as lasso) can often find an interpretable modelthat includes most of the important effect modifiers. Selective inference can then providevalid statistical significance and confidence interval for the selected effect modifiers. Lastly,the implementation of this procedure is straightforward by harvesting existing softwares ofmachine learning methods and selective inference. We refer the reader to Zhao et al. (2017)for a more detailed discussion on the strengths and weaknesses of our approach.

1.3 Alternative methods

To provide a more comprehensive picture of our selective inference approach (referred to asmethod “lasso” below), we decided before seeing any real data in the Mindset Study thatwe would also use four benchmark methods considered in the applied example in Zhao et al.(2017). These alternative methods are:

Method “naive”: This method simply fits a linear model with all the treatment by co-variate interactions (and of course all the main effects). In R, we can simply uselm(Y ∼ Z ∗ X) which is equivalent to lm(Y ∼ Z + X + Z : X). To investigateeffect modification, we can just report results for the interactions. This method iscalled “naive” because the linear model may be misspecified and may be insufficientfor removing confounding.

Method “marginal”: After Robinson’s transformation (Step 1 above), this method fitsunivariate linear regressions lm(Y ∼ Z + Z : Xj) for j = 1, . . . , p. This is a specialcase of Step 2.2 with fixed model M = {j}.

Method “full”: After Step 1, this method fits a full linear model lm(Y ∼ Z + Z : X).This is a special case of Step 2.2 with fixed model M = {1, 2, . . . , p}.

Method “snooping”: This method is similar to method “lasso” except for the very laststep. Instead of selective inference, it directly reports the results of lm(Y ∼ Z + Z :XM) treating M as given rather than learned from the data. This method is usedas a straw man to illustrate that ignoring model selection (aka “data snooping”) maylead to over-confident inference.

134

Page 114: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Selective inference for effect modification: An empirical investigation

2. Workshop results

2.1 Implementation details

In our workshop analysis, we used the random forest (Breiman, 2001) to estimate thenuisance parameters in Step 1. In particular, we used the “honest” forest implementationin the grf package (Athey et al., 2018) with tune.parameters = TRUE (so some parameterswill be tuned by cross-validation) and all other options set to default. In Step 2, categoricalcovariates are transformed to dummy variables. For example, XC (with five levels: 0, 1, 2,3, 4) is transformed to XC-1, XC-2, XC-3, XC-4. In Step 2.3, we used the theoretical valueλ = 1.1× E[‖XT ε‖∞] (Negahban et al., 2012) for model selection, where ε is the vector of

noise in the outcome. In the real data analysis λ is computed by simulating εi.i.d.∼ N(0, σ2)

where σ2 is the estimated noise level, see Lee et al. (2016). We then used the fixedLassoInffunction in the selectiveInference package to make the selective inference. Details ofthe implementation can be found in the supplementary R markdown file.

2.2 Results

By simply specifying an intercept term for τ(x) as described in Step 2.1, the (weighted)average treatment effect is estimated to be 0.256 with confidence interval [0.235, 0.277] (thisdoes not exactly estimate the average treatment effect because of the regression setup, seeequation (2) above). Thus the mindset intervention is indeed effective.

Our results for effect modification are summarized in Figure 1. Notice that althoughall the methods are plotted in the same figure for the ease of visualization, they may befitting different linear approximations to τ(x) and the coefficients for the same covariatemay have different meanings. Several covariates (X1, X5, XC-4) are significant using method“marginal” but non-significant using method “full”, indicating they may be correlated withthe actual effect modifier(s). We find the full model difficult to interpret because it consistsof all the covariates. The lasso-regularized regression selects two covariates, X1 and XC-3,as potential effect modifiers, and the application of selective inference shows that XC-3

is statistically significant even after adjusting for the model selection. In contrast, the“snooping” inference that ignores the bias from model selection would incorrectly declarethat X1 is also statistically significant.

To summarize, our workshop analysis suggests that: X1 is possibly an effect modifierbut more data is possibly needed before a decisive conclusion can be made; X2 does notmoderate the treatment effect; XC3 is an important effect modifier that the data supports.In fact, with selective inference, we are able to estimate the strength of the effect modifierXC3 through both interval and point estimates.

3. Post-workshop analysis

3.1 More advanced methods

A major objection to the polyhedral pivot in Lee et al. (2016) is that the selective confi-dence intervals are often excessively long. For example, in Figure 1 (method “lasso”), theconfidence interval of X1 is very asymmetric: most of the confidence interval lies above 0but the point estimate is indeed negative. More radical example of this kind can be found

135

Page 115: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Zhao and Panigrahi

●●●

●●

●●

●●

● ● ●●

●● ● ● ● ●

●●

● ●

●●

●● ● ● ● ●

●●

● ● ● ●

●●

● ●

●● ●

●● ● ● ●

●● ● ● ● ●

●●

●●

Naive

Marginal

Full

LassoN

A

C1−

10C

1−11

C1−

12C

1−13

C1−

14C

1−15

C1−

2C

1−3

C1−

4C

1−5

C1−

6C

1−7

C1−

8C

1−9

C2−

2C

3−1

S3

X1

X2

X3

X4

X5

XC

−1

XC

−2

XC

−3

XC

−4

−0.50

−0.25

0.00

0.25

0.50

−0.50

−0.25

0.00

0.25

0.50

−0.50

−0.25

0.00

0.25

0.50

−0.50

−0.25

0.00

0.25

0.50

−0.50

−0.25

0.00

0.25

0.50

Effe

ct m

odifi

catio

n

Figure 1: Workshop results: This figure plots the 95% confidence intervals of effect modifi-cation by the covariates (red solid intervals do not cover 0).

in Table 1 below. This problem is due to the ill-behavior of the polyhedral pivot when theobserved data lies close to the selection boundary. Such phenomenon was observed in theoriginal article by Lee et al. (2016). More recently Kivaranovic and Leeb (2018) has proventhat the expected length of the selective confidence interval constructed this way is infinity.

3.1.1 Randomized response

To mitigate this problem, Tian and Taylor (2018) proposed to randomize the responsebefore model selection, thereby smoothing out the selection boundary. This also allowsthe statistician to reserve more information in the data during the selection stage, leading

136

Page 116: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Selective inference for effect modification: An empirical investigation

to increased power in the inference stage. With moderate amount of injected noise, theincrease of inferential power also does not compromise the ability of model selection.

In the effect modification problem, this “randomized lasso” algorithm can be directlyapplied in Step 2.3 by replacing (3) with the following optimization problem:

minimizen

2En

[Y − Z(β0 + βTX)

]2+ λ‖β‖1 − ωTβ, (4)

where ω ∼ N (0, η2Ip) is the injected Gaussian noise. Note that the randomized Lasso hastwo tuning parameters, one being the amount of `1-penalty λ and the second one being theamount of injected noise which is measured by η2. In our analysis below we will use thesame penalty λ as before and set η2 = σ2/2.

The polyhedral lemma of Lee et al. (2016) no longer applies to randomized lasso becauseselection now depends on both the data and the injected noise ω. To construct selectiveconfidence intervals after randomized lasso, Tian Harris et al. (2016) proposed to use MonteCarlo and developed a general selective sampler to sample realizations of data truncatedto the randomized selection region. To obtain a point estimate of the coefficient, Panigrahiet al. (2016) and Panigrahi and Taylor (2018) introduced the “selection-adjusted” maxi-mum likelihood estimate (selective MLE) that maximizes the conditional likelihood giventhe selection event. These latest selective inference methods are implemented as Pythonsoftware available at https://github.com/selective-inference/Python-software.

3.1.2 Switching the target of selective inference

In our workshop analysis, the target of selective inference is the partial regression coefficientβM defined in (2). Alternatively, one might be interested in the full regression coefficient(β{1,2,...,p}

)M which contains entries of β{1,2,...,p} that correspond to the selected covariates

XM. In other words, instead of targeting all the full regression coefficients as in method“full” above, this approach focuses only on certain selected entries. The selective inferenceframework in Lee et al. (2016) and Tian Harris et al. (2016) can be effortlessly applied tofull regression coefficients because they, like partial regression coefficients, can be writtenas linear functions of the underlying parameters (in our case τ(x)).

3.1.3 Cross-fitting

Cross-fitting (Schick, 1986; Chernozhukov et al., 2018) is a general algorithm in semipara-metric inference to eliminate the dependence of nuisance parameter estimates on the cor-responding data point (e.g. dependence of µt(Xi) on Ti). In our case, it simply amountsto split the data into two halves and estimating µt(Xi) and µy(Xi) in Step 1 using modelstrained using the half of the data that does not contain the i-th data point. We imple-mented this algorithm for our post-workshop analysis. Cross-fitting is useful for provingtheoretical properties of the semiparametric estimator. In practice we rarely find that theusage of cross-fitting drastically changes the results.

3.2 Results

Table 1 shows the post-workshop analysis results. There are in total four analyses, targetingpartial or full coefficients and using the polyhedral pivot for lasso or selective sampler for

137

Page 117: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Zhao and Panigrahi

Table 1: Results of different selective inference methods in the Mindset Study dataset.

Target Selective inference Method Covariate Estimate CI p-value

Partial

Lasso + polyhedral pivotC1-11 0.223 [-∞, -2.164] 0.005XC-3 -0.141 [-0.156, ∞] 0.234X1 -0.025 [-0.042, 0.203] 0.736

Randomized lasso + sampler

S3 -0.013 [-0.060, 0.017] 0.908C1-4 -0.000 [-0.106, 0.082] 0.675C1-11 0.121 [-0.339, 0.428] 0.400XC-3 -0.151 [-0.265, -0.045] 0.004X4 0.002 [-0.063, 0.046] 0.596

Full

Lasso + polyhedral pivotC1-11 0.284 [-∞, -2.103] 0.005XC-3 -0.148 [-0.180, 4.257] 0.256X1 -0.031 [-4.345, 0.564] 0.872

Randomized lasso + sampler

S3 -0.011 [-0.051, 0.016] 0.214C1-4 0.061 [-0.092, 0.180] 0.185C1-11 0.180 [-0.375, 0.505] 0.568XC-3 -0.139 [-0.293, 0.002] 0.052X4 0.011 [-0.046, 0.053] 0.819

randomized lasso. Thus the first analysis in Table 1 (lasso + polyhedral pivot) is the same asmethod “lasso” in Figure 1 besides we used cross-fitting here. The randomized lasso selectsthree more covariates in the post-workshop analysis. This is typically the case due to theinjected noise. However, all selected covariates besides XC-3 are not statistically significantin the post-selection inference, suggesting that they are probably not effect modifiers.

The biggest advantage of using the randomized lasso and selective sampler is shorterselective confidence interval (CI). For example, For XC-3, the CI is reduced from [−0.156,∞]to [−0.265,−0.045]. A careful reader might have noticed that in first row of Table 1, thenaive point estimate for C1-11 obtained by regressing Y on the selected covariates—C1-11,XC-3, and X1—is not covered by the CI. This can happen if the data is very close to thedecision boundary, see Lee et al. (2016, Fig. 5). The selective MLE point estimates (forrandomized lasso) are always covered by the CIs and close to the center of the CIs in Table 1.Switching the inferential target from partial coefficients to full coefficients does not seemto change the results by much. This is likely due to the lack of strong effect modifiers andthe lack of dependence between the covariates. In the full model, the covariate XC-3 is notsignificant at level 0.05. One possible explanation is that using a selected model often addpower to the analysis when the data can be accurately described by a sparse generativemodel (as opposed to fitting a full model). These observations demonstrate the practicalbenefits of using the randomized lasso and selective MLE.

4. Discussion

In this paper we have presented a comprehensive yet transparent approach based on Zhaoet al. (2017) to analyze treatment effect heterogeneity in observational studies. The sameprocedure can be applied to randomized experiments as well, and Zhao et al. (2017) hasshown that in this case it is sufficient to estimate µy consistently in order for the polyhedral

138

Page 118: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Selective inference for effect modification: An empirical investigation

pivot to be asymptotically valid. The proposed procedure can be easily implemented usingexisting machine learning packages (to estimate µt and µy) and selective inference softwares.The R and Python code for our analyses are attached with this report.

We want to re-emphasize some points made in Zhao et al. (2017) about when selectiveinference is a good approach for analyzing effect modification. Compared to classical sta-tistical analysis, the selective inference framework makes it possible to use the same datato generate new scientific questions and then answer them. This is not useful for the in-ference of the average treatment effect because it is a deterministic quantity independentof any model selection. However, selective inference can be tremendously useful for effectmodification especially when the analyst wants to discover effect modifiers using the dataand make some confident conclusions about their effect sizes. We believe that this is indeedthe motivation behind the workshop organizers’ Question 3, making selective inference avery appealing choice of analyzing datasets like the Mindset Study.

On the other hand, since part of the information in the data is reserved for post-selectioninference, the selective inference framework is sub-optimal at making predictions (in ourcase, estimating τ(x)). There is a long list of literature on estimating the optimal treatmentregime or the CATE from the data. This has become a hot topic recently due to theavailability of flexible machine learning methods. We refer the reader to Zhao et al. (2017)for some references in this direction. When prediction accuracy is the foremost goal, thesemachine learning methods should be preferred to selective inference.

Berk et al. (2013) proposed an alternative post-selection procedure that constructs uni-versally valid confidence intervals regardless of the model selection algorithm. Howeverthis may be overly conservative when the selection algorithm is pre-specified by the dataanalyst (for example, the lasso with a fixed λ). Small (2018) discussed connections of thisalternative approach to observational studies.

The application in effect modification also suggests new research directions for selectiveinference. For example, during the workshop several participants attempted to describe theeffect modification using decision trees. Results presented in this way are easy to interpretand may have immediate implications in decision making. With the nodes and cutoffsselected in a data-adaptive fashion, this poses yet another post-selection inference problem.Reserving a hold-out data set for a confirmatory analysis on the effects may lead to a loss ofpower that can be potentially avoided with selective inference. Obtaining optimal inferencepost exploration via regression trees is an interesting direction for future work.

References

Athey, S., Tibshirani, J., and Wager, S. (2018). Generalized random forests. Annals ofStatistics, to appear.

Athey, S. C. and Imbens, G. W. (2016). Recursive partitioning for heterogeneous causaleffects. Proceedings of the National Academy of Sciences, 113(27):7353–7360.

Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selection infer-ence. Annals of Statistics, 41(2):802–837.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.

139

Page 119: Assessing Treatment E ect Variation in Observational Studies: … · 2019-09-09 · Observational Studies 5 (2019) 21-35 Submitted 7/19; Published 7/19 Assessing Treatment E ect Variation

Zhao and Panigrahi

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., andRobins, J. (2018). Double/debiased machine learning for treatment and structural pa-rameters. The Econometrics Journal, 21(1):C1–C68.

Fithian, W., Sun, D., and Taylor, J. (2014). Optimal inference after model selection.arXiv:1410.2597.

Kivaranovic, D. and Leeb, H. (2018). Expected length of post-model-selection confidenceintervals conditional on polyhedral constraints. arXiv:1803.01665.

Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact post-selection inference,with application to the lasso. Annals of Statistics, 44(3):907–927.

Negahban, S., Yu, B., Wainwright, M. J., and Ravikumar, P. K. (2012). A unified frameworkfor high-dimensional analysis of M-estimators with decomposable regularizers. StatisticalScience, 27(4):538–557.

Panigrahi, S. and Taylor, J. (2018). Scalable methods for bayesian selective inference.Electronic Journal of Statistics, 12(2):2355–2400.

Panigrahi, S., Taylor, J., and Weinstein, A. (2016). Bayesian post-selection inference in thelinear model. arXiv:1605.08824.

Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica,56(4):931–954.

Schick, A. (1986). On asymptotically efficient estimation in semiparametric models. Annalsof Statistics, 14(3):1139–1151.

Small, D. S. (2018). Larry Brown: Remembrance and connections of his work to observa-tional studies. Observational Studies, 4:250–259.

Taylor, J. and Tibshirani, R. J. (2015). Statistical learning and selective inference. Pro-ceedings of the National Academy of Sciences, 112(25):7629–7634.

Tian, X. and Taylor, J. (2018). Selective inference with a randomized response. Annals ofStatistics, 46(2):679–710.

Tian Harris, X., Panigrahi, S., Markovic, J., Bi, N., and Taylor, J. (2016). Selective samplingafter solving a convex problem. arXiv:1609.05609.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological), 58(1):267–288.

Tibshirani, R., Tibshirani, R., Taylor, J., Loftus, J., and Reid, S. (2017). selectiveInference:Tools for Post-Selection Inference. R package version 1.2.4. Available from: https:

//CRAN.R-project.org/package=selectiveInference.

van der Laan, M. J. and Rose, S. (2011). Targeted Learning. Springer.

Zhao, Q., Small, D. S., and Ertefaie, A. (2017). Selective inference for effect modificationvia the lasso. arXiv:1705.08020.

140