Download - MAKING CAUSAL CONCLUSIONS - Stanford Universitystatweb.stanford.edu/~owen/students/EvanRosenmanThesis.pdfthank Rina, my sometimes travel buddy and perpetual coauthor, who has made

MAKING CAUSAL CONCLUSIONS

FROM HETEROGENEOUS DATA SOURCES

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF STATISTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Evan Taylor Ragosa Rosenman

August 2020

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/pq377kc2214

© 2020 by Evan Taylor Ragosa Rosenman. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii

http://creativecommons.org/licenses/by-nc/3.0/us/http://creativecommons.org/licenses/by-nc/3.0/us/http://purl.stanford.edu/pq377kc2214

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Mike Baiocchi, Co-Adviser


Art Owen, Primary Adviser


Julia Palacios

Approved for the Stanford University Committee on Graduate Studies.

Stacey F. Bent, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

As datasets grow larger and more complex, the field of statistics must provide commensurate methods

for synthesizing and gathering evidence. This thesis presents new methodology for related questions

at the intersection of observational and experimental causal inference. I consider problems of “data

fusion,” in which one set of causal estimates is derived by merging results from an observational

dataset and an experimental dataset. I also consider how to design experiments informed by the

results of observational studies.

I begin by providing an overview of relevant concepts and prior work in causal inference, data

fusion, sensitivity analysis, and optimization. Chapter 2 considers the data fusion question in the

case when all confounding variables are measured; I propose several estimators and derive when

each is expected to outperform. In Chapter 3, I remove the unconfoundedness assumption, which

leads to a new class of estimators based on a shrinkage approach. Chapter 4 considers the design

question, and proposes a novel solution for regret-minimizing designs in the case of a binary covariate.

Throughout, I use data from the Women’s Health Initiative, a 1991 study of the e�cacy of hormone

therapy involving thousands of postmenopausal women, to demonstrate the utility of my methods.

The contents of this thesis are drawn from two existing manuscripts. The second chapter is

adapted from “Propensity Score Methods for Merging Observational and Experimental Datasets,”

jointly authored with Art Owen, Michael Baiocchi, and Hailey Banack (Rosenman et al., 2018).

The third chapter is adapted from “Combining Observational and Experimental Datasets Using

Shrinkage Estimators,” jointly authored with Guillaume Basse, Art Owen, and Michael Baiocchi

(Rosenman et al., 2020). The fourth chapter is forthcoming as its own paper.

This work touches on a variety of methods from disparate areas of the literature. While the

central problem is one of causal inference, tools from Empirical Bayes, decision theory, and convex

optimization are deployed throughout. It is my hope that this work can serve two purposes. First, I

hope it can excite methodologists to consider new ways of posing causal questions and new methods

for solving emergent challenges. Second, I hope to empower practitioners to more e�ciently use

their data to identify treatment e↵ects.

iv

Acknowledgments

If it takes a village to raise a child, it requires a small city to shepherd someone to a doctorate. With

apologies for the lack of brevity, I would like to thank the many teachers, friends, and colleagues

who have played a role in my intellectual and personal development over the past three decades.

From the third through twelfth grades, I attended the Pingry School in Martinsville, New Jersey,

and I am forever grateful for the extraordinary education I received there. In the eighth grade, my

English teacher, Marnie McKoy, took to referring to me as “Dr. Rosenman,” a moniker that I can

finally inhabit more than fifteen years later. The wonderful educators I encountered at Pingry –

Mr. and Mrs. Grant, Dr. Dineen, Mrs. Landau, Mr. Coe, Mr. Tramontana, Dr. Korfhage, Dr.

DeSimone, Mrs. O’Mara, and many others – have had a lasting impact on my intellectual life. My

time at Pingry also provided me with lifelong friends who have been sources of great encourage-

ment throughout this doctorate, including Kerry Bickford, Darina Shtrakhman, Bi↵ Parker-Magyar,

Melinda Zoephel, and Meredith Skiba.

My undergraduate years at Harvard were when I first discovered my ardor for STEM. Harvard

was an intimidating place, in which impostor syndrome ran rampant. Inspiring teachers and mentors

are the only reason I emerged with my passion for math and science intact. I’d like to acknowledge

Yiling Chen, Joe Blitzstein, Sarah Koch, David Morin, and Mike Ruberry for all the guidance and

excitement that they provided. Harvard also gave me a terrific group of friends, many of whom have

patiently listened to me kvetch over the past five years. To Matt Chartier, Olga Zinoveva, Kevin

Fogarty, Alice Li-Fogarty, Stephanie Wang, Danielle Kolin-Freeman, and Amy Guan – a hearty

thanks.

The choice to move to Washington, D.C. after undergrad, to work at a midsize tech company

called Applied Predictive Technologies, remains one of my best decisions. I am so grateful for the

wonderful friends I made in my three years there, many of whom have moved westward and become

my support network here in the Bay Area. Many thanks to Alex Svistunov, Nitin Viswanathan,

Kathy Qian, Chao Xue, Brady Kelly, Simon Krauss, J.D. Astudillo, and Liz Casano. I am also

appreciative of the teachers and mentors I met while studying as a nighttime Master’s student at

Georgetown. David Caraballo, Sivan Leviyang, Ken Shaw, and Ali Arab all played a role in nurturing

my love of Statistics and helping me on the path to the Ph.D.

v

One of my greatest blessings has been sharing the doctoral experience with two badass Harvard

statisticians who also matriculated in the fall of 2015. Both of these women influenced my decision to

pursue a Ph.D., and I would never have completed this degree without them. To Michele Zemplenyi:

there will never be another person who makes me laugh like you do. Thank you for all the moments

of levity and delight, and for always being just a phone call away. To Kristen Hunter, who has

enriched my life for a dozen years, thank you for so many adventures, for your endless willingness

to call out the bullshit, for being my lifetime partner in crime.

Any Ph.D. is an emotional rollercoaster – a cacophony of self-doubt, pride, disappointment, and

fleeting moments of insight. If one is exceedingly lucky, out of this mess emerges a doctor. I have

been so lucky. I must thank my wonderful committee members: Guillaume Basse, Julia Palacios,

and Stefan Wager, as well as my longtime mentor, Clea Sarnquist. Each of these individuals has been

generous with time and with wisdom, and each has kept the proverbial o�ce door ajar. Moreover,

they have written recommendation letters for me, provided me guidance on career choices, and

periodically reminded me to keep perspective. These faculty have modeled what it is to be a

curious, responsible researcher and I will be proud to follow in their footsteps.

It is a rare and wonderful thing to find an advisor who is simultaneously brilliant and kind. I

have been fortunate enough to find two of them. So much can be said about Mike Baiocchi: that he

is unrelentingly supportive of his students, incredibly down-to-earth, fearless, resilient, and smart as

hell. He is a credit to the field of Statistics and to the state of Maine. Mike, you are the reason I

have stuck with causal inference, and you have profoundly shaped my career. Just as importantly,

you have made me feel valued, competent, and worthy at times when I did not believe in myself. I

am so grateful to have had the chance to collaborate with you.

Art Owen, whose statistical prowess is eclipsed only by his unfailing generosity and fundamental

decency, has been an extraordinary mentor as well. Over the past four years, the opportunity to

spend an hour a week talking to him about research and ideas has been a true privilege – and a

complete blast. Art, you have made me a better researcher and a better communicator. You have

been an advocate for me in every way: the person who cheers me up after a journal rejection; who

is always willing to provide career advice; who reminds me to keep an eye on the bigger picture.

Thank you for everything.

I could not have imagined going through this experience without my cohort: nine brilliant,

quirky individuals, who came together from all over the world to share in these five years. We may

have seen each other less in recent times, but I will always remember the birthday cakes, the Patio

outings, the barbecues, the hushed laughs shared in the backs of classrooms. I’d especially like to

thank Rina, my sometimes travel buddy and perpetual coauthor, who has made some of the most

trying times bearable; Claire, who has listened patiently to my many complaints, and always been a

delightful co-conspirator in dark humor; and Andy, mensch among mensches, the guy who will teach

you probability and also take you out for ramen when you’re feeling down. You have all enriched

vi

this experience.

Friendship and guidance from students in prior cohorts, most notably Alex Chin, Jelena Markovic,

Jessica Hwang, and Paulo Orenstein, have also been essential, as have many wonderful relationships

outside of Sequoia Hall. Jason Weinreb, Murphy Temple, Surajit Bose, Matt Seymour, Je↵ Sheng,

and Daniel Kremer: thank you, all.

I would also like to thank my family. I find myself thinking of my grandparents, the last of

whom – my paternal grandfather, Lawrence Rosenman – passed away just as I was beginning my

doctorate. In the final voicemail he left me, Grandpa said, “I have love for all my grandkids – they

all went to great universities, so many I don’t know what to do with them all. Have a wonderful

day and do well. I love you.” It’s a simple message, and one I have cherished always. I also can’t

help but think of my maternal grandparents, Frances and Amerigo Ragosa. Stung by the Great

Depression, both cut their educations short in order to go to work: Mema barely managed to finish

high school while Pop-Pop, an Italian immigrant, dropped out before getting his diploma. Denied

opportunities for formal schooling, they built a successful life, but always yearned to learn more. I

hope that I make them proud.

My immediate family has spread all over the contiguous United States, but they have continued

to be my rock, and I am immensely lucky to have them in my life. Thanks to my brother, Michael,

for his words of encouragement, o↵ered from far-away Wisconsin. My father, Mark, the most happily

retired attorney in the whole history of north-central New Jersey, has nonetheless been willing to

proofread every single manuscript, application, poster, and presentation that I have produced in

my graduate career – including this thesis. Dad has played the role of consummate problem-solver,

always willing to go the extra mile to make my day more manageable, whether by buying me a

co↵ee mug or listening to me kvetch about my students. Dad, you have two characteristics that

have profoundly influenced my graduate career. First, you are a stubborn man, and I inherited that

quality, and it has served me well. Second, your unshakeable belief in my abilities is almost certainly

somewhat misplaced, but it has gone a long way in helping me to stay the course. Thank you.

And last but not least, to the BAMF-without-an-F, the OG embodiment of “Nevertheless, she

persisted,” the woman who showed early promise in math, who graduated summa cum laude from

a 98% male engineering college, who managed assembly lines, who earned a nighttime law degree

while raising two children and working full-time, who became an attorney at 49, who scrambled

and got knocked down and picked herself back up, who made partner, who has worked nonstop

since age 21 and faced down a lifetime of misogynist bullshit and who still fights every day to make

things a little bit easier for the next crop of women – to the incomparable Diane Ragosa: there is

nothing better in life than having a baller for a mother. I would never have attempted something so

ambitious without your example, and I would never have finished it without your unwavering love

and support. Thank you.

vii

Contents

Abstract iv

Acknowledgments v

1 Introduction 1

1.1 The Data Revolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Causal Inference Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Potential Outcome Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Sources of Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.3 Estimation Methods for Observational Studies . . . . . . . . . . . . . . . . . 4

1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Combining Datasets Under SITA 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Scientific assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Notation, assumptions and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Sampling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Stratification and treatment e↵ect assumptions . . . . . . . . . . . . . . . . . 12

2.2.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Delta method results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Population quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 Main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.3 Delta method means and variances . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.4 The dynamic weighted estimator . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.5 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.6 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Derivation: Dynamic Weighted Estimator . . . . . . . . . . . . . . . . . . . . . . . . 22

viii

2.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.1 Simulation of the ideal case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.2 Restrictive enrollment criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.3 Violation of Assumption 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6 WHI data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.6.1 Propensity score construction and covariate balance . . . . . . . . . . . . . . 30

2.6.2 Gold standard causal e↵ect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6.3 Prognostic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Combining Datasets Without SITA 39

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Notation, Assumptions, and Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 Assumptions and Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.3 Estimator Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Proposed Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.2 1, Common Shrinkage Factor . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.3 2, Variance-Weighted Shrinkage Factors . . . . . . . . . . . . . . . . . . . . 49

3.4.4 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5.1 Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5.2 Estimating Implied � . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6.1 Simulation Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6.2 Identical Observational and RCT Covariate Distributions . . . . . . . . . . . 56

3.6.3 Di↵ering Observational and RCT Covariate Distributions . . . . . . . . . . . 59

3.7 Application to the Women’s Health Initiative Data . . . . . . . . . . . . . . . . . . . 61

3.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 Designing Experiments Using Observational Studies 67

4.1 Problem Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1.1 A Change in Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

ix

4.1.2 Stratification and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.1.3 Loss and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Converting to an Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.1 Näıve Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.2 Regret Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 Tractable Case: Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 Application to the Women’s Health Initiative Data . . . . . . . . . . . . . . . . . . . 74

4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4.2 Detailed Example: � = 1.5, Fine Stratification . . . . . . . . . . . . . . . . . 75

4.4.3 Performance Over Multiple Conditions . . . . . . . . . . . . . . . . . . . . . . 76

4.5 Future Work: General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Conclusion 80

5.1 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A Appendix to Chapter 2 91

A.1 Lengthier Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.1.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.1.2 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.2 WHI Data Example: Additional Details . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.2.1 Observational study propensity modeling and covariate balance . . . . . . . . 93

A.2.2 RCT covariate balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

B Appendix to Chapter 3 95

B.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

C Appendix to Chapter 4 100

C.1 Proof of Validity of Confidence Regions . . . . . . . . . . . . . . . . . . . . . . . . . 100

C.1.1 Review of Proof in Zhao et al. (2019) . . . . . . . . . . . . . . . . . . . . . . 100

C.1.2 Extension to Design Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

C.2 Proof of Concavity of Minimax Problem . . . . . . . . . . . . . . . . . . . . . . . . . 104

x

List of Tables

2.1 These are the four � vectors used in our simulations. The first two correlate with the

mean response vector �, while the second two do not. The second and fourth imply

larger sampling biases than the first and third do. . . . . . . . . . . . . . . . . . . . 24

2.2 MSEs for treatment e↵ect in the ideal setting. Column 1 gives treatment (constant,

linear, quadratic). Column 2 shows whether the propensity was correlated with the

mean response. Column 3 indicates the magnitude of the propensity vector �. The

remaining columns are mean squared errors for the overall treatment from our 5

estimators and an oracle. In every case, the spiked-in estimator using (2.4) has lowest

MSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 MSEs for treatment e↵ect in the setting with restricted enrollments. The columns are

the same as in Table 2.2. Here the oracle estimator is always best and the dynamic

estimator is the best of the ones that can be implemented. . . . . . . . . . . . . . . 27

2.4 These are the results of the simulations where Assumption 3 is violated but the RCT

has the same x distribution as the ODB. . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 These are the results of the simulations where Assumption 3 is violated and the x in

the RCT are subject to restrictive enrollment criteria. . . . . . . . . . . . . . . . . . 29

2.6 Standardized di↵erences (SD) between treated and control populations in the obser-

vational dataset, before and after stratification on the propensity score, for clinical

risk factors for coronary heart disease.

32


vational database, before and after stratification on the propensity score, for ethnicity

category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


vational database, before and after stratification on the propensity score, for smoking

category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

xi

3.1 Distribution of age variable values in the observational study, RCT, and RCT “silver”

datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2 Distribution of history of cardiovascular disease in the observational study, RCT, and

RCT “silver” datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3 Distribution of Langley scatter categories in the observational study, RCT, and RCT

“silver” datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4 Simulation results for each stratification scheme. The third column gives the average

L2 loss over 1, 000 replicates of ⌧̂r, the RCT-only estimator (assuming an RCT of

size 1, 000). The following five columns give the average L2 loss of various shrinkage

estimators as a percentage of the average L2 loss of ⌧̂r. . . . . . . . . . . . . . . . . 63

3.5 Frequency across simulations that the conditions under which 1+ and 2+ dominate

⌧̂r are met. These conditions are given in Lemma 1 and Lemma 3. . . . . . . . . . . 65

4.1 L2 loss comparisons for regret-minimizing allocations relative to equal allocation. For

starred entries, the regret-minimizing allocation defaults to equal allocation. . . . . . 77

4.2 L2 loss comparisons for regret-minimizing allocations relative to näıve allocation. . . 78

A.1 Standardized di↵erences (SD) between treated and control populations in RCT gold

dataset, for clinical risk factors for coronary heart disease. . . . . . . . . . . . . . . . 94


dataset, for ethnicity category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94


dataset, for smoking category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

xii

List of Figures

2.1 Performance measures across all 2,000 simulations run in the ideal case. Bias squared

is shown in black, and variance in gray, so that total bar height represents the MSE.

The much larger values for the RCT estimator are excluded to make visual comparison

easier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Performance measures across all 2,000 simulations run in the restricted enrollment

case. Bias squared is shown in black, and variance in gray, so that total bar height

represents the MSE. The much larger values for the RCT estimator are excluded to

make visual comparison easier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Propensity score distributions among treated and control women (left panel) and

marginal propensity score distributions (right panel) for the ODB and RCT Women’s

Health Initiative populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4 ROC AUC scores for logistic regression prognostic score model in the control popu-

lations of the ODB and RCT silver datasets. . . . . . . . . . . . . . . . . . . . . . . 35

2.5 Causal estimators computed over 100 bootstrap replicates for small and large RCT

sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6 Root mean square error when estimating the causal e↵ect of HT on CHD, across

100 bootstrap replicates for smaller and large RCT sizes. The gold standard causal

e↵ect is taken to be the age-stratified reweighted estimator, the magnitude of which

is shown via the dashed gold line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and

Strawderman estimators, and an oracle under four di↵erent conditions. Here, we

assume ⌧̂o is computed without any adjustment for selection bias, yielding a highly

biased estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57



assume ⌧̂o is computed by stabilized inverse probability of treatment weighting, such

that some of the selection bias is removed. . . . . . . . . . . . . . . . . . . . . . . . . 58

xiii



assume ⌧̂o is computed without any adjustment for selection bias, yielding a highly

biased estimator. We also induce di↵erent distributions for the covariates Xi among

the observational and RCT units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60



assume ⌧̂o is computed by stabilized inverse probability of treatment weighting, such

that some of the selection bias is removed. We also induce di↵erent distributions for

the covariates Xi among the observational and RCT units. . . . . . . . . . . . . . . 61

4.1 Simulated example of confidence regions in four strata under � = 1.2. . . . . . . . . 73

4.2 Allocation of units to strata under näıve scheme and regret-minimizing scheme. . . 75

4.3 Average loss over 1,000 resamples of 1,000-units experiments under equal-allocation,

näıve-allocation, and regret-minimizing allocation designs. . . . . . . . . . . . . . . . 76

A.1 Nominal and cross-validated receiver operator characteristic area under curve for

propensity models with di↵erent numbers of variables . . . . . . . . . . . . . . . . . 94

xiv

Chapter 1

Introduction

1.1 The Data Revolution

Passive data collection is a defining feature of modern life. Massive online social networks continually

monitor user interactions (Eckles and Bakshy, 2017); hospitals capture patient medical records in

electronic health databases (Charles et al., 2015); e-commerce giants record real-time sales data as

their customers shop (Bajari et al., 2019). For applied researchers in the social and medical sciences,

this ever-expanding global panopticon yields data that is both promising and perilous.

In the optimistic view, these observational data can provide insight into the causal e↵ect of a

proposed treatment, such as a novel drug regimen or a new marketing strategy. If the strongest

assumptions hold, such data can be used to identify any desired causal e↵ect, obviating the need

to run randomized trials. This prospect has yielded recent conjecture that Big Data may supplant

experimentation as the future of decision-making (Bareinboim and Pearl, 2016).

Yet the past half-century of causal inference research engenders deep skepticism toward such an

approach (Imbens and Rubin, 2015). Researchers do not control the treatment assignment in obser-

vational data, and, as a result, cannot be certain that treated individuals and untreated individuals

are otherwise comparable. This challenge can be overcome only by making untestable assumptions

– and even if these assumptions hold, careful modeling is necessary to remove the selection e↵ect.

The applied literature includes myriad examples of treatments that showed promise in observational

studies only to be overturned by later randomized trials (Hartman et al., 2015). One prominent

case, the e↵ect of hormone therapy on the health of postmenopausal women, will be discussed at

length in this manuscript (Writing Group for the Women’s Health Initiative Investigators, 2002).

The “virtuous” counterpart to observational data is the well-designed experiment. Data from a

randomized trial yield unbiased estimates of a causal e↵ect without the need for problematic sta-

tistical assumptions. Yet experiments su↵er two significant drawbacks. First, they are frequently

1

CHAPTER 1. INTRODUCTION 2

expensive, and, as a consequence, generally involve fewer units. Especially if one is interested in sub-

group causal e↵ects, this means experimental estimates can be imprecise. Second, experiments often

involve inclusion criteria that can make them dissimilar from target populations of interest. Hence,

while observational studies may su↵er significant selection bias due to unmeasured confounding,

experimental data will frequently have high variance and may su↵er from bias as well.

There has been considerable recent interest in the development of statistical methods to synthe-

size evidence from these two types of data (Mueller et al., 2018; Bareinboim and Pearl, 2016; Kallus

et al., 2018). Yet the current literature – discussed at length in Section 1.3 – o↵ers few concrete

methodological recommendations for applied researchers. This thesis will seek to fulfill the unmet

need. We consider the data fusion problem from three angles. First, we develop methods for merging

experimental and observational causal e↵ect estimates in the case when all confounding variables are

measured in the observational studies. Next, we remove the unconfoundedness assumption, which

leads to a new class of estimators based on a shrinkage approach. Finally, we propose a novel solution

for designing experiments informed by observational studies, making use of the regret minimization

framework. Throughout, we deploy tools from disparate areas of the literature, including Empirical

Bayes, decision theory, and convex optimization.

1.2 Causal Inference Review

1.2.1 Potential Outcome Model

In the vein of Chin (2019), I provide a brief review of causal inference concepts that will be relevant

to this manuscript.

We suppose we have access to a finite population of n individuals. We are considering a treatment

of interest, such as an experimental drug or a behavioral intervention. In this thesis, we will only

consider binary treatments, and we associate with each unit i a random variable Wi 2 {0, 1}, whereWi = 1 indicates that unit i receives the treatment and Wi = 0 indicates that unit i does not receive

the treatment. We will also assume the units have an associated outcome Yi where we typically

suppose Yi 2 R. Lastly, we will assume the units have measured covariates Xi 2 Rp.Throughout, we will adopt the potential outcomes framework of Neyman and Rubin (Rubin,

1974). We associate with each subject i two values, Yi(1) and Yi(0). These values represent the

realized outcome for unit i if the unit is treated or not treated respectively. Hence, the observed

outcome Yi satisfies

Yi = WiYi(1) +WiYi(0) .

Implicit in this definition is the “Stable Unit Treatment Value Assumption” (SUTVA) – the

assumption that a unit’s potential outcomes do not vary with treatments assigned to other units,

and that there is only one version of the treatment (Rubin, 1980). We will make this assumption


throughout. Additionally, note that in Chapter 2, we use slightly di↵erent notation; the potential

outcomes are denoted (Yit, Yic) and we use Wit to denote Wi and Wic to denote 1�Wi.The typical target of inference will be the average treatment e↵ect (ATE),

⌧ =1

n

nX

i=1

Yi(1)� Yi(0) . (1.1)

We will often also be interested in treatment e↵ects involving subgroups. Suppose we have subgroups

indexed by k = 1, . . . ,K and we have associated indexing sets Sk such that:

[

k

Sk = {1, 2, . . . , n} and Sk \ Sj = ; for any i 6= j.

Then the treatment e↵ect for subgroup k is simply

⌧k =1

|Sk|X

i2Sk

Yi(1)� Yi(0) .

1.2.2 Sources of Randomness

Wi is a random variable. Characterizing the joint distribution of all the Wi variables – known as the

“assignment mechanism” – is a central part of inferring causality. One core dichotomy is between

experimental and observational datasets. In the former, the researcher controls the treatment as-

signment and thus knows the assignment mechanism explicitly. In the latter, the researcher typically

makes some assumptions about the distribution and then seeks to infer it from the data.

Define pi = P (Wi = 1) to be the unit-level treatment probability for unit i. A closely related

concept is the “propensity score,” given by

e(x) =1P

iI(Xi = x)

X

i:Xi=x

pi ,

the average treatment probability for all units such that Xi = x. These values will be extremely

important to our analysis of observational studies, though they are also defined in the setting of

experiments. Following Chapter 3 of Imbens and Rubin (2015), we can characterize assignment

mechanisms with some additional descriptors based on the dependencies of pi. Namely, an assign-

ment mechanism is

• probabilistic if 0 < pi < 1 for all i = 1, . . . , n, signifying that all units have some positiveprobability of receiving the treatment and some positive probability of receiving the control

condition;

• individualistic if pi depends only on Yi(0), Yi(1), and Xi for i = 1, . . . , n, and exhibits nodependency on the covariates or potential outcomes of units j 6= i; and


• unconfounded if pi does not depend on the potential outcomes conditional on the covariates,i.e. Xi = Xj =) pi = pj for any i and j, even if their potential outcomes di↵er.

Throughout this manuscript, we will assume probabilistic and individualistic assignment for all

data. Unconfounded assignment is also always assumed for the experimental data, and is assumed for

the observational data in Chapter 2 but not in subsequent chapters. The combination of probabilistic,

individualistic, and unconfounded assignment is known as “strongly ignorable treatment assignment”

(SITA) and we will often refer to whether or not SITA is assumed.

In the typical setting, Wi is our only source of randomness and we treat Yi(1) and Yi(0) as fixed

– but only partially observed – constants. This is the approach we will take in Chapters 2 and 3.

An alternative formulation considers Yi(0) and Yi(1) to themselves be random variables (see e.g.

VanderWeele and Robins, 2012). In this setting, Definition 1.1 can be easily redefined as

⌧ =1

n

nX

i=1

E (Yi(1)� Yi(0)) .

We can now also define a useful quantity that is related to the subgroup treatment e↵ect: the

conditional average treatment e↵ect (CATE),

⌧(x) =1

n

nX

i=1

E (Yi(1)� Yi(0) | Xi = x) .

The above definitions for the assignment mechanism also easily extend to this case; individualistic

and unconfounded assignment become properties of the joint distribution of Wi,Xi, Yi(0), and

Yi(1), rather than properties of the finite sample of n units. We find that this formulation is

more appropriate for discussion of experimental design, and will adopt it as needed in Chapter 4.

1.2.3 Estimation Methods for Observational Studies

In observational studies in which unconfoundedness holds, the treatment assignment is approxi-

mately independent of the potential outcomes among units who have su�ciently similar covariate

values. Hence, we can treat these units as being drawn from a local experiment, and directly com-

pare treated and control units to infer causal e↵ects. A further insight, first o↵ered in Rosenbaum

and Rubin (1983), is that the propensity score is a balancing score. This means that we need only

find units with su�ciently similar values of the propensity score – not the entire covariate vector –

in order to obtain independence of the treatment assignment from the covariates and the potential

outcomes.

This discovery gave rise to causal inference techniques that rely on estimating the propensity

score as a function of the covariates, and then using the fitted propensity score estimates to create

cohorts of units with similar values. Common methods include pair-matching treated and control


units; stratifying units; and weighting units in order to recover a local experiment (Austin, 2011).

Propensity score stratification approaches will be used in Chapter 2, while weighting approaches,

which give rise to “inverse probability weighting” (IPW) estimators, will play a major role in Chap-

ters 3 and 4.

There exist many more complex estimators that rely on modeling both the propensity score and

the potential outcomes as functions of the covariates. Imbens and Rubin (2015) suggest combining

propensity score stratification with regression adjustment to improve precision. “Doubly robust”

estimators (Kang et al., 2007) are formulated using both models such that they are asymptotically

consistent if either the propensity model or the outcome model is correctly specified. Heterogeneous

treatment e↵ect estimation in observational studies is a very active area of research; many modern

methods involve both outcome and propensity modeling (see e.g. Künzel et al., 2019; Chernozhukov

et al., 2018) and attain this property.

In the absence of unconfoundedness, inferential methodology is much more limited. Causal

estimates can be reliably identified in certain quasi-experimental settings – regression discontinu-

ity designs and instrumental variable analyses being two prominent examples – but these are not

generic methods. Researchers frequently focus not on point identification, but rather on methods to

determine whether measured causal e↵ects are su�ciently robust to the possibility of unmeasured

confounding. This area, known as “sensitivity analysis,” has yielded a variety of new methods in

recent years (Ding and VanderWeele, 2016; Fogarty, 2019; VanderWeele and Ding, 2017). We will

make use of a new method o↵ered by Zhao et al. (2019) in Chapters 3 and 4.

1.3 Literature Review

The data fusion problem relates to several di↵erent areas in the statistical and epidemiological

literature. For example, the meta-analysis literature considers the broad challenges of evidence

synthesis across multiple studies. Papers in this area have been highlighting the lack of consensus

about how to merge observational and experimental studies for at least a quarter of a century

(Pladevall-Vila et al., 1996). Yet even without clear methodological guidelines, observational studies

are frequently included in systematic reviews; a 2014 survey found that just 36% of 300 such reviews

were restricted to experimental and quasi-experimental data, with the remainder including at least

some types of observational studies (Page et al., 2016).

Mueller et al. (2018) recently published a summary of methods to incorporate observational

studies into meta-analyses. They considered 93 relevant articles published between 1994 and 2016,

finding that many recommendations for the inclusion of observational studies were essentially un-

changed from those used for randomized controlled trials (RCTs). While about 40% of the articles

made recommendations on the assessment of bias in observational studies, there was little agreement

on best practices for combining the data. The methodological questions considered in most of the


papers were whether or not to report a single e↵ect estimate, and whether or not to use a fixed vs.

a random e↵ects model to combine the individual study estimates. These questions relate to the

heterogeneity of e↵ect estimates, but they do not engage with the unique challenges of confounding

in observational studies.

Mueller and coauthors highlight a few exceptions. Thompson et al. (2011) propose estimating bias

reduction based on the subjective judgment of a panel of assessors, and adjusting the observational

study results accordingly. Their method requires a high degree of subject matter expertise. Prevost

et al. (2000) suggest a hierarchical Bayes approach in which the di↵erence between observational

and experimental results is modeled explicitly. The authors emphasize certain advantages to this

approach but also note that results are sensitive to the choice of prior. In totality, the meta-analysis

literature underscores the need for a more robust toolbox to synthesize heterogeneous data sources,

while accounting for the unique challenges of causal inference with observational data.

Another closely related area of the literature is that of transportability and generalizability. In

the middle of the twentieth century, Campbell (1957) introduced the concepts of “internal validity”

and “external validity” to distinguish between challenges of treatment e↵ect estimation and general-

izability in quasi-experimental research. This paradigm was widely adopted among social scientists.

The problem of extending causal findings across di↵erent domains is now known under the broader

banner of “transportability,” which subsumes results from the meta-analysis and treatment e↵ect

heterogeneity literatures (Bareinboim and Pearl, 2016).

In this context, there is substantial research on using observational data to determine how to

generalize causal e↵ects from an experiment to a target population. Early work focused heavily on

cases in which such generalization is invalid (see e.g. Manski, 2009; Höfler et al., 2010). In a series

of papers (Pearl and Bareinboim, 2011; Bareinboim and Pearl, 2013, 2016), Pearl and Bareoinboim

formalized this problem, derived conditions under which “transport” is possible, and developed

algorithms for returning the correct “transport formula.” While they considered the problem in the

context of graphical models, it has also been considered in the more classical causal inference setting.

For example, Hartman et al. (2015) derived assumptions and placebo tests for identifying population

treatment e↵ects from RCTs. Stuart et al. (2011) advocated the use of propensity scores as a tool

to assess generalizability. A variety of other work has advocated reweighting approaches in order to

generalize results (Cole and Stuart, 2010; Andrews and Oster, 2017).

We are explicitly interested in using causal estimates derived from observational studies, an area

that has received comparatively less attention than the transportability question. The hesitancy

to use these estimates is not particularly surprising: if a researcher has access to an experiment,

he or she will likely be cautious about incorporating observational data that might introduce bias.

One approach is to assume unconfoundedness in the observational study, meaning that all variables

a↵ecting the treatment assignment and the outcome are measured. This is our approach in Chapter

2; it is also used in Athey et al. (2019).


A small number of prior papers have attempted to weaken the unconfoundedness assumption

and proceed with merged estimation. They often introduce alternative assumptions. In Kallus et al.

(2018), the authors assume that the hidden confounding has a parametric structure that can be

modeled e↵ectively. They suggest fitting a model !̂ to predict local average treatment e↵ects to the

observational study, and then learning a second model ⌘̂ which interpolates between the predictions

of !̂ on the RCT units and the actual observed outcomes in the RCT. They posit that the sum of

these functions, !̂(x)+ ⌘̂(x), is a good estimate for the CATE at x. Yet their theoretical guarantees

rely heavily on determining the correct functional form for ⌘̂(·).In Peysakhovich and Lada (2016), it is assumed the bias preserves unit-level relative rank or-

dering (as the authors say, “bigger causal e↵ects imply bigger bias”). They argue that their set of

assumptions is reasonable in their setting, which involves time series data with multiple observations

per unit. But it does not easily generalize to the more standard case where each unit’s outcome is

observed only at a single time point.

A number of other approaches have been suggested, such as methods that make use of Bayesian

networks (Cooper and Yoo, 1999) or structural causal models (Mooij et al., 2016). Yet this inquiry

is unquestionable in its infancy. One can find a chorus of recent papers explicitly calling for more

methodological development in the area of combining observational and experimental data (Mueller

et al., 2018; Shalit, 2020). This thesis seeks to heed that call.

1.4 Contributions

The contents of this thesis are divided among the three chapters to follow, each of which corresponds

to a distinct manuscript. Each chapter considers the data fusion problem from a di↵erent angle.

The second chapter is adapted from “Propensity Score Methods for Merging Observational and

Experimental Datasets,” jointly authored with Art Owen, Michael Baiocchi, and Hailey Banack

(Rosenman et al., 2018). This chapter considers estimation of the average treatment e↵ect from a

completed pair of studies: a randomized controlled trial and an observational study. We assume that

SITA holds in the latter study. We propose a general procedure in which the data is jointly stratified

on the output of a propensity score estimation function fitted to the observational study, as well as

causal e↵ect moderators. We propose three novel estimators for the causal e↵ect within each stratum,

and use the delta method to determine when each would be expected to outperform. We apply our

methods to data from the Women’s Health Initiative, a study of thousands of postmenopausal women

which has both observational and experimental data on hormone therapy (HT).

The third chapter is adapted from “Combining Observational and Experimental Datasets Using

Shrinkage Estimators,” jointly authored with Guillaume Basse, Art Owen, and Michael Baiocchi

(Rosenman et al., 2020). This chapter considers the same setting, but we remove the assumption

that all confounders are measured and choose as our objective the L2 loss in measuring a vector of


stratum-specific treatment e↵ects. We propose a generic procedure for deriving shrinkage estimators

in this setting, making use of a generalized unbiased risk estimate. Then, we develop two new

estimators, prove finite sample conditions under which they have lower risk than an estimator using

only experimental data, and show that each achieves a notion of asymptotic optimality. Lastly, we

draw connections between our approach and state-of-the-art results in sensitivity analysis, including

proposing a method for evaluating the feasibility of our estimators.

The fourth chapter is forthcoming as its own manuscript. In this chapter, we depart from the

estimation setting and consider how to design a stratified experiment making use of data from a

completed observational study. Again, our objective is the L2 loss in measuring the vector of treat-

ment e↵ects. In the case of a binary outcome, we obtain valid, bias-aware confidence regions for the

pilot estimates of the stratum-specific variances derived from the observational study, generalizing

recent results from Zhao et al. (2019). Then, we show that experiments can be designed to minimize

a notion of regret by solving a convex optimization problem. We again demonstrate the utility of

our methods with an application to data from the Women’s Health Initiative.

In the final chapter, we discuss limitations and future directions for this line of research.

Chapter 2

Combining Datasets Under SITA

2.1 Introduction

We first consider how to combine the information from a large observational database (ODB) with

data from a smaller randomized controlled trial (RCT), under the assumption that all confounders

are measured in the observational study. Our goal is to obtain a treatment e↵ect estimate that is

more accurate than either source could yield on its own.

We present three methods to combine our two sources of data. The key technique underlying all

of these methods is to score subjects in the RCT according to their propensity for treatment had they

been in the ODB instead. They are then placed in pooled strata containing some ODB and some

RCT observations with comparable propensities. To see why this might help, consider a stratum in

the ODB comprised entirely of subjects with a very low treatment propensity. The RCT samples in

that same stratum will be more evenly split between treatment and control, increasing a critically

low within-stratum sample size of treated subjects. We are therefore extending the stratification of

observational data by propensity, as described by Imbens and Rubin (2015) and Stuart and Rubin

(2007).

For any data combination method such as ours to succeed, it is necessary to make assumptions

that cannot be tested within the available data and have to be judged instead on scientific grounds.

At a minimum, we require that treatment e↵ects for subjects in a stratum not depend too strongly,

if at all, on the data set the subjects came from. We will describe this and our other assumptions

below.

We apply our methods to data from the Women’s Health Initiative, or WHI (Writing Group for

the Women’s Health Initiative Investigators, 2002). This study includes an RCT paired with an

ODB. The treatment is hormone therapy (HT) and the outcome measure we consider is coronary

heart disease (CHD). Conclusions from the observational data alone proved to be misleading due to

di↵erences between the treated and untreated subjects, as revealed by a large RCT. Accordingly,

9

CHAPTER 2. COMBINING DATASETS UNDER SITA 10

it would be interesting to see if a smaller RCT combined with the ODB would have been more

accurate than the ODB alone, potentially providing an earlier warning. The WHI’s RCT was quite

large. This allows us to split it into a holdout sample that we use to define a true treatment e↵ect.

Using that holdout estimate as a gold standard, we can then compare our combined methods with

methods using only one of the data sources.

2.1.1 Scientific assumptions

Here we describe our assumptions in qualitative terms.

As mentioned above, we require the RCT and ODB to have comparable treatment e↵ects within

strata. The strong form of this assumption is that everybody in each stratum we form has the same

treatment e↵ect. Most of our results use a weaker form in which only the average treatment e↵ects

have to be the same for RCT and ODB subjects in each stratum. Note however that equal average

treatment e↵ects implies equal average di↵erential outcomes, but does not imply equal average

outcomes under either the test or control conditions.

We will assume that the propensity for treatment in the ODB is nearly constant within each of

our strata. This assumption requires that the variables we use to form strata include the important

quantities predictive of whether treatment or control was assigned. It also requires that we estimate

the propensity well. In our motivating problems, the ODB is so large that getting good propensity

estimates is a very reasonable assumption, provided that suitable predictor variables are present.

Our goal is to estimate a population average treatment e↵ect. For that, we need to know the

correct proportion of the population corresponding to each stratum. In our motivating medical

context, we suppose that the owner of the large ODB is interested in stratum proportions given by

that very same patient population. The RCT, on the other hand, might have di↵erent sampling

proportions due, for example, to enrollment criteria that restrict who may participate. In our

examples, the stratum proportions are exactly those of the ODB.

2.1.2 Outline

This chapter is organized as follows. In Section 2.2, we define our notation, assumptions and

estimators of the average treatment e↵ect. Our estimates combine some within-stratum estimates,

computed in various ways. Our first proposed estimator, called the spiked-in estimator, simply places

RCT units into propensity based strata defined on the ODB, and estimates the treatment e↵ect in

each stratum without regard to which data set the subjects came from. Our second estimator

takes a sample size weighted average of RCT and ODB-based treatment e↵ect estimates within each

stratum. Our third new estimator, called the dynamic weighted average, uses data driven weights

instead of sample size weights to combine ODB and RCT estimates within each stratum so as to

minimize an estimate of the mean squared error in each stratum. For the large sample sizes of

interest to us, delta method approximations to the mean and variance are accurate enough.


Section 2.3 presents delta method estimates of the within-stratum bias and variance for our

estimators. We see theoretically that one of our estimators, the spiked-in estimator, can have an

enormous bias if average test and control outcomes are not both comparable between data sets.

Another estimator, the dynamic weighted estimator, is more robust.

Section 2.5 gives numerical illustrations of our method for an ODB of size 5,000 and an RCT

of size 200. Section 2.6 gives more background on the WHI data. It then gives a detailed model

for the WHI data, developing a propensity and defining a gold standard average treatment e↵ect

on a holdout sample from the WHI’s RCT. A variant of the spiked-in estimator is introduced in

Section 2.6.3. That variant refines propensity strata via a prognostic score predictive of coronary

heart disease in untreated subjects, leading to a “dual-spiked” estimator. All the estimators are

compared via bootstrap simulations in 2.6.4, and the dual-spiked estimate is most accurate for the

WHI data. Lengthier proofs can be found in the Appendix. Section 2.7 summarizes our conclusions.

2.2 Notation, assumptions and estimators

Some subjects belong to the randomized controlled trial (RCT) and others to the observational

database (ODB). We assume that no subject is in both data sets. We write i 2 R if subject i is inthe RCT and i 2 O otherwise. Subject i has an outcome Yi 2 R and some covariates that we encodein the vector xi 2 Rd. Subject i receives either the test or control condition.

The condition of subject i is given by a treatment variable Wi 2 {0, 1} where Wi = 1 if subject iis in the test condition (and 0 otherwise). Some formulas simplify when we can use parallel notation

for both test and control settings. Accordingly we introduce Wit = Wi and Wic = 1 �Wi. Otherformulas look better when focused on the test condition. For instance, letting pit = Pr(Wit = 1) and

pic = Pr(Wic = 1), the expression pit(1 � pit) is immediately recognizable as a Bernoulli varianceand is preferred to pitpic.

2.2.1 Sampling assumptions

We adopt the potential outcomes framework of Neyman and Rubin (Rubin, 1974). Subject i has

two potential outcomes, Yit and Yic, corresponding to test and control conditions respectively. Then

Yi = WitYit + WicYic. The potential outcomes (Yit, Yic) are non-random and we will assume that

they are bounded. The treatment e↵ect for subject i is Yit � Yic. We work conditionally on theobserved values of covariates and so xi are also non-random.

All of the randomness in our model comes from the treatment variables Wi. We write Bern(p)

for a Bernoulli random variable taking the value 1 with probability p and 0 with probability 1� p.The ODB and RCT di↵er in how the Wi are distributed.

Assumption 1 (ODB sampling). If i 2 O, then Wi ⇠ Bern(pi) independently where pi = e(xi)with 0 < pi < 1.


The function e(·) in Assumption 1 is the propensity. Because the propensity depends only on x,and is never 0 or 1, the ODB has a strongly ignorable treatment assignment (SITA) (Rosenbaum

and Rubin, 1984). Because the Wi are independent, the outcome for subject i is una↵ected by the

treatment Wi0 for any subject i0 6= i. That is, our model for the ODB satisfies the stable unittreatment value assumption or SUTVA (Imbens and Rubin, 2015).

Assumption 2 (RCT sampling). If i 2 R, then Wi ⇠ Bern(pr) independently for a commonprobability 0 < pr < 1.

The RCT will commonly have pr = 1/2 but we do not assume this. We additionally assume that

the ODB is independent of the RCT.

2.2.2 Stratification and treatment e↵ect assumptions

We will use K strata indexed by k = 1, . . . ,K. The stratum for subject i depends on xi. The sets

Ok and Rk contain the subjects in stratum k from the ODB and RCT respectively. We assume thateach stratum contains only a narrow range of propensity values e(xi). Strata defined by propensity

ranges may be further partitioned by variables in xi, using domain knowledge if applicable, in order

to make the treatment e↵ect more nearly constant within strata. Propensity score stratification with

sub-stratification on other important predictors is a commonly used strategy for causal inference in

observational studies Imbens and Rubin (2015); Stuart and Rubin (2007).

Our model allows the treatment e↵ect to vary by stratum. We begin with a strong assumption

about the treatment e↵ects.

Assumption 3. For all strata k = 1, . . . ,K, there is a treatment e↵ect ⌧k with Yit � Yic = ⌧k forall i 2 Ok [Rk.

In most of our work, we can weaken this assumption to just require equality on average within

each stratum. The weakened version is given as Assumption 4 below. Let the sample sizes of the

ODB and RCT be no and nr respectively. Ordinarily no � nr. The ODB and RCT sample sizeswithin stratum k are nok and nrk. The within-stratum average treatment e↵ects are

⌧ok =1

nok

X

i2Ok

Yit � Yic and ⌧rk =1

nrk

X

i2Rk

Yit � Yic, (2.1)

when their denominator counts are positive. We will never use strata with nok = 0 when we later

weight strata proportionally to their ODB sizes.

Assumption 4. For k = 1, . . . ,K, if min(nok, nrk) > 0 then ⌧ok = ⌧rk and we call their common

value ⌧k. If nok > nrk = 0 take ⌧k = ⌧ok and if nrk > nok = 0 take ⌧k = ⌧rk.

Assumption 4 might be unrealistic if the treatment is applied di↵erently in the ODB versus the

RCT. We thus suppose some form of “treatment version irrelevance” Lesko et al. (2017).


We need the strong Assumption 3 in one place to estimate a quantity that depends on both po-

tential outcomes of a single subject. Because our strata will be based at least partially on propensity,

Assumption 3 is very nearly true under the model of Xie et al. Xie et al. (2012b) In the Appendix,

some simulations will involve data that violate Assumption 3.

2.2.3 Estimators

Our estimand is the global average treatment e↵ect defined by

⌧ =KX

k=1

!k⌧k

for weights !k > 0 withP

K

k=1 !k = 1. The weights can be chosen to match population character-

istics. We use !k = nok/no. Then !k = 0 whenever nok = 0 and we have a well defined ⌧k for

every stratum that contributes to ⌧ . We may still have nrk = 0 for some strata with !k > 0. Our

estimators all take the formP

k!k ⌧̂k for di↵erent within-stratum estimates ⌧̂k.

We begin with “single data source” estimators before describing our proposed new estimators.

An ODB-only estimate of the treatment e↵ect in stratum k is

⌧̂ok =

Pi2Ok WitYitPi2Ok Wit

�P

i2Ok WicYicPi2Ok Wic

. (2.2)

Then ⌧̂o =P

k!k ⌧̂ok. A potential problem with ⌧̂o comes from bins k with very small propensity

values. Then Ok may contain very few observations with Wit = 1 and ⌧̂ok may have high variance.Similarly for bins k associated with large propensity values, Ok may contain very few observationswith Wic = 1 which again leads to high variance. That is, the “edge bins” can have very skewed

sample sizes causing problems for ⌧̂o.

The ODB estimate (2.2) is a di↵erence of ratio estimators, because the denominators are random.

We will see in Section 2.3 that there can also be a severe bias in the edge bins. An analogous RCT-

only estimator is ⌧̂r =P

k!k ⌧̂rk where

⌧̂rk =

Pi2Rk WitYitPi2Rk Wit

�P

i2Rk WicYicPi2R Wic

. (2.3)

Because the RCT assigns treatments with constant probability, the edge bins have less imbalanced

treatment outcomes. However, because the RCT is small, we may find several of the strata have

very small sample sizes nrk.

Our first hybrid estimator is ⌧̂s =P

k!k ⌧̂sk, where

⌧̂sk =

Pi2Ok WitYit +

Pi2Rk WitYitP

i2Ok Wit +P

i2Rk Wit�P

i2Ok WicYic +P

i2Rk WicYicPi2Ok Wic +

Pi2Rk Wic

. (2.4)


The RCT data are “spiked” into the ODB strata. This spiked-in estimator can improve upon

the ODB estimator by increasing the number of treated units in the low-propensity edge bins and

increasing the number of control units in the high-propensity edge bins. Even a small number of

such balancing observations can be extremely valuable.

The spiked-in estimator is not a convex combination of ⌧̂ok and ⌧̂rk, because the pooling is

first done among the test and control units. Our final two estimators are constructed as convex

combinations of ⌧̂ok and ⌧̂rk.

The weighted average estimator ⌧̂w uses

⌧̂wk = �k ⌧̂ok + (1� �k)⌧̂rk, where �k =nok

nok + nrk. (2.5)

It weights ⌧̂rk and ⌧̂ok according to the number of data points involved in each estimate.

Our final estimator is a “dynamic weighted average” ⌧̂d. It uses weights for ⌧̂rk and ⌧̂ok that are

estimated from the data. Those weights are chosen to minimize an estimate of mean squared error

(MSE) derived using the delta method in the following section. We can observe its approximate

optimality via the following result, recalling that the RCT estimator will in general be unbiased.

Proposition 1. Let �̂1 and �̂2 be independent estimators of a common quantity �, with bias,

variance, and mean squared errors, Bias(�̂1) 2 (�1,1), Bias(�̂2) = 0, var(�̂j) and MSE(�̂j) 2(0,1) for j = 1, 2. For c 2 R, let �̂c = c�̂1 + (1� c)�̂2. Then

c⇤ ⌘ argminc

MSE(�̂c) =var(�̂2)

MSE(�̂1) + var(�̂2).

This linear combination has

Bias(�̂c⇤) =Bias(�̂1)MSE(�̂2)

MSE(�̂1) +MSE(�̂2),

var(�̂c⇤) = c2⇤var(�̂1) + (1� c⇤)2var(�̂2), and

MSE(�̂c⇤) =MSE(�̂1)var(�̂2)

MSE(�̂1) + var(�̂2).

(2.6)

Proof. Independence of the �̂j yields var(�̂c) = c2var(�̂1) + (1� c)2var(�̂2) while linearity of expec-tation yields Bias(�̂c) = cBias(✓̂1). Optimizing MSE(�̂c) over c yields the result.

2.3 Delta method results

Let X be a random vector with mean µ and a finite covariance matrix. Let f be a function of X

that is twice di↵erentiable in an open set containing µ and let f1 and f2 be first and second order


Taylor approximations to f around µ. Then the delta method mean and variance of f(X) are

E�(f(X)) = E(f2(X)) and var�(f(X)) = var(f1(X)) .

Sometimes, to combine estimates, we will need a delta method mean for a weighted sum of those

estimates. We will also need a delta method variance for a weighted sum of independent random

variables. We use the following natural expressions without resorting to Taylor approximations:

E�✓X

j

�j ⌧̂j

◆=X

j

�jE�(⌧̂j) (2.1)

var�

✓X

j

�j ⌧̂j

◆=X

j

�2jvar�(⌧̂j), for independent ⌧̂j . (2.2)

2.3.1 Population quantities

We will study our estimators in terms of some population quantities. These involve some unobserved

values of Yit or Yic. For instance, the test and control stratum averages in the ODB are

µokt =

Pi2Ok Yit

nokand µokc =

Pi2Ok Yic

nok

and it is typical that both of these are unobserved. Corresponding values for the RCT are µrkt and

µrkc.

When we merge ODB and RCT strata we will have to consider a kind of skew in which the within-

stratum mean responses above di↵er between the two data sets. To this end, define �kt = µokt�µrktand �kc = µokc � µrkc. Under either of the stronger Assumption 3 or the weaker Assumption 4,�kt = (⌧k + µokc)� (⌧k + µrkc) = �kc. We will use �k = �kt = �kc. We will see that large valuesof �k can bias the spiked-in estimator. Reducing that bias is the main motivation for our dynamic

weighted average estimator.

Now we define several other population quantities. Let S be a finite non-empty set of n = n(S)indices, such as one of our strata Ok or Rk. For each i 2 S, let (Yit, Yic) 2 [�B,B]2 be a pair ofbounded potential outcomes and let Wi = Wit be independent Bern(pi) random variables and let

Wic = 1�Wit. Some of our results add the condition that all pi 2 [✏, 1� ✏] for some ✏ > 0.For S so equipped, we define average responses

µt = µt(S) =1

n

X

i2SYit and µc = µc(S) =

1

n

X

i2SYic. (2.3)

For example, µokt above is µt(Ok). We use average treatment probabilities

pt = pt(S) =1

n

X

i2Spi and pc = pc(S) = 1� pt(S). (2.4)


These become pokt, pokc, prkt and prkc in a natural notation when S is Ok or Rk.The above quantities are averages over i uniformly distributed in S as distinct from expectations

with respect to random Wi. We also need some covariances of this type between response and

propensity values,

st = st(S) =1

n

X

i2SYitpi � µtpt and

sc = sc(S) =1

n

X

i2SYic(1� pi)� µcpc.

(2.5)

We will find that these quantities play an important role in bias. If, for instance, the larger values

of Yit tend to co-occur with higher propensities pi, then averages are biased up.

The delta method variances of our estimators depend on the following weighted averages of

squares and cross products

Stt = Stt(S) =1

n

X

i2Spi(1� pi)(Yit � ⇢t)2,

Scc = Scc(S) =1

n

X

i2Spi(1� pi)(Yic � ⇢c)2, and

Stc = Stc(S) =1

n

X

i2Spi(1� pi)(Yit � ⇢t)(Yic � ⇢c),

(2.6)

where ⇢t = ⇢t(S) = µt(S) + st(S)/pt(S) and ⇢c = ⇢c(S) = µc(S) + sc(S)/pc(S). The quantity ⇢t isthe lead term in E�(

Pi2S WitYit/

Pi2S Wit) and ⇢c is similar. More details about these quantities

are in the Appendix where Theorem 1 is proved.

Proposition 2. Let S be Ok, Rk or Ok [Rk. Then under Assumption 3, sc(S) = �st(S).

Proof. Under Assumption 3, we can set Yit = Yic + ⌧k and µt = µc + ⌧k in (2.5).

2.3.2 Main theorem

We will compare the e�ciency of our five estimators using their delta method approximations. We

state two elementary propositions without proof and then give our main theorem. Results for our

various estimators are mostly direct corollaries of that theorem.

Proposition 3. Let x and y be jointly distributed random variables with means x0 6= 0 and y0respectively, and finite variances. Let ⇢ = y0/x0. Then

E�⇣yx

⌘= ⇢� cov(y � ⇢x, x)

x20, and

var�⇣yx

⌘=

var(y � ⇢x)x20

.


Proposition 4. Let xt, xc, yt, yc be jointly distributed random variables with finite variances and

means xj,0 6= 0 and yj,0 respectively, for j 2 {t, c}. Let ⇢j = yj,0/xj,0. Then

var�⇣ ytxt

± ycxc

⌘=

var(yt � ⇢txt)x2t,0

+var(yc � ⇢cxc)

x2c,0

± 2cov(yt � ⇢txt, yc � ⇢cxc)xt,0xc,0

.

Theorem 1. Let S be an index set of finite cardinality n > 0. For i 2 S, let Wit ⇠ Bern(pi) beindependent with Wic = 1�Wit, 0 < pi < 1. Let

⌧̂ =

Pi2S WitYitPi2S Wit

�P

i2S WicYicPi2S Wic

where (Yit, Yic) 2 [�B,B]2, for B < 1. Then with µt, µc, pt, pc, st, sc, Stt, Scc, Stc defined atequations (2.4) through (2.6),

var�(⌧̂) =1

n

✓Sttp2t

+Sccp2c

+ 2Stcptpc

◆. (2.7)

If all pi 2 [✏, 1� ✏] for some ✏ > 0, then

E� (⌧̂) = (µt � µc) +⇣stpt

� scpc

⌘+O

⇣ 1n

⌘. (2.8)

Proof. See Appendix Section A.1.1.

The implied constant in O(1/n) for equation (2.8) holds for all n > 1.

2.3.3 Delta method means and variances

We define the delta method bias of an estimate ⌧̂k via Bias�(⌧̂k) = E�(⌧̂k) � ⌧k. We also assume0 < ✏ < e(xi) < 1� ✏ for some ✏.

Corollary 1. Let ⌧̂ok be the ODB-only estimator from (2.2). Then

var�(⌧̂ok) =1

nok

✓Sttp2t

+Sccp2c

+ 2Stcptpc

◆,

where st, sc, pt, pc, Stt, Scc and Stc are given in equations (2.3) through (2.6) with S = Ok. If0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then

Bias� (⌧̂ok) =stpt

� scpc

+O

✓1

nok

◆.


If also Assumption 3 holds, then

Bias� (⌧̂ok) =st

pt(1� pt)+O

✓1

nok

◆.

Proof. The first two claims follow from Theorem 1, using e(xi) 2 [✏, 1 � ✏] for i 2 Ok [ Rk inthe second one. Under Assumption 3, sc = �st, so the lead term in E�(⌧̂k) is st(1/pt + 1/pc) =st(pt + pc)/pt(1� pt) = st/pt(1� pt).

Corollary 2. Let ⌧̂rk be the RCT-only estimator from (2.3). Then ⌧̂rk is known to be unbiased, and

var�(⌧̂rk) =�̄2rk

nrkpr(1� pr), where

�̄2rk

=1

nrk

X

i2Rk

[(Yit � µrkt)(1� pr) + (Yic � µrkc)pr]2,(2.9)

for µrkt = µt(Rk) and µrkc = µc(Rk). Under Assumption 3, �̄2rk = �2rkt ⌘ (1/nrk)P

i2Rk(Yit �µrkt)2. If pr = 1/2, then

var�(⌧̂rk) =1

4n2k

X

i2Rk

✓Ȳi �

µrkt + µrkc2

◆2

for Ȳi = (Yit + Yic)/2.

Proof. See Appendix Section A.1.2.

Corollary 3. Let ⌧̂wk be the weighted-average estimator (2.5). Then, with �k = nok/(nok + nrk),

var�(⌧̂wk) =�k

nok + nrk

✓Sttp2t

+Sccp2c

+ 2Stcptpc

◆+

1� �knok + nrk

�̄2rk

pr(1� pr),

where Stt, Scc and Scc are given in equation (2.6) with S = Ok, and �̄2rk is defined at (2.9). If0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then

Bias�(⌧̂wk) = �k

✓soktpokt

� sokcpokc

◆+O

✓1

nok + nrk

◆,

where sokt, pokt, sokc, and pokc are defined by equations (2.4) and (2.5) for S = Ok. If Assumption3 also holds, then

Bias�(⌧̂wk) =�ksokt

pokt(1� pokt)+O

✓1

nok + nrk

◆.

Proof. See Appendix Section 2.3.6.


In our motivating scenarios we anticipate that no � nr so that �k ⇡ 1 for most k. Then the firstterm in var�(⌧̂wk) is only slightly smaller than var�(⌧̂ok) for the ODB-only estimate, and at most a

small variance reduction is to be expected from weighting.

The spiked-in estimator’s bias and variance cannot be computed as a corollary of Theorem 1,

but they can be computed directly.

Corollary 4. Let ⌧̂sk be the spiked-in estimator (2.5). Then

var�(⌧̂sk) =1

nok + nrk

✓Sttp2t

+Sccp2c

+ 2Stcptpc

◆,

where st, sc, pt, pc, Stt, Scc and Stc are given in equations (2.3) through (2.6) with S = Ok [Rk.If 0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then

Bias� (⌧̂sk) =stpt

� scpc

+O

✓1

nok + nrk

◆.

If Assumption 3 also holds, then

Bias� (⌧̂sk) =st

pt(1� pt)+O

✓1

nok + nrk

◆.

Proof. The spiked-in estimates are computed by pooling Ok and Rk into their union.

To relate the bias of ⌧̂sk to that of the other estimators, we write it in terms of the quantities

computed using S = Ok and S = Rk. Denoting these quantities using an additional subscript of oand r,

Bias�(⌧̂sk) = �knok⇣ poktnokpokt + nrkprkt

� pokcnokpokc + nrkprkc

⌘

+ soktnok

nokpokt + nrkprkt� sokc

noknokpokc + nrkprkc

+O⇣ 1nok + nrk

⌘.

(2.10)

The bias for ⌧̂rk is zero. The bias for ⌧̂ok has terms analogous to the second and third (and error)

terms above, but the first term is new to ⌧̂sk. This term is linear in �k. For large values of �k,

this term will dominate, yielding biases that can easily exceed those of ⌧̂ok. This is the fundamental

danger of the spiked-in estimator: if the mean potential outcomes di↵er substantially between ODB

and RCT subjects with similar value of the propensity score function, then the estimation will be

poor due to large bias.

2.3.4 The dynamic weighted estimator

The bias-variance tradeo↵s are intrinsically di↵erent in each stratum. Using results from the prior

section, we derive a dynamic weighted estimator that uses di↵erent weights in each stratum. Our


dynamic weighted estimator is based on Assumption 3, though we will test it in settings where that

assumption does not hold.

From Proposition 1, the MSE-optimal convex combination of ⌧̂ok and ⌧̂rk is c⇤k ⌧̂ok +(1� c⇤k)⌧̂rkwhere c⇤k = var(⌧̂rk)/(var(⌧̂rk) +MSE(⌧̂ok)). The dynamic weighted estimator is

⌧̂dk = ĉ⇤k ⌧̂ok + (1� ĉ⇤k)⌧̂rk, with ĉ⇤k =cvar(⌧̂rk)

cvar(⌧̂rk) + [MSE(⌧̂ok), (2.11)

for plug-in estimators of MSE(⌧̂ok) and var(⌧̂rk). To obtain our MSE estimates we use ]MSE(·) =Bias�(·)2 + var�(·) taking the delta method moments from Corollaries 1 and 2. These expressionsinclude some unknown population quantities that we then approximate from the data to get [MSE(·).

For the ODB estimate we use

]MSE(⌧̂ok) =✓

stpt(1� pt)

◆2+

1

nok

✓Sttp2t

+Sccp2c

+ 2Stcptpc

◆

where the quantities on the right hand side are given in Section 2.3.1 with S = Ok. For the RCTestimate we use

fvar(⌧̂rk) =�̄2rk

pr(1� pr)nrk, with �̄2

rk=

1

nrk

X

i2Rk

Wit�̂2rkt

+Wic�̂2rkc

where �̂2rkt

, �̂2rkc

are the sample variances observed among the treated and control units respectively.

Both of these estimates use Assumption 3.

The values of pt and pc are known: pt =P

i2Ok pit/nok where pit is the propensity e(xi)

and pc = 1 � pt. We use Horvitz-Thompson style inverse probability weighting to estimate otherquantities. Full details can be found in Appendix Section 2.4.

2.3.5 Performance comparison

The ideal dynamic estimator with the optimal weight ck⇤ must be at least as good as ⌧̂ok, ⌧̂rk and

⌧̂wk because those estimators are all special cases of weighting estimators belonging to the class that

ck⇤ optimizes over. Our estimator ⌧̂dk will not always be better than those other estimators, because

in estimating ĉk⇤, we may introduce enough error to make it less e�cient.

When combining stratum-based estimates ⌧̂k into the weighted estimator ⌧̂ =P

k!k ⌧̂k, there is

the possibility of biases canceling between strata. None of the competing estimators we consider are

designed to exploit such cancellation. For large strata, ck⇤ should be well estimated. To arrange

cancellations among biased within-stratum estimates would require domain-specific assumptions

that we do not make here.

The comparison to the spiked-in estimator is more complex. As we saw in equation (2.10),

the bias can grow without bound in �k, so for large �k this estimator will have the largest MSE.


However, for small values of �k, the spiked-in estimator can outperform all the other estimators. To

see why, we make a direct comparison with the dynamic weighted estimator and reference our prior

discussion showing that the dynamic weighted estimator will generally outperform ⌧̂ok, ⌧̂rk and ⌧̂wk.

We introduce sample counterparts of �k, given by

�̂kt =


�P

i2Rk WitYitPi2Rk Wit

, and

�̂kc =

Pi2Ok WicYicPi2Ok Wic

�P

i2Rk WicYicPi2Rk Wic

.

Then, after some algebra, ⌧̂sk di↵ers from the RCT estimate as follows,

⌧̂sk � ⌧̂rk = ckt�̂kt � ckc�̂kc (2.12)

for sample size proportions

ckt =

Pi2Ok WitP

i2Ok[Rk Witand ckc =

Pi2Ok WicP

i2Ok[Rk Wic.

By comparison,

⌧̂dk � ⌧̂rk = ck?�̂kt � ck?�̂kc, (2.13)

where the dynamic estimator tunes ck? to the available data. An oracle could choose ck? optimally

using Proposition 1. While the oracle is working in a one parameter family (2.13) for each bin k,

the spiked-in estimator uses two weights ckt and ckc (2.12) that are not necessarily within the family

that the oracle optimizes over. This is why it is possible for the spiked estimator to outperform the

oracle.

2.3.6 Proof of Corollary 3

Using (2.1) and Corollaries 1 and 2, Bias�(⌧̂wk) = �k⇥Bias�(⌧̂ok) for �k given in (2.5). This yields thelead terms in both expressions for Bias�(⌧̂wk). The error terms are �kO(1/nok) = O(1/(nok +nrk)).

Using independence of the RCT and ODB, Corollaries 1 and 2, and definition (2.2)

var�(⌧̂wk) =�2k

nok

✓Sttp2t

+Sccp2c

+ 2Stcptpc

◆+ (1� �k)2

�̄2rk

nrkpr(1� pr)

=�k

nok + nrk

✓Sttp2t

+Sccp2c

+ 2Stcptpc

◆+

1� �knok + nrk

�̄2rk

pr(1� pr).


2.4 Derivation: Dynamic Weighted Estimator

Recall we seek to estimate c⇤k ⌧̂ok+(1�c⇤k)⌧̂rk where the weight c⇤k = var(⌧̂rk)/(var(⌧̂rk) +MSE(⌧̂ok)).For the ODB, our plug-in estimate is

]MSE(⌧̂ok) =✓

stpt(1� pt)

◆2+

1

nok

✓Sttp2t

+Sccp2c

+ 2Stcptpc

◆

and for the RCT estimate we use

fvar(⌧̂rk) =�̄2rk

pr(1� pr)nrk, with �̄2

rk=

1

nrk

X

i2Rk

Wit�̂2rkt

+Wic�̂2rkc

where �̂2rkt

, �̂2rkc

are the sample variances observed among the treated and control units respectively.

We use Horvitz-Thompson style inverse probability weighting to estimate key quantities, as

follows:

⇢̂t =


, ⇢̂c =

Pi2Ok WicYicPi2Ok Wic

,

ŝt =

Pi2Ok Wit

nok

X

i2Ok

WitYit � ptX

i2Ok

WitYit/pit

!

+

Pi2Ok Wic

nok

X

i2Ok

WicYic � pcX

i2Ok

WicYic/pic

!,

Ŝtt =

Pi2Ok Witpit(1� pit)(Yit � ⇢̂t)

2

Pi2Ok Wit

, and

Ŝcc =

Pi2Ok Wicpit(1� pit)(Yic � ⇢̂c)

2

Pi2Ok Wic

.

The sole quantity that does not have a Horvitz-Thompson estimator is Stc(Ok), because we neverobserve both potential outcomes for a given unit. First, we write Stc as

1

n

X

i2Ok

Witpit(1� pit)(Yit � ⇢t)(Yic � ⇢c) +1

n

X

i2Ok

Wicpit(1� pit)(Yit � ⇢t)(Yic � ⇢c).

Next, under Assumption 3,

Yit � ⇢t = Yic + ⌧k � µt � st/pt = Yic � ⇢c �stptpc

,


and similarly Yic � ⇢c = Yit � ⇢t + st/(ptpc). Therefore

Stc =1

n

X

i2Ok

Witpit(1� pit)(Yit � ⇢t)2 +1

n

X

i2Ok

Wicpit(1� pit)(Yic � ⇢c)2

� stnpt(1� pt)

X

i2Ok

pit(1� pit)⇣Wit(Yit � ⇢t)�Wic(Yic � ⇢c)

⌘ (2.14)

and we get Ŝtc by plugging the above estimates of ⇢t, ⇢c and known values of pt, pc into (2.14).

Although Assumption 3 is used to derive the estimator, some of our simulations in Section 2.5 of

this Appendix test it under a violation of that assumption.

2.5 Simulations

Our goal is to estimate the average treatment e↵ect in the target population, from which we assume

the ODB data was randomly sampled. The value of the RCT is that it can substitute for ODB data

in places where that data is sparse due to the treatment assignment mechanism.

We simulate two high level scenarios. In one, the RCT is a random sample from the same

population that the ODB came from. Then the RCT and ODB data di↵er only in their treatment

assignment mechanisms. We consider this case the idea

Download - MAKING CAUSAL CONCLUSIONS - Stanford Universitystatweb.stanford.edu/~owen/students/EvanRosenmanThesis.pdfthank Rina, my sometimes travel buddy and perpetual coauthor, who has made

Top Related