-
MAKING CAUSAL CONCLUSIONS
FROM HETEROGENEOUS DATA SOURCES
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF STATISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Evan Taylor Ragosa Rosenman
August 2020
-
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/pq377kc2214
© 2020 by Evan Taylor Ragosa Rosenman. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
http://creativecommons.org/licenses/by-nc/3.0/us/http://creativecommons.org/licenses/by-nc/3.0/us/http://purl.stanford.edu/pq377kc2214
-
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Mike Baiocchi, Co-Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Art Owen, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Julia Palacios
Approved for the Stanford University Committee on Graduate Studies.
Stacey F. Bent, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
-
Abstract
As datasets grow larger and more complex, the field of statistics must provide commensurate methods
for synthesizing and gathering evidence. This thesis presents new methodology for related questions
at the intersection of observational and experimental causal inference. I consider problems of “data
fusion,” in which one set of causal estimates is derived by merging results from an observational
dataset and an experimental dataset. I also consider how to design experiments informed by the
results of observational studies.
I begin by providing an overview of relevant concepts and prior work in causal inference, data
fusion, sensitivity analysis, and optimization. Chapter 2 considers the data fusion question in the
case when all confounding variables are measured; I propose several estimators and derive when
each is expected to outperform. In Chapter 3, I remove the unconfoundedness assumption, which
leads to a new class of estimators based on a shrinkage approach. Chapter 4 considers the design
question, and proposes a novel solution for regret-minimizing designs in the case of a binary covariate.
Throughout, I use data from the Women’s Health Initiative, a 1991 study of the e�cacy of hormone
therapy involving thousands of postmenopausal women, to demonstrate the utility of my methods.
The contents of this thesis are drawn from two existing manuscripts. The second chapter is
adapted from “Propensity Score Methods for Merging Observational and Experimental Datasets,”
jointly authored with Art Owen, Michael Baiocchi, and Hailey Banack (Rosenman et al., 2018).
The third chapter is adapted from “Combining Observational and Experimental Datasets Using
Shrinkage Estimators,” jointly authored with Guillaume Basse, Art Owen, and Michael Baiocchi
(Rosenman et al., 2020). The fourth chapter is forthcoming as its own paper.
This work touches on a variety of methods from disparate areas of the literature. While the
central problem is one of causal inference, tools from Empirical Bayes, decision theory, and convex
optimization are deployed throughout. It is my hope that this work can serve two purposes. First, I
hope it can excite methodologists to consider new ways of posing causal questions and new methods
for solving emergent challenges. Second, I hope to empower practitioners to more e�ciently use
their data to identify treatment e↵ects.
iv
-
Acknowledgments
If it takes a village to raise a child, it requires a small city to shepherd someone to a doctorate. With
apologies for the lack of brevity, I would like to thank the many teachers, friends, and colleagues
who have played a role in my intellectual and personal development over the past three decades.
From the third through twelfth grades, I attended the Pingry School in Martinsville, New Jersey,
and I am forever grateful for the extraordinary education I received there. In the eighth grade, my
English teacher, Marnie McKoy, took to referring to me as “Dr. Rosenman,” a moniker that I can
finally inhabit more than fifteen years later. The wonderful educators I encountered at Pingry –
Mr. and Mrs. Grant, Dr. Dineen, Mrs. Landau, Mr. Coe, Mr. Tramontana, Dr. Korfhage, Dr.
DeSimone, Mrs. O’Mara, and many others – have had a lasting impact on my intellectual life. My
time at Pingry also provided me with lifelong friends who have been sources of great encourage-
ment throughout this doctorate, including Kerry Bickford, Darina Shtrakhman, Bi↵ Parker-Magyar,
Melinda Zoephel, and Meredith Skiba.
My undergraduate years at Harvard were when I first discovered my ardor for STEM. Harvard
was an intimidating place, in which impostor syndrome ran rampant. Inspiring teachers and mentors
are the only reason I emerged with my passion for math and science intact. I’d like to acknowledge
Yiling Chen, Joe Blitzstein, Sarah Koch, David Morin, and Mike Ruberry for all the guidance and
excitement that they provided. Harvard also gave me a terrific group of friends, many of whom have
patiently listened to me kvetch over the past five years. To Matt Chartier, Olga Zinoveva, Kevin
Fogarty, Alice Li-Fogarty, Stephanie Wang, Danielle Kolin-Freeman, and Amy Guan – a hearty
thanks.
The choice to move to Washington, D.C. after undergrad, to work at a midsize tech company
called Applied Predictive Technologies, remains one of my best decisions. I am so grateful for the
wonderful friends I made in my three years there, many of whom have moved westward and become
my support network here in the Bay Area. Many thanks to Alex Svistunov, Nitin Viswanathan,
Kathy Qian, Chao Xue, Brady Kelly, Simon Krauss, J.D. Astudillo, and Liz Casano. I am also
appreciative of the teachers and mentors I met while studying as a nighttime Master’s student at
Georgetown. David Caraballo, Sivan Leviyang, Ken Shaw, and Ali Arab all played a role in nurturing
my love of Statistics and helping me on the path to the Ph.D.
v
-
One of my greatest blessings has been sharing the doctoral experience with two badass Harvard
statisticians who also matriculated in the fall of 2015. Both of these women influenced my decision to
pursue a Ph.D., and I would never have completed this degree without them. To Michele Zemplenyi:
there will never be another person who makes me laugh like you do. Thank you for all the moments
of levity and delight, and for always being just a phone call away. To Kristen Hunter, who has
enriched my life for a dozen years, thank you for so many adventures, for your endless willingness
to call out the bullshit, for being my lifetime partner in crime.
Any Ph.D. is an emotional rollercoaster – a cacophony of self-doubt, pride, disappointment, and
fleeting moments of insight. If one is exceedingly lucky, out of this mess emerges a doctor. I have
been so lucky. I must thank my wonderful committee members: Guillaume Basse, Julia Palacios,
and Stefan Wager, as well as my longtime mentor, Clea Sarnquist. Each of these individuals has been
generous with time and with wisdom, and each has kept the proverbial o�ce door ajar. Moreover,
they have written recommendation letters for me, provided me guidance on career choices, and
periodically reminded me to keep perspective. These faculty have modeled what it is to be a
curious, responsible researcher and I will be proud to follow in their footsteps.
It is a rare and wonderful thing to find an advisor who is simultaneously brilliant and kind. I
have been fortunate enough to find two of them. So much can be said about Mike Baiocchi: that he
is unrelentingly supportive of his students, incredibly down-to-earth, fearless, resilient, and smart as
hell. He is a credit to the field of Statistics and to the state of Maine. Mike, you are the reason I
have stuck with causal inference, and you have profoundly shaped my career. Just as importantly,
you have made me feel valued, competent, and worthy at times when I did not believe in myself. I
am so grateful to have had the chance to collaborate with you.
Art Owen, whose statistical prowess is eclipsed only by his unfailing generosity and fundamental
decency, has been an extraordinary mentor as well. Over the past four years, the opportunity to
spend an hour a week talking to him about research and ideas has been a true privilege – and a
complete blast. Art, you have made me a better researcher and a better communicator. You have
been an advocate for me in every way: the person who cheers me up after a journal rejection; who
is always willing to provide career advice; who reminds me to keep an eye on the bigger picture.
Thank you for everything.
I could not have imagined going through this experience without my cohort: nine brilliant,
quirky individuals, who came together from all over the world to share in these five years. We may
have seen each other less in recent times, but I will always remember the birthday cakes, the Patio
outings, the barbecues, the hushed laughs shared in the backs of classrooms. I’d especially like to
thank Rina, my sometimes travel buddy and perpetual coauthor, who has made some of the most
trying times bearable; Claire, who has listened patiently to my many complaints, and always been a
delightful co-conspirator in dark humor; and Andy, mensch among mensches, the guy who will teach
you probability and also take you out for ramen when you’re feeling down. You have all enriched
vi
-
this experience.
Friendship and guidance from students in prior cohorts, most notably Alex Chin, Jelena Markovic,
Jessica Hwang, and Paulo Orenstein, have also been essential, as have many wonderful relationships
outside of Sequoia Hall. Jason Weinreb, Murphy Temple, Surajit Bose, Matt Seymour, Je↵ Sheng,
and Daniel Kremer: thank you, all.
I would also like to thank my family. I find myself thinking of my grandparents, the last of
whom – my paternal grandfather, Lawrence Rosenman – passed away just as I was beginning my
doctorate. In the final voicemail he left me, Grandpa said, “I have love for all my grandkids – they
all went to great universities, so many I don’t know what to do with them all. Have a wonderful
day and do well. I love you.” It’s a simple message, and one I have cherished always. I also can’t
help but think of my maternal grandparents, Frances and Amerigo Ragosa. Stung by the Great
Depression, both cut their educations short in order to go to work: Mema barely managed to finish
high school while Pop-Pop, an Italian immigrant, dropped out before getting his diploma. Denied
opportunities for formal schooling, they built a successful life, but always yearned to learn more. I
hope that I make them proud.
My immediate family has spread all over the contiguous United States, but they have continued
to be my rock, and I am immensely lucky to have them in my life. Thanks to my brother, Michael,
for his words of encouragement, o↵ered from far-away Wisconsin. My father, Mark, the most happily
retired attorney in the whole history of north-central New Jersey, has nonetheless been willing to
proofread every single manuscript, application, poster, and presentation that I have produced in
my graduate career – including this thesis. Dad has played the role of consummate problem-solver,
always willing to go the extra mile to make my day more manageable, whether by buying me a
co↵ee mug or listening to me kvetch about my students. Dad, you have two characteristics that
have profoundly influenced my graduate career. First, you are a stubborn man, and I inherited that
quality, and it has served me well. Second, your unshakeable belief in my abilities is almost certainly
somewhat misplaced, but it has gone a long way in helping me to stay the course. Thank you.
And last but not least, to the BAMF-without-an-F, the OG embodiment of “Nevertheless, she
persisted,” the woman who showed early promise in math, who graduated summa cum laude from
a 98% male engineering college, who managed assembly lines, who earned a nighttime law degree
while raising two children and working full-time, who became an attorney at 49, who scrambled
and got knocked down and picked herself back up, who made partner, who has worked nonstop
since age 21 and faced down a lifetime of misogynist bullshit and who still fights every day to make
things a little bit easier for the next crop of women – to the incomparable Diane Ragosa: there is
nothing better in life than having a baller for a mother. I would never have attempted something so
ambitious without your example, and I would never have finished it without your unwavering love
and support. Thank you.
vii
-
Contents
Abstract iv
Acknowledgments v
1 Introduction 1
1.1 The Data Revolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Causal Inference Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Potential Outcome Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Sources of Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Estimation Methods for Observational Studies . . . . . . . . . . . . . . . . . 4
1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Combining Datasets Under SITA 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Scientific assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Notation, assumptions and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Sampling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Stratification and treatment e↵ect assumptions . . . . . . . . . . . . . . . . . 12
2.2.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Delta method results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Population quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Delta method means and variances . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4 The dynamic weighted estimator . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.6 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Derivation: Dynamic Weighted Estimator . . . . . . . . . . . . . . . . . . . . . . . . 22
viii
-
2.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 Simulation of the ideal case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 Restrictive enrollment criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.3 Violation of Assumption 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 WHI data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.1 Propensity score construction and covariate balance . . . . . . . . . . . . . . 30
2.6.2 Gold standard causal e↵ect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.3 Prognostic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Combining Datasets Without SITA 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Notation, Assumptions, and Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Assumptions and Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Estimator Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Proposed Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 1, Common Shrinkage Factor . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.3 2, Variance-Weighted Shrinkage Factors . . . . . . . . . . . . . . . . . . . . 49
3.4.4 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.1 Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.2 Estimating Implied � . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.1 Simulation Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.2 Identical Observational and RCT Covariate Distributions . . . . . . . . . . . 56
3.6.3 Di↵ering Observational and RCT Covariate Distributions . . . . . . . . . . . 59
3.7 Application to the Women’s Health Initiative Data . . . . . . . . . . . . . . . . . . . 61
3.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Designing Experiments Using Observational Studies 67
4.1 Problem Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 A Change in Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
ix
-
4.1.2 Stratification and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.3 Loss and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Converting to an Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Näıve Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.2 Regret Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Tractable Case: Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Application to the Women’s Health Initiative Data . . . . . . . . . . . . . . . . . . . 74
4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.2 Detailed Example: � = 1.5, Fine Stratification . . . . . . . . . . . . . . . . . 75
4.4.3 Performance Over Multiple Conditions . . . . . . . . . . . . . . . . . . . . . . 76
4.5 Future Work: General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Conclusion 80
5.1 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A Appendix to Chapter 2 91
A.1 Lengthier Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.1.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.1.2 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.2 WHI Data Example: Additional Details . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.2.1 Observational study propensity modeling and covariate balance . . . . . . . . 93
A.2.2 RCT covariate balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
B Appendix to Chapter 3 95
B.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
C Appendix to Chapter 4 100
C.1 Proof of Validity of Confidence Regions . . . . . . . . . . . . . . . . . . . . . . . . . 100
C.1.1 Review of Proof in Zhao et al. (2019) . . . . . . . . . . . . . . . . . . . . . . 100
C.1.2 Extension to Design Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
C.2 Proof of Concavity of Minimax Problem . . . . . . . . . . . . . . . . . . . . . . . . . 104
x
-
List of Tables
2.1 These are the four � vectors used in our simulations. The first two correlate with the
mean response vector �, while the second two do not. The second and fourth imply
larger sampling biases than the first and third do. . . . . . . . . . . . . . . . . . . . 24
2.2 MSEs for treatment e↵ect in the ideal setting. Column 1 gives treatment (constant,
linear, quadratic). Column 2 shows whether the propensity was correlated with the
mean response. Column 3 indicates the magnitude of the propensity vector �. The
remaining columns are mean squared errors for the overall treatment from our 5
estimators and an oracle. In every case, the spiked-in estimator using (2.4) has lowest
MSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 MSEs for treatment e↵ect in the setting with restricted enrollments. The columns are
the same as in Table 2.2. Here the oracle estimator is always best and the dynamic
estimator is the best of the ones that can be implemented. . . . . . . . . . . . . . . 27
2.4 These are the results of the simulations where Assumption 3 is violated but the RCT
has the same x distribution as the ODB. . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 These are the results of the simulations where Assumption 3 is violated and the x in
the RCT are subject to restrictive enrollment criteria. . . . . . . . . . . . . . . . . . 29
2.6 Standardized di↵erences (SD) between treated and control populations in the obser-
vational dataset, before and after stratification on the propensity score, for clinical
risk factors for coronary heart disease.
32
2.7 Standardized di↵erences (SD) between treated and control populations in the obser-
vational database, before and after stratification on the propensity score, for ethnicity
category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8 Standardized di↵erences (SD) between treated and control populations in the obser-
vational database, before and after stratification on the propensity score, for smoking
category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
xi
-
3.1 Distribution of age variable values in the observational study, RCT, and RCT “silver”
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Distribution of history of cardiovascular disease in the observational study, RCT, and
RCT “silver” datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Distribution of Langley scatter categories in the observational study, RCT, and RCT
“silver” datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Simulation results for each stratification scheme. The third column gives the average
L2 loss over 1, 000 replicates of ⌧̂r, the RCT-only estimator (assuming an RCT of
size 1, 000). The following five columns give the average L2 loss of various shrinkage
estimators as a percentage of the average L2 loss of ⌧̂r. . . . . . . . . . . . . . . . . 63
3.5 Frequency across simulations that the conditions under which 1+ and 2+ dominate
⌧̂r are met. These conditions are given in Lemma 1 and Lemma 3. . . . . . . . . . . 65
4.1 L2 loss comparisons for regret-minimizing allocations relative to equal allocation. For
starred entries, the regret-minimizing allocation defaults to equal allocation. . . . . . 77
4.2 L2 loss comparisons for regret-minimizing allocations relative to näıve allocation. . . 78
A.1 Standardized di↵erences (SD) between treated and control populations in RCT gold
dataset, for clinical risk factors for coronary heart disease. . . . . . . . . . . . . . . . 94
A.2 Standardized di↵erences (SD) between treated and control populations in RCT gold
dataset, for ethnicity category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.3 Standardized di↵erences (SD) between treated and control populations in RCT gold
dataset, for smoking category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xii
-
List of Figures
2.1 Performance measures across all 2,000 simulations run in the ideal case. Bias squared
is shown in black, and variance in gray, so that total bar height represents the MSE.
The much larger values for the RCT estimator are excluded to make visual comparison
easier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Performance measures across all 2,000 simulations run in the restricted enrollment
case. Bias squared is shown in black, and variance in gray, so that total bar height
represents the MSE. The much larger values for the RCT estimator are excluded to
make visual comparison easier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Propensity score distributions among treated and control women (left panel) and
marginal propensity score distributions (right panel) for the ODB and RCT Women’s
Health Initiative populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 ROC AUC scores for logistic regression prognostic score model in the control popu-
lations of the ODB and RCT silver datasets. . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Causal estimators computed over 100 bootstrap replicates for small and large RCT
sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Root mean square error when estimating the causal e↵ect of HT on CHD, across
100 bootstrap replicates for smaller and large RCT sizes. The gold standard causal
e↵ect is taken to be the age-stratified reweighted estimator, the magnitude of which
is shown via the dashed gold line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and
Strawderman estimators, and an oracle under four di↵erent conditions. Here, we
assume ⌧̂o is computed without any adjustment for selection bias, yielding a highly
biased estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and
Strawderman estimators, and an oracle under four di↵erent conditions. Here, we
assume ⌧̂o is computed by stabilized inverse probability of treatment weighting, such
that some of the selection bias is removed. . . . . . . . . . . . . . . . . . . . . . . . . 58
xiii
-
3.3 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and
Strawderman estimators, and an oracle under four di↵erent conditions. Here, we
assume ⌧̂o is computed without any adjustment for selection bias, yielding a highly
biased estimator. We also induce di↵erent distributions for the covariates Xi among
the observational and RCT units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Percent reduction in risk relative to ⌧̂r for our proposed estimators, the Green and
Strawderman estimators, and an oracle under four di↵erent conditions. Here, we
assume ⌧̂o is computed by stabilized inverse probability of treatment weighting, such
that some of the selection bias is removed. We also induce di↵erent distributions for
the covariates Xi among the observational and RCT units. . . . . . . . . . . . . . . 61
4.1 Simulated example of confidence regions in four strata under � = 1.2. . . . . . . . . 73
4.2 Allocation of units to strata under näıve scheme and regret-minimizing scheme. . . 75
4.3 Average loss over 1,000 resamples of 1,000-units experiments under equal-allocation,
näıve-allocation, and regret-minimizing allocation designs. . . . . . . . . . . . . . . . 76
A.1 Nominal and cross-validated receiver operator characteristic area under curve for
propensity models with di↵erent numbers of variables . . . . . . . . . . . . . . . . . 94
xiv
-
Chapter 1
Introduction
1.1 The Data Revolution
Passive data collection is a defining feature of modern life. Massive online social networks continually
monitor user interactions (Eckles and Bakshy, 2017); hospitals capture patient medical records in
electronic health databases (Charles et al., 2015); e-commerce giants record real-time sales data as
their customers shop (Bajari et al., 2019). For applied researchers in the social and medical sciences,
this ever-expanding global panopticon yields data that is both promising and perilous.
In the optimistic view, these observational data can provide insight into the causal e↵ect of a
proposed treatment, such as a novel drug regimen or a new marketing strategy. If the strongest
assumptions hold, such data can be used to identify any desired causal e↵ect, obviating the need
to run randomized trials. This prospect has yielded recent conjecture that Big Data may supplant
experimentation as the future of decision-making (Bareinboim and Pearl, 2016).
Yet the past half-century of causal inference research engenders deep skepticism toward such an
approach (Imbens and Rubin, 2015). Researchers do not control the treatment assignment in obser-
vational data, and, as a result, cannot be certain that treated individuals and untreated individuals
are otherwise comparable. This challenge can be overcome only by making untestable assumptions
– and even if these assumptions hold, careful modeling is necessary to remove the selection e↵ect.
The applied literature includes myriad examples of treatments that showed promise in observational
studies only to be overturned by later randomized trials (Hartman et al., 2015). One prominent
case, the e↵ect of hormone therapy on the health of postmenopausal women, will be discussed at
length in this manuscript (Writing Group for the Women’s Health Initiative Investigators, 2002).
The “virtuous” counterpart to observational data is the well-designed experiment. Data from a
randomized trial yield unbiased estimates of a causal e↵ect without the need for problematic sta-
tistical assumptions. Yet experiments su↵er two significant drawbacks. First, they are frequently
1
-
CHAPTER 1. INTRODUCTION 2
expensive, and, as a consequence, generally involve fewer units. Especially if one is interested in sub-
group causal e↵ects, this means experimental estimates can be imprecise. Second, experiments often
involve inclusion criteria that can make them dissimilar from target populations of interest. Hence,
while observational studies may su↵er significant selection bias due to unmeasured confounding,
experimental data will frequently have high variance and may su↵er from bias as well.
There has been considerable recent interest in the development of statistical methods to synthe-
size evidence from these two types of data (Mueller et al., 2018; Bareinboim and Pearl, 2016; Kallus
et al., 2018). Yet the current literature – discussed at length in Section 1.3 – o↵ers few concrete
methodological recommendations for applied researchers. This thesis will seek to fulfill the unmet
need. We consider the data fusion problem from three angles. First, we develop methods for merging
experimental and observational causal e↵ect estimates in the case when all confounding variables are
measured in the observational studies. Next, we remove the unconfoundedness assumption, which
leads to a new class of estimators based on a shrinkage approach. Finally, we propose a novel solution
for designing experiments informed by observational studies, making use of the regret minimization
framework. Throughout, we deploy tools from disparate areas of the literature, including Empirical
Bayes, decision theory, and convex optimization.
1.2 Causal Inference Review
1.2.1 Potential Outcome Model
In the vein of Chin (2019), I provide a brief review of causal inference concepts that will be relevant
to this manuscript.
We suppose we have access to a finite population of n individuals. We are considering a treatment
of interest, such as an experimental drug or a behavioral intervention. In this thesis, we will only
consider binary treatments, and we associate with each unit i a random variable Wi 2 {0, 1}, whereWi = 1 indicates that unit i receives the treatment and Wi = 0 indicates that unit i does not receive
the treatment. We will also assume the units have an associated outcome Yi where we typically
suppose Yi 2 R. Lastly, we will assume the units have measured covariates Xi 2 Rp.Throughout, we will adopt the potential outcomes framework of Neyman and Rubin (Rubin,
1974). We associate with each subject i two values, Yi(1) and Yi(0). These values represent the
realized outcome for unit i if the unit is treated or not treated respectively. Hence, the observed
outcome Yi satisfies
Yi = WiYi(1) +WiYi(0) .
Implicit in this definition is the “Stable Unit Treatment Value Assumption” (SUTVA) – the
assumption that a unit’s potential outcomes do not vary with treatments assigned to other units,
and that there is only one version of the treatment (Rubin, 1980). We will make this assumption
-
CHAPTER 1. INTRODUCTION 3
throughout. Additionally, note that in Chapter 2, we use slightly di↵erent notation; the potential
outcomes are denoted (Yit, Yic) and we use Wit to denote Wi and Wic to denote 1�Wi.The typical target of inference will be the average treatment e↵ect (ATE),
⌧ =1
n
nX
i=1
Yi(1)� Yi(0) . (1.1)
We will often also be interested in treatment e↵ects involving subgroups. Suppose we have subgroups
indexed by k = 1, . . . ,K and we have associated indexing sets Sk such that:
[
k
Sk = {1, 2, . . . , n} and Sk \ Sj = ; for any i 6= j.
Then the treatment e↵ect for subgroup k is simply
⌧k =1
|Sk|X
i2Sk
Yi(1)� Yi(0) .
1.2.2 Sources of Randomness
Wi is a random variable. Characterizing the joint distribution of all the Wi variables – known as the
“assignment mechanism” – is a central part of inferring causality. One core dichotomy is between
experimental and observational datasets. In the former, the researcher controls the treatment as-
signment and thus knows the assignment mechanism explicitly. In the latter, the researcher typically
makes some assumptions about the distribution and then seeks to infer it from the data.
Define pi = P (Wi = 1) to be the unit-level treatment probability for unit i. A closely related
concept is the “propensity score,” given by
e(x) =1P
iI(Xi = x)
X
i:Xi=x
pi ,
the average treatment probability for all units such that Xi = x. These values will be extremely
important to our analysis of observational studies, though they are also defined in the setting of
experiments. Following Chapter 3 of Imbens and Rubin (2015), we can characterize assignment
mechanisms with some additional descriptors based on the dependencies of pi. Namely, an assign-
ment mechanism is
• probabilistic if 0 < pi < 1 for all i = 1, . . . , n, signifying that all units have some positiveprobability of receiving the treatment and some positive probability of receiving the control
condition;
• individualistic if pi depends only on Yi(0), Yi(1), and Xi for i = 1, . . . , n, and exhibits nodependency on the covariates or potential outcomes of units j 6= i; and
-
CHAPTER 1. INTRODUCTION 4
• unconfounded if pi does not depend on the potential outcomes conditional on the covariates,i.e. Xi = Xj =) pi = pj for any i and j, even if their potential outcomes di↵er.
Throughout this manuscript, we will assume probabilistic and individualistic assignment for all
data. Unconfounded assignment is also always assumed for the experimental data, and is assumed for
the observational data in Chapter 2 but not in subsequent chapters. The combination of probabilistic,
individualistic, and unconfounded assignment is known as “strongly ignorable treatment assignment”
(SITA) and we will often refer to whether or not SITA is assumed.
In the typical setting, Wi is our only source of randomness and we treat Yi(1) and Yi(0) as fixed
– but only partially observed – constants. This is the approach we will take in Chapters 2 and 3.
An alternative formulation considers Yi(0) and Yi(1) to themselves be random variables (see e.g.
VanderWeele and Robins, 2012). In this setting, Definition 1.1 can be easily redefined as
⌧ =1
n
nX
i=1
E (Yi(1)� Yi(0)) .
We can now also define a useful quantity that is related to the subgroup treatment e↵ect: the
conditional average treatment e↵ect (CATE),
⌧(x) =1
n
nX
i=1
E (Yi(1)� Yi(0) | Xi = x) .
The above definitions for the assignment mechanism also easily extend to this case; individualistic
and unconfounded assignment become properties of the joint distribution of Wi,Xi, Yi(0), and
Yi(1), rather than properties of the finite sample of n units. We find that this formulation is
more appropriate for discussion of experimental design, and will adopt it as needed in Chapter 4.
1.2.3 Estimation Methods for Observational Studies
In observational studies in which unconfoundedness holds, the treatment assignment is approxi-
mately independent of the potential outcomes among units who have su�ciently similar covariate
values. Hence, we can treat these units as being drawn from a local experiment, and directly com-
pare treated and control units to infer causal e↵ects. A further insight, first o↵ered in Rosenbaum
and Rubin (1983), is that the propensity score is a balancing score. This means that we need only
find units with su�ciently similar values of the propensity score – not the entire covariate vector –
in order to obtain independence of the treatment assignment from the covariates and the potential
outcomes.
This discovery gave rise to causal inference techniques that rely on estimating the propensity
score as a function of the covariates, and then using the fitted propensity score estimates to create
cohorts of units with similar values. Common methods include pair-matching treated and control
-
CHAPTER 1. INTRODUCTION 5
units; stratifying units; and weighting units in order to recover a local experiment (Austin, 2011).
Propensity score stratification approaches will be used in Chapter 2, while weighting approaches,
which give rise to “inverse probability weighting” (IPW) estimators, will play a major role in Chap-
ters 3 and 4.
There exist many more complex estimators that rely on modeling both the propensity score and
the potential outcomes as functions of the covariates. Imbens and Rubin (2015) suggest combining
propensity score stratification with regression adjustment to improve precision. “Doubly robust”
estimators (Kang et al., 2007) are formulated using both models such that they are asymptotically
consistent if either the propensity model or the outcome model is correctly specified. Heterogeneous
treatment e↵ect estimation in observational studies is a very active area of research; many modern
methods involve both outcome and propensity modeling (see e.g. Künzel et al., 2019; Chernozhukov
et al., 2018) and attain this property.
In the absence of unconfoundedness, inferential methodology is much more limited. Causal
estimates can be reliably identified in certain quasi-experimental settings – regression discontinu-
ity designs and instrumental variable analyses being two prominent examples – but these are not
generic methods. Researchers frequently focus not on point identification, but rather on methods to
determine whether measured causal e↵ects are su�ciently robust to the possibility of unmeasured
confounding. This area, known as “sensitivity analysis,” has yielded a variety of new methods in
recent years (Ding and VanderWeele, 2016; Fogarty, 2019; VanderWeele and Ding, 2017). We will
make use of a new method o↵ered by Zhao et al. (2019) in Chapters 3 and 4.
1.3 Literature Review
The data fusion problem relates to several di↵erent areas in the statistical and epidemiological
literature. For example, the meta-analysis literature considers the broad challenges of evidence
synthesis across multiple studies. Papers in this area have been highlighting the lack of consensus
about how to merge observational and experimental studies for at least a quarter of a century
(Pladevall-Vila et al., 1996). Yet even without clear methodological guidelines, observational studies
are frequently included in systematic reviews; a 2014 survey found that just 36% of 300 such reviews
were restricted to experimental and quasi-experimental data, with the remainder including at least
some types of observational studies (Page et al., 2016).
Mueller et al. (2018) recently published a summary of methods to incorporate observational
studies into meta-analyses. They considered 93 relevant articles published between 1994 and 2016,
finding that many recommendations for the inclusion of observational studies were essentially un-
changed from those used for randomized controlled trials (RCTs). While about 40% of the articles
made recommendations on the assessment of bias in observational studies, there was little agreement
on best practices for combining the data. The methodological questions considered in most of the
-
CHAPTER 1. INTRODUCTION 6
papers were whether or not to report a single e↵ect estimate, and whether or not to use a fixed vs.
a random e↵ects model to combine the individual study estimates. These questions relate to the
heterogeneity of e↵ect estimates, but they do not engage with the unique challenges of confounding
in observational studies.
Mueller and coauthors highlight a few exceptions. Thompson et al. (2011) propose estimating bias
reduction based on the subjective judgment of a panel of assessors, and adjusting the observational
study results accordingly. Their method requires a high degree of subject matter expertise. Prevost
et al. (2000) suggest a hierarchical Bayes approach in which the di↵erence between observational
and experimental results is modeled explicitly. The authors emphasize certain advantages to this
approach but also note that results are sensitive to the choice of prior. In totality, the meta-analysis
literature underscores the need for a more robust toolbox to synthesize heterogeneous data sources,
while accounting for the unique challenges of causal inference with observational data.
Another closely related area of the literature is that of transportability and generalizability. In
the middle of the twentieth century, Campbell (1957) introduced the concepts of “internal validity”
and “external validity” to distinguish between challenges of treatment e↵ect estimation and general-
izability in quasi-experimental research. This paradigm was widely adopted among social scientists.
The problem of extending causal findings across di↵erent domains is now known under the broader
banner of “transportability,” which subsumes results from the meta-analysis and treatment e↵ect
heterogeneity literatures (Bareinboim and Pearl, 2016).
In this context, there is substantial research on using observational data to determine how to
generalize causal e↵ects from an experiment to a target population. Early work focused heavily on
cases in which such generalization is invalid (see e.g. Manski, 2009; Höfler et al., 2010). In a series
of papers (Pearl and Bareinboim, 2011; Bareinboim and Pearl, 2013, 2016), Pearl and Bareoinboim
formalized this problem, derived conditions under which “transport” is possible, and developed
algorithms for returning the correct “transport formula.” While they considered the problem in the
context of graphical models, it has also been considered in the more classical causal inference setting.
For example, Hartman et al. (2015) derived assumptions and placebo tests for identifying population
treatment e↵ects from RCTs. Stuart et al. (2011) advocated the use of propensity scores as a tool
to assess generalizability. A variety of other work has advocated reweighting approaches in order to
generalize results (Cole and Stuart, 2010; Andrews and Oster, 2017).
We are explicitly interested in using causal estimates derived from observational studies, an area
that has received comparatively less attention than the transportability question. The hesitancy
to use these estimates is not particularly surprising: if a researcher has access to an experiment,
he or she will likely be cautious about incorporating observational data that might introduce bias.
One approach is to assume unconfoundedness in the observational study, meaning that all variables
a↵ecting the treatment assignment and the outcome are measured. This is our approach in Chapter
2; it is also used in Athey et al. (2019).
-
CHAPTER 1. INTRODUCTION 7
A small number of prior papers have attempted to weaken the unconfoundedness assumption
and proceed with merged estimation. They often introduce alternative assumptions. In Kallus et al.
(2018), the authors assume that the hidden confounding has a parametric structure that can be
modeled e↵ectively. They suggest fitting a model !̂ to predict local average treatment e↵ects to the
observational study, and then learning a second model ⌘̂ which interpolates between the predictions
of !̂ on the RCT units and the actual observed outcomes in the RCT. They posit that the sum of
these functions, !̂(x)+ ⌘̂(x), is a good estimate for the CATE at x. Yet their theoretical guarantees
rely heavily on determining the correct functional form for ⌘̂(·).In Peysakhovich and Lada (2016), it is assumed the bias preserves unit-level relative rank or-
dering (as the authors say, “bigger causal e↵ects imply bigger bias”). They argue that their set of
assumptions is reasonable in their setting, which involves time series data with multiple observations
per unit. But it does not easily generalize to the more standard case where each unit’s outcome is
observed only at a single time point.
A number of other approaches have been suggested, such as methods that make use of Bayesian
networks (Cooper and Yoo, 1999) or structural causal models (Mooij et al., 2016). Yet this inquiry
is unquestionable in its infancy. One can find a chorus of recent papers explicitly calling for more
methodological development in the area of combining observational and experimental data (Mueller
et al., 2018; Shalit, 2020). This thesis seeks to heed that call.
1.4 Contributions
The contents of this thesis are divided among the three chapters to follow, each of which corresponds
to a distinct manuscript. Each chapter considers the data fusion problem from a di↵erent angle.
The second chapter is adapted from “Propensity Score Methods for Merging Observational and
Experimental Datasets,” jointly authored with Art Owen, Michael Baiocchi, and Hailey Banack
(Rosenman et al., 2018). This chapter considers estimation of the average treatment e↵ect from a
completed pair of studies: a randomized controlled trial and an observational study. We assume that
SITA holds in the latter study. We propose a general procedure in which the data is jointly stratified
on the output of a propensity score estimation function fitted to the observational study, as well as
causal e↵ect moderators. We propose three novel estimators for the causal e↵ect within each stratum,
and use the delta method to determine when each would be expected to outperform. We apply our
methods to data from the Women’s Health Initiative, a study of thousands of postmenopausal women
which has both observational and experimental data on hormone therapy (HT).
The third chapter is adapted from “Combining Observational and Experimental Datasets Using
Shrinkage Estimators,” jointly authored with Guillaume Basse, Art Owen, and Michael Baiocchi
(Rosenman et al., 2020). This chapter considers the same setting, but we remove the assumption
that all confounders are measured and choose as our objective the L2 loss in measuring a vector of
-
CHAPTER 1. INTRODUCTION 8
stratum-specific treatment e↵ects. We propose a generic procedure for deriving shrinkage estimators
in this setting, making use of a generalized unbiased risk estimate. Then, we develop two new
estimators, prove finite sample conditions under which they have lower risk than an estimator using
only experimental data, and show that each achieves a notion of asymptotic optimality. Lastly, we
draw connections between our approach and state-of-the-art results in sensitivity analysis, including
proposing a method for evaluating the feasibility of our estimators.
The fourth chapter is forthcoming as its own manuscript. In this chapter, we depart from the
estimation setting and consider how to design a stratified experiment making use of data from a
completed observational study. Again, our objective is the L2 loss in measuring the vector of treat-
ment e↵ects. In the case of a binary outcome, we obtain valid, bias-aware confidence regions for the
pilot estimates of the stratum-specific variances derived from the observational study, generalizing
recent results from Zhao et al. (2019). Then, we show that experiments can be designed to minimize
a notion of regret by solving a convex optimization problem. We again demonstrate the utility of
our methods with an application to data from the Women’s Health Initiative.
In the final chapter, we discuss limitations and future directions for this line of research.
-
Chapter 2
Combining Datasets Under SITA
2.1 Introduction
We first consider how to combine the information from a large observational database (ODB) with
data from a smaller randomized controlled trial (RCT), under the assumption that all confounders
are measured in the observational study. Our goal is to obtain a treatment e↵ect estimate that is
more accurate than either source could yield on its own.
We present three methods to combine our two sources of data. The key technique underlying all
of these methods is to score subjects in the RCT according to their propensity for treatment had they
been in the ODB instead. They are then placed in pooled strata containing some ODB and some
RCT observations with comparable propensities. To see why this might help, consider a stratum in
the ODB comprised entirely of subjects with a very low treatment propensity. The RCT samples in
that same stratum will be more evenly split between treatment and control, increasing a critically
low within-stratum sample size of treated subjects. We are therefore extending the stratification of
observational data by propensity, as described by Imbens and Rubin (2015) and Stuart and Rubin
(2007).
For any data combination method such as ours to succeed, it is necessary to make assumptions
that cannot be tested within the available data and have to be judged instead on scientific grounds.
At a minimum, we require that treatment e↵ects for subjects in a stratum not depend too strongly,
if at all, on the data set the subjects came from. We will describe this and our other assumptions
below.
We apply our methods to data from the Women’s Health Initiative, or WHI (Writing Group for
the Women’s Health Initiative Investigators, 2002). This study includes an RCT paired with an
ODB. The treatment is hormone therapy (HT) and the outcome measure we consider is coronary
heart disease (CHD). Conclusions from the observational data alone proved to be misleading due to
di↵erences between the treated and untreated subjects, as revealed by a large RCT. Accordingly,
9
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 10
it would be interesting to see if a smaller RCT combined with the ODB would have been more
accurate than the ODB alone, potentially providing an earlier warning. The WHI’s RCT was quite
large. This allows us to split it into a holdout sample that we use to define a true treatment e↵ect.
Using that holdout estimate as a gold standard, we can then compare our combined methods with
methods using only one of the data sources.
2.1.1 Scientific assumptions
Here we describe our assumptions in qualitative terms.
As mentioned above, we require the RCT and ODB to have comparable treatment e↵ects within
strata. The strong form of this assumption is that everybody in each stratum we form has the same
treatment e↵ect. Most of our results use a weaker form in which only the average treatment e↵ects
have to be the same for RCT and ODB subjects in each stratum. Note however that equal average
treatment e↵ects implies equal average di↵erential outcomes, but does not imply equal average
outcomes under either the test or control conditions.
We will assume that the propensity for treatment in the ODB is nearly constant within each of
our strata. This assumption requires that the variables we use to form strata include the important
quantities predictive of whether treatment or control was assigned. It also requires that we estimate
the propensity well. In our motivating problems, the ODB is so large that getting good propensity
estimates is a very reasonable assumption, provided that suitable predictor variables are present.
Our goal is to estimate a population average treatment e↵ect. For that, we need to know the
correct proportion of the population corresponding to each stratum. In our motivating medical
context, we suppose that the owner of the large ODB is interested in stratum proportions given by
that very same patient population. The RCT, on the other hand, might have di↵erent sampling
proportions due, for example, to enrollment criteria that restrict who may participate. In our
examples, the stratum proportions are exactly those of the ODB.
2.1.2 Outline
This chapter is organized as follows. In Section 2.2, we define our notation, assumptions and
estimators of the average treatment e↵ect. Our estimates combine some within-stratum estimates,
computed in various ways. Our first proposed estimator, called the spiked-in estimator, simply places
RCT units into propensity based strata defined on the ODB, and estimates the treatment e↵ect in
each stratum without regard to which data set the subjects came from. Our second estimator
takes a sample size weighted average of RCT and ODB-based treatment e↵ect estimates within each
stratum. Our third new estimator, called the dynamic weighted average, uses data driven weights
instead of sample size weights to combine ODB and RCT estimates within each stratum so as to
minimize an estimate of the mean squared error in each stratum. For the large sample sizes of
interest to us, delta method approximations to the mean and variance are accurate enough.
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 11
Section 2.3 presents delta method estimates of the within-stratum bias and variance for our
estimators. We see theoretically that one of our estimators, the spiked-in estimator, can have an
enormous bias if average test and control outcomes are not both comparable between data sets.
Another estimator, the dynamic weighted estimator, is more robust.
Section 2.5 gives numerical illustrations of our method for an ODB of size 5,000 and an RCT
of size 200. Section 2.6 gives more background on the WHI data. It then gives a detailed model
for the WHI data, developing a propensity and defining a gold standard average treatment e↵ect
on a holdout sample from the WHI’s RCT. A variant of the spiked-in estimator is introduced in
Section 2.6.3. That variant refines propensity strata via a prognostic score predictive of coronary
heart disease in untreated subjects, leading to a “dual-spiked” estimator. All the estimators are
compared via bootstrap simulations in 2.6.4, and the dual-spiked estimate is most accurate for the
WHI data. Lengthier proofs can be found in the Appendix. Section 2.7 summarizes our conclusions.
2.2 Notation, assumptions and estimators
Some subjects belong to the randomized controlled trial (RCT) and others to the observational
database (ODB). We assume that no subject is in both data sets. We write i 2 R if subject i is inthe RCT and i 2 O otherwise. Subject i has an outcome Yi 2 R and some covariates that we encodein the vector xi 2 Rd. Subject i receives either the test or control condition.
The condition of subject i is given by a treatment variable Wi 2 {0, 1} where Wi = 1 if subject iis in the test condition (and 0 otherwise). Some formulas simplify when we can use parallel notation
for both test and control settings. Accordingly we introduce Wit = Wi and Wic = 1 �Wi. Otherformulas look better when focused on the test condition. For instance, letting pit = Pr(Wit = 1) and
pic = Pr(Wic = 1), the expression pit(1 � pit) is immediately recognizable as a Bernoulli varianceand is preferred to pitpic.
2.2.1 Sampling assumptions
We adopt the potential outcomes framework of Neyman and Rubin (Rubin, 1974). Subject i has
two potential outcomes, Yit and Yic, corresponding to test and control conditions respectively. Then
Yi = WitYit + WicYic. The potential outcomes (Yit, Yic) are non-random and we will assume that
they are bounded. The treatment e↵ect for subject i is Yit � Yic. We work conditionally on theobserved values of covariates and so xi are also non-random.
All of the randomness in our model comes from the treatment variables Wi. We write Bern(p)
for a Bernoulli random variable taking the value 1 with probability p and 0 with probability 1� p.The ODB and RCT di↵er in how the Wi are distributed.
Assumption 1 (ODB sampling). If i 2 O, then Wi ⇠ Bern(pi) independently where pi = e(xi)with 0 < pi < 1.
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 12
The function e(·) in Assumption 1 is the propensity. Because the propensity depends only on x,and is never 0 or 1, the ODB has a strongly ignorable treatment assignment (SITA) (Rosenbaum
and Rubin, 1984). Because the Wi are independent, the outcome for subject i is una↵ected by the
treatment Wi0 for any subject i0 6= i. That is, our model for the ODB satisfies the stable unittreatment value assumption or SUTVA (Imbens and Rubin, 2015).
Assumption 2 (RCT sampling). If i 2 R, then Wi ⇠ Bern(pr) independently for a commonprobability 0 < pr < 1.
The RCT will commonly have pr = 1/2 but we do not assume this. We additionally assume that
the ODB is independent of the RCT.
2.2.2 Stratification and treatment e↵ect assumptions
We will use K strata indexed by k = 1, . . . ,K. The stratum for subject i depends on xi. The sets
Ok and Rk contain the subjects in stratum k from the ODB and RCT respectively. We assume thateach stratum contains only a narrow range of propensity values e(xi). Strata defined by propensity
ranges may be further partitioned by variables in xi, using domain knowledge if applicable, in order
to make the treatment e↵ect more nearly constant within strata. Propensity score stratification with
sub-stratification on other important predictors is a commonly used strategy for causal inference in
observational studies Imbens and Rubin (2015); Stuart and Rubin (2007).
Our model allows the treatment e↵ect to vary by stratum. We begin with a strong assumption
about the treatment e↵ects.
Assumption 3. For all strata k = 1, . . . ,K, there is a treatment e↵ect ⌧k with Yit � Yic = ⌧k forall i 2 Ok [Rk.
In most of our work, we can weaken this assumption to just require equality on average within
each stratum. The weakened version is given as Assumption 4 below. Let the sample sizes of the
ODB and RCT be no and nr respectively. Ordinarily no � nr. The ODB and RCT sample sizeswithin stratum k are nok and nrk. The within-stratum average treatment e↵ects are
⌧ok =1
nok
X
i2Ok
Yit � Yic and ⌧rk =1
nrk
X
i2Rk
Yit � Yic, (2.1)
when their denominator counts are positive. We will never use strata with nok = 0 when we later
weight strata proportionally to their ODB sizes.
Assumption 4. For k = 1, . . . ,K, if min(nok, nrk) > 0 then ⌧ok = ⌧rk and we call their common
value ⌧k. If nok > nrk = 0 take ⌧k = ⌧ok and if nrk > nok = 0 take ⌧k = ⌧rk.
Assumption 4 might be unrealistic if the treatment is applied di↵erently in the ODB versus the
RCT. We thus suppose some form of “treatment version irrelevance” Lesko et al. (2017).
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 13
We need the strong Assumption 3 in one place to estimate a quantity that depends on both po-
tential outcomes of a single subject. Because our strata will be based at least partially on propensity,
Assumption 3 is very nearly true under the model of Xie et al. Xie et al. (2012b) In the Appendix,
some simulations will involve data that violate Assumption 3.
2.2.3 Estimators
Our estimand is the global average treatment e↵ect defined by
⌧ =KX
k=1
!k⌧k
for weights !k > 0 withP
K
k=1 !k = 1. The weights can be chosen to match population character-
istics. We use !k = nok/no. Then !k = 0 whenever nok = 0 and we have a well defined ⌧k for
every stratum that contributes to ⌧ . We may still have nrk = 0 for some strata with !k > 0. Our
estimators all take the formP
k!k ⌧̂k for di↵erent within-stratum estimates ⌧̂k.
We begin with “single data source” estimators before describing our proposed new estimators.
An ODB-only estimate of the treatment e↵ect in stratum k is
⌧̂ok =
Pi2Ok WitYitPi2Ok Wit
�P
i2Ok WicYicPi2Ok Wic
. (2.2)
Then ⌧̂o =P
k!k ⌧̂ok. A potential problem with ⌧̂o comes from bins k with very small propensity
values. Then Ok may contain very few observations with Wit = 1 and ⌧̂ok may have high variance.Similarly for bins k associated with large propensity values, Ok may contain very few observationswith Wic = 1 which again leads to high variance. That is, the “edge bins” can have very skewed
sample sizes causing problems for ⌧̂o.
The ODB estimate (2.2) is a di↵erence of ratio estimators, because the denominators are random.
We will see in Section 2.3 that there can also be a severe bias in the edge bins. An analogous RCT-
only estimator is ⌧̂r =P
k!k ⌧̂rk where
⌧̂rk =
Pi2Rk WitYitPi2Rk Wit
�P
i2Rk WicYicPi2R Wic
. (2.3)
Because the RCT assigns treatments with constant probability, the edge bins have less imbalanced
treatment outcomes. However, because the RCT is small, we may find several of the strata have
very small sample sizes nrk.
Our first hybrid estimator is ⌧̂s =P
k!k ⌧̂sk, where
⌧̂sk =
Pi2Ok WitYit +
Pi2Rk WitYitP
i2Ok Wit +P
i2Rk Wit�P
i2Ok WicYic +P
i2Rk WicYicPi2Ok Wic +
Pi2Rk Wic
. (2.4)
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 14
The RCT data are “spiked” into the ODB strata. This spiked-in estimator can improve upon
the ODB estimator by increasing the number of treated units in the low-propensity edge bins and
increasing the number of control units in the high-propensity edge bins. Even a small number of
such balancing observations can be extremely valuable.
The spiked-in estimator is not a convex combination of ⌧̂ok and ⌧̂rk, because the pooling is
first done among the test and control units. Our final two estimators are constructed as convex
combinations of ⌧̂ok and ⌧̂rk.
The weighted average estimator ⌧̂w uses
⌧̂wk = �k ⌧̂ok + (1� �k)⌧̂rk, where �k =nok
nok + nrk. (2.5)
It weights ⌧̂rk and ⌧̂ok according to the number of data points involved in each estimate.
Our final estimator is a “dynamic weighted average” ⌧̂d. It uses weights for ⌧̂rk and ⌧̂ok that are
estimated from the data. Those weights are chosen to minimize an estimate of mean squared error
(MSE) derived using the delta method in the following section. We can observe its approximate
optimality via the following result, recalling that the RCT estimator will in general be unbiased.
Proposition 1. Let �̂1 and �̂2 be independent estimators of a common quantity �, with bias,
variance, and mean squared errors, Bias(�̂1) 2 (�1,1), Bias(�̂2) = 0, var(�̂j) and MSE(�̂j) 2(0,1) for j = 1, 2. For c 2 R, let �̂c = c�̂1 + (1� c)�̂2. Then
c⇤ ⌘ argminc
MSE(�̂c) =var(�̂2)
MSE(�̂1) + var(�̂2).
This linear combination has
Bias(�̂c⇤) =Bias(�̂1)MSE(�̂2)
MSE(�̂1) +MSE(�̂2),
var(�̂c⇤) = c2⇤var(�̂1) + (1� c⇤)2var(�̂2), and
MSE(�̂c⇤) =MSE(�̂1)var(�̂2)
MSE(�̂1) + var(�̂2).
(2.6)
Proof. Independence of the �̂j yields var(�̂c) = c2var(�̂1) + (1� c)2var(�̂2) while linearity of expec-tation yields Bias(�̂c) = cBias(✓̂1). Optimizing MSE(�̂c) over c yields the result.
2.3 Delta method results
Let X be a random vector with mean µ and a finite covariance matrix. Let f be a function of X
that is twice di↵erentiable in an open set containing µ and let f1 and f2 be first and second order
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 15
Taylor approximations to f around µ. Then the delta method mean and variance of f(X) are
E�(f(X)) = E(f2(X)) and var�(f(X)) = var(f1(X)) .
Sometimes, to combine estimates, we will need a delta method mean for a weighted sum of those
estimates. We will also need a delta method variance for a weighted sum of independent random
variables. We use the following natural expressions without resorting to Taylor approximations:
E�✓X
j
�j ⌧̂j
◆=X
j
�jE�(⌧̂j) (2.1)
var�
✓X
j
�j ⌧̂j
◆=X
j
�2jvar�(⌧̂j), for independent ⌧̂j . (2.2)
2.3.1 Population quantities
We will study our estimators in terms of some population quantities. These involve some unobserved
values of Yit or Yic. For instance, the test and control stratum averages in the ODB are
µokt =
Pi2Ok Yit
nokand µokc =
Pi2Ok Yic
nok
and it is typical that both of these are unobserved. Corresponding values for the RCT are µrkt and
µrkc.
When we merge ODB and RCT strata we will have to consider a kind of skew in which the within-
stratum mean responses above di↵er between the two data sets. To this end, define �kt = µokt�µrktand �kc = µokc � µrkc. Under either of the stronger Assumption 3 or the weaker Assumption 4,�kt = (⌧k + µokc)� (⌧k + µrkc) = �kc. We will use �k = �kt = �kc. We will see that large valuesof �k can bias the spiked-in estimator. Reducing that bias is the main motivation for our dynamic
weighted average estimator.
Now we define several other population quantities. Let S be a finite non-empty set of n = n(S)indices, such as one of our strata Ok or Rk. For each i 2 S, let (Yit, Yic) 2 [�B,B]2 be a pair ofbounded potential outcomes and let Wi = Wit be independent Bern(pi) random variables and let
Wic = 1�Wit. Some of our results add the condition that all pi 2 [✏, 1� ✏] for some ✏ > 0.For S so equipped, we define average responses
µt = µt(S) =1
n
X
i2SYit and µc = µc(S) =
1
n
X
i2SYic. (2.3)
For example, µokt above is µt(Ok). We use average treatment probabilities
pt = pt(S) =1
n
X
i2Spi and pc = pc(S) = 1� pt(S). (2.4)
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 16
These become pokt, pokc, prkt and prkc in a natural notation when S is Ok or Rk.The above quantities are averages over i uniformly distributed in S as distinct from expectations
with respect to random Wi. We also need some covariances of this type between response and
propensity values,
st = st(S) =1
n
X
i2SYitpi � µtpt and
sc = sc(S) =1
n
X
i2SYic(1� pi)� µcpc.
(2.5)
We will find that these quantities play an important role in bias. If, for instance, the larger values
of Yit tend to co-occur with higher propensities pi, then averages are biased up.
The delta method variances of our estimators depend on the following weighted averages of
squares and cross products
Stt = Stt(S) =1
n
X
i2Spi(1� pi)(Yit � ⇢t)2,
Scc = Scc(S) =1
n
X
i2Spi(1� pi)(Yic � ⇢c)2, and
Stc = Stc(S) =1
n
X
i2Spi(1� pi)(Yit � ⇢t)(Yic � ⇢c),
(2.6)
where ⇢t = ⇢t(S) = µt(S) + st(S)/pt(S) and ⇢c = ⇢c(S) = µc(S) + sc(S)/pc(S). The quantity ⇢t isthe lead term in E�(
Pi2S WitYit/
Pi2S Wit) and ⇢c is similar. More details about these quantities
are in the Appendix where Theorem 1 is proved.
Proposition 2. Let S be Ok, Rk or Ok [Rk. Then under Assumption 3, sc(S) = �st(S).
Proof. Under Assumption 3, we can set Yit = Yic + ⌧k and µt = µc + ⌧k in (2.5).
2.3.2 Main theorem
We will compare the e�ciency of our five estimators using their delta method approximations. We
state two elementary propositions without proof and then give our main theorem. Results for our
various estimators are mostly direct corollaries of that theorem.
Proposition 3. Let x and y be jointly distributed random variables with means x0 6= 0 and y0respectively, and finite variances. Let ⇢ = y0/x0. Then
E�⇣yx
⌘= ⇢� cov(y � ⇢x, x)
x20, and
var�⇣yx
⌘=
var(y � ⇢x)x20
.
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 17
Proposition 4. Let xt, xc, yt, yc be jointly distributed random variables with finite variances and
means xj,0 6= 0 and yj,0 respectively, for j 2 {t, c}. Let ⇢j = yj,0/xj,0. Then
var�⇣ ytxt
± ycxc
⌘=
var(yt � ⇢txt)x2t,0
+var(yc � ⇢cxc)
x2c,0
± 2cov(yt � ⇢txt, yc � ⇢cxc)xt,0xc,0
.
Theorem 1. Let S be an index set of finite cardinality n > 0. For i 2 S, let Wit ⇠ Bern(pi) beindependent with Wic = 1�Wit, 0 < pi < 1. Let
⌧̂ =
Pi2S WitYitPi2S Wit
�P
i2S WicYicPi2S Wic
where (Yit, Yic) 2 [�B,B]2, for B < 1. Then with µt, µc, pt, pc, st, sc, Stt, Scc, Stc defined atequations (2.4) through (2.6),
var�(⌧̂) =1
n
✓Sttp2t
+Sccp2c
+ 2Stcptpc
◆. (2.7)
If all pi 2 [✏, 1� ✏] for some ✏ > 0, then
E� (⌧̂) = (µt � µc) +⇣stpt
� scpc
⌘+O
⇣ 1n
⌘. (2.8)
Proof. See Appendix Section A.1.1.
The implied constant in O(1/n) for equation (2.8) holds for all n > 1.
2.3.3 Delta method means and variances
We define the delta method bias of an estimate ⌧̂k via Bias�(⌧̂k) = E�(⌧̂k) � ⌧k. We also assume0 < ✏ < e(xi) < 1� ✏ for some ✏.
Corollary 1. Let ⌧̂ok be the ODB-only estimator from (2.2). Then
var�(⌧̂ok) =1
nok
✓Sttp2t
+Sccp2c
+ 2Stcptpc
◆,
where st, sc, pt, pc, Stt, Scc and Stc are given in equations (2.3) through (2.6) with S = Ok. If0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then
Bias� (⌧̂ok) =stpt
� scpc
+O
✓1
nok
◆.
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 18
If also Assumption 3 holds, then
Bias� (⌧̂ok) =st
pt(1� pt)+O
✓1
nok
◆.
Proof. The first two claims follow from Theorem 1, using e(xi) 2 [✏, 1 � ✏] for i 2 Ok [ Rk inthe second one. Under Assumption 3, sc = �st, so the lead term in E�(⌧̂k) is st(1/pt + 1/pc) =st(pt + pc)/pt(1� pt) = st/pt(1� pt).
Corollary 2. Let ⌧̂rk be the RCT-only estimator from (2.3). Then ⌧̂rk is known to be unbiased, and
var�(⌧̂rk) =�̄2rk
nrkpr(1� pr), where
�̄2rk
=1
nrk
X
i2Rk
[(Yit � µrkt)(1� pr) + (Yic � µrkc)pr]2,(2.9)
for µrkt = µt(Rk) and µrkc = µc(Rk). Under Assumption 3, �̄2rk = �2rkt ⌘ (1/nrk)P
i2Rk(Yit �µrkt)2. If pr = 1/2, then
var�(⌧̂rk) =1
4n2k
X
i2Rk
✓Ȳi �
µrkt + µrkc2
◆2
for Ȳi = (Yit + Yic)/2.
Proof. See Appendix Section A.1.2.
Corollary 3. Let ⌧̂wk be the weighted-average estimator (2.5). Then, with �k = nok/(nok + nrk),
var�(⌧̂wk) =�k
nok + nrk
✓Sttp2t
+Sccp2c
+ 2Stcptpc
◆+
1� �knok + nrk
�̄2rk
pr(1� pr),
where Stt, Scc and Scc are given in equation (2.6) with S = Ok, and �̄2rk is defined at (2.9). If0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then
Bias�(⌧̂wk) = �k
✓soktpokt
� sokcpokc
◆+O
✓1
nok + nrk
◆,
where sokt, pokt, sokc, and pokc are defined by equations (2.4) and (2.5) for S = Ok. If Assumption3 also holds, then
Bias�(⌧̂wk) =�ksokt
pokt(1� pokt)+O
✓1
nok + nrk
◆.
Proof. See Appendix Section 2.3.6.
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 19
In our motivating scenarios we anticipate that no � nr so that �k ⇡ 1 for most k. Then the firstterm in var�(⌧̂wk) is only slightly smaller than var�(⌧̂ok) for the ODB-only estimate, and at most a
small variance reduction is to be expected from weighting.
The spiked-in estimator’s bias and variance cannot be computed as a corollary of Theorem 1,
but they can be computed directly.
Corollary 4. Let ⌧̂sk be the spiked-in estimator (2.5). Then
var�(⌧̂sk) =1
nok + nrk
✓Sttp2t
+Sccp2c
+ 2Stcptpc
◆,
where st, sc, pt, pc, Stt, Scc and Stc are given in equations (2.3) through (2.6) with S = Ok [Rk.If 0 < ✏ < e(xi) < 1� ✏ for all i 2 Ok [Rk, then
Bias� (⌧̂sk) =stpt
� scpc
+O
✓1
nok + nrk
◆.
If Assumption 3 also holds, then
Bias� (⌧̂sk) =st
pt(1� pt)+O
✓1
nok + nrk
◆.
Proof. The spiked-in estimates are computed by pooling Ok and Rk into their union.
To relate the bias of ⌧̂sk to that of the other estimators, we write it in terms of the quantities
computed using S = Ok and S = Rk. Denoting these quantities using an additional subscript of oand r,
Bias�(⌧̂sk) = �knok⇣ poktnokpokt + nrkprkt
� pokcnokpokc + nrkprkc
⌘
+ soktnok
nokpokt + nrkprkt� sokc
noknokpokc + nrkprkc
+O⇣ 1nok + nrk
⌘.
(2.10)
The bias for ⌧̂rk is zero. The bias for ⌧̂ok has terms analogous to the second and third (and error)
terms above, but the first term is new to ⌧̂sk. This term is linear in �k. For large values of �k,
this term will dominate, yielding biases that can easily exceed those of ⌧̂ok. This is the fundamental
danger of the spiked-in estimator: if the mean potential outcomes di↵er substantially between ODB
and RCT subjects with similar value of the propensity score function, then the estimation will be
poor due to large bias.
2.3.4 The dynamic weighted estimator
The bias-variance tradeo↵s are intrinsically di↵erent in each stratum. Using results from the prior
section, we derive a dynamic weighted estimator that uses di↵erent weights in each stratum. Our
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 20
dynamic weighted estimator is based on Assumption 3, though we will test it in settings where that
assumption does not hold.
From Proposition 1, the MSE-optimal convex combination of ⌧̂ok and ⌧̂rk is c⇤k ⌧̂ok +(1� c⇤k)⌧̂rkwhere c⇤k = var(⌧̂rk)/(var(⌧̂rk) +MSE(⌧̂ok)). The dynamic weighted estimator is
⌧̂dk = ĉ⇤k ⌧̂ok + (1� ĉ⇤k)⌧̂rk, with ĉ⇤k =cvar(⌧̂rk)
cvar(⌧̂rk) + [MSE(⌧̂ok), (2.11)
for plug-in estimators of MSE(⌧̂ok) and var(⌧̂rk). To obtain our MSE estimates we use ]MSE(·) =Bias�(·)2 + var�(·) taking the delta method moments from Corollaries 1 and 2. These expressionsinclude some unknown population quantities that we then approximate from the data to get [MSE(·).
For the ODB estimate we use
]MSE(⌧̂ok) =✓
stpt(1� pt)
◆2+
1
nok
✓Sttp2t
+Sccp2c
+ 2Stcptpc
◆
where the quantities on the right hand side are given in Section 2.3.1 with S = Ok. For the RCTestimate we use
fvar(⌧̂rk) =�̄2rk
pr(1� pr)nrk, with �̄2
rk=
1
nrk
X
i2Rk
Wit�̂2rkt
+Wic�̂2rkc
where �̂2rkt
, �̂2rkc
are the sample variances observed among the treated and control units respectively.
Both of these estimates use Assumption 3.
The values of pt and pc are known: pt =P
i2Ok pit/nok where pit is the propensity e(xi)
and pc = 1 � pt. We use Horvitz-Thompson style inverse probability weighting to estimate otherquantities. Full details can be found in Appendix Section 2.4.
2.3.5 Performance comparison
The ideal dynamic estimator with the optimal weight ck⇤ must be at least as good as ⌧̂ok, ⌧̂rk and
⌧̂wk because those estimators are all special cases of weighting estimators belonging to the class that
ck⇤ optimizes over. Our estimator ⌧̂dk will not always be better than those other estimators, because
in estimating ĉk⇤, we may introduce enough error to make it less e�cient.
When combining stratum-based estimates ⌧̂k into the weighted estimator ⌧̂ =P
k!k ⌧̂k, there is
the possibility of biases canceling between strata. None of the competing estimators we consider are
designed to exploit such cancellation. For large strata, ck⇤ should be well estimated. To arrange
cancellations among biased within-stratum estimates would require domain-specific assumptions
that we do not make here.
The comparison to the spiked-in estimator is more complex. As we saw in equation (2.10),
the bias can grow without bound in �k, so for large �k this estimator will have the largest MSE.
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 21
However, for small values of �k, the spiked-in estimator can outperform all the other estimators. To
see why, we make a direct comparison with the dynamic weighted estimator and reference our prior
discussion showing that the dynamic weighted estimator will generally outperform ⌧̂ok, ⌧̂rk and ⌧̂wk.
We introduce sample counterparts of �k, given by
�̂kt =
Pi2Ok WitYitPi2Ok Wit
�P
i2Rk WitYitPi2Rk Wit
, and
�̂kc =
Pi2Ok WicYicPi2Ok Wic
�P
i2Rk WicYicPi2Rk Wic
.
Then, after some algebra, ⌧̂sk di↵ers from the RCT estimate as follows,
⌧̂sk � ⌧̂rk = ckt�̂kt � ckc�̂kc (2.12)
for sample size proportions
ckt =
Pi2Ok WitP
i2Ok[Rk Witand ckc =
Pi2Ok WicP
i2Ok[Rk Wic.
By comparison,
⌧̂dk � ⌧̂rk = ck?�̂kt � ck?�̂kc, (2.13)
where the dynamic estimator tunes ck? to the available data. An oracle could choose ck? optimally
using Proposition 1. While the oracle is working in a one parameter family (2.13) for each bin k,
the spiked-in estimator uses two weights ckt and ckc (2.12) that are not necessarily within the family
that the oracle optimizes over. This is why it is possible for the spiked estimator to outperform the
oracle.
2.3.6 Proof of Corollary 3
Using (2.1) and Corollaries 1 and 2, Bias�(⌧̂wk) = �k⇥Bias�(⌧̂ok) for �k given in (2.5). This yields thelead terms in both expressions for Bias�(⌧̂wk). The error terms are �kO(1/nok) = O(1/(nok +nrk)).
Using independence of the RCT and ODB, Corollaries 1 and 2, and definition (2.2)
var�(⌧̂wk) =�2k
nok
✓Sttp2t
+Sccp2c
+ 2Stcptpc
◆+ (1� �k)2
�̄2rk
nrkpr(1� pr)
=�k
nok + nrk
✓Sttp2t
+Sccp2c
+ 2Stcptpc
◆+
1� �knok + nrk
�̄2rk
pr(1� pr).
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 22
2.4 Derivation: Dynamic Weighted Estimator
Recall we seek to estimate c⇤k ⌧̂ok+(1�c⇤k)⌧̂rk where the weight c⇤k = var(⌧̂rk)/(var(⌧̂rk) +MSE(⌧̂ok)).For the ODB, our plug-in estimate is
]MSE(⌧̂ok) =✓
stpt(1� pt)
◆2+
1
nok
✓Sttp2t
+Sccp2c
+ 2Stcptpc
◆
and for the RCT estimate we use
fvar(⌧̂rk) =�̄2rk
pr(1� pr)nrk, with �̄2
rk=
1
nrk
X
i2Rk
Wit�̂2rkt
+Wic�̂2rkc
where �̂2rkt
, �̂2rkc
are the sample variances observed among the treated and control units respectively.
We use Horvitz-Thompson style inverse probability weighting to estimate key quantities, as
follows:
⇢̂t =
Pi2Ok WitYitPi2Ok Wit
, ⇢̂c =
Pi2Ok WicYicPi2Ok Wic
,
ŝt =
Pi2Ok Wit
nok
X
i2Ok
WitYit � ptX
i2Ok
WitYit/pit
!
+
Pi2Ok Wic
nok
X
i2Ok
WicYic � pcX
i2Ok
WicYic/pic
!,
Ŝtt =
Pi2Ok Witpit(1� pit)(Yit � ⇢̂t)
2
Pi2Ok Wit
, and
Ŝcc =
Pi2Ok Wicpit(1� pit)(Yic � ⇢̂c)
2
Pi2Ok Wic
.
The sole quantity that does not have a Horvitz-Thompson estimator is Stc(Ok), because we neverobserve both potential outcomes for a given unit. First, we write Stc as
1
n
X
i2Ok
Witpit(1� pit)(Yit � ⇢t)(Yic � ⇢c) +1
n
X
i2Ok
Wicpit(1� pit)(Yit � ⇢t)(Yic � ⇢c).
Next, under Assumption 3,
Yit � ⇢t = Yic + ⌧k � µt � st/pt = Yic � ⇢c �stptpc
,
-
CHAPTER 2. COMBINING DATASETS UNDER SITA 23
and similarly Yic � ⇢c = Yit � ⇢t + st/(ptpc). Therefore
Stc =1
n
X
i2Ok
Witpit(1� pit)(Yit � ⇢t)2 +1
n
X
i2Ok
Wicpit(1� pit)(Yic � ⇢c)2
� stnpt(1� pt)
X
i2Ok
pit(1� pit)⇣Wit(Yit � ⇢t)�Wic(Yic � ⇢c)
⌘ (2.14)
and we get Ŝtc by plugging the above estimates of ⇢t, ⇢c and known values of pt, pc into (2.14).
Although Assumption 3 is used to derive the estimator, some of our simulations in Section 2.5 of
this Appendix test it under a violation of that assumption.
2.5 Simulations
Our goal is to estimate the average treatment e↵ect in the target population, from which we assume
the ODB data was randomly sampled. The value of the RCT is that it can substitute for ODB data
in places where that data is sparse due to the treatment assignment mechanism.
We simulate two high level scenarios. In one, the RCT is a random sample from the same
population that the ODB came from. Then the RCT and ODB data di↵er only in their treatment
assignment mechanisms. We consider this case the idea