off-policy policy evaluation under unobserved confounding · 2020. 5. 7. · effect of the...

11
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 Off-policy Policy Evaluation Under Unobserved Confounding Anonymous Authors 1 Abstract When observed decisions depend only on ob- served features, off-policy policy evaluation (OPE) methods for sequential decision making problems allow evaluating the performance of evaluation policies before deploying them. This assumption is often violated due to the presence of unobserved confounders, variables that impact both the decisions and their outcomes. We assess the robustness of OPE methods by developing worst-case bounds on the performance of a evalu- ation policy under different models of confound- ing. When unobserved confounders can affect every decision in an episode, we demonstrate that even small amounts of per-decision confounding can heavily bias OPE methods. Fortunately, in a number of important settings found in healthcare, policy-making, operations, and technology, unob- served confounders may primarily affect only one of the many decisions made. Under this less pes- simistic model of one-decision confounding, we propose an efficient loss-minimization-based pro- cedure for computing worst-case bounds on OPE estimates, and prove its statistical consistency. On simulated healthcare examples, we demonstrate that our method allows reliable off-policy evalu- ation by invalidating non-robust results, and pro- viding certificates of robustness. 1. Introduction New technology and regulatory shifts allow collection of unprecedented amounts of data on past decisions and their associated outcomes, ranging from product recommenda- tion systems to medical treatment decisions. This presents unique opportunities to leverage off-policy observational data to inform better decision-making. When online experi- mentation is expensive or risky, it is crucial to leverage prior 1 Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country. Correspondence to: Anonymous Author <[email protected]>. Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute. data to evaluate the performance of a sequential decision policy before deploying it. While epidemiology has long been interested in dynamic treatment regime estimation, the reinforcement learning community is increasingly paying attention to batch reinforcement learning (RL) because of new models and data availability (see e.g. (Thomas et al., 2019; Liu et al., 2018; Le et al., 2019; Thomas et al., 2015; Komorowski et al., 2018b; Hanna et al., 2017; Gottesman et al., 2019a;b)). We focus on the common scenario where decisions are made in episodes by an unknown behavioral policy, each involving a sequence of decisions. A central challenge in OPE is that the estimand is inher- ently a counterfactual quantity: what would the resulting outcomes be if an alternate policy had been used (the coun- terfactual) instead of behavior policy used in the collected data (the factual). As a result, OPE requires causal rea- soning about whether the decisions caused observed dif- ferences in rewards, as opposed to being caused by some unobserved confounding variable that simultaneously affect both observed decisions and the states or rewards (Hern ´ an and Robins, 2020; Pearl, 2009). In order to make counterfactual evaluations possible, a stan- dard assumption—albeit often overlooked and unstated—is to require that the behavior policy also does not depend on any unobserved/latent variables that also affect the future states or rewards (no unobserved confounding). We refer to this assumption as sequential ignorability, following the line of works on dynamic treatment regimes (Robins, 1986; 1997; Murphy et al., 2001; Murphy, 2003). Sequential ig- norability, however, is frequently violated in observational problems where the behavioral policy is unknown. In health- care, business operations, and even automated systems in tech, decisions are often made with respect to unlogged data correlated with future potential outcomes. In this work, we develop and analyze a framework that can quantify the impact of unobserved confounders on off- policy policy evaluations, providing certificates of robust- ness. Since OPE is generally impossible under arbitrary amounts of unobserved confounding, we begin by positing a model that explicitly limits their influence on decisions. In Section 4, we illustrate that when unobserved confounders can affect all decisions, even small amounts of confounding can have an exponential (in the number of decision steps)

Upload: others

Post on 02-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

Off-policy Policy Evaluation Under Unobserved Confounding

Anonymous Authors1

Abstract

When observed decisions depend only on ob-served features, off-policy policy evaluation(OPE) methods for sequential decision makingproblems allow evaluating the performance ofevaluation policies before deploying them. Thisassumption is often violated due to the presenceof unobserved confounders, variables that impactboth the decisions and their outcomes. We assessthe robustness of OPE methods by developingworst-case bounds on the performance of a evalu-ation policy under different models of confound-ing. When unobserved confounders can affectevery decision in an episode, we demonstrate thateven small amounts of per-decision confoundingcan heavily bias OPE methods. Fortunately, in anumber of important settings found in healthcare,policy-making, operations, and technology, unob-served confounders may primarily affect only oneof the many decisions made. Under this less pes-simistic model of one-decision confounding, wepropose an efficient loss-minimization-based pro-cedure for computing worst-case bounds on OPEestimates, and prove its statistical consistency. Onsimulated healthcare examples, we demonstratethat our method allows reliable off-policy evalu-ation by invalidating non-robust results, and pro-viding certificates of robustness.

1. Introduction

New technology and regulatory shifts allow collection ofunprecedented amounts of data on past decisions and theirassociated outcomes, ranging from product recommenda-tion systems to medical treatment decisions. This presentsunique opportunities to leverage off-policy observationaldata to inform better decision-making. When online experi-mentation is expensive or risky, it is crucial to leverage prior

1Anonymous Institution, Anonymous City, Anonymous Region,Anonymous Country. Correspondence to: Anonymous Author<[email protected]>.

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute.

data to evaluate the performance of a sequential decisionpolicy before deploying it. While epidemiology has longbeen interested in dynamic treatment regime estimation, thereinforcement learning community is increasingly payingattention to batch reinforcement learning (RL) because ofnew models and data availability (see e.g. (Thomas et al.,2019; Liu et al., 2018; Le et al., 2019; Thomas et al., 2015;Komorowski et al., 2018b; Hanna et al., 2017; Gottesmanet al., 2019a;b)). We focus on the common scenario wheredecisions are made in episodes by an unknown behavioralpolicy, each involving a sequence of decisions.

A central challenge in OPE is that the estimand is inher-ently a counterfactual quantity: what would the resultingoutcomes be if an alternate policy had been used (the coun-terfactual) instead of behavior policy used in the collecteddata (the factual). As a result, OPE requires causal rea-soning about whether the decisions caused observed dif-ferences in rewards, as opposed to being caused by someunobserved confounding variable that simultaneously affectboth observed decisions and the states or rewards (Hernanand Robins, 2020; Pearl, 2009).

In order to make counterfactual evaluations possible, a stan-dard assumption—albeit often overlooked and unstated—isto require that the behavior policy also does not depend onany unobserved/latent variables that also affect the futurestates or rewards (no unobserved confounding). We referto this assumption as sequential ignorability, following theline of works on dynamic treatment regimes (Robins, 1986;1997; Murphy et al., 2001; Murphy, 2003). Sequential ig-norability, however, is frequently violated in observationalproblems where the behavioral policy is unknown. In health-care, business operations, and even automated systems intech, decisions are often made with respect to unlogged datacorrelated with future potential outcomes.

In this work, we develop and analyze a framework thatcan quantify the impact of unobserved confounders on off-policy policy evaluations, providing certificates of robust-ness. Since OPE is generally impossible under arbitraryamounts of unobserved confounding, we begin by positinga model that explicitly limits their influence on decisions. InSection 4, we illustrate that when unobserved confounderscan affect all decisions, even small amounts of confoundingcan have an exponential (in the number of decision steps)

Page 2: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Off-policy Policy Evaluation Under Unobserved Confounding

impact on the error of the resulting off policy evaluation.In this sense, the validity of OPE can almost always bequestioned under presence of unobserved confounding thataffect all time steps. Fortunately, in a number of impor-tant applications, unobserved confounders may only affecta single decision, particularly in scenarios where experts area high level decision-maker that use unrecorded informa-tion to make a initial decision, after which a standard set ofprotocols are followed based on well-recorded observations.

Under our less pessimistic model of single-decision con-founding, we develop bounds on the expected rewards ofthe evaluation policy (Section 5). We use functional convexduality to derive a dual relaxation, and show that it can becomputed by solving a loss minimization problem. Ourprocedure allows analyzing sensitivity of OPE methods toconfounding in realistic scenarios involving continuous stateand rewards, over a potentially large horizon. We prove thatan empirical approximation of our procedure is consistent,allowing estimation from observed past decisions.

On simulation examples of dynamic treatment regimes forautism and sepsis management, we illustrate how our single-decision confounding model allows us to obtain informativebounds over meaningful amounts of potential confound-ing. Our approach can both provide a certificate of therobustness of OPE under a certain amount of unobservedconfounding, as well as identify when bias in OPE can raiseconcerns for validity of selecting the best policy among aset of candidates. As we illustrate, developing tools for ameaningful sensitivity analysis is nontrivial: a naive boundyields prohibitively conservative estimates that almost loserobustness certificates for even neglible amounts of con-founding, whereas our loss-minimization-based bounds onpolicy values is informative.

1.1. Motivating example: managing sepsis patients

Managing sepsis in ICU patients is an extremely importantproblem, accountable for 1/3 of deaths in hospitals (Howelland Davis, 2017). Sepsis treatment decisions are made bya clinical care team, including nurses, residents, and ICUattending physicians and specialists (Rhodes et al., 2017).Difficulties of care often lead to making decisions based offof imperfect information, leading to substantial room forimprovement. AI-based approaches provide an opportunityfor optimal automated management of medications, freeingthe care team to allocate more resources to critical cases.Automated approaches can manage important medicationsfor sepsis, including antibiotics and vasopressors, and de-cide to notify the care team about when a patient shouldbe placed on a mechanical ventilator. Motivated by theseopportunities, and the availability of ICU data from MIMIC-3 (Johnson et al., 2016), several AI-based approaches forsepsis management system have been proposed (Futomaet al., 2018; Komorowski et al., 2018a; Raghu et al., 2017).

Due to safety concerns, new treatment policies need to beevaluated offline before a more thorough validation. Con-founding, however, is a serious issue in data generated froman ICU. Patients in emergency departments often do nothave an existing record in the hospital’s electronic healthsystem, leaving a substantial amount of patient-specific in-formation unobserved in subsequent offline analysis. As aprominent example, comorbidities that significantly com-plicate the cases of sepsis (Brent, 2017) are often un-recorded. Private communication with an emergency depart-ment physician revealed that initial treatment of antibioticsat admission to the hospital are often confounded by un-recorded factors that affect the eventual outcome (death ordischarge from the ICU). For example, comorbidities suchas undiagnosed or improperly heart failure can delay diagno-sis of sepsis, leading to slower implementation of antibiotictreatments. More generally, there is considerable discussionin the medical literature on the importance of quickly be-ginning antibiotic treatment, with frequently noted concernsabout confounding, as these discussions are largely based onoff-policy observational data collected from ICUs (Seymouret al., 2017; Sterling et al., 2015). Antibiotics are of particu-lar interest given the recent debate regarding the importanceof early treatment, and the risks of over-prescription.

We consider a scenario where one wishes to evaluate be-tween two automated policies: optimal treatment policiesthat differ only in initially avoiding, or prescribing antibi-otics. The latter is often considered a better treatment forsepsis, as it is caused by an infection. In this example, un-observed factors most critically effect the first decision onprescribing antibiotics upon arrival; since the care team ishighly trained for treating sepsis, we assume they followstandard protocols based on observed vitals signs and labmeasurements in subsequent time steps. In what follows, weassess the impact of confounding factors discussed aboveon OPE of automated policies, and provide certificates ofrobustness that guarantee gains over existing policies.

2. Related Work

Most methods for OPE for batch reinforcement learninglargely rely (implicitly or explicitly) on sequential ignor-ability. There is an extensive body of work for off policyevaluation and optimization under this assumption, includ-ing doubly robust methods (Jiang and Li, 2015; Thomas andBrunskill, 2016) and recent work that provides semiparamet-ric efficiency bounds (Kallus and Uehara, 2019). Often theprobabilities of observing particular decisions (the behaviorpolicy) are assumed to be known, though prior work hashighlighted how errors in these quantities can bias valueestimates (Liu et al., 2018) or provided estimators that learnand leverage a predictor of the decision probabilities (Nieet al., 2019; Hanna et al., 2019). Unfortunately doubly ro-

Page 3: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Off-policy Policy Evaluation Under Unobserved Confounding

bust estimators suffer from the same bias when sequentialignorability doesn’t hold, since neither the outcome modelnor the importance sampling weights can correct for theeffect of the unobserved confounder. The do-calculus andits sequential backdoor criterion on the associated directedacyclic graph (Pearl, 2009) also gives identification resultsfor OPE. Like sequential ignorability, this preclude the ex-istence of unobserved confounding variables. Therefore,methods assuming the sequential backdoor criterion holdswill be biased in their presence.

The focus of this work is to study how unobserved confound-ing affects OPE in sequential decision making problems,and derive bounds on the evaluation policy performance inthe presence of confounding. Zhang and Bareinboim (2019)derived partial identification bounds on policy performancewithout making model assumptions about the unobservedconfounder, similar to work by Manski (1990) on bound-ing treatment effects. Robins et al. (2000); Robins (2004);Brumback et al. (2004) instead posit a model for how theconfounding bias in each time step affects the outcome ofinterest and derive bounds under this model motivated bypotential confounding in the analysis the effects of dynamictreatment regimes for HIV therapy on CD4 counts in HIV-positive men. Our work is complementary to these in thatwe instead assume a model for how the unobserved con-founder affects the behavior policy, motivated by the natureof confounding in the management of sepsis patients anddevelopmental interventions for autistic children.

For single decision making problems, a variety of methodsdeveloped in the econometrics, statistics, and epidemiologyliterature estimate bounds on treatment effects and meanpotential outcomes based on a model for the effect of theunobserved confounder on the behavior policy (Cornfieldet al., 1959; Rosenbaum and Rubin, 1983; Robins et al.,2000; Imbens, 2003; Brumback et al., 2004). Recent workhas extended this to heterogeneous treatment effect esti-mates (Yadlowsky et al., 2018; Kallus et al., 2018) closelyrelated to policy evaluation, and policy optimization (Kallusand Zhou, 2018). Our model is closely related to these, andnaturally extends these approaches to sequential decisionmaking (see Section 4 for a detailed discussion).

3. Formulation

Notation conventions vary substantially in the diverse set ofcommunities interested in learning from observational datagathered on sequences of decisions and their outcomes. Inthis paper, we use the potential outcomes notation to makeexplicit which action we wish to evaluate versus which ac-tion was actually observed. In this setting, we imagine thatall counterfactual (potential) states and rewards exist, but weonly observe the one corresponding to the action taken (alsoknown as partial feedback). In batch off policy reinforce-

ment learning sequential ignorability (as we discuss further)is almost always assumed, in which case the distributionof states and rewards conditional on taking an action areequivalent to the potential outcome evaluated at that action.However, since our aim is to consider the impact of potentialconfounding, clarifying the difference between factual andcounterfactual states and rewards is important.

We focus on domains modeled by episodic stochastic de-cision processes with a discrete set of actions. Let At bea finite action set of actions available at time t = 1, .., T .Denote a sequence of actions a1 2 A1, .., aT 2 AT by a1:T

(and similarly at:t0 for arbitrary indices 1 t t0 T ,

with the convention a1:0 = ?). For any sequence of ac-tions a1:T , let St(a1:t�1) and Rt(a1:t) be the state andreward at time t: note in general there are many poten-tial realizable states for a particular prior sequence of ac-tions. Y (a1:T ) :=

PT

t=1 �t�1

Rt(a1:t) is the correspondingdiscounted sum of rewards. We denote by W (a1:T ) =(S1(a1), .., ST (a1:T�1), R1(a1), .., RT (a1:T )) all potentialoutcomes (over rewards and states) associated with the ac-tion sequence a1:T . Any sum

Pa1:t

over action sequencesis taken over all a1:T 2 A1 ⇥ · · ·⇥AT .

In the off-policy setting, we observe sequences of ac-tions A1, .., AT generated by an unknown behavioral pol-icy ⇡1, ..,⇡T . Let Ht denote the observed history un-til time t, so that H1 := S1, and for t = 2, .., T ,Ht := (S1, A1, S2(A1), A2, .., St(A1:t�1)). As a nota-tional shorthand, for any fixed sequence of actions a1:T ,we denote an instantiation of the observed history follow-ing the action sequence by Ht(a1:t�1), so that H1(a1:0) :=H1 = S1, and for t = 2, .., T , Ht(a1:t�1) = (S1, A1 =a1, S2(a1), .., At�1 = at�1, St(a1:t�1)). We denote by Ht

the set that this history takes values over.

When there is no unobserved confounding, At ⇠ ⇡t(· | Ht)since actions are generated conditional on the history Ht.When there is unobserved confounding Ut, the behavioralpolicy draws actions At ⇠ ⇡t(· | Ht, Ut), and we denoteby ⇡t(· | Ht) the conditional distribution of At given onlythe observed history Ht, meaning we marginalize out theUt dependence. For simplicity, we assume that previouslyobserved rewards are included in the states, so Rs(A1:s)is deterministic given Ht, the history and previous actionWe define Yt(at) := Y (A1:t�1, at, At+1:T ) as a shorthand:semantically this means the sum of rewards which matchesa trajectory of executed actions on all but one action, whereon time step t action at is taken. Note that since at maynot be identical to the taken action At, and the resultingexpression for Y represents a potential outcome.

Our goal is to reliably bound the bias of evaluating theperformance of a evaluation policy ⇡1, .., ⇡T in a con-founded multi-decision off-policy environment. In stan-dard batch RL notation, this would mean we wish to

Page 4: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Off-policy Policy Evaluation Under Unobserved Confounding

bound the bias of estimating V⇡ using behavioral data

generated under the presence of confounding variables.Let At ⇠ ⇡t(· | Ht) be the actions generated bythe evaluation policy at time t, where we use Ht :=(S1, A1, S2(A1), A2, .., St(A1:t�1)) and Ht(a1:t�1) :=(S1, A1 = a1, S2(a1), A2 = a2, .., St(a1:t�1)) to denotethe history under the evaluation policy, analogously to theshorthands Ht, Ht(a1:t�1). We are interested in statisticalestimation of the expected cumulative reward E[Y (A1:T )]under the evaluation policy, which we call the performanceof the evaluation policy (aka V

⇡ in batch RL). Through-out, we assume that for all t and at, and almost every Ht,⇡t(at | Ht) > 0 whenever ⇡t(at | Ht) > 0.

We can now state the sequential ignorability in terms of therelationship between actions and potential outcomes.

Definition 1 (Sequential Ignorability). A policy ⇡ sat-isfies sequential ignorability (see e.g (Robins, 1986;2004; Murphy, 2003)) if for all t = 1, .., T , condi-tional on the history Ht generated by the policy ⇡,At ⇠ ⇡t(· | Ht) is independent of the potential outcomesRt(a1:t), St+1(a1:t), Rt+1(a1:t+1), St+2(a1:t+1), ..,ST (a1:t�1), RT (a1:T ) for all a1:T 2 A1 ⇥ · · · AT .

Sequential ignorability is a natural condition required forthe evaluation policy to be well-defined: any additionalrandomization used by the evaluation policy ⇡t(· | Ht)cannot depend on unobserved confounders. We assume thatthe evaluation policy always satisfies this assumption.

Assumption A. The evaluation policy satisfies sequentialignorability (Definition 1).

Off-policy policy evaluation fundamentally requires coun-terfactual reasoning since we only observe the state evo-lution St(A1:t�1) and rewards Rt(A1:t) corresponding tothe actions made by the behavioral policy. The canonicalassumption in batch off-policy reinforcement learning isthat sequential ignorability holds for the behavior policy.We now briefly review how this allows identification (andthus, accurate estimation) of E[Y (A1:T )].

Because we only observe potential outcomes W (A1:t) eval-uated at the actions A1:t taken by the behavior policy ⇡t,we need to express E[Y (A1:T )] in terms of observable datagenerated by the behavioral policy ⇡t. Sequential ignorabil-ity of both the behavior policy and evaluation policy allowssuch counterfactual reasoning. The following identity isstandard (see the appendix for its derivation).

Lemma 1. Assume sequential ignorability (Definition 1)holds for both behavioral and evaluation policy. Then,

E[Y (A1:T )] = E"Y (A1:T )

TY

t=1

⇡t(At | Ht(A1:t�1))

⇡t(At | Ht)

#

The RHS is called the importance sampling formula. To

ease notation, we write

⇢t :=⇡t(At | Ht(A1:t�1))

⇡t(At | Ht). (1)

4. Bounds under unobserved confounding

Despite the advantageous implications, it is often unrealisticto assume that the behavioral policy ⇡t satisfies sequentialignorability (Definition 1). To address such challenges, werelax sequential ignorability of the behavioral policy, andinstead posit a model of bounded confounding. We developworst-case bounds on the evaluation policy performanceE[Y (A1:T )]. In addition to the observed state St(A

t�11 )

available in the data, we assume that there is an unobservedconfounder Ut available only to the behavioral policy ateach time t. The behavioral policy observes the historyHt and the unobserved confounder Ut, and generates anaction At ⇠ ⇡t(· | Ht, Ut). If Ut contains informationabout unseen potential outcomes, then sequential ignorabil-ity (Definition 1) will fail to hold for the behavioral policy.

Without loss of generality, let Ut be such that the potentialoutcomes are independent of At when controlling for Ut

alongside the observed states. Such an unobserved con-founder always exists since we can define Ut to be the tupleof all unseen potential outcomes.

Assumption B. For all t = 1, .., T , there exists arandom vector Ut such that conditional on the his-tory Ht generated by the behavioral policy and Ut,At ⇠ ⇡t(· | Ht) is independent of the potential outcomesRt(a1:t), St+1(a1:t), Rt+1(a1:t+1), St+2(a1:t+1), ..,ST (a1:t�1), RT (a1:T ) for all a1:T 2 A1 ⇥ · · · AT .

Identification of E[Y (A1:T )] is impossible under arbitraryunobserved confounding. However, it is often plausibleto posit that the unobserved confounder Ut has a limitedinfluence on the decisions of the behavioral policy. Whenthe influence of unobserved confounding on each action islimited, we might expect that estimates of the evaluationpolicy performance assuming sequential ignorability mightnot be too biased. We consider the following model of unob-served confounding that bounds the influence of unobservedconfounding on the behavioral policy’s decisions.

Assumption C. For t = 1, .., T , there is a �t � 1 satisfying

⇡t(at | Ht, Ut = ut)

⇡t(a0t | Ht, Ut = ut)

⇡t(a0t | Ht, Ut = u0t)

⇡t(at | Ht, Ut = u0t) �t (2)

for any at, a0t2 At, almost surely over Ht, and ut, u

0t, and

sequential ignorability holds conditional on Ht and Ut.

When the action space is binary At = {0, 1}, the abovebounded unobserved confounding assumption is equiva-lent (Rosenbaum, 2002) to the following logistic model

Page 5: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

Off-policy Policy Evaluation Under Unobserved Confounding

log P(At=1|Ht,Ut)P(At=0|Ht,Ut)

= (Ht)+(log�t) ·b(Ut) for some mea-surable function (·) and a bounded measurable functionb(·) taking values in [0, 1]. When T = 1, the bounded un-observed confounding assumption (2) reduces to a classicalmodel that has been extensively studied by many authorsmentioned in the related works.

Under this model of confounding, OPE is almost alwaysunreliable; in sequential decision making, effects of con-founding can create exponentially large (in the horizonT ) over-sampling of large (or small) rewards, introduc-ing an extremely large, un-correctable bias. As an illus-tration, consider applying OPE in the following dramati-cally simplified setting. Letting U ⇠ Unif({�1, 1}) be asingle unobserved confounder, consider a sequence of ac-tions A1, . . . , AT 2 {0, 1} each drawn conditionally onU , but independent of one another, with the conditionaldistribution P (At = 1 | U = 1) =

p�/(1 +

p�) and

P (At = 1 | U = 0) = 1/(1 +p�). Finally, consider the

reward R = U . This reward is independent of the actionstaken, yet, in the observed data, the likelihood of observ-ing the data ((At = 1)T

t=1, R = 1) is �T/2/(2(1 + �)T/2),

whereas the likelihood of observing ((At = 1)Tt=1, R = 0)

is 1/(2(1 + �)T/2). Therefore, even as n ! 1, OPEwill mistakenly suggest that the policy which always takesAt = 1 leads to much better rewards than one which alwaystakes At = 0, because without observing U , the impor-tance weights of both of the above samples will be equal inOPE. While this example is unrealistic, in terms of lackingstates and rewards that depend on the states and actions, thecore issue in terms of confounding remains: the unobservedconfounder will make certain observed data samples expo-nentially more likely than others, without the OPE algorithmbeing able to tell or correct for these differences.

5. Confounding in a single decision

In many important applications, it is realistic to assumethere is only a single step of confounding at a known timestep t

⇤. Under this assumption, we outline in this sectionhow we obtain a computationally and statistically feasibleprocedure for computing a lower (or upper) bound on thevalue E[Y (A1:T )] of an evaluation policy pi. After intro-ducing precisely our model of confounding, we show inProposition 1 how the evaluation policy value can be ex-pressed using standard importance sampling weighting oversteps prior to confounding time step t

⇤ along with likeli-hood ratios over potential outcomes that can be used torelate the potential outcomes over observed (factual) actionswith counterfactual actions not taken. These likelihood ra-tios over potential outcomes are unobserved, but a lowerbound on the evaluation policy value can be computed byminimizing over all feasible likelihood ratios that satisfyour confounding model assumptions. Towards computa-

tional tractability, we derive a dual relaxation that can berepresented as a loss minimization procedure. All proofs ofresults in this section are in the appendix.

We now define the confounding model for when there is anunobserved confounding variable U that only affects the be-havioral policy’s action at a single time period t

?2 [T ]. For

example, in looking at impacts of confounders on antibioticsin sepsis management (Section 1.1), it is plausible to assumethat after the decision when the patient arrives, unobservedconfounders no longer affect later treatment decisions.Assumption D. For all t 6= t

?, conditional onthe history Ht generated by the behavioral policy,At ⇠ ⇡t(· | Ht) is independent of the potential outcomesRt(a1:t), St+1(a1:t), Rt+1(a1:t+1), St+2(a1:t+1), ..,ST (a1:t�1), RT (a1:T ) for all a1:T 2 A1 ⇥ · · · AT . Fort = t

?, there exists a random variable U such that the sameconditional independence holds only when conditional onthe history Ht and U .

We assume the unobserved confounder has bounded influ-ence on the behavioral policy’s choice of action At? :Assumption E. There is a � � 1 satisfying

⇡t?(at? | Ht? , U = u)⇡t?(a0

t? | Ht? , U = u)⇡t?(a

0t? | Ht? , U = u

0)⇡t?(at? | Ht? , U = u0)

� (3)

for any at? , a0t? 2 At? , almost surely over Ht? , and u, u

0.

Selecting the amount of unobserved confounding � is amodeling task, and the above confounding model’s simplic-ity and interpretability makes it advantageous for enablingmodelers to choose a plausible value of �. As in any appliedmodeling problem, the amount of unobserved confounding� should be chosen with expert knowledge (e.g. by consult-ing doctors that make behavioral decisions). In Section 6,we give various application contexts in which a realisticrange of � can be posited. One of the most interpretableways to assess the level of robustness to confounding is viathe design sensitivity of the analysis (Rosenbaum, 2010):the value of � at which the bounds on the evaluation policy’svalue crosses a landmark threshold (e.g. performance ofbehavioral policy or some known safety threshold).

Under Assumption E, the likelihood ratio between the ob-served and unobserved distribution at t? can at most vary bya factor of �. Recall that W (a1:T ) is the tuple of all potentialoutcomes associated with the actions a1:T . The followingobservation is due to Yadlowsky et al. (2018, Lemma 2.1).Lemma 2. Under Assumptions D, E, for all at? 6= a

0t? , the

likelihood ratio over {W (a1:T )}a1:T exists

L(·;Ht? , at? , a0t?) :=

dPW (· | Ht? , At? = a0t?)

dPW (· | Ht? , At? = at?),

and for PW (· | Ht? , At? = at?)-a.s. all w,w0

L(w;Ht? , at? , a0t?) �L(w0;Ht? , at? , a

0t?). (4)

Page 6: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Off-policy Policy Evaluation Under Unobserved Confounding

We let L(·;Ht? , at? , at?) ⌘ 1. Using these (unknown)likelihood ratios, we have the following representation ofE[Y (A1:T )] under confounding.

Proposition 1. Under Assumptions A, D, E,

E[Y (A1:T )]=E"

t?�1Y

t=1

⇢t

X

at? ,a0t?

⇡t?(at? |Ht?(A1:t?))⇡t?(at? |Ht?)

⇥EhL(W ;Ht? , at? , a

0t?)Yt?(at?)

TY

t=t?+1

⇢t

��� Ht? , At? = at?

i#,

using the shorthand Yt?(at?) := Y (A1:t?�1, at? , At?+1:T ).

Proposition 1 implies a natural bound on the valueE[Y (A1:T )] under bounded unobserved confounding. Sincethe likelihood ratios L(·; ·, at? , a0t?) are fundamentally un-observable due to their counterfactual nature, we take aworst-case approach over all likelihood ratios that satisfycondition (4), and derive a bound that only depend on ob-servable distributions. Towards this goal, define the set

L :=�L : W ⇥Ht? ! R+ | L(w;Ht?) �L(w0;Ht?)

a.s. all w,w0, and E[L(W ;Ht?) | Ht? , At? = at? ] = 1

. (5)

Taking the infimum over the inner expectation in the expres-sion derived in Proposition 1, and noting that it does notdepend on a

0t? , define

⌘?(Ht? ; at?) :=

infL2L

E"L(W ;Ht?)Yt?(at?)

TY

t=t?+1

⇢t

��� Ht? , At? = at?

#.

Since ⌘?(·; ·) is difficult to compute, we use functional con-

vex duality to derive a dual relaxation that can be com-puted by solving a loss minimization problem over any well-specified model class. This allows us to compute a mean-ingful lower bound to E[Y (A1:T )] even when rewards andstates are continuous, by simply fitting a model using stan-dard supervised learning methods. For (s)+ = max(s, 0)and (s)� = �min(s, 0), define the weighted squared loss`�(z) := �(z)2� + (z)2+.

Theorem 2. Under Assumptions A, D, E, ⌘?(Ht? ; at?) islower bounded a.s. by the unique solution

?(Ht? ; at?) to

minf(Ht? )

E"1 {At? = at?}

⇡t?(at? | Ht?)`�

Yt?(at?)

TY

t=t?+1

⇢t � f(Ht?)

!#.

From Theorem 2 and Proposition 1, our final lower bound

on E[Y (A1:T )] is given by

E"

t?�1Y

t=1

⇢t

X

at?

⇡t?(at? | Ht?(A1:t?�1))

⇥ (1� ⇡t?(at? | Ht?))?(Ht? ; at?)

#

+ E"⇡t?(At? | Ht?)Y (A1:T )

TY

t=1

⇢t

#. (6)

Note that this results in a loss minimization problem foreach possible action, for each observed history Ht⇤ in thedataset generated from the behavioral policy. If confoundingoccurs very late in a decision process sequence, the space ofhistories can be very large and this may incur a significantcomputational cost. However if confounding occurs earlyin the process, the space of possible histories is small andcomputationally this is very tractable. This is the scenariofor the domains we consider in our experiments.Consistency We now show that an empirical approxima-tion to our loss minimization problem yields a consistentestimate of ?(·). We require the following standard over-lap assumption, which states the behavioral policy has auniformly positive probability of playing any action.Assumption F. There exists C < 1 such that for all t andat, ⇡t(at | Ht)/⇡t(at | Ht) C almost surely.

Since it is not feasible to optimize over the class of allfunctions f(Ht?), we consider a parameterization f✓(Ht?)where ✓ 2 Rd. We provide prove-able guarantees in thesimplified setting where ✓ 7! f✓ is linear, so that the lossminimization problem is convex. That is, we assume thatf✓ is represented by a finite linear combination of somearbitrary basis functions of Ht? . As long as the parameter-ization is well-specified so that ?(Ht? ; at?) = f✓?(Ht?)for some ✓

?2 ⇥, an empirical plug-in solution converges

to ? as the number of samples n grows to infinity. We let

⇥ ✓ Rd be our model space; our theorem allows ⇥ = Rd.

In the below result, let b⇡t(at | Ht) be a consistent estima-tor of ⇡t(at | Ht) trained on a separate dataset Dn withthe same underlying distribution; such estimators can betrained using sample splitting and standard supervised learn-ing methods. Define the set S✏ of ✏-approximate optimizersof the empirical plug-in problem

minf(Ht? )

bEn

"1 {At? = at?}

b⇡t?(at? | Ht?)`�

Yt?(at?)

TY

t=t?+1

b⇢t � f(Ht?)

!#,

where bEn is the empirical distribution on the data statisti-cally independent from Dn, and b⇢t := ⇡(At|Ht(A1:t�1))

b⇡t(At|Ht).

We assume that we observe independent, and identically dis-tributed trajectories, and formally, assume that the observed

Page 7: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384

Off-policy Policy Evaluation Under Unobserved Confounding

:iWhouW AnWiEioWics :iWh AnWiEioWics 2pWiPaO3oOicy

−0.1

0.0

0.1

0.2

0.32

uWco

Pe

[Y(

A 1:T

)]

2urs 1aive 23( True

Figure 1. Sepsis simulation. Data generation process withthe level of confounding �? = 2.0. Estimated outcomewith OPE along with the true value. Black lines showestimated upper and lower bound on outcome using ourapproach and red lines correspond to the naıve approach,both with � = 2.0. Dashed lines represents 95% quantile.

1 2 3 4 5 6LeveO of confoundLng (Γ)

−0.1

0.0

0.1

0.2

0.3

0.4

2uW

com

e

[Y(

A 1:T

)]

DesLgn 6ensLWLvLWy, 2urs

:LWh AnWLbLoWLcs:LWhouW AnWLbLoWLcs

1 2 3 4 5 6LeveO of confoundLng (Γ)

−0.1

0.0

0.1

0.2

0.3

0.4

2uW

com

e

[Y(

A 1:T

)]

DesLgn 6ensLWLvLWy, 1DLve

:LWh AnWLbLoWLcs:LWhouW AnWLbLoWLcs

(a) Our approach (b) Naive approach

Figure 2. Sepsis simulator design sensitivity. Data genera-tion process with level of confounding �? = 5. Estimatedlower and upper bound of two policies (with and withoutantibiotics) under (a) our approach with sensitivity 5.8 (b)naive approach with sensitivity 1.8.

cumulative reward is the evaluation of the potential outcomeat the observed action sequence, Yobs = Y (A1:T ) so thateach trajectory (unit) does not affect one another1.

Theorem 3. Let Assumptions A, D, E, F hold, and let ✓ 7!

f✓ be linear such that f✓?(·) = ?(·, at?) for some unique

✓?2 Rd. Let E|Yt?(at?)|4 < 1, and E[|f✓(Ht?)|4] < 1

for all ✓ 2 ⇥. If for all t, b⇡t(·|·) ! ⇡t(·|·) pointwise a.s.,⇡t(·|·)/ b⇡t(·|·) 2C, and 9c s.t. 0 < c b⇡t?(at? |Ht?)

1 a.s., then lim infn!1 dist(✓?, S"n)p

! 0 8"n # 0.Hence, a plug-in estimator of the lower bound (6) is consis-tent as n ! 1, under the hypothesis of Theorem 3.

6. Experiments

We provide a number of examples of how our method couldbe applied in real off-policy evaluation settings, where con-founding is primarily an issue in a single decision withinthe sequence. After introducing these settings and why ourmodel for confounding might fit these settings, we demon-strate using the method to certify the reliability of (or raiseconcerns about the unreliability of) OPE in these settings.Because the gold standard real counterfactual outcomes

1together, these imply the stable unit treatment (SUTVA) as-sumption (Rubin, 1980) in the statistics literature

are only known in simulations, we focus on simulation ex-amples motivated by the real OPE applications. First, weintroduce the real world setting, the corresponding simula-tors, and how we introduce confounding in the simulator tomodel the realistic source of confounding that might existin off-policy data in these settings. Then, we use these ex-amples to demonstrate that our approach can be fairly tightin some cases, meaning that our bounds are close to thetrue evaluation policy performance after introducing con-founding in our simulations and applying our method. Wealso demonstrate how our method compares to the naıveapproach in allowing us to certify robustness to confoundingwith much larger values of � than the naıve approach.

Managing sepsis patients To simulate data as in the ex-ample in Section 1.1, we used the sepsis simulator developedby Oberst and Sontag (2019). To simulate the unrecordedco-morbidities that introduce confounding, we extract someof the randomness that goes into choosing the state transi-tions into a confounding variable, so that the confoundingvariable are correlated with better state transitions in thesimulation. In the first time step, we take the optimal actionwith respect to all other drugs, and select antibiotics withprobability

p�/(1 +

p�) if the confounding variable is

large and with probability 1/(1 +p�) if the confounding

variable is small, satisfying Assumption E. We assume thatthe care team acts nearly optimally, except for some random-ness due to the challenges of the ICU, guaranteeing overlap(Assumption F) with respect to the optimal evaluation policy.In all but the first time step, we implemented the behaviorpolicy to take the optimal next treatment action with prob-ability 0.85, and otherwise switch the vasopressor status,independent of the confounders, satisfying Assumption D.

We imagine that using existing medical knowledge, an au-tomated policy is implemented to implement optimal treat-ment policy, and we would like to evaluate it’s benefit rela-tive to the current standard of care. We learn optimal policywith respect to this simulation online (without confounding)using policy iteration, as done in Oberst and Sontag (2019).

Communication interventions for minimally verbal chil-

dren with autism Minimially verbal children represent25-30% of children with autism, and often have poor prog-nosis in terms of social functioning (Rutter et al., 1967;Anderson et al., 2009). See Kasari et al. (2014) for morebackground on the challenges of treating these patients.

We compare the number of speech utterances by such chil-dren under an adaptive policy that starts with behavioral lan-guage interventions (BLI) for 12 weeks and augments BLIwith an augmented or alternative communication (AAC) ap-proach against a non-adaptive policy that uses AAC throughthe whole treatment. Kasari et al. (2014) note that there arevery few randomized trials of these interventions, and the

Page 8: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439

Off-policy Policy Evaluation Under Unobserved Confounding

0 5 10 15 20 25 30 35:eeNs

30

32

34

36

38

40

42

2uWcoPe [Y( A 1:T)]

(Γ ⋆2 2.0)AACBLI + AAC (23()BLI + AAC (2urs) Γ2 2.0BLI + AAC (2urs) Γ2 3.0BLI + AAC (1aLve)BLI + AAC (True)

0 5 10 15 20 25 30 35:eeNs

30

32

34

36

38

40

42

44

2uWcoPe [Y( A 1:T)]

(Γ ⋆2 2.0)AACBLI + AAC (23()BLI + AAC (2urs) Γ 2.0BLI + AAC (2urs) Γ 3.7BLI + AAC (1aLve)BLI + AAC (7rue)

(a) Case I (b) Case II

Figure 3. Autism simulation. Outcome of two differentpolicies, confounded adaptive policy (BLI+AAC) and un-confounded non-adaptive policy (AAI). Data generationprocess with the level of confounding �? = 2.0. Case I:effect size 0.3. Case II: effect size 0.8

1.00 1.28 1.50 2.00 2.28 2.50 3.00 3.50LeveO oI conIoundLng (Γ)

28

30

32

34

36

38

40

42

44

2ut

com

e

[Y(A

1:T

)]

DesLgn sensLtLvLty, (IIect 6L]e: 1.1

AACBL, + AAC (1DLve)BL, + AAC (2urs)BL, + AAC (True)

Figure 4. Autism simulation design sensitivity. Data gen-eration process with the level of confounding �? = 1.0.True value of adaptive (BLI+AAC) and non-adaptive(AAC) policies along with estimated lower bound on out-come using our and naive approach

number of individuals in these trials tends to be small. It isnatural in such cases to consider using existing off-policydata to evaluate this intervention protocol.

At the beginning of the treatment children are assigned toBLI or AAC treatment pesudo-randomly due to the availabil-ity of AAC devices. However, at the follow up visit after 12weeks a clinician may decide to use AAC devices for somechildren starting with BLI. Since this interventions requiresa specialized device, it is likely that the clinicians workingwith the children only give the devices to those for whomthe device is most effective. Assessing the effectivenessof the intervention is likely based on the clinician’s inter-actions with the patients, not information encoded in thereported covariates, which contain partial, noisy informationabout the outcome. Therefore, while there is confounding,Assumption D is plausible in the second decision.

The simulation for comparing developmental interventionsfor autistic children comes from Lu et al. (2016) based onmodeling the data from Kasari et al. (2014). Lu et al. (2016)provide plausible ranges for the parameters of the simula-tion, based on the observed results of the SMART trial andrealistic effect sizes. We create the aformentioned confound-ing variable in our experiments by making the variables inthe simulation that corresponds to the effectiveness of theintervention a randomly selected value that is unobserved.

We introduce confounding by simulating the decision in thesecond time step based on this latent variable, in accordancewith the model in Assumptions D and E.

Results All implementation and model details can be foundin the appendix. We compare three different approaches:applying standard OPE methods that assume sequential ig-norability holds, computing lower- and upper-bounds on theevaluation policy performance using the naıve bound pro-vided in the Appendix, and computing these bounds usingour proposed loss minimization approach.

Sepsis simulator We evaluate three different policies, 1.Without antibiotics (WO), which does not administer antibi-otics at the first timestep, 2. With antibiotics (W), whichadminister antibiotics at the first time step, and 3. the op-timal policy learned by policy iteration. Figure 1 showsthe outcome of these policies estimated on the data gen-erated with �? = 2.0. Confounding leads standard OPEmethods to underestimate the outcome for WO policy andover estimate the outcome for optimal and W policy, whichmakes W and optimal policy looks much better than WO.The naive bound cannot guarantee the superiority of W andoptimal policy over WO with � = 2.0; however, our pro-posed method shows the lower bound on W and optimal donot cross the upper bound on WO, certifying the robustnessof the benefit of immediately administering antibiotics.

Figure 2 compares the design sensitivity of our methodversus the naive approach. We generated the data with�? = 5.0. Figure 2(a,b) shows that using our method (re-spectively naıve), the lower bound on W policy meets the up-per bound of WO policy at � = 5.8 (respectively � = 1.8).This indicates the improved robustness of our algorithm toconservative choices of �.

Autism simulator We consider two different cases. Case I,effect size 0.3 (Figure 3): the adaptive policy (BLI+AAC)has lower true outcome than the non-adaptive policy (AAC).We injected �? = 2.0 level of confounding in this simula-tion that makes the standard OPE approach over estimatethe outcome of the adaptive policy. However, by using ourmethod to compute a lower bound on the adaptive policy,Figure 3 (a) shows that we cannot guarantee this superi-ority with the amount of confounding � = 2.0. Case II,effect size 0.8: the adaptive policy has a higher outcomethan the non adaptive policy with this effect size, and withthe amount of confounding �? = 2.0, standard OPE meth-ods overestimate this value. Figure 3(c) shows that unlikethe naıve method, our method guarantees the superiorityof the adaptive policy by � 2 [2.0, 3.7]. Figure 4 showsthe design sensitivity of our method. In this example theeffect size is 1.1 and the generated data is unconfounded.Lower bound computed by the naıve method shows designsensitivity of � = 1.28 while using our method is robust tomore conservative choices of � = 2.28.

Page 9: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494

Off-policy Policy Evaluation Under Unobserved Confounding

References

D. K. Anderson, R. S. Oti, C. Lord, and K. Welch. Patternsof growth in adaptive social abilities among children withautism spectrum disorders. Journal of Abnormal ChildPsychology, 37(7):1019–1034, Oct 2009. ISSN 1573-2835. doi: 10.1007/s10802-009-9326-0. URL https://doi.org/10.1007/s10802-009-9326-0.

A. J. Brent. Meta-analysis of time to antimicrobial therapyin sepsis: Confounding as well as bias. Critical CareMedicine, 45(2), 2017.

B. A. Brumback, M. A. Hernan, S. J. P. A. Haneuse, andJ. M. Robins. Sensitivity analyses for unmeasured con-founding assuming a marginal structural model for re-peated measures. Statistics in Medicine, 23(5):749–767,2004.

J. Cornfield, W. Haenszel, E. C. Hammond, A. M. Lilien-feld, M. B. Shimkin, and E. L. Wynder. Smoking andlung cancer: Recent evidence and a discussion of somequestions. Journal of the National Cancer Institute, 22(1):173–203, 1959.

J. Futoma, A. Lin, M. Sendak, A. Bedoya, M. Clement,C. O’Brien, and K. Heller. Learning to treat sepsis withmulti-output gaussian process deep recurrent q-networks,2018. URL https://openreview.net/forum?id=SyxCqGbRZ.

O. Gottesman, F. Johansson, M. Komorowski, A. Faisal,D. Sontag, F. Doshi-Velez, and L. A. Celi. Guidelinesfor reinforcement learning in healthcare. Nat Med, 25(1):16–18, 2019a.

O. Gottesman, Y. Liu, S. Sussex, E. Brunskill, and F. Doshi-Velez. Combining parametric and nonparametric modelsfor off-policy evaluation. In International Conference onMachine Learning, pages 2366–2375, 2019b.

J. Hanna, S. Niekum, and P. Stone. Importance samplingpolicy evaluation with an estimated behavior policy. InProceedings of the 36th International Conference on Ma-chine Learning (ICML), June 2019.

J. P. Hanna, P. Stone, and S. Niekum. Bootstrapping withmodels: Confidence intervals for off-policy evaluation. InThirty-First AAAI Conference on Artificial Intelligence,2017.

M. Hernan and J. Robins. Causal Inference: What If. BocaRaton: Chapman & Hall/CRC, 2020.

M. D. Howell and A. M. Davis. Management of Sepsis andSeptic Shock. JAMA, 317(8):847–848, 02 2017. ISSN0098-7484. doi: 10.1001/jama.2017.0131. URL https://doi.org/10.1001/jama.2017.0131.

T.-C. Hu, F. Moricz, and R. Taylor. Strong laws of largenumbers for arrays of rowwise independent random vari-ables. Acta Mathematica Hungarica, 54(1-2):153–162,1989.

G. W. Imbens. Sensitivity to exogeneity assumptions inprogram evaluation. American Economic Review, 93(2):126–132, 2003.

N. Jiang and L. Li. Doubly robust off-policy valueevaluation for reinforcement learning. arXiv preprintarXiv:1511.03722, 2015.

A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng,M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, andR. G. Mark. Mimic-iii, a freely accessible critical caredatabase. Scientific data, 3:160035, 2016.

N. Kallus and M. Uehara. Double reinforcement learningfor efficient off-policy evaluation in markov decision pro-cesses. arXiv preprint arXiv:1908.08526, 2019.

N. Kallus and A. Zhou. Confounding-robust policy improve-ment. In Advances in Neural Information ProcessingSystems, pages 9269–9279, 2018.

N. Kallus, X. Mao, and A. Zhou. Interval estimation ofindividual-level causal effects under unobserved con-founding. arXiv preprint arXiv:1810.02894, 2018.

C. Kasari, A. Kaiser, K. Goods, J. Nietfeld, P. Mathy,R. Landa, S. Murphy, and D. Almirall. Communicationinterventions for minimally verbal children with autism:A sequential multiple assignment randomized trial. Jour-nal of the American Academy of Child & AdolescentPsychiatry, 53(6):635–646, 2014.

A. J. King and R. J. Wets. Epi-consistency of convex stochas-tic programs. Stochastics and Stochastic Reports, 34(1-2):83–92, 1991.

M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, andA. A. Faisal. The artificial intelligence clinician learnsoptimal treatment strategies for sepsis in intensive care.Nature Medicine, 24(11):1716–1720, 2018a.

M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, andA. A. Faisal. The artificial intelligence clinician learnsoptimal treatment strategies for sepsis in intensive care.Nature medicine, 24(11):1716–1720, 2018b.

H. M. Le, C. Voloshin, and Y. Yue. Batch policy learn-ing under constraints. arXiv preprint arXiv:1903.08738,2019.

Y. Liu, O. Gottesman, A. Raghu, M. Komorowski, A. A.Faisal, F. Doshi-Velez, and E. Brunskill. Representationbalancing mdps for off-policy policy evaluation. In Ad-vances in Neural Information Processing Systems, pages2644–2653, 2018.

Page 10: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549

Off-policy Policy Evaluation Under Unobserved Confounding

X. Lu, I. Nahum-Shani, C. Kasari, K. G. Lynch, D. W. Oslin,W. E. Pelham, G. Fabiano, and D. Almirall. Comparingdynamic treatment regimes using repeated-measures out-comes: modeling considerations in smart studies. Statis-tics in medicine, 35(10):1595–1615, 2016.

D. Luenberger. Optimization by Vector Space Methods.Wiley, 1969.

C. F. Manski. Nonparametric bounds on treatment effects.The American Economic Review, 80(2):319–323, 1990.

S. A. Murphy. Optimal dynamic treatment regimes. Jour-nal of the Royal Statistical Society: Series B (StatisticalMethodology), 65(2):331–355, 2003.

S. A. Murphy, M. J. van der Laan, and J. M. Robins.Marginal mean models for dynamic regimes. Journalof the American Statistical Association, 96(456):1410–1423, 2001.

X. Nie, E. Brunskill, and S. Wager. Learning when-to-treatpolicies. arXiv preprint arXiv:1905.09751, 2019.

M. Oberst and D. Sontag. Counterfactual off-policyevaluation with Gumbel-max structural causal mod-els. In K. Chaudhuri and R. Salakhutdinov, edi-tors, Proceedings of the 36th International Confer-ence on Machine Learning, volume 97 of Proceed-ings of Machine Learning Research, pages 4881–4890, Long Beach, California, USA, 09–15 Jun2019. PMLR. URL http://proceedings.mlr.press/v97/oberst19a.html.

J. Pearl. Causality. Cambridge University Press, 2009.

A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, andM. Ghassemi. Continuous state-space models for optimalsepsis treatment-a deep reinforcement learning approach.arXiv preprint arXiv:1705.08422, 2017.

A. Rhodes, L. E. Evans, W. Alhazzani, et al. Surviving sep-sis campaign: International guidelines for management ofsepsis and septic shock: 2016. Intensive Care Medicine,43(3):304–377, 2017.

J. Robins. A new approach to causal inference in mortalitystudies with a sustained exposure period—application tocontrol of the healthy worker survivor effect. Mathemati-cal modelling, 7(9-12):1393–1512, 1986.

J. M. Robins. Causal inference from complex longitudinaldata. In Latent variable modeling and applications tocausality, pages 69–117. Springer, 1997.

J. M. Robins. Optimal structural nested models for optimalsequential decisions. In Proceedings of the Second Seat-tle Symposium in Biostatistics, pages 189–326. Springer,2004.

J. M. Robins, A. Rotnitzky, and D. O. Scharfstein. Sen-sitivity analysis for selection bias and unmeasured con-founding in missing data and causal inference models. InM. E. Halloran and D. Berry, editors, Statistical Modelsin Epidemiology, the Environment, and Clinical Trials,pages 1–94, New York, NY, 2000. Springer New York.ISBN 978-1-4612-1284-3.

R. T. Rockafellar and R. J. B. Wets. Variational Analysis.Springer, New York, 1998.

P. R. Rosenbaum. Observational studies. In Observationalstudies, pages 1–17. Springer, 2002.

P. R. Rosenbaum. Design of Observational Studies, vol-ume 10. Springer, 2010.

P. R. Rosenbaum and D. B. Rubin. Assessing sensitivityto an unobserved binary covariate in an observationalstudy with binary outcome. Journal of the Royal Statisti-cal Society: Series B (Methodological), 45(2):212–218,1983.

D. B. Rubin. Randomization analysis of experimentaldata: The fisher randomization test comment. Journal ofthe American Statistical Association, 75(371):591–593,1980.

M. Rutter, D. Greenfeld, and L. Lockyer. A five to fifteenyear follow-up study of infantile psychosis: Ii. social andbehavioural outcome. British Journal of Psychiatry, 113(504):1183–1199, 1967. doi: 10.1192/bjp.113.504.1183.

C. W. Seymour, F. Gesten, H. C. Prescott, M. E. Friedrich,T. J. Iwashyna, G. S. Phillips, S. Lemeshow, T. Osborn,K. M. Terry, and M. M. Levy. Time to treatment andmortality during mandated emergency care for sepsis.New England Journal of Medicine, 376(23):2235–2244,2017. doi: 10.1056/NEJMoa1703058. URL https://doi.org/10.1056/NEJMoa1703058. PMID:28528569.

S. A. Sterling, W. R. Miller, J. Pryor, M. A. Puskarich,and A. E. Jones. The impact of timing of antibiotics onoutcomes in severe sepsis and septic shock: a systematicreview and meta-analysis. Critical care medicine, 43(9):1907, 2015.

P. Thomas and E. Brunskill. Data-efficient off-policy pol-icy evaluation for reinforcement learning. In Interna-tional Conference on Machine Learning, pages 2139–2148, 2016.

P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High-confidence off-policy evaluation. In Twenty-Ninth AAAIConference on Artificial Intelligence, 2015.

Page 11: Off-policy Policy Evaluation Under Unobserved Confounding · 2020. 5. 7. · effect of the unobserved confounder. The do-calculus and its sequential backdoor criterion on the associated

550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604

Off-policy Policy Evaluation Under Unobserved Confounding

P. S. Thomas, B. C. da Silva, A. G. Barto, S. Giguere,Y. Brun, and E. Brunskill. Preventing undesirable behav-ior of intelligent machines. Science, 366(6468):999–1004,2019.

S. Yadlowsky, H. Namkoong, S. Basu, J. Duchi, andL. Tian. Bounds on the conditional and average treat-ment effect in the presence of unobserved confounders.arXiv:1808.09521 [stat.ME], 2018.

J. Zhang and E. Bareinboim. Near-optimal rein-forcement learning in dynamic treatment regimes.In H. Wallach, H. Larochelle, A. Beygelzimer,F. d’Alche Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems32, pages 13401–13411. Curran Associates, Inc.,2019. URL http://papers.nips.cc/paper/9496-near-optimal-reinforcement-learning-in-dynamic-treatment-regimes.pdf.