knowledge: a probabilistic approach - arxiv

23
Algorithmic recourse under imperfect causal knowledge: a probabilistic approach Amir-Hossein Karimi *1,2 Julius von Kügelgen *1,3 Bernhard Schölkopf 1 Isabel Valera 1,4 1 Max Planck Institute for Intelligent Systems, Tübingen, Germany 2 Max Planck ETH Center for Learning Systems, Zürich, Switzerland 3 Department of Engineering, University of Cambridge, United Kingdom 4 Department of Computer Science, Saarland University, Saarbrücken, Germany {amir, jvk, bs, ivalera}@tue.mpg.de Abstract Recent work has discussed the limitations of counterfactual explanations to recom- mend actions for algorithmic recourse, and argued for the need of taking causal relationships between features into consideration. Unfortunately, in practice, the true underlying structural causal model is generally unknown. In this work, we first show that it is impossible to guarantee recourse without access to the true structural equations. To address this limitation, we propose two probabilistic approaches to select optimal actions that achieve recourse with high probability given limited causal knowledge (e.g., only the causal graph). The first captures uncertainty over structural equations under additive Gaussian noise, and uses Bayesian model averaging to estimate the counterfactual distribution. The second removes any assumptions on the structural equations by instead computing the average effect of recourse actions on individuals similar to the person who seeks recourse, leading to a novel subpopulation-based interventional notion of recourse. We then derive a gradient-based procedure for selecting optimal recourse actions, and empirically show that the proposed approaches lead to more reliable recommendations under imperfect causal knowledge than non-probabilistic baselines. 1 Introduction As machine learning algorithms are increasingly used to assist consequential decision making in a wide range of real-world settings [36, 41], providing explanations for the decision of these black-box models becomes crucial [7, 58]. A popular approach is that of (nearest) counterfactual explanations, which refer to the closest feature instantiations that would have resulted in a changed prediction [59]. While providing some insight (explanation) into the underlying black-box classifier, such coun- terfactual explanations do not directly translate into actionable recommendations to individuals for obtaining a more favourable prediction[22, 5]—a related task referred to as algorithmic re- course [54, 55, 19, 21]. Importantly, prior work on both counterfactual explanations and algorithmic recourse treats features as independently manipulable inputs, thus ignoring the causal relationships between features. In this context, recent work [22] has argued for the need of taking into account the causal structure between features to find a minimal set of actions (in the form of interventions) that guarantees recourse. However, while this approach is theoretically sound, it involves computing counterfactuals in the true underlying structural causal model (SCM)[35], and thus relies on strong impractical * Equal contribution 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. arXiv:2006.06831v3 [cs.LG] 23 Oct 2020

Upload: others

Post on 06-Oct-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: knowledge: a probabilistic approach - arXiv

Algorithmic recourse under imperfect causalknowledge: a probabilistic approach

Amir-Hossein Karimi∗1,2 Julius von Kügelgen∗1,3 Bernhard Schölkopf 1 Isabel Valera 1,4

1Max Planck Institute for Intelligent Systems, Tübingen, Germany2Max Planck ETH Center for Learning Systems, Zürich, Switzerland

3Department of Engineering, University of Cambridge, United Kingdom4Department of Computer Science, Saarland University, Saarbrücken, Germany

{amir, jvk, bs, ivalera}@tue.mpg.de

Abstract

Recent work has discussed the limitations of counterfactual explanations to recom-mend actions for algorithmic recourse, and argued for the need of taking causalrelationships between features into consideration. Unfortunately, in practice, thetrue underlying structural causal model is generally unknown. In this work, we firstshow that it is impossible to guarantee recourse without access to the true structuralequations. To address this limitation, we propose two probabilistic approaches toselect optimal actions that achieve recourse with high probability given limitedcausal knowledge (e.g., only the causal graph). The first captures uncertaintyover structural equations under additive Gaussian noise, and uses Bayesian modelaveraging to estimate the counterfactual distribution. The second removes anyassumptions on the structural equations by instead computing the average effect ofrecourse actions on individuals similar to the person who seeks recourse, leadingto a novel subpopulation-based interventional notion of recourse. We then derive agradient-based procedure for selecting optimal recourse actions, and empiricallyshow that the proposed approaches lead to more reliable recommendations underimperfect causal knowledge than non-probabilistic baselines.

1 Introduction

As machine learning algorithms are increasingly used to assist consequential decision making in awide range of real-world settings [36, 41], providing explanations for the decision of these black-boxmodels becomes crucial [7, 58]. A popular approach is that of (nearest) counterfactual explanations,which refer to the closest feature instantiations that would have resulted in a changed prediction [59].While providing some insight (explanation) into the underlying black-box classifier, such coun-terfactual explanations do not directly translate into actionable recommendations to individualsfor obtaining a more favourable prediction[22, 5]—a related task referred to as algorithmic re-course [54, 55, 19, 21]. Importantly, prior work on both counterfactual explanations and algorithmicrecourse treats features as independently manipulable inputs, thus ignoring the causal relationshipsbetween features.

In this context, recent work [22] has argued for the need of taking into account the causal structurebetween features to find a minimal set of actions (in the form of interventions) that guaranteesrecourse. However, while this approach is theoretically sound, it involves computing counterfactualsin the true underlying structural causal model (SCM) [35], and thus relies on strong impractical∗Equal contribution

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arX

iv:2

006.

0683

1v3

[cs

.LG

] 2

3 O

ct 2

020

Page 2: knowledge: a probabilistic approach - arXiv

assumptions; specifically, it requires complete knowledge of the true structural equations. While formany applications it is possible to draw a causal diagram from expert knowledge, assumptions aboutthe form of structural equations are, in general, not testable and may thus not hold in practice [38].As a result, counterfactuals computed using a misspecified causal model may be inaccurate andrecommend actions that are sub-optimal or, even worse, ineffective to achieve recourse.

In this work, we focus on the problem of algorithmic recourse when only limited causal knowledge isavailable (as it is generally the case). To this end, we propose two probabilistic approaches whichallow to relax the strong assumption of a fully-specified SCM made in [22]. In the first approach, weassume that, while the underlying SCM is unknown, it belongs to the family of additive Gaussiannoise models [16, 37]. We then make use of Gaussian processes (GPs) [62] to average predictionsover a whole family of SCMs and thus to obtain a distribution over counterfactual outcomes whichforms the basis for individualised algorithmic recourse. The second approach considers a differentsubpopulation-based notion of algorithmic recourse by estimating the effect of interventions forindividuals similar to the one for which we aim to achieve recourse. It thus addresses a different(rung 2) target quantity than the counterfactual/individualised (rung 3) approach which allows us tofurther relax our assumptions by removing any assumptions on the form of the structural equations.This approach is based on the idea of the conditional average treatment effect (CATE) [1], and relieson conditional variational autoencoders (CVAEs) [48] to estimate the interventional distribution. Inboth cases, we assume that the causal graph is known or can be postulated from expert knowledge, aswithout such an assumption causal reasoning from observational data is not possible [38, Prop. 4.1].

In more detail, we first demonstrate as a motivating negative result that recourse guarantees areonly possible if the true SCM is known (§3). Then, we introduce two probabilistic approachesfor handling different levels of uncertainty in the structural equations (§4 and §5), and propose agradient-based method to find a set of actions that achieves recourse with a given probability atminimum cost (§6). Our experiments (§7) on synthetic and semi-synthetic loan approval data, showthe need for probabilistic approaches to achieve algorithmic recourse in practice, as point estimatesof the underlying true SCM often propose invalid recommendations or achieve recourse only at highercost. Importantly, our results also show that subpopulation-based recourse is the right approachto adopt when assumptions such as additive noise do not hold. A user-friendly implementation ofall methods that only requires specification of the causal graph and a training set is available athttps://github.com/amirhk/recourse.

2 Background and related work

Causality: structural causal models, interventions, and counterfactuals. To reason formallyabout causal relations between features X = {X1, ..., Xd}, we adopt the structural causal model(SCM) framework [35].2 Specifically, we assume that the data-generating process of X is describedby an (unknown) underlying SCMM of the general form

M = (S, PU), S = {Xr := fr(Xpa(r), Ur)}dr=1, PU = PU1× . . .× PUd , (1)

where the structural equations S are a set of assignments generating each observed variable Xr as adeterministic function fr of its causal parents Xpa(r) ⊆ X \Xr and an unobserved noise variableUr. The assumption of mutually independent noises (i.e., a fully factorised PU) entails that thereis no hidden confounding and is referred to as causal sufficiency. An SCM is often illustrated by itsassociated causal graph G, which is obtained by drawing a directed edge from each node in Xpa(r)to Xr for r ∈ [d] := {1, . . . , d}, see Fig. 1b and 1c for an example. We assume throughout that Gis acyclic. In this case,M implies a unique observational distribution PX, which factorises over G,defined as the push-forward of PU via S.3

Importantly, the SCM framework also entails interventional distributions describing a situation inwhich some variables are manipulated externally. E.g., using the do-operator, an intervention whichfixes XI to θ (where I ⊆ [d]) is denoted by do(XI = θ). The corresponding distribution of theremaining variables X−I can be computed by replacing the structural equations for XI in S to obtain

2Also known as non-parametric structural equation model with independent errors (NPSEM-IE).3I.e., for r ∈ [d], PXr|Xpa(r)

(Xr|Xpa(r)) := PUr (f−1r (Xr|Xpa(r))), where f−1

r (Xr|Xpa(r)) denotes thepre-image of Xr given Xpa(r) under fr , i.e., f−1

r (Xr|Xpa(r)) := {u ∈ Ur : fr(Xpa(r), u) = Xr}.

2

Page 3: knowledge: a probabilistic approach - arXiv

X1

X2 X3

h

(a) Classifier-centric view

S =

X1 := f1(U1),

X2 := f2(X1, U2),

X3 := f3(X1, X2, U3)

PU = PU1

× PU2× PU3

(b)M = (S, PU)

X1

X2 X3

h

(c) Causal graph G forM

Figure 1: A view commonly adopted for counterfactual explanations (a) treats features as indepen-dently manipulable inputs to a given fixed and deterministic classifier h. In the causal approach toalgorithmic recourse taken in this work, we instead view variables as causally related to each other bya structural causal model (SCM)M (b) with associated causal graph G (c).

the new set of equations Sdo(XI=θ). The interventional distribution PX−I |do(XI=θ) is then given bythe observational distribution implied by the manipulated SCM

(Sdo(XI=θ), PU

).

Similarly, an SCM also implies distributions over counterfactuals—statements about a world in whicha hypothetical intervention was performed all else being equal. For example, given observation xF wecan ask what would have happened if XI had instead taken the value θ. We denote the counterfactualvariable by X(do(XI = θ))|xF, whose distribution can be computed in three steps [35]:

1. Abduction: compute the posterior distribution over background variables given xF, PU|xF ;2. Action: perform the intervention to obtain the new structural equations Sdo(XI=θ); and,3. Prediction: PX(do(XI=θ))|xF is the distribution induced by the resulting SCM

(Sdo(XI=θ), PU|xF

).

Explainable ML: “counterfactual” explanations and (causal) algorithmic recourse. Assumethat we are given a binary probabilistic classifier h : X → [0, 1] trained to make decisions abouti.i.d. samples from the data distribution PX.4 For ease of illustration, we adopt the setting of loanapproval as a running example, i.e., h(x) ≥ 0.5 denotes that a loan is granted and h(x) < 0.5 thatit is denied. For a given individual xF that was denied a loan, h(xF) < 0.5, we aim to answer thefollowing questions: “Why did individual xF not get the loan?” and “What would they have tochange, preferably with minimal effort, to increase their chances for a future application?”.

A popular approach to this task is to find so-called (nearest) counterfactual explanations [59],where the term “counterfactual” is meant in the sense of the closest possible world with a differentoutcome [30]. Translating this idea to our setting, a counterfactual explanation xCE for an individualxF is given by a solution to the following optimisation problem:

xCE ∈ argminx∈X dist(x,xF) subject to h(x) ≥ 0.5, (2)

where dist(·, ·) is a similarity metric on X , and additional constraints may be added to reflectplausibility, feasibility, or diversity of the obtained counterfactual explanations [19, 20, 32, 33, 39, 44].

Importantly, while xCE signifies the most similar individual to xF that would receive the loan, it doesnot inform xF on the actions they should perform to become xCE. To address this limitation, therecently proposed framework of algorithmic recourse focuses instead on the actions an individual canperform to achieve a more favourable outcome [54]. The emphasis is thus shifted from minimising adistance as in (2) to optimising a personalised cost function costF(·) over a set of actions AF whichindividual xF can perform. However, most prior work on both counterfactual explanations andalgorithmic recourse considers features as independently manipulable inputs to the classifier h (seeFig. 1a), and therefore, ignores the potentially rich causal structure over X (see Fig. 1c). A numberof authors have argued for the need to consider causal relations between variables when generativecounterfactual explanations [59, 54, 20, 33, 32], however, the resulting counterfactuals fail to implyfeasible and optimal recourse actions [22].

In the most relevant work to the current [22], the authors approach the algorithmic recourse problemfrom a causal perspective within the SCM framework and propose to view recourse actions a ∈ AF

as interventions of the form do(XI = θ). For the class of invertible SCMs, such as additive noise

4Following the related literature, we consider a binary classification task by convention; most of ourconsiderations extend to multi-class classification or regression settings as well though.

3

Page 4: knowledge: a probabilistic approach - arXiv

models (ANM) [16], where the structural equations S are of the form

S = {Xr := fr(Xpa(r)) + Ur}dr=1 =⇒ uFr = xFr − fr(xFpa(r)), r ∈ [d], (3)

they propose to use the three steps of structural counterfactuals in [35] to assign a single counterfactualxSCF(a) := x(a)|xF to each action a = do(XI = θ) ∈ AF, and solve the optimisation problem,

aF = argmina=do(XI=θ)∈AF costF(a) subject to h(xSCF(a)) ≥ 0.5. (4)

3 Negative result: no recourse guarantees for unknown structural equations

In practice, the structural counterfactual xSCF(a) can only be computed using an approximate (andlikely imperfect) SCMM = (S, PU), which is estimated from data assuming a particular form ofthe structural equation as in (3). However, assumptions on the form of S? are generally untestable—not even with a randomised experiment—since there exist multiple SCMs which imply the sameobservational and interventional distributions, but entail different structural counterfactuals.Example 1 (adapted from 6.19 in [38]). Consider the following two SCMs MA and MB

which arise from the general form in Figure 1b by choosing U1, U2 ∼ Bernoulli(0.5) andU3 ∼ Uniform({0, . . . ,K}) independently in bothMA andMB , with structural equations

X1 := U1, in {MA,MB},X2 := X1(1− U2), in {MA,MB},X3 := IX1 6=X2

(IU3>0X1 + IU3=0X2) + IX1=X2U3, in MA,

X3 := IX1 6=X2(IU3>0X1 + IU3=0X2) + IX1=X2

(K − U3), in MB .

ThenMA andMB both imply exactly the same observational and interventional distributions, andthus are indistinguishable from empirical data. However, having observed xF = (1, 0, 0), they predictdifferent counterfactuals had X1 been 0, i.e., xSCF(X1 = 0) = (0, 0, 0) and (0, 0,K), respectively.5

Confirming or refuting an assumed form of S? would thus require counterfactual data which is, bydefinition, never available. Thus, example 1 proves the following proposition by contradiction.Proposition 2 (Lack of recourse guarantees). Unless the set of descendants of intervened-uponvariables is empty, algorithmic recourse can, in general, be guaranteed only if the true structuralequations are known, irrespective of the amount and type of available data.Remark 3. The converse of Proposition 2 does not hold. E.g., given xF = (1, 0, 1) in Example 1,abduction in either model yields U3 > 0, so the counterfactual of X3 cannot be predicted exactly.

Building on the framework in [22], we next present two novel approaches for causal algorithmicrecourse under unknown structural equations. The first approach in §4 aims to estimate the counter-factual distribution under the assumption of ANMs (3) with Gaussian noise for the structural equations.The second approach in §5 makes no assumptions about the structural equations, and instead ofapproximating the structural equations, it considers the effect of interventions on a sub-populationsimilar to xF. We recall that the causal graph is assumed to be known throughout.

4 Individualised algorithmic recourse via (probabilistic) counterfactuals

Since the true SCMM? is unknown, one approach to solving (4) is to learn an approximate SCMMwithin a given model class from training data {xi}ni=1. For example, for an ANM (3) with zero-meannoise, the functions fr can be learned via linear or kernel (ridge) regression of Xr given Xpa(r) asinput. We refer to these approaches asMLIN andMKR, respectively. M can then be used in placeofM? to infer the noise values as in (3), and subsequently to predict a single-point counterfactualxSCF(a) to be used in (4). However, the learned causal modelM may be imperfect, and thus lead towrong counterfactuals due to, e.g., the finite sample of the observed data, or more importantly, due tomodel misspecification (i.e., assuming a wrong parametric form for the structural equations).

To solve such limitation, we adopt a Bayesian approach to account for the uncertainty in the estimationof the structural equations. Specifically, we assume additive Gaussian noise and rely on probabilisticregression using a Gaussian process (GP) prior over the functions fr [62].

5This follows from abduction on xF = (1, 0, 0) which for bothMA andMB implies U3 = 0.

4

Page 5: knowledge: a probabilistic approach - arXiv

Definition 4 (GP-SCM). A Gaussian process SCM (GP-SCM) over X refers to the model

Xr := fr(Xpa(r)) + Ur, fr ∼ GP(0, kr), Ur ∼ N (0, σ2r), r ∈ [d], (5)

with covariance functions kr : Xpa(r) ×Xpa(r) → R, e.g., RBF kernels for continuous Xpa(r).

While GPs have previously been studied in a causal context for structure learning [13, 56], estimatingtreatment effects [2, 43], or learning SCMs with latent variables and measurement error [47], our goalhere is to account for the uncertainty over fr in the computation of the posterior over Ur, and thus toobtain a counterfactual distribution, as summarised in the following propositions.

Proposition 5 (GP-SCM noise posterior). Let {xi}ni=1 be an observational sample from (5). Foreach r ∈ [d] with non empty parent set |pa(r)| > 0, the posterior distribution of the noise vectorur = (u1r, ..., u

nr ), conditioned on xr = (x1r, ..., x

nr ) and Xpa(r) = (x1

pa(r), ...,xnpa(r)), is given by

ur|Xpa(r),xr ∼ N(σ2r(K+ σ2

rI)−1xr, σ

2r

(I− σ2

r(K+ σ2rI)−1)) , (6)

where K :=(kr(xipa(r),x

jpa(r)

))ij

denotes the Gram matrix.

Next, in order to compute counterfactual distributions, we rely on ancestral sampling (according tothe causal graph) of the descendants of the intervention targets XI using the noise posterior of (6).The counterfactual distribution of each descendant Xr is given by the following proposition.

Proposition 6 (GP-SCM counterfactual distribution). Let {xi}ni=1 be an observational samplefrom (5). Then, for r ∈ [d] with |pa(r)| > 0, the counterfactual distribution over Xr had Xpa(r) beenxpa(r) (instead of xF

pa(r)) for individual xF ∈ {xi}ni=1 is given by

Xr(Xpa(r) = xpa(r))|xF, {xi}ni=1 ∼ N(µFr+ kT (K+σ2

rI)−1xr, s

Fr+ k− kT (K+σ2

rI)−1k

), (7)

where k := kr(xpa(r), xpa(r)), k :=(kr(xpa(r),x

1pa(r)), . . . , kr(xpa(r),x

npa(r))

), xr and K as defined

in Proposition 5, and µFr and sFr are the posterior mean and variance of uFr given by (6).

All proofs can be found in Appendix A. We can now generalise the recourse problem (4) to ourprobabilistic setting by replacing the single-point counterfactual xSCF(a) with the counterfactualrandom variable XSCF(a) := X(a)|xF. As a consequence, it no longer makes sense to consider ahard constraint of the form h(xSCF(a)) > 0.5, i.e., that the prediction needs to change. Instead, wecan reason about the expected classifier output under the counterfactual distribution, leading to thefollowing probabilistic version of the individualised recourse optimisation problem:

mina=do(XI=θ)∈AF costF(a) subject to EXSCF(a) [h (XSCF(a))] ≥ thresh(a). (8)

Note that the threshold thresh(a) is allowed to depend on a. For example, an intuitive choice is

thresh(a) = 0.5 + γLCB

√VarXSCF(a) [h (XSCF(a))] (9)

which has the interpretation of the lower-confidence bound crossing the decision boundary of 0.5.Note that larger values of the hyperparameter γLCB lead to a more conservative approach to recourse,while for γLCB = 0 merely crossing the decision boundary with ≥ 50% chance suffices.

5 Subpopulation-based algorithmic recourse via interventions and CATEs

The GP-SCM approach in §4 allows us to average over an infinite number of (non-)linear structuralequations, under the assumption of additive Gaussian noise. However, this assumption may stillnot hold under the true SCM, leading to sub-optimal or inefficient solutions to the recourse problem.Next, we remove any assumptions about the structural equations, and propose a second approach thatdoes not aim to approximate an individualised counterfactual distribution, but instead considers theeffect of interventions on a subpopulation defined by certain shared characteristics with the given(factual) individual xF. The key idea behind this approach resembles the notion of conditional averagetreatment effects (CATE) [1] (illustrated in Fig. 2a) and is based on the fact that any interventiondo(XI = θ) only influences the descendants d(I) of the intervened-upon variables, while thenon-descendants nd(I) remain unaffected. Thus, when evaluating an intervention, we can conditionon Xnd(I) = xF

nd(I), thus selecting a subpopulation of individuals similar to the factual subject.

5

Page 6: knowledge: a probabilistic approach - arXiv

loandenied(𝑦 = 0)

loanapproved(𝑦 = 1)

boundaryℎ(𝒙) = 0.5

𝒙𝑭

𝒙𝑺𝑪𝑭𝑴∗

𝑪𝑨𝑻𝑬∗

(a)

A

G

E

L D

I S

(b)

0 0.5 1 1.5 2 2.5LCB

20

40

60

80

100

Val

idity

(%)

0 0.5 1 1.5 2 2.5LCB

8

10

12

14

16

18

20

Cos

t (%

)

LINKRRGP

CVAECATECATEGPCATECVAE

(c)

Figure 2: (a) Illustration of point- and subpopulation-based recourse approaches. (b) Assumed causalgraph for the semi-synthetic loan approval dataset. (c) Trade-off between validity and cost which canbe controlled via γLCB for the probabilistic recourse methods.

Specifically, we propose to solve the following subpopulation-based recourse optimisation problem

mina∈AF

costF(a) subject to EXd(I)|do(XI=θ),xFnd(I)

[h(xF

nd(I),θ,Xd(I))]≥ thresh(a), (10)

where, in contrast to (8), the expectation is taken over the corresponding interventional distribution.

In general, this interventional distribution does not match the conditional distribution, i.e.,PXd(I)|do(XI=θ),xF

nd(I)6= PXd(I)|XI=θ,xF

nd(I), because some spurious correlations in the observa-

tional distribution do not transfer to the interventional setting. For example, in Fig. 1c we have thatPX2|do(X1=x1,X3=x3) = PX2|X1=x1

6= PX2|X1=x1,X3=x3. Fortunately, the interventional distribu-

tion can still be identified from the observational one, as stated in the following proposition.Proposition 7. Subject to causal sufficiency, PXd(I)|do(XI=θ),xF

nd(I)is observationally identifiable:

p(Xd(I)|do(XI = θ),xF

nd(I))=∏r∈d(I) p

(Xr|Xpa(r)

)∣∣∣XI=θ,Xnd(I)=xF

nd(I)

. (11)

As evident from Proposition 7, tackling the optimisation problem in (10) in the general case (i.e., forarbitrary graphs and intervention sets I) requires estimating the stable conditionals PXr|Xpa(r)

(a.k.a.causal Markov kernels) in order to compute the interventional expectation via (11). For convenience(see §6 for details), here we opt for latent-variable implicit density models, but other conditionaldensity estimation approaches may be also be used [e.g., 6, 8, 53]. Specifically, we model eachconditional p(xr|xpa(r)) with a conditional variational autoencoder (CVAE) [48] as:

p(xr|xpa(r)) ≈ pψr (xr|xpa(r)) =∫pψr (xr|xpa(r), zr)p(zr)dzr, p(zr) := N (0, I). (12)

To facilitate sampling xr (and in analogy to the deterministic mechanisms fr in SCMs), we opt fordeterministic decoders in the form of neural nets Dr parametrised by ψr, i.e., pψr (xr|xpa(r), zr) :=

δ(xr −Dr(xpa(r), zr;ψr)

), and rely on variational inference [60], amortised with approximate

posteriors qφr (zr|xr,xpa(r)) parametrised by encoders in the form of neural nets with parameters φr.We learn both the encoder and decoder parameters by maximising the evidence lower bound (ELBO)using stochastic gradient descend [9, 26, 27, 40]. For further details, we refer to Appendix D.Remark 8. The collection of CVAEs can be interpreted as learning an approximate SCM of the form

MCVAE : S = {Xr := Dr(Xpa(r), zr;ψr)}dr=1, zr ∼ N (0, I) ∀r ∈ [d] (13)

However, this family of SCMs may not allow to identify the true SCM (provided it can be expressedas above) from data without additional assumptions. Moreover, exact posterior inference over zrgiven xF is intractable, and we need to resort to approximations instead. It is thus unclear whethersampling from qφr (zr|xFr,xF

pa(r)) instead of from p(zr) in (12) can be interpreted as a counterfactualwithin (13). For further discussion on such “pseudo-counterfactuals” we refer to Appendix C.

6

Page 7: knowledge: a probabilistic approach - arXiv

6 Solving the probabilistic-recourse optimisation problems

We now discuss how to solve the resulting optimisation problems in (8) and (10). First, note that bothproblems differ only on the distribution over which the expectation in the constraint is taken: in (8)this is the counterfactual distribution of the descendants given in Proposition 6; and in (10) it is theinterventional distribution identified in Proposition 7. In either case, computing the expectation for anarbitrary classifier h is intractable. Here, we approximate these integrals via Monte Carlo by samplingx(m)d(I) from the interventional or counterfactual distributions resulting from a = do(XI = θ), i.e.,

EXd(I)|θ ,[h(xF

nd(I),θ,Xd(I))]≈ 1

M

∑Mm=1 h

(xF

nd(I),θ,x(m)d(I)).

Brute-force approach. A way to solve (8) and (10) is to (i) iterate over a ∈ AF, with AF being afinite set of feasible actions (possibly as a result of discretising in the case of a continuous searchspace); (ii) approximately evaluate the constraint via Monte Carlo; and (iii) select a minimum cost ac-tion amongst all evaluated candidates satisfying the constraint. However, this may be computationallyprohibitive and yield suboptimal interventions due to discretisation.

Gradient-based approach. Recall that, for actions of the form a = do(XI = θ), we need tooptimise over both the intervention targets I and the intervention values θ. Selecting targets is ahard combinatorial optimisation problem, as there are 2d

′possible choices for d′ ≤ d actionable

features, with a potentially infinite number of intervention values. We therefore consider differentchoices of targets I in parallel, and propose a gradient-based approach suitable for differentiableclassifiers to efficiently find an optimal θ for a given intervention set I .6 In particular, we first rewritethe constrained optimisation problem in unconstrained form with Lagrangian [23, 28]:

L(θ, λ) := costF(a) + λ(thresh(a)− EXd(I)|θ

[h(xF

nd(I),θ,Xd(I))])

. (14)

We then solve the saddle point problem minθ maxλ L(θ, λ) arising from (14) with stochastic gradientdescent [9, 26]. Since both the GP-SCM counterfactual (7) and the CVAE interventional distribu-tions (12) admit a reparametrisation trick [27, 40], we can differentiate through the constraint:

∇θEXd(I)

[h(xF

nd(I),θ,Xd(I))]

= Ez∼N (0,I)

[∇θh

(xF

nd(I),θ,xd(I)(z))]. (15)

Here, xd(I)(z) is obtained by iteratively computing all descendants in topological order: eithersubstituting z together with the other parents into the decoders Dr for the CVAEs, or by using theGaussian reparametrisation xr(z) = µ+ σz with µ and σ given by (7) for the GP-SCM. A similargradient estimator for the variance which enters thresh(a) for γLCB 6= 0 is derived in Appendix F.

7 Experimental results

In our experiments, we compare different approaches for causal algorithmic recourse on syntheticand semi-synthetic data sets. Additional results can be found in Apendix B.

Compared methods. We compare the naive point-based recourse approaches MLIN and MKR

mentioned at the beginning of §4 as baselines with the proposed counterfactual GP-SCMMGP and theCVAE approach for sub-population-based recourse (CATECVAE). For completeness, we also consider aCATEGP approach as a GP can also be seen as modelling each conditional as a Gaussian,7 and alsoevaluate the “pseudo-counterfactual”MCVAE approach discussed in Remark 8. Finally, we reportoracle performance for individualisedM? and sub-population-based recourse methods CATE? bysampling counterfactuals and interventions from the true underlying SCM. We note that a comparisonwith non-causal recourse approaches that assume independent features [54, 44] or consider causalrelations to generate counterfactual explanations but not recourse actions [19, 32] is neither naturalnor straight-forward, because it is unclear whether descendant variables should be allowed to change,whether keeping their value constant should incur a cost, and, if so, how much, c.f. [22].

6For large d when enumerating all I becomes computationally prohibitive, we can upper-bound the allowednumber of variables to be intervened on simultaneously (e.g., |I| ≤ 3), or choose a greedy approach to select I.

7Sampling from the noise prior instead of the posterior in (6) leads to an interventional distribution in (7).

7

Page 8: knowledge: a probabilistic approach - arXiv

Table 1: Experimental results for the gradient-based approach on different 3-variable SCMs. We showaverage performance ±1 standard deviation for Nruns = 100, NMC-samples = 100, and γLCB = 2.

Method LINEAR SCM NON-LINEAR ANM NON-ADDITIVE SCM

Valid? (%) LCB Cost (%) Valid? (%) LCB Cost (%) Valid? (%) LCB Cost (%)

M? 100 - 10.9±7.9 100 - 20.1±12.3 100 - 13.2±11.0MLIN 100 - 11.0±7.0 54 - 20.6±11.0 98 - 14.0±13.5MKR 90 - 10.7±6.5 91 - 20.6±12.5 70 - 13.2±11.6MGP 100 .55±.04 12.2±8.3 100 .54±.03 21.9±12.9 95 .52±.04 13.4±12.8MCVAE 100 .55±.07 11.8±7.7 97 .54±.05 22.6±12.3 95 .51±.01 13.4±12.2CATE? 90 .56±.07 11.9±9.2 97 .55±.05 26.3±21.4 100 .52±.02 13.5±13.0CATEGP 93 .56±.05 12.2±8.4 94 .55±.06 25.0±14.8 94 .52±.03 13.2±13.1CATECVAE 89 .56±.08 12.1±8.9 98 .54±.05 26.0±14.3 100 .52±.05 13.6±12.9

Metrics. We compare recourse actions recommended by the different methods in terms of cost,computed as the L2-norm between the intervention θI and the factual value xF

I , normalised by therange of each feature r ∈ I observed in the training data; and validity, computed as the percentageof individuals for which the recommended actions result in a favourable prediction under the true(oracle) SCM. For our probabilistic recourse methods, we also report the lower confidence boundLCB := E[h]− γLCB

√Var[h] of the selected action under the given method.

Synthetic 3-variable SCMs under different assumptions. In our first set of experiments, weconsider three classes of SCMs over three variables with the same causal graph as in Fig. 1c. To testrobustness of the different methods to assumptions about the form of the true structural equations, weconsider a linear SCM, a non-linear ANM, and a more general, multi-modal SCM with non-additivenoise. For further details on the exact form we refer to Appendix E.

Results are shown in Table 1. We observe that the point-based recourse approaches perform (relatively)well in terms of both validity and cost, when their underlying assumptions are met (i.e.,MLIN on thelinear SCM andMKR on the nonlinear ANM). Otherwise, validity significantly drops as expected (see,e.g., the results ofMLIN on the non-linear ANM, or ofMKR on the non-additive SCM). Moreover,we note that the inferior performance ofMKR compared toMLIN on the linear SCM suggests anoverfitting problem, which does not occur for its more conservative probabilistic counterpartMGP.Generally, the individualised approachesMGP andMCVAE perform very competitively in terms ofcost and validity, especially on the linear and nonlinear ANMs. The subpopulation-based CATEapproaches on the other hand, perform particularly well on the challenging non-additive SCM (onwhich the assumptions of GP approaches are violated) where CATECVAE achieves perfect validity asthe only non-oracle method. As expected, the subpopulation-based approaches generally lead tohigher cost than the individualised ones, since the latter only aim to achieve recourse only for a givenindividual while the former do it for an entire group (see Fig. 2a).

Semi-synthetic 7-variable SCM for loan-approval. We also test our methods on a larger semi-synthetic SCM inspired by the German Credit UCI dataset [34]. We consider the variables age A,gender G, education-level E, loan amount L, duration D, income I , and savings S with causal graphshown in Fig. 2b. We model age A, gender G and loan duration D as non-actionable variables, butconsider D to be mutable, i.e., it cannot be manipulated directly but is allowed to change (e.g., as aconsequence of an intervention on L). The SCM includes linear and non-linear relationships, as wellas different types of variables and noise distributions, and is described in more detail in Appendix E.

The results are summarised in Table 2, where we observe that the insights discussed above similarlyapply for data generated from a more complex SCM, and for different classifiers. Finally, we show theinfluence of γLCB on the performance of the proposed probabilistic approaches in Fig. 2c. We observethat lower values of γLCB lead to lower validity (and cost), especially for the CATE approaches. AsγLCB increases validity approaches the corresponding oraclesM? and CATE?, outperforming thepoint-based recourse approaches. In summary, our probabilistic recourse approaches are not onlymore robust, but also allow controlling the trade-off between validity and cost using γLCB.

8

Page 9: knowledge: a probabilistic approach - arXiv

Table 2: Experimental results for the 7-variable SCM for loan-approval. We show average performance±1 standard deviation for Nruns = 100, NMC-samples = 100, and γLCB = 2.5. For linear and non-linear logistic regression as classifiers, we use the gradient-based approach, whereas for the non-differentiable random forest classifier we rely on the brute-force approach (with 10 discretised binsper dimension) to solve the recourse optimisation problems.

Method LINEAR LOG. REGR. NON-LIN. LOG. REGR. (MLP) RANDOM FOREST(BRUTE-FORCE)

Valid? (%) LCB Cost (%) Valid? (%) LCB Cost (%) Valid? (%) LCB Cost (%)

M? 100 - 15.8± 7.6 100 - 11.0±7.0 100 - 15.2±7.5MLIN 19 - 15.4± 7.4 80 - 11.0±6.9 94 - 15.6±7.6MKR 41 - 15.6± 7.5 87 - 11.1±7.0 92 - 15.1±7.4MGP 100 .50±.00 18.0± 7.7 100 .52±.04 11.7±7.3 100 .66±.14 16.3±7.4MCVAE 100 .50±.00 16.6± 7.6 99 .51±.01 11.3±6.9 100 .66±.14 15.9±7.4CATE? 93 .50±.01 22.0± 9.4 95 .52±.05 12.0±7.7 98 .66±.15 17.0±7.3CATEGP 93 .50±.02 21.7± 9.2 93 .51±.06 12.0±7.4 100 .67±.15 17.1±7.4CATECVAE 94 .49±.01 23.7±11.3 95 .51±.03 12.0±7.8 100 .68±.15 17.9±7.4

8 Discussion

Assumptions, limitations, and extensions. Throughout the paper, we have assumed a knowncausal graph and causal sufficiency. While this may not hold for all settings, it is the minimal necessaryset of assumptions for causal reasoning from observational data alone. Access to instrumentalvariables or experimental data may help further relax these assumptions [3, 11, 50]. Moreover, if onlya partial graph is available or some relations are known to be confounded, one will need to restrictrecourse actions to the subset of interventions that are still identifiable [45, 46, 51]. An alternativeapproach could address causal sufficiency violations by relying on latent variable models to estimateconfounders from multiple causes [61] or proxy variables [31], or to work with bounds on causaleffects instead [4, 49]. We relegate the investigation of these settings to future work.

On the counterfactual vs interventional nature of recourse. Given that we address two differentnotions of recourse—counterfactual/individualised (rung 3) vs. interventional/subpopulation-based(rung 2)—one may ask which framing is more appropriate. Since the main difference is whether thebackground variables U are assumed fixed (counterfactual) or not (interventional) when reasoningabout actions, we believe that this question is best addressed by thinking about the type of environmentand interpretation of U: if the environment is static, or if U (mostly) captures unobserved informationabout the individual, the counterfactual notion seems to be the right one; if, on the other hand, Ualso captures environmental factors which may change, e.g., between consecutive loan applications,then the interventional notion of recourse may be more appropriate. In practice, both notions maybe present (for different variables), and the proposed approaches can be combined depending onthe available domain knowledge since each parent-child causal relation is treated separately. Weemphasise that the subpopulation-based approach is also practically motivated by a reluctance tomake (parametric) assumptions about the structural equations which are untestable but necessary forcounterfactual reasoning. It may therefore be useful to avoid problems of misspecification, even forcounterfactual recourse, as demonstrated experimentally for the non-additive SCM.

9 Conclusion

In this work, we studied the problem of algorithmic recourse from a causal perspective. As negativeresult, we first showed that algorithmic recourse cannot be guaranteed in the absence of perfectknowledge about the underlying SCM governing the world, which unfortunately is not available inpractice. To address this limitation, we proposed two probabilistic approaches to achieve recourseunder more realistic assumptions. In particular, we derived i) an individual-level recourse approachbased on GPs that approximates the counterfactual distribution by averaging over the family ofadditive Gaussian SCMs; and ii) a subpopulation-based approach, which assumes that only the causalgraph is known and makes use of CVAEs to estimate the conditional average treatment effect of anintervention on a subpopulation similar to the individual seeking recourse. Our experiments showedthat the proposed probabilistic approaches not only result in more robust recourse interventions thanapproaches based on point estimates of the SCM, but also allows to trade-off validity and cost.

9

Page 10: knowledge: a probabilistic approach - arXiv

Broader Impact

Our work falls into the domain of explainable AI, which—given the increasing use of often intranspar-ent (“blackbox”) machine learning models in consequential decision making—is of rapidly-growingsocietal importance. In particular, we consider the task of enabling and facilitating algorithmicrecourse, which aims to provide individuals with guidance and recommendations on how best (i.e.,efficiently and ideally at low cost) to recover from unfavourable decisions made by an automatedsystem. To address this task, we build on the framework of causal modelling, which constitutes aprincipled and mathematically rigorous way to reason about the downstream effects of actions. Sincecorrelation does not imply causation, this requires to make additional assumptions based on a generalunderstanding of the domain at hand. While this may perhaps seem restrictive at first, we point outthat other approaches to explainability also make implicit assumptions of a causal nature (e.g., that allfeatures can be changed at will without affecting others in the case of “counterfactual” explanations),without explicitly and clearly stating such assumptions. The advantage of phrasing assumptions aboutrelations between features in the form of a causal graph is that the latter is transparent and intuitive tounderstand and can thus be challenged by decision makers and individuals alike.

While theoretically sound from a causal perspective, at the same time, our method is aimed atbeing practical by not making further assumptions beyond the causal graph which would be hardor impossible to test or challenge empirically—in contrast to the assumed known specification ofthe full SCM in [22]. We start from the position that the model is only partially known, and usethis to motivate probabilistic approaches to causal algorithmic recourse which take uncertainty intoaccount. Our approaches are more robust to misspeficiation than naive point-based recourse methods(as demonstrated experimentally): “system-failure” is thus fundamentally baked in to our methods.Moreover, the interpretable “conservativeness parameter” γLCB can be used trade-off the desired levelof robustness against the effort an individual is willing to put into achieving recourse.

The importance of causal reasoning for an ethical and socially beneficial use of ML-assisted tech-nology has also been stressed in a number of recent works in the field of explainability and fairalgorithmic decision making [29, 42, 24, 63, 64, 10, 57, 15]. We thus hope that some of the proba-bilistic approaches for causal reasoning under imperfect knowledge proposed in this work may alsoprove useful for related tasks such as fairness, accountability, transparency. To this end, we havecreated a user-friendly implementation of all the approaches proposed in this work that we will makepublicly available to be scrutinised, re-used, and further improved by the community. The code ishighly flexible and only requires the specification of a causal graph, as well as a labelled trainingdataset.

Since our work considers the classifier as given, it is possible that it is explicitly discriminatory orreproduces biases in the data. While not directly addressing this problem, our work aims to enableindividuals to overcome a potentially unfairly obtained decision with minimal effort. If successfulrecourse examples are included in future training data, this may help de-bias a system over time; weconsider the intersection of our work with fair decision making in the context of a classifier evolvingover time as the result of further data collection [25] a fruitful and important direction for futureresearch. In addition, observing that certain minority groups consistently receive more costly recourserecommendations may be a way to reveal bias in the underlying decision making system.

While our framework is intended to help individuals increase their chances for a more favourableprediction given that they were, e.g., denied a loan or bail, we cannot rule out a priori, that the sameapproach could also be used by foes in unintended ways, e.g., to “game” a spam filter or similarsystem built to protect society from harm. However, since our framework requires the specificationof a causal graph which usually requires an understanding of the domain and the causal influences atplay, it is unlikely that it could be abused by a purely virtual system without a human in the loop.

Acknowledgments and Disclosure of Funding

The authors would like to thank Adrian Weller, Floyd Kretschmar, Junhyung Park, Matthias Bauer,Miriam Rateike, Nicolo Ruggeri, Umang Bhatt, and Vidhi Lalchand for helpful feedback anddiscussions. Moreover, a special thanks to Adrià Garriga-Alonso for insightful input on some ofthe GP-derivations and to Adrián Javaloy Bornás for invaluable help with the CVAE-training. AHKacknowledges NSERC and CLS for generous funding support.

10

Page 11: knowledge: a probabilistic approach - arXiv

References[1] Jason Abrevaya, Yu-Chin Hsu, and Robert P Lieli. Estimating conditional average treatment effects.

Journal of Business & Economic Statistics, 33(4):485–505, 2015.

[2] Ahmed M Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects usingmulti-task gaussian processes. In Advances in Neural Information Processing Systems, pages 3424–3432,2017.

[3] Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumentalvariables. Journal of the American statistical Association, 91(434):444–455, 1996.

[4] Alexander Balke and Judea Pearl. Counterfactual probabilities: Computational methods, bounds andapplications. In Uncertainty Proceedings 1994, pages 46–54. Elsevier, 1994.

[5] Solon Barocas, Andrew D Selbst, and Manish Raghavan. The hidden assumptions behind counterfactualexplanations and principal reasons. In Proceedings of the 2020 Conference on Fairness, Accountability,and Transparency, pages 80–89, 2020.

[6] David M Bashtannyk and Rob J Hyndman. Bandwidth selection for kernel conditional density estimation.Computational Statistics & Data Analysis, 36(3):279–298, 2001.

[7] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh,Ruchir Puri, José MF Moura, and Peter Eckersley. Explainable machine learning in deployment. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 648–657, 2020.

[8] Christopher M Bishop. Mixture density networks. 1994.

[9] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural informationprocessing systems, pages 161–168, 2008.

[10] Silvia Chiappa. Path-specific counterfactual fairness. In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 33, pages 7801–7808, 2019.

[11] Gregory F Cooper and Changwon Yoo. Causal discovery from a mixture of experimental and observationaldata. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 116–125,1999.

[12] G. Darmois. Analyse des liaisons de probabilité. In Proc. Int. Stat. Conferences 1947, page 231, 1951.

[13] Nir Friedman and Iftach Nachman. Gaussian process networks. In Proceedings of the Sixteenth conferenceon Uncertainty in artificial intelligence, pages 211–219, 2000.

[14] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. Akernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.

[15] Vivek Gupta, Pegah Nokhiz, Chitradeep Dutta Roy, and Suresh Venkatasubramanian. Equalizing recourseacross groups. arXiv preprint arXiv:1909.03166, 2019.

[16] Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinearcausal discovery with additive noise models. In Advances in neural information processing systems, pages689–696, 2009.

[17] Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniquenessresults. Neural Networks, 12(3):429–439, 1999.

[18] Dominik Janzing and Bernhard Scholkopf. Causal inference using the algorithmic markov condition. IEEETransactions on Information Theory, 56(10):5168–5194, 2010.

[19] Shalmali Joshi, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. Towardsrealistic individual recourse and actionable explanations in black-box decision making systems. arXivpreprint arXiv:1907.09615, 2019.

[20] Amir-Hossein Karimi, Gilles Barthe, Borja Balle, and Isabel Valera. Model-agnostic counterfactualexplanations for consequential decisions. In International Conference on Artificial Intelligence andStatistics, pages 895–905, 2020.

[21] Amir-Hossein Karimi, Gilles Barthe, Bernhard Schölkopf, and Isabel Valera. A survey of algorithmicrecourse: definitions, formulations, solutions, and prospects. arXiv preprint arXiv:2010.04050, 2020.

11

Page 12: knowledge: a probabilistic approach - arXiv

[22] Amir-Hossein Karimi, Bernhard Schölkopf, and Isabel Valera. Algorithmic recourse: from counterfactualexplanations to interventions. arXiv preprint arXiv:2002.06278, 2020.

[23] W. Karush. Minima of functions of several variables with inequalities as side conditions. Master’s Thesis,Department of Mathematics, University of Chicago, 1939.

[24] Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, andBernhard Schölkopf. Avoiding discrimination through causal reasoning. In Advances in Neural InformationProcessing Systems, pages 656–666, 2017.

[25] Niki Kilbertus, Manuel Gomez-Rodriguez, Bernhard Schölkopf, Krikamol Muandet, and Isabel Valera.Fair decisions despite imperfect predictions. AISTATS, 2019.

[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd InternationalConference for Learning Representations, 2015.

[27] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In 2nd International Conferenceon Learning Representations, 2014.

[28] Harold W Kuhn and Albert W Tucker. Nonlinear programming. In J. Neyman, editor, Proceedings of thesecond Berkeley symposium on mathematical statistics and probability. University of California Press,Berkeley, 1951.

[29] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances inNeural Information Processing Systems, pages 4066–4076, 2017.

[30] David Lewis. Counterfactuals. Harvard University Press, 1973.

[31] Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causaleffect inference with deep latent-variable models. In Advances in Neural Information Processing Systems,pages 6446–6456, 2017.

[32] Divyat Mahajan, Chenhao Tan, and Amit Sharma. Preserving causal constraints in counterfactual explana-tions for machine learning classifiers. arXiv preprint arXiv:1912.03277, 2019.

[33] Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning classifiers throughdiverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability,and Transparency, pages 607–617, 2020.

[34] Patrick M Murphy. UCI repository of machine learning databases. ftp:/pub/machine-learning-databaseonics. uci. edu, 1994.

[35] Judea Pearl. Causality. Cambridge university press, 2009.

[36] Walt L Perry. Predictive policing: The role of crime forecasting in law enforcement operations. RandCorporation, 2013.

[37] Jonas Peters and Peter Bühlmann. Identifiability of gaussian structural equation models with equal errorvariances. Biometrika, 101(1):219–228, 2014.

[38] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations andlearning algorithms. MIT press, 2017.

[39] Rafael Poyiadzi, Kacper Sokol, Raul Santos-Rodriguez, Tijl De Bie, and Peter Flach. FACE: Feasible andactionable counterfactual explanations. arXiv preprint arXiv:1909.09369, 2019.

[40] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-imate inference in deep generative models. In International Conference on Machine Learning, pages1278–1286, 2014.

[41] Cristóbal Romero and Sebastián Ventura. Preface to the special issue on data mining for personalisededucational systems. User Modeling and User Adapted Interaction, 21(1):1, 2011.

[42] Chris Russell, Matt J Kusner, Joshua Loftus, and Ricardo Silva. When worlds collide: integrating differentcounterfactual assumptions in fairness. In Advances in Neural Information Processing Systems, pages6414–6423, 2017.

[43] Peter Schulam and Suchi Saria. Reliable decision support using counterfactual models. In Advances inNeural Information Processing Systems, pages 1697–1708, 2017.

12

Page 13: knowledge: a probabilistic approach - arXiv

[44] Shubham Sharma, Jette Henderson, and Joydeep Ghosh. Certifai: A common framework to provideexplanations and analyse the fairness and robustness of black-box models. In Proceedings of the AAAI/ACMConference on AI, Ethics, and Society, pages 166–172, 2020.

[45] Ilya Shpitser and Judea Pearl. Identification of conditional interventional distributions. In 22nd Conferenceon Uncertainty in Artificial Intelligence, UAI 2006, pages 437–444, 2006.

[46] Ilya Shpitser and Judea Pearl. Complete identification methods for the causal hierarchy. Journal of MachineLearning Research, 9(Sep):1941–1979, 2008.

[47] Ricardo Silva and Robert B Gramacy. Gaussian process structural equation models with latent variables.In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 537–545,2010.

[48] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deepconditional generative models. In Advances in neural information processing systems, pages 3483–3491,2015.

[49] Jin Tian and Judea Pearl. Probabilities of causation: Bounds and identification. Annals of Mathematicsand Artificial Intelligence, 28(1-4):287–313, 2000.

[50] Jin Tian and Judea Pearl. Causal discovery from changes. In Proceedings of the Seventeenth conference onUncertainty in artificial intelligence, pages 512–521, 2001.

[51] Jin Tian and Judea Pearl. A general identification condition for causal effects. In Eighteenth nationalconference on Artificial intelligence, pages 567–573, 2002.

[52] Marc Toussaint. Lecture notes: Gaussian identities. 2011.

[53] Brian L Trippe and Richard E Turner. Conditional density estimation with bayesian normalising flows.arXiv preprint arXiv:1802.04908, 2018.

[54] Berk Ustun, Alexander Spangher, and Yang Liu. Actionable recourse in linear classification. In Proceedingsof the Conference on Fairness, Accountability, and Transparency, pages 10–19, 2019.

[55] Suresh Venkatasubramanian and Mark Alfano. The philosophical basis of algorithmic recourse. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 284–293, 2020.

[56] Julius von Kügelgen, Paul K Rubenstein, Bernhard Schölkopf, and Adrian Weller. Optimal experimentaldesign via Bayesian optimization: active causal structure learning for Gaussian process networks. NeurIPSWorkshop ”Do the right thing”: machine learning and causal inference for improved decision making,2019.

[57] Julius von Kügelgen, Umang Bhatt, Amir-Hossein Karimi, Isabel Valera, Adrian Weller, and BernhardSchölkopf. On the fairness of causal algorithmic recourse. arXiv preprint arXiv:2010.06529, 2020.

[58] Sandra Wachter, Brent Mittelstadt, and Luciano Floridi. Why a right to explanation of automated decision-making does not exist in the general data protection regulation. International Data Privacy Law, 7(2):76–99, 2017.

[59] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening theblack box: Automated decisions and the GDPR. Harv. JL & Tech., 31:841, 2017.

[60] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variationalinference. Foundations and Trends® in Machine Learning, 1(1-2):1–305, 2008.

[61] Yixin Wang and David M Blei. The blessings of multiple causes. Journal of the American StatisticalAssociation, pages 1–71, 2019.

[62] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2.MIT press Cambridge, MA, 2006.

[63] Junzhe Zhang and Elias Bareinboim. Equality of opportunity in classification: A causal approach. InAdvances in Neural Information Processing Systems, pages 3671–3681, 2018.

[64] Junzhe Zhang and Elias Bareinboim. Fairness in decision-making—the causal explanation formula. InThirty-Second AAAI Conference on Artificial Intelligence, 2018.

[65] K Zhang and A Hyvärinen. On the identifiability of the post-nonlinear causal model. In 25th Conferenceon Uncertainty in Artificial Intelligence (UAI 2009), pages 647–655. AUAI Press, 2009.

13

Page 14: knowledge: a probabilistic approach - arXiv

A Proofs

A.1 Proof of Proposition 5

Proposition 5 (GP-SCM noise posterior). Let {xi}ni=1 be an observational sample from (5). For each r ∈ [d]with non empty parent set |pa(r)| > 0, the posterior distribution of the noise vector ur = (u1

r, ..., unr ),

conditioned on xr = (x1r, ..., xnr ) and Xpa(r) = (x1

pa(r), ...,xnpa(r)), is given by

ur|Xpa(r),xr ∼ N(σ2r(K + σ2

rI)−1xr, σ

2r

(I− σ2

r(K + σ2rI)−1)) , (6)

where K :=(kr(xipa(r),x

jpa(r)

))ij

denotes the Gram matrix.

Proof. First, note that, by definition, ur is independent of fr = (fr(x1pa(r)), ..., fr(x

npa(r))) given Xpa(r).

Moreover, it follows from the assumed GP-SCM model in (5) and Definition 4, as well as properties of the GPprior, that both are multivariate Gaussian random variables with distributions given by

ur ∼ N (0, σ2rI) independently of Xpa(r), and (A.1)

fr|Xpa(r) ∼ N (0,K), (A.2)where 0 denotes the zero vector (or matrix, see below) and K is as defined in Proposition 5.

Since independent multivariate Gaussian random variables are jointly multivariate Gaussian, we thus have(urfr

)|Xpa(r) ∼ N (0,Σ), where Σ =

(σ2rI 00 K

)(A.3)

Noting that xr = fr + ur and applying a linear transformation to (A.3), we then obtain(urxr

)|Xpa(r) =

(I 0I I

)(urfr

)|Xpa(r) ∼ N (0, Σ), where Σ =

(σ2rI σ2

rIσ2rI K + σ2

rI

). (A.4)

Conditioning on xr and using the conditioning formula [e.g., 52], the result follows:

ur|Xpa(r),xr ∼ N(0 + σ2

rI(K + σ2rI)−1(xr − 0), σ2

rI− σ2rI(K + σ2

rI)−1σ2

rI)

(A.5)

∼ N(σ2r(K + σ2

rI)−1xr, σ

2r

(I− σ2

r(K + σ2rI)−1)) (A.6)

A.2 Proof of Proposition 6

Proposition 6 (GP-SCM counterfactual distribution). Let {xi}ni=1 be an observational sample from (5). Then,for r ∈ [d] with |pa(r)| > 0, the counterfactual distribution over Xr had Xpa(r) been xpa(r) (instead of xF

pa(r))for individual xF ∈ {xi}ni=1 is given by

Xr(Xpa(r) = xpa(r))|xF, {xi}ni=1 ∼ N(µFr + kT (K + σ2

rI)−1xr, s

Fr + k − kT (K + σ2

rI)−1k

), (7)

where k := kr(xpa(r), xpa(r)), k :=(kr(xpa(r),x

1pa(r)), . . . , kr(xpa(r),x

npa(r))

), xr and K as defined in

Proposition 5, and µFr and sFr are the posterior mean and variance of uFr given by (6).

Proof. We follow the three steps of abduction, action, and prediction for computing counterfactual distributions(see §2 for more details). Starting from the factual observation xF ∈ {xi}ni=1 generated according to

xFr := fr(xFpa(r)) + uFr, (A.7)

we first compute the noise posterior (abduction). According to Proposition 5 it is given by a marginal of (6), i.e.,

uFr|Xpa(r),xr ∼ N (µFr , sFr) (A.8)

where µFr is given by element F of the mean vector

µr = σ2r(K + σ2

rI)−1xr (A.9)

and sFr is given by element (F, F) of the covariance matrix

Sr = σ2r

(I− σ2

r(K + σ2rI)−1) (A.10)

of the noise posterior given by (6).

Next, we simulate the hypothetical intervention by updating the structural equation (A.7) (action step),xFr(Xpa(r) = xpa(r)) := fr(xpa(r)) + uFr. (A.11)

The GP predictive posterior at the new input xpa(r) has distribution [see, e.g., 62],

fr(xpa(r))|Xpa(r),xr ∼ N (kT (K + σ2rI)−1xr, k − kT (K + σ2

rI)−1k). (A.12)

Substituting (A.12) and (A.8) into (A.11) and noting that the sum of two Gaussians is again Gaussian withmean and variance equal to the sums of means and variances of the two individual Gaussians (prediction step)completes the proof.

14

Page 15: knowledge: a probabilistic approach - arXiv

A.3 Proof of Proposition 7

Proposition 7. Subject to causal sufficiency, PXd(I)|do(XI=θ),xFnd(I)

is observationally identifiable:

p(Xd(I)|do(XI = θ),xF

nd(I))

=∏r∈d(I) p

(Xr|Xpa(r)

)∣∣∣XI=θ,Xnd(I)=xF

nd(I)

. (11)

Proof. This is a direct consequence of the properties of causally sufficient (Markovian) causal models, but weinclude a derivation for completeness. Recall that P factorises over its underlying causal graph G as follows,

p(X) =∏r∈[d]

p(Xr|Xpa(r)). (A.13)

This joint distribution is transformed by the intervention do(XI = θ) as follows,

P (X−I , do(XI = θ)) = δ(XI = θ)∏

r∈[d]\I

P (Xr|Xpa(r)). (A.14)

Splitting the non-intervened variables into descendants d(I) and non-descendants nd(I), and conditioning onthe intervened variables do(XI = θ), we obtain

P (Xnd(I),Xd(I)|do(XI = θ)) =

∏r∈nd(I)∪d(I)

P (Xr|Xpa(r))

∣∣∣∣∣∣XI=θ

. (A.15)

As the non-descendants Xnd(I) are, by their very definition, not affected by the intervention, we can write

P (Xnd(I),Xd(I)|do(XI = θ)) =

∏r∈d(I)

P (Xr|Xpa(r))

∣∣∣∣∣∣XI=θ

∏r∈nd(I)

P (Xr|Xpa(r)).

We can thus condition on a particular value of Xnd(I) to obtain

P(Xd(I)|do(XI = θ),Xnd(I) = xF

nd(I))

=

∏r∈d(I)

P (Xr|Xpa(r))

∣∣∣∣∣∣XI=θ,Xnd(I)=xF

nd(I)

(A.16)

15

Page 16: knowledge: a probabilistic approach - arXiv

B Additional results

This section presents additional results complementing those from Section 7. Table 3 presents results thatmirror those in Table 1, where the brute-force approach discussed at the beginning of §6 is used instead of thegradient-based optimisation. Here, each real-valued feature was discretised into 20 bins within the range of itsobserved values in the training dataset.

Fig. 3 mirrors the results in Fig. 2c, for which a snapshot (γLCB = 2.5) is also provided in Table 2. Here we showthe trade-off between validity and cost by varying the values of γLCB, using as trained classifiers a non-linearmultilayer perceptron (MLP) in (a) and a non-differentiable random forest classifer in (b). Note that optimisationfor the latter can only be done with the brute-force approach. All these additional results mostly confirm theinsights presented in the main body.

Finally, Table 4 provides a qualitative comparison of the proposed recourse approaches against the oracles andbaselines in terms of their selection of intervention targets. We show empirically, on the three synthetic datasets,that CATE approaches have more predictable behaviour, as they are less sensitive to model assumptions, and arethus more preferable for the individual seeking recourse under imperfect causal knowledge.

Table 3: Experimental results for the brute-force (20-bin discretization) approach on different 3-variable SCMs. We show average performance for Nruns = 100, NMC-samples = 100, and γLCB = 2.The relative trends reflect those in Table 1.

Method LINEAR SCM NON-LINEAR ANM NON-ADDITIVE SCM

Valid? (%) LCB Cost (%) Valid? (%) LCB Cost (%) Valid? (%) LCB Cost (%)

M? 100 - 11.0±5.6 100 - 20.7±11.0 100 - 15.8± 8.9MLIN 100 - 11.3±5.8 60 - 19.9± 8.9 92 - 17.0±10.4MKR 95 - 11.2±5.6 88 - 20.5±10.7 47 - 15.8±10.6MGP 100 .55±.04 11.6±5.8 99 .55±.04 21.2±10.9 88 .58±.05 16.8±10.3MCVAE 100 .55±.04 11.5±5.8 95 .55±.03 21.7±10.7 95 .59±.07 16.9±10.3CATE? 90 .57±.07 11.0±5.5 95 .55±.05 22.8±10.8 99 .57±.06 16.2± 8.9CATEGP 92 .56±.07 11.2±5.5 95 .55±.04 22.8±10.9 85 .58±.07 16.4±10.5CATECVAE 90 .57±.06 11.1±5.4 96 .55±.03 23.0±10.8 94 .59±.07 16.8±10.2

0 0.5 1 1.5 2 2.5LCB

20

40

60

80

100

Val

idity

(%)

0 0.5 1 1.5 2 2.5LCB

3

4

5

6

7

8

9

Cos

t (%

)

LINKRRGP

CVAECATECATEGPCATECVAE

(a) MLP

0 0.5 1 1.5 2 2.5LCB

20

40

60

80

100

Val

idity

(%)

0 0.5 1 1.5 2 2.5LCB

6

8

10

12

14

16

Cos

t (%

)

LINKRRGP

CVAECATECATEGPCATECVAE

(b) random forest

Figure 3: Trade-off between validity and cost which can be controlled via γLCB for the probabilisticrecourse methods. Shown is the same setting as in Fig. 2c using instead a non-linear logistic regressionin the form of a multilayer perceptron (MLP; left), and a random forest (right) as classifiers h.

16

Page 17: knowledge: a probabilistic approach - arXiv

Table 4: Experimental results for the gradient-descent approach on different 3-variable SCMs (topto bottom: linear SCM, non-linear ANM, non-additive SCM). We show average performance forNruns = 100, NMC-samples = 100, and γLCB = 2, and display the number (out of Nruns) of performedinterventions on all subsets of variables by each recourse type. The two right-most columns displayhow many of the intervention sets for each recourse type agreed with the suggestions made bythe oracle methods,M? and CATE?, respectively. We observe that interventions proposed by thesubpopulation-based oracle often differ from the ones proposed at the individual level, which can bevisually explained by Fig. 2a. Importantly, we observe general agreement among all CATE approachesin their selection of intervened-upon variables. In contrast, we observe that individual-based methodsdeviate away from their oracle (i.e.,M?) in their selection of variables to intervene upon for recourse.This result further suggest that the CATE approaches presented in this work exhibit more predictablebehaviour, as they are less sensitive to model assumptions, and are thus more preferable for theindividual seeking recourse under imperfect causal knowledge.

Method SCM INTERVENTION SET IDENTICAL INT. SET

Valid? (%) LCB Cost (%) {X1} {X2} {X3} {X1, X2} {X1, X3} {X2, X3} {X1, X2, X3}M? CATE?

M? 100 - 10.9±7.9 0 25 0 56 0 0 19 100 23MLIN 100 - 11.0±7.0 0 26 0 50 0 1 23 52 23MKR 90 - 10.7±6.5 0 22 0 44 0 0 34 54 27MGP 100 .55±.04 12.2±8.3 0 6 0 13 0 7 74 25 61MCVAE 100 .55±.07 11.8±7.7 0 12 0 25 0 5 58 31 57CATE? 90 .56±.07 11.9±9.2 0 6 0 11 0 13 70 23 100CATEGP 93 .56±.05 12.2±8.4 0 3 0 9 1 15 72 18 76CATECVAE 89 .56±.08 12.1±8.9 0 6 1 11 0 16 66 18 78

M? 100 - 20.1±12.3 70 0 0 2 16 0 11 99 17MLIN 54 - 20.6±11.0 13 0 0 0 81 0 5 20 41MKR 91 - 20.6±12.5 65 0 0 1 23 0 10 76 22MGP 100 .54±.03 21.9±12.9 39 0 0 0 38 0 22 54 38MCVAE 97 .54±.05 22.6±12.3 33 0 0 0 51 0 15 45 42CATE? 97 .55±.05 26.3±21.4 4 0 0 0 44 2 49 17 99CATEGP 94 .55±.06 25.0±14.8 4 1 0 0 37 4 53 11 69CATECVAE 98 .54±.05 26.0±14.3 3 0 0 1 32 1 62 12 70

M? 100 - 13.2±11.0 0 0 1 0 11 78 7 97 78MLIN 98 - 14.0±13.5 0 0 0 1 0 85 11 81 77MKR 70 - 13.2±11.6 0 17 0 4 10 59 7 55 53MGP 95 .52±.04 13.4±12.8 3 1 2 0 0 82 9 73 78MCVAE 95 .51±.01 13.4±12.2 0 3 1 5 2 71 15 72 76CATE? 100 .52±.02 13.5±13.0 0 0 2 0 9 77 9 78 97CATEGP 94 .52±.03 13.2±13.1 3 1 5 0 3 73 12 70 76CATECVAE 100 .52±.05 13.6±12.9 0 1 2 0 1 82 11 78 78

17

Page 18: knowledge: a probabilistic approach - arXiv

C (Non-)identifability of SCMs under different assumptions

In general form, i.e., without any further assumption on the structural equations S or noise distribution PU,SCMs are not identifiable from data alone, meaning that there are multiple different SCMs (possibly with differentunderlying causal graphs) which imply the same observational distribution [38]. One possible construction relieson the use of the inverse cumulative distribution function (cdf) in combination with uniformly-distributed randomvariables [12] and is also used in non-identifiability proofs for non-linear independent component analysis(ICA) [17]. Even knowing the causal graph is generally not enough as summarised in the following proposition.

Proposition 9. Even when the causal graph is known, the conditionals P (Xr|Xpa(r)) alone are insufficient touniquely determine the structural equations Xr := fr(Xpa(r), Ur) without further assumptions.

Proof. This can be shown by using the following argument from [18, Footnote 1] (adapted to our notation):

“let Ur consist of (possibly uncountably many) real-valued random variables Ur[xpa(r)], one for each valuexpa(r) of the parents Xpa(r). Let Ur[xpa(r)] be distributed according to PXr|xpa(r)

and definefr(xpa(r), Ur) := Ur[xpa(r)]. Then Xr|Xpa(r) has distribution PXr|Xpa(r)”.

We can now build on this formulation to construct a second SCM with the same observational distribution andcausal graph, e.g., by shifting the noise variables and structural equations by some fixed constant C as follows.

For r ∈ [d], define Yr := Xr −C. Let Ur consist of (possibly uncountably many) real-valued random variablesUr[xpa(r)], one for each value xpa(r) of the parents Xpa(r). Let Ur[xpa(r)] be distributed according to PYr|xpa(r)

and define fr(xpa(r), Ur) := Ur[xpa(r)] + C. Then Xr|Xpa(r) also has distribution PXr|Xpa(r) , but for C 6= 0

the structural equations and noise distributions are different from the previous construction.

In the case of the CVAE-SCM model from (13) the setting is slightly less general than the above, since weadditionally assume that: (i) the noise distributions are isotropic multivariate Gaussian distributions of fixeddimension, zr ∼ Ndzr (0, I); and (ii) the structural equations Dr are from the class of functions that can beexpressed as feedforward neural networks if fixed width and depth with learnable parameters ψr .

Unfortunately, we are not aware of any identifiability results for this particular setting, and further investigationinto this matter is beyond the scope of the current work. It is interesting to note, however, that the CVAE-SCMfrom (13) can be understood as a non-linear extension of the linear Gaussian model with equal error variancesconsidered by [37], for which identifiability has been shown.

In general, there seem to be very few works addressing identifiability of SCMs in the non-linear case; werefer to [38, §7.1] for an overview of existing results. Of particular interest for our setting is the post-nonlinear model of [65], which refers to the setting in which a non-linearity g is applied on top of an ANM,i.e., Xr := gr(fr(Xpa(r)) + Ur), and for which complete conditions on {fr, gr} have been provided that leadto identifiability. Given the form of the decoders Dr—feedforward neural networks with stacked layers ofsimple non-linearities applied to linear transformations of the previous layers’ output—it may be possible thatthe CVAE-SCM from (13) can be interpreted as a nested post-nonlinear model. We consider this an interestingdirection, but leave further investigations into this matter for future work.

18

Page 19: knowledge: a probabilistic approach - arXiv

D Further details on CVAE training

To learn the CVAE latent variable models, we perform amortised variational inference with approximate posteriorsq parameterised by encoders Er in the form of neural nets with parameters φr ,

pψr (zr|xr,xpa(r)) ≈ qφr (zr|xr,xpa(r)) := N (µr, σ2r), (µr, σ

2r) := Er(xr,xpa(r);φr). (D.1)

The training objective in form of the evidence lower bound (ELBO) given data {xi}ni=1 is given by

Lr(ψr, φr) =n∑i=1

Eqφr (z|xir,xipa(r))

[ ∥∥∥xir −Dr(xipa(r), z;ψr)∥∥∥2 ]+ βrDKL

(qφr (z|xir,xipa(r))

∣∣∣∣∣∣ p(z))(D.2)

We learn both ψr and φr simultaneously via stochastic gradient descend on Lr , with gradients computed byMonte Carlo sampling from qφr with reparametrisation. Since the pairs of encoder and decoder parameters(ψr, φr) are independent for different r, this can be done in parallel.

D.1 Hyperparameter selection for CVAE training

A CVAE model was trained for every Xr|Xpa(r) relation. Generally, hyperparameters were selected by comparingthe distribution of real samples from the dataset against reconstructed samples from the trained CVAE obtainedby sampling noise from the prior. The selection of hyperparameters was done either manually, or by performinga grid search over various encoder and decoder architectures, latent-space dimensions, and values of thehyperparameters βr that trade off the MSE and KL terms in the CVAE objective (D.2). For the case of automaticselection, the setup resulting in the smallest maximum mean discrepancy (MMD) statistic [14] between realand reconstructed samples was chosen as hyperparameter configuration. Further details on the search spaceconsidered and the selected values are provided in Table 5.

Table 5: Selection of hyperparameters for CVAE training was either performed manually (for LinearSCM, Non-linear ANM, Non-additve SCM) or automatically (for 7-variable semi-synthetic loanapproval) by selecting the setting that resulted in the minimum MMD statistic between real andreconstructed samples.

SCM Conditional Encoder Arch. Decoder Arch. Latent Dim. λKLD

Linear SCMX2|X1, 1×32×32×32 5×5×1 1 0.01X3|X1, X2 1×32×32×32 32×32×32×1 1 0.01

Non-linear ANMX2|X1, 1×32×32 32×32×1 5 0.01X3|X1, X2 1×32×32×32 32×32×1 1 0.01

Non-additve SCMX2|X1, 1×32×32×32 32×32×1 3 0.5X3|X1, X2 1×32×32×32 5×5×1 3 0.1

7-variable semi-syntheticloan approval any

2×1

1,21×3×3 2×2×1 5, 1, 0.5, 0.1,1×5×5 3×3×1 0.05, 0.01,

1×3×3×3 5×5×1 0.0053×3×3×1

19

Page 20: knowledge: a probabilistic approach - arXiv

E Experimental details, hyperparameter choices, and specification of SCMs

E.1 Specification of SCMs used in our experiments

The following is a specification of all SCMs used in our experiments on synthetic and semi-synthetic data, bothfor data generation and to evaluate the validity of recourse actions proposed by the different approaches bycomputing the corresponding counterfactual in the ground-truth SCMs.

In addition, we also specify the model used to generate training labels. Note, however, that these labels are onlyused to train a new classifier (e.g., a logistic regression, multi-layer perceptron, or random forest) from scratch:this is the h(x) referred to in the main paper. The label generating process is thus only used for obtaining labelsto train a classifier on and is subsequently disregarded in favour of h.

In selecting the structural equations and label generating process, we tried to pick combinations that resulted inroughly centred features, as well as roughly balanced datasets (i.e., with a similar proportion of positive andnegative training examples) that are not perfectly linearly-separable (i.e., with some class overlap). Moreover,we tried to select settings that result in a diverse set of intervention targets selected by the oracle for differentfactual instances, i.e., we try to avoid situations in which the optimal action is to always intervene on the same(set of) variable(s). To induce more interesting behaviour, we sample root nodes from mixtures of Gaussians.

E.1.1 3-variable synthetic SCMs used for Table 1

A visual summary of the 3-variable synthetic SCMs used for Table 1 is provided in Fig. 4.

4

2

0

2

4

X 1

6

4

2

0

2

4

6

X 2

5 0 5X1

2

0

2

4

X 3

5 0 5X2

2.5 0.0 2.5 5.0X3

(a) Linear SCM

6

4

2

0

2

4

X 1

1

0

1

2

X 2

5 0X1

4

2

0

2

X 3

1 0 1 2X2

4 2 0 2X3

(b) Non-linear ANM

4

2

0

2

4

X 15

0

5

10

X 2

5 0 5X1

10

5

0

5

10

X 3

0 10X2

10 0 10X3

(c) Non-additive SCM

Figure 4: Histograms and scatter plots of pairwise feature relations for the synthetic 3-variable SCMs.

Linear SCM: The linear 3-variable SCM consists of the following structural equations and noise distributions:

X1 := U1, U1 ∼ MoG(

0.5N (−2, 1.5) + 0.5N (1, 1))

(E.1)

X2 := −X1 + U2, U2 ∼ N (0, 1) (E.2)X3 := 0.05X1 + 0.25X2 + U3, U3 ∼ N (0, 1) (E.3)

Non-linear ANM: The non-linear 3-variable ANM consists of the following structural equations and noisedistributions:

X1 := U1, U1 ∼ MoG(

0.5N (−2, 1.5) + 0.5N (1, 1))

(E.4)

X2 := −1 +3

1 + e−2X1+ U2, U2 ∼ N (0, 0.1) (E.5)

X3 := −0.05X1 + 0.25X22 + U3, U3 ∼ N (0, 1) (E.6)

Non-additve SCM: The non-additive 3-variable SCM consists of the following structural equations and noisedistributions:

X1 := U1, U1 ∼ MoG(

0.5N (−2.5, 1) + 0.5N (2.5, 1))

(E.7)

X2 := 0.25 sgn(U2)X21 (1 + U2

2 ), U2 ∼ N (0, 0.25) (E.8)

X3 := −1 + 0.1 sgn(U3)(X21 +X2

2 ) + U3, U3 ∼ N (0, 0.252) (E.9)

20

Page 21: knowledge: a probabilistic approach - arXiv

Label generation: For all 3-variable SCMs, labels Y were sampled according to

Y ∼ Bernoulli((

1 + e−2.5ρ−1(X1+X2+X3))−1

)(E.10)

where ρ is the average of (X1 +X2 +X3) across all training samples.

E.1.2 7-variable semi-synthetic loan approval SCM used for Table 2

For the semi-synthetic dataset, we wanted to capture some relations between the involved variables that seemedsomewhat intuitive to us and to some limited extent reflect a loan approval setting in the real-world:

• loan amount and duration being largest for mid-aged people who may want to build a house and start afamily, and smaller for younger and older people;

• loan duration increasing with loan amount due to the an upper limit on monthly payments that can beafforded

• savings increasing once income passes a certain (minimal-sustenance) threshold;

• income increasing with age;

• education increasing with age initially before eventually saturating;

• gender differences in income and (access to) education due to existing gender-discrimination andinequality of opportunities in the population;

A visual summary of the 7-variable semi-synthetic loan SCMis shown in Fig. 5.

Semi-synthetic SCM: The loan approval SCM consists of the following structural equations and noisedistributions:

G := UG, UG ∼ Bernoulli(0.5) (E.11)A := −35 + UA, UA ∼ Gamma(10, 3.5) (E.12)

E := −0.5 +

(1 + e

−(−1+0.5G+(1+e−0.1A)−1

+UE

))−1

, UE ∼ N (0, 0.25) (E.13)

L := 1 + 0.01(A− 5)(5−A) +G+ UL, UL ∼ N (0, 4) (E.14)D := −1 + 0.1A+ 2G+ L+ UD, UD ∼ N (0, 9) (E.15)I := −4 + 0.1(A+ 35) + 2G+GE + UI , UI ∼ N (0, 4) (E.16)S := −4 + 1.5I{I>0}I + US , US ∼ N (0, 25) (E.17)

Note that variables in the above SCM often have a relative meaning in terms of deviation from the mean, e.g., wecentre the Gamma-distributed age around its mean of 35, so that A has the meaning of “age-difference from themean of 35” (and similarly for other variables).

Label generation: Labels Y were sampled according to

Y ∼ Bernoulli((

1 + e−0.3(−L−D+I+S+IS))−1

). (E.18)

Note that this label generation process only depends on loan duration and amount, income and savings, but noton gender, age or education level.

21

Page 22: knowledge: a probabilistic approach - arXiv

0.0

0.2

0.4

0.6

0.8

1.0

G

20

0

20

40

A

0.4

0.2

0.0

0.2

E

15

10

5

0

5

L

15

10

5

0

5

10

D

5.0

2.5

0.0

2.5

5.0

7.5

I

0.0 0.5 1.0G

10

0

10

S

25 0 25 50A

0.25 0.00 0.25E

10 0L

10 0 10D

5 0 5I

10 0 10S

Figure 5: Histograms and scatter plots of pairwise feature relations for the semi-synthetic loan SCM.

22

Page 23: knowledge: a probabilistic approach - arXiv

F Derivation of a Monte-Carlo estimator for the gradient of the variance

We now derive an estimator for the gradient of the square-root of the variance (i.e., standard deviation) of hover the interventional or counterfactual distribution of Xd(I) w.r.t. θ, which appears (multiplied by λLCB) in thethreshold tresh(a) of the optimisation constraint/regulariser.

First, we use the chain rule of differentiation to write

∇θ

√VXd(I)

[h(Xd(I),θ,xF

nd(I)

)]=∇θVXd(I)

[h(Xd(I),θ,x

Fnd(I)

)]2

√VXd(I)

[h(Xd(I),θ,xF

nd(I)

)] (F.1)

Next, we write the variance as expectation and—assuming the interventional or counterfactual distribution ofXd(I) admits reparametrisation as is the case for the GP-SCM and CVAE models used in this paper—use thereparametrisation trick to differentiate through the expectation operator as in (15).

∇θVXd(I)

[h(Xd(I),θ,x

Fnd(I)

)](F.2)

= ∇θEXd(I)

[(h(Xd(I),θ,x

Fnd(I)

)− EX′d(I)

[h(X′d(I),θ,x

Fnd(I)

) ])2](F.3)

= ∇θEz∼N (0,I)

[(h(Xd(I)(z;θ),θ,xF

nd(I))− Ez′∼N (0,I)

[h(xd(I)(z

′;θ),θ,xFnd(I)

) ])2](F.4)

= Ez∼N (0,I)

[∇θ

(h(Xd(I)(z;θ),θ,xF

nd(I))− Ez′∼N (0,I)

[h(xd(I)(z

′;θ),θ,xFnd(I)

) ])2](F.5)

= Ez∼N (0,I)

[2

(h(Xd(I)(z;θ),θ,xF

nd(I))− Ez′∼N (0,I)

[h(xd(I)(z

′;θ),θ,xFnd(I)

) ])(F.6)

×(∇θh

(Xd(I)(z;θ),θ,xF

nd(I))− Ez′∼N (0,I)

[∇θh

(xd(I)(z

′;θ),θ,xFnd(I)

) ])](F.7)

We can now obtain an estimate of the gradient with two independent sets of Monte Carlo samples of Xd(I),drawn via reparametrisation from the interventional or counterfactual distribution,

{x(m)

d(I) := xd(I)(z(m);θ)}Mm=1, {x(m′)

d(I) := xd(I)(z(m′);θ)}M

m′=1 where z(m), z(m′) i.i.d.∼ N (0, I).

(F.8)

This yields the following Monte Carlo gradient estimator of the variance:

∇θVXd(I)

[h(Xd(I),θ,x

Fnd(I)

)]≈ 1

M

M∑m=1

[2

(h(x(m)

d(I),θ,xFnd(I)

)− 1

M ′

M∑m′=1

h(x(m′)d(I) ,θ,x

Fnd(I)

))(F.9)

×(∇θh

(x(m)

d(I),θ,xFnd(I)

)− 1

M ′

M′∑m′=1

∇θh(x(m′)d(I) ,θ,x

Fnd(I)

))](F.10)

Substituting the above expression, together with the following Monte Carlo estimate of the (undifferentiated)variance

VXd(I)

[h(Xd(I),θ,x

Fnd(I)

)]≈ 1

M − 1

M∑m=1

(h(x(m)

d(I),θ,xFnd(I)

)− 1

M

M′∑m′=1

h(x(m′)d(I) ,θ,x

Fnd(I)

))2

,

(F.11)into (F.1) gives the desired estimate for the gradient of the standard deviation of h.

23