ultra marginal feature importance - arxiv

35
Ultra-marginal Feature Importance Joseph Janssen Department of Earth, Ocean and Atmospheric Sciences University of British Columbia Vancouver [email protected] Vincent Guan Department of Mathematics University of British Columbia Vancouver [email protected] Abstract Scientists frequently prioritize learning from data rather than training the best possible model; however, research in machine learning often prioritizes the lat- ter. Marginal feature importance methods, such as marginal contribution feature importance (MCI), attempt to break this trend by providing a useful framework for quantifying the relationships in data in an interpretable fashion. In this work, we aim to improve upon the theoretical properties, performance, and runtime of MCI by introducing ultra-marginal feature importance (UMFI), which uses pre- processing methods from the AI fairness literature to remove dependencies in the feature set prior to model evaluation. We show on real and simulated data that UMFI performs at least as well as MCI, with significantly better performance in the presence of correlated interactions and unrelated features, while partially learning the structure of the causal graph and substantially reducing the exponential runtime of MCI to super-linear. 1 Introduction Scientists often seek to determine the true relationships between a set of characteristics and some outcome of interest. These relationships are ideally determined by performing carefully controlled experiments so that causality can be established. However, experiments can be difficult and costly to pursue, unethical to perform, or impossible to control [78], leaving only observational data available. The relationships that are hidden within vast quantities of observational data are often difficult to determine, so statistical tools, such as feature importance, have been explored. Feature importance methods quantify and, more importantly, rank how related explanatory features are to a response. There is no consensus on how to define feature importance, so choosing an appropriate method strongly depends on the questions that the user seeks to answer [74, 36, 16]. The most widely referenced distinction between feature importance methods is the conditional versus marginal divide, which defines the two extremes of feature importance methods [26]. This division is sometimes also framed as true to the model versus true to the data [18]. The distinction between these two types of methods is only evident in the presence of dependent features. Indeed, if all explanatory features in a dataset are independent, conditional and marginal importances are exactly the same [26, 57]. Conditional methods are typically used for feature selection, dimensionality reduction, and improving prediction performance. On the other hand, marginal methods are specifically developed to explain and interpret data [36], seeking to avoid the tendency of conditional methods to explain the model and give correlated features too little importance [18]. For example, if a scientist wants to build a database of genes associated with some disease, then a marginal approach could be preferred [26, 16, 18]. Alternatively, if an engineer wants to determine the smallest number of features to obtain good model predictions, a conditional importance metric would be more suitable [36, 16]. A more detailed discussion about the differences between marginal and conditional methods is provided in Appendix A.2. Preprint. Under review. arXiv:2204.09938v3 [stat.ML] 17 Jul 2022

Upload: khangminh22

Post on 05-May-2023

1 views

Category:

Documents


0 download

TRANSCRIPT

Ultra-marginal Feature Importance

Joseph JanssenDepartment of Earth, Ocean and Atmospheric Sciences

University of British ColumbiaVancouver

[email protected]

Vincent GuanDepartment of Mathematics

University of British ColumbiaVancouver

[email protected]

Abstract

Scientists frequently prioritize learning from data rather than training the bestpossible model; however, research in machine learning often prioritizes the lat-ter. Marginal feature importance methods, such as marginal contribution featureimportance (MCI), attempt to break this trend by providing a useful frameworkfor quantifying the relationships in data in an interpretable fashion. In this work,we aim to improve upon the theoretical properties, performance, and runtime ofMCI by introducing ultra-marginal feature importance (UMFI), which uses pre-processing methods from the AI fairness literature to remove dependencies in thefeature set prior to model evaluation. We show on real and simulated data thatUMFI performs at least as well as MCI, with significantly better performance in thepresence of correlated interactions and unrelated features, while partially learningthe structure of the causal graph and substantially reducing the exponential runtimeof MCI to super-linear.

1 Introduction

Scientists often seek to determine the true relationships between a set of characteristics and someoutcome of interest. These relationships are ideally determined by performing carefully controlledexperiments so that causality can be established. However, experiments can be difficult and costly topursue, unethical to perform, or impossible to control [78], leaving only observational data available.The relationships that are hidden within vast quantities of observational data are often difficult todetermine, so statistical tools, such as feature importance, have been explored. Feature importancemethods quantify and, more importantly, rank how related explanatory features are to a response.

There is no consensus on how to define feature importance, so choosing an appropriate methodstrongly depends on the questions that the user seeks to answer [74, 36, 16]. The most widelyreferenced distinction between feature importance methods is the conditional versus marginal divide,which defines the two extremes of feature importance methods [26]. This division is sometimesalso framed as true to the model versus true to the data [18]. The distinction between these twotypes of methods is only evident in the presence of dependent features. Indeed, if all explanatoryfeatures in a dataset are independent, conditional and marginal importances are exactly the same[26, 57]. Conditional methods are typically used for feature selection, dimensionality reduction, andimproving prediction performance. On the other hand, marginal methods are specifically developedto explain and interpret data [36], seeking to avoid the tendency of conditional methods to explainthe model and give correlated features too little importance [18]. For example, if a scientist wants tobuild a database of genes associated with some disease, then a marginal approach could be preferred[26, 16, 18]. Alternatively, if an engineer wants to determine the smallest number of features toobtain good model predictions, a conditional importance metric would be more suitable [36, 16]. Amore detailed discussion about the differences between marginal and conditional methods is providedin Appendix A.2.

Preprint. Under review.

arX

iv:2

204.

0993

8v3

[st

at.M

L]

17

Jul 2

022

Recently, feature importance methods such as Shapely-values [62, 20, 52], SAGE [23], accumulatedlocal effects (ALE) [4], permutation importance [11], and conditional permutation importance (CPI)[26] have been used in high-impact journal papers by scientists who want to explain the mechanismswithin data [2, 7, 65, 44, 60, 34, 40]. However, these methods do not adequately account for correlatedfeatures and feature interactions. ALE can only easily show first order effects [56], and althoughCPI improves upon some limitations of permutation importance, CPI has the property that twoperfectly correlated features with significant predictive power would both be deemed unimportant[23]. Further, only one model is trained in ALE, CPI, and permutation importance. Thus, correlatedfeatures, which can alter the model assembly process, could be given artificially low importanceif the goal is to explain the data [38]. Developers of feature importance methods like SAGE andmarginal contribution feature importance (MCI) have attempted to address these issues by evaluatingthe difference in accuracy between a model trained with the feature of interest and a model trainedwithout it, across all feature subsets [17, 23]. In particular, MCI was shown to have better quality androbustness when compared to Shapely-values, SAGE, ablation, and bivariate methods [17].

In this paper, we introduce ultra-marginal feature importance (UMFI), a new marginal featureimportance method that performs at least as well as previous methods while drastically reducingruntime. UMFI was developed to overcome three key shortcomings of MCI. First, MCI is givenby a maximization problem that requires searching across all subsets of the feature space F . Exactcomputation requires an exponential number of model trainings, which makes MCI ineffective forinterpreting large datasets (e.g., gene expression studies). Second, although it can handle complexfeature interactions and data with correlated features, MCI underestimates the importance of correlatedfeatures that interact in the expression for the response variable. Third, MCI can give non-zeroimportance to features that are completely unrelated to the response variable.

The rest of this paper is organized as follows. Axioms for explaining the data are proposed inSection 2. The framework for UMFI is then formally presented in Section 3 along with its theoreticalproperties and its simple algorithm. In Section 4, we conduct experiments on simulated and realdata to assess the quality, robustness, and time complexity of UMFI compared to MCI. Finally, anoverview of the work, its limitations, and ideas for future work are discussed in Section 5.

Related work

This paper is greatly inspired by the development of marginal contribution feature importance (MCI)by Catav et al. [17]. Although other methods, such as SAGE [23], have been retooled to betterexplain data [16], up until this point, MCI had been the only feature importance method developedspecifically to explain data. Let F be the set of features used to predict the response variable, Y .Given an evaluation function ν (e.g., random forest’s OOB-R2), Catav et al. [17] defined the marginalcontribution feature importance (MCI) of a feature f ∈ F as

Iν(f) = maxS⊆F

ν(S ∪ {f})− ν(S). (1)

The machine learning evaluation function ν aims to approximate the universal predictive power ofsome set of features. Let y be the optimal constant predictor and G(S) be the set of all predictivemodels restricted to using features in S ⊆ F . Given a loss function l, the universal predictive powersatisfies

ν(S) = miny

E[l(y, Y )]− ming∈G(S)

E[l(g(S), Y )]. (2)

We found that MCI has issues with correlated interactions and that it occasionally provides importanceto features that are completely unimportant. Additionally, since MCI is a subset-based importancemetric, its exponential runtime is not suitable for large feature sets. Our goal is to introduce amarginal feature importance metric that performs at least as well as MCI while improving uponits shortcomings. To achieve this, when evaluating the importance of a feature of interest f , wepreprocess the data to remove dependencies on f . Ideally, this would be done in a manner thatpreserves as much information from the original features as possible. Since many of the complicationsfor computing feature importance emerge from dependencies within the data, removing dependencieson the feature of interest enables accurate scores across diverse settings. This also eliminates the needfor subset-based scoring, and hence drastically reduces runtime compared to previous methods.

Finding orthogonal predictors for resolving the controversies of feature importance in the presence ofmulticollinearity is not strictly new [33]. However, the discussion of orthogonal predictors for betterfeature importance methods has been seemingly limited to applications to multiple linear regression,

2

mostly in the domain of psychology [9, 79]. While the techniques for orthogonalizing predictors hasbeen limited to fairly simple linear algebra, more advanced and more general dependency removalmethods have seen great progress within the AI fairness and privacy literature. The need for theseindependent and information preserving data representations for better feature importance methodshas even been mentioned as future work in König et al. [47] and Chen et al. [18]. Some examplesof these techniques include linear regression [10], optimal transport [43], neural networks [15, 63],convex optimization [15], and principal inertial components [71]. Linear regression and optimaltransport were implemented for UMFI in this paper.

2 Axioms for explaining data

Any attempt to build a method that explains the data should begin by rigorously defining whatexplaining the data truly means. Different definitions and goals have been formulated by Chenet al. [18] and Catav et al. [17]. Inspired by these definitions, we attempt a more accurate, justified,and rigorous definition. Chen et al. [18] suggests that a method that is true to the data shouldspread importance among correlated features rather than sparsely choosing features that are neededfor model performance and should aim to understand the underlying causal graph. Instead ofthese more qualitative descriptors, Catav et al. [17] defines a set a axioms and an additional set ofmathematical properties, including the marginal contribution axiom, elimination axiom, minimalismaxiom, duplication invariance property, self contribution property, among many others.

Given a feature set F , a response Y , and a feature of interest f ∈ F , the feature importance of f isdefined as ImpF,Y (f) ∈ R≥0. We define the following three axioms as vital for any method thatclaims to explain the data:

1. Elimination axiom: Eliminating a feature from the feature set F can only decrease theimportance of the feature of interest:

ImpF\x,Y (f) ≤ ImpF,Y (f).

2. Duplication invariance and symmetry axiom: Adding a duplicate copy of a feature x = xalready in the feature set x ∈ F will not change the importance of the feature of interest,and the duplicated feature will have importance equal to the original feature:

∀f ∈ F, ImpF,Y (f) = ImpF∪{x},Y (f) and ImpF∪{x},Y (x) = ImpF∪{x},Y (x).

3. Blood relation axiom: If data is generated from a causal graph, feature f will be givennon-zero and positive importance if and only if it is blood related to the response Y in thecausal graph. Two vertices in a causal graph are said to be blood related if there is a directedpath between them or if there is a backdoor path between them via a common ancestor.

ImpF,Y (f) > 0 ⇐⇒ f ∈ BR(Y ).

The elimination axiom comes directly from Catav et al. [17]. In part, the justification for this axiomcomes from the fact that a feature’s relation to the response may only be revealed when anotherfeature it interacts with is observed. Thus, in cases where this synergistic feature is dropped, thefeatures it has interactive synergies with should have decreased importance. Further, this axiom isjustified by the fact that once a feature is observed to be significantly related to the response, therelationship strength between the feature and response should not drop, regardless of the additionalfeatures added.

The duplication invariance and symmetry axiom is one of the key properties that separates featureimportance methods that are for data explanation rather than model exploration or optimization [17].A model may equally spread the use of the relevant information among the two features (randomforests or elastic net), or only one of the perfectly correlated features may be selected (lasso) [18].However, from the data’s perspective, both features should be equally related to the response and theoriginal importance found before duplication should still be true. After duplication, no additionalinteraction capability is available [35], so feature importance across all features should not increase,and from the elimination axiom, the importances should not decrease either.

The blood relation axiom asserts that feature importance scores intended for data explanation shouldextract reliable knowledge about the underlying causal graph and data generating process. In fact,

3

a statistical association between a feature and the response exists precisely when the two featuresare blood related, or equivalently, when there is an open path between them [76]. Thus, a featureimportance metric satisfying this axiom would give non-zero importance to a feature if and only ifthere is a statistical association between that feature and the response. Additionally, if the goal is toconstruct a causal graph to represent the relationships in the data, then a feature importance metricsatisfying this axiom can partition the feature set into features that are blood related to the responseand features that are not blood related to the response (features with either no path to the response oronly a closed path to the response).

We intentionally exclude some of the MCI axioms and properties included by Catav et al. [17]. Forexample, the marginal contribution axiom is not included because it conflicts directly with the bloodrelation axiom. In the collider example presented by Harel et al. [37], they present a causal graphas Y ← S → G ← E, where S is unmeasured. When predicting the response Y , the marginalcontribution axiom requires that feature E is given importance since if we know G, then feature Ecan help predict the response. However, feature E has no relation to the response Y , so it would bemore reasonable to give E zero importance. Indeed, E is given zero importance under the bloodrelation axiom, so the blood relation axiom is more reasonable and justified compared to the marginalcontribution axiom.

3 Ultra-marginal feature importance

Let F = {f1, ..., fp} be a set of p features of arbitrary type used to predict the response Y . Wenote that features may be viewed as random variables, or realizations of random variables accordingto their joint distribution in the form of a dataset. We define the space of information subsets of afeature set F as I(F ) = {g(S) : S ⊆ F} where g is any function that may act on S. We call theseinformation subsets of F because I(Y ; g(S)) ≤ I(Y ;S) ≤ I(Y ;F ) holds for any function g andany S ⊆ F by Theorem A.3 and the monotonicity of mutual information.

Given an evaluation function ν : I(F ) → R≥0 and a feature set F , we define the ultra-marginalfeature importance (UMFI) of a feature f ∈ F as

UF,Yν (f) = ν(SFf ∪ {f})− ν(SFf ). (3)

Definition 1. We define SFf as the preprocessed feature set after dependencies on the feature ofinterest f have been removed from F . We say that the preprocessed feature set, SFf , is optimal if itobeys the following properties:

1. SFf = g(F )

2. SFf ⊥⊥ f

3. I(Y ;SFf , f) = I(Y ;F )

The first property states that the modified feature set SFf is obtained via a function of F , whichensures that SFf ∈ I(F ). The second property upholds that SFf is completely independent of f , andthe last property affirms the optimality of SFf in the sense that there is no unnecessary informationloss incurred during preprocessing with respect to the mutual information between the response Yand the original feature set F . We note that the optimal preprocessed feature set is not necessarilyunique, and by Theorem A.3, the last condition is met when the function g, used to obtain SFf , isinjective. In practice, the last two properties can be difficult to guarantee, but we see later in Section4 that UMFI can provide accurate feature importance scores even without optimal preprocessings.

Next, we prove that UMFI obeys the three axioms given in Section 2 under certain assumptions. Oneof the main assumptions we make is that ν(F ) behaves similarly to the mutual information functionI(Y ;F ). In the case of classification, the evaluation function is equivalent to mutual informationin the idealized setting where we know the Bayes classifier and ν is defined with the cross entropyloss function [23]. In the case of regression, one can also closely relate mutual information to theexplained variance of a model. See Covert et al. [24] and Appendix A.3 for a more thorough overviewof this relationship. Thus, we assert that this a reasonable assumption. In practice, the accuracy of theapproximation ν(S) ≈ I(Y ;S) depends on the quality of the machine learning model, the specifiedloss function, and the response variable’s distribution [24].

4

Theorem 3.1 (Elimination axiom). When preprocessing is performed using optimal transport withchaining and when ν(S) = I(Y ;S), UF,Yν (f) ≤ UF∪{x},Yν (f).

Proof. Let SF∪{x}f be the preprocessed version of F ∪ {x} and let SFf be the preprocessed version

of F . By optimal transport with chaining [43], we may assume that SF∪{x}f = SFf ∪ x and thatSFf , f, x are mutually independent. It follows from supermodularity of mutual information undermutual independence (Theorem A.1) that

UF∪x,Yν (f) = I(Y ;SF∪xf , f)− I(Y ;SF∪xf ) = I(Y ;SFf , x, f)− I(Y ;SFf , x)

≥ I(Y ;SFf , f)− I(Y ;SFf ) = UF,Yν (f).

Theorem 3.2 (Duplication invariance and symmetry axiom). Let x ∈ F , x 6∈ F , and x = x.Suppose that ν(S) = I(Y ;S) and that all preprocessed feature sets SFf and SF∪{x}f are optimal

preprocessings. Then, ∀f ∈ F, UF,Yν (f) = UF∪{x},Yν (f) and UF∪{x},Yν (x) = U

F∪{x},Yν (x).

Proof. To prove the claims, we first show that for all SFf which satisfy the optimality conditions for

SFf listed in Definition 1, SFf also satisfies the optimally conditions for SF∪{x}f , and for all SF∪{x}f

which satisfy the optimality conditions for SF∪{x}f , SF∪{x}f also satisfies the optimality conditionsfor SFf . We prove this for all f ∈ F .

The first two properties in Definition 1 follow immediately from the fact that a function with repeatedarguments can be defined to be equal to the same function without repeated arguments and the factthat both SFf and SF∪{x}f must be independent of f by optimality. Then, since mutual information is

invariant under duplicate information and since SFf and SF∪{x}f are optimal, we know that

I(Y ;F, x) = I(Y ;SF∪{x}f , f) = I(Y ;F ) = I(Y ;SFf , f)

Hence, we may assume that SFf and SF∪{x}f are interchangeable for all f ∈ F , and it follows that

UF,Yν (f) = I(Y ;SFf , f)− I(Y ;SFf ) = I(Y ;SF∪{x}f , f)− I(Y ;S

F∪{x}f ) = UF∪{x},Yν (f).

Finally, since x = x, we can see that SF∪{x}x and SF∪{x}x are interchangeable, which proves thesymmetry axiom

UF∪{x},Yν (x) = UF∪{x},Yν (x).

We note that this proof holds when SFf and SF∪{x}f are interchangeable. This fact also holds whenthe removal of dependencies on a feature f is done in a pairwise fashion (see Algorithm 3) or viachaining [43].

Theorem 3.3 (Blood relation axiom). Let ν(F ) = I(Y ;F ). Assuming the data is generated from aGaussian graphical model obeying the global Markov property and faithfulness, UF,Yν (f) > 0 if andonly if f ∈ BR(Y ).

Proof. To start, we know UF,Yν (f) = I(Y ;SFf , f) − I(Y ;SFf ) = I(Y ; f |SFf ). And from thedefinition of conditional mutual information, we know UF,Yν (f) = 0 ⇐⇒ I(Y ; f |SFf ) = 0 ⇐⇒Y ⊥⊥ f |SFf . Since we have a Gaussian graphical model, where all variables are jointly Gaussian, wemay write the conditional independence statements in terms of covariance block matrices

Y ⊥⊥ f |SFf ⇐⇒ Σf,Y − Σf,SFf

Σ−1SFf ,S

Ff

ΣSFf ,Y

= 0. (4)

5

Since f ⊥⊥ SFf and ΣX,Y = 0 ⇐⇒ X ⊥⊥ Y , we know that Y ⊥⊥ f |SFf ⇐⇒ f ⊥⊥ Y .

All that is left to prove is f ⊥⊥ Y ⇐⇒ f 6∈ BR(Y ). First, if f 6∈ BR(Y ), then f ⊥⊥ Y followsfrom the global Markov property and the fact that f and Y are d-separated by the empty set. Indeed,every path from f to Y must have at least one collider. We consider two cases. (1) The edge comingout of Y is outgoing. Then since f is not a descendent of Y , the path must reverse its orientation atsome vertex before meeting f . That vertex is a collider. (2) The edge connecting to Y points towardsY . Then the path must reverse its orientation at some point since f is not an ancestor of Y . The pathmust then reverse another time because otherwise, f would share a common ancestor with Y (thevertex of the first reversal). The vertex with the second reversal is a collider.

Conversely, let f ∈ BR(Y ). By the faithfulness assumption, it suffices to show that f and Y ared-connected by the empty set. Since f ∈ BR(Y ), there are two possible cases: either there is adirected path between f and Y , or f and Y share a common ancestor. In the first case, we simplychoose the directed path between f and Y and observe that there cannot be a collider. Similarly, inthe second case, we may pick the path beginning at Y and trace it up to the common ancestor andthen travel to f . There can be no colliders along the path since every vertex has at least one outgoingedge by construction. Also, the empty set cannot contain any non-colliders.

Thus, we are left with

UF,Yν (f) = 0 ⇐⇒ I(Y ; f |SFf ) = 0 ⇐⇒ Y ⊥⊥ f |SFf ⇐⇒ f ⊥⊥ Y ⇐⇒ f 6∈ BR(Y ),

or equivalently,

UF,Yν (f) > 0 ⇐⇒ I(Y ; f |SFf ) > 0 ⇐⇒ Y 6⊥⊥ f |SFf ⇐⇒ f 6⊥⊥ Y ⇐⇒ f ∈ BR(Y ),

which completes the proof.

We also prove in Appendix C via partial information decomposition that UMFI obeys the bloodrelation axiom for all features f ∈ F if there is no interaction information between f and SFf [75, 35].

Since UMFI is model-agnostic, we provide a general algorithm for computing the ultra-marginalfeature importance of a feature f ∈ F which can be applied using any pair of preprocessing andmodelling techniques (Algorithm 1).

Algorithm 1: Algorithm for computing UMFI1: Let Y be the response variable of the set of predictors F . Choose a feature f ∈ F .2: Estimate SFf by removing dependencies on f from F with minimal information loss.3: Specify a model and evaluation function ν.4: Train a model using features SFf to predict Y and compute ν(SFf ).5: Train a model using features (SFf , f) to predict Y and compute ν(SFf ∪ f).6: return UF,Yν (f) = ν(SFf ∪ f)− ν(SFf )

We emphasize that the critical preprocessing step in line 2 of the algorithm can be implemented invarious ways. Appendix E details our implementations of this step using pairwise optimal transportand linear regression. We also note that since the distributions of the features are unknown in practice,all of the steps in the algorithm are typically done with respect to the observed dataset rather than onthe random variables themselves.

4 Experiments

We perform experiments to compare UMFI and MCI with respect to quality, robustness, and timecomplexity. To implement UMFI, we consider optimal transport [43] (UMFI_OT) and linear regres-sion [10] (UMFI_LR) as methods to remove dependencies from the data. A detailed overview of

6

these implementations is shown in Appendix E and experiments comparing these methods appear inAppendix F. For all experiments, we use random forests’ out-of-bag accuracy (R2 OOB-accuracy forregression tasks and OOB classification accuracy for classification tasks) as the evaluation metricν [11]. We use the ranger R package to implement random forests with default hyperparametersand 100 for the number of trees [77]. All experiments were run in Microsoft R Open Version 4.0.2[55]. Appendix G contains additional experiments comparing UMFI and MCI with other featureimportance metrics including ablation, permutation importance, and conditional permutation impor-tance. In the same section, we rerun the experiments comparing MCI and UMFI using extremelyrandomized trees instead of random forests and do an additional comparison on a real dataset fromhydrology [1]. Code for all experiments can be found at https://github.com/joej1997/UMFI.

4.1 Experiments on simulated data

We run UMFI on simulated data to verify that it performs well compared to MCI. The data in allsimulation studies contains one response variable Y , four explanatory features x1, x2, x3, x4, and1000 randomly generated observations. Each of the simulation studies are repeated 100 times so thatwe can also test the stability of each method.

4.1.1 Nonlinear interactions

Interaction effects are common in many scientific disciplines where assessing feature importanceis prevalent, including hydrology [41, 2, 49], genomics [16, 73, 58], and glaciology [29, 5, 13, 61].So, as was done in Catav et al. [17], we assess the ability of MCI and UMFI to detect nonlinearinteraction effects in the data [53]. We consider:

x1, x2, x3, x4 ∼ N (0, 1)

Y = x1 + x2 + sign(x1 ∗ x2) + x3 + x4.

Ideally, the results of a feature importance metric would conclude that x1 and x2 have higherimportance compared to x3 and x4 because of the extra interaction term, sign(x1 ∗ x2). Figure 1ashows consistently good performance across all methods. Each method gave high relative importancescores to x1 and x2, while x3 and x4 received less, but still substantial importance. All methods showsimilar variability.

4.1.2 Correlated interactions

Interacting features are often correlated [39, 41]. So, this simulation study aims to repeat the nonlinearinteractions study, except now x1, x2, x3, and, x4 are highly correlated. Let A,B,C,D,E,G ∼N (0, 1). We consider:

x1 = A+B, x2 = B + C, x3 = D + E, x4 = E +G

Y = x1 + x2 + sign(x1 ∗ x2) + x3 + x4.

In this simulation, x1 is correlated with x2, and x3 is correlated with x4 in the same way. Just aswith the interaction experiment with independent features, we would expect x1 and x2 to be moreimportant than x3 and x4 because of the extra interaction term, sign(x1 ∗ x2). The results in Figure1b clearly show that UMFI provides better estimations of feature importance compared to MCI whencorrelated interactions are present. The variability of all estimates are approximately the same acrossmethods. However, MCI estimates that all features have approximately the same feature importancescores, while both UMFI methods show significantly greater importance for x1 and x2 comparedto x3 and x4. MCI fails in this experiment because x2 contains most of the important informationcoming from x1, which implies that I(Y ;x1, x2)−I(Y ;x1) ≤ I(Y ;x1)−I(Y ; ∅). Therefore, whencomputing the importance for x1, the maximizing subset that solves Equation 1 cannot contain x2 ifν(S) ∼ I(Y ;S). UMFI is able to detect this interaction because it can extract the information fromx2 that interacts with x1 while keeping this extracted feature independent of x1.

4.1.3 Correlation

Marginal feature importance methods such as MCI and UMFI should not change the measuredimportance of features in the presence of highly correlated or duplicated variables according to the

7

(a) Nonlinear interactions (b) Correlated interactions

(c) Correlation (d) Blood relation

Figure 1: Results for the experiments on simulated data from Subsection 4.1. Feature importancescores are shown as a percentage of the total for each of x1 to x4 from 100 replications. Results areshown for marginal contribution feature importance (MCI), ultra-marginal feature importance withlinear regression (UMFI_LR), and ultra-marginal feature importance with pairwise optimal transport(UMFI_OT).

duplication invariance and symmetry axiom. To test this, we implement a simulation study similar tothe ones found in Catav et al. [17]. Let ε ∼ N (0, 0.01). We consider:

x1, x2, x4 ∼ N (0, 1), x3 = x1 + ε

Y = x1 + x2.

The addition of x3, which is approximately a duplicate of x1, should not alter the importance of x1,and x1 should remain equally as important as x2, since they have the same influence on the responseY . The results shown in Figure 1c show that both MCI and UMFI work reasonably well. As with theprevious simulation experiment, the variability is consistent across methods. As was desired, UMFIwith linear regression shows equal relative importance scores for x1 and x2. The importance given tox2 was slightly greater than x1 according to MCI and UMFI with optimal transport. Interestingly,MCI assigns some importance to x4, which was independent of the response, while both UMFImethods assign importance scores close to zero. Because of this, we conclude that UMFI with linearregression performs the best in this simulated scenario.

4.1.4 Blood relation

To ensure that UMFI is true to the data and could be used to learn part of the structure of the causalgraph in theory as well as in practice, we implement the blood relation simulation experiment. Inthis study, data is generated from the causal graph in Figure 2 which was inspired by the collidercausal graph found in Harel et al. [37]. The feature S is unobserved, thus only x3 and x4 are bloodrelated to the response Y . Because of this, according to the blood relation axiom, x3 and x4 should

8

Sx1

x2

x3

Y

x4

Figure 2: Causal graph which generates the data for the blood relation simulation experiment.

be given high and positive importance while x1 and x2 should receive zero importance. In Section 3,we proved that in ideal scenarios, UMFI will only give non-zero importance to blood related features.We hypothesize that we can extend this to real-world scenarios where non-Gaussian features andinteraction information appear. To test this, we consider:

x1, S ∼ N (0, 1), δ ∼ U(−1, 1), ε ∼ U(−0.5, 0.5), γ ∼ Exp(1)

x2 = 3 ∗ x1 + δ, x3 = x2 + S

Y = S + ε

x4 = Y + γ.

The results shown in Figure 1d indicate that MCI fails to find the blood related features since once x3is known, x2 and x1 can help predict the response by denoising x3. While x1 and x2 can increasemodel performance by denoising x3, x1 and x2 are completely unrelated to Y , yet most of theimportance is given to these features by MCI. On the other hand, both UMFI_LR and UMFI_OTdetect that x1 x2 should have zero importance while giving most the importance to x4 and the rest ofthe relative importance to x3.

4.2 BRCA experiments

We use the same breast cancer (BRCA) classification dataset [70] used in previous feature importancestudies including Catav et al. [17] and Covert et al. [23] to test the quality and robustness of UMFIon real data. The original data contains over 17, 000 genes and 571 anonymous patients that havebeen diagnosed with one of 4 breast cancer sub-types. We consider the same subset of 50 genesas in Catav et al. [17] and Covert et al. [23] for easier computation and result visualization. Ofthe 50 selected genes, 10 are known to be associated with breast cancer, while the other 40 genesare randomly sampled. This data was downloaded from https://github.com/TAU-MLwell/Marginal-Contribution-Feature-Importance/tree/main/BRCA_dataset (MIT License).In Catav et al. [17] and Covert et al. [23], these 40 randomly sampled genes are assumed to beunassociated with breast cancer. However, to ensure a more definitive ground truth, we also randomlypermute the values of these 40 genes across their respective 571 observations to further reduce thechance that these genes have any association with breast cancer. Quality is then measured with thetrue positive and true negative rates: the 10 BRCA associated genes should have some non-zeroimportance (positive), and the other 40 genes should have exactly zero importance (negative). Theseexperiments were run 200 times on different seeds and with a different random sample of 500 patientsfor each iteration. Robustness is measured using the standardized interquartile range (SIQR) from therepeated experiments, which is calculated by dividing the average IQR across the 50 features by theaverage median. This experiment is too computationally intensive for MCI to be calculated exactly,so we implement MCI assuming soft 2-size submodularity. This approximation will provide a lowerbound on the exact MCI importance of each feature.

We found that MCI and UMFI (UMFI_LR and UMFI_OT) correctly gave significant importance tothe 10 genes that are known to be associated with breast cancer (Figure 3). However, MCI consistentlygives non-zero importance to all features, while UMFI gives zero importance to the majority of therandomized genes as desired. Of the 40 randomized genes, the few that UMFI gives a non-zero scoreto have a score close to zero, and importantly, their scores are much smaller compared to the scores ofany of the 10 BRCA-associated genes. During these experiments, we noticed that a greater proportion

9

Figure 3: Median feature importance scores provided by (a) MCI, (b) UMFI with linear regression,and (c) UMFI with pairwise optimal transport, for each gene in the BRCA dataset after 200 iterations.Genes colored in blue are known to be associated with breast cancer while genes colored in grey arerandom permutations of randomly selected genes, which we assume to be unassociated with breastcancer. The first and third quantiles of the scores are visualized for each gene.

of permuted genes are given zero importance as the number of iterations grows. To confirm theconvergence of these importance scores, we ran the UMFI experiments 5000 times and noted thatboth UMFI methods have a perfect overall accuracy when distinguishing between important andpermuted features (Appendix G.2.1). Interestingly, the ordering of important features was similaracross methods, with BCL11A and SLC22A5 always being the most important and TEX14 alwaysbeing the least important of the 10 BRCA-associated genes. Although UMFI scores have highervariability than MCI (Table 1), it is clear from the feature importance plot that UMFI separates the 10associated genes from the 40 unassociated genes better than MCI does even after considering UMFI’shigher SIQR.

4.3 Computational complexity

MCI must train and evaluate a model for each element of the power set of the feature set, whichimplies O(2p) model trainings if there are p features. If the evaluation function ν obeys soft k-sizesubmodularity, then the maximizing subset has no more than k elements, which reduces the numberof model trainings to O(pk+1) [17]. UMFI circumvents the exponential training time since it can

10

Table 1: The standardized interquartile range (SIQR), true positive rate (TPR), true negative rate(TNR), overall accuracy (OA), and the number of features for which feature importance can becalculated within 1, 15, and 60 minute(s) are displayed after running the methods on the BRCA data.The best results from each column are bolded.

Method SIQR TPR TNR OA @1min @15min @1hr

MCI (k=2) 6.6 % 1 0 0.20 35 80 130UMFI (LR) 41.9% 1 0.975 0.98 500 2000 4010UMFI (OT) 28.5% 1 0.775 0.82 300 1500 3000

Figure 4: Computation time of MCI (dark red), MCI with the soft 2-size-submodularity assumption(pink), UMFI_OT (light blue), and UMFI_LR (dark blue) plotted against number of processedfeatures from the BRCA data.

be evaluated immediately after removing the dependencies of f from the feature set F . To confirmthe above statements, and to show that the extra model trainings required for MCI dominate thecomputation time for removing dependencies in UMFI, we ran a simple experiment. For a rangeof dataset sizes from the BRCA data, we evaluate the computation time for calculating the featureimportance scores of all features using MCI and UMFI. We ran this experiment for a dataset with 5features, and then slowly added features until our given time budget of 1 hour ran out. Once all 50BRCA features were used, more features were randomly generated. All datasets had 571 observations.These experiments were run using an Intel Core i9-9980HK CPU 2.40GHz with 32GB of RAM.Code was parallelized in R, and 12 of the 16 available threads were used.

From Figure 4, we can observe that UMFI is approximately superlinear, with UMFI_OT incuringmore computational cost compared to UMFI_LR. Giving each method 1 hour to run, we found thatMCI could process 19 features, MCI with the soft 2-size submodularity assumption could process130 features, UMFI_OT could process about 3000 features, and UMFI_LR could process about 4000features (Table 1).

5 Conclusion

In this study, we introduced ultra-marginal feature importance (UMFI), a new method that usespreprocessing techniques, originally developed in the domain of AI fairness, to provide fast andaccurate feature importance scores for the purposes of explaining data. We introduced three idealaxioms that feature importance measures should satisfy if they claim to explain the data and prove thatUMFI satisfies these axioms with some basic assumptions. Optimal transport and linear regressionwere explored as preprocessing techniques to remove dependencies from data. When compared withMCI, the previous state-of-the-art method for explaining data, experimental results showed that UMFI

11

was able to provide faster and more accurate estimates of feature importance on real and simulateddata, particularly in the presence of correlated interactions and unrelated features.

Throughout the work on this paper, several shortcomings appeared. First, we only considered twosimple methods for removing dependencies, linear regression and pairwise optimal transport. Othermethods certainly exist in the literature, including optimal transport with chaining [43], neuralnetworks [15, 63], or principal inertial components [72]. Though our two methods performed fairlywell on the real and simulated datasets in Section 4, optimal transport and linear regression failed tofind representations of the data that were independent of the protected attribute when we tested themethods on a hydrology dataset with more shared information compared to BRCA [1] (AppendixG.4). However, neural nets or principal inertial components certainly could have given better results.Also, despite requiring significantly more computational cost, better methods for estimating theconditional CDF, or using optimal transport with chaining, should give better estimates for SFf whenimplementing UMFI_OT. Second, UMFI scores are less robust than MCI since they have much highervariability, however, because of the significantly lower computational cost, UMFI can be run multipletimes and averaged to increase robustness. Third, it is not clear precisely how closely ν approximatesmutual information in practice. Finally, though UMFI can work for any arbitrary feature type, in thispaper, we have only considered datasets with continuous explanatory variables.

In future work, we would like to test how well other methods, such as neural networks, pair withUMFI while further testing on a wider variety of random variable types such as binary, categorical,and ordinal features. Further, we would like to explore how well dependence can be removed andUMFI can be estimated on real data as the number of features increases to sizes much larger than 50.

To reiterate, UMFI is a powerful tool to explain relationships in data using feature importance. Weemphasise that UMFI is just a framework. A variety of other methods can be used to estimate theevaluation function including, but not limited to, XGBoost, neural networks, or Gaussian processes.Furthermore, new preprocessing techniques for dependence removal are still being developed inthe AI fairness community, so these, in addition to other existing methods, can be used in futureapplications of UMFI for additional improvements.

We hope that UMFI will be a useful tool in a variety of disciplines including bioinfomatics, ecology,earth sciences, and health science for discovering scientific processes and relationships hidden withindata. Further, we hope our work provides further justification and encouragement for the developmentof better AI fairness preprocessing methods and better machine learning models that more closelyapproximate mutual information.

Acknowledgments

We thank Dr. Elina Robeva for her support and encouragement during this process. We thank allof the authors of the papers we cited, especially Dr. Chandra Nair for his help with concepts ininformation theory and Boyang Fu for his experimental advice.

12

Appendix

A Mutual information

A.1 Properties of mutual information

Theorem A.1 (Supermodularity under mutual independence). Let S, f,X be mutually independentrandom variables. Then, I(Y ;S, f,X)− I(Y ;S,X) ≥ I(Y ;S, f)− I(Y ;S) [48, 66].

Proof.

I(Y ;S, f,X)− I(Y ;S,X)

= I(Y ;S) + I(Y ;X|S) + I(Y ; f |S,X)− [I(Y ;S) + I(Y ;X|S)] (by chain rule)= I(Y ; f |S,X) = I(Y, S,X; f) (by mutual independence)≥ I(Y, S; f) (by monotonicity of I(·; f))= I(Y ; f |S) (by mutual independence)= I(Y ;S, f)− I(Y ;S) (by the chain rule for mutual information)

Theorem A.2 (Data processing inequality). LetX,Y, Z be three random variables forming a Markovchain X → Y → Z, i.e. X ⊥⊥ Z|Y . Then, I(X;Y ) ≥ I(X;Z).

Proof. The proof can be found in Cover and Thomas [22, p. 32].

Theorem A.3. Let F be a set of features used to predict the response Y . Then I(Y ;F ) ≥ I(Y ; g(F ))for any function g. If g is injective, then I(Y ;F ) = I(Y ; g(F )).

Proof. The first claim I(Y ;F ) ≥ I(Y ; g(F )) follows from the data processing inequality A.2 sinceY → F → g(F ) forms a Markov chain.

If g is injective, then we may write F = h(g(F )) where h : Im(g) → F is the inverse of grestricted to the image of g. Hence, it follows that Y → g(F ) → F is a Markov chain. Note thatY ⊥⊥ F |g(F ) is equivalent to Y ⊥⊥ h(g(F ))|g(F ), and therefore F is a constant given g(F ). By thedata processing inequality, I(Y ; g(F )) ≥ I(Y ;F ) and combining with the above inequality yieldsthe desired claim, I(Y ; g(F )) = I(Y ;F ) when g is injective.

A.2 Mutual information and feature importance

Let F = {f1, ..., fn} be a set of features used to predict Y . As shown in Griffith and Koch [35], themutual information I(Y ;F ) = I(Y ; f1, ..., fn) can be visualized using a partial information (PI)diagram [75]. We may interpret the mutual information shared between Y and F as a collection ofnon-negative pieces of information, whose sum forms I(Y ;F ). Each of these pieces of informationcan be classified as (1) unique, (2) redundant, or (3) synergistic (Figure 5).

1. The unique parts correspond to mutual information between Y and the single feature f ,which cannot be found elsewhere in F , either in other features or in any groups of features.For example, if f, g ∈ F predict Y = 3f + fg + g, then the unique part for the featureimportance of f comes entirely from the term 3f . Although the term fg also helps predictY , this arises from an interaction that also uses another feature g.

2. The redundant parts correspond to mutual information between Y and f that can also befound within other predictors or groups of predictors in F . For example, if f, g ∈ F arehighly correlated and help predict Y , then most of their respective feature importanceswould come from redundant information, as f and g share much of the same information.

3. The synergistic parts correspond to mutual information between Y and F that arise frominteractions between f and other features in F . For example, if Y = fg where f, g ∈ Fare independent, then all of the feature importance of f would come from synergisticinformation via the interaction fg. This would also hold for the feature importance of g.

13

We note that the distinction between marginal and conditional methods for feature importance comesfrom their treatment of redundant information, i.e. their treatment of dependent features. A marginalmethod, like MCI or UMFI, should count all of the redundant information pertaining to f in I(Y ;F )towards the feature importance of f . Indeed, even though this information can be found elsewhere inthe model, redundant information still constitutes part of the information that f shares about Y in thedata. Conversely, a conditional approach, like conditional permutation importance (CPI), would countnone of the redundant information towards the evaluation of a feature’s importance. This is becauseunder a conditional framework, a feature’s importance is defined to be the additional predictive powersupplied by f beyond that which is available from all other predictors [14].

Figure 5: PI-diagrams taken from Griffith and Koch [35] for I(Y ;F ) when |F | = 2 (left) and |F | = 3(right). Magenta represents unique information, redundant information is colored with yellow, andsynergistic information is in cyan. The starred regions represent a single region.

Mutual information itself is a common choice in the context of feature selection [6, 3, 80, 8]. However,due to the computational cost and the limited number of observations available for the calculation ofthe high-dimensional joint probability density function, it is not practical to compute I(Y ;S). Forfeature selection, users are only interested in the importance given to the top k features. Therefore,mutual information-based feature selection methods typically bypass the computation of I(Y ;S)by instead studying the mutual information between the candidate feature and the response alongwith the mutual information between the candidate and the previously selected features [8, 6]. Thesemethods are much less suitable for feature importance when the goal is to explain the data sinceinteractions cannot be considered, which is why the prevalent approach is to train machine learningmodels to determine marginal feature importance.

Another connection between feature importance and mutual information comes from Louppe et al.[51], who showed that when extremely randomized trees’ mean decrease in impurity (MDI) is usedas a feature importance score, the MDI of a single feature converges to a quantity that is defined byconditional mutual information [51, Eq. 4], as the number of trees and the number of observationsgoes to infinity. Also, the sum of the MDI scores across the feature set F converges to I(Y ;F ).

A.3 Mutual information and machine learning evaluation functions

The evaluation function for a machine learning model ν(S) : P(S)→ R≥0 measures how well theresponse Y can be predicted using the features S ⊆ F . Intuitively, ν(S) should ideally mirror orat least covary with mutual information I(Y ;S). Direct relationships between mutual informationand machine learning evaluation functions have been observed in previous works. For example, theGini value is equivalent to the first order Taylor approximation of information entropy [82]. TheGini impurity index is the central mechanism for choosing splits in random forests [77]. Also, withsome assumptions, mutual information and R2 accuracy are related. Since mutual information can beexpressed in terms of the linear correlation coefficient, if we assume the response and predictions arejoint Gaussian and the predictions are unbiased [22], we can approximate the mutual information

14

between Y and F as:

I(Y ;F ) ≥ I(Y ; g(F )) = I(Y ; Y ) = −1

2log[1− ρ2(Y, Y )] = −1

2log[1−R2].

Machine learning evaluation functions and mutual information have been equated many times inthe feature importance literature. Covert et al. [23] demonstrated equivalence when the Bayesclassifier is known and cross entropy loss is used. In a simple example, Catav et al. [16] used mutualinformation directly as the evaluation function. The connection between machine learning evaluationfunctions and mutual information was further used by Sutera et al. [69] to relate random forest featureimportance with Shapely values.

B Additional information about marginal contribution feature importance(MCI)

Two of the methods that are compared with MCI in Catav et al. [17] include ablation and bivariateassociation. Ablation methods determine feature importance based on the difference in accuracybetween the full model and the full model without the feature of interest, i.e. Aν(f) = ν(F ) −ν(F\f). Bivariate methods are among the most popular methods for genome-wide associationstudies [21, 28, 68]. In this case, the feature importance is given by the difference in the evaluationfunction of the model with just the feature of interest and the null model, i.e. Bν(f) = ν(f)− ν(∅).The three feature importance axioms proposed by Catav et al. [17] were partially motivated by theshortcomings of these two methods.

1. Marginal contribution: Ablation methods may underestimate the importance of featuresin cases where the correlation between features is high. In these scenarios, ν(F ) may beapproximately equal to ν(F\f) even in cases where f is highly related to the response.Because of this, the importance of a feature Iν(f) should be at least as large as the importancegiven by ablation methods.

2. Elimination: Bivariate methods may underestimate the importance of features in caseswhere interactions exist between features. Many high-order interactions may be presentin the data, so eliminating features from the feature set could prevent the detection of animportant interaction. Thus, eliminating features from F should only be able to decrease thefeature importance of f .

3. Minimalism: Catav et al. [17] decided to impose the minimalism axiom so that MCI canbe unique. If Iν(f) satisfies the first two axioms, then multiplying Iν(f) by any constantλ > 1 would not change this. The minimalism axiom helps disambiguate MCI from thesetrivial variations.

C Additional information about ultra-marginal feature importance (UMFI)

Theorem C.1 (Existence of optimal preprocessing SFf when all variables are jointly Gaussian).Suppose that all features in random vector F are joint normally distributed with mean 0 and that thepreprocessed matrix SFf is obtained via multiple linear regression with the model:

F \ {f} = βf + ε,

where ε = SFf , f is a random variable in F , and β is the column vector of size |F |−1 that minimizesthe least squares error. Then, SFf is an optimal preprocessing.

Proof. To show that SFf is an optimal preprocessing (Definition 1), it suffices to show that SFf ⊥⊥ fand that I(Y ;F ) = I(Y ;SFf , f), since SFf is a function of F by construction.

From the normal equations and the definition of covariance, we know that Cov(SFf , f) = 0, as shownin the proof of Theorem E.3. Since SFf = F \ {f} − βf , and all features in F are joint normallydistributed, it follows that (SFf , f) is joint normally distributed as well, since (SFf , f) can be obtained

15

via the transformation AF = (SFf , f), where the main diagonal entries of A are 1, the other |F | − 1entries of the column corresponding to f are given by the entries of −β, and all other entries are 0.

A =

1 0 . . . . . . −β10 1 0 . . . −β2...

. . .0 0 . . . . . . 1

A−1 =

1 0 . . . . . . β10 1 0 . . . β2...

. . .0 0 . . . . . . 1

Hence, Cov(f, SFf ) = 0 =⇒ SFf ⊥⊥ f from the properties of multivariate Gaussians.

To prove the second claim I(Y ;F ) = I(Y ;SFf , f), by Theorem A.3, it suffices to show that the mapg(F ) = (SFf , f) = AF is injective. This is immediate from the fact that the matrix A, defined above,is invertible and thus bijective (injective and surjective).

Theorem C.2 (Blood relation axiom in the absence of interactions). Suppose that there is nosynergistic information Isyn(Y ;SFf , f) about Y between f and SFf for all f ∈ F . Then, if thegraphical model obeys the global Markov property and faithfulness, UF,Yν > 0 if and only iff ∈ BR(Y ).

Proof. As in the proof of Theorem 3.3, it suffices to show that I(Y ; f |SFf ) = 0 if and only iff 6∈ BR(Y ).

We may further rewrite I(Y ; f |SFf ) = 0 as I(Y ;SFf , f) = I(Y ;SFf ). Using partial informationdecomposition [75], and since SFf ⊥⊥ f , we may decompose

I(Y ;SFf , f) = I(Y ; f) + I(Y ;SFf ) + Isyn(Y ;SFf , f).

where we note that I(Y ; f) = Iuniq(Y ; f) and I(Y ;SFf ) captures both the unique information thatSFf shares with Y as well as synergistic information within the preprocessing SFf that is shared withY . As proven in Theorem 3.3, I(Y ; f) = 0 if f 6∈ BR(Y ) and I(Y ; f) > 0 if f ∈ BR(Y ) by theglobal Markov property and faithfulness. Since Isyn(Y ;SFf , f) = 0 by assumption, this gives us thedesired statement I(Y ;SFf , f) = I(Y ;SFf ) if and only if f 6∈ BR(Y ).

D Additional information about other feature importance methods

Historically, feature importance methods were developed in the pursuit of scientific questions,but current research in this area typically focuses on model explainability or model optimization.Early forms of feature importance assessed the strength of the relationships between variableswithin animal biology or human psychology using methods such as the correlation coefficient [31],Spearman’s rank correlation coefficient [64], multiple linear regression [25], and partial correlation[78]. Although these methods are perfectly interpretable, they are inadequate for modelling andtherefore explaining complex data, since they cannot quantify the unknown interactions betweenmultiple features. To counteract this severe limitation, Breiman was instrumental with his introductionof variable importance within classification and regression trees [12]. At that time, Breiman seemedmore concerned about the true strength of the relationships between the explanatory variables andthe response, as he posited that a feature that is related to the response should be given someimportance even if it does not appear in the final model [12]. However, starting with Breiman’srandom forests, feature importance began to prioritize machine learning model explanation ratherthan data exploration. A good overview of the properties of some popular feature importance metricsis shown in Covert et al. [23].

D.1 Disagreements with previous feature importance papers

Claim #1: Random forests permutation importance is a marginal method [36, 67, 26].

16

Refutation: Marginal and conditional feature importance metrics mark the two extreme perspectivesof feature importance, however, the two perspectives agree if all features are independent. In a linearsetting, the squared correlation is an example of a linear marginal method. Though the marginal andconditional divide is not as clear when considering random forests instead of linear regression [26],the experiments in Appendix G.1.3 show that random forest permutation importance is in the middleof the marginal and conditional frameworks. In random forest’s permutation importance, when ahighly correlated feature is added to the data, the importance weights are divided equally amongthe correlated features. Marginal methods, such as the squared correlation [26], would give the fullshared importance to each correlated feature, whereas in conditional methods (CPI), none of theshared importance is given to the correlated features. Thus, permutation importance does not providefeature importances corresponding to one of the extremes, and it is approximately in the middle ofthe marginal and conditional frameworks.

Claim #2: Conditional feature importance (e.g., Conditional permutation importance (CPI)) givescausal features high importance and non-causal features low importance [26].

Refutation: Contrary to the claim above, it has been argued by other authors that conditional methodsare for prediction purposes while marginal methods are for explanatory or causal purposes [36].While it is true that conditional importance measures can bare the same resemblance as conditionalindependence statements, which often form the basis of causal graphs, we agree more with theperspective of Grömping [36] for several reasons. First, any time that a feature importance metricis based on a single model with permuted values, the metric is focusing on interpreting the modelrather than the data, and interpreting the model can be misleading since machine learning modelsare often times poor representations of underlying processes [81, 83]. Second, the experiment fromAppendix G.1.4 shows that both conditional methods, CPI and ablation, performed poorly in assigningappropriate importance scores to features based on statistical association. Third, Fellinghauer et al.[30] attempted to use CPI to learn causal graphs, but they found that the method was inaccurate andslow compared to permutation importance.

Claim #3: A good feature importance method should incorporate some conditional and some marginalaspects [26, 36].

Refutation: To justify these claims, the authors cite Budescu [14] and Johnson and LeBreton [45].In Johnson and LeBreton [45], they argue that individually analyzing marginal and conditionalimportance requires subjective interpretation, thus motivating the need for a single index that isdescriptive of both perspectives. While this approach may be simpler to interpret, the separateconsideration of marginal and conditional feature importance scores could give a user powerfulinformation about the data. For example, by using both, one can quantify the redundant informationthat the feature set has about the response as well as each feature’s influence on the predictive powerof the full model. If a third method is used that does not quantify interactions, perhaps the unique,redundant, and synergistic information from the partial information diagrams referenced in AppendixA.2 can each be quantified and compared. This may be an interesting project for future work.

Budescu [14] was concerned that current feature importance metrics depend on the subset of featuresused in a linear regression model. To counteract this, he introduced dominance analysis, whichintegrated both marginal and conditional importance. While the reasoning of Budescu [14] makessense in the context of linear regression, the importance of features will certainly change dependingon the other features in the feature set when interactions are considered. Thus, it is our position thateven though the most suitable feature importance method usually depends on the problem the userseeks to answer, it is better to consider marginal importance, conditional importance, or both of themseparately, than it is to seek a metric that mixes the marginal and conditional frameworks.

E Preprocessing methods for removing dependencies

Finding information preserving independent representations of our data is the central step of UMFI.These representations were first considered for AI fairness and privacy algorithms in order to giveunbiased predictions in the face of sensitive attributes. For example, if one wants to remove theinfluence of race on recidivism likelihood predictions, preprocessing methods can be used to alter theoriginal dataset such that the set of predictors are independent of race. In the following subsections,we discuss how optimal transport and linear regression can be used for finding these representations.

17

E.1 Optimal transport

Most of the results and methods explained in this section can be found in Johndrow and Lum [43].To avoid confusion with the probability density function, we denote a feature f ∈ F by Z. LetX ∈ F \ Z. We would like to remove the dependencies of Z from X with minimal information losswith respect to X . To do so using optimal transport, we consider the Monge problem:

gc(X, X) = infg:g(X)∼X

E[c(X, g(X))] = infg:g(X)∼X

∫Rc(x, g(x))dµ(x). (2.1.1)

The quantity gc(X, X) represents the transportation cost of moving X to X with respect to somecost function c, and in our case, we desire X ⊥⊥ Z. It is natural to use c(x, x) = dq(x, x),where d is the Euclidean norm. The transportation cost is also given by the Wasserstein-q distance,gc(X, X) =Wq

q (X, X), defined below for one-dimensional distributions.

Wq(X, X)q =

∫ 1

0

|F←(p)− F←(p)|qdp,

where F and F are the CDFs of X and X , and F←(p) = supx∈R F (x) ≤ p. It can be shown thatgiven any continuous one dimensional distributions X and X , the optimal transport map g : X → Xis given by g = F← ◦ F .

Theorem E.1. Let X be a r.v. with density f and CDF F . Let X have CDF F . Then g = F← ◦F isthe minimizer to (2.1.1). Hence, g optimally transports X to X = F←(F (X)).

Proof. We show E[|X − g(X)|q] =∫ 1

0|F←(p)− F←(p)|qdp for g = F← ◦ F

E[|X − g(X)|q] =

∫ ∞−∞|x− F←(F (x))|qf(x)dx

=

∫ ∞−∞|F←(F (x))− F←(F (x))|qf(x)dx =

∫ 1

0

|F←(p)− F←(p)|qdp

Theorem E.2. Let FX|z(x) = P (X ≤ x|Z = z) denote the CDF of X|{Z = z} . Then g =

F← ◦ FX|z optimally transports X|{Z = z} to X ⊥⊥ Z for any CDF F

Proof. We apply Theorem E.1 on the random variable X|{Z = z} and note that X|{Z = z} isindependent of Z. In particular, g(X|Z = z) ⊥⊥ Z for any choice of F .

Theorem E.2 suggests an algorithm for transporting data (x1, ..., xn) sampled from X , to(x1, ..., xn) ⊥⊥ (z1, ..., zn). Since xj is taken jointly with zj , as they are attributes coming from thejth sample in the dataset, then xj is a realization of the distribution X|{Z = zj}. Consequently, foreach j = 1, ..., n, we should transport xj to xj = F←(FX|zj (xj)). This procedure can also adaptedfor features sampled from discrete r.v’s, as shown in Johndrow and Lum [43].

Algorithm 2: Algorithm for removing dependencies of Z from X

Require: X = [x1, ..., xn], Z = [z1, ..., zn], X|(Z = zj) ∼ FX|zj , F is a CDFfor j = 1, ..., n doxj = F←(FX|zj (xj))

end forreturn x = [x1, ..., xn]

We would ideally pick F such that it minimizes the transportation cost gc(X, X) =

gc(X, F←(FX|zj (X))) across all CDFs F in order to minimize information loss. However, in

18

practice, the choice of F does not matter much. In fact, as long as the support of F is at least a largeas the support of FX , then any rank-based prediction rule, e.g. random forest, will be invariant to thechoice of F [43]. A standard choice for F is FX so that we can recover the original quantiles of X .

Furthermore, FX|zj is not usually known and must be estimated from the data. For example, this canbe done by splitting Z into N quantiles and using the empirical CDF P (X ≤ xj |Z ∈ zj’s quantile).The ability of this method to remove dependencies on Z from X relies significantly on the accuracyof this estimate.

We may iterate Algorithm 2 over each feature in F \ Z to obtain pairwise independence betweenthe transported variables Xj and Z. It is also possible to iterate Algorithm 2 via chaining to achievemutual independence between the transformed variables Xj and Z [43, 2.4]. However, this iscomputationally expensive, and pairwise independence should suffice for an accurate UMFI score,as will be explored further in Section F. Step 2 of Algorithm 1 can therefore be implemented withAlgorithm 3.

Algorithm 3: Algorithm for estimating SFZ via pairwise optimal transportRequire: Z = [z1, ..., zn], Xj = [xj1, ..., xjn] for Xj in F \ ZSFZ = ∅for Xj in F \ Z doXj = output of Algorithm 2 with Xj and Zadd Xj to SFZ

end forreturn SFZ

In other words, we may estimate SFf = SFZ as:

SFZ = {F←X (FX|z(X)) : X ∈ F \ Z}.

E.2 Linear regression

The most basic method for removing dependencies is linear regression. Even though it is quite simple,it can be shown to be optimal with a few assumptions (Theorem E.3). This preprocessing techniqueis implemented in the popular Python package fairlearn [10, 54].

To reiterate, removing dependencies requires methods to make a feature or set of features S inde-pendent of a protected attribute f , while keeping as much of the original information as possible.The overarching idea is that under the assumption that the residuals and the protected attribute arejointly Gaussian, we may show that the residuals can be utilized as a representation of S, which isindependent of f .Theorem E.3. Assuming no intercept term, if one specifies a linear regression model with

Y = βX + ε

and X and ε are joint normally distributed, then (1) ε ⊥⊥ X and (2) ε is correlated with Y unless Ycan be completely predicted from X .

Proof. (1) From the normal equations, the definition of covariance, and the fact that E[ε] = 0, itfollows that

Cov(X, ε) = E[XT ε]− E[ε]E[X] = E[XT ε] = E[XT (Y −Xβ)]

= E[XT (Y −X(XTX)−1XTY ))] = E[XTY −XTX(XTX)−1XTY ] = E[XTY −XTY ] = 0

Then, since X and ε are jointly normal, X ⊥⊥ ε.(2) From the definition of the response variable Y and the distributive property for covariances weknow

Cov(Y, ε) = Cov(Xβ + ε, ε) = βCov(X, ε) + Cov(ε, ε) = V ar(ε).

19

Thus, in step 2 the algorithm for UMFI (Algorithm 1), we can estimate

SFf = {εi = Xi − β0,i − β1,if : Xi ∈ F \ f}.

F Experiments comparing linear regression and optimal transport

In the following subsections, we compare the ability of linear regression and pairwise optimaltransport to remove the information of a feature from data while distorting the original data as littleas possible. It can be concluded that while linear regression works optimally when the data is jointlyGaussian, on real data, such as the BRCA dataset, pairwise optimal transport can find independentrepresentations of the data, while linear regression fails (Section F.1).

To implement UMFI paired with linear regression, we only remove dependencies when the regressionslope coefficient is statistically significant (p-value< 0.01). To implement UMFI paired with pairwiseoptimal transport, when removing dependencies on the feature Z from the dataset, we estimate FX|Zby breaking up Z into quantiles of size 150 and running linear regression on each quantile. The neworthogonal predictors are then given by the values of the inverse empirical CDF of the residuals fromthe mentioned linear regression model.

F.1 Removing dependencies

It is crucial for our linear regression and optimal transport preprocessing methods to remove theinformation associated with the feature of interest, f , from the rest of the dataset F \ {f}. Therefore,we would like the preprocessed dataset SFf to share zero mutual information with f . The mutualinformation I(f ;SFf ) is difficult to calculate, but it is closely related to the optimal predictor of fgiven SFf [63]. For example, if I(f ;SFf ) = 0, as is desired, then the optimal predictor of f will havezero accuracy given SFf . If the opposite is true and SFf contains all of the information from f , thenan optimal predictor of f should be able to perfectly predict f from the given information in SFf . Inthe following experiments, we assume that random forests can form the optimal predictor of f givenSFf . We use the OOB-R2 value coming from the random forest model to give a relative measure ofthe mutual information between f and the transformed dataset SFf .

We used the BRCA dataset with 50 features to test the ability of optimal transport and linear regressionto remove dependencies [23, 16]. All 50 features are continuous and the response is categorical.For each individual feature, we first use random forest OOB-R2 to give a relative measure of themutual information I(f ;F \ {f}) between the feature of interest f and the other 49 features. Wethen consider the case where the 49 remaining features are preprocessed to have dependencies on fremoved via linear regression or pairwise optimal transport. Similarly, random forest’s OOB-R2 isused to give a relative measure of I(f ;SFf ).

The results are plotted in Figure 6. It is clear that the raw data (black line) shares considerableinformation across features. Most features can be predicted from the other untransformed featureswith an accuracy of R2 > 0.2 and many can even be predicted with accuracies over 0.4. Sincethe data has extremely nonlinear dependencies between features, simple linear regression is unableto remove all the mutual information between the protected attributes and the rest of the features.Indeed, the data certainly cannot be approximated with multivariate Gaussians. Conversely, pairwiseoptimal transport can successfully remove most of the mutual information present in the data. For all50 features in the dataset, f cannot be predicted successfully by random forest (OOB-R2 = 0) fromthe other features after F \ f is transformed with pairwise optimal transport.

F.2 Distortion

Not only do we require that the transformed features are independent of the feature of interest, but wealso require that as much of the information present in the original data is preserved in the transformeddata. To measure the amount of distortion imposed on the original data, we measure the dependencebetween the original and perturbed data using the maximal information coefficient [46]. For eachfeature in the BRCA dataset with 50 features [23, 16], the information from the current feature isremoved from all other features with either linear regression or pairwise optimal transport (Figure 7).

20

Figure 6: The relative mutual information Irel(fi;F \ {fi}) between the ith feature in the BRCAdataset and all other features is plotted (black) for each i ∈ {1, 2, ...50}. The relative mutualinformation Irel(fi;SFfi) between the ith feature and all other features after preprocessing with linearregression (red) and optimal transport (blue) is also plotted. Relative mutual information is measuredby random forest’s OOB-R2.

Figure 7: Cell (i, j) indicates how similar the jth variable in the BRCA dataset is compared to itstransformation via pairwise optimal transport or linear regression with respect to feature i. This ismeasured with the maximal information coefficient, which is comparable to R2. To make the plotsmore clear and accessible, only the first 15 features are shown.

Linear regression does not distort the transformed features in most cases. The dependence betweenthe original and perturbed features usually remains near 1, though the dependence does go as lowas 0.42 in one case (Figure 7). While linear regression transformed these features with minimaldistortion, these results are moot since linear regression failed to remove the original dependencies ina significant way, which was the main goal of the method (Figure 6).

Compared to linear regression, pairwise optimal transport has a much more sizable effect on thedistorted features, though this may have been necessary to completely remove dependence. Thedependence between original and perturbed features mostly ranges from 0.6-0.9, though some are as

21

low as 0.37 (Figure 7). While only the first 15 features are shown, the results are similar for the other35 features.

G Further feature importance experiments

This section is comprised of additional experiments performed on the simulated data introducedin Section 4.1, the BRCA dataset with permuted random genes, the original BRCA dataset withunpermuted random genes [70, 23, 17], and the CAMELS hydrology dataset [1]. MCI and UMFIused either random forests or extremely randomized trees [11, 32]. Both of these are implementedusing the ranger R package [77]. Ablation, permutation importance, and conditional permutationimportance used random forests. Ablation and permutation importance were implemented withthe ranger R package [77], while conditional permutation importance was implemented with therandomForest and permimp packages [27, 50]. All experiments were run in Microsoft R OpenVersion 4.0.2 [55].

G.1 Extra experiments on simulated data

We repeat our previous experiments on simulated data from Section 4.1 to test how ablation, per-mutation importance (PI), and conditional permutation importance (CPI) behave in the presenceof nonlinear interactions (Section G.1.1), correlated interactions (Section G.1.2), correlation (Sec-tion G.1.3), and blood and non-blood related features (Section G.1.4). Further, we test how usingextremely randomized trees instead of random forests for MCI and UMFI changes the results ofthe same simulation experiments. Although other methods such as XGBoost [19] could have beenimplemented for these experiments, XGBoost requires greater care when optimizing hyperparameters,so we chose to use extremely randomized trees instead, which is faster than random forests andprovides similarly good predictions [32]. Both random forests and extremely randomized trees arenot sensitive to hyperparameters [59]. For these simulation studies, we also perturb the size of thequantiles used by UMFI_OT. We now use quantiles of size 30 instead of size 150. Quantiles of size30 worked better on the hydrology data used in later experiments, so we test to see if the simulationresults were sensitive to this choice in quantile size for dependency removal via optimal transport.

G.1.1 Nonlinear interactions

The first experiment on simulated data handles the case where two variables, x1 and x2, interactin a nonlinear way in the response Y . As explained in Section 4.1.1, we should expect x1 and x2to contribute more than half of the total importance, while x3 and x4 should be important, but lessimportant compared to x1 and x2. Figure 8a shows that ablation, PI, and CPI all provide accuratescores.

When tested with extremely randomized trees, the nonlinear interactions simulation experimentresults for MCI and UMFI, shown in Figure 8e, remain mostly unchanged compared to the resultsfrom the experiment with random forests given in Figure 1a.

G.1.2 Correlated interactions

The second experiment considers the case where two correlated variables, x1 and x2, interact togetherin the response Y . Thus, as explained in Section 4.1.2, we should expect x1 and x2 to have moreimportance compared to x3 and x4. Figure 8b shows that ablation, PI, and CPI all correctly weigh theimportance of x1 and x2 higher relative to x3 and x4. The only notable difference is that the ablationmethod attributes an additional ∼ 3% importance to each of x1 and x2 compared to PI, CPI, MCI,and UMFI (Figure 8b).

When tested with extremely randomized trees instead of random forests, the correlated interactionsimulation experiment results (Figure 8f) for MCI and UMFI are similar to the earlier results shownin Figure 1b. MCI gave slightly more importance to x1 and x2 compared to x3 and x4, though thedifferences are seemingly insignificant. On the other hand, both UMFI methods gave significantlymore importance to x1 and x2 compared to x3 and x4, as expected.

22

G.1.3 Correlation

The third experiment tests how the metrics allocate importance to correlated features. As explainedin Section 4.1.3, x1 and x2 should remain around the same relative importance, and x3 = x1 + ε,should have just slightly less importance compared to x1 and x2. Figure 8c indicates that CPI andablation give near zero importance to the two heavily correlated features x1 and x3. This aligns withthe discussion about conditional feature importance methods in Section A.2 since these methods basetheir scores on the importance of a feature conditioned on all other variables. Ablation performssimilarly to CPI in this test, albeit with slightly less drastic results. Finally, we see that PI splitsthe importance detected from x1 and x3 proportionally across both features. This shows that PI isin between the marginal and conditional approaches. The marginal approaches (MCI and UMFI)allocate all of the redundant information to the feature. The conditional approaches (CPI and ablation)allocate none of the redundant information to the feature. PI evenly splits the redundant informationacross the relevant correlated features.

When tested with extremely randomized trees, the correlation simulation experiment results (Figure8g) for MCI and UMFI change slightly compared to the experiment with random forests in Figure1c. MCI works well, though it still gives some non-zero importance to x4. With random forests, therelative importance of x4 was usually above 5%, but with extremely randomized trees, the relativeimportance dropped below 5%. The performance of UMFI with linear regression got slightly worseas now the importance of x1 is slightly greater than that of x2 on average. The performance ofUMFI with optimal transport changed for the better and now the importance of x1 and x2 are almostidentical which was not true before. In this experiment, UMFI_OT performed the best.

G.1.4 Blood relation

For the last simulation experiment, we revisit the blood relation experiment performed in Section4.1.4 using data generated from the causal graph in Figure 2. The feature S is unobserved, so the onlyblood related features to Y in F are x3 and x4. x3 and x4 should therefore be given high importancewhile x1 and x2 should receive zero importance. When tested on ablation, CPI, and PI, we notice thatall three metrics fail to capture the desired importance, since they each give significant importance tox2, which is not blood related to Y .

When the experiment was re-tested on MCI and both implementations of UMFI using extremelyrandomized trees instead of random forest, we observe that UMFI_LR and UMFI_OT both continueto give positive importance to the blood related features x3 and x4, while giving near-zero importanceto the two remaining observed features (Figure 8h). However, we note that x3 is given muchmore importance relative to x4 when implemented with extremely randomized trees comparedto random forests (Figure 1d). On the other hand, MCI gives positive importance to x2 in thisexperiment. However, we note that it also correctly gave x1 almost zero importance while giving x3and x4 significantly more importance compared to the random forest implementation. Across mostsimulation studies, it appears MCI performs better using extremely randomized trees compared torandom forests.

23

(a) RF: Nonlinear interactions (b) RF: Correlated interactions

(c) RF: Correlation (d) RF: Blood relation

(e) ET: Nonlinear interactions (f) ET: Correlated interactions

(g) ET: Correlation (h) ET: Blood relation

Figure 8: Results for the experiments on simulated data from Subsection G.1. The results for ablation,conditional permutation importance (CPI), and permutation importance (PI) were implemented withrandom forest (RF), and are shown in Figures 8a, 8b, 8c, and 8d . The results for MCI, UMFI_LR,and UMFI_OT were implemented with extremely randomized trees (ET), and are shown in Figures8e, 8f, 8g, and 8h. Feature importance scores are shown as a percentage of the total for each of x1 tox4 from 100 replications.

24

G.2 Extra BRCA experiments with known ground-truth feature importance

The following experiments are performed on the BRCA dataset with 571 patients, each with one offour breast cancer subtypes, and 50 continuous predictor genes. The experiments use the same settingas in Section 4.2, where the 40 randomly chosen genes are also permuted so that the ground-truthfeature importances are known. We observed that the overall classification accuracy of random forestsfor this dataset was 0.76.

G.2.1 Running 5000 iterations of UMFI

The original BRCA experiment conducted in Section 4.2 showed that UMFI_LR and UMFI_OTperformed impressively on real data, providing significantly more accurate feature importancescores than MCI after 200 iterations of the experiment. Both UMFI_LR and UMFI_OT correctlygave high importance to the ten BRCA-associated genes, while giving zero median importance toabout 80% of the unassociated genes. Additionally, in an overnight study spanning less than tenhours, UMFI_LR and UMFI_OT displayed ideal results after running 5000 iterations of the BRCAexperiment. As shown in Figure 9, both implementations of UMFI achieve 100% overall accuracy bygiving high importance to the ten BRCA-associated genes and zero median feature importance to all40 unassociated genes. These results indicate that UMFI’s relatively low computational cost can beleveraged via aggregation to achieve superior performance on complex data within a reasonable timebudget.

Figure 9: Median feature importance scores provided by (a) UMFI with linear regression, and (b)UMFI with pairwise optimal transport, for each gene in the permuted BRCA dataset after 5000iterations. Genes colored in blue are known to be associated with breast cancer while genes coloredin grey are random permutations of randomly selected genes, which we assume to be unassociatedwith breast cancer subtype. The first and third quantiles of the scores are visualized for each gene.

25

G.2.2 Ablation, PI, and CPI

We also test the quality and robustness of other feature importance metrics including ablation, PI,and CPI, by running 200 iterations of the BRCA experiment from Section 4.2 for each method.Results are shown in Figure 10. Ablation importance scores are small and have large uncertaintiescompared to its median importance scores, which makes the scores impractical to interpret. Eight ofthe ten important genes are identified by ablation, but all other genes are given exactly zero medianimportance. All ten important genes are given non-zero importance by CPI, however, some randomlypermuted genes are given more importance than some genes known to be important, such as CDK6.PI gave more reliable and stable results compared to ablation and CPI in this experiment, exhibitingsimilar performance to UMFI_LR and UMFI_OT from the analogous experiment shown in Figure3. We note that PI assigned zero importance to 29 of the 40 unassociated genes, making its TNR of0.725 slightly lower than UMFI in the analogous experiment from Section 4.2.

Figure 10: Median feature importance scores provided by (a) ablation, (b) permutation importance,and (c) conditional permutation importance, for each gene in the permuted BRCA dataset after 200iterations. Genes colored in blue are known to be associated with breast cancer while genes coloredin grey are random permutations of randomly selected genes, which we assume to be unassociatedwith breast cancer subtype. The first and third quantiles of the scores are visualized for each gene.

26

G.3 Experiments on unpermuted BRCA data

Additional BRCA experiments were performed on the original randomized genes, as done in Covertet al. [23] and Catav et al. [17]. The observed overall classification accuracy of random forests forthis dataset was 0.79.

Feature importance scores on this dataset were first computed with MCI, UMFI_LR, and UMFI_OTover 100 iterations, as shown in Figure 11. The ordering of the BRCA associated genes is fairlysimilar across MCI and both UMFI methods. BCL11A and SLC22A5 are always the top two featuresand TEX14 is always the least important BRCA associated gene. While there are clear similaritiesin the results of all methods, the glaring difference is the number of features given zero importance.While MCI gives non-zero median importance to all 50 features, 14 features are given zero medianimportance by UMFI with linear regression, and 10 features are given zero median importance byUMFI with pairwise optimal transport. It is unlikely that all 40 randomly selected genes, which havenot shown any association with breast cancer in previous studies, share information about breastcancer, so in this respect, we conclude that UMFI performs better than MCI.

Figure 11: Median feature importance scores provided by (a) MCI, (b) UMFI with linear regression,and (c) UMFI with pairwise optimal transport, for each gene in the unpermuted BRCA dataset after100 iterations. Genes colored in blue are associated with breast cancer while genes colored in greyare randomly selected genes. The first and third quantiles of the scores are visualized for each gene.

27

Feature importance scores on the unpermuted BRCA dataset were also computed with ablation, CPI,and PI over 100 iterations, as shown in Figure 12. When also considering these results, we observe thatMCI, UMFI, and PI give similar importance scores, while ablation and CPI performed significantlyworse. Once again, ablation’s high relative variance hampers its interpretability. Meanwhile, CPIgave by far the highest importance to SLC25A1, which is not known to have any association withbreast cancer. In the results of MCI, UMFI, and PI, BCL11A is the most important while CST9L isalways among the most important non-BRCA associated genes. Contrary to this, ablation and CPIgive high importance to BRCA1, BRCA2, TEX14, EZH2, and IGF1R for BRCA associated genes,and SLC25A1 for non-BRCA associated genes.

Figure 12: Median feature importance scores provided by (a) ablation, (b) permutation importance,and (c) conditional permutation importance, for each gene in the unpermuted BRCA dataset after100 iterations. Genes colored in blue are associated with breast cancer while genes colored in greyare randomly selected genes. The first and third quantiles of the scores are visualized for each gene.

G.3.1 Computational complexity

We compare the computational complexity of UMFI and MCI against the other feature importancemethods that were explored in this section: ablation, PI, and CPI. To do so, we ran 10 iterations of theBRCA experiment, which has 50 features, each with 571 observations. We recorded the average timefor each method to compute feature importance for 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50 features.

28

Figure 13: The average computation time for each method to process p features over 10 iterations ofthe original BRCA data is plotted for each p ∈ {5, 10, 15, 20, 25, 30, 35, 40, 45, 50}.

Figure 13 shows that PI is the fastest method, processing 50 features in 50 milliseconds on average,followed by ablation (50 features in 1.8 seconds), UMFI (50 features in 3 seconds when parallelized),CPI (50 features in 30 seconds), and finally MCI with soft 2-size submodularity (50 features in 205seconds).

G.4 Experiments on hydrology data

The final experiments for this study were conducted on a large-sample hydrology dataset calledCAMELS [1]. This dataset records catchment averaged climate, soil, geology, topography, and landcover characteristics for 643 catchments across the contiguous United States. With these, there are 29continuous explanatory variables. The response variable is averaged yearly streamflow, which is alsocontinuous. Extremely randomized trees were used in this experiment with an overall OOB-R2 of0.91.

Figure 14, which is analogous to Figure 6 in Appendix F, shows that both preprocessing methods failto completely remove dependencies from the CAMELS dataset. This can likely be attributed to thefact that each feature is almost completely dependent on the other explanatory features (R2 ≥ 0.65).

Figure 14: The relative mutual information Irel(fi;F \{fi}) between the ith feature in the CAMELSdataset and all other features is plotted (black) for each i ∈ {1, 2, ...30}. The relative mutualinformation Irel(fi;SFfi) between the ith feature and all other features after preprocessing with linearregression (red) and optimal transport (blue) is also plotted. Relative mutual information is measuredby random forest’s OOB-R2.

29

The feature importance scores indicated in Figure 15 show that mean precipitation and aridity indexare the features with the strongest relationships with mean annual streamflow. Geology and soilattributes such as bedrock permeability and soil porosity are always among the least importantfeatures. These conclusions are in line with previous studies [2, 42].

Figure 15: Median feature importance scores provided by (a) MCI, (b) UMFI with linear regression,and (c) UMFI with pairwise optimal transport, for each explanatory variable in the CAMELS dataset,taken after 100 iterations. The first and third quantiles of the scores are visualized for each feature.

30

References[1] Nans Addor, Andrew J Newman, Naoki Mizukami, and Martyn P Clark. The camels data set:

catchment attributes and meteorology for large-sample studies. Hydrology and Earth SystemSciences, 21(10):5293–5313, 2017. 7, 12, 22, 29

[2] Nans Addor, Grey Nearing, Cristina Prieto, AJ Newman, Nataliya Le Vine, and Martyn P Clark.A ranking of hydrological signatures based on their predictability in space. Water ResourcesResearch, 54(11):8792–8812, 2018. 2, 7, 30

[3] Ahmed Al-Ani, Mohamed Deriche, and Jalel Chebil. A new mutual information based measurefor feature selection. Intelligent Data Analysis, 7(1):43–57, 2003. 14

[4] Daniel W Apley and Jingyu Zhu. Visualizing the effects of predictor variables in black boxsupervised learning models. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 82(4):1059–1086, 2020. 2

[5] Eviatar Bach, Valentina Radic, and Christian Schoof. How sensitive are mountain glaciers toclimate change? insights from a block model. Journal of Glaciology, 64(244):247–258, 2018. 7

[6] Roberto Battiti. Using mutual information for selecting features in supervised neural netlearning. IEEE Transactions on neural networks, 5(4):537–550, 1994. 14

[7] Adrián Bazaga, Dan Leggate, and Hendrik Weisser. Genome-wide investigation of gene-cancerassociations for the prediction of novel therapeutic targets in oncology. Scientific reports, 10(1):1–10, 2020. 2

[8] Mohamed Bennasar, Yulia Hicks, and Rossitza Setchi. Feature selection using joint mutualinformation maximisation. Expert Systems with Applications, 42(22):8520–8532, 2015. 14

[9] Jian Bi. A review of statistical methods for determination of relative importance of correlatedpredictors and identification of drivers of consumer liking. Journal of Sensory Studies, 27(2):87–101, 2012. 3

[10] Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan,Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. Fairlearn: A toolkit for assessingand improving fairness in ai. Microsoft, Tech. Rep. MSR-TR-2020-32, 2020. 3, 6, 19

[11] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. 2, 7, 22

[12] Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification andregression trees. Routledge, 2017. 16

[13] Alexander Brenning and GF Azócar. Statistical analysis of topographic and climatic controlsand multispectral signatures of rock glaciers in the dry andes, chile (27–33 s). Permafrost andPeriglacial Processes, 21(1):54–66, 2010. 7

[14] David V Budescu. Dominance analysis: a new approach to the problem of relative importanceof predictors in multiple regression. Psychological bulletin, 114(3):542, 1993. 14, 17

[15] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, andKush R Varshney. Optimized pre-processing for discrimination prevention. Advances in neuralinformation processing systems, 30, 2017. 3, 12

[16] Amnon Catav, Boyang Fu, Jason Ernst, Sriram Sankararaman, and Ran Gilad-Bachrach.Marginal contribution feature importance–an axiomatic approach for the natural case. arXivpreprint arXiv:2010.07910, 2020. 1, 2, 7, 15, 20

[17] Amnon Catav, Boyang Fu, Yazeed Zoabi, Ahuva Libi Weiss Meilik, Noam Shomron, JasonErnst, Sriram Sankararaman, and Ran Gilad-Bachrach. Marginal contribution feature impor-tance - an axiomatic approach for explaining data. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 of Pro-ceedings of Machine Learning Research, pages 1324–1335. PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/catav21a.html. 2, 3, 4, 7, 8, 9, 10, 15, 22, 27

31

[18] Hugh Chen, Joseph D Janizek, Scott Lundberg, and Su-In Lee. True to the model or true to thedata? arXiv preprint arXiv:2006.16234, 2020. 1, 3

[19] Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, KailongChen, et al. Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4):1–4, 2015. 22

[20] Shay Cohen, Gideon Dror, and Eytan Ruppin. Feature selection via coalitional game theory.Neural Computation, 19(7):1939–1961, 2007. 2

[21] Wellcome Trust Case Control Consortium et al. Genome-wide association study of 14,000cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661, 2007. 15

[22] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory 2nd Edition (WileySeries in Telecommunications and Signal Processing). Wiley-Interscience, July 2006. ISBN0471241954. 13, 14

[23] Ian Covert, Scott M Lundberg, and Su-In Lee. Understanding global feature contributionswith additive importance measures. Advances in Neural Information Processing Systems, 33:17212–17223, 2020. 2, 4, 9, 15, 16, 20, 22, 27

[24] Ian Covert, Scott M Lundberg, and Su-In Lee. Explaining by removing: A unified frameworkfor model explanation. J. Mach. Learn. Res., 22:209–1, 2021. 4

[25] Richard B Darlington. Multiple regression in psychological research and practice. Psychologicalbulletin, 69(3):161, 1968. 16

[26] Dries Debeer and Carolin Strobl. Conditional permutation importance revisited. BMC bioinfor-matics, 21(1):1–30, 2020. 1, 2, 16, 17

[27] Dries Debeer, Torsten Hothorn, Carolin Strobl, and Maintainer Dries Debeer. Package ‘per-mimp’. 2021. 22

[28] Douglas F Easton, Karen A Pooley, Alison M Dunning, Paul DP Pharoah, Deborah Thompson,Dennis G Ballinger, Jeffery P Struewing, Jonathan Morrison, Helen Field, Robert Luben, et al.Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 447(7148):1087–1093, 2007. 15

[29] Tamsin L Edwards, Sophie Nowicki, Ben Marzeion, Regine Hock, Heiko Goelzer, HélèneSeroussi, Nicolas C Jourdain, Donald A Slater, Fiona E Turner, Christopher J Smith, et al.Projected land ice contributions to twenty-first-century sea level rise. Nature, 593(7857):74–82,2021. 7

[30] Bernd Fellinghauer, Peter Bühlmann, Martin Ryffel, Michael Von Rhein, and Jan D Reinhardt.Stable graphical model estimation with random forests for discrete, continuous, and mixedvariables. Computational Statistics & Data Analysis, 64:132–152, 2013. 17

[31] Francis Galton. I. co-relations and their measurement, chiefly from anthropometric data.Proceedings of the Royal Society of London, 45(273-279):135–145, 1889. 16

[32] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machinelearning, 63(1):3–42, 2006. 22

[33] WA Gibson. Orthogonal predictors: A possible resolution of the hoffman-ward controversy.Psychological reports, 11(1):32–34, 1962. 2

[34] David A Gill, Michael B Mascia, Gabby N Ahmadia, Louise Glew, Sarah E Lester, MeganBarnes, Ian Craigie, Emily S Darling, Christopher M Free, Jonas Geldmann, et al. Capacityshortfalls hinder the performance of marine protected areas globally. Nature, 543(7647):665–669, 2017. 2

[35] Virgil Griffith and Christof Koch. Quantifying synergistic mutual information. arXiv preprintarXiv:1205.4265, 2012. 3, 6, 13, 14

[36] Ulrike Grömping. Variable importance in regression models. Wiley interdisciplinary reviews:Computational statistics, 7(2):137–152, 2015. 1, 16, 17

32

[37] Nimrod Harel, Ran Gilad-Bachrach, and Uri Obolski. Inherent inconsistencies of featureimportance. arXiv preprint arXiv:2206.08204, 2022. 4, 8

[38] Giles Hooker, Lucas Mentch, and Siyu Zhou. Unrestricted permutation forces extrapolation:variable importance requires at least one more model, or there is no free variable importance.Statistics and Computing, 31(6):1–16, 2021. 2

[39] Aleks Jakulin and Ivan Bratko. Quantifying and visualizing attribute interactions: An approachbased on entropy. 2003. 7

[40] Alexander Janssen, Mark Hoogendoorn, Marjon H Cnossen, Ron AA Mathôt, OPTI-CLOT Study Group, SYMPHONY Consortium, MH Cnossen, SH Reitsma, FWG Leebeek,RAA Mathôt, K Fijnvandraat, et al. Application of shap values for inferring the optimal func-tional form of covariates in pharmacokinetic modeling. CPT: Pharmacometrics & SystemsPharmacology, 2022. 2

[41] Joseph Janssen and Ali A Ameli. A hydrologic functional approach for improving large-sample hydrology performance in poorly gauged regions. Water Resources Research, 57(9):e2021WR030263, 2021. 7

[42] Florian U Jehn, Konrad Bestian, Lutz Breuer, Philipp Kraft, and Tobias Houska. Using hydro-logical and climatic catchment clusters to explore drivers of catchment behavior. Hydrologyand Earth System Sciences, 24(3):1081–1100, 2020. 30

[43] James E Johndrow and Kristian Lum. An algorithm for removing sensitive information:application to race-independent recidivism prediction. The Annals of Applied Statistics, 13(1):189–220, 2019. 3, 5, 6, 12, 18, 19

[44] Pål V Johnsen, Signe Riemer-Sørensen, Andrew Thomas DeWan, Megan E Cahill, and MetteLangaas. A new method for exploring gene–gene and gene–environment interactions in gwaswith tree ensemble methods and shap values. BMC bioinformatics, 22(1):1–29, 2021. 2

[45] Jeff W Johnson and James M LeBreton. History and use of relative importance indices inorganizational research. Organizational research methods, 7(3):238–257, 2004. 17

[46] Justin B Kinney and Gurinder S Atwal. Equitability, mutual information, and the maximalinformation coefficient. Proceedings of the National Academy of Sciences, 111(9):3354–3359,2014. 20

[47] Gunnar König, Timo Freiesleben, Bernd Bischl, Giuseppe Casalicchio, and Moritz Grosse-Wentrup. Decomposition of global feature importance into direct and associative components(dedact). arXiv preprint arXiv:2106.08086, 2021. 3

[48] Ken Lau, Chandra Nair, and David Ng. A mutual information inequality and some applications.13

[49] Edward Le, Ali Ameli, Joseph Janssen, and John Hammond. Snow persistence explains streamhigh flow and low flow signatures with differing relationships by aridity and climatic seasonality.Hydrology and Earth System Sciences Discussions, pages 1–22, 2022. 7

[50] Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002. 22

[51] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Understanding variableimportances in forests of randomized trees. Advances in neural information processing systems,26, 2013. 14

[52] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions.Advances in neural information processing systems, 30, 2017. 2

[53] Alexander Marx, Arthur Gretton, and Joris M Mooij. A weaker faithfulness assumption basedon triple interactions. In Uncertainty in Artificial Intelligence, pages 451–460. PMLR, 2021. 7

[54] Matthijs, Vincent Warmerdam, and ManyOthers. scikit-fairness. scikit-fairness.https://github.com/koaning/scikit-fairness, 2019. 19

33

[55] R Core Team Microsoft. Microsoft R Open. Microsoft, Redmond, Washington, 2017. URLhttps://mran.microsoft.com/. 7, 22

[56] Christoph Molnar. Interpretable machine learning. Lulu. com, 2020. 2

[57] Christoph Molnar, Gunnar König, Bernd Bischl, and Giuseppe Casalicchio. Model-agnosticfeature importance and effects with dependent features–a conditional subgroup approach. arXivpreprint arXiv:2006.04628, 2020. 1

[58] Alena Orlenko and Jason H Moore. A comparison of methods for interpreting random forestmodels of genetic association in the presence of non-additive interactions. BioData mining, 14(1):1–17, 2021. 7

[59] Philipp Probst, Anne-Laure Boulesteix, and Bernd Bischl. Tunability: importance of hyperpa-rameters of machine learning algorithms. The Journal of Machine Learning Research, 20(1):1934–1965, 2019. 22

[60] Lennart Schmidt, Falk Heße, Sabine Attinger, and Rohini Kumar. Challenges in applyingmachine learning models for hydrological inference: A case study for flooding events acrossgermany. Water Resources Research, 56(5):e2019WR025924, 2020. 2

[61] Heïdi Sevestre and Douglas I Benn. Climatic and geometric controls on the global distributionof surge-type glaciers: implications for a unifying model of surging. Journal of Glaciology, 61(228):646–662, 2015. 7

[62] Lloyd S Shapley. A value for n-person games, contributions to the theory of games, 2, 307–317,1953. 2

[63] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. Learningcontrollable fair representations. In The 22nd International Conference on Artificial Intelligenceand Statistics, pages 2164–2173. PMLR, 2019. 3, 12, 20

[64] Charles Spearman. " general intelligence" objectively determined and measured. 1961. 16

[65] Lina Stein, Martyn P Clark, Wouter JM Knoben, Francesca Pianosi, and Ross A Woods. How doclimate and catchment attributes influence flood generating processes? a large-sample study for671 catchments across the contiguous usa. Water Resources Research, 57(4):e2020WR028300,2021. 2

[66] Bastian Steudel and Nihat Ay. Information-theoretic inference of common ancestors. Entropy,17(4):2304–2327, 2015. 13

[67] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis.Conditional variable importance for random forests. BMC bioinformatics, 9(1):1–11, 2008. 16

[68] Shanwen Sun, Benzhi Dong, and Quan Zou. Revisiting genome-wide association studies fromstatistical modelling to machine learning. Briefings in Bioinformatics, 22(4):bbaa263, 2021. 15

[69] Antonio Sutera, Gilles Louppe, Van Anh Huynh-Thu, Louis Wehenkel, and Pierre Geurts. Fromglobal to local mdi variable importances for random forests and when they are shapley values.Advances in Neural Information Processing Systems, 34, 2021. 15

[70] Katarzyna Tomczak, Patrycja Czerwinska, and Maciej Wiznerowicz. The cancer genome atlas(tcga): an immeasurable source of knowledge. Contemporary oncology, 19(1A):A68, 2015. 9,22

[71] Hao Wang and Flavio P Calmon. An estimation-theoretic view of privacy. In 2017 55th AnnualAllerton Conference on Communication, Control, and Computing (Allerton), pages 886–893.IEEE, 2017. 3

[72] Hao Wang, Lisa Vo, Flavio P Calmon, Muriel Médard, Ken R Duffy, and Mayank Varia. Privacywith estimation guarantees. IEEE Transactions on Information Theory, 65(12):8025–8042,2019. 12

34

[73] Hui Wang, David A Bennett, Philip L De Jager, Qing-Ye Zhang, and Hong-Yu Zhang. Genome-wide epistasis analysis for alzheimer’s disease and implications for genetic risk prediction.Alzheimer’s research & therapy, 13(1):1–13, 2021. 7

[74] Pengfei Wei, Zhenzhou Lu, and Jingwen Song. Variable importance analysis: a comprehensivereview. Reliability Engineering & System Safety, 142:399–432, 2015. 1

[75] Paul L Williams and Randall D Beer. Nonnegative decomposition of multivariate information.arXiv preprint arXiv:1004.2515, 2010. 6, 13, 16

[76] Thomas C Williams, Cathrine C Bach, Niels B Matthiesen, Tine B Henriksen, and LuigiGagliardi. Directed acyclic graphs: a tool for causal studies in paediatrics. Pediatric research,84(4):487–493, 2018. 4

[77] Marvin N Wright and Andreas Ziegler. ranger: A fast implementation of random forests forhigh dimensional data in c++ and r. arXiv preprint arXiv:1508.04409, 2015. 7, 14, 22

[78] Sewall Wright. Correlation and causation. 1921. 1, 16

[79] Lee H Wurm and Sebastiano A Fisicaro. What residualizing predictors in regression analysesdoes (and what it does not do). Journal of memory and language, 72:37–48, 2014. 3

[80] Jian-Bo Yang and Chong-Jin Ong. An effective feature selection method via mutual informationestimation. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(6):1550–1559, 2012. 14

[81] Yang Yang and Ting Fong May Chui. Reliability assessment of machine learning modelsin hydrological predictions through metamorphic testing. Water Resources Research, 57(9):e2020WR029471, 2021. 17

[82] Ye Yuan, Liji Wu, and Xiangmin Zhang. Gini-impurity index analysis. IEEE Transactions onInformation Forensics and Security, 16:3154–3169, 2021. 14

[83] Marius Zumwald, Christoph Baumberger, David N Bresch, and Reto Knutti. Assessing the rep-resentational accuracy of data-driven models: The case of the effect of urban green infrastructureon temperature. Environmental Modelling & Software, 141:105048, 2021. 17

35