standard errors and confidence ... - web.ccs.miami.edu

25
SPECIAL ISSUE PAPER Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival Hemant Ishwaran | Min Lu Division of Biostatistics, Miller School of Medicine, University of Miami, Miami, Florida, USA Correspondence Hemant Ishwaran, Division of Biostatistics, 1120 NW 14th Street, University of Miami, Miami, FL 33136, USA. Email: [email protected] Funding information NIH, Grant/Award Number: R01 GM125072 Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limita- tion of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the deleted jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias cor- rection properties. These 2 estimators are highly competitive when compared with the .164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in outofsample data. Most importantly, subsampling is compu- tationally fast, thus making it especially attractive for big data settings. KEYWORDS bootstrap, deleted jackknife, permutation importance, prediction error, subsampling, VIMP 1 | INTRODUCTION Random forests (RF) 1 are a popular treebased learning method with broad applications to machine learning and data mining. RF was originally designed for regression and classification problems, but over time, the methodology has been extended to other important settings. For example, random survival forests (RSF) 2,3 extends RF to rightcensored survival and competing risk settings (see also Hothorn et al 4 and Zhu and Kosorok 5 for other tree ensemble approaches to survival analysis). Two guiding principles are at the core of RF's success. One is the use of deep trees. Another is injecting randomization into the tree growing process. First, trees are randomly grown by using a bootstrap sample of the data. Secondly, random feature selection is used when growing the tree. Thus, rather than splitting a node using all variables, the node is split using the best candidate from a randomly selected subset of variables. The purpose of this 2step randomization is to decorrelate trees, which encourages low variance for the ensemble due to bagging. 6 When combined with the strategy of using deep trees, which is a bias reduction technique, this reduces generalization error and results in superior performance for the ensemble. Received: 8 November 2017 Revised: 23 January 2018 Accepted: 2 April 2018 DOI: 10.1002/sim.7803 558 Copyright © 2018 John Wiley & Sons, Ltd. Statistics in Medicine. 2019;38:558582. wileyonlinelibrary.com/journal/sim

Upload: others

Post on 08-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Received 8 November 2017 Revised 23 January 2018 Accepted 2 April 2018

S P E C I A L I S S U E PA P ER

DOI 101002sim7803

Standard errors and confidence intervals for variableimportance in random forest regression classification andsurvival

Hemant Ishwaran | Min Lu

Division of Biostatistics Miller School ofMedicine University of Miami MiamiFlorida USA

CorrespondenceHemant Ishwaran Division ofBiostatistics 1120 NW 14th StreetUniversity of Miami Miami FL 33136USAEmail hemantishwarangmailcom

Funding informationNIH GrantAward Number R01GM125072

558 Copyright copy 2018 John Wiley amp Sons Ltd

Random forests are a popular nonparametric tree ensemble procedure with

broad applications to data analysis While its widespread popularity stems from

its prediction performance an equally important feature is that it provides a

fully nonparametric measure of variable importance (VIMP) A current limita-

tion of VIMP however is that no systematic method exists for estimating its

variance As a solution we propose a subsampling approach that can be used

to estimate the variance of VIMP and for constructing confidence intervals

The method is general enough that it can be applied to many useful settings

including regression classification and survival problems Using extensive

simulations we demonstrate the effectiveness of the subsampling estimator

and in particular find that the delete‐d jackknife variance estimator a close

cousin is especially effective under low subsampling rates due to its bias cor-

rection properties These 2 estimators are highly competitive when compared

with the 164 bootstrap estimator a modified bootstrap procedure designed to

deal with ties in out‐of‐sample data Most importantly subsampling is compu-

tationally fast thus making it especially attractive for big data settings

KEYWORDS

bootstrap delete‐d jackknife permutation importance prediction error subsampling VIMP

1 | INTRODUCTION

Random forests (RF)1 are a popular tree‐based learning method with broad applications to machine learning and datamining RF was originally designed for regression and classification problems but over time the methodology has beenextended to other important settings For example random survival forests (RSF)23 extends RF to right‐censoredsurvival and competing risk settings (see also Hothorn et al4 and Zhu and Kosorok5 for other tree ensemble approachesto survival analysis) Two guiding principles are at the core of RFs success One is the use of deep trees Another isinjecting randomization into the tree growing process First trees are randomly grown by using a bootstrap sampleof the data Secondly random feature selection is used when growing the tree Thus rather than splitting a node usingall variables the node is split using the best candidate from a randomly selected subset of variables The purpose of this2‐step randomization is to decorrelate trees which encourages low variance for the ensemble due to bagging6 Whencombined with the strategy of using deep trees which is a bias reduction technique this reduces generalization errorand results in superior performance for the ensemble

Statistics in Medicine 201938558ndash582wileyonlinelibrarycomjournalsim

ISHWARAN AND LU 559

While RFs popularity stems from its prediction performance an equally important feature is that it provides a fullynonparametric measure of variable importance (VIMP)127-9 Variable importance allows users to identify whichvariables play a key role in prediction thus providing insight into the underlying mechanism for what otherwise mightbe considered a black box We note that the concept of VIMP is not specific to RF and has a long history One of theearliest examples was CART10 which calculated VIMP by summing the reduction in node impurity due to a variableover all tree nodes Another approach calculated importance using surrogate splitting (see chapter 53 of Breimanet al10)

Early prototypes of RF software developed by Leo Breiman and his student Adele Cutler provided for variousoptions for calculating VIMP11 One procedure used for classification forests was to estimate VIMP using the forest aver-aged decrease in Gini impurity (somewhat akin to the node impurity approaches of CART) However while Gini impor-tance12 saw widespread initial use with RF over time it has become less popular8 By far the most frequently usedmeasure of importance was another measure provided by the Breiman‐Cutler software called permutation importance(sometimes also referred to as Breiman‐Cutler importance) Unlike Gini importance that estimates importance using in‐sample impurity permutation importance adopts a prediction‐based approach by using prediction error attributable tothe variable A clever feature is that rather than using cross‐validation which is computationally expensive for forestspermutation importance estimates prediction error by making use of out‐of‐bootstrap cases Recall that each tree is cal-culated from a bootstrap sample of the original data The approximately 1minus632=368 left from the bootstrap representsout‐of‐sample data which can be used for estimating prediction performance These data are called out‐of‐bag (OOB)and prediction error obtained from it is called OOB error13 Permutation importance permutes a variables OOB dataand compares the resulting OOB prediction error to the original OOB prediction errormdashthe motivation being that alarge positive value indicates a variable with predictive importance

Permutation (Breiman‐Cutler) Importance

In the OOB cases for a tree randomly permute all values of the jth variable Put these new covariate valuesdown the tree and compute a new internal error rate The amount by which this new error exceeds theoriginal OOB error is defined as the importance of the jth variable for the tree Averaging over the forestyields VIMP

mdash Measure 1 (Manual On Setting Up Using And Understanding Random Forests V31)

We focus on Breiman‐Cutler permutation importance in this manuscript (for simplicity hereafter simply referred toas VIMP) One of the tremendous advantages of VIMP is that it removes the arbitrariness of having to select a cutoffvalue when determining the effectiveness of a variable Regardless of the problem a VIMP of zero always representsan appropriate cutoff as it reflects the point at which a variable no longer contributes predictive power to the modelHowever in practice one may observe values close to zero and the meaning of what constitutes being zero becomesunclear One way to resolve this is to calculate the variance of VIMP but this is challenging due to the complex natureof RF Unfortunately while the empirical properties of VIMP are well documented14-16 much less is known aboutVIMPs theoretical properties outside of a few studies717

Given the difficulties of theoretical analysis an alternative approach is to approximate the distribution of VIMPthrough some form of resampling This has been the favored approach used for RF regression for assessing variabilityof RF predicted values Methods that have been used include bootstrapping18 for estimating the variance and the infin-itesimal jackknife19 and infinite order U‐statistics20 for confidence intervals These methods however only apply to RFpredicted values and not to VIMP which involves prediction error This greatly complicates matters and requires a moregeneral approach

For this reason we base our approach on subsampling21 a general methodology for approximating the distributionof a complex statistic Section 3 provides a description of our subsampling procedure for estimating the variance Nota-tional framework and a formal definition of VIMP are provided in Section 2 Section 3 begins by introducing a bootstrapsolution to be used as a comparison procedure Interestingly we find the bootstrap cannot be applied directly due to tiesthat occur in the OOB data This is precisely due to the fact that VIMP is prediction error based We propose a solutionto this problem called the 164 bootstrap estimator The subsampling variance estimator and the delete‐d jackknife var-iance estimator22 a close cousin are described later in Section 3 Sections 4 5 and 6 consider regression classificationand survival settings and extensively evaluate performance of the 2 subsampling methods and the 164 bootstrap

560 ISHWARAN AND LU

estimator We also show how to construct confidence intervals for VIMP using the estimated variance The results arevery promising for the subsampling methods Section 7 summarizes our findings and provides practical guidelines foruse of the methodology Some theoretical results for VIMP are provided in the appendix

2 | NOTATIONAL FRAMEWORK AND DEFINITION OF VIMP

21 | Notation

We assume YisinY is the response and XisinX is the p‐dimensional feature where Y can be continuous binary categoricalor survival and X can be continuous or discrete We assume the underlying problem involves a nonparametric regres-sion framework where the goal is to estimate a functional h(x) of the response given X=x Estimation is based on thelearning data L frac14 fethX1Y 1THORNhellip ethXnYnTHORNg where (XiYi) are independently distributed with the same distribution P as(XY)

Examples of h(x) are the following1 The conditional mean hethxTHORN frac14 Efrac12Y jX frac14 x in regression2 The conditional class probabilities h(x)=(p1(x)hellippK(x)) in a K‐multiclass problem where pkethxTHORN frac14 PfY frac14 kjX frac14 xg3 The survival function hethxTHORN frac14 PfTogttjX frac14 xg in survival analysis Here Y=(T δ) represents the bivariate response

composed of the observed survival time T frac14 minethToCoTHORN and censoring indicator δ=1ToleCo where (ToCo) arethe unobserved event and censoring times

22 | Random forest predictor

As in Breiman1 we define an RF as a collection of randomized tree predictors fhethmiddotΘmLTHORNm frac14 1hellipMg Here hethxΘmLTHORN denotes the mth random tree predictor of h(x) and Θm are independent identically distributed randomquantities encoding the randomization needed for constructing a tree Note that Θm is selected prior to growing the treeand is independent of the learning data L

The tree predictors are combined to form the finite forest estimator of h(x)

hethxΘ1hellipΘM LTHORN frac14 1M

sumM

mfrac141hethxΘmLTHORN (1)

The infinite forest estimator is obtained by taking the limit as Mrarrinfin and equals

hethxLTHORN frac14 EΘfrac12hethxΘLTHORN (2)

23 | Loss function

Calculating VIMP assumes some well‐defined notion of prediction error Therefore we assume there is an appropriately

prechosen loss function ℓethY ĥTHORNge0 used to measure performance of a predictor ĥ in predicting h Examples include thefollowing1 Squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 in regression problems2 For classification problems widely used measures of performance are the misclassification error or the Brier score

For the latter ℓethY ĥTHORN frac14 eth1=KTHORNsumKkfrac141eth1fY frac14 kgminuspkTHORN2 where pk is the estimator for the conditional probability pk

3 For survival the weighted Brier score2324 can be used Section 6 provides further details

The choice of ℓ can be very general and we do not impose any specific conditions on how it must be selected Asdescribed later in Section 3 the conditions needed for our methodology to hold require only the existence of a limitingdistribution for VIMP Although such a limit may be satisfied by imposing specific conditions on ℓ such as requiring thetrue function h to yield the minimum value of Efrac12ℓethY hTHORN we do not impose such assumptions so as to retain as generalan approach as possible

ISHWARAN AND LU 561

24 | Tree VIMP

LetLlowastethΘmTHORN be the mth bootstrap sample and letLlowastlowastethΘmTHORN frac14 L∖LlowastethΘmTHORN be the corresponding OOB data Write X=(X(1)hellipX(p)) where X(j) denotes the jth feature coordinate The permuted value of the jth coordinate of X is denoted by ~X eth jTHORNSubstituting this into the jth coordinate of X yields ~Xeth jTHORN

~Xeth jTHORN frac14 ethX eth1THORNhellipX eth jminus1THORN ~Xeth jTHORNXeth jthorn1THORNhellipX ethpTHORNTHORN

Variable importance is calculated by taking the difference in prediction error under the original X to prediction error

under the perturbed ~Xeth jTHORN over OOB data More formally let IethXeth jTHORNΘmLTHORN denote the VIMP for X(j) for the mth treeIt follows that

IethXeth jTHORNΘmLTHORN frac14sumiisinLlowastlowastethΘmTHORNℓethYi heth~Xeth jTHORN

i ΘmLTHORNTHORNsumiisinLlowastlowastethΘmTHORN1

minussumiisinLlowastlowastethΘmTHORNℓethYi hethXiΘmLTHORNTHORN

sumiisinLlowastlowastethΘmTHORN1 (3)

Note that in the first sum we implicitly assume Θm embeds the additional randomization for permuting OOB data to

define ~Xeth jTHORNi Because this additional randomization only requires knowledge of OOB membership and therefore can

be parameterized in terms of Θ we assume without loss of generality that Θ encodes both the randomization for grow-ing a tree and for permuting OOB data

Expression (3) can be written more compactly by noting that the denominator in each sum equals the OOB samplesize Let N(Θm) be this value Then

IethXeth jTHORNΘmLTHORN frac14 1NethΘmTHORN sum

iisinLlowastlowastethΘmTHORNfrac12ℓethYi heth~Xeth jTHORN

i ΘmLTHORNTHORNminusℓethYi hethXiΘmLTHORNTHORN

25 | Variable importance

Averaging tree VIMP over the forest yields VIMP

IethX eth jTHORNΘ1hellipΘM LTHORN frac14 1M

sumM

mfrac141IethXeth jTHORNΘmLTHORN (4)

An infinite forest estimator for VIMP can be defined analogously by taking the limit as Mrarrinfin

IethX eth jTHORNLTHORN frac14 EΘfrac12IethX eth jTHORNΘLTHORN (5)

It is worth noting that (4) and (5) do not explicitly make use of the forest predictors (1) or (2) This is a unique feature ofpermutation VIMP because it is a tree‐based estimator of importance

3 | SAMPLING APPROACHES FOR ESTIMATING VIMP VARIANCE

31 | The 164 bootstrap estimator

The bootstrap is a popular method that can be used for estimating the variance of an estimator So why not use the boot-strap to estimate the standard error for VIMP One problem is that running a bootstrap on a forest is computationallyexpensive Another more serious problem however is that a direct application of the bootstrap will not work for VIMPThis is because RF trees already use bootstrap data and applying the bootstrap creates double‐bootstrap data that affectsthe coherence of being OOB

To explain what goes wrong let us simplify our previous notation by writing Ieth jTHORNnM for the finite forest estimator (4)

Let Pn denote the empirical measure for L The bootstrap estimator of VarethIeth jTHORNnMTHORN is

562 ISHWARAN AND LU

VarlowastethIeth jTHORNnMTHORN frac14 VarPnethIlowasteth jTHORNnMTHORN (6)

To calculate (6) we must draw a sample from Pn Call this bootstrap sample Llowast Because Llowast represents the learningdata we must draw a bootstrap sample from Llowast to construct a RF tree Let LlowastethΘlowastTHORN denote this bootstrap sample whereΘlowast represents the tree growing instructions This is a double‐bootstrap draw The problem is that if a specific case in Llowast

is duplicated lgt1 times there is no guarantee that all l cases appear in the bootstrap draw LlowastethΘlowastTHORN These remainingduplicated values are assigned to the OOB data but these values are not truly OOB which compromises the coherenceof the OOB data

Double‐bootstrap data lower the probability of being truly OOB to a value much smaller than 368 which is thevalue expected for a true OOB sample We can work out exactly how much smaller this probability is Let ni be the num-ber of occurrences of case i in Llowast Then

Prfi is truly OOB inLlowastethΘlowastTHORNg frac14 sumn

lfrac141Prfi is truly OOB inLlowastethΘlowastTHORNjni frac14 lgPr fni frac14 lg (7)

We have

ethn1hellipnnTHORNsimMultinomialethn eth1=nhellip 1=nTHORNTHORNnisimBinomialethn 1=nTHORN≍Poissoneth1THORN

Hence (7) can be seen to equal

sumn

lfrac141

nminusln

n

Prfni frac14 lg≍sumn

lfrac141

nminusln

n eminus11l

l

frac14 eminus1sum

n

lfrac1411minus

ln

n1l≍eminus1sum

n

lfrac141

eminusl

l≍1635

Therefore double‐bootstrap data have an OOB size of 164nThe above discussion points to a simple solution to the problem which we call the 164 bootstrap estimator The 164

estimator is a bootstrap variance estimator but is careful to use only truly OOB data Let Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg denote the bootstrap sample used for learning and letLlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be the bootstrap sample usedto grow the tree The OOB data for the double‐bootstrap data are defined as fethXil YilTHORNnotinLlowastethΘlowastTHORNg However there isanother subtle issue at play regarding duplicates in the OOB data Even though fethXil YilTHORNnotinLlowastethΘlowastTHORNg are data points fromLlowast truly excluded from the double‐bootstrap sample and therefore technically meet the criteria of being OOB there is noguarantee they are all unique This is because these values originated fromLlowast a bootstrap draw and therefore could verywell be duplicated To ensure this does not happen we further process the OOB data to retain only the unique values

The steps for implementing the 164 estimator can be summarized as follows

164 bootstrap estimator for VarethIeth jTHORNnMTHORN

1 Draw a bootstrap sample Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg2 Let LlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be a bootstrap draw from Llowast Use LlowastethΘlowastTHORN to grow a tree predictor3 Define OOB data to be the unique values in fethXil YilTHORNnotinLlowastethΘlowastTHORNg4 Calculate the tree VIMP IethXeth jTHORNΘlowastLlowastTHORN using OOB data of step 35 Repeat steps 2 to 4 independently M times Average the VIMP values to obtain θlowasteth jTHORNn 6 Repeat the entire procedure Kgt1 times obtaining θlowasteth jTHORNn1 hellip θ

lowasteth jTHORNnK Estimate VarethIeth jTHORNnMTHORN by the bootstrap sample

variance eth1=KTHORNsumKkfrac141ethθlowasteth jTHORNnk minus

1KsumK

kprimefrac141θlowasteth jTHORNnkprime

THORN2

32 | Subsampling and the delete‐d jackknife

A problem with the 164 bootstrap estimator is that its OOB data set is smaller than a typical OOB estimator Truly OOBdata from a double bootstrap can be less than half the size of OOB data used in a standard VIMP calculation (164

ISHWARAN AND LU 563

versus 368) Thus in a forest of 1000 trees the 164 estimator uses about 164 trees on average to calculate VIMP for acase compared with 368 trees used in a standard calculation This can reduce efficiency of the 164 estimator Anotherproblem is computational expense The 164 estimator requires repeatedly fitting RF to bootstrap data which becomesexpensive as n increases

To avoid these problems we propose a more efficient procedure based on subsampling theory21 The idea rests oncalculating VIMP over small iid subsets of the data Because sampling is without replacement this avoid ties in theOOB data that creates problems for the bootstrap Also because each calculation is fast the procedure is computation-ally efficient especially in big n settings

321 | Subsampling theory

We begin by first reviewing some basic theory of subsampling Let X1hellipXn be iid random values with common distri-

bution P Let θn frac14 θethX1hellipXnTHORN be some estimator for θ(P) an unknown real‐valued parameter we wish to estimateThe bootstrap estimator for the variance of θn is based on the following simple idea Let Pn frac14 eth1=nTHORNsumn

ifrac141δXi be theempirical measure for the data Let Xlowast

1 hellipXlowastn be a bootstrap sample obtained by independently sampling n points from

Pn Because Pn converges to P we should expect the moments of the bootstrap estimator θlowastn frac14 θethXlowast1 hellipX

lowastnTHORN to closely

approximate those of θ In particular we should expect the bootstrap variance VarPnethθlowastnTHORN to closely approximate

VarethθnTHORN This is the rationale for the variance estimator (6) described earlierSubsampling21 employs the same strategy as the bootstrap but is based on sampling without replacement For b=b(n)

such that bnrarr0 let Sb be the entire collection of subsets of 1hellipn of size b For each s=i1hellipibisinSb let θnbs frac14θethXi1 hellipXibTHORN be the estimator evaluated using s The goal is to estimate the sampling distribution of n1=2ethθnminusθethPTHORNTHORN Itturns out that subsampling provides a consistent estimate of this distribution under fairly mild conditions LetQn denotethe distribution of n1=2ethθnminusθethPTHORNTHORN Assume Qn converges weakly to a proper limiting distribution Q

Qn dQ (8)

Then it follows21 that the distribution function for the statistic n1=2ethθnminusθethPTHORNTHORN can be approximated by the subsamplingestimator

ŨnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθnTHORNlexg (9)

where Cb frac14 nb

is the cardinality of Sb More formally assuming (8) and bnrarr0 for brarrinfin then ŨnbethxTHORNp FethxTHORN frac14

Qfrac12minusinfin x for each x that is a continuity point of the limiting cumulative distribution function F The key to this argumentis to recognize that because of (8) and bnrarr0 Ũnb closely approximates

UnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθethPTHORNTHORNlexg

which is a U‐statistic25 of order b (see Politis and Romano21 for details)

The ability to approximate the distribution of θn suggests similar to the bootstrap that we can approximatemoments of θn with those from the subsampled estimator in particular we should be able to approximate the varianceUnlike the bootstrap however subsampled statistics are calculated using a sample size b and not n Therefore to esti-mate the variance of θn we must apply a scaling factor to correct for sample size The subsampled estimator for the var-iance is (see Radulović26 and section 331 from Politis and Romano21)

υb frac14 b=nCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

(10)

The estimator (10) is closely related to the delete‐d jackknife22 The delete‐d estimator works on subsets of size r=nminusdand is defined as follows

564 ISHWARAN AND LU

υJethdTHORN frac14 r=dCr

sumsisinSr

ethθnrsminusθnTHORN2

With a little bit or rearrangement this can be rewritten as follows

υJethdTHORN frac14 r=dCr

sumsisinSr

θnrsminus1Cr

sumsprimeisinSr

θnrsprime

2

thorn rd

1Cr

sumsisinSr

θnrsminusθn

2

Setting d=nminusb we obtain

υJethdTHORN frac14 b=ethnminusbTHORNCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

thorn bnminusb

1Cb

sumsisinSb

θnbsminusθn

2

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflbias

(11)

The first term closely approximates (10) since bnrarr0 while the second term is a bias estimate of the subsampled estima-tor Thus the delete‐d estimator (11) can be seen to be a bias corrected version of (10) Furthermore this correction isalways upwards because the bias term is squared and always positive

322 | Subsampling and delete‐d jackknife algorithms

We can now describe our subsampling estimator for the variance of VIMP In the following we assume b is some inte-ger much smaller than n such that bnrarr0

b‐subsampling estimator for VarethIeth jTHORNnMTHORN1 Draw a subsampling set sisinSb Let Ls be L restricted to s2 Calculate Ieth jTHORNnMethLsTHORN the finite forest estimator for VIMP using Ls Let θ

eth jTHORNnbs denote this value

3 Repeat Kgt1 times obtaining θeth jTHORNnbs1hellip θeth jTHORNnbsK

Estimate VarethIeth jTHORNnMTHORN by frac12b=ethnKTHORNsumKkfrac141ethθeth jTHORNnbsk

minus1KsumK

kprimefrac141θeth jTHORNnbskprime

THORN2

The delete‐d jackknife estimator is obtained by a slight modification to the above algorithm

delete‐d jackknife estimator ethd frac14 nminusbTHORN for VarethIeth jTHORNnMTHORN1 Using the entire learning set L calculate the forest VIMP estimator Ieth jTHORNnMethLTHORN Let θeth jTHORNn denote this value2 Run the b‐subsampling estimator but replace the estimator in step 3 with fb=frac12ethnminusbTHORNKgsumK

kfrac141ethθeth jTHORNnbskminusθeth jTHORNn THORN2

4 | RANDOM FOREST REGRESSION RF ‐R

41 | Simulations

In the following sections (Sections 4 5 and 6) we evaluate the performance of the 164 bootstrap estimator the b‐sub-sampling estimator and the delete‐d jackknife variance estimator We begin by looking at the regression setting Weused the following simulations to assess performance

1 y frac14 10 sinethπx1x2THORN thorn 20ethx3minus05THORN2 thorn 10x4 thorn 5x5 thorn ε xjsimU(0 1) εsimN(0 1)2 y frac14 ethx21 thorn frac12x2x3minusethx2x4THORNminus12THORN1=2 thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 1252)3 y frac14 tanminus1ethfrac12x2x3minusethx2x4THORNminus1=x1THORN thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 12)4 y frac14 x1x2 thorn x23minusx4x7 thorn x8x10minusx26 thorn ε xjsimU(minus1 1) εsimN(0 12)5 y frac14 1fx1gt0g thorn x32 thorn 1fx4 thorn x6minusx8minusx9gt1thorn x10g thorn expethminusx22THORN thorn ε xjsimU(minus1 1) εsimN(0 12)

ISHWARAN AND LU 565

6 y frac14 x21 thorn 3x22x3expethminusjx4jTHORN thorn x6minusx8 thorn ε xjsimU(minus1 1) εsimN(0 12)7 y frac14 1fx1 thorn x34 thorn x9 thorn sinethx2x8THORN thorn εgt038g xjsimU(minus1 1) εsimN(0 12)8 y frac14 logethx1 thorn x2x3THORNminusexpethx4xminus15 minusx6THORN thorn ε xjsimU(05 1) εsimN(0 12)9 y frac14 x1x22jx3j1=2 thorn lfloorx4minusx5x6rfloorthorn ε xjsimU(minus1 1) εsimN(0 12)10 y frac14 x3ethx1 thorn 1THORNjx2jminusethx25frac12jx4j thorn jx5j thorn jx6jminus1THORN1=2 thorn ε xjsimU(minus1 1) εsimN(0 12)11 y frac14 cosethx1minusx2THORN thorn sinminus1ethx1x3THORNminustanminus1ethx2minusx23THORN thorn ε xjsimU(minus1 1) εsimN(0 12)12 y= ε εsimN(0 1)

In all 12 simulations the dimension of the feature space was set to p=20 This was done by adding variables unre-lated to y to the design matrix We call these noise variables In simulations 1 to 3 noise variables were U(0 1) for sim-ulations 4 to 11 noise variables were U(minus1 1) and for simulation 12 noise variables were N(0 1) All features (strongand noisy) were simulated independently Simulations 1 to 3 are the well‐known Friedman 1 2 3 simulations627 Sim-ulations 4 to 11 were inspired from COBRA28 Simulation 12 is a pure noise model

The sample size was set at n=250 Subsampling was set at a rate of b=n12 which in this case is b=158 We cansee that practically speaking this is a very small sample size and allows subsampled VIMP to be rapidly computedThe value of d for the delete‐d jackknife was always set to d=nminusb The number of bootstraps was set to 100 and thenumber of subsamples was set to 100 Note that this is not a large number of bootstraps or subsampled replicatesHowever they represent values practitioners are likely to use in practice especially for big data due tocomputational costs

All RF calculations were implemented using the randomForestSRC R‐package29 The package runs in OpenMPparallel processing mode which allows for parallel processing on user desktops as well as large scale computing clus-ters The package now includes a dedicated function ldquosubsamplerdquo which implements the 3 methods studied here Thesubsample function was used for all calculations All RF calculations used 250 trees Tree nodesize was set to 5 and p3 random feature selection used (these are default settings for regression) Each simulation was repeated independently250 times Random forest parameters were kept fixed over simulations All calculations related to prediction error and

VIMP were based on squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2

42 | Estimating the true finite standard error and true finite VIMP

Each procedure provides an estimate of the VarethIeth jTHORNnMTHORN We took the square root of this to obtain an estimate for the stan-

dard error of VIMP ethVarethIeth jTHORNnMTHORNTHORN1=2 To assess performance in estimating the standard error we used the following strat-egy to approximate the unknown parameter VarethIeth jTHORNnMTHORN For each simulation model we drew 1000 independent copies ofthe data and for each of these copies we calculated the finite forest VIMP Ieth jTHORNnM The same sample size of n=250 wasused and all forest tuning parameters were kept the same as outlined above We used the variance of these 1000 valuesto estimate VarethIeth jTHORNnMTHORN We refer to the square root of this value as the true finite standard error Additionally we aver-aged the 1000 values to estimate Efrac12Ieth jTHORNnM frac14 Efrac12IethXeth jTHORNLTHORN We call this the true finite VIMP

43 | Results

Performance of methods was assessed by bias and standardized mean‐squared error (SMSE) The bias for a method wasobtained by averaging its estimated standard error over the 250 replications and taking the difference between this andthe true finite standard error Mean‐squared error was estimated by averaging the squared difference between amethods estimated value for the standard error and the true finite standard error Standardized mean‐squared errorwas defined by dividing MSE by the true finite standard error In evaluating these performance values we realized itwas important to take into account signal strength of a variable In our simulations there are noisy variables with nosignal There are also variables with strong and moderately strong signal Therefore to better understand performancedifferences results were stratified by size of a variables true finite VIMP In total there were 240 variables to be dealtwith (12 simulations each with p=20 variables) These 240 variables were stratified into 6 groups based on 10th 25th50th 75th and 90th percentiles of true finite VIMP (standardized by the Y variance to make VIMP comparable acrosssimulations) Bias and SMSE for the 6 groups are displayed in Figure 1

All methods exhibit low bias for small VIMP As VIMP increases corresponding to stronger variables bias for thesubsampling estimator increases Its bias is negative showing that it underestimates variance The delete‐d estimator

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 559

While RFs popularity stems from its prediction performance an equally important feature is that it provides a fullynonparametric measure of variable importance (VIMP)127-9 Variable importance allows users to identify whichvariables play a key role in prediction thus providing insight into the underlying mechanism for what otherwise mightbe considered a black box We note that the concept of VIMP is not specific to RF and has a long history One of theearliest examples was CART10 which calculated VIMP by summing the reduction in node impurity due to a variableover all tree nodes Another approach calculated importance using surrogate splitting (see chapter 53 of Breimanet al10)

Early prototypes of RF software developed by Leo Breiman and his student Adele Cutler provided for variousoptions for calculating VIMP11 One procedure used for classification forests was to estimate VIMP using the forest aver-aged decrease in Gini impurity (somewhat akin to the node impurity approaches of CART) However while Gini impor-tance12 saw widespread initial use with RF over time it has become less popular8 By far the most frequently usedmeasure of importance was another measure provided by the Breiman‐Cutler software called permutation importance(sometimes also referred to as Breiman‐Cutler importance) Unlike Gini importance that estimates importance using in‐sample impurity permutation importance adopts a prediction‐based approach by using prediction error attributable tothe variable A clever feature is that rather than using cross‐validation which is computationally expensive for forestspermutation importance estimates prediction error by making use of out‐of‐bootstrap cases Recall that each tree is cal-culated from a bootstrap sample of the original data The approximately 1minus632=368 left from the bootstrap representsout‐of‐sample data which can be used for estimating prediction performance These data are called out‐of‐bag (OOB)and prediction error obtained from it is called OOB error13 Permutation importance permutes a variables OOB dataand compares the resulting OOB prediction error to the original OOB prediction errormdashthe motivation being that alarge positive value indicates a variable with predictive importance

Permutation (Breiman‐Cutler) Importance

In the OOB cases for a tree randomly permute all values of the jth variable Put these new covariate valuesdown the tree and compute a new internal error rate The amount by which this new error exceeds theoriginal OOB error is defined as the importance of the jth variable for the tree Averaging over the forestyields VIMP

mdash Measure 1 (Manual On Setting Up Using And Understanding Random Forests V31)

We focus on Breiman‐Cutler permutation importance in this manuscript (for simplicity hereafter simply referred toas VIMP) One of the tremendous advantages of VIMP is that it removes the arbitrariness of having to select a cutoffvalue when determining the effectiveness of a variable Regardless of the problem a VIMP of zero always representsan appropriate cutoff as it reflects the point at which a variable no longer contributes predictive power to the modelHowever in practice one may observe values close to zero and the meaning of what constitutes being zero becomesunclear One way to resolve this is to calculate the variance of VIMP but this is challenging due to the complex natureof RF Unfortunately while the empirical properties of VIMP are well documented14-16 much less is known aboutVIMPs theoretical properties outside of a few studies717

Given the difficulties of theoretical analysis an alternative approach is to approximate the distribution of VIMPthrough some form of resampling This has been the favored approach used for RF regression for assessing variabilityof RF predicted values Methods that have been used include bootstrapping18 for estimating the variance and the infin-itesimal jackknife19 and infinite order U‐statistics20 for confidence intervals These methods however only apply to RFpredicted values and not to VIMP which involves prediction error This greatly complicates matters and requires a moregeneral approach

For this reason we base our approach on subsampling21 a general methodology for approximating the distributionof a complex statistic Section 3 provides a description of our subsampling procedure for estimating the variance Nota-tional framework and a formal definition of VIMP are provided in Section 2 Section 3 begins by introducing a bootstrapsolution to be used as a comparison procedure Interestingly we find the bootstrap cannot be applied directly due to tiesthat occur in the OOB data This is precisely due to the fact that VIMP is prediction error based We propose a solutionto this problem called the 164 bootstrap estimator The subsampling variance estimator and the delete‐d jackknife var-iance estimator22 a close cousin are described later in Section 3 Sections 4 5 and 6 consider regression classificationand survival settings and extensively evaluate performance of the 2 subsampling methods and the 164 bootstrap

560 ISHWARAN AND LU

estimator We also show how to construct confidence intervals for VIMP using the estimated variance The results arevery promising for the subsampling methods Section 7 summarizes our findings and provides practical guidelines foruse of the methodology Some theoretical results for VIMP are provided in the appendix

2 | NOTATIONAL FRAMEWORK AND DEFINITION OF VIMP

21 | Notation

We assume YisinY is the response and XisinX is the p‐dimensional feature where Y can be continuous binary categoricalor survival and X can be continuous or discrete We assume the underlying problem involves a nonparametric regres-sion framework where the goal is to estimate a functional h(x) of the response given X=x Estimation is based on thelearning data L frac14 fethX1Y 1THORNhellip ethXnYnTHORNg where (XiYi) are independently distributed with the same distribution P as(XY)

Examples of h(x) are the following1 The conditional mean hethxTHORN frac14 Efrac12Y jX frac14 x in regression2 The conditional class probabilities h(x)=(p1(x)hellippK(x)) in a K‐multiclass problem where pkethxTHORN frac14 PfY frac14 kjX frac14 xg3 The survival function hethxTHORN frac14 PfTogttjX frac14 xg in survival analysis Here Y=(T δ) represents the bivariate response

composed of the observed survival time T frac14 minethToCoTHORN and censoring indicator δ=1ToleCo where (ToCo) arethe unobserved event and censoring times

22 | Random forest predictor

As in Breiman1 we define an RF as a collection of randomized tree predictors fhethmiddotΘmLTHORNm frac14 1hellipMg Here hethxΘmLTHORN denotes the mth random tree predictor of h(x) and Θm are independent identically distributed randomquantities encoding the randomization needed for constructing a tree Note that Θm is selected prior to growing the treeand is independent of the learning data L

The tree predictors are combined to form the finite forest estimator of h(x)

hethxΘ1hellipΘM LTHORN frac14 1M

sumM

mfrac141hethxΘmLTHORN (1)

The infinite forest estimator is obtained by taking the limit as Mrarrinfin and equals

hethxLTHORN frac14 EΘfrac12hethxΘLTHORN (2)

23 | Loss function

Calculating VIMP assumes some well‐defined notion of prediction error Therefore we assume there is an appropriately

prechosen loss function ℓethY ĥTHORNge0 used to measure performance of a predictor ĥ in predicting h Examples include thefollowing1 Squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 in regression problems2 For classification problems widely used measures of performance are the misclassification error or the Brier score

For the latter ℓethY ĥTHORN frac14 eth1=KTHORNsumKkfrac141eth1fY frac14 kgminuspkTHORN2 where pk is the estimator for the conditional probability pk

3 For survival the weighted Brier score2324 can be used Section 6 provides further details

The choice of ℓ can be very general and we do not impose any specific conditions on how it must be selected Asdescribed later in Section 3 the conditions needed for our methodology to hold require only the existence of a limitingdistribution for VIMP Although such a limit may be satisfied by imposing specific conditions on ℓ such as requiring thetrue function h to yield the minimum value of Efrac12ℓethY hTHORN we do not impose such assumptions so as to retain as generalan approach as possible

ISHWARAN AND LU 561

24 | Tree VIMP

LetLlowastethΘmTHORN be the mth bootstrap sample and letLlowastlowastethΘmTHORN frac14 L∖LlowastethΘmTHORN be the corresponding OOB data Write X=(X(1)hellipX(p)) where X(j) denotes the jth feature coordinate The permuted value of the jth coordinate of X is denoted by ~X eth jTHORNSubstituting this into the jth coordinate of X yields ~Xeth jTHORN

~Xeth jTHORN frac14 ethX eth1THORNhellipX eth jminus1THORN ~Xeth jTHORNXeth jthorn1THORNhellipX ethpTHORNTHORN

Variable importance is calculated by taking the difference in prediction error under the original X to prediction error

under the perturbed ~Xeth jTHORN over OOB data More formally let IethXeth jTHORNΘmLTHORN denote the VIMP for X(j) for the mth treeIt follows that

IethXeth jTHORNΘmLTHORN frac14sumiisinLlowastlowastethΘmTHORNℓethYi heth~Xeth jTHORN

i ΘmLTHORNTHORNsumiisinLlowastlowastethΘmTHORN1

minussumiisinLlowastlowastethΘmTHORNℓethYi hethXiΘmLTHORNTHORN

sumiisinLlowastlowastethΘmTHORN1 (3)

Note that in the first sum we implicitly assume Θm embeds the additional randomization for permuting OOB data to

define ~Xeth jTHORNi Because this additional randomization only requires knowledge of OOB membership and therefore can

be parameterized in terms of Θ we assume without loss of generality that Θ encodes both the randomization for grow-ing a tree and for permuting OOB data

Expression (3) can be written more compactly by noting that the denominator in each sum equals the OOB samplesize Let N(Θm) be this value Then

IethXeth jTHORNΘmLTHORN frac14 1NethΘmTHORN sum

iisinLlowastlowastethΘmTHORNfrac12ℓethYi heth~Xeth jTHORN

i ΘmLTHORNTHORNminusℓethYi hethXiΘmLTHORNTHORN

25 | Variable importance

Averaging tree VIMP over the forest yields VIMP

IethX eth jTHORNΘ1hellipΘM LTHORN frac14 1M

sumM

mfrac141IethXeth jTHORNΘmLTHORN (4)

An infinite forest estimator for VIMP can be defined analogously by taking the limit as Mrarrinfin

IethX eth jTHORNLTHORN frac14 EΘfrac12IethX eth jTHORNΘLTHORN (5)

It is worth noting that (4) and (5) do not explicitly make use of the forest predictors (1) or (2) This is a unique feature ofpermutation VIMP because it is a tree‐based estimator of importance

3 | SAMPLING APPROACHES FOR ESTIMATING VIMP VARIANCE

31 | The 164 bootstrap estimator

The bootstrap is a popular method that can be used for estimating the variance of an estimator So why not use the boot-strap to estimate the standard error for VIMP One problem is that running a bootstrap on a forest is computationallyexpensive Another more serious problem however is that a direct application of the bootstrap will not work for VIMPThis is because RF trees already use bootstrap data and applying the bootstrap creates double‐bootstrap data that affectsthe coherence of being OOB

To explain what goes wrong let us simplify our previous notation by writing Ieth jTHORNnM for the finite forest estimator (4)

Let Pn denote the empirical measure for L The bootstrap estimator of VarethIeth jTHORNnMTHORN is

562 ISHWARAN AND LU

VarlowastethIeth jTHORNnMTHORN frac14 VarPnethIlowasteth jTHORNnMTHORN (6)

To calculate (6) we must draw a sample from Pn Call this bootstrap sample Llowast Because Llowast represents the learningdata we must draw a bootstrap sample from Llowast to construct a RF tree Let LlowastethΘlowastTHORN denote this bootstrap sample whereΘlowast represents the tree growing instructions This is a double‐bootstrap draw The problem is that if a specific case in Llowast

is duplicated lgt1 times there is no guarantee that all l cases appear in the bootstrap draw LlowastethΘlowastTHORN These remainingduplicated values are assigned to the OOB data but these values are not truly OOB which compromises the coherenceof the OOB data

Double‐bootstrap data lower the probability of being truly OOB to a value much smaller than 368 which is thevalue expected for a true OOB sample We can work out exactly how much smaller this probability is Let ni be the num-ber of occurrences of case i in Llowast Then

Prfi is truly OOB inLlowastethΘlowastTHORNg frac14 sumn

lfrac141Prfi is truly OOB inLlowastethΘlowastTHORNjni frac14 lgPr fni frac14 lg (7)

We have

ethn1hellipnnTHORNsimMultinomialethn eth1=nhellip 1=nTHORNTHORNnisimBinomialethn 1=nTHORN≍Poissoneth1THORN

Hence (7) can be seen to equal

sumn

lfrac141

nminusln

n

Prfni frac14 lg≍sumn

lfrac141

nminusln

n eminus11l

l

frac14 eminus1sum

n

lfrac1411minus

ln

n1l≍eminus1sum

n

lfrac141

eminusl

l≍1635

Therefore double‐bootstrap data have an OOB size of 164nThe above discussion points to a simple solution to the problem which we call the 164 bootstrap estimator The 164

estimator is a bootstrap variance estimator but is careful to use only truly OOB data Let Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg denote the bootstrap sample used for learning and letLlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be the bootstrap sample usedto grow the tree The OOB data for the double‐bootstrap data are defined as fethXil YilTHORNnotinLlowastethΘlowastTHORNg However there isanother subtle issue at play regarding duplicates in the OOB data Even though fethXil YilTHORNnotinLlowastethΘlowastTHORNg are data points fromLlowast truly excluded from the double‐bootstrap sample and therefore technically meet the criteria of being OOB there is noguarantee they are all unique This is because these values originated fromLlowast a bootstrap draw and therefore could verywell be duplicated To ensure this does not happen we further process the OOB data to retain only the unique values

The steps for implementing the 164 estimator can be summarized as follows

164 bootstrap estimator for VarethIeth jTHORNnMTHORN

1 Draw a bootstrap sample Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg2 Let LlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be a bootstrap draw from Llowast Use LlowastethΘlowastTHORN to grow a tree predictor3 Define OOB data to be the unique values in fethXil YilTHORNnotinLlowastethΘlowastTHORNg4 Calculate the tree VIMP IethXeth jTHORNΘlowastLlowastTHORN using OOB data of step 35 Repeat steps 2 to 4 independently M times Average the VIMP values to obtain θlowasteth jTHORNn 6 Repeat the entire procedure Kgt1 times obtaining θlowasteth jTHORNn1 hellip θ

lowasteth jTHORNnK Estimate VarethIeth jTHORNnMTHORN by the bootstrap sample

variance eth1=KTHORNsumKkfrac141ethθlowasteth jTHORNnk minus

1KsumK

kprimefrac141θlowasteth jTHORNnkprime

THORN2

32 | Subsampling and the delete‐d jackknife

A problem with the 164 bootstrap estimator is that its OOB data set is smaller than a typical OOB estimator Truly OOBdata from a double bootstrap can be less than half the size of OOB data used in a standard VIMP calculation (164

ISHWARAN AND LU 563

versus 368) Thus in a forest of 1000 trees the 164 estimator uses about 164 trees on average to calculate VIMP for acase compared with 368 trees used in a standard calculation This can reduce efficiency of the 164 estimator Anotherproblem is computational expense The 164 estimator requires repeatedly fitting RF to bootstrap data which becomesexpensive as n increases

To avoid these problems we propose a more efficient procedure based on subsampling theory21 The idea rests oncalculating VIMP over small iid subsets of the data Because sampling is without replacement this avoid ties in theOOB data that creates problems for the bootstrap Also because each calculation is fast the procedure is computation-ally efficient especially in big n settings

321 | Subsampling theory

We begin by first reviewing some basic theory of subsampling Let X1hellipXn be iid random values with common distri-

bution P Let θn frac14 θethX1hellipXnTHORN be some estimator for θ(P) an unknown real‐valued parameter we wish to estimateThe bootstrap estimator for the variance of θn is based on the following simple idea Let Pn frac14 eth1=nTHORNsumn

ifrac141δXi be theempirical measure for the data Let Xlowast

1 hellipXlowastn be a bootstrap sample obtained by independently sampling n points from

Pn Because Pn converges to P we should expect the moments of the bootstrap estimator θlowastn frac14 θethXlowast1 hellipX

lowastnTHORN to closely

approximate those of θ In particular we should expect the bootstrap variance VarPnethθlowastnTHORN to closely approximate

VarethθnTHORN This is the rationale for the variance estimator (6) described earlierSubsampling21 employs the same strategy as the bootstrap but is based on sampling without replacement For b=b(n)

such that bnrarr0 let Sb be the entire collection of subsets of 1hellipn of size b For each s=i1hellipibisinSb let θnbs frac14θethXi1 hellipXibTHORN be the estimator evaluated using s The goal is to estimate the sampling distribution of n1=2ethθnminusθethPTHORNTHORN Itturns out that subsampling provides a consistent estimate of this distribution under fairly mild conditions LetQn denotethe distribution of n1=2ethθnminusθethPTHORNTHORN Assume Qn converges weakly to a proper limiting distribution Q

Qn dQ (8)

Then it follows21 that the distribution function for the statistic n1=2ethθnminusθethPTHORNTHORN can be approximated by the subsamplingestimator

ŨnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθnTHORNlexg (9)

where Cb frac14 nb

is the cardinality of Sb More formally assuming (8) and bnrarr0 for brarrinfin then ŨnbethxTHORNp FethxTHORN frac14

Qfrac12minusinfin x for each x that is a continuity point of the limiting cumulative distribution function F The key to this argumentis to recognize that because of (8) and bnrarr0 Ũnb closely approximates

UnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθethPTHORNTHORNlexg

which is a U‐statistic25 of order b (see Politis and Romano21 for details)

The ability to approximate the distribution of θn suggests similar to the bootstrap that we can approximatemoments of θn with those from the subsampled estimator in particular we should be able to approximate the varianceUnlike the bootstrap however subsampled statistics are calculated using a sample size b and not n Therefore to esti-mate the variance of θn we must apply a scaling factor to correct for sample size The subsampled estimator for the var-iance is (see Radulović26 and section 331 from Politis and Romano21)

υb frac14 b=nCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

(10)

The estimator (10) is closely related to the delete‐d jackknife22 The delete‐d estimator works on subsets of size r=nminusdand is defined as follows

564 ISHWARAN AND LU

υJethdTHORN frac14 r=dCr

sumsisinSr

ethθnrsminusθnTHORN2

With a little bit or rearrangement this can be rewritten as follows

υJethdTHORN frac14 r=dCr

sumsisinSr

θnrsminus1Cr

sumsprimeisinSr

θnrsprime

2

thorn rd

1Cr

sumsisinSr

θnrsminusθn

2

Setting d=nminusb we obtain

υJethdTHORN frac14 b=ethnminusbTHORNCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

thorn bnminusb

1Cb

sumsisinSb

θnbsminusθn

2

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflbias

(11)

The first term closely approximates (10) since bnrarr0 while the second term is a bias estimate of the subsampled estima-tor Thus the delete‐d estimator (11) can be seen to be a bias corrected version of (10) Furthermore this correction isalways upwards because the bias term is squared and always positive

322 | Subsampling and delete‐d jackknife algorithms

We can now describe our subsampling estimator for the variance of VIMP In the following we assume b is some inte-ger much smaller than n such that bnrarr0

b‐subsampling estimator for VarethIeth jTHORNnMTHORN1 Draw a subsampling set sisinSb Let Ls be L restricted to s2 Calculate Ieth jTHORNnMethLsTHORN the finite forest estimator for VIMP using Ls Let θ

eth jTHORNnbs denote this value

3 Repeat Kgt1 times obtaining θeth jTHORNnbs1hellip θeth jTHORNnbsK

Estimate VarethIeth jTHORNnMTHORN by frac12b=ethnKTHORNsumKkfrac141ethθeth jTHORNnbsk

minus1KsumK

kprimefrac141θeth jTHORNnbskprime

THORN2

The delete‐d jackknife estimator is obtained by a slight modification to the above algorithm

delete‐d jackknife estimator ethd frac14 nminusbTHORN for VarethIeth jTHORNnMTHORN1 Using the entire learning set L calculate the forest VIMP estimator Ieth jTHORNnMethLTHORN Let θeth jTHORNn denote this value2 Run the b‐subsampling estimator but replace the estimator in step 3 with fb=frac12ethnminusbTHORNKgsumK

kfrac141ethθeth jTHORNnbskminusθeth jTHORNn THORN2

4 | RANDOM FOREST REGRESSION RF ‐R

41 | Simulations

In the following sections (Sections 4 5 and 6) we evaluate the performance of the 164 bootstrap estimator the b‐sub-sampling estimator and the delete‐d jackknife variance estimator We begin by looking at the regression setting Weused the following simulations to assess performance

1 y frac14 10 sinethπx1x2THORN thorn 20ethx3minus05THORN2 thorn 10x4 thorn 5x5 thorn ε xjsimU(0 1) εsimN(0 1)2 y frac14 ethx21 thorn frac12x2x3minusethx2x4THORNminus12THORN1=2 thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 1252)3 y frac14 tanminus1ethfrac12x2x3minusethx2x4THORNminus1=x1THORN thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 12)4 y frac14 x1x2 thorn x23minusx4x7 thorn x8x10minusx26 thorn ε xjsimU(minus1 1) εsimN(0 12)5 y frac14 1fx1gt0g thorn x32 thorn 1fx4 thorn x6minusx8minusx9gt1thorn x10g thorn expethminusx22THORN thorn ε xjsimU(minus1 1) εsimN(0 12)

ISHWARAN AND LU 565

6 y frac14 x21 thorn 3x22x3expethminusjx4jTHORN thorn x6minusx8 thorn ε xjsimU(minus1 1) εsimN(0 12)7 y frac14 1fx1 thorn x34 thorn x9 thorn sinethx2x8THORN thorn εgt038g xjsimU(minus1 1) εsimN(0 12)8 y frac14 logethx1 thorn x2x3THORNminusexpethx4xminus15 minusx6THORN thorn ε xjsimU(05 1) εsimN(0 12)9 y frac14 x1x22jx3j1=2 thorn lfloorx4minusx5x6rfloorthorn ε xjsimU(minus1 1) εsimN(0 12)10 y frac14 x3ethx1 thorn 1THORNjx2jminusethx25frac12jx4j thorn jx5j thorn jx6jminus1THORN1=2 thorn ε xjsimU(minus1 1) εsimN(0 12)11 y frac14 cosethx1minusx2THORN thorn sinminus1ethx1x3THORNminustanminus1ethx2minusx23THORN thorn ε xjsimU(minus1 1) εsimN(0 12)12 y= ε εsimN(0 1)

In all 12 simulations the dimension of the feature space was set to p=20 This was done by adding variables unre-lated to y to the design matrix We call these noise variables In simulations 1 to 3 noise variables were U(0 1) for sim-ulations 4 to 11 noise variables were U(minus1 1) and for simulation 12 noise variables were N(0 1) All features (strongand noisy) were simulated independently Simulations 1 to 3 are the well‐known Friedman 1 2 3 simulations627 Sim-ulations 4 to 11 were inspired from COBRA28 Simulation 12 is a pure noise model

The sample size was set at n=250 Subsampling was set at a rate of b=n12 which in this case is b=158 We cansee that practically speaking this is a very small sample size and allows subsampled VIMP to be rapidly computedThe value of d for the delete‐d jackknife was always set to d=nminusb The number of bootstraps was set to 100 and thenumber of subsamples was set to 100 Note that this is not a large number of bootstraps or subsampled replicatesHowever they represent values practitioners are likely to use in practice especially for big data due tocomputational costs

All RF calculations were implemented using the randomForestSRC R‐package29 The package runs in OpenMPparallel processing mode which allows for parallel processing on user desktops as well as large scale computing clus-ters The package now includes a dedicated function ldquosubsamplerdquo which implements the 3 methods studied here Thesubsample function was used for all calculations All RF calculations used 250 trees Tree nodesize was set to 5 and p3 random feature selection used (these are default settings for regression) Each simulation was repeated independently250 times Random forest parameters were kept fixed over simulations All calculations related to prediction error and

VIMP were based on squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2

42 | Estimating the true finite standard error and true finite VIMP

Each procedure provides an estimate of the VarethIeth jTHORNnMTHORN We took the square root of this to obtain an estimate for the stan-

dard error of VIMP ethVarethIeth jTHORNnMTHORNTHORN1=2 To assess performance in estimating the standard error we used the following strat-egy to approximate the unknown parameter VarethIeth jTHORNnMTHORN For each simulation model we drew 1000 independent copies ofthe data and for each of these copies we calculated the finite forest VIMP Ieth jTHORNnM The same sample size of n=250 wasused and all forest tuning parameters were kept the same as outlined above We used the variance of these 1000 valuesto estimate VarethIeth jTHORNnMTHORN We refer to the square root of this value as the true finite standard error Additionally we aver-aged the 1000 values to estimate Efrac12Ieth jTHORNnM frac14 Efrac12IethXeth jTHORNLTHORN We call this the true finite VIMP

43 | Results

Performance of methods was assessed by bias and standardized mean‐squared error (SMSE) The bias for a method wasobtained by averaging its estimated standard error over the 250 replications and taking the difference between this andthe true finite standard error Mean‐squared error was estimated by averaging the squared difference between amethods estimated value for the standard error and the true finite standard error Standardized mean‐squared errorwas defined by dividing MSE by the true finite standard error In evaluating these performance values we realized itwas important to take into account signal strength of a variable In our simulations there are noisy variables with nosignal There are also variables with strong and moderately strong signal Therefore to better understand performancedifferences results were stratified by size of a variables true finite VIMP In total there were 240 variables to be dealtwith (12 simulations each with p=20 variables) These 240 variables were stratified into 6 groups based on 10th 25th50th 75th and 90th percentiles of true finite VIMP (standardized by the Y variance to make VIMP comparable acrosssimulations) Bias and SMSE for the 6 groups are displayed in Figure 1

All methods exhibit low bias for small VIMP As VIMP increases corresponding to stronger variables bias for thesubsampling estimator increases Its bias is negative showing that it underestimates variance The delete‐d estimator

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

560 ISHWARAN AND LU

estimator We also show how to construct confidence intervals for VIMP using the estimated variance The results arevery promising for the subsampling methods Section 7 summarizes our findings and provides practical guidelines foruse of the methodology Some theoretical results for VIMP are provided in the appendix

2 | NOTATIONAL FRAMEWORK AND DEFINITION OF VIMP

21 | Notation

We assume YisinY is the response and XisinX is the p‐dimensional feature where Y can be continuous binary categoricalor survival and X can be continuous or discrete We assume the underlying problem involves a nonparametric regres-sion framework where the goal is to estimate a functional h(x) of the response given X=x Estimation is based on thelearning data L frac14 fethX1Y 1THORNhellip ethXnYnTHORNg where (XiYi) are independently distributed with the same distribution P as(XY)

Examples of h(x) are the following1 The conditional mean hethxTHORN frac14 Efrac12Y jX frac14 x in regression2 The conditional class probabilities h(x)=(p1(x)hellippK(x)) in a K‐multiclass problem where pkethxTHORN frac14 PfY frac14 kjX frac14 xg3 The survival function hethxTHORN frac14 PfTogttjX frac14 xg in survival analysis Here Y=(T δ) represents the bivariate response

composed of the observed survival time T frac14 minethToCoTHORN and censoring indicator δ=1ToleCo where (ToCo) arethe unobserved event and censoring times

22 | Random forest predictor

As in Breiman1 we define an RF as a collection of randomized tree predictors fhethmiddotΘmLTHORNm frac14 1hellipMg Here hethxΘmLTHORN denotes the mth random tree predictor of h(x) and Θm are independent identically distributed randomquantities encoding the randomization needed for constructing a tree Note that Θm is selected prior to growing the treeand is independent of the learning data L

The tree predictors are combined to form the finite forest estimator of h(x)

hethxΘ1hellipΘM LTHORN frac14 1M

sumM

mfrac141hethxΘmLTHORN (1)

The infinite forest estimator is obtained by taking the limit as Mrarrinfin and equals

hethxLTHORN frac14 EΘfrac12hethxΘLTHORN (2)

23 | Loss function

Calculating VIMP assumes some well‐defined notion of prediction error Therefore we assume there is an appropriately

prechosen loss function ℓethY ĥTHORNge0 used to measure performance of a predictor ĥ in predicting h Examples include thefollowing1 Squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 in regression problems2 For classification problems widely used measures of performance are the misclassification error or the Brier score

For the latter ℓethY ĥTHORN frac14 eth1=KTHORNsumKkfrac141eth1fY frac14 kgminuspkTHORN2 where pk is the estimator for the conditional probability pk

3 For survival the weighted Brier score2324 can be used Section 6 provides further details

The choice of ℓ can be very general and we do not impose any specific conditions on how it must be selected Asdescribed later in Section 3 the conditions needed for our methodology to hold require only the existence of a limitingdistribution for VIMP Although such a limit may be satisfied by imposing specific conditions on ℓ such as requiring thetrue function h to yield the minimum value of Efrac12ℓethY hTHORN we do not impose such assumptions so as to retain as generalan approach as possible

ISHWARAN AND LU 561

24 | Tree VIMP

LetLlowastethΘmTHORN be the mth bootstrap sample and letLlowastlowastethΘmTHORN frac14 L∖LlowastethΘmTHORN be the corresponding OOB data Write X=(X(1)hellipX(p)) where X(j) denotes the jth feature coordinate The permuted value of the jth coordinate of X is denoted by ~X eth jTHORNSubstituting this into the jth coordinate of X yields ~Xeth jTHORN

~Xeth jTHORN frac14 ethX eth1THORNhellipX eth jminus1THORN ~Xeth jTHORNXeth jthorn1THORNhellipX ethpTHORNTHORN

Variable importance is calculated by taking the difference in prediction error under the original X to prediction error

under the perturbed ~Xeth jTHORN over OOB data More formally let IethXeth jTHORNΘmLTHORN denote the VIMP for X(j) for the mth treeIt follows that

IethXeth jTHORNΘmLTHORN frac14sumiisinLlowastlowastethΘmTHORNℓethYi heth~Xeth jTHORN

i ΘmLTHORNTHORNsumiisinLlowastlowastethΘmTHORN1

minussumiisinLlowastlowastethΘmTHORNℓethYi hethXiΘmLTHORNTHORN

sumiisinLlowastlowastethΘmTHORN1 (3)

Note that in the first sum we implicitly assume Θm embeds the additional randomization for permuting OOB data to

define ~Xeth jTHORNi Because this additional randomization only requires knowledge of OOB membership and therefore can

be parameterized in terms of Θ we assume without loss of generality that Θ encodes both the randomization for grow-ing a tree and for permuting OOB data

Expression (3) can be written more compactly by noting that the denominator in each sum equals the OOB samplesize Let N(Θm) be this value Then

IethXeth jTHORNΘmLTHORN frac14 1NethΘmTHORN sum

iisinLlowastlowastethΘmTHORNfrac12ℓethYi heth~Xeth jTHORN

i ΘmLTHORNTHORNminusℓethYi hethXiΘmLTHORNTHORN

25 | Variable importance

Averaging tree VIMP over the forest yields VIMP

IethX eth jTHORNΘ1hellipΘM LTHORN frac14 1M

sumM

mfrac141IethXeth jTHORNΘmLTHORN (4)

An infinite forest estimator for VIMP can be defined analogously by taking the limit as Mrarrinfin

IethX eth jTHORNLTHORN frac14 EΘfrac12IethX eth jTHORNΘLTHORN (5)

It is worth noting that (4) and (5) do not explicitly make use of the forest predictors (1) or (2) This is a unique feature ofpermutation VIMP because it is a tree‐based estimator of importance

3 | SAMPLING APPROACHES FOR ESTIMATING VIMP VARIANCE

31 | The 164 bootstrap estimator

The bootstrap is a popular method that can be used for estimating the variance of an estimator So why not use the boot-strap to estimate the standard error for VIMP One problem is that running a bootstrap on a forest is computationallyexpensive Another more serious problem however is that a direct application of the bootstrap will not work for VIMPThis is because RF trees already use bootstrap data and applying the bootstrap creates double‐bootstrap data that affectsthe coherence of being OOB

To explain what goes wrong let us simplify our previous notation by writing Ieth jTHORNnM for the finite forest estimator (4)

Let Pn denote the empirical measure for L The bootstrap estimator of VarethIeth jTHORNnMTHORN is

562 ISHWARAN AND LU

VarlowastethIeth jTHORNnMTHORN frac14 VarPnethIlowasteth jTHORNnMTHORN (6)

To calculate (6) we must draw a sample from Pn Call this bootstrap sample Llowast Because Llowast represents the learningdata we must draw a bootstrap sample from Llowast to construct a RF tree Let LlowastethΘlowastTHORN denote this bootstrap sample whereΘlowast represents the tree growing instructions This is a double‐bootstrap draw The problem is that if a specific case in Llowast

is duplicated lgt1 times there is no guarantee that all l cases appear in the bootstrap draw LlowastethΘlowastTHORN These remainingduplicated values are assigned to the OOB data but these values are not truly OOB which compromises the coherenceof the OOB data

Double‐bootstrap data lower the probability of being truly OOB to a value much smaller than 368 which is thevalue expected for a true OOB sample We can work out exactly how much smaller this probability is Let ni be the num-ber of occurrences of case i in Llowast Then

Prfi is truly OOB inLlowastethΘlowastTHORNg frac14 sumn

lfrac141Prfi is truly OOB inLlowastethΘlowastTHORNjni frac14 lgPr fni frac14 lg (7)

We have

ethn1hellipnnTHORNsimMultinomialethn eth1=nhellip 1=nTHORNTHORNnisimBinomialethn 1=nTHORN≍Poissoneth1THORN

Hence (7) can be seen to equal

sumn

lfrac141

nminusln

n

Prfni frac14 lg≍sumn

lfrac141

nminusln

n eminus11l

l

frac14 eminus1sum

n

lfrac1411minus

ln

n1l≍eminus1sum

n

lfrac141

eminusl

l≍1635

Therefore double‐bootstrap data have an OOB size of 164nThe above discussion points to a simple solution to the problem which we call the 164 bootstrap estimator The 164

estimator is a bootstrap variance estimator but is careful to use only truly OOB data Let Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg denote the bootstrap sample used for learning and letLlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be the bootstrap sample usedto grow the tree The OOB data for the double‐bootstrap data are defined as fethXil YilTHORNnotinLlowastethΘlowastTHORNg However there isanother subtle issue at play regarding duplicates in the OOB data Even though fethXil YilTHORNnotinLlowastethΘlowastTHORNg are data points fromLlowast truly excluded from the double‐bootstrap sample and therefore technically meet the criteria of being OOB there is noguarantee they are all unique This is because these values originated fromLlowast a bootstrap draw and therefore could verywell be duplicated To ensure this does not happen we further process the OOB data to retain only the unique values

The steps for implementing the 164 estimator can be summarized as follows

164 bootstrap estimator for VarethIeth jTHORNnMTHORN

1 Draw a bootstrap sample Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg2 Let LlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be a bootstrap draw from Llowast Use LlowastethΘlowastTHORN to grow a tree predictor3 Define OOB data to be the unique values in fethXil YilTHORNnotinLlowastethΘlowastTHORNg4 Calculate the tree VIMP IethXeth jTHORNΘlowastLlowastTHORN using OOB data of step 35 Repeat steps 2 to 4 independently M times Average the VIMP values to obtain θlowasteth jTHORNn 6 Repeat the entire procedure Kgt1 times obtaining θlowasteth jTHORNn1 hellip θ

lowasteth jTHORNnK Estimate VarethIeth jTHORNnMTHORN by the bootstrap sample

variance eth1=KTHORNsumKkfrac141ethθlowasteth jTHORNnk minus

1KsumK

kprimefrac141θlowasteth jTHORNnkprime

THORN2

32 | Subsampling and the delete‐d jackknife

A problem with the 164 bootstrap estimator is that its OOB data set is smaller than a typical OOB estimator Truly OOBdata from a double bootstrap can be less than half the size of OOB data used in a standard VIMP calculation (164

ISHWARAN AND LU 563

versus 368) Thus in a forest of 1000 trees the 164 estimator uses about 164 trees on average to calculate VIMP for acase compared with 368 trees used in a standard calculation This can reduce efficiency of the 164 estimator Anotherproblem is computational expense The 164 estimator requires repeatedly fitting RF to bootstrap data which becomesexpensive as n increases

To avoid these problems we propose a more efficient procedure based on subsampling theory21 The idea rests oncalculating VIMP over small iid subsets of the data Because sampling is without replacement this avoid ties in theOOB data that creates problems for the bootstrap Also because each calculation is fast the procedure is computation-ally efficient especially in big n settings

321 | Subsampling theory

We begin by first reviewing some basic theory of subsampling Let X1hellipXn be iid random values with common distri-

bution P Let θn frac14 θethX1hellipXnTHORN be some estimator for θ(P) an unknown real‐valued parameter we wish to estimateThe bootstrap estimator for the variance of θn is based on the following simple idea Let Pn frac14 eth1=nTHORNsumn

ifrac141δXi be theempirical measure for the data Let Xlowast

1 hellipXlowastn be a bootstrap sample obtained by independently sampling n points from

Pn Because Pn converges to P we should expect the moments of the bootstrap estimator θlowastn frac14 θethXlowast1 hellipX

lowastnTHORN to closely

approximate those of θ In particular we should expect the bootstrap variance VarPnethθlowastnTHORN to closely approximate

VarethθnTHORN This is the rationale for the variance estimator (6) described earlierSubsampling21 employs the same strategy as the bootstrap but is based on sampling without replacement For b=b(n)

such that bnrarr0 let Sb be the entire collection of subsets of 1hellipn of size b For each s=i1hellipibisinSb let θnbs frac14θethXi1 hellipXibTHORN be the estimator evaluated using s The goal is to estimate the sampling distribution of n1=2ethθnminusθethPTHORNTHORN Itturns out that subsampling provides a consistent estimate of this distribution under fairly mild conditions LetQn denotethe distribution of n1=2ethθnminusθethPTHORNTHORN Assume Qn converges weakly to a proper limiting distribution Q

Qn dQ (8)

Then it follows21 that the distribution function for the statistic n1=2ethθnminusθethPTHORNTHORN can be approximated by the subsamplingestimator

ŨnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθnTHORNlexg (9)

where Cb frac14 nb

is the cardinality of Sb More formally assuming (8) and bnrarr0 for brarrinfin then ŨnbethxTHORNp FethxTHORN frac14

Qfrac12minusinfin x for each x that is a continuity point of the limiting cumulative distribution function F The key to this argumentis to recognize that because of (8) and bnrarr0 Ũnb closely approximates

UnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθethPTHORNTHORNlexg

which is a U‐statistic25 of order b (see Politis and Romano21 for details)

The ability to approximate the distribution of θn suggests similar to the bootstrap that we can approximatemoments of θn with those from the subsampled estimator in particular we should be able to approximate the varianceUnlike the bootstrap however subsampled statistics are calculated using a sample size b and not n Therefore to esti-mate the variance of θn we must apply a scaling factor to correct for sample size The subsampled estimator for the var-iance is (see Radulović26 and section 331 from Politis and Romano21)

υb frac14 b=nCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

(10)

The estimator (10) is closely related to the delete‐d jackknife22 The delete‐d estimator works on subsets of size r=nminusdand is defined as follows

564 ISHWARAN AND LU

υJethdTHORN frac14 r=dCr

sumsisinSr

ethθnrsminusθnTHORN2

With a little bit or rearrangement this can be rewritten as follows

υJethdTHORN frac14 r=dCr

sumsisinSr

θnrsminus1Cr

sumsprimeisinSr

θnrsprime

2

thorn rd

1Cr

sumsisinSr

θnrsminusθn

2

Setting d=nminusb we obtain

υJethdTHORN frac14 b=ethnminusbTHORNCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

thorn bnminusb

1Cb

sumsisinSb

θnbsminusθn

2

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflbias

(11)

The first term closely approximates (10) since bnrarr0 while the second term is a bias estimate of the subsampled estima-tor Thus the delete‐d estimator (11) can be seen to be a bias corrected version of (10) Furthermore this correction isalways upwards because the bias term is squared and always positive

322 | Subsampling and delete‐d jackknife algorithms

We can now describe our subsampling estimator for the variance of VIMP In the following we assume b is some inte-ger much smaller than n such that bnrarr0

b‐subsampling estimator for VarethIeth jTHORNnMTHORN1 Draw a subsampling set sisinSb Let Ls be L restricted to s2 Calculate Ieth jTHORNnMethLsTHORN the finite forest estimator for VIMP using Ls Let θ

eth jTHORNnbs denote this value

3 Repeat Kgt1 times obtaining θeth jTHORNnbs1hellip θeth jTHORNnbsK

Estimate VarethIeth jTHORNnMTHORN by frac12b=ethnKTHORNsumKkfrac141ethθeth jTHORNnbsk

minus1KsumK

kprimefrac141θeth jTHORNnbskprime

THORN2

The delete‐d jackknife estimator is obtained by a slight modification to the above algorithm

delete‐d jackknife estimator ethd frac14 nminusbTHORN for VarethIeth jTHORNnMTHORN1 Using the entire learning set L calculate the forest VIMP estimator Ieth jTHORNnMethLTHORN Let θeth jTHORNn denote this value2 Run the b‐subsampling estimator but replace the estimator in step 3 with fb=frac12ethnminusbTHORNKgsumK

kfrac141ethθeth jTHORNnbskminusθeth jTHORNn THORN2

4 | RANDOM FOREST REGRESSION RF ‐R

41 | Simulations

In the following sections (Sections 4 5 and 6) we evaluate the performance of the 164 bootstrap estimator the b‐sub-sampling estimator and the delete‐d jackknife variance estimator We begin by looking at the regression setting Weused the following simulations to assess performance

1 y frac14 10 sinethπx1x2THORN thorn 20ethx3minus05THORN2 thorn 10x4 thorn 5x5 thorn ε xjsimU(0 1) εsimN(0 1)2 y frac14 ethx21 thorn frac12x2x3minusethx2x4THORNminus12THORN1=2 thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 1252)3 y frac14 tanminus1ethfrac12x2x3minusethx2x4THORNminus1=x1THORN thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 12)4 y frac14 x1x2 thorn x23minusx4x7 thorn x8x10minusx26 thorn ε xjsimU(minus1 1) εsimN(0 12)5 y frac14 1fx1gt0g thorn x32 thorn 1fx4 thorn x6minusx8minusx9gt1thorn x10g thorn expethminusx22THORN thorn ε xjsimU(minus1 1) εsimN(0 12)

ISHWARAN AND LU 565

6 y frac14 x21 thorn 3x22x3expethminusjx4jTHORN thorn x6minusx8 thorn ε xjsimU(minus1 1) εsimN(0 12)7 y frac14 1fx1 thorn x34 thorn x9 thorn sinethx2x8THORN thorn εgt038g xjsimU(minus1 1) εsimN(0 12)8 y frac14 logethx1 thorn x2x3THORNminusexpethx4xminus15 minusx6THORN thorn ε xjsimU(05 1) εsimN(0 12)9 y frac14 x1x22jx3j1=2 thorn lfloorx4minusx5x6rfloorthorn ε xjsimU(minus1 1) εsimN(0 12)10 y frac14 x3ethx1 thorn 1THORNjx2jminusethx25frac12jx4j thorn jx5j thorn jx6jminus1THORN1=2 thorn ε xjsimU(minus1 1) εsimN(0 12)11 y frac14 cosethx1minusx2THORN thorn sinminus1ethx1x3THORNminustanminus1ethx2minusx23THORN thorn ε xjsimU(minus1 1) εsimN(0 12)12 y= ε εsimN(0 1)

In all 12 simulations the dimension of the feature space was set to p=20 This was done by adding variables unre-lated to y to the design matrix We call these noise variables In simulations 1 to 3 noise variables were U(0 1) for sim-ulations 4 to 11 noise variables were U(minus1 1) and for simulation 12 noise variables were N(0 1) All features (strongand noisy) were simulated independently Simulations 1 to 3 are the well‐known Friedman 1 2 3 simulations627 Sim-ulations 4 to 11 were inspired from COBRA28 Simulation 12 is a pure noise model

The sample size was set at n=250 Subsampling was set at a rate of b=n12 which in this case is b=158 We cansee that practically speaking this is a very small sample size and allows subsampled VIMP to be rapidly computedThe value of d for the delete‐d jackknife was always set to d=nminusb The number of bootstraps was set to 100 and thenumber of subsamples was set to 100 Note that this is not a large number of bootstraps or subsampled replicatesHowever they represent values practitioners are likely to use in practice especially for big data due tocomputational costs

All RF calculations were implemented using the randomForestSRC R‐package29 The package runs in OpenMPparallel processing mode which allows for parallel processing on user desktops as well as large scale computing clus-ters The package now includes a dedicated function ldquosubsamplerdquo which implements the 3 methods studied here Thesubsample function was used for all calculations All RF calculations used 250 trees Tree nodesize was set to 5 and p3 random feature selection used (these are default settings for regression) Each simulation was repeated independently250 times Random forest parameters were kept fixed over simulations All calculations related to prediction error and

VIMP were based on squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2

42 | Estimating the true finite standard error and true finite VIMP

Each procedure provides an estimate of the VarethIeth jTHORNnMTHORN We took the square root of this to obtain an estimate for the stan-

dard error of VIMP ethVarethIeth jTHORNnMTHORNTHORN1=2 To assess performance in estimating the standard error we used the following strat-egy to approximate the unknown parameter VarethIeth jTHORNnMTHORN For each simulation model we drew 1000 independent copies ofthe data and for each of these copies we calculated the finite forest VIMP Ieth jTHORNnM The same sample size of n=250 wasused and all forest tuning parameters were kept the same as outlined above We used the variance of these 1000 valuesto estimate VarethIeth jTHORNnMTHORN We refer to the square root of this value as the true finite standard error Additionally we aver-aged the 1000 values to estimate Efrac12Ieth jTHORNnM frac14 Efrac12IethXeth jTHORNLTHORN We call this the true finite VIMP

43 | Results

Performance of methods was assessed by bias and standardized mean‐squared error (SMSE) The bias for a method wasobtained by averaging its estimated standard error over the 250 replications and taking the difference between this andthe true finite standard error Mean‐squared error was estimated by averaging the squared difference between amethods estimated value for the standard error and the true finite standard error Standardized mean‐squared errorwas defined by dividing MSE by the true finite standard error In evaluating these performance values we realized itwas important to take into account signal strength of a variable In our simulations there are noisy variables with nosignal There are also variables with strong and moderately strong signal Therefore to better understand performancedifferences results were stratified by size of a variables true finite VIMP In total there were 240 variables to be dealtwith (12 simulations each with p=20 variables) These 240 variables were stratified into 6 groups based on 10th 25th50th 75th and 90th percentiles of true finite VIMP (standardized by the Y variance to make VIMP comparable acrosssimulations) Bias and SMSE for the 6 groups are displayed in Figure 1

All methods exhibit low bias for small VIMP As VIMP increases corresponding to stronger variables bias for thesubsampling estimator increases Its bias is negative showing that it underestimates variance The delete‐d estimator

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 561

24 | Tree VIMP

LetLlowastethΘmTHORN be the mth bootstrap sample and letLlowastlowastethΘmTHORN frac14 L∖LlowastethΘmTHORN be the corresponding OOB data Write X=(X(1)hellipX(p)) where X(j) denotes the jth feature coordinate The permuted value of the jth coordinate of X is denoted by ~X eth jTHORNSubstituting this into the jth coordinate of X yields ~Xeth jTHORN

~Xeth jTHORN frac14 ethX eth1THORNhellipX eth jminus1THORN ~Xeth jTHORNXeth jthorn1THORNhellipX ethpTHORNTHORN

Variable importance is calculated by taking the difference in prediction error under the original X to prediction error

under the perturbed ~Xeth jTHORN over OOB data More formally let IethXeth jTHORNΘmLTHORN denote the VIMP for X(j) for the mth treeIt follows that

IethXeth jTHORNΘmLTHORN frac14sumiisinLlowastlowastethΘmTHORNℓethYi heth~Xeth jTHORN

i ΘmLTHORNTHORNsumiisinLlowastlowastethΘmTHORN1

minussumiisinLlowastlowastethΘmTHORNℓethYi hethXiΘmLTHORNTHORN

sumiisinLlowastlowastethΘmTHORN1 (3)

Note that in the first sum we implicitly assume Θm embeds the additional randomization for permuting OOB data to

define ~Xeth jTHORNi Because this additional randomization only requires knowledge of OOB membership and therefore can

be parameterized in terms of Θ we assume without loss of generality that Θ encodes both the randomization for grow-ing a tree and for permuting OOB data

Expression (3) can be written more compactly by noting that the denominator in each sum equals the OOB samplesize Let N(Θm) be this value Then

IethXeth jTHORNΘmLTHORN frac14 1NethΘmTHORN sum

iisinLlowastlowastethΘmTHORNfrac12ℓethYi heth~Xeth jTHORN

i ΘmLTHORNTHORNminusℓethYi hethXiΘmLTHORNTHORN

25 | Variable importance

Averaging tree VIMP over the forest yields VIMP

IethX eth jTHORNΘ1hellipΘM LTHORN frac14 1M

sumM

mfrac141IethXeth jTHORNΘmLTHORN (4)

An infinite forest estimator for VIMP can be defined analogously by taking the limit as Mrarrinfin

IethX eth jTHORNLTHORN frac14 EΘfrac12IethX eth jTHORNΘLTHORN (5)

It is worth noting that (4) and (5) do not explicitly make use of the forest predictors (1) or (2) This is a unique feature ofpermutation VIMP because it is a tree‐based estimator of importance

3 | SAMPLING APPROACHES FOR ESTIMATING VIMP VARIANCE

31 | The 164 bootstrap estimator

The bootstrap is a popular method that can be used for estimating the variance of an estimator So why not use the boot-strap to estimate the standard error for VIMP One problem is that running a bootstrap on a forest is computationallyexpensive Another more serious problem however is that a direct application of the bootstrap will not work for VIMPThis is because RF trees already use bootstrap data and applying the bootstrap creates double‐bootstrap data that affectsthe coherence of being OOB

To explain what goes wrong let us simplify our previous notation by writing Ieth jTHORNnM for the finite forest estimator (4)

Let Pn denote the empirical measure for L The bootstrap estimator of VarethIeth jTHORNnMTHORN is

562 ISHWARAN AND LU

VarlowastethIeth jTHORNnMTHORN frac14 VarPnethIlowasteth jTHORNnMTHORN (6)

To calculate (6) we must draw a sample from Pn Call this bootstrap sample Llowast Because Llowast represents the learningdata we must draw a bootstrap sample from Llowast to construct a RF tree Let LlowastethΘlowastTHORN denote this bootstrap sample whereΘlowast represents the tree growing instructions This is a double‐bootstrap draw The problem is that if a specific case in Llowast

is duplicated lgt1 times there is no guarantee that all l cases appear in the bootstrap draw LlowastethΘlowastTHORN These remainingduplicated values are assigned to the OOB data but these values are not truly OOB which compromises the coherenceof the OOB data

Double‐bootstrap data lower the probability of being truly OOB to a value much smaller than 368 which is thevalue expected for a true OOB sample We can work out exactly how much smaller this probability is Let ni be the num-ber of occurrences of case i in Llowast Then

Prfi is truly OOB inLlowastethΘlowastTHORNg frac14 sumn

lfrac141Prfi is truly OOB inLlowastethΘlowastTHORNjni frac14 lgPr fni frac14 lg (7)

We have

ethn1hellipnnTHORNsimMultinomialethn eth1=nhellip 1=nTHORNTHORNnisimBinomialethn 1=nTHORN≍Poissoneth1THORN

Hence (7) can be seen to equal

sumn

lfrac141

nminusln

n

Prfni frac14 lg≍sumn

lfrac141

nminusln

n eminus11l

l

frac14 eminus1sum

n

lfrac1411minus

ln

n1l≍eminus1sum

n

lfrac141

eminusl

l≍1635

Therefore double‐bootstrap data have an OOB size of 164nThe above discussion points to a simple solution to the problem which we call the 164 bootstrap estimator The 164

estimator is a bootstrap variance estimator but is careful to use only truly OOB data Let Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg denote the bootstrap sample used for learning and letLlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be the bootstrap sample usedto grow the tree The OOB data for the double‐bootstrap data are defined as fethXil YilTHORNnotinLlowastethΘlowastTHORNg However there isanother subtle issue at play regarding duplicates in the OOB data Even though fethXil YilTHORNnotinLlowastethΘlowastTHORNg are data points fromLlowast truly excluded from the double‐bootstrap sample and therefore technically meet the criteria of being OOB there is noguarantee they are all unique This is because these values originated fromLlowast a bootstrap draw and therefore could verywell be duplicated To ensure this does not happen we further process the OOB data to retain only the unique values

The steps for implementing the 164 estimator can be summarized as follows

164 bootstrap estimator for VarethIeth jTHORNnMTHORN

1 Draw a bootstrap sample Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg2 Let LlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be a bootstrap draw from Llowast Use LlowastethΘlowastTHORN to grow a tree predictor3 Define OOB data to be the unique values in fethXil YilTHORNnotinLlowastethΘlowastTHORNg4 Calculate the tree VIMP IethXeth jTHORNΘlowastLlowastTHORN using OOB data of step 35 Repeat steps 2 to 4 independently M times Average the VIMP values to obtain θlowasteth jTHORNn 6 Repeat the entire procedure Kgt1 times obtaining θlowasteth jTHORNn1 hellip θ

lowasteth jTHORNnK Estimate VarethIeth jTHORNnMTHORN by the bootstrap sample

variance eth1=KTHORNsumKkfrac141ethθlowasteth jTHORNnk minus

1KsumK

kprimefrac141θlowasteth jTHORNnkprime

THORN2

32 | Subsampling and the delete‐d jackknife

A problem with the 164 bootstrap estimator is that its OOB data set is smaller than a typical OOB estimator Truly OOBdata from a double bootstrap can be less than half the size of OOB data used in a standard VIMP calculation (164

ISHWARAN AND LU 563

versus 368) Thus in a forest of 1000 trees the 164 estimator uses about 164 trees on average to calculate VIMP for acase compared with 368 trees used in a standard calculation This can reduce efficiency of the 164 estimator Anotherproblem is computational expense The 164 estimator requires repeatedly fitting RF to bootstrap data which becomesexpensive as n increases

To avoid these problems we propose a more efficient procedure based on subsampling theory21 The idea rests oncalculating VIMP over small iid subsets of the data Because sampling is without replacement this avoid ties in theOOB data that creates problems for the bootstrap Also because each calculation is fast the procedure is computation-ally efficient especially in big n settings

321 | Subsampling theory

We begin by first reviewing some basic theory of subsampling Let X1hellipXn be iid random values with common distri-

bution P Let θn frac14 θethX1hellipXnTHORN be some estimator for θ(P) an unknown real‐valued parameter we wish to estimateThe bootstrap estimator for the variance of θn is based on the following simple idea Let Pn frac14 eth1=nTHORNsumn

ifrac141δXi be theempirical measure for the data Let Xlowast

1 hellipXlowastn be a bootstrap sample obtained by independently sampling n points from

Pn Because Pn converges to P we should expect the moments of the bootstrap estimator θlowastn frac14 θethXlowast1 hellipX

lowastnTHORN to closely

approximate those of θ In particular we should expect the bootstrap variance VarPnethθlowastnTHORN to closely approximate

VarethθnTHORN This is the rationale for the variance estimator (6) described earlierSubsampling21 employs the same strategy as the bootstrap but is based on sampling without replacement For b=b(n)

such that bnrarr0 let Sb be the entire collection of subsets of 1hellipn of size b For each s=i1hellipibisinSb let θnbs frac14θethXi1 hellipXibTHORN be the estimator evaluated using s The goal is to estimate the sampling distribution of n1=2ethθnminusθethPTHORNTHORN Itturns out that subsampling provides a consistent estimate of this distribution under fairly mild conditions LetQn denotethe distribution of n1=2ethθnminusθethPTHORNTHORN Assume Qn converges weakly to a proper limiting distribution Q

Qn dQ (8)

Then it follows21 that the distribution function for the statistic n1=2ethθnminusθethPTHORNTHORN can be approximated by the subsamplingestimator

ŨnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθnTHORNlexg (9)

where Cb frac14 nb

is the cardinality of Sb More formally assuming (8) and bnrarr0 for brarrinfin then ŨnbethxTHORNp FethxTHORN frac14

Qfrac12minusinfin x for each x that is a continuity point of the limiting cumulative distribution function F The key to this argumentis to recognize that because of (8) and bnrarr0 Ũnb closely approximates

UnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθethPTHORNTHORNlexg

which is a U‐statistic25 of order b (see Politis and Romano21 for details)

The ability to approximate the distribution of θn suggests similar to the bootstrap that we can approximatemoments of θn with those from the subsampled estimator in particular we should be able to approximate the varianceUnlike the bootstrap however subsampled statistics are calculated using a sample size b and not n Therefore to esti-mate the variance of θn we must apply a scaling factor to correct for sample size The subsampled estimator for the var-iance is (see Radulović26 and section 331 from Politis and Romano21)

υb frac14 b=nCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

(10)

The estimator (10) is closely related to the delete‐d jackknife22 The delete‐d estimator works on subsets of size r=nminusdand is defined as follows

564 ISHWARAN AND LU

υJethdTHORN frac14 r=dCr

sumsisinSr

ethθnrsminusθnTHORN2

With a little bit or rearrangement this can be rewritten as follows

υJethdTHORN frac14 r=dCr

sumsisinSr

θnrsminus1Cr

sumsprimeisinSr

θnrsprime

2

thorn rd

1Cr

sumsisinSr

θnrsminusθn

2

Setting d=nminusb we obtain

υJethdTHORN frac14 b=ethnminusbTHORNCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

thorn bnminusb

1Cb

sumsisinSb

θnbsminusθn

2

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflbias

(11)

The first term closely approximates (10) since bnrarr0 while the second term is a bias estimate of the subsampled estima-tor Thus the delete‐d estimator (11) can be seen to be a bias corrected version of (10) Furthermore this correction isalways upwards because the bias term is squared and always positive

322 | Subsampling and delete‐d jackknife algorithms

We can now describe our subsampling estimator for the variance of VIMP In the following we assume b is some inte-ger much smaller than n such that bnrarr0

b‐subsampling estimator for VarethIeth jTHORNnMTHORN1 Draw a subsampling set sisinSb Let Ls be L restricted to s2 Calculate Ieth jTHORNnMethLsTHORN the finite forest estimator for VIMP using Ls Let θ

eth jTHORNnbs denote this value

3 Repeat Kgt1 times obtaining θeth jTHORNnbs1hellip θeth jTHORNnbsK

Estimate VarethIeth jTHORNnMTHORN by frac12b=ethnKTHORNsumKkfrac141ethθeth jTHORNnbsk

minus1KsumK

kprimefrac141θeth jTHORNnbskprime

THORN2

The delete‐d jackknife estimator is obtained by a slight modification to the above algorithm

delete‐d jackknife estimator ethd frac14 nminusbTHORN for VarethIeth jTHORNnMTHORN1 Using the entire learning set L calculate the forest VIMP estimator Ieth jTHORNnMethLTHORN Let θeth jTHORNn denote this value2 Run the b‐subsampling estimator but replace the estimator in step 3 with fb=frac12ethnminusbTHORNKgsumK

kfrac141ethθeth jTHORNnbskminusθeth jTHORNn THORN2

4 | RANDOM FOREST REGRESSION RF ‐R

41 | Simulations

In the following sections (Sections 4 5 and 6) we evaluate the performance of the 164 bootstrap estimator the b‐sub-sampling estimator and the delete‐d jackknife variance estimator We begin by looking at the regression setting Weused the following simulations to assess performance

1 y frac14 10 sinethπx1x2THORN thorn 20ethx3minus05THORN2 thorn 10x4 thorn 5x5 thorn ε xjsimU(0 1) εsimN(0 1)2 y frac14 ethx21 thorn frac12x2x3minusethx2x4THORNminus12THORN1=2 thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 1252)3 y frac14 tanminus1ethfrac12x2x3minusethx2x4THORNminus1=x1THORN thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 12)4 y frac14 x1x2 thorn x23minusx4x7 thorn x8x10minusx26 thorn ε xjsimU(minus1 1) εsimN(0 12)5 y frac14 1fx1gt0g thorn x32 thorn 1fx4 thorn x6minusx8minusx9gt1thorn x10g thorn expethminusx22THORN thorn ε xjsimU(minus1 1) εsimN(0 12)

ISHWARAN AND LU 565

6 y frac14 x21 thorn 3x22x3expethminusjx4jTHORN thorn x6minusx8 thorn ε xjsimU(minus1 1) εsimN(0 12)7 y frac14 1fx1 thorn x34 thorn x9 thorn sinethx2x8THORN thorn εgt038g xjsimU(minus1 1) εsimN(0 12)8 y frac14 logethx1 thorn x2x3THORNminusexpethx4xminus15 minusx6THORN thorn ε xjsimU(05 1) εsimN(0 12)9 y frac14 x1x22jx3j1=2 thorn lfloorx4minusx5x6rfloorthorn ε xjsimU(minus1 1) εsimN(0 12)10 y frac14 x3ethx1 thorn 1THORNjx2jminusethx25frac12jx4j thorn jx5j thorn jx6jminus1THORN1=2 thorn ε xjsimU(minus1 1) εsimN(0 12)11 y frac14 cosethx1minusx2THORN thorn sinminus1ethx1x3THORNminustanminus1ethx2minusx23THORN thorn ε xjsimU(minus1 1) εsimN(0 12)12 y= ε εsimN(0 1)

In all 12 simulations the dimension of the feature space was set to p=20 This was done by adding variables unre-lated to y to the design matrix We call these noise variables In simulations 1 to 3 noise variables were U(0 1) for sim-ulations 4 to 11 noise variables were U(minus1 1) and for simulation 12 noise variables were N(0 1) All features (strongand noisy) were simulated independently Simulations 1 to 3 are the well‐known Friedman 1 2 3 simulations627 Sim-ulations 4 to 11 were inspired from COBRA28 Simulation 12 is a pure noise model

The sample size was set at n=250 Subsampling was set at a rate of b=n12 which in this case is b=158 We cansee that practically speaking this is a very small sample size and allows subsampled VIMP to be rapidly computedThe value of d for the delete‐d jackknife was always set to d=nminusb The number of bootstraps was set to 100 and thenumber of subsamples was set to 100 Note that this is not a large number of bootstraps or subsampled replicatesHowever they represent values practitioners are likely to use in practice especially for big data due tocomputational costs

All RF calculations were implemented using the randomForestSRC R‐package29 The package runs in OpenMPparallel processing mode which allows for parallel processing on user desktops as well as large scale computing clus-ters The package now includes a dedicated function ldquosubsamplerdquo which implements the 3 methods studied here Thesubsample function was used for all calculations All RF calculations used 250 trees Tree nodesize was set to 5 and p3 random feature selection used (these are default settings for regression) Each simulation was repeated independently250 times Random forest parameters were kept fixed over simulations All calculations related to prediction error and

VIMP were based on squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2

42 | Estimating the true finite standard error and true finite VIMP

Each procedure provides an estimate of the VarethIeth jTHORNnMTHORN We took the square root of this to obtain an estimate for the stan-

dard error of VIMP ethVarethIeth jTHORNnMTHORNTHORN1=2 To assess performance in estimating the standard error we used the following strat-egy to approximate the unknown parameter VarethIeth jTHORNnMTHORN For each simulation model we drew 1000 independent copies ofthe data and for each of these copies we calculated the finite forest VIMP Ieth jTHORNnM The same sample size of n=250 wasused and all forest tuning parameters were kept the same as outlined above We used the variance of these 1000 valuesto estimate VarethIeth jTHORNnMTHORN We refer to the square root of this value as the true finite standard error Additionally we aver-aged the 1000 values to estimate Efrac12Ieth jTHORNnM frac14 Efrac12IethXeth jTHORNLTHORN We call this the true finite VIMP

43 | Results

Performance of methods was assessed by bias and standardized mean‐squared error (SMSE) The bias for a method wasobtained by averaging its estimated standard error over the 250 replications and taking the difference between this andthe true finite standard error Mean‐squared error was estimated by averaging the squared difference between amethods estimated value for the standard error and the true finite standard error Standardized mean‐squared errorwas defined by dividing MSE by the true finite standard error In evaluating these performance values we realized itwas important to take into account signal strength of a variable In our simulations there are noisy variables with nosignal There are also variables with strong and moderately strong signal Therefore to better understand performancedifferences results were stratified by size of a variables true finite VIMP In total there were 240 variables to be dealtwith (12 simulations each with p=20 variables) These 240 variables were stratified into 6 groups based on 10th 25th50th 75th and 90th percentiles of true finite VIMP (standardized by the Y variance to make VIMP comparable acrosssimulations) Bias and SMSE for the 6 groups are displayed in Figure 1

All methods exhibit low bias for small VIMP As VIMP increases corresponding to stronger variables bias for thesubsampling estimator increases Its bias is negative showing that it underestimates variance The delete‐d estimator

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

562 ISHWARAN AND LU

VarlowastethIeth jTHORNnMTHORN frac14 VarPnethIlowasteth jTHORNnMTHORN (6)

To calculate (6) we must draw a sample from Pn Call this bootstrap sample Llowast Because Llowast represents the learningdata we must draw a bootstrap sample from Llowast to construct a RF tree Let LlowastethΘlowastTHORN denote this bootstrap sample whereΘlowast represents the tree growing instructions This is a double‐bootstrap draw The problem is that if a specific case in Llowast

is duplicated lgt1 times there is no guarantee that all l cases appear in the bootstrap draw LlowastethΘlowastTHORN These remainingduplicated values are assigned to the OOB data but these values are not truly OOB which compromises the coherenceof the OOB data

Double‐bootstrap data lower the probability of being truly OOB to a value much smaller than 368 which is thevalue expected for a true OOB sample We can work out exactly how much smaller this probability is Let ni be the num-ber of occurrences of case i in Llowast Then

Prfi is truly OOB inLlowastethΘlowastTHORNg frac14 sumn

lfrac141Prfi is truly OOB inLlowastethΘlowastTHORNjni frac14 lgPr fni frac14 lg (7)

We have

ethn1hellipnnTHORNsimMultinomialethn eth1=nhellip 1=nTHORNTHORNnisimBinomialethn 1=nTHORN≍Poissoneth1THORN

Hence (7) can be seen to equal

sumn

lfrac141

nminusln

n

Prfni frac14 lg≍sumn

lfrac141

nminusln

n eminus11l

l

frac14 eminus1sum

n

lfrac1411minus

ln

n1l≍eminus1sum

n

lfrac141

eminusl

l≍1635

Therefore double‐bootstrap data have an OOB size of 164nThe above discussion points to a simple solution to the problem which we call the 164 bootstrap estimator The 164

estimator is a bootstrap variance estimator but is careful to use only truly OOB data Let Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg denote the bootstrap sample used for learning and letLlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be the bootstrap sample usedto grow the tree The OOB data for the double‐bootstrap data are defined as fethXil YilTHORNnotinLlowastethΘlowastTHORNg However there isanother subtle issue at play regarding duplicates in the OOB data Even though fethXil YilTHORNnotinLlowastethΘlowastTHORNg are data points fromLlowast truly excluded from the double‐bootstrap sample and therefore technically meet the criteria of being OOB there is noguarantee they are all unique This is because these values originated fromLlowast a bootstrap draw and therefore could verywell be duplicated To ensure this does not happen we further process the OOB data to retain only the unique values

The steps for implementing the 164 estimator can be summarized as follows

164 bootstrap estimator for VarethIeth jTHORNnMTHORN

1 Draw a bootstrap sample Llowast frac14 fZ1 frac14 ethXi1 Yi1THORNhellipZn frac14 ethXin YinTHORNg2 Let LlowastethΘlowastTHORN frac14 fZiiisinΘlowastg be a bootstrap draw from Llowast Use LlowastethΘlowastTHORN to grow a tree predictor3 Define OOB data to be the unique values in fethXil YilTHORNnotinLlowastethΘlowastTHORNg4 Calculate the tree VIMP IethXeth jTHORNΘlowastLlowastTHORN using OOB data of step 35 Repeat steps 2 to 4 independently M times Average the VIMP values to obtain θlowasteth jTHORNn 6 Repeat the entire procedure Kgt1 times obtaining θlowasteth jTHORNn1 hellip θ

lowasteth jTHORNnK Estimate VarethIeth jTHORNnMTHORN by the bootstrap sample

variance eth1=KTHORNsumKkfrac141ethθlowasteth jTHORNnk minus

1KsumK

kprimefrac141θlowasteth jTHORNnkprime

THORN2

32 | Subsampling and the delete‐d jackknife

A problem with the 164 bootstrap estimator is that its OOB data set is smaller than a typical OOB estimator Truly OOBdata from a double bootstrap can be less than half the size of OOB data used in a standard VIMP calculation (164

ISHWARAN AND LU 563

versus 368) Thus in a forest of 1000 trees the 164 estimator uses about 164 trees on average to calculate VIMP for acase compared with 368 trees used in a standard calculation This can reduce efficiency of the 164 estimator Anotherproblem is computational expense The 164 estimator requires repeatedly fitting RF to bootstrap data which becomesexpensive as n increases

To avoid these problems we propose a more efficient procedure based on subsampling theory21 The idea rests oncalculating VIMP over small iid subsets of the data Because sampling is without replacement this avoid ties in theOOB data that creates problems for the bootstrap Also because each calculation is fast the procedure is computation-ally efficient especially in big n settings

321 | Subsampling theory

We begin by first reviewing some basic theory of subsampling Let X1hellipXn be iid random values with common distri-

bution P Let θn frac14 θethX1hellipXnTHORN be some estimator for θ(P) an unknown real‐valued parameter we wish to estimateThe bootstrap estimator for the variance of θn is based on the following simple idea Let Pn frac14 eth1=nTHORNsumn

ifrac141δXi be theempirical measure for the data Let Xlowast

1 hellipXlowastn be a bootstrap sample obtained by independently sampling n points from

Pn Because Pn converges to P we should expect the moments of the bootstrap estimator θlowastn frac14 θethXlowast1 hellipX

lowastnTHORN to closely

approximate those of θ In particular we should expect the bootstrap variance VarPnethθlowastnTHORN to closely approximate

VarethθnTHORN This is the rationale for the variance estimator (6) described earlierSubsampling21 employs the same strategy as the bootstrap but is based on sampling without replacement For b=b(n)

such that bnrarr0 let Sb be the entire collection of subsets of 1hellipn of size b For each s=i1hellipibisinSb let θnbs frac14θethXi1 hellipXibTHORN be the estimator evaluated using s The goal is to estimate the sampling distribution of n1=2ethθnminusθethPTHORNTHORN Itturns out that subsampling provides a consistent estimate of this distribution under fairly mild conditions LetQn denotethe distribution of n1=2ethθnminusθethPTHORNTHORN Assume Qn converges weakly to a proper limiting distribution Q

Qn dQ (8)

Then it follows21 that the distribution function for the statistic n1=2ethθnminusθethPTHORNTHORN can be approximated by the subsamplingestimator

ŨnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθnTHORNlexg (9)

where Cb frac14 nb

is the cardinality of Sb More formally assuming (8) and bnrarr0 for brarrinfin then ŨnbethxTHORNp FethxTHORN frac14

Qfrac12minusinfin x for each x that is a continuity point of the limiting cumulative distribution function F The key to this argumentis to recognize that because of (8) and bnrarr0 Ũnb closely approximates

UnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθethPTHORNTHORNlexg

which is a U‐statistic25 of order b (see Politis and Romano21 for details)

The ability to approximate the distribution of θn suggests similar to the bootstrap that we can approximatemoments of θn with those from the subsampled estimator in particular we should be able to approximate the varianceUnlike the bootstrap however subsampled statistics are calculated using a sample size b and not n Therefore to esti-mate the variance of θn we must apply a scaling factor to correct for sample size The subsampled estimator for the var-iance is (see Radulović26 and section 331 from Politis and Romano21)

υb frac14 b=nCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

(10)

The estimator (10) is closely related to the delete‐d jackknife22 The delete‐d estimator works on subsets of size r=nminusdand is defined as follows

564 ISHWARAN AND LU

υJethdTHORN frac14 r=dCr

sumsisinSr

ethθnrsminusθnTHORN2

With a little bit or rearrangement this can be rewritten as follows

υJethdTHORN frac14 r=dCr

sumsisinSr

θnrsminus1Cr

sumsprimeisinSr

θnrsprime

2

thorn rd

1Cr

sumsisinSr

θnrsminusθn

2

Setting d=nminusb we obtain

υJethdTHORN frac14 b=ethnminusbTHORNCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

thorn bnminusb

1Cb

sumsisinSb

θnbsminusθn

2

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflbias

(11)

The first term closely approximates (10) since bnrarr0 while the second term is a bias estimate of the subsampled estima-tor Thus the delete‐d estimator (11) can be seen to be a bias corrected version of (10) Furthermore this correction isalways upwards because the bias term is squared and always positive

322 | Subsampling and delete‐d jackknife algorithms

We can now describe our subsampling estimator for the variance of VIMP In the following we assume b is some inte-ger much smaller than n such that bnrarr0

b‐subsampling estimator for VarethIeth jTHORNnMTHORN1 Draw a subsampling set sisinSb Let Ls be L restricted to s2 Calculate Ieth jTHORNnMethLsTHORN the finite forest estimator for VIMP using Ls Let θ

eth jTHORNnbs denote this value

3 Repeat Kgt1 times obtaining θeth jTHORNnbs1hellip θeth jTHORNnbsK

Estimate VarethIeth jTHORNnMTHORN by frac12b=ethnKTHORNsumKkfrac141ethθeth jTHORNnbsk

minus1KsumK

kprimefrac141θeth jTHORNnbskprime

THORN2

The delete‐d jackknife estimator is obtained by a slight modification to the above algorithm

delete‐d jackknife estimator ethd frac14 nminusbTHORN for VarethIeth jTHORNnMTHORN1 Using the entire learning set L calculate the forest VIMP estimator Ieth jTHORNnMethLTHORN Let θeth jTHORNn denote this value2 Run the b‐subsampling estimator but replace the estimator in step 3 with fb=frac12ethnminusbTHORNKgsumK

kfrac141ethθeth jTHORNnbskminusθeth jTHORNn THORN2

4 | RANDOM FOREST REGRESSION RF ‐R

41 | Simulations

In the following sections (Sections 4 5 and 6) we evaluate the performance of the 164 bootstrap estimator the b‐sub-sampling estimator and the delete‐d jackknife variance estimator We begin by looking at the regression setting Weused the following simulations to assess performance

1 y frac14 10 sinethπx1x2THORN thorn 20ethx3minus05THORN2 thorn 10x4 thorn 5x5 thorn ε xjsimU(0 1) εsimN(0 1)2 y frac14 ethx21 thorn frac12x2x3minusethx2x4THORNminus12THORN1=2 thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 1252)3 y frac14 tanminus1ethfrac12x2x3minusethx2x4THORNminus1=x1THORN thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 12)4 y frac14 x1x2 thorn x23minusx4x7 thorn x8x10minusx26 thorn ε xjsimU(minus1 1) εsimN(0 12)5 y frac14 1fx1gt0g thorn x32 thorn 1fx4 thorn x6minusx8minusx9gt1thorn x10g thorn expethminusx22THORN thorn ε xjsimU(minus1 1) εsimN(0 12)

ISHWARAN AND LU 565

6 y frac14 x21 thorn 3x22x3expethminusjx4jTHORN thorn x6minusx8 thorn ε xjsimU(minus1 1) εsimN(0 12)7 y frac14 1fx1 thorn x34 thorn x9 thorn sinethx2x8THORN thorn εgt038g xjsimU(minus1 1) εsimN(0 12)8 y frac14 logethx1 thorn x2x3THORNminusexpethx4xminus15 minusx6THORN thorn ε xjsimU(05 1) εsimN(0 12)9 y frac14 x1x22jx3j1=2 thorn lfloorx4minusx5x6rfloorthorn ε xjsimU(minus1 1) εsimN(0 12)10 y frac14 x3ethx1 thorn 1THORNjx2jminusethx25frac12jx4j thorn jx5j thorn jx6jminus1THORN1=2 thorn ε xjsimU(minus1 1) εsimN(0 12)11 y frac14 cosethx1minusx2THORN thorn sinminus1ethx1x3THORNminustanminus1ethx2minusx23THORN thorn ε xjsimU(minus1 1) εsimN(0 12)12 y= ε εsimN(0 1)

In all 12 simulations the dimension of the feature space was set to p=20 This was done by adding variables unre-lated to y to the design matrix We call these noise variables In simulations 1 to 3 noise variables were U(0 1) for sim-ulations 4 to 11 noise variables were U(minus1 1) and for simulation 12 noise variables were N(0 1) All features (strongand noisy) were simulated independently Simulations 1 to 3 are the well‐known Friedman 1 2 3 simulations627 Sim-ulations 4 to 11 were inspired from COBRA28 Simulation 12 is a pure noise model

The sample size was set at n=250 Subsampling was set at a rate of b=n12 which in this case is b=158 We cansee that practically speaking this is a very small sample size and allows subsampled VIMP to be rapidly computedThe value of d for the delete‐d jackknife was always set to d=nminusb The number of bootstraps was set to 100 and thenumber of subsamples was set to 100 Note that this is not a large number of bootstraps or subsampled replicatesHowever they represent values practitioners are likely to use in practice especially for big data due tocomputational costs

All RF calculations were implemented using the randomForestSRC R‐package29 The package runs in OpenMPparallel processing mode which allows for parallel processing on user desktops as well as large scale computing clus-ters The package now includes a dedicated function ldquosubsamplerdquo which implements the 3 methods studied here Thesubsample function was used for all calculations All RF calculations used 250 trees Tree nodesize was set to 5 and p3 random feature selection used (these are default settings for regression) Each simulation was repeated independently250 times Random forest parameters were kept fixed over simulations All calculations related to prediction error and

VIMP were based on squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2

42 | Estimating the true finite standard error and true finite VIMP

Each procedure provides an estimate of the VarethIeth jTHORNnMTHORN We took the square root of this to obtain an estimate for the stan-

dard error of VIMP ethVarethIeth jTHORNnMTHORNTHORN1=2 To assess performance in estimating the standard error we used the following strat-egy to approximate the unknown parameter VarethIeth jTHORNnMTHORN For each simulation model we drew 1000 independent copies ofthe data and for each of these copies we calculated the finite forest VIMP Ieth jTHORNnM The same sample size of n=250 wasused and all forest tuning parameters were kept the same as outlined above We used the variance of these 1000 valuesto estimate VarethIeth jTHORNnMTHORN We refer to the square root of this value as the true finite standard error Additionally we aver-aged the 1000 values to estimate Efrac12Ieth jTHORNnM frac14 Efrac12IethXeth jTHORNLTHORN We call this the true finite VIMP

43 | Results

Performance of methods was assessed by bias and standardized mean‐squared error (SMSE) The bias for a method wasobtained by averaging its estimated standard error over the 250 replications and taking the difference between this andthe true finite standard error Mean‐squared error was estimated by averaging the squared difference between amethods estimated value for the standard error and the true finite standard error Standardized mean‐squared errorwas defined by dividing MSE by the true finite standard error In evaluating these performance values we realized itwas important to take into account signal strength of a variable In our simulations there are noisy variables with nosignal There are also variables with strong and moderately strong signal Therefore to better understand performancedifferences results were stratified by size of a variables true finite VIMP In total there were 240 variables to be dealtwith (12 simulations each with p=20 variables) These 240 variables were stratified into 6 groups based on 10th 25th50th 75th and 90th percentiles of true finite VIMP (standardized by the Y variance to make VIMP comparable acrosssimulations) Bias and SMSE for the 6 groups are displayed in Figure 1

All methods exhibit low bias for small VIMP As VIMP increases corresponding to stronger variables bias for thesubsampling estimator increases Its bias is negative showing that it underestimates variance The delete‐d estimator

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 563

versus 368) Thus in a forest of 1000 trees the 164 estimator uses about 164 trees on average to calculate VIMP for acase compared with 368 trees used in a standard calculation This can reduce efficiency of the 164 estimator Anotherproblem is computational expense The 164 estimator requires repeatedly fitting RF to bootstrap data which becomesexpensive as n increases

To avoid these problems we propose a more efficient procedure based on subsampling theory21 The idea rests oncalculating VIMP over small iid subsets of the data Because sampling is without replacement this avoid ties in theOOB data that creates problems for the bootstrap Also because each calculation is fast the procedure is computation-ally efficient especially in big n settings

321 | Subsampling theory

We begin by first reviewing some basic theory of subsampling Let X1hellipXn be iid random values with common distri-

bution P Let θn frac14 θethX1hellipXnTHORN be some estimator for θ(P) an unknown real‐valued parameter we wish to estimateThe bootstrap estimator for the variance of θn is based on the following simple idea Let Pn frac14 eth1=nTHORNsumn

ifrac141δXi be theempirical measure for the data Let Xlowast

1 hellipXlowastn be a bootstrap sample obtained by independently sampling n points from

Pn Because Pn converges to P we should expect the moments of the bootstrap estimator θlowastn frac14 θethXlowast1 hellipX

lowastnTHORN to closely

approximate those of θ In particular we should expect the bootstrap variance VarPnethθlowastnTHORN to closely approximate

VarethθnTHORN This is the rationale for the variance estimator (6) described earlierSubsampling21 employs the same strategy as the bootstrap but is based on sampling without replacement For b=b(n)

such that bnrarr0 let Sb be the entire collection of subsets of 1hellipn of size b For each s=i1hellipibisinSb let θnbs frac14θethXi1 hellipXibTHORN be the estimator evaluated using s The goal is to estimate the sampling distribution of n1=2ethθnminusθethPTHORNTHORN Itturns out that subsampling provides a consistent estimate of this distribution under fairly mild conditions LetQn denotethe distribution of n1=2ethθnminusθethPTHORNTHORN Assume Qn converges weakly to a proper limiting distribution Q

Qn dQ (8)

Then it follows21 that the distribution function for the statistic n1=2ethθnminusθethPTHORNTHORN can be approximated by the subsamplingestimator

ŨnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθnTHORNlexg (9)

where Cb frac14 nb

is the cardinality of Sb More formally assuming (8) and bnrarr0 for brarrinfin then ŨnbethxTHORNp FethxTHORN frac14

Qfrac12minusinfin x for each x that is a continuity point of the limiting cumulative distribution function F The key to this argumentis to recognize that because of (8) and bnrarr0 Ũnb closely approximates

UnbethxTHORN frac14 1Cb

sumsisinSb

1fb1=2ethθnbsminusθethPTHORNTHORNlexg

which is a U‐statistic25 of order b (see Politis and Romano21 for details)

The ability to approximate the distribution of θn suggests similar to the bootstrap that we can approximatemoments of θn with those from the subsampled estimator in particular we should be able to approximate the varianceUnlike the bootstrap however subsampled statistics are calculated using a sample size b and not n Therefore to esti-mate the variance of θn we must apply a scaling factor to correct for sample size The subsampled estimator for the var-iance is (see Radulović26 and section 331 from Politis and Romano21)

υb frac14 b=nCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

(10)

The estimator (10) is closely related to the delete‐d jackknife22 The delete‐d estimator works on subsets of size r=nminusdand is defined as follows

564 ISHWARAN AND LU

υJethdTHORN frac14 r=dCr

sumsisinSr

ethθnrsminusθnTHORN2

With a little bit or rearrangement this can be rewritten as follows

υJethdTHORN frac14 r=dCr

sumsisinSr

θnrsminus1Cr

sumsprimeisinSr

θnrsprime

2

thorn rd

1Cr

sumsisinSr

θnrsminusθn

2

Setting d=nminusb we obtain

υJethdTHORN frac14 b=ethnminusbTHORNCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

thorn bnminusb

1Cb

sumsisinSb

θnbsminusθn

2

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflbias

(11)

The first term closely approximates (10) since bnrarr0 while the second term is a bias estimate of the subsampled estima-tor Thus the delete‐d estimator (11) can be seen to be a bias corrected version of (10) Furthermore this correction isalways upwards because the bias term is squared and always positive

322 | Subsampling and delete‐d jackknife algorithms

We can now describe our subsampling estimator for the variance of VIMP In the following we assume b is some inte-ger much smaller than n such that bnrarr0

b‐subsampling estimator for VarethIeth jTHORNnMTHORN1 Draw a subsampling set sisinSb Let Ls be L restricted to s2 Calculate Ieth jTHORNnMethLsTHORN the finite forest estimator for VIMP using Ls Let θ

eth jTHORNnbs denote this value

3 Repeat Kgt1 times obtaining θeth jTHORNnbs1hellip θeth jTHORNnbsK

Estimate VarethIeth jTHORNnMTHORN by frac12b=ethnKTHORNsumKkfrac141ethθeth jTHORNnbsk

minus1KsumK

kprimefrac141θeth jTHORNnbskprime

THORN2

The delete‐d jackknife estimator is obtained by a slight modification to the above algorithm

delete‐d jackknife estimator ethd frac14 nminusbTHORN for VarethIeth jTHORNnMTHORN1 Using the entire learning set L calculate the forest VIMP estimator Ieth jTHORNnMethLTHORN Let θeth jTHORNn denote this value2 Run the b‐subsampling estimator but replace the estimator in step 3 with fb=frac12ethnminusbTHORNKgsumK

kfrac141ethθeth jTHORNnbskminusθeth jTHORNn THORN2

4 | RANDOM FOREST REGRESSION RF ‐R

41 | Simulations

In the following sections (Sections 4 5 and 6) we evaluate the performance of the 164 bootstrap estimator the b‐sub-sampling estimator and the delete‐d jackknife variance estimator We begin by looking at the regression setting Weused the following simulations to assess performance

1 y frac14 10 sinethπx1x2THORN thorn 20ethx3minus05THORN2 thorn 10x4 thorn 5x5 thorn ε xjsimU(0 1) εsimN(0 1)2 y frac14 ethx21 thorn frac12x2x3minusethx2x4THORNminus12THORN1=2 thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 1252)3 y frac14 tanminus1ethfrac12x2x3minusethx2x4THORNminus1=x1THORN thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 12)4 y frac14 x1x2 thorn x23minusx4x7 thorn x8x10minusx26 thorn ε xjsimU(minus1 1) εsimN(0 12)5 y frac14 1fx1gt0g thorn x32 thorn 1fx4 thorn x6minusx8minusx9gt1thorn x10g thorn expethminusx22THORN thorn ε xjsimU(minus1 1) εsimN(0 12)

ISHWARAN AND LU 565

6 y frac14 x21 thorn 3x22x3expethminusjx4jTHORN thorn x6minusx8 thorn ε xjsimU(minus1 1) εsimN(0 12)7 y frac14 1fx1 thorn x34 thorn x9 thorn sinethx2x8THORN thorn εgt038g xjsimU(minus1 1) εsimN(0 12)8 y frac14 logethx1 thorn x2x3THORNminusexpethx4xminus15 minusx6THORN thorn ε xjsimU(05 1) εsimN(0 12)9 y frac14 x1x22jx3j1=2 thorn lfloorx4minusx5x6rfloorthorn ε xjsimU(minus1 1) εsimN(0 12)10 y frac14 x3ethx1 thorn 1THORNjx2jminusethx25frac12jx4j thorn jx5j thorn jx6jminus1THORN1=2 thorn ε xjsimU(minus1 1) εsimN(0 12)11 y frac14 cosethx1minusx2THORN thorn sinminus1ethx1x3THORNminustanminus1ethx2minusx23THORN thorn ε xjsimU(minus1 1) εsimN(0 12)12 y= ε εsimN(0 1)

In all 12 simulations the dimension of the feature space was set to p=20 This was done by adding variables unre-lated to y to the design matrix We call these noise variables In simulations 1 to 3 noise variables were U(0 1) for sim-ulations 4 to 11 noise variables were U(minus1 1) and for simulation 12 noise variables were N(0 1) All features (strongand noisy) were simulated independently Simulations 1 to 3 are the well‐known Friedman 1 2 3 simulations627 Sim-ulations 4 to 11 were inspired from COBRA28 Simulation 12 is a pure noise model

The sample size was set at n=250 Subsampling was set at a rate of b=n12 which in this case is b=158 We cansee that practically speaking this is a very small sample size and allows subsampled VIMP to be rapidly computedThe value of d for the delete‐d jackknife was always set to d=nminusb The number of bootstraps was set to 100 and thenumber of subsamples was set to 100 Note that this is not a large number of bootstraps or subsampled replicatesHowever they represent values practitioners are likely to use in practice especially for big data due tocomputational costs

All RF calculations were implemented using the randomForestSRC R‐package29 The package runs in OpenMPparallel processing mode which allows for parallel processing on user desktops as well as large scale computing clus-ters The package now includes a dedicated function ldquosubsamplerdquo which implements the 3 methods studied here Thesubsample function was used for all calculations All RF calculations used 250 trees Tree nodesize was set to 5 and p3 random feature selection used (these are default settings for regression) Each simulation was repeated independently250 times Random forest parameters were kept fixed over simulations All calculations related to prediction error and

VIMP were based on squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2

42 | Estimating the true finite standard error and true finite VIMP

Each procedure provides an estimate of the VarethIeth jTHORNnMTHORN We took the square root of this to obtain an estimate for the stan-

dard error of VIMP ethVarethIeth jTHORNnMTHORNTHORN1=2 To assess performance in estimating the standard error we used the following strat-egy to approximate the unknown parameter VarethIeth jTHORNnMTHORN For each simulation model we drew 1000 independent copies ofthe data and for each of these copies we calculated the finite forest VIMP Ieth jTHORNnM The same sample size of n=250 wasused and all forest tuning parameters were kept the same as outlined above We used the variance of these 1000 valuesto estimate VarethIeth jTHORNnMTHORN We refer to the square root of this value as the true finite standard error Additionally we aver-aged the 1000 values to estimate Efrac12Ieth jTHORNnM frac14 Efrac12IethXeth jTHORNLTHORN We call this the true finite VIMP

43 | Results

Performance of methods was assessed by bias and standardized mean‐squared error (SMSE) The bias for a method wasobtained by averaging its estimated standard error over the 250 replications and taking the difference between this andthe true finite standard error Mean‐squared error was estimated by averaging the squared difference between amethods estimated value for the standard error and the true finite standard error Standardized mean‐squared errorwas defined by dividing MSE by the true finite standard error In evaluating these performance values we realized itwas important to take into account signal strength of a variable In our simulations there are noisy variables with nosignal There are also variables with strong and moderately strong signal Therefore to better understand performancedifferences results were stratified by size of a variables true finite VIMP In total there were 240 variables to be dealtwith (12 simulations each with p=20 variables) These 240 variables were stratified into 6 groups based on 10th 25th50th 75th and 90th percentiles of true finite VIMP (standardized by the Y variance to make VIMP comparable acrosssimulations) Bias and SMSE for the 6 groups are displayed in Figure 1

All methods exhibit low bias for small VIMP As VIMP increases corresponding to stronger variables bias for thesubsampling estimator increases Its bias is negative showing that it underestimates variance The delete‐d estimator

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

564 ISHWARAN AND LU

υJethdTHORN frac14 r=dCr

sumsisinSr

ethθnrsminusθnTHORN2

With a little bit or rearrangement this can be rewritten as follows

υJethdTHORN frac14 r=dCr

sumsisinSr

θnrsminus1Cr

sumsprimeisinSr

θnrsprime

2

thorn rd

1Cr

sumsisinSr

θnrsminusθn

2

Setting d=nminusb we obtain

υJethdTHORN frac14 b=ethnminusbTHORNCb

sumsisinSb

θnbsminus1Cb

sumsprimeisinSb

θnbsprime

2

thorn bnminusb

1Cb

sumsisinSb

θnbsminusθn

2

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflbias

(11)

The first term closely approximates (10) since bnrarr0 while the second term is a bias estimate of the subsampled estima-tor Thus the delete‐d estimator (11) can be seen to be a bias corrected version of (10) Furthermore this correction isalways upwards because the bias term is squared and always positive

322 | Subsampling and delete‐d jackknife algorithms

We can now describe our subsampling estimator for the variance of VIMP In the following we assume b is some inte-ger much smaller than n such that bnrarr0

b‐subsampling estimator for VarethIeth jTHORNnMTHORN1 Draw a subsampling set sisinSb Let Ls be L restricted to s2 Calculate Ieth jTHORNnMethLsTHORN the finite forest estimator for VIMP using Ls Let θ

eth jTHORNnbs denote this value

3 Repeat Kgt1 times obtaining θeth jTHORNnbs1hellip θeth jTHORNnbsK

Estimate VarethIeth jTHORNnMTHORN by frac12b=ethnKTHORNsumKkfrac141ethθeth jTHORNnbsk

minus1KsumK

kprimefrac141θeth jTHORNnbskprime

THORN2

The delete‐d jackknife estimator is obtained by a slight modification to the above algorithm

delete‐d jackknife estimator ethd frac14 nminusbTHORN for VarethIeth jTHORNnMTHORN1 Using the entire learning set L calculate the forest VIMP estimator Ieth jTHORNnMethLTHORN Let θeth jTHORNn denote this value2 Run the b‐subsampling estimator but replace the estimator in step 3 with fb=frac12ethnminusbTHORNKgsumK

kfrac141ethθeth jTHORNnbskminusθeth jTHORNn THORN2

4 | RANDOM FOREST REGRESSION RF ‐R

41 | Simulations

In the following sections (Sections 4 5 and 6) we evaluate the performance of the 164 bootstrap estimator the b‐sub-sampling estimator and the delete‐d jackknife variance estimator We begin by looking at the regression setting Weused the following simulations to assess performance

1 y frac14 10 sinethπx1x2THORN thorn 20ethx3minus05THORN2 thorn 10x4 thorn 5x5 thorn ε xjsimU(0 1) εsimN(0 1)2 y frac14 ethx21 thorn frac12x2x3minusethx2x4THORNminus12THORN1=2 thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 1252)3 y frac14 tanminus1ethfrac12x2x3minusethx2x4THORNminus1=x1THORN thorn ε x1simU(0 100) x2simU(40π 560π) x3simU(0 1) x4simU(1 11) εsimN(0 12)4 y frac14 x1x2 thorn x23minusx4x7 thorn x8x10minusx26 thorn ε xjsimU(minus1 1) εsimN(0 12)5 y frac14 1fx1gt0g thorn x32 thorn 1fx4 thorn x6minusx8minusx9gt1thorn x10g thorn expethminusx22THORN thorn ε xjsimU(minus1 1) εsimN(0 12)

ISHWARAN AND LU 565

6 y frac14 x21 thorn 3x22x3expethminusjx4jTHORN thorn x6minusx8 thorn ε xjsimU(minus1 1) εsimN(0 12)7 y frac14 1fx1 thorn x34 thorn x9 thorn sinethx2x8THORN thorn εgt038g xjsimU(minus1 1) εsimN(0 12)8 y frac14 logethx1 thorn x2x3THORNminusexpethx4xminus15 minusx6THORN thorn ε xjsimU(05 1) εsimN(0 12)9 y frac14 x1x22jx3j1=2 thorn lfloorx4minusx5x6rfloorthorn ε xjsimU(minus1 1) εsimN(0 12)10 y frac14 x3ethx1 thorn 1THORNjx2jminusethx25frac12jx4j thorn jx5j thorn jx6jminus1THORN1=2 thorn ε xjsimU(minus1 1) εsimN(0 12)11 y frac14 cosethx1minusx2THORN thorn sinminus1ethx1x3THORNminustanminus1ethx2minusx23THORN thorn ε xjsimU(minus1 1) εsimN(0 12)12 y= ε εsimN(0 1)

In all 12 simulations the dimension of the feature space was set to p=20 This was done by adding variables unre-lated to y to the design matrix We call these noise variables In simulations 1 to 3 noise variables were U(0 1) for sim-ulations 4 to 11 noise variables were U(minus1 1) and for simulation 12 noise variables were N(0 1) All features (strongand noisy) were simulated independently Simulations 1 to 3 are the well‐known Friedman 1 2 3 simulations627 Sim-ulations 4 to 11 were inspired from COBRA28 Simulation 12 is a pure noise model

The sample size was set at n=250 Subsampling was set at a rate of b=n12 which in this case is b=158 We cansee that practically speaking this is a very small sample size and allows subsampled VIMP to be rapidly computedThe value of d for the delete‐d jackknife was always set to d=nminusb The number of bootstraps was set to 100 and thenumber of subsamples was set to 100 Note that this is not a large number of bootstraps or subsampled replicatesHowever they represent values practitioners are likely to use in practice especially for big data due tocomputational costs

All RF calculations were implemented using the randomForestSRC R‐package29 The package runs in OpenMPparallel processing mode which allows for parallel processing on user desktops as well as large scale computing clus-ters The package now includes a dedicated function ldquosubsamplerdquo which implements the 3 methods studied here Thesubsample function was used for all calculations All RF calculations used 250 trees Tree nodesize was set to 5 and p3 random feature selection used (these are default settings for regression) Each simulation was repeated independently250 times Random forest parameters were kept fixed over simulations All calculations related to prediction error and

VIMP were based on squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2

42 | Estimating the true finite standard error and true finite VIMP

Each procedure provides an estimate of the VarethIeth jTHORNnMTHORN We took the square root of this to obtain an estimate for the stan-

dard error of VIMP ethVarethIeth jTHORNnMTHORNTHORN1=2 To assess performance in estimating the standard error we used the following strat-egy to approximate the unknown parameter VarethIeth jTHORNnMTHORN For each simulation model we drew 1000 independent copies ofthe data and for each of these copies we calculated the finite forest VIMP Ieth jTHORNnM The same sample size of n=250 wasused and all forest tuning parameters were kept the same as outlined above We used the variance of these 1000 valuesto estimate VarethIeth jTHORNnMTHORN We refer to the square root of this value as the true finite standard error Additionally we aver-aged the 1000 values to estimate Efrac12Ieth jTHORNnM frac14 Efrac12IethXeth jTHORNLTHORN We call this the true finite VIMP

43 | Results

Performance of methods was assessed by bias and standardized mean‐squared error (SMSE) The bias for a method wasobtained by averaging its estimated standard error over the 250 replications and taking the difference between this andthe true finite standard error Mean‐squared error was estimated by averaging the squared difference between amethods estimated value for the standard error and the true finite standard error Standardized mean‐squared errorwas defined by dividing MSE by the true finite standard error In evaluating these performance values we realized itwas important to take into account signal strength of a variable In our simulations there are noisy variables with nosignal There are also variables with strong and moderately strong signal Therefore to better understand performancedifferences results were stratified by size of a variables true finite VIMP In total there were 240 variables to be dealtwith (12 simulations each with p=20 variables) These 240 variables were stratified into 6 groups based on 10th 25th50th 75th and 90th percentiles of true finite VIMP (standardized by the Y variance to make VIMP comparable acrosssimulations) Bias and SMSE for the 6 groups are displayed in Figure 1

All methods exhibit low bias for small VIMP As VIMP increases corresponding to stronger variables bias for thesubsampling estimator increases Its bias is negative showing that it underestimates variance The delete‐d estimator

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 565

6 y frac14 x21 thorn 3x22x3expethminusjx4jTHORN thorn x6minusx8 thorn ε xjsimU(minus1 1) εsimN(0 12)7 y frac14 1fx1 thorn x34 thorn x9 thorn sinethx2x8THORN thorn εgt038g xjsimU(minus1 1) εsimN(0 12)8 y frac14 logethx1 thorn x2x3THORNminusexpethx4xminus15 minusx6THORN thorn ε xjsimU(05 1) εsimN(0 12)9 y frac14 x1x22jx3j1=2 thorn lfloorx4minusx5x6rfloorthorn ε xjsimU(minus1 1) εsimN(0 12)10 y frac14 x3ethx1 thorn 1THORNjx2jminusethx25frac12jx4j thorn jx5j thorn jx6jminus1THORN1=2 thorn ε xjsimU(minus1 1) εsimN(0 12)11 y frac14 cosethx1minusx2THORN thorn sinminus1ethx1x3THORNminustanminus1ethx2minusx23THORN thorn ε xjsimU(minus1 1) εsimN(0 12)12 y= ε εsimN(0 1)

In all 12 simulations the dimension of the feature space was set to p=20 This was done by adding variables unre-lated to y to the design matrix We call these noise variables In simulations 1 to 3 noise variables were U(0 1) for sim-ulations 4 to 11 noise variables were U(minus1 1) and for simulation 12 noise variables were N(0 1) All features (strongand noisy) were simulated independently Simulations 1 to 3 are the well‐known Friedman 1 2 3 simulations627 Sim-ulations 4 to 11 were inspired from COBRA28 Simulation 12 is a pure noise model

The sample size was set at n=250 Subsampling was set at a rate of b=n12 which in this case is b=158 We cansee that practically speaking this is a very small sample size and allows subsampled VIMP to be rapidly computedThe value of d for the delete‐d jackknife was always set to d=nminusb The number of bootstraps was set to 100 and thenumber of subsamples was set to 100 Note that this is not a large number of bootstraps or subsampled replicatesHowever they represent values practitioners are likely to use in practice especially for big data due tocomputational costs

All RF calculations were implemented using the randomForestSRC R‐package29 The package runs in OpenMPparallel processing mode which allows for parallel processing on user desktops as well as large scale computing clus-ters The package now includes a dedicated function ldquosubsamplerdquo which implements the 3 methods studied here Thesubsample function was used for all calculations All RF calculations used 250 trees Tree nodesize was set to 5 and p3 random feature selection used (these are default settings for regression) Each simulation was repeated independently250 times Random forest parameters were kept fixed over simulations All calculations related to prediction error and

VIMP were based on squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2

42 | Estimating the true finite standard error and true finite VIMP

Each procedure provides an estimate of the VarethIeth jTHORNnMTHORN We took the square root of this to obtain an estimate for the stan-

dard error of VIMP ethVarethIeth jTHORNnMTHORNTHORN1=2 To assess performance in estimating the standard error we used the following strat-egy to approximate the unknown parameter VarethIeth jTHORNnMTHORN For each simulation model we drew 1000 independent copies ofthe data and for each of these copies we calculated the finite forest VIMP Ieth jTHORNnM The same sample size of n=250 wasused and all forest tuning parameters were kept the same as outlined above We used the variance of these 1000 valuesto estimate VarethIeth jTHORNnMTHORN We refer to the square root of this value as the true finite standard error Additionally we aver-aged the 1000 values to estimate Efrac12Ieth jTHORNnM frac14 Efrac12IethXeth jTHORNLTHORN We call this the true finite VIMP

43 | Results

Performance of methods was assessed by bias and standardized mean‐squared error (SMSE) The bias for a method wasobtained by averaging its estimated standard error over the 250 replications and taking the difference between this andthe true finite standard error Mean‐squared error was estimated by averaging the squared difference between amethods estimated value for the standard error and the true finite standard error Standardized mean‐squared errorwas defined by dividing MSE by the true finite standard error In evaluating these performance values we realized itwas important to take into account signal strength of a variable In our simulations there are noisy variables with nosignal There are also variables with strong and moderately strong signal Therefore to better understand performancedifferences results were stratified by size of a variables true finite VIMP In total there were 240 variables to be dealtwith (12 simulations each with p=20 variables) These 240 variables were stratified into 6 groups based on 10th 25th50th 75th and 90th percentiles of true finite VIMP (standardized by the Y variance to make VIMP comparable acrosssimulations) Bias and SMSE for the 6 groups are displayed in Figure 1

All methods exhibit low bias for small VIMP As VIMP increases corresponding to stronger variables bias for thesubsampling estimator increases Its bias is negative showing that it underestimates variance The delete‐d estimator

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

FIGURE 1 Bias and standardized mean‐squared error (SMSE) performance for estimating VIMP standard error from random forestndash

regression simulations In total there are 240 variables (12 simulations p=20 variables in each simulation) These 240 variables have been

stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP Extreme right boxplots labeled ldquoALLrdquo display

performance for all 240 variables simultaneously [Colour figure can be viewed at wileyonlinelibrarycom]

566 ISHWARAN AND LU

does much better This is due to the bias correction factor discussed earlier (see (11)) which kicks in when signalincreases The pattern seen for bias is reflected in the results for SMSE the delete‐d is similar to the subsampling esti-mator except for large VIMP where it does better Overall the 164 estimator is the best of all 3 methods On the otherhand it is hundreds of times slower

This shows that the delete‐d estimator should be used when bias is an issue However bias of the subsampling esti-mator can be improved by increasing the subsampling rate Figure 2 reports the results from the same set of simulationsbut using an increased sampling rate b=n34 (b=628) Both estimators improve overall but note the improvement inbias and SMSE for the subsampling estimator relative to the delete‐d estimator Also notice that both estimators nowoutperform the 164 bootstrap

44 | Confidence intervals for VIMP

The subsampling distribution (9) discussed in Section 3 can also be used to calculate nonparametric confidence inter-vals21 The general idea for constructing a confidence interval for a target parameter θ(P) is as follows Define the

1minusα quantile for the subsampling distribution as cnbeth1minusαTHORN frac14 inffx~UnbethxTHORNge1minusαg Similarly define the 1minusα quantilefor the limiting distribution Q of n1=2ethθnminusθethPTHORNTHORN as ceth1minusαTHORN frac14 infftFethxTHORN frac14 Qfrac12minusinfin xge1minusαg Then assuming (8) andbnrarr0 the interval

FIGURE 2 Results from random forestndashregression simulations but with increased subsampling rate b=n34 Notice the improvement in

bias and standardized mean‐squared error (SMSE) for the subsampling estimator [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 567

frac12θnminusnminus1=2cnbeth1minusαTHORNinfinTHORN (12)

contains θ(P) with asymptotic probability 1minusα if c(1minusα) is a continuity point of FWhile (12) can be used to calculate a nonparametric confidence interval for VIMP we have found that a more stable

solution can be obtained if we are willing to strengthen our asssumptions to include asympotic normality Let θeth jTHORNn frac14 Ieth jTHORNnM

denote the finite forest estimator for VIMP We call the limit of θeth jTHORNn as n Mrarrinfin the true VIMP and denote thisvalue by θeth jTHORN0

θeth jTHORN0 frac14 limnMrarrinfin

θeth jTHORNn frac14 limnMrarrinfin

Ieth jTHORNnM

Let υeth jTHORNn be an estimator for VarethIeth jTHORNnMTHORN Assuming asymptotic normality

θeth jTHORNn minusθeth jTHORN0ffiffiffiffiffiffiffiυeth jTHORNn

q dNeth0 1THORN (13)

an asymptotic 100(1minusα)confidence region for θeth jTHORN0 the true VIMP can be defined as

θeth jTHORNn plusmnzα=2

ffiffiffiffiffiffiffiυeth jTHORNn

q

where zα2 is the 1minusα2‐quantile from a standard normal PrN(0 1)le zα2=1minusα2

45 | Justification of normality

To provide justification for assumption (13) we reran our previous simulations 1000 times independently For eachMonte Carlo replication and each simulation model we calculated the finite forest VIMP for a variable and centeredit by its mean value from the 1000 simulations and divided this centered value by the standard deviation of the 1000VIMP values All experimental parameters were held at the same values as before except for the sample size whichwas increased to n=2500 The left‐hand side of Figure 3 displays normal quantile plots for standardized VIMP for eachof the 12 simulation models On the y‐axis are the quantiles for the standardized VIMP while the x‐axis are correspond-ing N(0 1) quantiles The right‐hand side displays quantile bias defined as the difference between the quantile for stan-dardized VIMP to the quantile for a standard normal Values are displayed for 5 10 25 50 75 90 95 percentile valuesThe results generally confirm that (13) holds Deviations from normality occur primarily in the tails but these are rea-sonable and expected in finite sample settings For example the median bias is about 002 for the 95th percentile Thusthe standardized VIMP quantile differs from the true standard normal value of 1645 by only a value of 002 Alsoobserve that overall mean bias is near zero (far right boxplot)

Coverage probabilities for 90 confidence intervals are provided in Figure 4 The left and right figures correspond tosimulations of Figures 1 and 2 respectively (recall Figure 2 is based on a larger subsampling rate b=n34) Confidenceintervals for the subsampling and delete‐d estimators use asymptotic normality For direct comparison asymptotic nor-mal bootstrap confidence intervals are also provided All procedures tend to produce confidence intervals that are toolarge when VIMP is small (reflected by coverage probabilities exceeding the targeted 90 level) This is actually a goodfeature as it implies they tend to overestimate confidence intervals for noisy variables thereby making them less likelyto be selected For larger VIMP in the left figure the subsampling estimator tends to produce intervals that are toosmall As mentioned earlier this is because it tends to underestimate the variance The delete‐d estimator performsmuch better However when the subsampling rate is increased (right figure) the subsampling estimator is generallysuperior to the delete‐d estimator Its overall mean coverage rate is 92 which is much better than the delete‐d and boot-strap which achieve coverage rates of 97 which are too high

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

FIGURE 4 Coverage of variable importance 90 asymptotic normal confidence intervals from random forestndashregression simulations Left‐

and right‐hand side figures based on subsampling rates b=n12 and b=n34 respectively Confidence regions for the 240 variables from the 12

simulation experiments have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite variable

importance values [Colour figure can be viewed at wileyonlinelibrarycom]

FIGURE 3 Assessing asymptotic normality of variable importance (VIMP) from random forestndashregression simulations Left‐hand figure

displays normal quantile plots for standardized VIMP for each of the 12 simulations Right‐hand figure displays bias of VIMP quantiles

compared with standard normal quantiles for all 240 variables from all 12 simulations

568 ISHWARAN AND LU

5 | RANDOM FOREST CLASSIFICATION RF ‐C

51 | Simulations

The following simulations were used to study performance of methods in the classification problem1 Threenorm simulation using mlbenchthreenorm from the mlbench R‐package30

2 Two class simulation using twoClassSim from the caret R‐package31 with 2 factors 5 linear and 3 nonlinearvariables

3 Same as 2 but with a ρ=75 exchangeable correlation4 Same as 2 but with 15 linear variables

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 569

5 Same as 2 but with 15 linear variables and a ρ=75 exchangeable correlation6 RF‐R simulation 6 with y discretized into 2 classes based on its median7 RF‐R simulation 8 with y discretized into 2 classes based on its median8 RF‐R simulation 9 with y discretized into 3 classes based on its 20th and 75th quantiles9 RF‐R simulation 10 with y discretized into 3 classes based on its 20th and 75th quantiles10 RF‐R simulation 11 with y discretized into 3 classes based on its 20th and 75th quantiles

In simulation 1 the feature space dimension was p=20 Simulations 2 to 4 added d=10 noise variables (see the caretpackage for details) Simulations 6 to 10 added d=10 noise variables from aU [minus11] distribution Experimental param-eters were set as in RF‐R simulations n=250 b=n12n34 100 bootstrap samples and 100 subsample draws Parame-ters for randomForestSRC were set as in RF‐R except for random feature selection which used p12 random features(default setting) The entire procedure was repeated 250 times

52 | Brier score

Error performance was assessed using the normalized Brier score Let Yisin1hellipK be the response If 0lepkle1 denotes thepredicted probability that Y equals class k k=1hellipK the normalized Brier score is defined as follows

BSlowast frac14 100KKminus1

sumK

kfrac141eth1fY frac14 kgminuspkTHORN2

Note that the normalizing constant 100K(Kminus1) used here is different than the value 1K typically used for the Brierscore We multiply the traditional Brier score by 100K2(Kminus1) because we have noticed that the value for the Brier scoreunder random guessing depends on the number of classes K If K increases the Brier score under random guessing con-verges to 1 The normalizing constant used here resolves this problem and yields a value of 100 for random guessingregardless of K Thus anything below 100 signifies a classifier that is better than pure guessing A perfect classifierhas value 0

Although misclassification error is the commonly used performance measure in classification problems it cannot beoverstated how much stabler the Brier score is This stability will naturally lend itself to stabler estimates of VIMPwhich is why we have chosen to use it As a simple illustration of this consider Figure 5 that displays the OOB misclas-sification rate and OOB normalized Brier score for the Wisconsin breast cancer data from the mlbench R‐packageObserve that misclassification error remains unstable even with M=20000 trees

FIGURE 5 Out‐of‐bag misclassification error rate versus out‐of‐bag normalized Brier score for Wisconsin breast cancer data (obtained

from the mlbench R‐package) Note the fluctuations in misclassification error even after 20000 trees in contrast to the stable behavior of

the Brier score

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

570 ISHWARAN AND LU

53 | Results

Figure 6 displays the results from the RF‐C simulations Values are displayed as in RF‐R Left‐ and right‐hand side fig-ures are based on subsampling rates b=n12 and b=n34 The top and middle figures show bias and standardized MSE forestimating the standard error The bottom figures are coverage probabilities for 90 confidence intervals Variableimportance confidence intervals were calculated using asymptotic normality as in RF‐R (see the appendix for justifica-tion of normality) The conclusions from Figure 6 are similar to those for RF‐R The delete‐d jackknife is more accuratein estimating the standard error for strong variables when the subsampling rate is small but as b increases the subsam-pling estimator improves Both estimators generally improve with increased b Note unlike RF‐R coverage probability

FIGURE 6 Results from random forest‐classification simulations showing performance of subsampling delete‐d jackknife and 164

bootstrap Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and

standardized mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for

VIMP 90 asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th

percentiles of true finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 571

for the delete‐d jackknife is better than the subsampling estimator This is probably because there are more variableswith moderate signal in these simulations

6 | RANDOM SURVIVAL FORESTS RSF

Now we consider the survival setting We begin by first defining the survival framework using the notation of Section 2Following this we discuss 2 different methods that can be used for measuring prediction error in survival settings Fol-lowing this are illustrative examples

61 | Notation

We assume a traditional right‐censoring framework The response is Y=(T δ) where T frac14 minethToCoTHORN is the observedsurvival time and δ=1ToleCo is the right‐censoring indicator Here (To Co) denote the unobserved event and censoringtimes Thus δ=1 denotes an event such as death while δ=0 denotes a right‐censored case The target function h is theconditional survival function hethxTHORN frac14 PfTogttjX frac14 xg where t is some selected time point

62 | Weighted brier score

Let ĥ be an estimator of h One method for measuring performance of ĥ is the weighted Brier score2324 defined as fol-lows

wBSethtTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

where w(t Y G) is the weight defined by

wethtY GTHORN frac14 1fTletgδGethTminusTHORN thorn 1fTgttg

GethtTHORN

and GethtTHORN frac14 PfCogttg is the survival function of the censoring variable Co Using the notation of Section 2 the loss func-tion ℓ under the weighted Brier score can be written as follows

ℓethY ĥTHORN frac14 eth1fTgttgminusĥTHORN2wethtY GTHORN

This assumes G is a known function but in practice Gmust be estimated2324 Thus ifĜ is an estimator of G w(t Y G) isreplaced by wethtY ĜTHORN

63 | Concordance index

Harrells concordance index32 is another measure of prediction performance that can be used in survival settings The

concordance index estimates the accuracy of the predictor ĥ in ranking 2 individuals in terms of their survival A valueof 1 represents an estimator that has perfect discrimination whereas a value of 05 indicates performance on par with arandom coin toss This intuitive interpretation of performance has made Harrells concordance index very popular andfor this reason we will base our analysis on it Note that because the concordance index is calculated by comparing dis-cordant pairs to concordant pairs and therefore is very complex it is not possible to express it in terms of the ℓ‐lossfunction of Section 2 However this just means that VIMP based on Harrells concordance index is not easily describednotationally in terms of a formal loss but this does not pose any problems to the application of our methodology Per-mutation VIMP based on Harrells concordance index is well defined and can be readily calculated2

64 | Systolic heart failure

For our first illustration we consider a survival data set of n=2231 cardiovascular patients All patients suffered fromsystolic heart failure and all underwent cardiopulmonary stress testing The outcome was defined as all cause mor-tality Over a mean follow‐up of 5 years 742 of the patients died Patient variables included baseline

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

572 ISHWARAN AND LU

characteristics and exercise stress test results (p=39) More detailed information regarding the data can be found inHsich et al33

An RSF analysis was run on the data A total of 250 survival trees were grown using a nodesize value of 30 with all otherparameters set to default values used by RSF in randomForestSRC software Performance was measured using the C‐index defined as one minus the Harrell concordance index2 The delete‐d jackknife estimator was calculated using 1000subsampled values using a b=n12 subsampling rate (b=472) We preferred to use the delete‐d jackknife rather than thesubsampling estimator because of the low subsampling rate Also we did not use the 164 estimator because it was too slow

The 95 asymptotic normal confidence intervals are given in Figure 7 Variable importance values have been mul-tiplied by 100 for convenient interpretation as percentage Blood urea nitrogen exercise time and peak VO2 have thelargest VIMP with confidence intervals well bounded away from zero All 3 variables are known to be highly predictiveof heart failure and these findings are not surprising More interesting however are several variables with moderatesized VIMP which have confidence regions bounded away from zero Some examples are creatinine clearance sex leftventricular ejection fraction and use of beta‐blockers The finding for sex is especially interesting because sex is oftenunder appreciated for predicting heart failure

65 | Survival simulations

Next we study peformance using simulations We use the 3 survival simulation models described by Breiman in his2002 Wald lectures34 Let h(tx) be the hazard function for covariate x at time t Simulations are as follows

1 hetht xTHORN frac14 expethx1minus2x4 thorn 2x5THORN2 If x1le5 hetht xTHORN frac14 expethx2THORN1ftnotinfrac125 25g If x1gt5 hetht xTHORN frac14 expethx3THORN1ftnotinfrac1225 45g3 hetht xTHORN frac14 eth1thorn z2tTHORNexpethz1 thorn z2tTHORN where z1=5x1 and z2=x4+x5

In all simulations covariates were independently sampled from aU [01] distribution Noise variables were added toincrease the dimension to p=10 Simulation 1 corresponds to a Cox model Simulations 2 and 3 are nonproportional

FIGURE 7 Delete‐d jackknife 95 asymptotic normal confidence intervals from random survival forest analysis of systolic heart failure

data BUN blood urea nitrogen CABG coronary artery bypass graft ICD implantable cardioverter‐defibrillator LVEF left ventricular

ejection fraction PCI percutaneous coronary intervention [Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 573

hazards Censoring was simulated independently of time in all simulations Censoring rates were 19 15 and 29respectively

RSF was fit using randomForestSRC using the same tuning values as in RF‐C simulations (nodesize 5 randomfeature selection p12) Experimental parameters were kept the same as previous simulations Experiments wererepeated 250 times Results are displayed using the same format as RF‐R and RF‐C and are provided in Figure 8 Theresults generally mirror our earlier findings bias for the subsampling estimator improves relative to the delete‐d jack-knife with increasing subsampling rate

FIGURE 8 Results from random survival forest simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap

Left‐ and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized

mean‐squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90

asymptotic normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true

finite VIMP [Colour figure can be viewed at wileyonlinelibrarycom]

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

574 ISHWARAN AND LU

66 | Competing risk simulations

Here we study performance of the methods in a competing risk setting For our analysis we use the competing risksimulations from Ishwaran et al3 Simulations were based on a Cox exponential hazards model with 2 competing eventsCovariates had differing effects on the hazards Models included covariates common to both hazards as well as covar-iates unique to only one hazard We considered 3 of the simulations from section 61 of Ishwaran et al3

1 Linear model All p covariate effects are linear2 Quadratic model A subset of the p covariate effects are quadratic3 Interaction model Same as 1 but interactions between certain p variables were included

The feature dimension was p=12 for simulations 1 and 2 and p=17 for simulation 3 (ie 5 interaction terms wereadded) Covariates were sampled from both continuous and discrete distributions Performance was measured usingthe time truncated concordance index3 Without loss of generality we record performance for variables related to event1 only RSF competing risk trees were constructed using log‐rank splitting with weight 1 on event 1 and weight 0 onevent 2 (this ensures VIMP identifies only those variables affecting the event 1 cause) RSF parameters and experimentalparameters were identical to the previous simulations For brevity results are given in the appendix in Figure A2 Theresults mirror our previous findings

7 | DISCUSSION

71 | Summary

One widely used tool for peering inside the RF ldquoblack boxrdquo is VIMP But analyzing VIMP is difficult because of the com-plex nature of RF Given the difficulties of theoretical analysis our strategy was to approximate the distribution of VIMPthrough the use of subsampling a general methodology for approximating distributions of complex statistics Wedescribed a general procedure for estimating the variance of VIMP and for constructing confidence intervals

We compared our subsampling estimator and also the closely related delete‐d jackknife22 to the 164 bootstrap esti-mator a modified bootstrap procedure designed to address ties in OOB data Using extensive simulations involvingregression classification and survival data a consistent pattern of performance emerged for the 3 estimators All pro-cedures tended to under estimate variance for strong variables and over estimate variance for weak variables This wasespecially problematic for the subsampling estimator in low subsampling rate scenarios The delete‐d jackknife didmuch better in this case due to its bias correction Both of these methods improved with increasing subsampling rateeventually outperforming the 164 bootstrap

72 | Computational speed

Overall we generally prefer the delete‐d jackknife because of its better peformance under low subsampling rates whichwe feel will be the bulk of applications due to the computational complexity of VIMP Consider survival with concor-dance error rates the most computationally expensive setting for VIMP The concordance index measures concordanceand discordance over pairs of points a O(n2) operation With M trees the number of computations is O(n2M) for amethod like the 164 bootstrap On the other hand employing a subsampling rate of b=n12 reduces this to O(nM) afactor of ntimes smaller The resulting increase in speed will be of tremendous advantage in big data settings

73 | Practical guidelines

One of the major applications of our methodolgy will be variable selection Below we provide some practical guidelinesfor this setting

1 Use asymptotic normal confidence intervals derived from the delete‐d jackknife variance estimator2 A good default subsampling rate is b=n12 As mentioned this will substantially reduce computational costs in big n

problems In small n problems while this might seem overly aggressive leading to small subsamples our resultshave shown solid performance even when n=250

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 575

3 The α value for the confidence region should be chosen using typical values such as α=1 or α=05 Outside ofextreme settings such as high‐dimensional problems our experience suggests this should work well

The above guidelines are only meant to be starting points for the analyst and obviously there will beexceptions to the rules However as way of support for these recommendations we did the following variableselection experiment We reran the RF‐R simulations of Section 4 Variables were selected using 100(1minusα)delete‐d jackknife asymptotic normal confidence intervals The true positive rate (TPR) and true negative rate(TNR) was calculated for each simulation and results averaged over 250 independent runs True positive ratewas defined as the fraction of true signal variables identified True negative rate was the fraction of noisy signalvariables identified

Figure 9 displays the averaged TPR and TNR for each of the 12 simulation experiments under levels of significanceα=25 1 01 001 The top panel was calculated under a subsampling rate b=n12 while the bottom panel used b=n34A similar pattern for TPR and TNR values is generally observed for both sampling rates TPR decreases with increasingα TNR increases with α For b=n34 TNR rates are slightly better however for b=n12 TPR rates are slightly betterThus if the goal is finding true signal the edge goes to b=n12

Focusing on the top panel corresponding to the recommended n12subsampling rate a striking pattern for TPR isthat while TPR decreases with α the decline is fairly slow This is interesting given the wide range of α values from25 to 01 These values are extreme and unlikely to be used in practice and yet TPR results remain quite robustValues for TNR are also fairly robust to α although TNR values appear relatively too small when α=25 The valueα=25 is too extreme and creates overly narrow confidence regions causing noisy variables to be misclassified as signalvariables Generally however values α=1 05 01 perform very well under both TNR and TPR

FIGURE 9 Results from variable selection experiment using random forestndashregression simulations of Section 4 Displayed are the true

positive rate and true negative rate for variables selected using 100(1minusα) delete‐d jackknife confidence regions where α= 25 1 01

001 Top and bottom figures are based on subsampling rates b=n12 and b=n34

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

576 ISHWARAN AND LU

74 | Theoretical considerations

The key assumption underlying subsampling is the existence of a limiting distribution (8) for the estimator However asdiscussed earlier theoretical results for VIMP are difficult to come by and establishing a result like (8) for something ascomplicated as permuation importance is not easy As a token we would like to offer some partial insight into VIMP forthe regression case (RF‐R) perhaps pointing the way for more work in this area As shown in the appendix (see The-

orem 1) assuming an additive model hethXTHORN frac14 sumpjfrac141hjethXeth jTHORN

i THORN the population mean for VIMP equals

Efrac12IethXeth jTHORNΘLTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj is a correlation coefficient and hj is the additive expansion of h attributed to X(j) For noisy variables hj=0 andρj=1 thus VIMP will converge to zero For strong variables hjne0 Our theory suggests that the value of ρj will be thesame for all strong variables Therefore for strong variables except for some constant VIMP equals the amount that hjchanges when X(j) is permuted thus showing that VIMP correctly isolates the effect of X(j) in the model

The technical assumptions required by Theorem 1 are provided in the appendix however there are 2 key conditionsworth briefly mentioning One is the use of deep trees in which terminal nodes contain exactly one unique value (rep-licated values due to bootstrapping are allowed) A second condition is that the forest predictor is L2‐consistent for h Asdiscussed in the appendix this latter assumption is reasonable in our setting and has been proven by Scornet et al35

It is interesting that the above property for VIMP is tied to the consistency of the forest We believe in general thatproperties for VIMP such as its limiting distribution will rely on analogous results for the RF predictor Hopefully inthe future these results for VIMP will be proven At least in the case of RF‐R we know that distributional results existfor the predictor Wager19 established asymptotic normality of the infinite forest predictor (assuming one observationper terminal node) Mentch and Hooker20 established a similar result for the finite forest predictor See Biau andScornet36 for a comprehensive discussion of known theoretical results for RF

ACKNOWLEDGEMENTS

This research was supported by NIH grant R01 GM125072

ORCID

Hemant Ishwaran httporcidorg0000-0003-2758-9647

REFERENCES

1 Breiman L Random forests Mach Learn 2001455‐32

2 Ishwaran H Kogalur UB Blackstone EH Lauer MS Random survival forests Ann Appl Stat 20082(3)841‐860

3 Ishwaran H Gerds TA Kogalur UB Moore RD Gange SJ Lau BM Random survival forests for competing risks Biostatistics201415(4)757‐773

4 Hothorn T Hornik K Zeileis A Unbiased recursive partitioning a conditional inference framework J Comput Graph Stat200615(3)651‐674

5 Zhu R Kosorok MR Recursively imputed survival trees J Am Stat Assoc 2012107(497)331‐340

6 Breiman L Bagging predictors Mach Learn 199624(2)123‐140

7 Ishwaran H Variable importance in binary regression trees and forests Electron J Stat 20071519‐537

8 Groumlmping U Variable importance assessment in regression linear regression versus random forest Am Stat 200963(4)308‐319

9 Genuer R Poggi J‐M Tuleau‐Malot Christine Variable selection using random forests Pattern Recogn Lett 201031(14)2225‐2236

10 Breiman L Friedman J Stone CJ Olshen RA Classification and Regression Trees New York CRC press 1984

11 Breiman L Manual on setting up using and understanding random forests v31 2002

12 Louppe G Wehenkel L Sutera A Geurts P Understanding variable importances in forests of randomized trees In Advances in NeuralInformation Processing Systems 2013431‐439

13 Breiman L Out‐of‐bag estimation Technical report Berkeley CA 94708 Statistics Department University of California Berkeley 1996b199633 34

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 577

14 Archer KJ Kimes RV Empirical characterization of random forest variable importance measures Comput Stat Data Anal200852(4)2249‐2260

15 Nicodemus KK Malley JD Strobl C Ziegler A The behaviour of random forest permutation‐based variable importance measures underpredictor correlation BMC Bioinformatics 201011(1)110‐123

16 Gregorutti B Michel B Saint‐Pierre P Correlation and variable importance in random forests Stat Comput 201727(3)659‐678

17 Zhu R Zeng D Kosorok MR Reinforcement learning trees J Am Stat Assoc 2015110(512)1770‐1784

18 Sexton J Laake P Standard errors for bagged and random forest estimators Comput Stat Data Anal 200953(3)801‐811

19 Wager S Hastie T Efron B Confidence intervals for random forests the jackknife and the infinitesimal jackknife J Mach Learn Res201415(1)1625‐1651

20 Mentch L Hooker G Quantifying uncertainty in random forests via confidence intervals and hypothesis tests J Mach Learn Res201617(1)841‐881

21 Politis DN Romano JP Large sample confidence regions based on subsamples under minimal assumptions Ann Stat199422(4)2031‐2050

22 Shao J Wu CJ A general theory for jackknife variance estimation Ann Stat 198917(3)1176‐1197

23 Graf E Schmoor C Sauerbrei W Schumacher M Assessment and comparison of prognostic classification schemes for survival data StatMed 199918(17‐18)2529‐2545

24 Gerds TA Schumacher M Consistent estimation of the expected brier score in general survival models with right‐censored event timesBiom J 200648(6)1029‐1040

25 Hoeffding W Probability inequalities for sums of bounded random variables J Am Stat Assoc 196358(301)13‐30

26 Radulović D On the subsample bootstrap variance estimation Test 19987(2)295‐306

27 Friedman JH Multivariate adaptive regression splines Ann Stat 19911‐67

28 Biau G Fischer A Guedj B Malley JD COBRA a combined regression strategy J Multivar Anal 201614618‐28

29 Ishwaran H Kogalur UB Random forests for survival regression and classification (RF‐SRC) https cranr-projectorgwebpackagesrandomForestSRC R package version 250 2017

30 Leisch F Dimitriadou E Mlbench machine learning benchmark problems httpscranr-projectorg webpackagesmlbench R packageversion 21‐1 2010

31 Kuhn M Caret classification and regression training httpscranr-projectorgwebpackagescaret R package version 60‐77 2017

32 Harrell FE Califf RM Pryor DB Lee KL Rosati RA Evaluating the yield of medical tests JAMA 1982247(18)2543‐2546

33 Hsich E Gorodeski EZ Blackstone EH Ishwaran H Lauer MS Identifying important risk factors for survival in patient with systolicheart failure using random survival forests Circ Cardiovasc Qual Outcomes 20114(1)39‐45

34 Breiman L Software for the masses Banff Alberta Canada Lecture III IMS Wald Lectures 2002

35 Scornet E Biau G Vert J‐P Consistency of random forests Ann Stat 201543(4)1716‐1741

36 Biau G Scornet E A random forest guided tour Test 201625(2)197‐227

How to cite this article Ishwaran H Lu M Standard errors and confidence intervals for variable importancein random forest regression classification and survival Statistics in Medicine 201938558ndash582 httpsdoiorg101002sim7803

APPENDIX

Assessing normality for classification survival and competing risk

We applied the same strategy as in RF‐R simulations to assess normality of VIMP for RF‐C RSF and RSF competingrisk simulations Specifically for each setting we ran simulations 1000 times independently Experimental parameterswere set as before with n=2500 The finite forest VIMP for a variable was centered by its averaged value from the1000 simulations This centered value was then divided by the standard deviation of the 1000 VIMP values Quantilebias was calculated by taking the difference between the quantile for standardized VIMP to that of a standard normalquantile Quantile bias is displayed in Figure A1 for the 3 families

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

FIGURE A1 Assessing asymptotic normality of variable importance from RF‐C RSF and RSF competing risk simulations Figure

displays bias of standardized variable importance quantiles compared with standard normal quantiles Values are displayed for 5 10 25

50 75 90 95 percentile values

578 ISHWARAN AND LU

Performance results from competing risk simulations

Competing risk simulations from Ishwaran et al3 were used to assess performance of the subsampling delete‐djackknife and 164 bootstrap estimators Performance was measured using the time truncated concordance index3

Analysis focused on variables affecting cause 1 event RSF competing risk trees were constructed using log‐rank splittingwith weight 1 on event 1 and weight 0 on event 2 This ensured VIMP identified only those variables affecting event 1Results are displayed in Figure A2

Some theoretical results for VIMP in RF‐R

Let θeth jTHORNn frac14 IethX eth jTHORNLTHORN be the infinite forest estimator for VIMP (5) We assume the following additive regression modelholds

Yi frac14 sump

jfrac141hjethXeth jTHORN

i THORN thorn εi i frac14 1hellipn (A1)

where (Xi εi) are iid with distribution P such that Xi and εi are independent and EethεiTHORN frac14 0 Var(εi) = σ2ltinfin Notice (A1)implies that the target function h has an additive expansion hethXTHORN frac14 sump

jfrac141hjethXeth jTHORNTHORN This is a useful assumption because itwill allow us to isolate the effect of VIMP Also there are known consistency results for RF in additive models35 whichwe will use later in establishing our results Assuming squared error loss ℓethY ĥTHORN frac14 ethYminusĥTHORN2 we have

θeth jTHORNn frac14 EΘ1

NethΘTHORN sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

We will assume the number of OOB cases N(Θ) is always fixed at Round(neminus1) where Round(middot) is the nearest integerfunction Write Nn for N(Θ) Because this is a fixed value

θeth jTHORNn frac14 1Nn

EΘ sumiisinLlowastlowastethΘTHORN

frac12ethYiminusheth~Xeth jTHORNi ΘLTHORNTHORN2minusethYiminushethXiΘLTHORNTHORN2

To study θeth jTHORNn we will evaluate its mean θeth jTHORNn0 frac14 ELfrac12θeth jTHORNn For ease of notation write hni frac14 hethXiΘLTHORN and

~hni frac14 heth~Xeth jTHORNi ΘLTHORN Likewise let hn frac14 hethXΘLTHORN and ~hn frac14 heth~Xeth jTHORNΘLTHORN We have

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

FIGURE A2 Results from competing risk simulations showing performance of subsampling delete‐d jackknife and 164 bootstrap Left‐

and right‐hand side figures are based on subsampling rates b=n12 and b=n34 Top and middle figures display bias and standardized mean‐

squared error (SMSE) for estimating variable importance (VIMP) standard error Bottom figure displays coverage for VIMP 90 asymptotic

normal confidence intervals Results have been stratified into 6 groups based on 10th 25th 50th 75th and 90th percentiles of true finite VIMP

[Colour figure can be viewed at wileyonlinelibrarycom]

ISHWARAN AND LU 579

θeth jTHORNn0 frac141Nn

E sumiisinLlowastlowastethΘTHORN

frac12ethYiminus~hniTHORN2minusethYiminushniTHORN2

frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2

where the right‐hand side follows because ethXY hn ~hnTHORNfrac14d ethXiYi hni ~hniTHORN if i is OOB (ie because the tree does not useinformation about (Xi Yi) in its construction we can replace (Xi Yi) with (X Y)) Now making using of the represen-tation Y=h(X)+ε which holds by the assumed regression model (A1) and writing h for h(X) and ~Δn frac14 ~hnminushn

θeth jTHORNn0 frac14 Efrac12ethYminus~hnTHORN2minusethYminushnTHORN2 frac14 Efrac12minus2ε~Δn thorn ~Δ2n thorn 2~ΔnethhnminushTHORN frac14 Efrac12~Δ2

n thorn 2Efrac12~ΔnethhnminushTHORN (A2)

where in the last line we have used EethεTHORN frac14 0 and that ε is independent of f~Δn hn hg

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

580 ISHWARAN AND LU

We can see that (A2) is driven by the 2 terms ~Δn and hnminush Define integer values ni=ni(Θ)ge0 recording the bootstrapfrequency of case i=1hellipn in LlowastethΘTHORN (notice that ni=0 implies case i is OOB) By the definition of a RF‐R tree we have

hn frac14 hethXΘLTHORN frac14 sumn

ifrac141WiethXΘTHORNYi

where fWiethXΘTHORNgn1 are the forest weights defined as follows

WiethXΘTHORN frac14 ni1fXiisinRethXΘTHORNgjRethXΘTHORNj

where R(XΘ) is the tree terminal node containingX and |R(XΘ)| is the cardinality equal to the number of bootstrap casesin R(X Θ) Notice that the weights are convex since 0leWi(X Θ)le1 and

sumn

ifrac141WiethXΘTHORN frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgjRethXΘTHORNj frac14 sum

n

ifrac141

ni1fXiisinRethXΘTHORNgsumn

iprimefrac141niprime1fXiprimeisinRethXΘTHORNgfrac14 1

Similarly we have

~hn frac14 heth~Xeth jTHORNΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYi whereWieth~Xeth jTHORNΘTHORN frac14 ni1fXiisinReth~Xeth jTHORNΘTHORNg

jReth~Xeth jTHORNΘTHORNj

Therefore

~ΔnethXTHORN frac14 heth~Xeth jTHORNΘLTHORNminushethXΘLTHORN frac14 sumn

ifrac141Wieth~Xeth jTHORNΘTHORNYiminussum

n

ifrac141WiethXΘTHORNYi

To study ~Δn in more detail we will assume deep trees containing one unique case per terminal node

Assumption 1 We assume each terminal node contains exactly one unique value That is each terminalnode contains the bootstrap copies of a unique data point

Assumption 1 results in the following useful simplification For notational ease write ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ)Then

~ΔnethXTHORN frac14 1

j~Rjsumiisin~R

niY iminus1jRjsumiisinRniY i

frac14 1

j~Rjsumiisin~R

nihethXiTHORNminus 1jRjsumiisinRnihethXiTHORN thorn 1

j~Rjsumiisin~R

niεiminus1jRjsumiisinRniεi

frac14 hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN thorn εieth~RTHORNminusεiethRTHORN

where ieth~RTHORN and i(R) identify the index for the bootstrap case in ~R frac14 Reth~Xeth jTHORNΘTHORN and R=R(X Θ) respectively (note that

ieth~RTHORN and i(R) are functions of X and j but this is suppressed for notational simplicity) We can see that the informationin the target function h is captured by the first 2 terms in the last line and therefore will be crucial to understandingVIMP Notice that if hethXieth~RTHORNTHORN≍heth~Xeth jTHORNTHORN and h(Xi(R))≍h(X) which is what we would expect asymptotically with a deeptree then

hethXieth~RTHORNTHORNminushethXiethRTHORNTHORN≍heth~Xeth jTHORNTHORNminushethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORN

where the right‐hand side follows by our assumption of an additive model (A1) This shows that VIMP for X(j) isassessed by how much its contribution to the additive expansion hj changes when X(j) is permuted This motivatesthe following assumption

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

ISHWARAN AND LU 581

Assumption 2 Assuming a deep tree with one unique value in a terminal node

heth~Xeth jTHORNTHORN frac14 hethXieth~RTHORNTHORN thorn ~ζnethXTHORN hethXTHORN frac14 hethXiethRTHORNTHORN thorn ζnethXTHORN

where Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN

This deals with the first 2 terms in the expansion of ~ΔnethXTHORN We also need to deal with the remaining term involving

the measurement errors εieth~RTHORNminusεiethRTHORN For this we will rely on a fairly mild exchangeability assumption

Assumption 3 ε ε is a finite exchangeable sequence with variance σ2

ieth~RTHORN iethRTHORN

Finally a further assumption we will need is consistency of the forest predictor2 2

Assumption 4 The forest predictor is L2‐consistent Efrac12ethhnminushTHORN rarr0 where Efrac12h ltinfin

Putting all of the above together we can now state our main result

Theorem 1 If Assumptions 1 2 3 and 4 hold then

θeth jTHORNn0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN

Note that the asymptotic limit will be heavily dependent on the strength of the variable Consider when X(j) is a noisy

variable Ideally this means the tree is split without ever using X(j) Therefore if we dropX and ~Xeth jTHORN down the tree they willoccupy the same terminal node Hence ~R frac14 R and εieth~RTHORN frac14 εiethRTHORN and therefore ρj=1 Furthermore because hj must be zerofor a noisy variable it follows that θeth jTHORNn0 frac14 oeth1THORN Thus the limit is zero for a noisy variable Obviously this is much differentthan the limit of a strong variable which must be strictly positive because hjne0 and ρjlt1 for strong variables

eth jTHORN ~2 ~

Proof By (A2) we have θn0 frac14 Efrac12Δn thorn 2Efrac12ΔnethhnminushTHORN We start by dealing with the second term

Efrac12~ΔnethhnminushTHORN By the Cauchy‐Schwartz inequality

Efrac12~ΔnethhnminushTHORNleEfrac12j~ΔnjjhnminushjleffiffiffiffiffiffiffiffiffiffiffiEfrac12~Δ2

nq ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Efrac12ethhnminushTHORN2q

By Assumption 4 the right‐hand side converges to zero if Efrac12~Δ2n remains bounded By Assumption 2 and

the assumption of an additive model (A1)

~ΔnethXTHORN frac14 hjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNminus~ζnethXTHORN thorn ζnethXTHORN thorn εieth~RTHORNminusεiethRTHORN

Assumption 4 implies h (and therefore hj) is square‐integrable Assumption 3 implies that εieth~RTHORN εiethRTHORN havefinite second moment and are square‐integrable Therefore squaring and taking expectations and using

Eeth~ζ 2nTHORN frac14 oeth1THORN and Eethζ 2nTHORN frac14 oeth1THORN deduce that

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORN2 thorn 2Efrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN thorn Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 thorn oeth1THORN

By exchangeability Efrac12gethXTHORNεieth~RTHORN frac14 Efrac12gethXTHORNεiethRTHORN for any function g(X) Hence

0 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORNεieth~RTHORNminusEfrac12ethhjeth~Xeth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNεiethRTHORN frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethX eth jTHORNTHORNTHORNethεieth~RTHORNminusεiethRTHORNTHORN

Appealing to exchangeability once more we have

Efrac12ethεieth~RTHORNminusεiethRTHORNTHORN2 frac14 Efrac12ε2ieth~RTHORN thorn Efrac12ε2iethRTHORNminus2Efrac12ethεieth~RTHORNεiethRTHORNTHORN frac14 2σ2eth1minusρjTHORN

where ρj frac14 correthεieth~RTHORN εiethRTHORNTHORN Therefore we have shown

Efrac12~ΔnethXTHORN2 frac14 Efrac12ethhjeth~X eth jTHORNTHORNminushjethXeth jTHORNTHORNTHORN2 thorn 2σ2eth1minusρjTHORN thorn oeth1THORN

which verifies boundedness of Efrac12~Δ2n and that θeth jTHORNn0 frac14 Efrac12~Δ2

n thorn oeth1THORN

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions

582 ISHWARAN AND LU

The conditions needed to establish Theorem 1 are fairly reasonable Assumption 2 can be viewed as a type of continuitycondition for h However it is also an assertion about the approximating behavior of the forest predictor hn It asserts thatall features X within a terminal node have h(X) values close to one another which can be seen as an indirect way ofasserting good local prediction behavior for the tree Thus Assumption 2 is very similar to Assumption 4 The latterassumption of consistency is reasonable for deep trees under an additive model assumption Scornet et al35 establishedL2‐consistency of RF‐R for additive models allowing for the number of terminal nodes to grow at rate of n (see theorem2 of their paper) For technical reasons their proof replaced bootstrapping with subsampling and required X to be uni-formly distributed but other than this their result can be seen as strong support for our assumptions