understanding variable importances in forests of randomized trees

Understanding variable importancesin forests of randomized trees

Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts

Dept. of EE & CS, & GIGA-RUniversite de Liege, Belgium

June 6, 2014

1 / 12

Ensembles of randomized trees

︸︷︷︸Majority vote or prediction average

I Improve standard classification and regression trees byreducing their variance.

I Many examples : Bagging (Breiman, 1996), Random Forests(Breiman, 2001), Extremely randomized trees (Geurts et al., 2006).

I Standard Random Forests : bootstrap sampling + randomselection of K features at each node

2 / 12

Strengths and weaknesses

I Universal approximation

I Robustness to outliers

I Robustness to irrelevant attributes (to some extent)

I Invariance to scaling of inputs

I Good computational efficiency and scalability

I Very good accuracy

I Loss of interpretability w.r.t. single decision trees

3 / 12

Strengths and weaknesses

I Universal approximation

I Robustness to outliers

I Robustness to irrelevant attributes (to some extent)

I Invariance to scaling of inputs

I Good computational efficiency and scalability

I Very good accuracy

I Loss of interpretability w.r.t. single decision trees

3 / 12

Variable importances

I Interpretability can be recovered through variable importances

I Two main importance measures :• The mean decrease of impurity (MDI) : summing total

impurity reductions at all tree nodes where the variableappears (Breiman et al., 1984) ;

• The mean decrease of accuracy (MDA) : measuringaccuracy reduction on out-of-bag samples when the values ofthe variable are randomly permuted (Breiman, 2001).

I We focus here on MDI because :• It is faster to compute ;• It does not require to use bootstrap sampling ;• In practice, it correlates well with the MDA measure (except in

specific conditions).

4 / 12

Mean decrease of impurity

I Importance of variable Xj for an ensemble of M trees ϕm is :

Imp(Xj) =1

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

], (1)

where jt denotes the variable used at node t, p(t) = Nt/Nand ∆i(t) is the impurity reduction at node t :

∆i(t) = i(t)− NtL

Nti(tL)− Ntr

Nti(tR) (2)

I Impurity i(t) can be Shannon entropy, gini index, variance, ...

5 / 12

Motivation and assumptions

Imp(Xj) =1

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

]I MDI works well, but it is not well understood theoretically ;

I We would like to better characterize it and derive its mainproperties from this characterization.

I Working assumptions :

• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.

6 / 12

Imp(Xj) =1

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

I Working assumptions :

• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

6 / 12

Imp(Xj) =1

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

I Working assumptions :• All variables are discrete ;

• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

6 / 12

Imp(Xj) =1

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;

• Shannon entropy as impurity measure :

i(t) = −∑c

6 / 12

Imp(Xj) =1

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

6 / 12

Imp(Xj) =1

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

i(t) = −∑c

• Totally randomized trees (RF with K = 1) ;

• Asymptotic conditions : N →∞, M →∞.

6 / 12

Imp(Xj) =1

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

i(t) = −∑c

6 / 12

Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.

I (X1, . . . ,Xp;Y )︸︷︷︸Information jointly provided

by all input variablesabout the output

p∑j=1

Imp(Xj)︸︷︷︸i) Decomposition in terms of

the MDI importance ofeach input variable

Imp(Xj) =

p−1∑k=0

p − k︸︷︷︸ii) Decomposition along

the degrees k of interactionwith the other variables

∑B∈Pk (V−j )

I (Xj ;Y |B)

︸︷︷︸iii) Decomposition along all

interaction terms Bof a given degree k

E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1

6(I (X1;Y |X2) + I (X1;Y |X3)) + 1

3I (X1;Y |X2,X3)

7 / 12

p∑j=1

Imp(Xj) =

p−1∑k=0

∑B∈Pk (V−j )

I (Xj ;Y |B)

E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1

6(I (X1;Y |X2) + I (X1;Y |X3)) + 1

3I (X1;Y |X2,X3)

7 / 12

p∑j=1

Imp(Xj) =

p−1∑k=0

∑B∈Pk (V−j )

I (Xj ;Y |B)

E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1

6(I (X1;Y |X2) + I (X1;Y |X3)) + 1

3I (X1;Y |X2,X3)

7 / 12

p∑j=1

Imp(Xj) =

p−1∑k=0

∑B∈Pk (V−j )

I (Xj ;Y |B)

E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1

6(I (X1;Y |X2) + I (X1;Y |X3)) + 1

3I (X1;Y |X2,X3)

7 / 12

Illustration (Breiman et al., 1984)

y x1 x2 x3 x4 x5 x6 x7

0 1 1 1 0 1 1 11 0 0 1 0 0 1 02 1 0 1 1 1 0 13 1 0 1 1 0 1 14 0 1 1 1 0 1 05 1 1 0 1 0 1 16 1 1 0 1 1 1 17 1 0 1 0 0 1 08 1 1 1 1 1 1 19 1 1 1 1 0 1 1

8 / 12

Illustration (Breiman et al., 1984)

Imp(Xj) =

p−1∑k=0

p − k

∑B∈Pk (V−j )

I (Xj ;Y |B)

Var Imp

X1 0.412X2 0.581X3 0.531X4 0.542X5 0.656X6 0.225X7 0.372∑

3.3210 1 2 3 4 5 6

8 / 12

Result 2 : Irrelevant variables

Variable importances depend only on the relevant variables.

Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.

A variable Xj is irrelevant if and only if Imp(Xj) = 0.

The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.

9 / 12

Non-totally randomized trees

Most properties are lost as soon as K > 1.

⇒ There can be relevant variables with zero importances (due tomasking effects).

Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0

I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1

2I (X2;Y )

I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0

⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.

I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0

10 / 12

2I (X2;Y )

10 / 12

2I (X2;Y )

10 / 12

2I (X2;Y )

10 / 12

2I (X2;Y )

10 / 12

2I (X2;Y )

10 / 12

2I (X2;Y )

10 / 12

Illustration (continued)

1 2 3 4 5 6 70.1

X1X2X3X4X5X6X7

11 / 12

Conclusions

I We propose a theoretical characterization of MDI importancescores in asymptotic conditions ;

I In the case of totally randomized trees, variable importancesactually show sound and desirable theoretical properties, thatare lost when using non-totally randomized trees ;

I Main results remain valid for a large range of impuritymeasures (Gini entropy, variance, kernelized variance, etc).

I Future works :• More formal treatment of the non-totally random case

(K > 1) ;• Derive distributions of importances in finite setting ;• Use these results to design better importance score estimators.

12 / 12

understanding variable importances in forests of randomized trees

Science

assumptions impxj

c nt log nt

j ptit mdi

main properties

mean decrease of impurity

impurity measure

importance of variable

ntl nt itl ntr nt itr