understanding variable importances in forests of randomized trees

34
Understanding variable importances in forests of randomized trees Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts Dept. of EE & CS, & GIGA-R Universit´ e de Li` ege, Belgium June 6, 2014 1 / 12

Upload: gilles-louppe

Post on 18-Jun-2015

707 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Understanding variable importances in forests of randomized trees

Understanding variable importancesin forests of randomized trees

Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts

Dept. of EE & CS, & GIGA-RUniversite de Liege, Belgium

June 6, 2014

1 / 12

Page 2: Understanding variable importances in forests of randomized trees

Ensembles of randomized trees

...

︸ ︷︷ ︸Majority vote or prediction average

I Improve standard classification and regression trees byreducing their variance.

I Many examples : Bagging (Breiman, 1996), Random Forests(Breiman, 2001), Extremely randomized trees (Geurts et al., 2006).

I Standard Random Forests : bootstrap sampling + randomselection of K features at each node

2 / 12

Page 3: Understanding variable importances in forests of randomized trees

Strengths and weaknesses

I Universal approximation

I Robustness to outliers

I Robustness to irrelevant attributes (to some extent)

I Invariance to scaling of inputs

I Good computational efficiency and scalability

I Very good accuracy

I Loss of interpretability w.r.t. single decision trees

3 / 12

Page 4: Understanding variable importances in forests of randomized trees

Strengths and weaknesses

I Universal approximation

I Robustness to outliers

I Robustness to irrelevant attributes (to some extent)

I Invariance to scaling of inputs

I Good computational efficiency and scalability

I Very good accuracy

I Loss of interpretability w.r.t. single decision trees

3 / 12

Page 5: Understanding variable importances in forests of randomized trees

Variable importances

I Interpretability can be recovered through variable importances

I Two main importance measures :• The mean decrease of impurity (MDI) : summing total

impurity reductions at all tree nodes where the variableappears (Breiman et al., 1984) ;

• The mean decrease of accuracy (MDA) : measuringaccuracy reduction on out-of-bag samples when the values ofthe variable are randomly permuted (Breiman, 2001).

I We focus here on MDI because :• It is faster to compute ;• It does not require to use bootstrap sampling ;• In practice, it correlates well with the MDA measure (except in

specific conditions).

4 / 12

Page 6: Understanding variable importances in forests of randomized trees

Variable importances

I Interpretability can be recovered through variable importances

I Two main importance measures :• The mean decrease of impurity (MDI) : summing total

impurity reductions at all tree nodes where the variableappears (Breiman et al., 1984) ;

• The mean decrease of accuracy (MDA) : measuringaccuracy reduction on out-of-bag samples when the values ofthe variable are randomly permuted (Breiman, 2001).

I We focus here on MDI because :• It is faster to compute ;• It does not require to use bootstrap sampling ;• In practice, it correlates well with the MDA measure (except in

specific conditions).

4 / 12

Page 7: Understanding variable importances in forests of randomized trees

Variable importances

I Interpretability can be recovered through variable importances

I Two main importance measures :• The mean decrease of impurity (MDI) : summing total

impurity reductions at all tree nodes where the variableappears (Breiman et al., 1984) ;

• The mean decrease of accuracy (MDA) : measuringaccuracy reduction on out-of-bag samples when the values ofthe variable are randomly permuted (Breiman, 2001).

I We focus here on MDI because :• It is faster to compute ;• It does not require to use bootstrap sampling ;• In practice, it correlates well with the MDA measure (except in

specific conditions).

4 / 12

Page 8: Understanding variable importances in forests of randomized trees

Mean decrease of impurity

...

I Importance of variable Xj for an ensemble of M trees ϕm is :

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

], (1)

where jt denotes the variable used at node t, p(t) = Nt/Nand ∆i(t) is the impurity reduction at node t :

∆i(t) = i(t)− NtL

Nti(tL)− Ntr

Nti(tR) (2)

I Impurity i(t) can be Shannon entropy, gini index, variance, ...

5 / 12

Page 9: Understanding variable importances in forests of randomized trees

Motivation and assumptions

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

]I MDI works well, but it is not well understood theoretically ;

I We would like to better characterize it and derive its mainproperties from this characterization.

I Working assumptions :

• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

Nt,c

Ntlog

Nt,c

Nt

• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.

6 / 12

Page 10: Understanding variable importances in forests of randomized trees

Motivation and assumptions

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

]I MDI works well, but it is not well understood theoretically ;

I We would like to better characterize it and derive its mainproperties from this characterization.

I Working assumptions :

• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

Nt,c

Ntlog

Nt,c

Nt

• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.

6 / 12

Page 11: Understanding variable importances in forests of randomized trees

Motivation and assumptions

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

]I MDI works well, but it is not well understood theoretically ;

I We would like to better characterize it and derive its mainproperties from this characterization.

I Working assumptions :• All variables are discrete ;

• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

Nt,c

Ntlog

Nt,c

Nt

• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.

6 / 12

Page 12: Understanding variable importances in forests of randomized trees

Motivation and assumptions

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

]I MDI works well, but it is not well understood theoretically ;

I We would like to better characterize it and derive its mainproperties from this characterization.

I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;

• Shannon entropy as impurity measure :

i(t) = −∑c

Nt,c

Ntlog

Nt,c

Nt

• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.

6 / 12

Page 13: Understanding variable importances in forests of randomized trees

Motivation and assumptions

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

]I MDI works well, but it is not well understood theoretically ;

I We would like to better characterize it and derive its mainproperties from this characterization.

I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

Nt,c

Ntlog

Nt,c

Nt

• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.

6 / 12

Page 14: Understanding variable importances in forests of randomized trees

Motivation and assumptions

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

]I MDI works well, but it is not well understood theoretically ;

I We would like to better characterize it and derive its mainproperties from this characterization.

I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

Nt,c

Ntlog

Nt,c

Nt

• Totally randomized trees (RF with K = 1) ;

• Asymptotic conditions : N →∞, M →∞.

6 / 12

Page 15: Understanding variable importances in forests of randomized trees

Motivation and assumptions

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

]I MDI works well, but it is not well understood theoretically ;

I We would like to better characterize it and derive its mainproperties from this characterization.

I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :

i(t) = −∑c

Nt,c

Ntlog

Nt,c

Nt

• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.

6 / 12

Page 16: Understanding variable importances in forests of randomized trees

Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.

I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided

by all input variablesabout the output

=

p∑j=1

Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of

the MDI importance ofeach input variable

Imp(Xj) =

p−1∑k=0

1

C kp

1

p − k︸ ︷︷ ︸ii) Decomposition along

the degrees k of interactionwith the other variables

∑B∈Pk (V−j )

I (Xj ;Y |B)

︸ ︷︷ ︸iii) Decomposition along all

interaction terms Bof a given degree k

E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1

6(I (X1;Y |X2) + I (X1;Y |X3)) + 1

3I (X1;Y |X2,X3)

7 / 12

Page 17: Understanding variable importances in forests of randomized trees

Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.

I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided

by all input variablesabout the output

=

p∑j=1

Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of

the MDI importance ofeach input variable

Imp(Xj) =

p−1∑k=0

1

C kp

1

p − k︸ ︷︷ ︸ii) Decomposition along

the degrees k of interactionwith the other variables

∑B∈Pk (V−j )

I (Xj ;Y |B)

︸ ︷︷ ︸iii) Decomposition along all

interaction terms Bof a given degree k

E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1

6(I (X1;Y |X2) + I (X1;Y |X3)) + 1

3I (X1;Y |X2,X3)

7 / 12

Page 18: Understanding variable importances in forests of randomized trees

Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.

I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided

by all input variablesabout the output

=

p∑j=1

Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of

the MDI importance ofeach input variable

Imp(Xj) =

p−1∑k=0

1

C kp

1

p − k︸ ︷︷ ︸ii) Decomposition along

the degrees k of interactionwith the other variables

∑B∈Pk (V−j )

I (Xj ;Y |B)

︸ ︷︷ ︸iii) Decomposition along all

interaction terms Bof a given degree k

E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1

6(I (X1;Y |X2) + I (X1;Y |X3)) + 1

3I (X1;Y |X2,X3)

7 / 12

Page 19: Understanding variable importances in forests of randomized trees

Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.

I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided

by all input variablesabout the output

=

p∑j=1

Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of

the MDI importance ofeach input variable

Imp(Xj) =

p−1∑k=0

1

C kp

1

p − k︸ ︷︷ ︸ii) Decomposition along

the degrees k of interactionwith the other variables

∑B∈Pk (V−j )

I (Xj ;Y |B)

︸ ︷︷ ︸iii) Decomposition along all

interaction terms Bof a given degree k

E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1

6(I (X1;Y |X2) + I (X1;Y |X3)) + 1

3I (X1;Y |X2,X3)

7 / 12

Page 20: Understanding variable importances in forests of randomized trees

Illustration (Breiman et al., 1984)

X1

X2 X3

X4

X5 X6

X7

y x1 x2 x3 x4 x5 x6 x7

0 1 1 1 0 1 1 11 0 0 1 0 0 1 02 1 0 1 1 1 0 13 1 0 1 1 0 1 14 0 1 1 1 0 1 05 1 1 0 1 0 1 16 1 1 0 1 1 1 17 1 0 1 0 0 1 08 1 1 1 1 1 1 19 1 1 1 1 0 1 1

8 / 12

Page 21: Understanding variable importances in forests of randomized trees

Illustration (Breiman et al., 1984)

X1

X2 X3

X4

X5 X6

X7

Imp(Xj) =

p−1∑k=0

1

C kp

1

p − k

∑B∈Pk (V−j )

I (Xj ;Y |B)

Var Imp

X1 0.412X2 0.581X3 0.531X4 0.542X5 0.656X6 0.225X7 0.372∑

3.3210 1 2 3 4 5 6

X1

X2

X3

X4

X5

X6

X7

k

8 / 12

Page 22: Understanding variable importances in forests of randomized trees

Result 2 : Irrelevant variables

Variable importances depend only on the relevant variables.

Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.

A variable Xj is irrelevant if and only if Imp(Xj) = 0.

The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.

9 / 12

Page 23: Understanding variable importances in forests of randomized trees

Result 2 : Irrelevant variables

Variable importances depend only on the relevant variables.

Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.

A variable Xj is irrelevant if and only if Imp(Xj) = 0.

The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.

9 / 12

Page 24: Understanding variable importances in forests of randomized trees

Result 2 : Irrelevant variables

Variable importances depend only on the relevant variables.

Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.

A variable Xj is irrelevant if and only if Imp(Xj) = 0.

The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.

9 / 12

Page 25: Understanding variable importances in forests of randomized trees

Result 2 : Irrelevant variables

Variable importances depend only on the relevant variables.

Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.

A variable Xj is irrelevant if and only if Imp(Xj) = 0.

The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.

9 / 12

Page 26: Understanding variable importances in forests of randomized trees

Non-totally randomized trees

Most properties are lost as soon as K > 1.

⇒ There can be relevant variables with zero importances (due tomasking effects).

Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0

I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1

2I (X2;Y )

I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0

⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.

I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0

10 / 12

Page 27: Understanding variable importances in forests of randomized trees

Non-totally randomized trees

Most properties are lost as soon as K > 1.

⇒ There can be relevant variables with zero importances (due tomasking effects).

Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0

I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1

2I (X2;Y )

I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0

⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.

I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0

10 / 12

Page 28: Understanding variable importances in forests of randomized trees

Non-totally randomized trees

Most properties are lost as soon as K > 1.

⇒ There can be relevant variables with zero importances (due tomasking effects).

Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0

I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1

2I (X2;Y )

I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0

⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.

I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0

10 / 12

Page 29: Understanding variable importances in forests of randomized trees

Non-totally randomized trees

Most properties are lost as soon as K > 1.

⇒ There can be relevant variables with zero importances (due tomasking effects).

Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0

I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1

2I (X2;Y )

I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0

⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.

I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0

10 / 12

Page 30: Understanding variable importances in forests of randomized trees

Non-totally randomized trees

Most properties are lost as soon as K > 1.

⇒ There can be relevant variables with zero importances (due tomasking effects).

Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0

I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1

2I (X2;Y )

I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0

⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.

I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0

10 / 12

Page 31: Understanding variable importances in forests of randomized trees

Non-totally randomized trees

Most properties are lost as soon as K > 1.

⇒ There can be relevant variables with zero importances (due tomasking effects).

Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0

I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1

2I (X2;Y )

I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0

⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.

I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0

10 / 12

Page 32: Understanding variable importances in forests of randomized trees

Non-totally randomized trees

Most properties are lost as soon as K > 1.

⇒ There can be relevant variables with zero importances (due tomasking effects).

Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0

I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1

2I (X2;Y )

I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0

⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.

I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0

10 / 12

Page 33: Understanding variable importances in forests of randomized trees

Illustration (continued)

X1

X2 X3

X4

X5 X6

X7

1 2 3 4 5 6 70.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

K

Impo

rtan

ce

X1X2X3X4X5X6X7

11 / 12

Page 34: Understanding variable importances in forests of randomized trees

Conclusions

I We propose a theoretical characterization of MDI importancescores in asymptotic conditions ;

I In the case of totally randomized trees, variable importancesactually show sound and desirable theoretical properties, thatare lost when using non-totally randomized trees ;

I Main results remain valid for a large range of impuritymeasures (Gini entropy, variance, kernelized variance, etc).

I Future works :• More formal treatment of the non-totally random case

(K > 1) ;• Derive distributions of importances in finite setting ;• Use these results to design better importance score estimators.

12 / 12