overfitting, bias/variance tradeoff, and ensemble methods pierre geurts stochastic methods (prof....

68
Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

Upload: cassandra-blake

Post on 17-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

Overfitting, Bias/Variance tradeoff,and

Ensemble methods

Pierre Geurts

Stochastic methods

(Prof. L.Wehenkel)

University of Liège

Page 2: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

2

Content of the presentation

• Bias and variance definitions• Parameters that influence bias and variance• Decision/regression tree variance• Bias and variance reduction techniques

Page 3: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

3

Content of the presentation

• Bias and variance definitions:– A simple regression problem with no input

– Generalization to full regression problems

– A short discussion about classification

• Parameters that influence bias and variance• Decision/regression tree variance• Bias and variance reduction techniques

Page 4: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

4

Regression problem - no input

• Goal: predict as well as possible the height of a Belgian male adult

• More precisely:– Choose an error measure, for example the square error.

– Find an estimation ŷ such that the expectation:

over the whole population of Belgian male adult is minimized.

180

Page 5: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

5

Regression problem - no input

• The estimation that minimizes the error can be computed by taking:

• So, the estimation which minimizes the error is Ey{y}. In AL, it is called the Bayes model.

• But in practice, we cannot compute the exact value of Ey{y} (this would imply to measure the height of every Belgian male adults).

Page 6: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

6

Learning algorithm

• As p(y) is unknown, find an estimation y from a sample of individuals, LS={y1,y2,…,yN}, drawn from the Belgian male adult population.

• Example of learning algorithms:

– ,

(if we know that the height is close to 180)

Page 7: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

7

Good learning algorithm

• As LS are randomly drawn, the prediction ŷ will also be a random variable

• A good learning algorithm should not be good only on one learning sample but in average over all learning samples (of size N) we want to minimize:

• Let us analyse this error in more detail

ŷ

pLS (ŷ)

Page 8: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

8

Bias/variance decomposition (1)

Page 9: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

9

Bias/variance decomposition (2)

E= Ey{(y- Ey{y})2} + ELS{(Ey{y}-ŷ)2}

y

vary{y}

Ey{y}

= residual error = minimal attainable error = vary{y}

Page 10: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

10

Bias/variance decomposition (3)

Page 11: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

11

Bias/variance decomposition (4)

E= vary{y} + (Ey{y}-ELS{ŷ})2 + …

ELS{ŷ} = average model (over all LS)

bias2 = error between Bayes and average model

ŷEy{y} ELS{ŷ}

bias2

Page 12: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

12

Bias/variance decomposition (5)

E= vary{y} + bias2 + ELS{(ŷ-ELS{ŷ})2}

varLS{ŷ} = estimation variance = consequence of over-fitting

ŷ

varLS{ŷ}

ELS{ŷ}

Page 13: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

13

Bias/variance decomposition (6)

E= vary{y} + bias2 + varLS{ŷ}

ŷEy{y} ELS{ŷ}

bias2

vary{y} varLS{ŷ}

Page 14: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

14

Our simple example

• –

From statistics, ŷ1 is the best estimate with zero bias

So, the first one may not be the best estimator because of variance (There is a bias/variance tradeoff w.r.t. )

Page 15: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

15

Bayesian approach (1)

• Hypotheses :– The average height is close to 180cm:

– The height of one individual is Gaussian around the mean:

• What is the most probable value of after having seen the learning sample ?

Page 16: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

16

Bayesian approach (2)

Bayes theorem and P(LS) is constant

Independence of the learning cases

Page 17: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

17

Regression problem – full (1)

• Actually, we want to find a function ŷ(x) of several inputs => average over the whole input space:

• The error becomes:

• Over all learning sets:

Page 18: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

18

Regression problem – full (2)

ELS{Ey|x{(y-ŷ(x))^2}}=Noise(x)+Bias2(x)+Variance(x)

• Noise(x) = Ey|x{(y-hB(x))2}

Quantifies how much y varies from hB(x) = Ey|x{y}, the Bayes model.

• Bias2(x) = (hB(x)-ELS{ŷ(x)})2:

Measures the error between the Bayes model and the average model.

• Variance(x) = ELS{(ŷ(x)-ELS{ŷ(x))2} :

Quantify how much ŷ(x) varies from one learning sample to another.

Page 19: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

19

Illustration (1)

• Problem definition:– One input x, uniform random variable in [0,1]

– y=h(x)+ε where εN(0,1)

h(x)=Ey|x{y}

x

y

Page 20: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

20

Illustration (2)

• Low variance, high bias method underfitting

ELS{ŷ(x)}

Page 21: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

21

Illustration (3)

• Low bias, high variance method overfitting

ELS{ŷ(x)}

Page 22: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

22

Illustration (4)

• No noise doesn’t imply no variance (but less variance)

ELS{ŷ(x)}

Page 23: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

23

Classification problems (1)

• The mean misclassification error is:

• The best possible model is the Bayes model:

• The “average” model is:

• Unfortunately, there is no such decomposition of the mean misclassification error into a bias and a variance terms.

• Nevertheless, we observe the same phenomena

Page 24: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

24

Classification problems (2)

A full decision treeOne test node

0

1

0 10

1

0 1

LS1

0

1

0 10

1

0 1

LS2

Page 25: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

25

Classification problems (3)

• Bias systematic error component (independent of the learning sample)

• Variance error due to the variability of the model with respect to the learning sample randomness

• There are errors due to bias and errors due to variance

0

1

0 1

One test node Full decision tree

0

1

0 1

Page 26: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

26

Content of the presentation

• Bias and variance definitions• Parameters that influence bias and variance

– Complexity of the model

– Complexity of the Bayes model

– Noise

– Learning sample size

– Learning algorithm

• Decision/regression tree variance• Bias and variance reduction techniques

Page 27: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

27

Illustrative problem

• Artificial problem with 10 inputs, all uniform random variables in [0,1]

• The true function depends only on 5 inputs:

y(x)=10.sin(π.x1.x2)+20.(x3-0.5)2+10.x4+5.x5+ε,

where ε is a N(0,1) random variable

• Experimentations: – ELS average over 50 learning sets of size 500

– Ex,y average over 2000 cases

Estimate variance and bias (+ residual error)

Page 28: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

28

Complexity of the model

Usually, the bias is a decreasing function of the complexity, while variance is an increasing function of the complexity.

E=bias2+var

bias2

var

Complexity

Page 29: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

29

Complexity of the model – neural networks

• Error, bias, and variance w.r.t. the number of neurons in the hidden layer

0

1

2

3

4

5

6

7

8

0 2 4 6 8 10 12

Nb hidden perceptrons

Err

or Error

Bias R

Var R

Page 30: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

30

Complexity of the model – regression trees

• Error, bias, and variance w.r.t. the number of test nodes

0

2

4

6

8

10

12

14

16

18

20

0 10 20 30 40 50

Nb test nodes

Err

or Error

Bias R

Var R

Page 31: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

31

Complexity of the model – k-NN

• Error, bias, and variance w.r.t. k, the number of neighbors

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25 30

k

Err

or Error

Bias R

Var R

Page 32: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

32

Learning problem

• Complexity of the Bayes model:– At fixed model complexity, bias increases with the complexity of

the Bayes model. However, the effect on variance is difficult to predict.

• Noise: – Variance increases with noise and bias is mainly unaffected. – E.g. with (full) regression trees

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6

Noise std. dev.

Err

or

Error

Noise

Bias R

Var R

Page 33: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

33

Learning sample size (1)

• At fixed model complexity, bias remains constant and variance decreases with the learning sample size. E.g. linear regression

0

1

2

3

4

5

6

7

8

9

10

0 500 1000 1500 2000

LS size

Err

or Error

Bias R

Var R

Page 34: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

34

Learning sample size (2)

• When the complexity of the model is dependant on the learning sample size, both bias and variance decrease with the learning sample size. E.g. regression trees

0

2

4

6

8

10

12

14

16

18

20

0 500 1000 1500 2000

LS size

Err

or Error

Bias R

Var R

Page 35: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

35

Learning algorithms – linear regression

• Very few parameters : small variance • Goal function is not linear : high bias

Method Err2 Bias2+Noise Variance

Linear regr. 7.0 6.8 0.2k-NN (k=1) 15.4 5 10.4

k-NN (k=10) 8.5 7.2 1.3

MLP (10) 2.0 1.2 0.8

MLP (10 – 10) 4.6 1.4 3.2

Regr. Tree 10.2 3.5 6.7

Page 36: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

36

Learning algorithms – k-NN

• Small k : high variance and moderate bias• High k : smaller variance but higher bias

Method Err2 Bias2+Noise Variance

Linear regr. 7.0 6.8 0.2

k-NN (k=1) 15.4 5 10.4

k-NN (k=10) 8.5 7.2 1.3MLP (10) 2.0 1.2 0.8

MLP (10 – 10) 4.6 1.4 3.2

Regr. Tree 10.2 3.5 6.7

Page 37: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

37

Learning algorithms - MLP

• Small bias • Variance increases with the model complexity

Method Err2 Bias2+Noise Variance

Linear regr. 7.0 6.8 0.2

k-NN (k=1) 15.4 5 10.4

k-NN (k=10) 8.5 7.2 1.3

MLP (10) 2.0 1.2 0.8

MLP (10 – 10) 4.6 1.4 3.2Regr. Tree 10.2 3.5 6.7

Page 38: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

38

Learning algorithms – regression trees

• Small bias, a (complex enough) tree can approximate any non linear function

• High variance

Method Err2 Bias2+Noise Variance

Linear regr. 7.0 6.8 0.2

k-NN (k=1) 15.4 5 10.4

k-NN (k=10) 8.5 7.2 1.3

MLP (10) 2.0 1.2 0.8

MLP (10 – 10) 4.6 1.4 3.2

Regr. Tree 10.2 3.5 6.7

Page 39: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

39

Content of the presentation

• Bias and variance definition• Parameters that influence bias and variance• Decision/regression tree variance• Bias and variance reduction techniques

Page 40: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

40

Decision/regression tree variance (1)

• DT/RT are among the machine learning method that present the highest variance. Even a small change of the learning sample can result in a very different tree.

• Even small trees have a high variance

Method E Bias Variance

k-NN (k=10) 8.5 7.2 1.3

MLP (10 – 10) 4.6 1.4 3.2

RT, no test 25.5 25.4 0.1

RT, 1 test 19.0 17.7 1.3

RT, 3 tests 14.8 11.1 3.7

RT, full (250 tests) 10.2 3.5 6.7

Page 41: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

41

Decision/regression tree variance (2)

• Possible sources of variance:– Discretization of numerical attributes

• The selected threshold has a high variance (see next slide).

– Structure choice• Sometimes, attribute scores are very close.

– Estimation at leaf nodes• Because of the recursive partitioning, prediction at leaf nodes

are based on very small samples of objects.

• Consequences: – sub-optimality in terms of accuracy– questionable interpretability since the parameters can

not be trusted

Page 42: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

42

A1

?

Decision/regression tree variance (3)

• The discretization thresholds chosen in trees are very unstable

• This variance put into question the interpretability

0.48

A1<0,48

0.61

A1<0,61

0.82

A1<0,82

Page 43: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

43

Content of the presentation

• Bias and variance definition• Parameters that influence bias and variance• Decision/regression tree variance• Bias and variance reduction techniques

– Introduction

– Dealing with the bias/variance tradeoff of one algorithm

– Ensemble methods

Page 44: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

44

Bias and variance reduction techniques

• In the context of a given method:– Adapt the learning algorithm to find the best trade-off

between bias and variance.

– Not a panacea but the least we can do.

– Example: pruning, weight decay.

• Ensemble methods:– Change the bias/variance trade-off.

– Universal but destroys some features of the initial method.

– Example: bagging, boosting.

Page 45: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

45

Variance reduction: 1 model (1)

• General idea: reduce the ability of the learning algorithm to fit the LS– Pruning

• reduces the model complexity explicitly

– Early stopping• reduces the amount of search

– Regularization• reduce the size of hypothesis space

• Weight decay with neural networks consists in penalizing high weight values

Page 46: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

46

Variance reduction: 1 model (2)

• Selection of the optimal level of fitting – a priori (not optimal)

– by cross-validation (less efficient): Bias2 error on the learning set, E error on an independent test set

E=bias2+var

bias2

var

Fitting

Optimal fitting

Page 47: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

47

Variance reduction: 1 model (3)

• Examples:– Post-pruning of regression trees

– Early stopping of MLP by cross-validation

Method E Bias Variance

Full regr. Tree (250) 10.2 3.5 6.7

Pr. regr. Tree (45) 9.1 4.3 4.8

Full learned MLP 4.6 1.4 3.2

Early stopped MLP 3.8 1.5 2.3

• As expected, variance decreases but bias increases

Page 48: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

48

Ensemble methods

• Combine the predictions of several models built with a learning algorithm in order to improve with respect to the use of a single model

• Two important families:– Averaging techniques

• Grow several models independantly and simply average their predictions

• Ex: bagging, random forests

• Decrease mainly variance

– Boosting type algorithms• Grows several models sequentially

• Ex: Adaboost, MART

• Decrease mainly bias

Page 49: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

49

Bagging (1)

ELS{Err(x)}=Ey|x{(y-hB(x))2}+ (hB(x)-ELS{ŷ(x)})2+ELS{(ŷ(x)-ELS{ŷ(x))2}

• Idea : the average model ELS{ŷ(x)} has the same bias as the original method but zero variance

• Bagging (Bootstrap AGGregatING) :– To compute ELS{ŷ (x)}, we should draw an infinite number of LS

(of size N)

– Since we have only one single LS, we simulate sampling from nature by bootstrap sampling from the given LS

– Bootstrap sampling = sampling with replacement of N objects from LS (N is the size of LS)

Page 50: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

50

Bagging (2)

LS

LS1 LS2 LST

ŷ1(x) ŷ2(x) ŷT(x)

In regression: ŷ(x) = 1/k.(ŷ1(x)+ŷ2(x)+…+ŷT(x))

In classification: ŷ(x) = the majority class in {ŷ1(x),…,ŷT(x)}

x

Page 51: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

51

Bagging (3)

• Usually, bagging reduces very much the variance without increasing too much the bias.

• Application to regression trees

Method E Bias Variance

3 Test regr. Tree 14.8 11.1 3.7

Bagged (T=25) 11.7 10.7 1.0

Full regr. Tree 10.2 3.5 6.7

Bagged (T=25) 5.3 3.8 1.5

• Strong variance reduction without increasing the bias (although the model is much more complex than a single tree)

Page 52: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

52

Bagging (4)

x

y

Page 53: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

53

Other averaging techniques

• Perturb and Combine paradigm:– Perturb the data or the learning algorithm to obtain several models that are good

on the learning sample.– Combine the predictions of these models

• Usually, these methods decrease the variance (because of averaging) but (slightly) increase the bias (because of the perturbation)

• Examples:– Bagging perturbs the learning sample.– Learn several neural networks with random initial weights

– Random forests.Method E Bias Variance

MLP (10-10) 4.6 1.4 3.2

Average of 10 MLPs 2.0 1.4 0.6

Page 54: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

54

Random forests (1)

• Perturb and combine algorithm specifically designed for trees

• Combine bagging and random attribute subset selection:– Build the tree from a bootstrap sample

– Instead of choosing the best split among all attributes, select the best split among a random subset of k attributes

(= bagging when k is equal to the number of attributes)

• There is a bias/variance tradeoff with k: The smaller k, the greater the reduction of variance but also the higher the increase of bias

Page 55: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

55

Random forests (2)

• Application to our illustrative problem:

• Other advantage: it decreases computing times with respect to bagging since only a subset of all attributes needs to be considered when splitting a node.

Method E Bias Variance

Full regr. Tree 10.2 3.5 6.7

Bagging (k=10) 5.3 3.8 1.5

Random Forests (k=7) 4.8 3.8 1.0

Random Forests (k=5) 4.9 4.0 0.9

Random Forests (k=3) 5.6 4.7 0.8

Page 56: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

56

Boosting methods (1)

• The motivation of boosting is to combine the ouputs of many « weak » models to produce a powerful ensemble of models.

• Weak model = a model that has a high bias (strictly, in classification, a model slightly better than random guessing)

• Differences with previous ensemble methods:– Models are built sequentially on modified versions of

the data– The predictions of the models are combined through a

weighted sum/vote

Page 57: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

57

Boosting methods (2)

LS

x

LS1

In regression: ŷ(x) = 1.ŷ1(x)+ 2.ŷ2(x)+…+ T.ŷT(x))

In classification: ŷ(x) = the majority class in {ŷ1(x),…,ŷT(x)} according to the weights {1,2,…,T}

ŷ1(x) ŷ2(x) ŷT(x)

LS2 LST

Page 58: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

58

Adaboost (1)

• Assume that the learning algorithm accepts weighted objects

• This is the case of many learning algorithms:– With trees, simply take into account the weights when counting

objects

– In neural networks, minimize the weighted squared error

• At each step, adaboost increases the weights of cases from the learning sample misclassified by the last model

• Thus, the algorithm focuses on the difficult cases from the learning sample

• In the weighted majority vote, adaboost gives higher influence to the more accurate models

Page 59: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

59

Adaboost (2)

• Input: a learning algorithm and a learning sample {(xi,yi): i=1,…,N}

• Initialize the weights wi=1/N, i=1,…,N

• For t=1 to T– Build a model ŷt(x) with the learning algorithm using weights wi

– Compute the weighted error:

– Compute t=log((1-errt)/errt))

– Change weights:

ii

iitii

tw

xŷyIwerr

))((

))]((exp[ itimii xŷyIww

Page 60: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

60

MART (multiple additive regression trees)

MART is a boosting algorithm for regression

• Input: a learning sample {(xi,yi): i=1,…,N}

• Initialize – ŷ0(x) = 1/N i yi ; ri=yi, i=1,…,N

• For t=1 to T:– For i=1 to N, compute the residuals

ri ri -ŷt-1(xi)

– Build a regression tree from the learning sample {(xi,ri): i=1,…,N}

• Return the model ŷ(x)= ŷ0(x)+ŷ1(x)+…+ŷT(x)

Page 61: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

61

Boosting methods

• Adaboost and MART are only two boosting variants. There are many other boosting type algorithms.

• Boosting decision/regression trees improves their accuracy often dramatically. However, boosting is more sensible to noise than averaging techniques (overfitting).

• For boosting to work, the models need not to be perfect on the learning sample. With trees, there are two possible strategies:– Use pruned trees (pre-pruned or post-pruned by cross-validation)

– Limit the number of tree tests (and split first the most impure nodes)

there is again a bias/variance tradeoff with respect to the tree size.

Page 62: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

62

Experiment with MART

• On our illustrative problem:

• Boosting reduces the bias but increases the variance. However, with respect to full trees, it decreases both bias and variance.

Method E Bias Variance

Full regr. Tree 10.2 3.5 6.7

Regr. Tree with 1 test 18.9 17.8 1.1

+ MART (T=50) 5.0 3.1 1.9

+ Bagging (T=50) 17.9 17.3 0.6

Regr. Tree with 5 tests 11.7 8.8 2.9

+ MART (T=50) 6.4 1.7 4.7

+ Bagging (T=50) 9.1 8.7 0.4

Page 63: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

63

Interpretability and efficiency of ensembles

• Since we average several models, we loose interpretability and efficiency which are two of the main advantages of decision/regression trees

• However, – We still can use the ensembles to compute variable

importance by averaging over all trees. Actually, this even stabilizes the estimates.

– Averaging techniques can be parallelized and boosting type algorithm uses smaller trees. So, the increase of computing times is not so detrimental.

Page 64: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

64

Experiments on Golub’s microarray data

• 72 objects, 7129 numerical attributes (gene expressions), 2 classes (ALL and AL)

• Leave-one-out error with several variants

• Variable importance with boosting

Method Error

1 decision tree 22.2% (16/72)

Random forests (k=85,T=500) 9.7% (7/72)

Extra-trees (sth=0.5, T=500) 5.5% (4/72)

Adaboost (1 test node, T=500) 1.4% (1/72)

0

1020

30

4050

60

70

8090

100

1 5 9 13

17

21

25

29

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

93

97

101

105

Variables

Imp

orta

nce

Page 65: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

65

Conclusion (1)

• The notions of bias and variance are very useful to predict how changing the (learning and problem) parameters will affect the accuracy. E.g. this explains why very simple methods can work much better than more complex ones on very difficult tasks

• Variance reduction is a very important topic:– To reduce bias is easy, but to keep variance low is not as easy.

– Especially in the context of new applications of machine learning to very complex domains: temporal data, biological data, Bayesian networks learning, text mining…

• All learning algorithms are not equal in terms of variance. Trees are among the worst methods from this criterion

Page 66: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

66

Conclusion (2)

• Ensemble methods are very effective techniques to reduce bias and/or variance. They can transform a not so good method to a competitive method in terms of accuracy.

• Adaboost with trees is considered as one of the best « off-the-shelve » classification method.

• Interpretability of the model and efficiency of the method are difficult to preserve if we want to reduce variance significantly.

• There are other ways to tackle the variance/overfitting problem, e.g.:– Bayesian approaches (related to averaging techniques)– Support vector machines (they maintain a low variance by

maximizing the classifiction margin)

Page 67: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

67

References

• About bias and variance:– Neural networks and the bias/variance dilemma, S.Geman et al., Neural

computation 4, 1(1992), 1-58– Neural networks for statistical pattern recognition, C.M.Bishop, Oxfored

University Press, 1994– The elements of statistical learning, T.Hastie et al., Springer, 2001– Contribution to decision tree induction: bias/variance tradeoff and time

series classification, P.Geurts, Phd Thesis, 2002• About ensemble methods:

– Bagging predictors, L.Breiman, Machine learning, 24, 1996– A decision theoretic generalization of on-line learning and an application

to boosting, Y.Freund and R.Schapire, Journal of Computer and Science Systems, 1995

– Random Forests, L.Breiman, Machine learning, 45, 2001– Ensemble methods in machine learning, T.Dietterich, First international

workshop on multiple classifier systems, 2000– An introduction to boosting and leveraging, R.Meir and G.Ratsch,

Advanced Lectures on Machine Learning, Springer, 2003

Page 68: Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

68

Softwares

• Random forests– http://stat-www.berkeley.edu/users/breiman/rf.html

– R package randomForest

• Boosting:– See www.boosting.org