decision trees ii and bias/variance and cross-validation

50
Decision Trees II and Bias/Variance and Cross-Validation Nipun Batra and teaching staff January 11, 2020 IIT Gandhinagar

Upload: others

Post on 27-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decision Trees II and Bias/Variance and Cross-Validation

Decision Trees II and Bias/Variance andCross-Validation

Nipun Batra and teaching staffJanuary 11, 2020

IIT Gandhinagar

Page 2: Decision Trees II and Bias/Variance and Cross-Validation

Real Input Real Output

Page 3: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

Let us consider the dataset given below

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1

2

Labe

l, Y

1

Page 4: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

What would be the prediction for decision tree with depth 0?

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1

2

Labe

l, Y

2

Page 5: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

Prediction for decision tree with depth 0.Horizontal dashed line shows the predicted Y value. It is theaverage of Y values of all datapoints.

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1

2

Labe

l, Y

3

Page 6: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

What would be the decision tree with depth 1?

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1

2

Labe

l, Y

4

Page 7: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

Decision tree with depth 1

X[0] <= 2.5mse = 0.667samples = 6value = 1.0

mse = 0.0samples = 2value = 0.0

True

mse = 0.25samples = 4value = 1.5

False

5

Page 8: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

The Decision Boundary

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1

2La

bel,

Y

6

Page 9: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

What would be the decision tree with depth 2 ?

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1

2

Labe

l, Y

7

Page 10: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

Decision tree with depth 2

X[0] <= 2.5mse = 0.667samples = 6value = 1.0

mse = 0.0samples = 2value = 0.0

True

X[0] <= 4.5mse = 0.25samples = 4value = 1.5

False

mse = 0.0samples = 2value = 1.0

mse = 0.0samples = 2value = 2.0

8

Page 11: Decision Trees II and Bias/Variance and Cross-Validation

Example 1

The Decision Boundary

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1

2La

bel,

Y

9

Page 12: Decision Trees II and Bias/Variance and Cross-Validation

Objective function

Here, Feature is denoted by X and Label by Y.Let the “decision boundary” or “split” be at X = S.Let the region X < S, be region R1.Let the region X > S, be region R2.

10

Page 13: Decision Trees II and Bias/Variance and Cross-Validation

Objective function

Here, Feature is denoted by X and Label by Y.Let the “decision boundary” or “split” be at X = S.Let the region X < S, be region R1.Let the region X > S, be region R2.

Then, letC1 = Mean (Yi|Xi ∈ R1)C2 = Mean (Yi|Xi ∈ R2)

10

Page 14: Decision Trees II and Bias/Variance and Cross-Validation

Objective function

Here, Feature is denoted by X and Label by Y.Let the “decision boundary” or “split” be at X = S.Let the region X < S, be region R1.Let the region X > S, be region R2.

Then, letC1 = Mean (Yi|Xi ∈ R1)C2 = Mean (Yi|Xi ∈ R2)Loss =

∑i((Yi − C1|Xi ∈ R1)2 + (Yi − C2|Xi ∈ R2)2)

10

Page 15: Decision Trees II and Bias/Variance and Cross-Validation

Objective function

Here, Feature is denoted by X and Label by Y.Let the “decision boundary” or “split” be at X = S.Let the region X < S, be region R1.Let the region X > S, be region R2.

Then, letC1 = Mean (Yi|Xi ∈ R1)C2 = Mean (Yi|Xi ∈ R2)Loss =

∑i((Yi − C1|Xi ∈ R1)2 + (Yi − C2|Xi ∈ R2)2)

Our objective is to minimize the loss and findminS

∑i

((Yi − C1|Xi ∈ R1)2 + (Yi − C2|Xi ∈ R2)2

)10

Page 16: Decision Trees II and Bias/Variance and Cross-Validation

How to find optimal split “S”?

11

Page 17: Decision Trees II and Bias/Variance and Cross-Validation

How to find optimal split “S”?

1. Sort all datapoints (X,Y) in increasing order of X.

12

Page 18: Decision Trees II and Bias/Variance and Cross-Validation

How to find optimal split “S”?

1. Sort all datapoints (X,Y) in increasing order of X.

2. Evaluate the loss function for all

S =Xi+Xi+1

2

and the select the S with minimum loss.

12

Page 19: Decision Trees II and Bias/Variance and Cross-Validation

A Question!

Draw a regression tree for Y = sin(X), 0 ≤ X ≤ 2π

13

Page 20: Decision Trees II and Bias/Variance and Cross-Validation

A Question!

Dataset of Y = sin(X), 0 ≤ X ≤ 7 with 10,000 points

0 1 2 3 4 5 6 7Feature, X

1

0

1La

bel,

Y

14

Page 21: Decision Trees II and Bias/Variance and Cross-Validation

A Question!

Regression tree of depth 1

X[0] <= 3.033mse = 0.463

samples = 10000value = 0.035

mse = 0.086samples = 4333

value = 0.657

True

mse = 0.23samples = 5667value = -0.441

False

15

Page 22: Decision Trees II and Bias/Variance and Cross-Validation

A Question!

Decision Boundary

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1La

bel,

Y

16

Page 23: Decision Trees II and Bias/Variance and Cross-Validation

A Question!

Regression tree with no depth limit is too big to fit in a slide.It has of depth 20. The decision boundaries are in figure below.

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0Feature, X

0

1La

bel,

Y

17

Page 24: Decision Trees II and Bias/Variance and Cross-Validation

Summary

• Interpretability an important goal• Decision trees: well known interpretable models• Learning optimal tree is hard• Greedy approach:• Recursively split to maximize “performance gain”• Issues:

• Can overfit easily!• Empirically not as powerful as other methods

18

Page 25: Decision Trees II and Bias/Variance and Cross-Validation

A Question!

What would be the decision boundary of a decision treeclassifier?

1 2 3 4 5 6 7X1

1.0

1.5

2.0

2.5

3.0

3.5

4.0

X 2

19

Page 26: Decision Trees II and Bias/Variance and Cross-Validation

Decision Boundary for a tree with depth 1

1 2 3 4 5 6 7X1

1.0

1.5

2.0

2.5

3.0

3.5

4.0

X 2

(a) Decision Boundary

X[0] <= 4.0entropy = 1.0samples = 12value = [6, 6]

entropy = 0.65samples = 6value = [5, 1]

True

entropy = 0.65samples = 6value = [1, 5]

False

(b) Decision Tree

20

Page 27: Decision Trees II and Bias/Variance and Cross-Validation

Decision Boundary for a tree with no depth limit

1 2 3 4 5 6 7X1

1.0

1.5

2.0

2.5

3.0

3.5

4.0

X 2

(c) Decision Boundary

X[0] <= 4.0entropy = 1.0samples = 12value = [6, 6]

X[1] <= 1.5entropy = 0.65samples = 6value = [5, 1]

True

X[1] <= 1.5entropy = 0.65samples = 6value = [1, 5]

False

entropy = 0.0samples = 3value = [3, 0]

X[0] <= 1.5entropy = 0.918

samples = 3value = [2, 1]

entropy = 0.0samples = 2value = [2, 0]

entropy = 0.0samples = 1value = [0, 1]

entropy = 0.0samples = 3value = [0, 3]

X[0] <= 6.5entropy = 0.918

samples = 3value = [1, 2]

entropy = 0.0samples = 1value = [1, 0]

entropy = 0.0samples = 2value = [0, 2]

(d) Decision Tree

21

Page 28: Decision Trees II and Bias/Variance and Cross-Validation

Are deeper trees always better?

As we saw, deeper trees learn more complex decisionboundaries.

22

Page 29: Decision Trees II and Bias/Variance and Cross-Validation

Are deeper trees always better?

As we saw, deeper trees learn more complex decisionboundaries.

But, sometimes this can lead to poor generalization

22

Page 30: Decision Trees II and Bias/Variance and Cross-Validation

An example

Consider the dataset below

1 2 3 4 5 6X1

1

2

3

4

5

6

X 2

Train dataset

(e) Train Set

1 2 3 4 5 6X1

1

2

3

4

5

6

X 2

Test dataset

(f) Test Set

23

Page 31: Decision Trees II and Bias/Variance and Cross-Validation

Underfitting

Underfitting is also known as high bias, since it has a verybiased incorrect assumption.

1 2 3 4 5 6X1

1

2

3

4

5

6

X 2

(g) Decision Boundary

X[1] <= 1.5entropy = 0.964samples = 36

value = [22, 14]

entropy = 0.0samples = 6value = [6, 0]

True

X[1] <= 5.5entropy = 0.997samples = 30

value = [16, 14]

False

entropy = 0.98samples = 24

value = [10, 14]

entropy = 0.0samples = 6value = [6, 0]

(h) Decision Tree

24

Page 32: Decision Trees II and Bias/Variance and Cross-Validation

Overfitting

Overfitting is also known as high variance, since very smallchanges in data can lead to very different models.Decision tree learned has depth of 10.

1 2 3 4 5 6X1

1

2

3

4

5

6

X 2

25

Page 33: Decision Trees II and Bias/Variance and Cross-Validation

Intution for Variance

A small change in data can lead to very different models.

Dataset 1

1 2 3 4 5 6X1

1

2

3

4

5

6

X 2

Train dataset

Dataset 2

1 2 3 4 5 6X1

1

2

3

4

5

6

X 2

26

Page 34: Decision Trees II and Bias/Variance and Cross-Validation

Intution for Variance

X[1] <= 1.5entropy = 0.964samples = 36

value = [22, 14]

entropy = 0.0samples = 6value = [6, 0]

True

X[1] <= 5.5entropy = 0.997samples = 30

value = [16, 14]

False

X[0] <= 1.5entropy = 0.98samples = 24

value = [10, 14]

entropy = 0.0samples = 6value = [6, 0]

entropy = 0.0samples = 4value = [4, 0]

X[0] <= 5.5entropy = 0.881samples = 20value = [6, 14]

X[0] <= 2.5entropy = 0.544samples = 16value = [2, 14]

entropy = 0.0samples = 4value = [4, 0]

entropy = 0.0samples = 4value = [0, 4]

X[0] <= 4.5entropy = 0.65samples = 12value = [2, 10]

X[1] <= 2.5entropy = 0.811

samples = 8value = [2, 6]

entropy = 0.0samples = 4value = [0, 4]

entropy = 0.0samples = 2value = [0, 2]

X[1] <= 4.5entropy = 0.918

samples = 6value = [2, 4]

X[1] <= 3.5entropy = 1.0samples = 4value = [2, 2]

entropy = 0.0samples = 2value = [0, 2]

X[0] <= 3.5entropy = 1.0samples = 2value = [1, 1]

X[0] <= 3.5entropy = 1.0samples = 2value = [1, 1]

entropy = 0.0samples = 1value = [1, 0]

entropy = 0.0samples = 1value = [0, 1]

entropy = 0.0samples = 1value = [0, 1]

entropy = 0.0samples = 1value = [1, 0]

X[1] <= 1.5entropy = 0.964samples = 36

value = [22, 14]

entropy = 0.0samples = 6value = [6, 0]

True

X[1] <= 5.5entropy = 0.997samples = 30

value = [16, 14]

False

X[0] <= 1.5entropy = 0.98samples = 24

value = [10, 14]

entropy = 0.0samples = 6value = [6, 0]

entropy = 0.0samples = 4value = [4, 0]

X[0] <= 5.5entropy = 0.881samples = 20value = [6, 14]

X[1] <= 3.5entropy = 0.544samples = 16value = [2, 14]

entropy = 0.0samples = 4value = [4, 0]

X[1] <= 2.5entropy = 0.811

samples = 8value = [2, 6]

entropy = 0.0samples = 8value = [0, 8]

entropy = 0.0samples = 4value = [0, 4]

X[0] <= 2.5entropy = 1.0samples = 4value = [2, 2]

entropy = 0.0samples = 1value = [0, 1]

X[0] <= 4.5entropy = 0.918

samples = 3value = [2, 1]

entropy = 0.0samples = 2value = [2, 0]

entropy = 0.0samples = 1value = [0, 1]

27

Page 35: Decision Trees II and Bias/Variance and Cross-Validation

A Good Fit

1 2 3 4 5 6X1

1

2

3

4

5

6

X 2

X[1] <= 1.5entropy = 0.964samples = 36

value = [22, 14]

entropy = 0.0samples = 6value = [6, 0]

True

X[1] <= 5.5entropy = 0.997samples = 30

value = [16, 14]

False

X[0] <= 1.5entropy = 0.98samples = 24

value = [10, 14]

entropy = 0.0samples = 6value = [6, 0]

entropy = 0.0samples = 4value = [4, 0]

X[0] <= 5.5entropy = 0.881samples = 20value = [6, 14]

entropy = 0.544samples = 16value = [2, 14]

entropy = 0.0samples = 4value = [4, 0]

28

Page 36: Decision Trees II and Bias/Variance and Cross-Validation

Accuracy vs Depth Curve

0 2 4 6 8 10Depth

60

70

80

90

100Ac

cura

cy

Test AccuracyTrain Accuracy

As depth increases, train accuracy improvesAs depth increases, test accuracy improves till a pointAt very high depths, test accuracy is not good (overfitting).

29

Page 37: Decision Trees II and Bias/Variance and Cross-Validation

Accuracy vs Depth Curve

0 2 4 6 8 10Depth

60

70

80

90

100Ac

cura

cy

Test AccuracyTrain Accuracy

As depth increases, train accuracy improves

As depth increases, test accuracy improves till a pointAt very high depths, test accuracy is not good (overfitting).

29

Page 38: Decision Trees II and Bias/Variance and Cross-Validation

Accuracy vs Depth Curve

0 2 4 6 8 10Depth

60

70

80

90

100Ac

cura

cy

Test AccuracyTrain Accuracy

As depth increases, train accuracy improvesAs depth increases, test accuracy improves till a point

At very high depths, test accuracy is not good (overfitting).

29

Page 39: Decision Trees II and Bias/Variance and Cross-Validation

Accuracy vs Depth Curve

0 2 4 6 8 10Depth

60

70

80

90

100Ac

cura

cy

Test AccuracyTrain Accuracy

As depth increases, train accuracy improvesAs depth increases, test accuracy improves till a pointAt very high depths, test accuracy is not good (overfitting). 29

Page 40: Decision Trees II and Bias/Variance and Cross-Validation

Accuracy vs Depth Curve : Underfitting

The highlighted region is the underfitting region.Model is too simple (less depth) to learn from the data.

2 4 6 8 10Depth

60

70

80

90

100

Accu

racy

Test AccuracyTrain Accuracy

30

Page 41: Decision Trees II and Bias/Variance and Cross-Validation

Accuracy vs Depth Curve : Overfitting

The highlighted region is the overfitting region.Model is complex (high depth) and hence also learns theanomalies in data.

2 4 6 8 10Depth

60

70

80

90

100

Accu

racy

Test AccuracyTrain Accuracy

31

Page 42: Decision Trees II and Bias/Variance and Cross-Validation

Accuracy vs Depth Curve

The highlighted region is the good fit region.We want to maximize test accuracy while being in this region.

2 4 6 8 10Depth

60

70

80

90

100

Accu

racy

Test AccuracyTrain Accuracy

32

Page 43: Decision Trees II and Bias/Variance and Cross-Validation

The big question!?

How to find the optimal depth for a decision tree?

33

Page 44: Decision Trees II and Bias/Variance and Cross-Validation

The big question!?

How to find the optimal depth for a decision tree?

Use cross-validation!

33

Page 45: Decision Trees II and Bias/Variance and Cross-Validation

Our General Training Flow

Train Data Test Data

Model

Training

Predicting

Actual Labels

Predicted Labels

Error MetricCalculation

Accuracy

34

Page 46: Decision Trees II and Bias/Variance and Cross-Validation

K-Fold cross-validation: Utilise full dataset for testing

Train Test

Train Test Train

Train Test Train

Test Train

FOLD 1

FOLD 2

FOLD 3

FOLD 4

35

Page 47: Decision Trees II and Bias/Variance and Cross-Validation

The Validation Set

Train Data Test Data

ModelDepth = 1

Training

Predicting

Actual Labels

Predicted Labels

Error MetricCalculation

Accuracy

Validation

ModelDepth = 2

ModelDepth = 3

Accuracy on validation set

Validation AccuracyDepth 1 : 70%Depth 2 : 90%Depth 3 : 80%

Select model having highest validation

accuracy

36

Page 48: Decision Trees II and Bias/Variance and Cross-Validation

Nested Cross Validation

Divide your training set into K equal parts.Cyclically use 1 part as “validation set” and the rest for training.Here K = 4

Train Validation

Train Validation Train

Train Validation Train

Validation Train

FOLD 1

FOLD 2

FOLD 3

FOLD 4

37

Page 49: Decision Trees II and Bias/Variance and Cross-Validation

Nested Cross Validation

Average out the validation accuracy across all the foldsUse the model with highest validation accuracy

Train Validation

Train Validation Train

Train Validation Train

Validation Train

FOLD 1

FOLD 2

FOLD 3

FOLD 4

Average out Validation Accuracy across all folds.

Select the hyperparameter for which the model gives the highest average validation accuracy

38

Page 50: Decision Trees II and Bias/Variance and Cross-Validation

Next time: Ensemble Learning

• How to combine various models?• Why to combine multiple models?• How can we reduce bias?• How can we reduce variance?

39