ensemble methods - university of...

93
Ensemble methods CS 446

Upload: others

Post on 25-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Ensemble methods

CS 446

Page 2: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Why ensembles?

Standard machine learning setup:

I We have some data.

I We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ).

I We output the best on a validation set.

Question: can we do better than the best?

What if we use an ensemble/aggregate/combination?

We’ll consider two approaches: boosting and bagging.

1 / 27

Page 3: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Why ensembles?

Standard machine learning setup:

I We have some data.

I We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ).

I We output the best on a validation set.

Question: can we do better than the best?

What if we use an ensemble/aggregate/combination?

We’ll consider two approaches: boosting and bagging.

1 / 27

Page 4: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Why ensembles?

Standard machine learning setup:

I We have some data.

I We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ).

I We output the best on a validation set.

Question: can we do better than the best?

What if we use an ensemble/aggregate/combination?

We’ll consider two approaches: boosting and bagging.

1 / 27

Page 5: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Why ensembles?

Standard machine learning setup:

I We have some data.

I We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ).

I We output the best on a validation set.

Question: can we do better than the best?

What if we use an ensemble/aggregate/combination?

We’ll consider two approaches: boosting and bagging.

1 / 27

Page 6: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Bagging

2 / 27

Page 7: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Bagging?

This first approach is based upon a simple idea:

I If the predictors have indepedent errors,a majority vote of their outputs should be good.

Let’s first check this.

3 / 27

Page 8: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 9: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 10: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

#classifiers = n = 10

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 11: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

#classifiers = n = 20

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 12: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

#classifiers = n = 30

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 13: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

#classifiers = n = 40

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 14: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12#classifiers = n = 50

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 15: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

#classifiers = n = 60

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 16: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.20.0

0.1

0.2

0.3

0.4

0.5#classifiers = n = 2, fraction red = 0.16

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 17: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.0

0.1

0.2

0.3

0.4

#classifiers = n = 3, fraction red = 0.064

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 18: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35#classifiers = n = 4, fraction red = 0.0256

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 19: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35#classifiers = n = 5, fraction red = 0.01024

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 20: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

0.30

#classifiers = n = 6, fraction red = 0.004096

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 21: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

0.30#classifiers = n = 7, fraction red = 0.0016384

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 22: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

#classifiers = n = 10, fraction green = 0.366897

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 23: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

#classifiers = n = 20, fraction green = 0.244663

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 24: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

#classifiers = n = 30, fraction green = 0.175369

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 25: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

#classifiers = n = 40, fraction green = 0.129766

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 26: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12#classifiers = n = 50, fraction green = 0.0978074

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 27: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 28: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27

Page 29: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

#classifiers = n = 10, fraction green = 0.366897

Green region is error of majority vote!Suppose yi ∈ {−1,+1}.

MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,

−1 when∑i yi < 0.

Error rate of majority classifier (with individual error probability p):

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

5 / 27

Page 30: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

#classifiers = n = 20, fraction green = 0.244663

Green region is error of majority vote!Suppose yi ∈ {−1,+1}.

MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,

−1 when∑i yi < 0.

Error rate of majority classifier (with individual error probability p):

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

5 / 27

Page 31: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

#classifiers = n = 30, fraction green = 0.175369

Green region is error of majority vote!Suppose yi ∈ {−1,+1}.

MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,

−1 when∑i yi < 0.

Error rate of majority classifier (with individual error probability p):

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

5 / 27

Page 32: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

#classifiers = n = 40, fraction green = 0.129766

Green region is error of majority vote!Suppose yi ∈ {−1,+1}.

MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,

−1 when∑i yi < 0.

Error rate of majority classifier (with individual error probability p):

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

5 / 27

Page 33: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12#classifiers = n = 50, fraction green = 0.0978074

Green region is error of majority vote!Suppose yi ∈ {−1,+1}.

MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,

−1 when∑i yi < 0.

Error rate of majority classifier (with individual error probability p):

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

5 / 27

Page 34: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote!Suppose yi ∈ {−1,+1}.

MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,

−1 when∑i yi < 0.

Error rate of majority classifier (with individual error probability p):

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

5 / 27

Page 35: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Bottom line

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote!

Error of majority vote classifier goes down exponentially in n.

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

6 / 27

Page 36: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Bottom line

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote!

Error of majority vote classifier goes down exponentially in n.

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

6 / 27

Page 37: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

From independent errors to an algorithm

How to use independent errors in an algorithm?

1. For t = 1, 2, . . . , T :

1.1 Obtain IID data St := ((x(t)i , y

(t)i ))ni=1,

1.2 Train classifier ft on St.

2. Output x 7→ MAJ(f1(x), . . . , fT (x)

).

I Good news: errors are independent!(Our exponential error estimate from before is valid.)

I Bad news: classifiers trained on 1/T fraction of data(why not just train ResNet on all of it. . . ).

7 / 27

Page 38: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

From independent errors to an algorithm

How to use independent errors in an algorithm?

1. For t = 1, 2, . . . , T :

1.1 Obtain IID data St := ((x(t)i , y

(t)i ))ni=1,

1.2 Train classifier ft on St.

2. Output x 7→ MAJ(f1(x), . . . , fT (x)

).

I Good news: errors are independent!(Our exponential error estimate from before is valid.)

I Bad news: classifiers trained on 1/T fraction of data(why not just train ResNet on all of it. . . ).

7 / 27

Page 39: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Bagging

Bagging = Bootstrap aggregating (Leo Breiman, 1994).

1. Obtain IID data S := ((xi, yi))ni=1.

2. For t = 1, 2, . . . , T :

2.1 Resample n points uniformly at random with replacement from S,obtaining “Bootstrap sample” St.

2.2 Train classifier ft on St.

3. Output x 7→ MAJ(f1(x), . . . , fT (x)

).

I Good news: using most of the data for each ft !

I Bad news: errors no longer indepedent. . . ?

8 / 27

Page 40: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Bagging

Bagging = Bootstrap aggregating (Leo Breiman, 1994).

1. Obtain IID data S := ((xi, yi))ni=1.

2. For t = 1, 2, . . . , T :

2.1 Resample n points uniformly at random with replacement from S,obtaining “Bootstrap sample” St.

2.2 Train classifier ft on St.

3. Output x 7→ MAJ(f1(x), . . . , fT (x)

).

I Good news: using most of the data for each ft !

I Bad news: errors no longer indepedent. . . ?

8 / 27

Page 41: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Sampling with replacement?

Question:

Take n samples uniformly at random with replacement froma population of size n. What is the probability that a givenindividual is not picked?

Answer:

(1− 1

n

)n; for large n: lim

n→∞

(1− 1

n

)n=

1

e≈ 0.3679 .

Implications for bagging:

I Each bootstrap sample contains about 63% of the data set.

I Remaining 37% can be used to estimate error rate of classifiertrained on the bootstrap sample.

I If we have three classifiers,some of their error estimates must share examples!Independence is violated!

9 / 27

Page 42: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Sampling with replacement?

Question:

Take n samples uniformly at random with replacement froma population of size n. What is the probability that a givenindividual is not picked?

Answer:

(1− 1

n

)n; for large n: lim

n→∞

(1− 1

n

)n=

1

e≈ 0.3679 .

Implications for bagging:

I Each bootstrap sample contains about 63% of the data set.

I Remaining 37% can be used to estimate error rate of classifiertrained on the bootstrap sample.

I If we have three classifiers,some of their error estimates must share examples!Independence is violated!

9 / 27

Page 43: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Sampling with replacement?

Question:

Take n samples uniformly at random with replacement froma population of size n. What is the probability that a givenindividual is not picked?

Answer:

(1− 1

n

)n; for large n: lim

n→∞

(1− 1

n

)n=

1

e≈ 0.3679 .

Implications for bagging:

I Each bootstrap sample contains about 63% of the data set.

I Remaining 37% can be used to estimate error rate of classifiertrained on the bootstrap sample.

I If we have three classifiers,some of their error estimates must share examples!Independence is violated!

9 / 27

Page 44: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Random Forests

Random Forests (Leo Breiman, 2001).

1. Obtain IID data S := ((xi, yi))ni=1.

2. For t = 1, 2, . . . , T :

2.1 Resample n points uniformly at random with replacement from S,obtaining “Bootstrap sample” St.

2.2 Train a decision tree fT on St as follows:when greedily splitting tree nodes,consider only

√d and not d possible features.

3. Output x 7→ MAJ(f1(x), . . . , fT (x)

).

I Heuristic news: maybe errors are more independent now?

10 / 27

Page 45: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Random Forests

Random Forests (Leo Breiman, 2001).

1. Obtain IID data S := ((xi, yi))ni=1.

2. For t = 1, 2, . . . , T :

2.1 Resample n points uniformly at random with replacement from S,obtaining “Bootstrap sample” St.

2.2 Train a decision tree fT on St as follows:when greedily splitting tree nodes,consider only

√d and not d possible features.

3. Output x 7→ MAJ(f1(x), . . . , fT (x)

).

I Heuristic news: maybe errors are more independent now?

10 / 27

Page 46: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting — a quick look

11 / 27

Page 47: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting overview

I We no longer assume classifiers have independent errors.

I We no longer output a simple majority:we reweight the classifiers via optimization.

I There is a rich theory with many interpretations.

12 / 27

Page 48: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Simplified boosting scheme

1. Start with data ((xi, yi)ni=1 and classifiers (h1, . . . , hT ).

2. Find weights w ∈ RT which approximately minimize

1

n

n∑i=1

`

yi T∑j=1

wjhj(xi)

=1

n

n∑i=1

`(yiw

Tzi),

where zi =(h1(xi), . . . , hT (xi)

)∈ RT .

(We use classifiers to give us features.)

3. Predict with x 7→∑Tj=1 wjhj(x).

Remarks.

I If ` is convex, this is standard linear prediction: convex in w.I In the classical setting:`(r) = exp(−r), optimizer = coordinate descent, T =∞.

I Most commonly, (h1, . . . , hT ) are decision stumps.I Popular software implementation: xgboost.

13 / 27

Page 49: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Simplified boosting scheme

1. Start with data ((xi, yi)ni=1 and classifiers (h1, . . . , hT ).

2. Find weights w ∈ RT which approximately minimize

1

n

n∑i=1

`

yi T∑j=1

wjhj(xi)

=1

n

n∑i=1

`(yiw

Tzi),

where zi =(h1(xi), . . . , hT (xi)

)∈ RT .

(We use classifiers to give us features.)

3. Predict with x 7→∑Tj=1 wjhj(x).

Remarks.

I If ` is convex, this is standard linear prediction: convex in w.I In the classical setting:`(r) = exp(−r), optimizer = coordinate descent, T =∞.

I Most commonly, (h1, . . . , hT ) are decision stumps.I Popular software implementation: xgboost.

13 / 27

Page 50: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Decision stumps?

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

14 / 27

Page 51: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Decision stumps?

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

y = 2

14 / 27

Page 52: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Decision stumps?

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

14 / 27

Page 53: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Decision stumps?

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

y = 1 y = 3

14 / 27

Page 54: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Decision stumps?

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

y = 1 y = 3

. . . and stop there!

14 / 27

Page 55: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting decision stumps

Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.000 -1.5000.000

0.000

1.500

1.500

3.000

4.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00

-3.0

00-2

.000

-1.00

0

0.000

1.000

2.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.2

00

-2.4

00

-1.60

0-0

.800

0.000

0.800

1.600

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

15 / 27

Page 56: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting decision stumps

Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00-2

.000

0.00

0

0.000

2.000

2.000

4.000

6.000

8.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.0

00-4

.500

-3.0

00

-1.5

00

0.00

0

1.50

0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-8.0

00-6

.000

-4.0

00

-2.0

000.

000

2.00

0

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

15 / 27

Page 57: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting decision stumps

Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.000

-2.5

00

-2.500

0.00

0

0.000

2.500

5.000

7.50010.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-8.0

00-6

.000

-4.0

00

-2.0

000.

000

0.00

0

2.00

0

4.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

000.

000 0.000

3.00

0

6.000

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

15 / 27

Page 58: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting decision stumps

Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.0

00-2

.500

-2.500

0.00

0

0.000

2.5005.000

7.50010.00012.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-10.

000

-7.5

00-5

.000

-2.5

000.

000

0.00

0

2.50

0

5.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-16.

000

-12.

000

-8.0

00

-4.0

000.

000

0.000

4.00

0

8.000

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

15 / 27

Page 59: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting decision stumps

Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.000

-3.0

00

-3.000

0.00

0

0.000

3.000

6.000

9.000 12.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

00

0.00

0

0.00

0

3.00

0

6.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-20.

000

-15.

000

-10.

000

-5.0

000.

000

0.000

5.00

0

10.000

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

15 / 27

Page 60: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data
Page 61: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting — classical perspective

16 / 27

Page 62: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Coordinate descent?

The classical methods used coordinate descent:

I Find the maximum magnitude coordinate of the gradient:

argmaxj

∣∣∣∣∣∣ d

dwj

n∑i=1

`(∑

j

wjhj(xi)yi

)∣∣∣∣∣∣= argmax

j

∣∣∣∣∣∣n∑i=1

`′(∑

j

wjhj(xi)yi

)hj(xi)yi

∣∣∣∣∣∣= argmax

j

∣∣∣∣∣∣n∑i=1

qihj(xi)yi

∣∣∣∣∣∣ ,where we’ve defined qi := `′(

∑j hj(xi)yi).

I Iterate: w′ := w − ηsej , where j is the maximum coordinate,s ∈ {−1,+1} is its sign, and η is a step size.

17 / 27

Page 63: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Interpreting coordinate descent

Suppose hj : Rd → {−1,+1}; then hj(x)y = 2 · 1[hj(x) = y]− 1, andeach step solves

argmaxj

∣∣∣∣∣∣n∑i=1

qihj(x)iyi

∣∣∣∣∣∣ = argmaxj

∣∣∣∣∣∣n∑i=1

qi(1[hj(xi) = y]− 1/2

)∣∣∣∣∣∣ .We are solving a weighted zero-one loss minimization problem.

Remarks:

I The classical choice of coordinate descent is equivalent to solving aproblem akin to weighted zero-one loss minimization.

I We can abstract away a finite set (h1, . . . , hT ), and have anarbitrary set of predictors (e.g., all linear classifiers).

18 / 27

Page 64: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Interpreting coordinate descent

Suppose hj : Rd → {−1,+1}; then hj(x)y = 2 · 1[hj(x) = y]− 1, andeach step solves

argmaxj

∣∣∣∣∣∣n∑i=1

qihj(x)iyi

∣∣∣∣∣∣ = argmaxj

∣∣∣∣∣∣n∑i=1

qi(1[hj(xi) = y]− 1/2

)∣∣∣∣∣∣ .We are solving a weighted zero-one loss minimization problem.

Remarks:

I The classical choice of coordinate descent is equivalent to solving aproblem akin to weighted zero-one loss minimization.

I We can abstract away a finite set (h1, . . . , hT ), and have anarbitrary set of predictors (e.g., all linear classifiers).

18 / 27

Page 65: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Classical boosting setup

There is a Weak Learning Oracle, and a correspondingγ-weak-learnable assumption:

A set of points is γ-weak-learnable a weak learning oracle if forany weighting q, it returns predictor h so that Eq

(h(X)Y

)≥ γ.

Interpretation: for any reweighting q, we get a predictor h which is atleast γ-correlated with the target.

Remarks:I The classical methods iteratively invoke the oracle with different

weightings and then output a final aggregated predictor.I The best-known method, AdaBoost, performs coordinate-descent

updates (invoking the oracle) with a specific step size, and needsO( 1

γ2 ln(1ε )) iterations for accuracy ε > 0.

I The original description of AdaBoost is in terms of the sequence ofweightings q1, q2, . . ., and says nothing about coordinate descent.

I Adaptive Boosting: method doesn’t need to know γ, and adapts tovarying γt := Eqt

(ht(X)Y ).

19 / 27

Page 66: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Classical boosting setup

There is a Weak Learning Oracle, and a correspondingγ-weak-learnable assumption:

A set of points is γ-weak-learnable a weak learning oracle if forany weighting q, it returns predictor h so that Eq

(h(X)Y

)≥ γ.

Interpretation: for any reweighting q, we get a predictor h which is atleast γ-correlated with the target.

Remarks:I The classical methods iteratively invoke the oracle with different

weightings and then output a final aggregated predictor.I The best-known method, AdaBoost, performs coordinate-descent

updates (invoking the oracle) with a specific step size, and needsO( 1

γ2 ln(1ε )) iterations for accuracy ε > 0.

I The original description of AdaBoost is in terms of the sequence ofweightings q1, q2, . . ., and says nothing about coordinate descent.

I Adaptive Boosting: method doesn’t need to know γ, and adapts tovarying γt := Eqt

(ht(X)Y ).

19 / 27

Page 67: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: AdaBoost with decision stumps

(This example from Schapire&Freund’s book.)

Weak learning oracle (WLO): pick the best decision stump, meaningF :=

{x 7→ sign(xi − b) : i ∈ {1, . . . , d}, b ∈ R

}.

(Straightforward to handle weights in ERM.)

Remark:

I Only need to consider O(n) stumps (Why?).

20 / 27

Page 68: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: AdaBoost with decision stumps

(This example from Schapire&Freund’s book.)

Weak learning oracle (WLO): pick the best decision stump, meaningF :=

{x 7→ sign(xi − b) : i ∈ {1, . . . , d}, b ∈ R

}.

(Straightforward to handle weights in ERM.)

Remark:

I Only need to consider O(n) stumps (Why?).

20 / 27

Page 69: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: execution of AdaBoost

D1

D2 D3

+

+

+

+

f1 f2 f3

21 / 27

Page 70: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: execution of AdaBoost

D1

D2 D3

+

+

+

+

f1

f2 f3

21 / 27

Page 71: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: execution of AdaBoost

D1 D2

D3

+

+

+

+

f1

f2 f3

21 / 27

Page 72: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: execution of AdaBoost

D1 D2

D3

+

+

+

+

f1 f2

f3

21 / 27

Page 73: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: execution of AdaBoost

D1 D2 D3

+

+

+

+

f1 f2

f3

21 / 27

Page 74: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: execution of AdaBoost

D1 D2 D3

+

+

+

+

f1 f2 f3

21 / 27

Page 75: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: final classifier from AdaBoost

+

+

f1 f2 f3

Final classifierf(x) = sign(0.42f1(x) + 0.65f2(x) + 0.92f3(x))

(Zero training error rate!)

22 / 27

Page 76: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Example: final classifier from AdaBoost

+

+

f1 f2 f3

Final classifierf(x) = sign(0.42f1(x) + 0.65f2(x) + 0.92f3(x))

(Zero training error rate!)

22 / 27

Page 77: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data
Page 78: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

A typical run of boosting.

AdaBoost+C4.5 on “letters” dataset.

Err

orra

te

16 1 Introduction and Overview

10 100 10000

5

10

15

20

Figure 1.7The training and test percent error rates obtained using boosting on an OCR dataset with C4.5 as the base learner.The top and bottom curves are test and training error, respectively. The top horizontal line shows the test error rateusing just C4.5. The bottom line shows the final test error rate of AdaBoost after 1000 rounds. (Reprinted withpermission of the Institute of Mathematical Statistics.)

This pronounced lack of overfitting seems to flatly contradict our earlier intuition that sim-pler is better. Surely, a combination of five trees is much, much simpler than a combinationof 1000 trees (about 200 times simpler, in terms of raw size), and both perform equallywell on the training set (perfectly, in fact). So how can it be that the far larger and morecomplex combined classifier performs so much better on the test set? This would appear tobe a paradox.

One superficially plausible explanation is that the αt ’s are converging rapidly to zero,so that the number of base classifiers being combined is effectively bounded. However, asnoted above, the ϵt ’s remain around 5–6% in this case, well below 1

2 , which means that theweights αt on the individual base classifiers are also bounded well above zero, so that thecombined classifier is constantly growing and evolving with each round of boosting.

Such resistance to overfitting is typical of boosting, although, as we have seen in sec-tion 1.2.3, boosting certainly can overfit. This resistance is one of the properties that makeit such an attractive learning algorithm. But how can we understand this behavior?

In chapter 5, we present a theoretical explanation of how, why, and whenAdaBoost worksand, in particular, of why it often does not overfit. Briefly, the main idea is the following.The description above of AdaBoost’s performance on the training set took into accountonly the training error, which is already zero after just five rounds. However, training errortells only part of the story, in that it reports just the number of examples that are correctlyor incorrectly classified. Instead, to understand AdaBoost, we also need to consider howconfident the predictions being made by the algorithm are. We will see that such confidencecan be measured by a quantity called the margin. According to this explanation, althoughthe training error—that is, whether or not the predictions are correct—is not changing

AdaBoost training error

AdaBoost test error

C4.5 test error

# of rounds T

(# nodes across all decision trees in f is >2× 106)

Training error rate is zero after just five rounds,but test error rate continues to decrease, even up to 1000 rounds!

(Figure 1.7 from Schapire & Freund text)23 / 27

Page 79: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting the margin.

Final classifier from AdaBoost:

f(x) = sign

(∑Tt=1 αtft(x)∑Tt=1 |αt|

)︸ ︷︷ ︸g(x) ∈ [−1,+1]

.

Call y · g(x) ∈ [−1,+1] the margin achieved on example (x, y).(Note: `1 not `2 normalized.)

Margin theory [Schapire, Freund, Bartlett, and Lee, 1998]:I Larger margins ⇒ better generalization, independent of T .I AdaBoost tends to increase margins on training examples.

“letters” dataset:T = 5 T = 100 T = 1000

training error rate 0.0% 0.0% 0.0%test error rate 8.4% 3.3% 3.1%

% margins ≤0.5 7.7% 0.0% 0.0%min. margin 0.14 0.52 0.55

I Similar phenomenon in deep networks and gradient descent.

24 / 27

Page 80: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Boosting the margin.

Final classifier from AdaBoost:

f(x) = sign

(∑Tt=1 αtft(x)∑Tt=1 |αt|

)︸ ︷︷ ︸g(x) ∈ [−1,+1]

.

Call y · g(x) ∈ [−1,+1] the margin achieved on example (x, y).(Note: `1 not `2 normalized.)

Margin theory [Schapire, Freund, Bartlett, and Lee, 1998]:I Larger margins ⇒ better generalization, independent of T .I AdaBoost tends to increase margins on training examples.

“letters” dataset:T = 5 T = 100 T = 1000

training error rate 0.0% 0.0% 0.0%test error rate 8.4% 3.3% 3.1%

% margins ≤0.5 7.7% 0.0% 0.0%min. margin 0.14 0.52 0.55

I Similar phenomenon in deep networks and gradient descent.24 / 27

Page 81: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

Boosted stumps.(O(n) param.)

25 / 27

Page 82: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

1 0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.000 -1.5000.000

0.000

1.500

1.500

3.000

4.500

Boosted stumps.(O(n) param.)

25 / 27

Page 83: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00-2

.000

0.00

0

0.000

2.000

2.000

4.000

6.000

8.000

Boosted stumps.(O(n) param.)

25 / 27

Page 84: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.000

-2.5

00

-2.500

0.00

0

0.000

2.500

5.000

7.50010.000

Boosted stumps.(O(n) param.)

25 / 27

Page 85: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.0

00-2

.500

-2.500

0.00

0

0.000

2.5005.000

7.50010.00012.500

Boosted stumps.(O(n) param.)

25 / 27

Page 86: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.000

-3.0

00

-3.000

0.00

0

0.000

3.000

6.000

9.000 12.000

Boosted stumps.(O(n) param.)

25 / 27

Page 87: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

1 0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

2 1 0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

2 1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.000 -1.5000.000

0.000

1.500

1.500

3.000

4.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00

-3.0

00-2

.000

-1.00

0

0.000

1.000

2.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.2

00

-2.4

00

-1.60

0-0

.800

0.000

0.800

1.600

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

25 / 27

Page 88: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00-2

.000

0.00

0

0.000

2.000

2.000

4.000

6.000

8.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.0

00-4

.500

-3.0

00

-1.5

00

0.00

0

1.50

0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-8.0

00-6

.000

-4.0

00

-2.0

000.

000

2.00

0

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

25 / 27

Page 89: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10 12 14 16

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.000

-2.5

00

-2.500

0.00

0

0.000

2.500

5.000

7.50010.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-8.0

00-6

.000

-4.0

00

-2.0

000.

000

0.00

0

2.00

0

4.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

000.

000 0.000

3.00

0

6.000

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

25 / 27

Page 90: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.0

00-2

.500

-2.500

0.00

0

0.000

2.5005.000

7.50010.00012.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-10.

000

-7.5

00-5

.000

-2.5

000.

000

0.00

0

2.50

0

5.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-16.

000

-12.

000

-8.0

00

-4.0

000.

000

0.000

4.00

0

8.000

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

25 / 27

Page 91: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.000

-3.0

00

-3.000

0.00

0

0.000

3.000

6.000

9.000 12.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

00

0.00

0

0.00

0

3.00

0

6.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-20.

000

-15.

000

-10.

000

-5.0

000.

000

0.000

5.00

0

10.000

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)

3-layer ReLU.(O(n) param.)

25 / 27

Page 92: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Summary

26 / 27

Page 93: Ensemble methods - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-ensemble-annotated.pdf · Why ensembles? Standard machine learning setup: I We have some data

Summary

I We can do better than best predictor.

I (Bagging.) If errors are independent, majority vote works well.

I (Boosting.) If they are not independent, reweighted majority workswell; weights can be found with convex optimization.

27 / 27