variance reduction and ensemble methodsnrr150130/cs6375/...bias-variance analysis in regression β€’...

36
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag

Upload: others

Post on 21-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Variance Reduction and Ensemble Methods

Nicholas RuozziUniversity of Texas at Dallas

Based on the slides of Vibhav Gogate and David Sontag

Page 2: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Last Time

β€’ PAC learning

β€’ Bias/variance tradeoff

β€’ small hypothesis spaces (not enough flexibility) can have high bias

β€’ rich hypothesis spaces (too much flexibility) can have high variance

β€’ Today: more on this phenomenon and how to get around it

2

Page 3: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Intuition

β€’ Bias

β€’ Measures the accuracy or quality of the algorithm

β€’ High bias means a poor match

β€’ Variance

β€’ Measures the precision or specificity of the match

β€’ High variance means a weak match

β€’ We would like to minimize each of these

β€’ Unfortunately, we can’t do this independently, there is a trade-off

3

Page 4: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias-Variance Analysis in Regression

β€’ True function is 𝑦𝑦 = 𝑓𝑓(π‘₯π‘₯) + πœ–πœ–

β€’ Where noise, πœ–πœ–, is normally distributed with zero mean and standard deviation 𝜎𝜎

β€’ Given a set of training examples, π‘₯π‘₯ 1 ,𝑦𝑦(1) , … , π‘₯π‘₯ 𝑛𝑛 ,𝑦𝑦(𝑛𝑛) , we fit a hypothesis 𝑔𝑔(π‘₯π‘₯) = 𝑀𝑀𝑇𝑇π‘₯π‘₯ + 𝑏𝑏 to the data to minimize the squared error

�𝑖𝑖

𝑦𝑦(𝑖𝑖) –𝑔𝑔 π‘₯π‘₯ 𝑖𝑖 2

4

Page 5: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

2-D Example

Sample 20 points from 𝑓𝑓(π‘₯π‘₯) = π‘₯π‘₯ + 2 sin(1.5π‘₯π‘₯) + 𝑁𝑁(0,0.2)

5

Page 6: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

2-D Example

50 fits (20 examples each)

6

Page 7: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias-Variance Analysis

β€’ Given a new data point π‘₯π‘₯π‘₯ with observed value 𝑦𝑦′ = 𝑓𝑓 π‘₯π‘₯β€² + πœ–πœ–, want to understand the expected prediction error

β€’ Suppose that training samples are drawn independently from a distribution 𝑝𝑝(𝑆𝑆), want to compute the expected error of the estimator

𝐸𝐸[ 𝑦𝑦′–𝑔𝑔𝑆𝑆 π‘₯π‘₯β€²2 ]

7

Page 8: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Probability Reminder

β€’ Variance of a random variable, 𝑍𝑍

𝑉𝑉𝑉𝑉𝑉𝑉 𝑍𝑍 = 𝐸𝐸 𝑍𝑍 βˆ’ 𝐸𝐸 𝑍𝑍 2

= 𝐸𝐸 𝑍𝑍2 βˆ’ 2𝑍𝑍𝐸𝐸 𝑍𝑍 + 𝐸𝐸 𝑍𝑍 2

= 𝐸𝐸 𝑍𝑍2 βˆ’ 𝐸𝐸 𝑍𝑍 2

β€’ Properties of 𝑉𝑉𝑉𝑉𝑉𝑉(𝑍𝑍)

𝑉𝑉𝑉𝑉𝑉𝑉 𝑉𝑉𝑍𝑍 = 𝐸𝐸 𝑉𝑉2𝑍𝑍2 βˆ’ 𝐸𝐸 𝑉𝑉𝑍𝑍 2 = 𝑉𝑉2𝑉𝑉𝑉𝑉𝑉𝑉(𝑍𝑍)

8

Page 9: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias-Variance-Noise Decomposition

9

𝐸𝐸 𝑦𝑦′ βˆ’ 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€²2 = 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑦𝑦′ + 𝑦𝑦′2

= 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝐸𝐸 𝑦𝑦′ + 𝐸𝐸 𝑦𝑦′2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑓𝑓 π‘₯π‘₯β€²

+ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦π‘₯ + 𝑓𝑓 π‘₯π‘₯β€² 2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝑉𝑉𝑉𝑉𝑉𝑉 πœ–πœ–

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝜎𝜎2

Page 10: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias-Variance-Noise Decomposition

10

𝐸𝐸 𝑦𝑦′ βˆ’ 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€²2 = 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑦𝑦′ + 𝑦𝑦′2

= 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝐸𝐸 𝑦𝑦′ + 𝐸𝐸 𝑦𝑦′2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑓𝑓 π‘₯π‘₯β€²

+ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦π‘₯ + 𝑓𝑓 π‘₯π‘₯β€² 2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝑉𝑉𝑉𝑉𝑉𝑉 πœ–πœ–

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝜎𝜎2

The samples 𝑆𝑆and the noise πœ–πœ– are independent

Page 11: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias-Variance-Noise Decomposition

11

𝐸𝐸 𝑦𝑦′ βˆ’ 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€²2 = 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑦𝑦′ + 𝑦𝑦′2

= 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝐸𝐸 𝑦𝑦′ + 𝐸𝐸 𝑦𝑦′2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑓𝑓 π‘₯π‘₯β€²

+ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦π‘₯ + 𝑓𝑓 π‘₯π‘₯β€² 2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝑉𝑉𝑉𝑉𝑉𝑉 πœ–πœ–

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝜎𝜎2

Follows from definition of variance

Page 12: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias-Variance-Noise Decomposition

12

𝐸𝐸 𝑦𝑦′ βˆ’ 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€²2 = 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑦𝑦′ + 𝑦𝑦′2

= 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝐸𝐸 𝑦𝑦′ + 𝐸𝐸 𝑦𝑦′2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑓𝑓 π‘₯π‘₯β€²

+ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦π‘₯ + 𝑓𝑓 π‘₯π‘₯β€² 2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝑉𝑉𝑉𝑉𝑉𝑉 πœ–πœ–

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝜎𝜎2

𝐸𝐸 𝑦𝑦′ = 𝑓𝑓(π‘₯π‘₯β€²)

Page 13: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias-Variance-Noise Decomposition

13

𝐸𝐸 𝑦𝑦′ βˆ’ 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€²2 = 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑦𝑦′ + 𝑦𝑦′2

= 𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝐸𝐸 𝑦𝑦′ + 𝐸𝐸 𝑦𝑦′2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠 π‘₯π‘₯β€² 2 βˆ’ 2𝐸𝐸 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² 𝑓𝑓 π‘₯π‘₯β€²

+ 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦π‘₯ + 𝑓𝑓 π‘₯π‘₯β€² 2

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝑉𝑉𝑉𝑉𝑉𝑉 πœ–πœ–

= 𝑉𝑉𝑉𝑉𝑉𝑉 𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² + 𝐸𝐸 𝑔𝑔𝑠𝑠(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 + 𝜎𝜎2

Variance Bias Noise

Page 14: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias, Variance, and Noise

β€’ Variance: 𝐸𝐸[ (𝑔𝑔𝑆𝑆 π‘₯π‘₯β€² βˆ’ 𝐸𝐸 𝑔𝑔𝑆𝑆(π‘₯π‘₯β€²) )2 ]

β€’ Describes how much 𝑔𝑔𝑆𝑆(π‘₯π‘₯β€²) varies from one training set 𝑆𝑆 to another

β€’ Bias: 𝐸𝐸 𝑔𝑔𝑆𝑆(π‘₯π‘₯β€²) βˆ’ 𝑓𝑓(π‘₯π‘₯π‘₯)

β€’ Describes the average error of 𝑔𝑔𝑆𝑆(π‘₯π‘₯β€²)

β€’ Noise: 𝐸𝐸 𝑦𝑦′ βˆ’ 𝑓𝑓 π‘₯π‘₯β€² 2 = 𝐸𝐸[πœ–πœ–2] = 𝜎𝜎2

β€’ Describes how much 𝑦𝑦′ varies from 𝑓𝑓(π‘₯π‘₯β€²)

14

Page 15: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

2-D Example

50 fits (20 examples each)

15

Page 16: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias

16

Page 17: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Variance

17

Page 18: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Noise

18

Page 19: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias

β€’ Low bias

β€’ ?

β€’ High bias

β€’ ?

19

Page 20: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias

β€’ Low bias

β€’ Linear regression applied to linear data

β€’ 2nd degree polynomial applied to quadratic data

β€’ High bias

β€’ Constant function applied to non-constant data

β€’ Linear regression applied to highly non-linear data

20

Page 21: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Variance

β€’ Low variance

β€’ ?

β€’ High variance

β€’ ?

21

Page 22: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Variance

β€’ Low variance

β€’ Constant function

β€’ Model independent of training data

β€’ High variance

β€’ High degree polynomial

22

Page 23: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bias/Variance Tradeoff

β€’ (bias2+variance) is what counts for prediction

β€’ As we saw in PAC learning, we often have

β€’ Low bias β‡’ high variance

β€’ Low variance β‡’ high bias

β€’ How can we deal with this in practice?

23

Page 24: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Reduce Variance Without Increasing Bias

β€’ Averaging reduces variance: let 𝑍𝑍1, … ,𝑍𝑍𝑁𝑁 be i.i.d random variables

𝑉𝑉𝑉𝑉𝑉𝑉1𝑁𝑁�𝑖𝑖

𝑍𝑍𝑖𝑖 =1𝑁𝑁𝑉𝑉𝑉𝑉𝑉𝑉(𝑍𝑍𝑖𝑖)

β€’ Idea: average models to reduce model variance

β€’ The problem

β€’ Only one training set

β€’ Where do multiple models come from?

24

Page 25: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bagging: Bootstrap Aggregation

β€’ Take repeated bootstrap samples from training set 𝐷𝐷 (Breiman, 1994)

β€’ Bootstrap sampling: Given set 𝐷𝐷 containing 𝑁𝑁 training examples, create 𝐷𝐷π‘₯ by drawing 𝑁𝑁 examples at random with replacement from 𝐷𝐷

β€’ Bagging:

β€’ Create π‘˜π‘˜ bootstrap samples 𝐷𝐷1, … ,π·π·π‘˜π‘˜

β€’ Train distinct classifier on each 𝐷𝐷𝑖𝑖

β€’ Classify new instance by majority vote / average

25

Page 26: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bagging: Bootstrap Aggregation

[image from the slides of David Sontag]26

Page 27: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bagging

Data 1 2 3 4 5 6 7 8 9 10BS 1 7 1 9 10 7 8 8 4 7 2BS 2 8 1 3 1 1 9 7 4 10 1BS 3 5 4 8 8 2 5 5 7 8 8

27

β€’ Build a classifier from each bootstrap sampleβ€’ In each bootstrap sample, each data point has

probability 1 βˆ’ 1𝑁𝑁

𝑁𝑁of not being selected

β€’ Expected number of distinct data points in each sample is then

𝑁𝑁 β‹… 1 βˆ’ 1 βˆ’ 1𝑁𝑁

π‘π‘β‰ˆ 𝑁𝑁 β‹… (1 βˆ’ exp(βˆ’1)) = .632 β‹… 𝑁𝑁

Page 28: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bagging

Data 1 2 3 4 5 6 7 8 9 10BS 1 7 1 9 10 7 8 8 4 7 2BS 2 8 1 3 1 1 9 7 4 10 1BS 3 5 4 8 8 2 5 5 7 8 8

28

β€’ Build a classifier from each bootstrap sampleβ€’ In each bootstrap sample, each data point has

probability 1 βˆ’ 1𝑁𝑁

𝑁𝑁of not being selected

β€’ If we have 1 TB of data, each bootstrap sample will be ~ 632GB (this can present computational challenges, e.g., you shouldn’t replicate the data)

Page 29: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Decision Tree Bagging

[image from the slides of David Sontag]29

Page 30: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Decision Tree Bagging (100 Bagged Trees)

[image from the slides of David Sontag]30

Page 31: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Bagging Results

Breiman β€œBagging Predictors” Berkeley Statistics Department TR#421, 1994

32

Without Bagging

WithBagging

Page 32: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Random Forests

33

Page 33: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Random Forests

β€’ Ensemble method specifically designed for decision tree classifiers

β€’ Introduce two sources of randomness: β€œbagging” and β€œrandom input vectors”

β€’ Bagging method: each tree is grown using a bootstrap sample of training data

β€’ Random vector method: best split at each node is chosen from a random sample of π‘šπ‘š attributes instead of all attributes

34

Page 34: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Random Forest Algorithmβ€’ For 𝑏𝑏 = 1 to 𝐡𝐡

β€’ Draw a bootstrap sample of size 𝑁𝑁 from the dataβ€’ Grow a tree 𝑇𝑇𝑏𝑏 using the bootstrap sample as follows

β€’ Choose π‘šπ‘š attributes uniformly at random from the dataβ€’ Choose the best attribute among the π‘šπ‘š to split onβ€’ Split on the best attribute and recurse (until partitions

have fewer than π‘ π‘ π‘šπ‘šπ‘–π‘–π‘›π‘› number of nodes)

β€’ Prediction for a new data point π‘₯π‘₯β€’ Regression: 1

π΅π΅βˆ‘π‘π‘ 𝑇𝑇𝑏𝑏(π‘₯π‘₯)

β€’ Classification: choose the majority class label among 𝑇𝑇1 π‘₯π‘₯ , … ,𝑇𝑇𝐡𝐡(π‘₯π‘₯)

35

Page 35: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

Random Forest Demo

A demo of random forests implemented in JavaScript

36

Page 36: Variance Reduction and Ensemble Methodsnrr150130/cs6375/...Bias-Variance Analysis in Regression β€’ True function is 𝑦𝑦= 𝑓𝑓(π‘₯π‘₯) +πœ–πœ– β€’ Where noise, πœ–πœ–,

When Will Bagging Improve Accuracy?

β€’ Depends on the stability of the base-level classifiers

β€’ A learner is unstable if a small change to the training set causes a large change in the output hypothesis

β€’ If small changes in 𝐷𝐷 cause large changes in the output, then there will likely be an improvement in performance with bagging

β€’ Bagging can help unstable procedures, but could hurt the performance of stable procedures

β€’ Decision trees are unstable

β€’ π‘˜π‘˜-nearest neighbor is stable

37