bootstrapped aggregation and random forests

26
Bootstrapped Aggregation and Random Forests

Upload: others

Post on 13-May-2022

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bootstrapped Aggregation and Random Forests

Bootstrapped Aggregationand Random Forests

Page 2: Bootstrapped Aggregation and Random Forests

Bonus: Regression Trees

2

o Suppose you want to PREDICT the salary of a MLB player based on two features: how many hits they average a year and how many years they’ve been in the league

Years

Hits

1

117.5

238

1 4.5 24

R1

R3

R2

|Years < 4.5

Hits < 117.55.11

6.00 6.74

.

• ×(

Ri-

i

'

Mgau( SAHR ) 1=5.11Rz 123

Page 3: Bootstrapped Aggregation and Random Forests

Bonus: Regression Trees

3

o Suppose you want to PREDICT the salary of a MLB player based on two features: how many hits they average a year and how many years they’ve been in the league

o Perform the usual binary splitting by feature

o Instead of classification, predict the response based on average response in leaf node

Page 4: Bootstrapped Aggregation and Random Forests

Bonus: Regression Trees

4

o Instead of classification, predict the response based on average response in leaf node

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

L gift.

,:|q

Page 5: Bootstrapped Aggregation and Random Forests

Bonus: Regression Trees

5

o What measure do we use to decide which feature to split on?

o The goal is to find boxes that minimize the

o Consider splitting the data on a node into two boxes. Choose feature and split to minimize

where

RSS =JX

j=1

X

i2Rj

�yi � yRj

�2

<latexit sha1_base64="VB4eeT2jjl/WokMOFtW/uCEEKLg=">AAACRHicbVA9b9RAEF2Hr2A+ckBJM+KCFApO9jWBIlKkNIgqHxyJdHux1nvju03Wa2t3HMmy/OfSpKfjH9BQEESLWN9dAQlPGuntezOanZeWWjmKoq/B2p279+4/WH8YPnr85OlG79nzz66orMSRLHRhT1LhUCuDI1Kk8aS0KPJU43F6vtf5xxdonSrMJ6pLnORiZlSmpCAvJT2+GR4eHcEO8KlypRa1o1ojd1WeNGc7cXv6EZYPBVwZOEzOWuAaM9qCOlHwFvhcUFO3SdNZ3KrZnN6cDsPN0CPp9aNBtADcJvGK9NkK+0nvC58WssrRkNTCuXEclTRphCUlNbYhrxyWQp6LGY49NSJHN2kWKbTw2itTyArryxAs1L8nGpE7V+ep78wFzd1NrxP/540ryt5NGmXKitDI5aKs0kAFdJHCVFmUpGtPhLTK/xXkXFghyQffhRDfPPk2GQ0H7wfxwbC/u71KY529ZK/YFovZNttlH9g+GzHJLtk39oNdB1fB9+Bn8GvZuhasZl6wfxD8/gPTFa75</latexit><latexit sha1_base64="VB4eeT2jjl/WokMOFtW/uCEEKLg=">AAACRHicbVA9b9RAEF2Hr2A+ckBJM+KCFApO9jWBIlKkNIgqHxyJdHux1nvju03Wa2t3HMmy/OfSpKfjH9BQEESLWN9dAQlPGuntezOanZeWWjmKoq/B2p279+4/WH8YPnr85OlG79nzz66orMSRLHRhT1LhUCuDI1Kk8aS0KPJU43F6vtf5xxdonSrMJ6pLnORiZlSmpCAvJT2+GR4eHcEO8KlypRa1o1ojd1WeNGc7cXv6EZYPBVwZOEzOWuAaM9qCOlHwFvhcUFO3SdNZ3KrZnN6cDsPN0CPp9aNBtADcJvGK9NkK+0nvC58WssrRkNTCuXEclTRphCUlNbYhrxyWQp6LGY49NSJHN2kWKbTw2itTyArryxAs1L8nGpE7V+ep78wFzd1NrxP/540ryt5NGmXKitDI5aKs0kAFdJHCVFmUpGtPhLTK/xXkXFghyQffhRDfPPk2GQ0H7wfxwbC/u71KY529ZK/YFovZNttlH9g+GzHJLtk39oNdB1fB9+Bn8GvZuhasZl6wfxD8/gPTFa75</latexit><latexit sha1_base64="VB4eeT2jjl/WokMOFtW/uCEEKLg=">AAACRHicbVA9b9RAEF2Hr2A+ckBJM+KCFApO9jWBIlKkNIgqHxyJdHux1nvju03Wa2t3HMmy/OfSpKfjH9BQEESLWN9dAQlPGuntezOanZeWWjmKoq/B2p279+4/WH8YPnr85OlG79nzz66orMSRLHRhT1LhUCuDI1Kk8aS0KPJU43F6vtf5xxdonSrMJ6pLnORiZlSmpCAvJT2+GR4eHcEO8KlypRa1o1ojd1WeNGc7cXv6EZYPBVwZOEzOWuAaM9qCOlHwFvhcUFO3SdNZ3KrZnN6cDsPN0CPp9aNBtADcJvGK9NkK+0nvC58WssrRkNTCuXEclTRphCUlNbYhrxyWQp6LGY49NSJHN2kWKbTw2itTyArryxAs1L8nGpE7V+ep78wFzd1NrxP/540ryt5NGmXKitDI5aKs0kAFdJHCVFmUpGtPhLTK/xXkXFghyQffhRDfPPk2GQ0H7wfxwbC/u71KY529ZK/YFovZNttlH9g+GzHJLtk39oNdB1fB9+Bn8GvZuhasZl6wfxD8/gPTFa75</latexit><latexit sha1_base64="VB4eeT2jjl/WokMOFtW/uCEEKLg=">AAACRHicbVA9b9RAEF2Hr2A+ckBJM+KCFApO9jWBIlKkNIgqHxyJdHux1nvju03Wa2t3HMmy/OfSpKfjH9BQEESLWN9dAQlPGuntezOanZeWWjmKoq/B2p279+4/WH8YPnr85OlG79nzz66orMSRLHRhT1LhUCuDI1Kk8aS0KPJU43F6vtf5xxdonSrMJ6pLnORiZlSmpCAvJT2+GR4eHcEO8KlypRa1o1ojd1WeNGc7cXv6EZYPBVwZOEzOWuAaM9qCOlHwFvhcUFO3SdNZ3KrZnN6cDsPN0CPp9aNBtADcJvGK9NkK+0nvC58WssrRkNTCuXEclTRphCUlNbYhrxyWQp6LGY49NSJHN2kWKbTw2itTyArryxAs1L8nGpE7V+ep78wFzd1NrxP/540ryt5NGmXKitDI5aKs0kAFdJHCVFmUpGtPhLTK/xXkXFghyQffhRDfPPk2GQ0H7wfxwbC/u71KY529ZK/YFovZNttlH9g+GzHJLtk39oNdB1fB9+Bn8GvZuhasZl6wfxD8/gPTFa75</latexit>

X

i: xi2R1(j,s)

(yi � yR1)2 +

X

i: xi2R2(j,s)

(yi � yR2)2

<latexit sha1_base64="4N+0l5Bkl4OtHgO3IdysuJfHzHo=">AAACiXichVHRTtswFHXCGJCxUdgjL9YKUhGjSvJAgSckXniEaR1ITYkc96Y1OE5k3yCiKPwLv7Q3/ga3VBoDJI5k6eice4/te5NCCoO+/+i4C58WPy8tr3hfVr9+W2utb/wxeak59Hkuc32ZMANSKOijQAmXhQaWJRIukpuTqX9xC9qIXP3GqoBhxsZKpIIztFLcetjyopEwhWSVwUpCZMosrsXR/V0saCQU/RUHneufZqehkYQUO7Syxh6NJgzrqolr6zeRFuMJ7lyFdJd+lBZ+kBb+S/O2PIu41fa7/gz0LQnmpE3mOItbf6NRzssMFHLJjBkEfoHDmmkUXELjRaWBgvEbNoaBpYplYIb1bJQN3bbKiKa5tkchnakvO2qWGVNlia3MGE7Ma28qvucNSkwPhrVQRYmg+PNFaSkp5nS6FzoSGjjKyhLGtbBvpXzCNONotzcdQvD6y29JP+wedoPzsH3cm09jmWySH6RDAtIjx+SUnJE+4c6Ss+fsOz131Q3dA/foudR15j3fyX9wT54A34rBEg==</latexit><latexit sha1_base64="4N+0l5Bkl4OtHgO3IdysuJfHzHo=">AAACiXichVHRTtswFHXCGJCxUdgjL9YKUhGjSvJAgSckXniEaR1ITYkc96Y1OE5k3yCiKPwLv7Q3/ga3VBoDJI5k6eice4/te5NCCoO+/+i4C58WPy8tr3hfVr9+W2utb/wxeak59Hkuc32ZMANSKOijQAmXhQaWJRIukpuTqX9xC9qIXP3GqoBhxsZKpIIztFLcetjyopEwhWSVwUpCZMosrsXR/V0saCQU/RUHneufZqehkYQUO7Syxh6NJgzrqolr6zeRFuMJ7lyFdJd+lBZ+kBb+S/O2PIu41fa7/gz0LQnmpE3mOItbf6NRzssMFHLJjBkEfoHDmmkUXELjRaWBgvEbNoaBpYplYIb1bJQN3bbKiKa5tkchnakvO2qWGVNlia3MGE7Ma28qvucNSkwPhrVQRYmg+PNFaSkp5nS6FzoSGjjKyhLGtbBvpXzCNONotzcdQvD6y29JP+wedoPzsH3cm09jmWySH6RDAtIjx+SUnJE+4c6Ss+fsOz131Q3dA/foudR15j3fyX9wT54A34rBEg==</latexit><latexit sha1_base64="4N+0l5Bkl4OtHgO3IdysuJfHzHo=">AAACiXichVHRTtswFHXCGJCxUdgjL9YKUhGjSvJAgSckXniEaR1ITYkc96Y1OE5k3yCiKPwLv7Q3/ga3VBoDJI5k6eice4/te5NCCoO+/+i4C58WPy8tr3hfVr9+W2utb/wxeak59Hkuc32ZMANSKOijQAmXhQaWJRIukpuTqX9xC9qIXP3GqoBhxsZKpIIztFLcetjyopEwhWSVwUpCZMosrsXR/V0saCQU/RUHneufZqehkYQUO7Syxh6NJgzrqolr6zeRFuMJ7lyFdJd+lBZ+kBb+S/O2PIu41fa7/gz0LQnmpE3mOItbf6NRzssMFHLJjBkEfoHDmmkUXELjRaWBgvEbNoaBpYplYIb1bJQN3bbKiKa5tkchnakvO2qWGVNlia3MGE7Ma28qvucNSkwPhrVQRYmg+PNFaSkp5nS6FzoSGjjKyhLGtbBvpXzCNONotzcdQvD6y29JP+wedoPzsH3cm09jmWySH6RDAtIjx+SUnJE+4c6Ss+fsOz131Q3dA/foudR15j3fyX9wT54A34rBEg==</latexit><latexit sha1_base64="4N+0l5Bkl4OtHgO3IdysuJfHzHo=">AAACiXichVHRTtswFHXCGJCxUdgjL9YKUhGjSvJAgSckXniEaR1ITYkc96Y1OE5k3yCiKPwLv7Q3/ga3VBoDJI5k6eice4/te5NCCoO+/+i4C58WPy8tr3hfVr9+W2utb/wxeak59Hkuc32ZMANSKOijQAmXhQaWJRIukpuTqX9xC9qIXP3GqoBhxsZKpIIztFLcetjyopEwhWSVwUpCZMosrsXR/V0saCQU/RUHneufZqehkYQUO7Syxh6NJgzrqolr6zeRFuMJ7lyFdJd+lBZ+kBb+S/O2PIu41fa7/gz0LQnmpE3mOItbf6NRzssMFHLJjBkEfoHDmmkUXELjRaWBgvEbNoaBpYplYIb1bJQN3bbKiKa5tkchnakvO2qWGVNlia3MGE7Ma28qvucNSkwPhrVQRYmg+PNFaSkp5nS6FzoSGjjKyhLGtbBvpXzCNONotzcdQvD6y29JP+wedoPzsH3cm09jmWySH6RDAtIjx+SUnJE+4c6Ss+fsOz131Q3dA/foudR15j3fyX9wT54A34rBEg==</latexit>

R1(j, s) = {X | Xj < s} and R2(j, s) = {X | Xj � s}<latexit sha1_base64="Zs67o2YmObQBXLhrj9WKj1xrYoQ=">AAACPXicbVBNSxxBFOxRk5jJh6sec3m4BgyEZWYvJhBByMWjkYwu7CxDT8/btbW7Z9L9JrgM+8u85DfklqMXD0nINdf0rHswakFDUVWP16/ySklHUfQjWFpeefT4yerT8NnzFy/XOusbx66srcBElKq0g5w7VNJgQpIUDiqLXOcKT/Lzj61/8hWtk6X5TNMKR5pPjBxLwclLWSfZDo+yeOfsrXsDe5A2A0i1LGCQncEHcOkMUsILsroBbgqYwVHWfzCcTvBLmw+3Q4+s04160Rxwn8QL0mULHGad72lRilqjIaG4c8M4qmjUcEtSKJyFae2w4uKcT3DoqeEa3aiZnz+D114pYFxa/wzBXL090XDt3FTnPqk5nbq7Xis+5A1rGr8bNdJUNaERN4vGtQIqoe0SCmlRkJp6woWV/q8gTrnlgnzjbQnx3ZPvk6Tfe9+LP/W7+7uLNlbZK7bFdljMdtk+O2CHLGGCXbIr9pP9Cr4F18Hv4M9NdClYzGyy/xD8/QcN2aif</latexit><latexit sha1_base64="Zs67o2YmObQBXLhrj9WKj1xrYoQ=">AAACPXicbVBNSxxBFOxRk5jJh6sec3m4BgyEZWYvJhBByMWjkYwu7CxDT8/btbW7Z9L9JrgM+8u85DfklqMXD0nINdf0rHswakFDUVWP16/ySklHUfQjWFpeefT4yerT8NnzFy/XOusbx66srcBElKq0g5w7VNJgQpIUDiqLXOcKT/Lzj61/8hWtk6X5TNMKR5pPjBxLwclLWSfZDo+yeOfsrXsDe5A2A0i1LGCQncEHcOkMUsILsroBbgqYwVHWfzCcTvBLmw+3Q4+s04160Rxwn8QL0mULHGad72lRilqjIaG4c8M4qmjUcEtSKJyFae2w4uKcT3DoqeEa3aiZnz+D114pYFxa/wzBXL090XDt3FTnPqk5nbq7Xis+5A1rGr8bNdJUNaERN4vGtQIqoe0SCmlRkJp6woWV/q8gTrnlgnzjbQnx3ZPvk6Tfe9+LP/W7+7uLNlbZK7bFdljMdtk+O2CHLGGCXbIr9pP9Cr4F18Hv4M9NdClYzGyy/xD8/QcN2aif</latexit><latexit sha1_base64="Zs67o2YmObQBXLhrj9WKj1xrYoQ=">AAACPXicbVBNSxxBFOxRk5jJh6sec3m4BgyEZWYvJhBByMWjkYwu7CxDT8/btbW7Z9L9JrgM+8u85DfklqMXD0nINdf0rHswakFDUVWP16/ySklHUfQjWFpeefT4yerT8NnzFy/XOusbx66srcBElKq0g5w7VNJgQpIUDiqLXOcKT/Lzj61/8hWtk6X5TNMKR5pPjBxLwclLWSfZDo+yeOfsrXsDe5A2A0i1LGCQncEHcOkMUsILsroBbgqYwVHWfzCcTvBLmw+3Q4+s04160Rxwn8QL0mULHGad72lRilqjIaG4c8M4qmjUcEtSKJyFae2w4uKcT3DoqeEa3aiZnz+D114pYFxa/wzBXL090XDt3FTnPqk5nbq7Xis+5A1rGr8bNdJUNaERN4vGtQIqoe0SCmlRkJp6woWV/q8gTrnlgnzjbQnx3ZPvk6Tfe9+LP/W7+7uLNlbZK7bFdljMdtk+O2CHLGGCXbIr9pP9Cr4F18Hv4M9NdClYzGyy/xD8/QcN2aif</latexit><latexit sha1_base64="Zs67o2YmObQBXLhrj9WKj1xrYoQ=">AAACPXicbVBNSxxBFOxRk5jJh6sec3m4BgyEZWYvJhBByMWjkYwu7CxDT8/btbW7Z9L9JrgM+8u85DfklqMXD0nINdf0rHswakFDUVWP16/ySklHUfQjWFpeefT4yerT8NnzFy/XOusbx66srcBElKq0g5w7VNJgQpIUDiqLXOcKT/Lzj61/8hWtk6X5TNMKR5pPjBxLwclLWSfZDo+yeOfsrXsDe5A2A0i1LGCQncEHcOkMUsILsroBbgqYwVHWfzCcTvBLmw+3Q4+s04160Rxwn8QL0mULHGad72lRilqjIaG4c8M4qmjUcEtSKJyFae2w4uKcT3DoqeEa3aiZnz+D114pYFxa/wzBXL090XDt3FTnPqk5nbq7Xis+5A1rGr8bNdJUNaERN4vGtQIqoe0SCmlRkJp6woWV/q8gTrnlgnzjbQnx3ZPvk6Tfe9+LP/W7+7uLNlbZK7bFdljMdtk+O2CHLGGCXbIr9pP9Cr4F18Hv4M9NdClYzGyy/xD8/QcN2aif</latexit>

,# regions

Ey= =

-

=

= -

Page 6: Bootstrapped Aggregation and Random Forests

Previously on CSCI 4622

6

Last time we saw that full Decision Tree Classifiers tended to overfit

Page 7: Bootstrapped Aggregation and Random Forests

Previously on CSCI 4622

7

Last time we saw that full Decision Tree Classifiers tended to overfit

We could alleviate this a bit with pruning

Prepruning:o Don’t let the tree have too many levels

o Stopping conditions

Postpruning: o Build the complete tree

o Go back and remove vertices that don’t affect performance too much

Page 8: Bootstrapped Aggregation and Random Forests

Ensemble Methods

8

Today we’ll see how we can use many decision trees to do even better

Ensemble methods are very popular in Machine Learning of late

General Idea: o Train numerous slightly different classifiers

o Use each classifier to make a prediction on a query point

o Make the ensembled prediction by taking majority vote from individual classifiers

Page 9: Bootstrapped Aggregation and Random Forests

Bagging

9

This week we’ll look at three different types of Ensembled Decision Tree Classifiers

The simplest is called Bootstrapped Aggregation (or Bagging)

Idea: What would have happened if those trainingexamples that we clearly overfit to weren’t there?

00

Page 10: Bootstrapped Aggregation and Random Forests

Bagging

10

The simplest is called Bootstrapped Aggregation (or Bagging)

Page 11: Bootstrapped Aggregation and Random Forests

Bagging

11

The simplest is called Bootstrapped Aggregation (or Bagging)

o Sample n training examples with replacemento Fit full-depth Decision Tree Classifier

o Repeat many times

o Aggregate results by majority vote

n Training8×5

⇒-

Page 12: Bootstrapped Aggregation and Random Forests

Bagging

12

The simplest is called Bootstrapped Aggregation (or Bagging)

Page 13: Bootstrapped Aggregation and Random Forests

Bagging

13

The simplest is called Bootstrapped Aggregation (or Bagging)

Page 14: Bootstrapped Aggregation and Random Forests

Bagging

14

The simplest is called Bootstrapped Aggregation (or Bagging)

213 's IR

⇒ naV⇒B

• ••

B B R

Page 15: Bootstrapped Aggregation and Random Forests

Bootstrap Refresher

15

.tt#TF )

-

=

p§*YEw0¥÷ ;;¥¥⇒£⇒201L

30kEMPIRICAL

D 1St.

YouR , = 10/20,20 ,

20,20 ,zo

- Rz= 10110,70/30,10 ,2i

Page 16: Bootstrapped Aggregation and Random Forests

Bagging

16

In general, assume that we define and each individual DCT has

Then we can define the Bagged Classifier as

How do we Validate the Bagged model?

o Could use cross-validation or a held-out validation set, as usual

o But there is a better way.

y 2 {�1, 1}<latexit sha1_base64="+dcH34nr7UAWK2Kcg6/pOeN+7ic=">AAAB/XicbVBNS8NAEJ34WeNXVDx5WWwFD1qSXqq3ghePFawtNKFstpt26WYTdjdCCAX/ihcPKl79H978N24/Dtr6YODx3gwz88KUM6Vd99taWV1b39gsbdnbO7t7+87B4YNKMkloiyQ8kZ0QK8qZoC3NNKedVFIch5y2w9HNxG8/UqlYIu51ntIgxgPBIkawNlLPOa7Yuc8E8otL78Lzx3bFNug5ZbfqToGWiTcnZZij2XO+/H5CspgKTThWquu5qQ4KLDUjnI5tP1M0xWSEB7RrqMAxVUExPX+MzozSR1EiTQmNpurviQLHSuVxaDpjrIdq0ZuI/3ndTEdXQcFEmmkqyGxRlHGkEzTJAvWZpETz3BBMJDO3IjLEEhNtEpuE4C2+vExatep11burlRv1eRolOIFTOAcP6tCAW2hCCwgU8Ayv8GY9WS/Wu/Uxa12x5jNH8AfW5w9W0pH/</latexit><latexit sha1_base64="+dcH34nr7UAWK2Kcg6/pOeN+7ic=">AAAB/XicbVBNS8NAEJ34WeNXVDx5WWwFD1qSXqq3ghePFawtNKFstpt26WYTdjdCCAX/ihcPKl79H978N24/Dtr6YODx3gwz88KUM6Vd99taWV1b39gsbdnbO7t7+87B4YNKMkloiyQ8kZ0QK8qZoC3NNKedVFIch5y2w9HNxG8/UqlYIu51ntIgxgPBIkawNlLPOa7Yuc8E8otL78Lzx3bFNug5ZbfqToGWiTcnZZij2XO+/H5CspgKTThWquu5qQ4KLDUjnI5tP1M0xWSEB7RrqMAxVUExPX+MzozSR1EiTQmNpurviQLHSuVxaDpjrIdq0ZuI/3ndTEdXQcFEmmkqyGxRlHGkEzTJAvWZpETz3BBMJDO3IjLEEhNtEpuE4C2+vExatep11burlRv1eRolOIFTOAcP6tCAW2hCCwgU8Ayv8GY9WS/Wu/Uxa12x5jNH8AfW5w9W0pH/</latexit><latexit sha1_base64="+dcH34nr7UAWK2Kcg6/pOeN+7ic=">AAAB/XicbVBNS8NAEJ34WeNXVDx5WWwFD1qSXqq3ghePFawtNKFstpt26WYTdjdCCAX/ihcPKl79H978N24/Dtr6YODx3gwz88KUM6Vd99taWV1b39gsbdnbO7t7+87B4YNKMkloiyQ8kZ0QK8qZoC3NNKedVFIch5y2w9HNxG8/UqlYIu51ntIgxgPBIkawNlLPOa7Yuc8E8otL78Lzx3bFNug5ZbfqToGWiTcnZZij2XO+/H5CspgKTThWquu5qQ4KLDUjnI5tP1M0xWSEB7RrqMAxVUExPX+MzozSR1EiTQmNpurviQLHSuVxaDpjrIdq0ZuI/3ndTEdXQcFEmmkqyGxRlHGkEzTJAvWZpETz3BBMJDO3IjLEEhNtEpuE4C2+vExatep11burlRv1eRolOIFTOAcP6tCAW2hCCwgU8Ayv8GY9WS/Wu/Uxa12x5jNH8AfW5w9W0pH/</latexit><latexit sha1_base64="+dcH34nr7UAWK2Kcg6/pOeN+7ic=">AAAB/XicbVBNS8NAEJ34WeNXVDx5WWwFD1qSXqq3ghePFawtNKFstpt26WYTdjdCCAX/ihcPKl79H978N24/Dtr6YODx3gwz88KUM6Vd99taWV1b39gsbdnbO7t7+87B4YNKMkloiyQ8kZ0QK8qZoC3NNKedVFIch5y2w9HNxG8/UqlYIu51ntIgxgPBIkawNlLPOa7Yuc8E8otL78Lzx3bFNug5ZbfqToGWiTcnZZij2XO+/H5CspgKTThWquu5qQ4KLDUjnI5tP1M0xWSEB7RrqMAxVUExPX+MzozSR1EiTQmNpurviQLHSuVxaDpjrIdq0ZuI/3ndTEdXQcFEmmkqyGxRlHGkEzTJAvWZpETz3BBMJDO3IjLEEhNtEpuE4C2+vExatep11burlRv1eRolOIFTOAcP6tCAW2hCCwgU8Ayv8GY9WS/Wu/Uxa12x5jNH8AfW5w9W0pH/</latexit>

hk(x)<latexit sha1_base64="KOJGeqMAnba4cY+E9X2PCsU1FKg=">AAAB/HicbVC7TsNAEFzzDOZlHh3NiQQpNJGdJtBFoqEMEiaREis6X87JKeeH7s6IYEX8Cg0FIFo+hI6/4Zy4gISRVhrN7Gp3x084k8q2v42V1bX1jc3Slrm9s7u3bx0c3sk4FYS6JOax6PhYUs4i6iqmOO0kguLQ57Ttj69yv31PhWRxdKsmCfVCPIxYwAhWWupbxxVz1B9Xs54foIfpuVkxNfpW2a7ZM6Bl4hSkDAVafeurN4hJGtJIEY6l7Dp2orwMC8UIp1Ozl0qaYDLGQ9rVNMIhlV42u36KzrQyQEEsdEUKzdTfExkOpZyEvu4MsRrJRS8X//O6qQouvIxFSapoROaLgpQjFaM8CjRgghLFJ5pgIpi+FZERFpgoHVgegrP48jJx67XLmnNTLzcbRRolOIFTqIIDDWjCNbTABQKP8Ayv8GY8GS/Gu/Exb10xipkj+APj8wcunJH1</latexit><latexit sha1_base64="KOJGeqMAnba4cY+E9X2PCsU1FKg=">AAAB/HicbVC7TsNAEFzzDOZlHh3NiQQpNJGdJtBFoqEMEiaREis6X87JKeeH7s6IYEX8Cg0FIFo+hI6/4Zy4gISRVhrN7Gp3x084k8q2v42V1bX1jc3Slrm9s7u3bx0c3sk4FYS6JOax6PhYUs4i6iqmOO0kguLQ57Ttj69yv31PhWRxdKsmCfVCPIxYwAhWWupbxxVz1B9Xs54foIfpuVkxNfpW2a7ZM6Bl4hSkDAVafeurN4hJGtJIEY6l7Dp2orwMC8UIp1Ozl0qaYDLGQ9rVNMIhlV42u36KzrQyQEEsdEUKzdTfExkOpZyEvu4MsRrJRS8X//O6qQouvIxFSapoROaLgpQjFaM8CjRgghLFJ5pgIpi+FZERFpgoHVgegrP48jJx67XLmnNTLzcbRRolOIFTqIIDDWjCNbTABQKP8Ayv8GY8GS/Gu/Exb10xipkj+APj8wcunJH1</latexit><latexit sha1_base64="KOJGeqMAnba4cY+E9X2PCsU1FKg=">AAAB/HicbVC7TsNAEFzzDOZlHh3NiQQpNJGdJtBFoqEMEiaREis6X87JKeeH7s6IYEX8Cg0FIFo+hI6/4Zy4gISRVhrN7Gp3x084k8q2v42V1bX1jc3Slrm9s7u3bx0c3sk4FYS6JOax6PhYUs4i6iqmOO0kguLQ57Ttj69yv31PhWRxdKsmCfVCPIxYwAhWWupbxxVz1B9Xs54foIfpuVkxNfpW2a7ZM6Bl4hSkDAVafeurN4hJGtJIEY6l7Dp2orwMC8UIp1Ozl0qaYDLGQ9rVNMIhlV42u36KzrQyQEEsdEUKzdTfExkOpZyEvu4MsRrJRS8X//O6qQouvIxFSapoROaLgpQjFaM8CjRgghLFJ5pgIpi+FZERFpgoHVgegrP48jJx67XLmnNTLzcbRRolOIFTqIIDDWjCNbTABQKP8Ayv8GY8GS/Gu/Exb10xipkj+APj8wcunJH1</latexit><latexit sha1_base64="KOJGeqMAnba4cY+E9X2PCsU1FKg=">AAAB/HicbVC7TsNAEFzzDOZlHh3NiQQpNJGdJtBFoqEMEiaREis6X87JKeeH7s6IYEX8Cg0FIFo+hI6/4Zy4gISRVhrN7Gp3x084k8q2v42V1bX1jc3Slrm9s7u3bx0c3sk4FYS6JOax6PhYUs4i6iqmOO0kguLQ57Ttj69yv31PhWRxdKsmCfVCPIxYwAhWWupbxxVz1B9Xs54foIfpuVkxNfpW2a7ZM6Bl4hSkDAVafeurN4hJGtJIEY6l7Dp2orwMC8UIp1Ozl0qaYDLGQ9rVNMIhlV42u36KzrQyQEEsdEUKzdTfExkOpZyEvu4MsRrJRS8X//O6qQouvIxFSapoROaLgpQjFaM8CjRgghLFJ5pgIpi+FZERFpgoHVgegrP48jJx67XLmnNTLzcbRRolOIFTqIIDDWjCNbTABQKP8Ayv8GY8GS/Gu/Exb10xipkj+APj8wcunJH1</latexit>

H(x) = sign

"KX

k=1

hk(x)

#

<latexit sha1_base64="hedUj3SeYVZGuiD6Sfx1qXtpQsI=">AAACPXicbVA9a9xAFFw5jj+U2L4kZZrF54DdHNI1tgvDQRpDGhui2HCSxWrv6W653ZXYfTInhH5ZmvyGdCnTpEhM2rTeOx/BXwMLw8wb3r7JSiksBsEPb+XF6su19Y1N/9Xrre2dzpu3X2xRGQ4RL2RhLjNmQQoNEQqUcFkaYCqTcJFNP879i2swVhT6M9YlJIqNtcgFZ+iktBPt+af7TZzldNYe0BMaI8zQqMaKsW5jCTkO45GwpWS1xVpCbCuVNtOTsL36RCfp9H82NmI8wcTf8x3STjfoBQvQpyRcki5Z4iztfI9HBa8UaOSSWTsMgxKThhkUXELrx5WFkvEpG8PQUc0U2KRZnN/SD04Z0bww7mmkC/V+omHK2lplblIxnNjH3lx8zhtWmB8ljdBlhaD53aK8khQLOu+SjoQBjrJ2hHEj3F8pnzDDOLrG5yWEj09+SqJ+77gXnve7g8NlGxvkPdkl+yQkh2RATskZiQgnX8lP8pv88b55v7wb7+/d6Iq3zLwjD+D9uwUO6a1Y</latexit><latexit sha1_base64="hedUj3SeYVZGuiD6Sfx1qXtpQsI=">AAACPXicbVA9a9xAFFw5jj+U2L4kZZrF54DdHNI1tgvDQRpDGhui2HCSxWrv6W653ZXYfTInhH5ZmvyGdCnTpEhM2rTeOx/BXwMLw8wb3r7JSiksBsEPb+XF6su19Y1N/9Xrre2dzpu3X2xRGQ4RL2RhLjNmQQoNEQqUcFkaYCqTcJFNP879i2swVhT6M9YlJIqNtcgFZ+iktBPt+af7TZzldNYe0BMaI8zQqMaKsW5jCTkO45GwpWS1xVpCbCuVNtOTsL36RCfp9H82NmI8wcTf8x3STjfoBQvQpyRcki5Z4iztfI9HBa8UaOSSWTsMgxKThhkUXELrx5WFkvEpG8PQUc0U2KRZnN/SD04Z0bww7mmkC/V+omHK2lplblIxnNjH3lx8zhtWmB8ljdBlhaD53aK8khQLOu+SjoQBjrJ2hHEj3F8pnzDDOLrG5yWEj09+SqJ+77gXnve7g8NlGxvkPdkl+yQkh2RATskZiQgnX8lP8pv88b55v7wb7+/d6Iq3zLwjD+D9uwUO6a1Y</latexit><latexit sha1_base64="hedUj3SeYVZGuiD6Sfx1qXtpQsI=">AAACPXicbVA9a9xAFFw5jj+U2L4kZZrF54DdHNI1tgvDQRpDGhui2HCSxWrv6W653ZXYfTInhH5ZmvyGdCnTpEhM2rTeOx/BXwMLw8wb3r7JSiksBsEPb+XF6su19Y1N/9Xrre2dzpu3X2xRGQ4RL2RhLjNmQQoNEQqUcFkaYCqTcJFNP879i2swVhT6M9YlJIqNtcgFZ+iktBPt+af7TZzldNYe0BMaI8zQqMaKsW5jCTkO45GwpWS1xVpCbCuVNtOTsL36RCfp9H82NmI8wcTf8x3STjfoBQvQpyRcki5Z4iztfI9HBa8UaOSSWTsMgxKThhkUXELrx5WFkvEpG8PQUc0U2KRZnN/SD04Z0bww7mmkC/V+omHK2lplblIxnNjH3lx8zhtWmB8ljdBlhaD53aK8khQLOu+SjoQBjrJ2hHEj3F8pnzDDOLrG5yWEj09+SqJ+77gXnve7g8NlGxvkPdkl+yQkh2RATskZiQgnX8lP8pv88b55v7wb7+/d6Iq3zLwjD+D9uwUO6a1Y</latexit><latexit sha1_base64="hedUj3SeYVZGuiD6Sfx1qXtpQsI=">AAACPXicbVA9a9xAFFw5jj+U2L4kZZrF54DdHNI1tgvDQRpDGhui2HCSxWrv6W653ZXYfTInhH5ZmvyGdCnTpEhM2rTeOx/BXwMLw8wb3r7JSiksBsEPb+XF6su19Y1N/9Xrre2dzpu3X2xRGQ4RL2RhLjNmQQoNEQqUcFkaYCqTcJFNP879i2swVhT6M9YlJIqNtcgFZ+iktBPt+af7TZzldNYe0BMaI8zQqMaKsW5jCTkO45GwpWS1xVpCbCuVNtOTsL36RCfp9H82NmI8wcTf8x3STjfoBQvQpyRcki5Z4iztfI9HBa8UaOSSWTsMgxKThhkUXELrx5WFkvEpG8PQUc0U2KRZnN/SD04Z0bww7mmkC/V+omHK2lplblIxnNjH3lx8zhtWmB8ljdBlhaD53aK8khQLOu+SjoQBjrJ2hHEj3F8pnzDDOLrG5yWEj09+SqJ+77gXnve7g8NlGxvkPdkl+yQkh2RATskZiQgnX8lP8pv88b55v7wb7+/d6Iq3zLwjD+D9uwUO6a1Y</latexit>

* ,-1,

1,1a=

=

-2=2Sign (2) =/

Page 17: Bootstrapped Aggregation and Random Forests

Out-of-Bag Error Estimation

17

o Trees are repeatedly fit on bootstrapped re-samples of the training set

o Can be shown that around 2/3 of training examples present in each bootstrapped re-sample

o OOB Error Estimation makes use of the training examples NOT in the re-sample

Here’s How: 560

Xi ⇒ "gi I ,hiucx;)) ⇒ +1 or

- I

has ¥Rin

Page 18: Bootstrapped Aggregation and Random Forests

Bagging Pros and Cons

18

Pros:o Results in a much lower variance classifier than a single Decision Tree

Cons: o Results in a much less interpretable model

We essentially give up some interpretability in favor of better prediction accuracy

But we can still get insight into what our model is doing using Bagging

Page 19: Bootstrapped Aggregation and Random Forests

Bagging and Feature Importance

19

In a single Decision Tree Classifier we can get an idea of feature importance by

o Measuring Information Gain / reduction in Impurity is achieved by splitting on feature

o Compare these values at the end and rank features

With Bagging we can still do this.

We just compute these measures over all trees and average

Thal

Ca

ChestPain

Oldpeak

MaxHR

RestBP

Age

Chol

Slope

Sex

ExAng

RestECG

Fbs

0 20 40 60 80 100

Variable Importance

Page 20: Bootstrapped Aggregation and Random Forests

An Even Better Approach

20

It turns out, we can take the Bagging idea and make it even better

Suppose you have a particular feature that is a very strong predictor for the response

So strong that in almost all Bagged Decision Trees, it’s the first feature that gets split on

When this happens, the trees can all look very similar. They’re too correlated

Maybe always splitting on the same feature isn’t the best idea

Maybe, like with bootstrapping training examples, we split on a random subset of features too

Page 21: Bootstrapped Aggregation and Random Forests

Random Forests

21

o Still doing Bagging, in the sense that we fit bootstrapped resamples of training data

o BUT, every time we have to choose a feature to split on, don’t consider all p features

o Instead, consider only features to split on

Advantages:o Since at each split, considering less than half of features, this gives very different trees

o Very different trees in your ensemble decreases correlation, and gives more robust results

o Also, slightly cheaper because you don’t have to evaluate each feature to choose best split

m ⇡ pp

<latexit sha1_base64="6OmA/EtTp4raBkb6upyfWmzZprc=">AAAB/XicbVBNSwMxEM36WevXqnjyEmwFT2W3l+qt4MVjBdcWukvJptk2NNnEJCuWpeBf8eJBxav/w5v/xrTdg7Y+GHi8N8PMvFgyqo3nfTsrq2vrG5ulrfL2zu7evntweKdFpjAJsGBCdWKkCaMpCQw1jHSkIojHjLTj0dXUbz8QpalIb81YkoijQUoTipGxUs89rnIYIimVeIShvlcml5Mq7LkVr+bNAJeJX5AKKNDquV9hX+CMk9RghrTu+p40UY6UoZiRSTnMNJEIj9CAdC1NESc6ymfnT+CZVfowEcpWauBM/T2RI671mMe2kyMz1IveVPzP62YmuYhymsrMkBTPFyUZg0bAaRawTxXBho0tQVhReyvEQ6QQNjaxsg3BX3x5mQT12mXNv6lXmo0ijRI4AafgHPigAZrgGrRAADDIwTN4BW/Ok/PivDsf89YVp5g5An/gfP4AvouU3Q==</latexit><latexit sha1_base64="6OmA/EtTp4raBkb6upyfWmzZprc=">AAAB/XicbVBNSwMxEM36WevXqnjyEmwFT2W3l+qt4MVjBdcWukvJptk2NNnEJCuWpeBf8eJBxav/w5v/xrTdg7Y+GHi8N8PMvFgyqo3nfTsrq2vrG5ulrfL2zu7evntweKdFpjAJsGBCdWKkCaMpCQw1jHSkIojHjLTj0dXUbz8QpalIb81YkoijQUoTipGxUs89rnIYIimVeIShvlcml5Mq7LkVr+bNAJeJX5AKKNDquV9hX+CMk9RghrTu+p40UY6UoZiRSTnMNJEIj9CAdC1NESc6ymfnT+CZVfowEcpWauBM/T2RI671mMe2kyMz1IveVPzP62YmuYhymsrMkBTPFyUZg0bAaRawTxXBho0tQVhReyvEQ6QQNjaxsg3BX3x5mQT12mXNv6lXmo0ijRI4AafgHPigAZrgGrRAADDIwTN4BW/Ok/PivDsf89YVp5g5An/gfP4AvouU3Q==</latexit><latexit sha1_base64="6OmA/EtTp4raBkb6upyfWmzZprc=">AAAB/XicbVBNSwMxEM36WevXqnjyEmwFT2W3l+qt4MVjBdcWukvJptk2NNnEJCuWpeBf8eJBxav/w5v/xrTdg7Y+GHi8N8PMvFgyqo3nfTsrq2vrG5ulrfL2zu7evntweKdFpjAJsGBCdWKkCaMpCQw1jHSkIojHjLTj0dXUbz8QpalIb81YkoijQUoTipGxUs89rnIYIimVeIShvlcml5Mq7LkVr+bNAJeJX5AKKNDquV9hX+CMk9RghrTu+p40UY6UoZiRSTnMNJEIj9CAdC1NESc6ymfnT+CZVfowEcpWauBM/T2RI671mMe2kyMz1IveVPzP62YmuYhymsrMkBTPFyUZg0bAaRawTxXBho0tQVhReyvEQ6QQNjaxsg3BX3x5mQT12mXNv6lXmo0ijRI4AafgHPigAZrgGrRAADDIwTN4BW/Ok/PivDsf89YVp5g5An/gfP4AvouU3Q==</latexit><latexit sha1_base64="6OmA/EtTp4raBkb6upyfWmzZprc=">AAAB/XicbVBNSwMxEM36WevXqnjyEmwFT2W3l+qt4MVjBdcWukvJptk2NNnEJCuWpeBf8eJBxav/w5v/xrTdg7Y+GHi8N8PMvFgyqo3nfTsrq2vrG5ulrfL2zu7evntweKdFpjAJsGBCdWKkCaMpCQw1jHSkIojHjLTj0dXUbz8QpalIb81YkoijQUoTipGxUs89rnIYIimVeIShvlcml5Mq7LkVr+bNAJeJX5AKKNDquV9hX+CMk9RghrTu+p40UY6UoZiRSTnMNJEIj9CAdC1NESc6ymfnT+CZVfowEcpWauBM/T2RI671mMe2kyMz1IveVPzP62YmuYhymsrMkBTPFyUZg0bAaRawTxXBho0tQVhReyvEQ6QQNjaxsg3BX3x5mQT12mXNv6lXmo0ijRI4AafgHPigAZrgGrRAADDIwTN4BW/Ok/PivDsf89YVp5g5An/gfP4AvouU3Q==</latexit>

mm

Page 22: Bootstrapped Aggregation and Random Forests

Practical Results

22

o Look at expression of 500 different genes in tissue samples. Predict 15 cancer states

0 100 200 300 400 500

0.2

0.3

0.4

0.5

Number of Trees

Test

Cla

ssifi

catio

n E

rro

r

m=p

m=p/2m= p

[James,etal]

Page 23: Bootstrapped Aggregation and Random Forests

Where Do We Go From Here?

23

o Unfortunately, Decision Tree Classifiers are prone to overfitting

o On their own, tend to do worse than other classifiers we’ve seen

But WITH THEIR POWERS COMBINED!

Next up, Ensemble Methods.

o Take various weaker classifiers, and combine them to get strong classifiers

o Bagging and Random Forests

o Boosted Decision Trees and the AdaBoost Algorithm

Page 24: Bootstrapped Aggregation and Random Forests

24

Page 25: Bootstrapped Aggregation and Random Forests

25

Page 26: Bootstrapped Aggregation and Random Forests

26