trees, bagging, random forests and boosting - msri

43
Boosting Trevor Hastie, Stanford University 1 Trees, Bagging, Random Forests and Boosting Classification Trees Bagging: Averaging Trees Random Forests: Cleverer Averaging of Trees Boosting: Cleverest Averaging of Trees Methods for improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not generalize well. The techniques discussed here enhance their performance considerably.

Upload: haquynh

Post on 18-Jan-2017

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 1

Trees, Bagging, Random Forests and Boosting

• Classification Trees

• Bagging: Averaging Trees

• Random Forests: Cleverer Averaging of Trees

• Boosting: Cleverest Averaging of Trees

Methods for improving the performance of weak learners such asTrees. Classification trees are adaptive and robust, but do notgeneralize well. The techniques discussed here enhance theirperformance considerably.

Page 2: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 2

Two-class Classification

• Observations are classified into two or more classes, coded by aresponse variable Y taking values 1, 2, . . . , K.

• We have a feature vector X = (X1, X2, . . . , Xp), and we hopeto build a classification rule C(X) to assign a class label to anindividual with feature X.

• We have a sample of pairs (yi, xi), i = 1, . . . , N . Note thateach of the xi are vectors xi = (xi1, xi2, . . . , xip).

• Example: Y indicates whether an email is spam or not. X

represents the relative frequency of a subset of specially chosenwords in the email message.

• The technology described here estimates C(X) directly, or viathe probability function P (C = k|X).

Page 3: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 3

Classification Trees

• Represented by a series of binary splits.

• Each internal node represents a value query on one of thevariables — e.g. “Is X3 > 0.4”. If the answer is “Yes”, go right,else go left.

• The terminal nodes are the decision nodes. Typically eachterminal node is dominated by one of the classes.

• The tree is grown using training data, by recursive splitting.

• The tree is often pruned to an optimal size, evaluated bycross-validation.

• New observations are classified by passing their X down to aterminal node of the tree, and then using majority vote.

Page 4: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 4

x.2<0.39x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

Classification Tree

Page 5: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 5

Properties of Trees

✔ Can handle huge datasets

✔ Can handle mixed predictors—quantitative and qualitative

✔ Easily ignore redundant variables

✔ Handle missing data elegantly

✔ Small trees are easy to interpret

✖ large trees are hard to interpret

✖ Often prediction performance is poor

Page 6: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 6

Example: Predicting e-mail spam

• data from 4601 email messages

• goal: predict whether an email message is spam (junk email) orgood.

• input features: relative frequencies in a message of 57 of themost commonly occurring words and punctuation marks in allthe training the email messages.

• for this problem not all errors are equal; we want to avoidfiltering out good email, while letting spam get through is notdesirable but less serious in its consequences.

• we coded spam as 1 and email as 0.

• A system like this would be trained for each user separately(e.g. their word lists would be different)

Page 7: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 7

Predictors

• 48 quantitative predictors—the percentage of words in theemail that match a given word. Examples include business,address, internet, free, and george. The idea was that thesecould be customized for individual users.

• 6 quantitative predictors—the percentage of characters in theemail that match a given character. The characters are ch;,ch(, ch[, ch!, ch$, and ch#.

• The average length of uninterrupted sequences of capitalletters: CAPAVE.

• The length of the longest uninterrupted sequence of capitalletters: CAPMAX.

• The sum of the length of uninterrupted sequences of capitalletters: CAPTOT.

Page 8: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 8

Details

• A test set of size 1536 was randomly chosen, leaving 3065observations in the training set.

• A full tree was grown on the training set, with splittingcontinuing until a minimum bucket size of 5 was reached.

• This bushy tree was pruned back using cost-complexitypruning, and the tree size was chosen by 10-foldcross-validation.

• We then compute the test error and ROC curve on the testdata.

Page 9: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 9

Some important features

39% of the training data were spam.

Average percentage of words or characters in an email messageequal to the indicated word or character. We have chosen thewords and characters showing the largest difference between spam

and email.

george you your hp free hpl

spam 0.00 2.26 1.38 0.02 0.52 0.01

email 1.27 1.27 0.44 0.90 0.07 0.43

! our re edu remove

spam 0.51 0.51 0.13 0.01 0.28

email 0.11 0.18 0.42 0.29 0.01

Page 10: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 10

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58

Page 11: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 11

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for pruned tree on SPAM data

o TREE − Error: 8.7%

SPAM Data

Overall error rate on test data:8.7%.ROC curve obtained by vary-ing the threshold c0 of the clas-sifier:C(X) = +1 if P (+1|X) > c0.Sensitivity: proportion of truespam identifiedSpecificity: proportion of trueemail identified.

We may want specificity to be high, and suffer some spam:Specificity : 95% =⇒ Sensitivity : 79%

Page 12: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 12

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for TREE vs SVM on SPAM data

oo

SVM − Error: 6.7%TREE − Error: 8.7%

TREE vs SVM

Comparing ROC curves onthe test data is a goodway to compare classi-fiers. SVM dominatesTREE here.

Page 13: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 13

Toy Classification Problem

-6 -4 -2 0 2 4 6

-6-4

-20

24

6

Bayes Error Rate: 0.25

0

00

0

00

0

0

00

0

0

0

0

0

0

0

0

0 0 0

00

0

00

00

00

0

0

00

00

00

0 0

0

0

00

0

0

0 0

00

0

0

0

000

0

0 0

00

00

00 00 0000

0

0

0

0

0

0

0

00

00

0

0

00

00

0

0

0

000

00

00

0

000 00

0

0

0

0

0

00

00

0

00

0

00

00

0

0

0

0

0

0

0

0

00

00

0

0

0

0

00

0

0

0 00

0

0 00

00

00

0

0

0

0

0 0

0

0

000

0

0

0

0

00

0

0

000

0

00

0

0

0

0 00

0

00 00 0

0 0

000

00

0

0000 0

0

0

0

0

00 0

000

0

0

0

0

00

0

00

0

0

0

0

0

0

0

0 00

0

00

00

0

0

0

0 0

00

00

0 0

0

0

0

00

0

00

0

0

000

0

0

0

00

0

00 00

0

00

0

0

0

0

0

0

0

00

0

0

0

0

00

000

0

0

00

00

0

00

0

0

0 0

0

0

0

0

0

0

00

0

0

0

0

0

00

00

0

0

00

00 0

0

0

0

0 0

0

0

00

000

0

0 0

00

00

0 0

0 0

0

0

0

0

00

0

0

0

0

000

0

00

0 0

0

0

00

00

0 0

0

0

0

0

0

0

0

0

0

0 00

00

0

00

0 0

00

00

1

1

1

11

1

1

1

11

1

111

1

1

11

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1

1

11

1

1

11

1

1 11

1

1

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

11

1

1

11

1

1

11

11

11

1

11

11

1

1

1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

111

1

1

1

11

11

11

1

1

1

11

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

11

1

1

1

1

1

1

11

111

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

11

11

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

11

1

1

1

1

1

11

1

1

1

11

1

1

1 1

1

1

1

1

1

1

1

1

11

1

1

1 1

1

1

11

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

11

1

11

1

X2

X1

• Data X and Y , with Y

taking values +1 or −1.

• Here X = (X1, X2)

• The black boundaryis the Bayes DecisionBoundary - the bestone can do.

• Goal: Given N train-ing pairs (Xi, Yi)produce a classifierC(X) ∈ {−1, 1}

• Also estimate the probability of the class labels P (Y = +1|X).

Page 14: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 14

Toy Example - No Noise

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

0

0

00

0 0

0

0 0

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

00

0

0

0

000

0

0

0

0

0

0

0

0

0

00

00 0

0

00

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

00

00

0

0

0

0

0

0

00

0

0

0

0

00

0

0

000

0

0

00

0

0

0

0 0

00

0

0 00

0

0

0

00

00 0

0

0

00

0

0

0

0

0

00

0

00

00 0

0

000

0

0

0

00

0 00

0

0

0

0

0

0

0

0

0

00

0

0 0

0

0

0

00

0

00

00

0

0

00

0

0

0

0

00

0 0

0

00

0

0

0

0 0

0

0

0

0

00

0

000

000

0

00 0

0

0

0

0 00

0

0

0

00

00

0

0

0

0

0

00

0

00

0

00

0

0

00

00

0

0 0

0

0

0

0

0

0

00

0

00

00

00

0

0

00

0

00

0

0

00

0

00

0 0

0

0

0

0

0

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0 00

00

0

0

0

0

0

00

00

0

0

00

0

00

00

0

0

00

00

0

000 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 00

0

0

00

00 00

00

00

00

0

0

0

0

0

0

0

0 0

0

0

0

0

00

0

00

0

000

00

00

0

0

0

1

1

1

1

1

1

11

11

11

1

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

11

11

1

1 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

11

1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 11

1

1

11

11

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1 11 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

11

1

1 1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

Bayes Error Rate: 0

X2

X1

• Deterministic problem;noise comes from sam-pling distribution of X.

• Use a training sampleof size 200.

• Here Bayes Error is0%.

Page 15: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 15

Classification Tree

x.2<-1.06711x.2>-1.06711

94/200

1

0/34

1

x.2<1.14988x.2>1.14988

72/166

0

x.1<1.13632x.1>1.13632

40/134

0

x.1<-0.900735x.1>-0.900735

23/117

0

x.1<-1.1668x.1>-1.1668

5/26

1

0/12

1

x.1<-1.07831x.1>-1.07831

5/14

1

1/5

1

4/9

1

x.2<-0.823968x.2>-0.823968

2/91

0

2/8

0

0/83

0

0/17

1

0/32

1

Page 16: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 16

Decision Boundary: Tree

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

11

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

000

0

00

00

0

0

0

00

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

0

0

0

0 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

0

00

0

Error Rate: 0.073

When the nested spheresare in 10-dimensions, Clas-sification Trees produces arather noisy and inaccuraterule C(X), with error ratesaround 30%.

Page 17: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 17

Model Averaging

Classification trees can be simple, but often produce noisy (bushy)or weak (stunted) classifiers.

• Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.

• Boosting (Freund & Shapire, 1996): Fit many large or smalltrees to reweighted versions of the training data. Classify byweighted majority vote.

• Random Forests (Breiman 1999): Fancier version of bagging.

In general Boosting � Random Forests � Bagging � Single Tree.

Page 18: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 18

Bagging

Bagging or bootstrap aggregation averages a given procedure overmany samples, to reduce its variance — a poor man’s Bayes. See

pp 246.

Suppose C(S, x) is a classifier, such as a tree, based on our trainingdata S, producing a predicted class label at input point x.

To bag C, we draw bootstrap samples S∗1, . . .S∗B each of size N

with replacement from the training data. Then

Cbag(x) = Majority Vote {C(S∗b, x)}Bb=1.

Bagging can dramatically reduce the variance of unstableprocedures (like trees), leading to improved prediction. Howeverany simple structure in C (e.g a tree) is lost.

Page 19: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 19

x.2<0.39x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

Original Tree

x.2<0.36x.2>0.36

7/30

0

x.1<-0.965x.1>-0.965

1/23

0

1/5

0

0/18

0

1/7

1

Bootstrap Tree 1

x.2<0.39x.2>0.39

11/30

0

x.3<-1.575x.3>-1.575

3/22

0

2/5

1

0/17

0

0/8

1

Bootstrap Tree 2

x.4<0.395x.4>0.395

4/30

0

x.3<-1.575x.3>-1.575

2/25

0

2/5

0

0/20

0

2/5

0

Bootstrap Tree 3

x.2<0.255x.2>0.255

13/30

0

x.3<-1.385x.3>-1.385

2/16

0

2/5

0

0/11

0

3/14

1

Bootstrap Tree 4

x.2<0.38x.2>0.38

12/30

0

x.3<-1.61x.3>-1.61

4/20

0

2/6

1

0/14

0

2/10

1

Bootstrap Tree 5

Page 20: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 20

Decision Boundary: Bagging

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

11

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

000

0

00

00

0

0

0

00

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

0

0

0

0 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

0

00

0

Error Rate: 0.032

Bagging averages manytrees, and producessmoother decision bound-aries.

Page 21: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 21

Random forests

• refinement of bagged trees; quite popular

• at each tree split, a random sample of m features is drawn, andonly those m features are considered for splitting. Typicallym =

√p or log2 p, where p is the number of features

• For each tree grown on a bootstrap sample, the error rate forobservations left out of the bootstrap sample is monitored.This is called the “out-of-bag” error rate.

• random forests tries to improve on bagging by “de-correlating”the trees. Each tree has the same expectation.

Page 22: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 22

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for TREE, SVM and Random Forest on SPAM data

ooo

Random Forest − Error: 5.0%SVM − Error: 6.7%TREE − Error: 8.7%

TREE, SVM and RF

Random Forest dominatesboth other methods on theSPAM data — 5.0% error.Used 500 trees with defaultsettings for random Forest

package in R.

Page 23: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 23

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

CM (x)

C3(x)

C2(x)

C1(x)

Boosting

• Average many trees, eachgrown to re-weighted versionsof the training data.

• Final Classifier is weighted av-erage of classifiers:

C(x) = sign[∑M

m=1 αmCm(x)]

Page 24: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 24

Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Bagging

AdaBoost

100 Node Trees

Boosting vs Bagging

• 2000 points fromNested Spheres in R10

• Bayes error rate is 0%.

• Trees are grown bestfirst without pruning.

• Leftmost term is a sin-gle tree.

Page 25: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 25

AdaBoost (Freund & Schapire, 1996)

1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .

2. For m = 1 to M repeat steps (a)–(d):

(a) Fit a classifier Cm(x) to the training data using weights wi.

(b) Compute weighted error of newest tree

errm =∑N

i=1 wiI(yi �= Cm(xi))∑Ni=1 wi

.

(c) Compute αm = log[(1 − errm)/errm].

(d) Update weights for i = 1, . . . , N :wi ← wi · exp[αm · I(yi �= Cm(xi))]and renormalize to wi to sum to 1.

3. Output C(x) = sign[∑M

m=1 αmCm(x)].

Page 26: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 26

Boosting Iterations

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Single Stump

400 Node Tree

Boosting Stumps

A stump is a two-nodetree, after a single split.Boosting stumps worksremarkably well on thenested-spheres problem.

Page 27: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 27

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5 Training Error

• Nested spheres in 10-Dimensions.

• Bayes error is 0%.

• Boosting drives thetraining error to zero.

• Further iterations con-tinue to improve testerror in many exam-ples.

Page 28: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 28

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Bayes Error

Noisy Problems

• Nested Gaussians in10-Dimensions.

• Bayes error is 25%.

• Boosting with stumps

• Here the test errordoes increase, but quiteslowly.

Page 29: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 29

Stagewise Additive Modeling

Boosting builds an additive model

f(x) =M∑

m=1

βmb(x; γm).

Here b(x, γm) is a tree, and γm parametrizes the splits.

We do things like that in statistics all the time!

• GAMs: f(x) =∑

j fj(xj)

• Basis expansions: f(x) =∑M

m=1 θmhm(x)

Traditionally the parameters fm, θm are fit jointly (i.e. leastsquares, maximum likelihood).

With boosting, the parameters (βm, γm) are fit in a stagewisefashion. This slows the process down, and overfits less quickly.

Page 30: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 30

Additive Trees

• Simple example: stagewise least-squares?

• Fix the past M − 1 functions, and update the Mth using a tree:

minfM∈Tree(x)

E(Y −M−1∑m=1

fm(x) − fM (x))2

• If we define the current residuals to be

R = Y −M−1∑m=1

fm(x)

then at each stage we fit a tree to the residuals

minfM∈Tree(x)

E(R − fM (x))2

Page 31: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 31

Stagewise Least Squares

Suppose we have available a basis family b(x; γ) parametrized by γ.

• After m − 1 steps, suppose we have the modelfm−1(x) =

∑m−1j=1 βjb(x; γj).

• At the mth step we solve

minβ,γ

N∑i=1

(yi − fm−1(xi) − βb(xi; γ))2

• Denoting the residuals at the mth stage byrim = yi − fm−1(xi), the previous step amounts to

minβ,γ

(rim − βb(xi; γ))2,

• Thus the term βmb(x; γm) that best fits the current residuals isadded to the expansion at each step.

Page 32: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 32

Adaboost: Stagewise Modeling

• AdaBoost builds an additive logistic regression model

f(x) = logPr(Y = 1|x)

Pr(Y = −1|x)=

M∑m=1

αmGm(x)

by stagewise fitting using the loss function

L(y, f(x)) = exp(−y f(x)).

• Given the current fM−1(x), our solution for (βm, Gm) is

arg minβ,G

N∑i=1

exp[−yi(fm−1(xi) + β G(x))]

where Gm(x) ∈ {−1, 1} is a tree classifier and βm is acoefficient.

Page 33: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 33

• With w(m)i = exp(−yi fm−1(xi)), this can be re-expressed as

arg minβ,G

N∑i=1

w(m)i exp(−β yi G(xi))

• We can show that this leads to the Adaboost algorithm; See

pp 305.

Page 34: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 34

-2 -1 0 1 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector

Loss

y · f

Why Exponential Loss?

• e−yF (x) is a monotone,smooth upper bound onmisclassification loss at x.

• Leads to simple reweightingscheme.

• Has logit transform as popu-lation minimizer

f∗(x) =12

logPr(Y = 1|x)

Pr(Y = −1|x)

• Other more robust loss func-tions, like binomial deviance.

Page 35: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 35

General Stagewise Algorithm

We can do the same for more general loss functions, not only leastsquares.

1. Initialize f0(x) = 0.

2. For m = 1 to M :

(a) Compute(βm, γm) = arg minβ,γ

∑Ni=1 L(yi, fm−1(xi) + βb(xi; γ)).

(b) Set fm(x) = fm−1(x) + βmb(x; γm).

Sometimes we replace step (b) in item 2 by

(b∗) Set fm(x) = fm−1(x) + νβmb(x; γm)

Here ν is a shrinkage factor, and often ν < 0.1. Shrinkage slows thestagewise model-building even more, and typically leads to betterperformance.

Page 36: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 36

Gradient Boosting

• General boosting algorithm that works with a variety ofdifferent loss functions. Models include regression, resistantregression, K-class classification and risk modeling.

• Gradient Boosting builds additive tree models, for example, forrepresenting the logits in logistic regression.

• Tree size is a parameter that determines the order ofinteraction (next slide).

• Gradient Boosting inherits all the good features of trees(variable selection, missing data, mixed predictors), andimproves on the weak features, such as prediction performance.

• Gradient Boosting is described in detail in , section 10.10.

Page 37: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 37

Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Stumps

10 Node100 NodeAdaboost

Tree Size

The tree size J determinesthe interaction order of themodel:

η(X) =∑

j

ηj(Xj)

+∑jk

ηjk(Xj , Xk)

+∑jkl

ηjkl(Xj , Xk, Xl)

+ · · ·

Page 38: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 38

Stumps win!

Since the true decision boundary is the surface of a sphere, thefunction that describes it has the form

f(X) = X21 + X2

2 + . . . + X2p − c = 0.

Boosted stumps via Gradient Boosting returns reasonableapproximations to these quadratic functions.

Coordinate Functions for Additive Logistic Trees

f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)

f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)

Page 39: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 39

Spam Example Results

With 3000 training and 1500 test observations, Gradient Boostingfits an additive logistic model

f(x) = logPr(spam|x)Pr(email|x)

using trees with J = 6 terminal-node trees.

Gradient Boosting achieves a test error of 4%, compared to 5.3% foran additive GAM, 5.0% for Random Forests, and 8.7% for CART.

Page 40: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 40

Spam: Variable Importance

!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative importance

Page 41: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 41

Spam: Partial Dependence

!

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

remove

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

edu

Par

tial D

epen

denc

e

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.6

-0.2

0.0

0.2

hp

Par

tial D

epen

denc

e

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-1.0

-0.6

-0.2

0.0

0.2

Page 42: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 42

Comparison of Learning Methods

Some characteristics of different learning methods.

Key: ● = good, ● =fair, and ● =poor.

Characteristic NeuralNets

SVM CART GAM KNN,Kernel

GradientBoost

Natural handling of dataof “mixed” type ● ● ● ● ● ●

Handling of missing val-ues ● ● ● ● ● ●

Robustness to outliers ininput space ● ● ● ● ● ●

Insensitive to monotonetransformations of in-puts

● ● ● ● ● ●

Computational scalabil-ity (large N) ● ● ● ● ● ●

Ability to deal with irrel-evant inputs ● ● ● ● ● ●

Ability to extract linearcombinations of features ● ● ● ● ● ●

Interpretability● ● ● ● ● ●

Predictive power● ● ● ● ● ●

Page 43: Trees, Bagging, Random Forests and Boosting - MSRI

Boosting Trevor Hastie, Stanford University 43

Software

• R: free GPL statistical computing environment available fromCRAN, implements the S language. Includes:

– randomForest: implementation of Leo Breimans algorithms.

– rpart: Terry Therneau’s implementation of classificationand regression trees.

– gbm: Greg Ridgeway’s implementation of Friedman’sgradient boosting algorithm.

• Salford Systems: Commercial implementation of trees, randomforests and gradient boosting.

• Splus (Insightful): Commerical version of S.

• Weka: GPL software from University of Waikato, New Zealand.Includes Trees, Random Forests and many other procedures.