trees, bagging, random forests and boosting - msri
TRANSCRIPT
Boosting Trevor Hastie, Stanford University 1
Trees, Bagging, Random Forests and Boosting
• Classification Trees
• Bagging: Averaging Trees
• Random Forests: Cleverer Averaging of Trees
• Boosting: Cleverest Averaging of Trees
Methods for improving the performance of weak learners such asTrees. Classification trees are adaptive and robust, but do notgeneralize well. The techniques discussed here enhance theirperformance considerably.
Boosting Trevor Hastie, Stanford University 2
Two-class Classification
• Observations are classified into two or more classes, coded by aresponse variable Y taking values 1, 2, . . . , K.
• We have a feature vector X = (X1, X2, . . . , Xp), and we hopeto build a classification rule C(X) to assign a class label to anindividual with feature X.
• We have a sample of pairs (yi, xi), i = 1, . . . , N . Note thateach of the xi are vectors xi = (xi1, xi2, . . . , xip).
• Example: Y indicates whether an email is spam or not. X
represents the relative frequency of a subset of specially chosenwords in the email message.
• The technology described here estimates C(X) directly, or viathe probability function P (C = k|X).
Boosting Trevor Hastie, Stanford University 3
Classification Trees
• Represented by a series of binary splits.
• Each internal node represents a value query on one of thevariables — e.g. “Is X3 > 0.4”. If the answer is “Yes”, go right,else go left.
• The terminal nodes are the decision nodes. Typically eachterminal node is dominated by one of the classes.
• The tree is grown using training data, by recursive splitting.
• The tree is often pruned to an optimal size, evaluated bycross-validation.
• New observations are classified by passing their X down to aterminal node of the tree, and then using majority vote.
Boosting Trevor Hastie, Stanford University 4
x.2<0.39x.2>0.39
10/30
0
x.3<-1.575x.3>-1.575
3/21
0
2/5
1
0/16
0
2/9
1
Classification Tree
Boosting Trevor Hastie, Stanford University 5
Properties of Trees
✔ Can handle huge datasets
✔ Can handle mixed predictors—quantitative and qualitative
✔ Easily ignore redundant variables
✔ Handle missing data elegantly
✔ Small trees are easy to interpret
✖ large trees are hard to interpret
✖ Often prediction performance is poor
Boosting Trevor Hastie, Stanford University 6
Example: Predicting e-mail spam
• data from 4601 email messages
• goal: predict whether an email message is spam (junk email) orgood.
• input features: relative frequencies in a message of 57 of themost commonly occurring words and punctuation marks in allthe training the email messages.
• for this problem not all errors are equal; we want to avoidfiltering out good email, while letting spam get through is notdesirable but less serious in its consequences.
• we coded spam as 1 and email as 0.
• A system like this would be trained for each user separately(e.g. their word lists would be different)
Boosting Trevor Hastie, Stanford University 7
Predictors
• 48 quantitative predictors—the percentage of words in theemail that match a given word. Examples include business,address, internet, free, and george. The idea was that thesecould be customized for individual users.
• 6 quantitative predictors—the percentage of characters in theemail that match a given character. The characters are ch;,ch(, ch[, ch!, ch$, and ch#.
• The average length of uninterrupted sequences of capitalletters: CAPAVE.
• The length of the longest uninterrupted sequence of capitalletters: CAPMAX.
• The sum of the length of uninterrupted sequences of capitalletters: CAPTOT.
Boosting Trevor Hastie, Stanford University 8
Details
• A test set of size 1536 was randomly chosen, leaving 3065observations in the training set.
• A full tree was grown on the training set, with splittingcontinuing until a minimum bucket size of 5 was reached.
• This bushy tree was pruned back using cost-complexitypruning, and the tree size was chosen by 10-foldcross-validation.
• We then compute the test error and ROC curve on the testdata.
Boosting Trevor Hastie, Stanford University 9
Some important features
39% of the training data were spam.
Average percentage of words or characters in an email messageequal to the indicated word or character. We have chosen thewords and characters showing the largest difference between spam
and email.
george you your hp free hpl
spam 0.00 2.26 1.38 0.02 0.52 0.01
email 1.27 1.27 0.44 0.90 0.07 0.43
! our re edu remove
spam 0.51 0.51 0.13 0.01 0.28
email 0.11 0.18 0.42 0.29 0.01
Boosting Trevor Hastie, Stanford University 10
600/1536
280/1177
180/1065
80/861
80/652
77/423
20/238
19/236 1/2
57/185
48/113
37/101 1/12
9/72
3/229
0/209
100/204
36/123
16/94
14/89 3/5
9/29
16/81
9/112
6/109 0/3
48/359
26/337
19/110
18/109 0/1
7/227
0/22
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
ch$<0.0555
remove<0.06
ch!<0.191
george<0.005
hp<0.03
CAPMAX<10.5
receive<0.125edu<0.045
our<1.2
CAPAVE<2.7505
free<0.065
business<0.145
george<0.15
hp<0.405
CAPAVE<2.907
1999<0.58
ch$>0.0555
remove>0.06
ch!>0.191
george>0.005
hp>0.03
CAPMAX>10.5
receive>0.125edu>0.045
our>1.2
CAPAVE>2.7505
free>0.065
business>0.145
george>0.15
hp>0.405
CAPAVE>2.907
1999>0.58
Boosting Trevor Hastie, Stanford University 11
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for pruned tree on SPAM data
o TREE − Error: 8.7%
SPAM Data
Overall error rate on test data:8.7%.ROC curve obtained by vary-ing the threshold c0 of the clas-sifier:C(X) = +1 if P (+1|X) > c0.Sensitivity: proportion of truespam identifiedSpecificity: proportion of trueemail identified.
We may want specificity to be high, and suffer some spam:Specificity : 95% =⇒ Sensitivity : 79%
Boosting Trevor Hastie, Stanford University 12
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for TREE vs SVM on SPAM data
oo
SVM − Error: 6.7%TREE − Error: 8.7%
TREE vs SVM
Comparing ROC curves onthe test data is a goodway to compare classi-fiers. SVM dominatesTREE here.
Boosting Trevor Hastie, Stanford University 13
Toy Classification Problem
-6 -4 -2 0 2 4 6
-6-4
-20
24
6
Bayes Error Rate: 0.25
0
00
0
00
0
0
00
0
0
0
0
0
0
0
0
0 0 0
00
0
00
00
00
0
0
00
00
00
0 0
0
0
00
0
0
0 0
00
0
0
0
000
0
0 0
00
00
00 00 0000
0
0
0
0
0
0
0
00
00
0
0
00
00
0
0
0
000
00
00
0
000 00
0
0
0
0
0
00
00
0
00
0
00
00
0
0
0
0
0
0
0
0
00
00
0
0
0
0
00
0
0
0 00
0
0 00
00
00
0
0
0
0
0 0
0
0
000
0
0
0
0
00
0
0
000
0
00
0
0
0
0 00
0
00 00 0
0 0
000
00
0
0000 0
0
0
0
0
00 0
000
0
0
0
0
00
0
00
0
0
0
0
0
0
0
0 00
0
00
00
0
0
0
0 0
00
00
0 0
0
0
0
00
0
00
0
0
000
0
0
0
00
0
00 00
0
00
0
0
0
0
0
0
0
00
0
0
0
0
00
000
0
0
00
00
0
00
0
0
0 0
0
0
0
0
0
0
00
0
0
0
0
0
00
00
0
0
00
00 0
0
0
0
0 0
0
0
00
000
0
0 0
00
00
0 0
0 0
0
0
0
0
00
0
0
0
0
000
0
00
0 0
0
0
00
00
0 0
0
0
0
0
0
0
0
0
0
0 00
00
0
00
0 0
00
00
1
1
1
11
1
1
1
11
1
111
1
1
11
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1
1
11
1
1
11
1
1 11
1
1
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
11
1
1
11
1
1
11
11
11
1
11
11
1
1
1
11
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
111
1
1
1
11
11
11
1
1
1
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
1
1
1
1
11
111
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
11
11
1
1
1
1
1
11
1
1
1
11
1
1
1 1
1
1
1
1
1
1
1
1
11
1
1
1 1
1
1
11
1
1
1
1
1 1
1
11
1
1
1
1
1
1
1
1
11
1
11
1
X2
X1
• Data X and Y , with Y
taking values +1 or −1.
• Here X = (X1, X2)
• The black boundaryis the Bayes DecisionBoundary - the bestone can do.
• Goal: Given N train-ing pairs (Xi, Yi)produce a classifierC(X) ∈ {−1, 1}
• Also estimate the probability of the class labels P (Y = +1|X).
Boosting Trevor Hastie, Stanford University 14
Toy Example - No Noise
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
0
0
00
0 0
0
0 0
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
00
0
0
0
000
0
0
0
0
0
0
0
0
0
00
00 0
0
00
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
00
00
0
0
0
0
0
0
00
0
0
0
0
00
0
0
000
0
0
00
0
0
0
0 0
00
0
0 00
0
0
0
00
00 0
0
0
00
0
0
0
0
0
00
0
00
00 0
0
000
0
0
0
00
0 00
0
0
0
0
0
0
0
0
0
00
0
0 0
0
0
0
00
0
00
00
0
0
00
0
0
0
0
00
0 0
0
00
0
0
0
0 0
0
0
0
0
00
0
000
000
0
00 0
0
0
0
0 00
0
0
0
00
00
0
0
0
0
0
00
0
00
0
00
0
0
00
00
0
0 0
0
0
0
0
0
0
00
0
00
00
00
0
0
00
0
00
0
0
00
0
00
0 0
0
0
0
0
0
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0 00
00
0
0
0
0
0
00
00
0
0
00
0
00
00
0
0
00
00
0
000 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 00
0
0
00
00 00
00
00
00
0
0
0
0
0
0
0
0 0
0
0
0
0
00
0
00
0
000
00
00
0
0
0
1
1
1
1
1
1
11
11
11
1
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
11
11
1
1 1
1
1
1
1
1
1
1
11
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
11
1 1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1 11
1
1
11
11
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1 11 1
1
1
1
1
1
1
1
11
1
1
1
1
11
11
1
1
11
1
1 1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
Bayes Error Rate: 0
X2
X1
• Deterministic problem;noise comes from sam-pling distribution of X.
• Use a training sampleof size 200.
• Here Bayes Error is0%.
Boosting Trevor Hastie, Stanford University 15
Classification Tree
x.2<-1.06711x.2>-1.06711
94/200
1
0/34
1
x.2<1.14988x.2>1.14988
72/166
0
x.1<1.13632x.1>1.13632
40/134
0
x.1<-0.900735x.1>-0.900735
23/117
0
x.1<-1.1668x.1>-1.1668
5/26
1
0/12
1
x.1<-1.07831x.1>-1.07831
5/14
1
1/5
1
4/9
1
x.2<-0.823968x.2>-0.823968
2/91
0
2/8
0
0/83
0
0/17
1
0/32
1
Boosting Trevor Hastie, Stanford University 16
Decision Boundary: Tree
X1
X2
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
1
1
11
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
11
1
11
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
0
0
0
0
0
0
0
00
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
000
0
00
00
0
0
0
00
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0
0
0
00
0 0
0
0
0
0
0 0
00
0
0
0
0
0
0
00
0
00
0
0
0 0
0
0
0
00
0
0
0
00
0
Error Rate: 0.073
When the nested spheresare in 10-dimensions, Clas-sification Trees produces arather noisy and inaccuraterule C(X), with error ratesaround 30%.
Boosting Trevor Hastie, Stanford University 17
Model Averaging
Classification trees can be simple, but often produce noisy (bushy)or weak (stunted) classifiers.
• Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.
• Boosting (Freund & Shapire, 1996): Fit many large or smalltrees to reweighted versions of the training data. Classify byweighted majority vote.
• Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting � Random Forests � Bagging � Single Tree.
Boosting Trevor Hastie, Stanford University 18
Bagging
Bagging or bootstrap aggregation averages a given procedure overmany samples, to reduce its variance — a poor man’s Bayes. See
pp 246.
Suppose C(S, x) is a classifier, such as a tree, based on our trainingdata S, producing a predicted class label at input point x.
To bag C, we draw bootstrap samples S∗1, . . .S∗B each of size N
with replacement from the training data. Then
Cbag(x) = Majority Vote {C(S∗b, x)}Bb=1.
Bagging can dramatically reduce the variance of unstableprocedures (like trees), leading to improved prediction. Howeverany simple structure in C (e.g a tree) is lost.
Boosting Trevor Hastie, Stanford University 19
x.2<0.39x.2>0.39
10/30
0
x.3<-1.575x.3>-1.575
3/21
0
2/5
1
0/16
0
2/9
1
Original Tree
x.2<0.36x.2>0.36
7/30
0
x.1<-0.965x.1>-0.965
1/23
0
1/5
0
0/18
0
1/7
1
Bootstrap Tree 1
x.2<0.39x.2>0.39
11/30
0
x.3<-1.575x.3>-1.575
3/22
0
2/5
1
0/17
0
0/8
1
Bootstrap Tree 2
x.4<0.395x.4>0.395
4/30
0
x.3<-1.575x.3>-1.575
2/25
0
2/5
0
0/20
0
2/5
0
Bootstrap Tree 3
x.2<0.255x.2>0.255
13/30
0
x.3<-1.385x.3>-1.385
2/16
0
2/5
0
0/11
0
3/14
1
Bootstrap Tree 4
x.2<0.38x.2>0.38
12/30
0
x.3<-1.61x.3>-1.61
4/20
0
2/6
1
0/14
0
2/10
1
Bootstrap Tree 5
Boosting Trevor Hastie, Stanford University 20
Decision Boundary: Bagging
X1
X2
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
1
1
11
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
11
1
11
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
0
0
0
0
0
0
0
00
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
000
0
00
00
0
0
0
00
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0
0
0
00
0 0
0
0
0
0
0 0
00
0
0
0
0
0
0
00
0
00
0
0
0 0
0
0
0
00
0
0
0
00
0
Error Rate: 0.032
Bagging averages manytrees, and producessmoother decision bound-aries.
Boosting Trevor Hastie, Stanford University 21
Random forests
• refinement of bagged trees; quite popular
• at each tree split, a random sample of m features is drawn, andonly those m features are considered for splitting. Typicallym =
√p or log2 p, where p is the number of features
• For each tree grown on a bootstrap sample, the error rate forobservations left out of the bootstrap sample is monitored.This is called the “out-of-bag” error rate.
• random forests tries to improve on bagging by “de-correlating”the trees. Each tree has the same expectation.
Boosting Trevor Hastie, Stanford University 22
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for TREE, SVM and Random Forest on SPAM data
ooo
Random Forest − Error: 5.0%SVM − Error: 6.7%TREE − Error: 8.7%
TREE, SVM and RF
Random Forest dominatesboth other methods on theSPAM data — 5.0% error.Used 500 trees with defaultsettings for random Forest
package in R.
Boosting Trevor Hastie, Stanford University 23
Training Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted Sample
Weighted SampleWeighted Sample
Training Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted SampleWeighted Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
CM (x)
C3(x)
C2(x)
C1(x)
Boosting
• Average many trees, eachgrown to re-weighted versionsof the training data.
• Final Classifier is weighted av-erage of classifiers:
C(x) = sign[∑M
m=1 αmCm(x)]
Boosting Trevor Hastie, Stanford University 24
Number of Terms
Tes
t Err
or
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4 Bagging
AdaBoost
100 Node Trees
Boosting vs Bagging
• 2000 points fromNested Spheres in R10
• Bayes error rate is 0%.
• Trees are grown bestfirst without pruning.
• Leftmost term is a sin-gle tree.
Boosting Trevor Hastie, Stanford University 25
AdaBoost (Freund & Schapire, 1996)
1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .
2. For m = 1 to M repeat steps (a)–(d):
(a) Fit a classifier Cm(x) to the training data using weights wi.
(b) Compute weighted error of newest tree
errm =∑N
i=1 wiI(yi �= Cm(xi))∑Ni=1 wi
.
(c) Compute αm = log[(1 − errm)/errm].
(d) Update weights for i = 1, . . . , N :wi ← wi · exp[αm · I(yi �= Cm(xi))]and renormalize to wi to sum to 1.
3. Output C(x) = sign[∑M
m=1 αmCm(x)].
Boosting Trevor Hastie, Stanford University 26
Boosting Iterations
Tes
t Err
or
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
0.5
Single Stump
400 Node Tree
Boosting Stumps
A stump is a two-nodetree, after a single split.Boosting stumps worksremarkably well on thenested-spheres problem.
Boosting Trevor Hastie, Stanford University 27
Number of Terms
Tra
in a
nd T
est E
rror
0 100 200 300 400 500 600
0.0
0.1
0.2
0.3
0.4
0.5 Training Error
• Nested spheres in 10-Dimensions.
• Bayes error is 0%.
• Boosting drives thetraining error to zero.
• Further iterations con-tinue to improve testerror in many exam-ples.
Boosting Trevor Hastie, Stanford University 28
Number of Terms
Tra
in a
nd T
est E
rror
0 100 200 300 400 500 600
0.0
0.1
0.2
0.3
0.4
0.5
Bayes Error
Noisy Problems
• Nested Gaussians in10-Dimensions.
• Bayes error is 25%.
• Boosting with stumps
• Here the test errordoes increase, but quiteslowly.
Boosting Trevor Hastie, Stanford University 29
Stagewise Additive Modeling
Boosting builds an additive model
f(x) =M∑
m=1
βmb(x; γm).
Here b(x, γm) is a tree, and γm parametrizes the splits.
We do things like that in statistics all the time!
• GAMs: f(x) =∑
j fj(xj)
• Basis expansions: f(x) =∑M
m=1 θmhm(x)
Traditionally the parameters fm, θm are fit jointly (i.e. leastsquares, maximum likelihood).
With boosting, the parameters (βm, γm) are fit in a stagewisefashion. This slows the process down, and overfits less quickly.
Boosting Trevor Hastie, Stanford University 30
Additive Trees
• Simple example: stagewise least-squares?
• Fix the past M − 1 functions, and update the Mth using a tree:
minfM∈Tree(x)
E(Y −M−1∑m=1
fm(x) − fM (x))2
• If we define the current residuals to be
R = Y −M−1∑m=1
fm(x)
then at each stage we fit a tree to the residuals
minfM∈Tree(x)
E(R − fM (x))2
Boosting Trevor Hastie, Stanford University 31
Stagewise Least Squares
Suppose we have available a basis family b(x; γ) parametrized by γ.
• After m − 1 steps, suppose we have the modelfm−1(x) =
∑m−1j=1 βjb(x; γj).
• At the mth step we solve
minβ,γ
N∑i=1
(yi − fm−1(xi) − βb(xi; γ))2
• Denoting the residuals at the mth stage byrim = yi − fm−1(xi), the previous step amounts to
minβ,γ
(rim − βb(xi; γ))2,
• Thus the term βmb(x; γm) that best fits the current residuals isadded to the expansion at each step.
Boosting Trevor Hastie, Stanford University 32
Adaboost: Stagewise Modeling
• AdaBoost builds an additive logistic regression model
f(x) = logPr(Y = 1|x)
Pr(Y = −1|x)=
M∑m=1
αmGm(x)
by stagewise fitting using the loss function
L(y, f(x)) = exp(−y f(x)).
• Given the current fM−1(x), our solution for (βm, Gm) is
arg minβ,G
N∑i=1
exp[−yi(fm−1(xi) + β G(x))]
where Gm(x) ∈ {−1, 1} is a tree classifier and βm is acoefficient.
Boosting Trevor Hastie, Stanford University 33
• With w(m)i = exp(−yi fm−1(xi)), this can be re-expressed as
arg minβ,G
N∑i=1
w(m)i exp(−β yi G(xi))
• We can show that this leads to the Adaboost algorithm; See
pp 305.
Boosting Trevor Hastie, Stanford University 34
-2 -1 0 1 2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector
Loss
y · f
Why Exponential Loss?
• e−yF (x) is a monotone,smooth upper bound onmisclassification loss at x.
• Leads to simple reweightingscheme.
• Has logit transform as popu-lation minimizer
f∗(x) =12
logPr(Y = 1|x)
Pr(Y = −1|x)
• Other more robust loss func-tions, like binomial deviance.
Boosting Trevor Hastie, Stanford University 35
General Stagewise Algorithm
We can do the same for more general loss functions, not only leastsquares.
1. Initialize f0(x) = 0.
2. For m = 1 to M :
(a) Compute(βm, γm) = arg minβ,γ
∑Ni=1 L(yi, fm−1(xi) + βb(xi; γ)).
(b) Set fm(x) = fm−1(x) + βmb(x; γm).
Sometimes we replace step (b) in item 2 by
(b∗) Set fm(x) = fm−1(x) + νβmb(x; γm)
Here ν is a shrinkage factor, and often ν < 0.1. Shrinkage slows thestagewise model-building even more, and typically leads to betterperformance.
Boosting Trevor Hastie, Stanford University 36
Gradient Boosting
• General boosting algorithm that works with a variety ofdifferent loss functions. Models include regression, resistantregression, K-class classification and risk modeling.
• Gradient Boosting builds additive tree models, for example, forrepresenting the logits in logistic regression.
• Tree size is a parameter that determines the order ofinteraction (next slide).
• Gradient Boosting inherits all the good features of trees(variable selection, missing data, mixed predictors), andimproves on the weak features, such as prediction performance.
• Gradient Boosting is described in detail in , section 10.10.
Boosting Trevor Hastie, Stanford University 37
Number of Terms
Tes
t Err
or
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4 Stumps
10 Node100 NodeAdaboost
Tree Size
The tree size J determinesthe interaction order of themodel:
η(X) =∑
j
ηj(Xj)
+∑jk
ηjk(Xj , Xk)
+∑jkl
ηjkl(Xj , Xk, Xl)
+ · · ·
Boosting Trevor Hastie, Stanford University 38
Stumps win!
Since the true decision boundary is the surface of a sphere, thefunction that describes it has the form
f(X) = X21 + X2
2 + . . . + X2p − c = 0.
Boosted stumps via Gradient Boosting returns reasonableapproximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees
f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)
f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)
Boosting Trevor Hastie, Stanford University 39
Spam Example Results
With 3000 training and 1500 test observations, Gradient Boostingfits an additive logistic model
f(x) = logPr(spam|x)Pr(email|x)
using trees with J = 6 terminal-node trees.
Gradient Boosting achieves a test error of 4%, compared to 5.3% foran additive GAM, 5.0% for Random Forests, and 8.7% for CART.
Boosting Trevor Hastie, Stanford University 40
Spam: Variable Importance
!$
hpremove
freeCAPAVE
yourCAPMAX
georgeCAPTOT
eduyouour
moneywill
1999business
re(
receiveinternet
000email
meeting;
650overmailpm
peopletechnology
hplall
orderaddress
makefont
projectdata
originalreport
conferencelab
[creditparts
#85
tablecs
direct415857
telnetlabs
addresses3d
0 20 40 60 80 100
Relative importance
Boosting Trevor Hastie, Stanford University 41
Spam: Partial Dependence
!
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
remove
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
edu
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
-0.6
-0.2
0.0
0.2
hp
Par
tial D
epen
denc
e
0.0 0.5 1.0 1.5 2.0 2.5 3.0
-1.0
-0.6
-0.2
0.0
0.2
Boosting Trevor Hastie, Stanford University 42
Comparison of Learning Methods
Some characteristics of different learning methods.
Key: ● = good, ● =fair, and ● =poor.
Characteristic NeuralNets
SVM CART GAM KNN,Kernel
GradientBoost
Natural handling of dataof “mixed” type ● ● ● ● ● ●
Handling of missing val-ues ● ● ● ● ● ●
Robustness to outliers ininput space ● ● ● ● ● ●
Insensitive to monotonetransformations of in-puts
● ● ● ● ● ●
Computational scalabil-ity (large N) ● ● ● ● ● ●
Ability to deal with irrel-evant inputs ● ● ● ● ● ●
Ability to extract linearcombinations of features ● ● ● ● ● ●
Interpretability● ● ● ● ● ●
Predictive power● ● ● ● ● ●
Boosting Trevor Hastie, Stanford University 43
Software
• R: free GPL statistical computing environment available fromCRAN, implements the S language. Includes:
– randomForest: implementation of Leo Breimans algorithms.
– rpart: Terry Therneau’s implementation of classificationand regression trees.
– gbm: Greg Ridgeway’s implementation of Friedman’sgradient boosting algorithm.
• Salford Systems: Commercial implementation of trees, randomforests and gradient boosting.
• Splus (Insightful): Commerical version of S.
• Weka: GPL software from University of Waikato, New Zealand.Includes Trees, Random Forests and many other procedures.