© 1999 Elder & Ridgeway KDD99 T5-1 Combining Estimators
Combining EstimatorsCombining Estimatorsto Improve Performanceto Improve PerformanceA survey of “model bundling” techniques --
from boosting and bagging, to Bayesian model averaging-- creating a breakthrough in the practice of Data Mining.
John F. Elder IV, Ph.D.Elder Research, Charlottesville, Virginia
www.datamininglab.com
Greg Ridgeway, Ph.D.University of Washington, Dept. of Statistics
www.stat.washington.edu/greg
© 1999 Elder & Ridgeway KDD99 T5-2 Combining Estimators
Outline
• Why combine? A motivating example• Hidden dangers of model selection• Reducing modeling uncertainty through
Bayesian Model Averaging• Stabilizing predictors through bagging• Improving performance through boosting• Emerging theory illuminates empirical success• Bundling, in general• Latest algorithms• Closing Examples & Summary
© 1999 Elder & Ridgeway KDD99 T5-3 Combining Estimators
Reasons to combine estimators
• Decreases variability in the predictions.
• Accounts for uncertainty in the model class.
P−> Improved accuracy on new data.
© 1999 Elder & Ridgeway KDD99 T5-4 Combining Estimators
A Motivating Example:Classifying a bat’s species from its chirp• Goal: Use time-frequency features of echolocation signals
to classify bats by species in the field (avoiding captureand physical inspection).
• U. Illinois biologists gathered data: 98 signals from 19bats representing 6 species: Southeastern, Grey, LittleBrown, Indiana, Pipistrelle, Big-Eared.
• ~35 data features (dimensions) calculated from signals,such as low frequency at the 3db level, time position of thesignal peak, and amplitude ratio of 1st and 2nd harmonics.
• Turned out to have a nice level of difficulty for comparingmethods: overlap in classes, but some separability.
© 1999 Elder & Ridgeway KDD99 T5-5 Combining Estimators
Sample Projection
bw
t10Var4
© 1999 Elder & Ridgeway KDD99 T5-6 Combining Estimators
% 88
82
76
70
64
58
52
46
40
TX2step(11-12)
CART(7)
Trees Polynomial Neural Nearest Networks Networks Neighbor
ASPN(15)
35-N.net
17-N.net
8-N.net (14)
(tree vars)
Visualization &multicorrelation
Optimal (of 8)dimensions
Near.n.
8-Near.n(13)
[Fit](16)
prop.
Avg.
Avg.
OldFit
AdvisorPerceptrons
(16)
trainingtime:
1:seconds2:20 mins
hour day ε combinations
ˆ y 4
∑ ∑enabled
Perc
ent C
lass
ifie
d C
orre
ctly
Bun
dlin
g
© 1999 Elder & Ridgeway KDD99 T5-7 Combining Estimators
What is model uncertainty?
• Suppose we wish to predict y frompredictors x.
• Given a dataset of observations, D, for anew observation with predictors x* we wantto derive the predictive distribution of y*
given x* and D.
),|P( Dy ∗∗ x
© 1999 Elder & Ridgeway KDD99 T5-8 Combining Estimators
In practice…
• Although we want to use all the informationin D to make the best estimate of y* for anindividual with covariates x*…
• In practice, however, we always use
where M is a model constructed from D.
),|P( Dy ∗∗ x
),|P( My ∗∗ x
© 1999 Elder & Ridgeway KDD99 T5-9 Combining Estimators
Selecting M• The process of selecting a model usually
involves– Model class selection
• Linear regression, tree regression, neural network
– Variable selection• variable exclusion, transformation, smoothing
– Parameter estimation
• We tend to choose the one model that fitsthe data or performs best as the model.
© 1999 Elder & Ridgeway KDD99 T5-10 Combining Estimators
What’s wrong with that?
• Two models may equally fit a dataset (withrepect to some loss) but have differentpredictions.
• Competing interpretable models withequivalent performance offer ambiguiousconclusions.
• Model search dilutes the evidence. “Part ofthe evidence is spent specifying the model.”
© 1999 Elder & Ridgeway KDD99 T5-11 Combining Estimators
Bayesian Model Averaging
Goal: Account for model uncertainty
Method: Use Bayes’ Theorem and average themodels by their posterior probabilities
Properties:
• Improves predictive performance
• Theoretically elegant
• Computationally costly
© 1999 Elder & Ridgeway KDD99 T5-12 Combining Estimators
Averaging the models
Consider a set containing the K candidatemodels — M1,…, MK.
With a few probability manipulations we canmake predictions using all of them.
The probability mass for a particular prediction value of y is a weighted average of theprobability mass that each model places on that value of y. The weight is based on theposterior probability of that model given the data.
∑=k kk DMMyDy )|P(),|P(),|P( **** xx
© 1999 Elder & Ridgeway KDD99 T5-13 Combining Estimators
Bayes’ Theorem
• Mk - model
• D - data
• P(D|Mk) - integrated likelihood of Mk
• P(Mk) - prior model probability
∑ =
= K
l ll
kkk
MPMDP
MPMDPDMP
1)()|(
)()|()|(
© 1999 Elder & Ridgeway KDD99 T5-14 Combining Estimators
Challenges• The size of the model set may cause
exhaustive summation to be impossible.
• The integrated likelihood of each model isusually complex.
• Specifying a prior distribution (even a non-informative one) across the space of modelsis non-trivial.
• Proposed solutions to these challenges often involve MCMC, BICapproximation, MLE approximation, Occam’s window, Occam’s razor.
© 1999 Elder & Ridgeway KDD99 T5-15 Combining Estimators
Performance
• Survival model: Primary biliary cirrhosis– BMA vs. Stepwise regression — 2% improvement
– BMA vs. expert selected model — 10% improvement
• Linear regression: Body fat prediction– BMA provides best 90% predictive coverage.
• Graphical models– BMA yields an improvement
© 1999 Elder & Ridgeway KDD99 T5-16 Combining Estimators
BMA References
• Chris Volinsky’s BMA homepagewww.research.att.com/~volinsky/bma.html
• J. Hoeting, D. Madigan, A. Raftery, C. Volinsky(1999). “Bayesian Model Averaging: A PracticalTutorial” (to appear in Statistical Science),www.stat.colostate.edu/~jah/documents/bma2.ps
© 1999 Elder & Ridgeway KDD99 T5-17 Combining Estimators
We can always assume
Assume that we have a way of constructing apredictor, f , from a dataset D.
We want to choose the estimator of f thatminimizes J, squared loss for example.
Unstable predictors
0)|E( where,)( =+= xx εεfy
2
, ))(ˆ(E),ˆ( xx Dy fyDfJ −=
)(ˆ xDf
© 1999 Elder & Ridgeway KDD99 T5-18 Combining Estimators
Bias-variance decompositionIf we could average over all possible datasets,
let the average prediction be
The average prediction error over all datasetsthat we might see is decomposable
variancebiasnoise
))()(ˆ(E))()((EE),ˆ(E 2
,
22
++=
−+−+= xxxx xx ffffDfJ DDD ε
)(ˆE)( xx DD ff =
© 1999 Elder & Ridgeway KDD99 T5-19 Combining Estimators
Bias-variance decomposition (cont.)
• The noise cannot be reduced.
• The squared-bias term might be reducible
• The variance term is 0 if we use
But this requires having an infinite number of datasets
variancebiasnoise
))()(ˆ(E))()((EE),ˆ(E 2,
22
++=
−+−+= xxxx xx ffffDfJ DDD ε
)()(ˆ xx ff D =
© 1999 Elder & Ridgeway KDD99 T5-20 Combining Estimators
Bagging (Bootstrap Aggregating)
Goal: Variance reduction
Method: Create bootstrap replicates of thedataset and fit a model to each. Average thepredictions of each model.
Properties:
• Stabilizes “unstable” methods
• Easy to implement, parallelizable
• Theory is not fully explained
© 1999 Elder & Ridgeway KDD99 T5-21 Combining Estimators
Bagging algorithm
1. Create K bootstrap replicates of the dataset.
2. Fit a model to each of the replicates.
3. Average (or vote) the predictions of the Kmodels.
Bootstrapping simulates the stream of infinitedatasets in the bias-variance decomposition.
© 1999 Elder & Ridgeway KDD99 T5-22 Combining Estimators
Bagging Example
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-23 Combining Estimators
CART decision boundary
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-24 Combining Estimators
100 bagged trees
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-25 Combining Estimators
Bagged tree decision boundary
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-26 Combining Estimators
Regression resultsSquared error loss
010
2030
40
Boston Housing Ozone Friedman #1 Friedman #2 Friedman #3
CARTBagged CART
© 1999 Elder & Ridgeway KDD99 T5-27 Combining Estimators
Classification resultsMisclassification rates
05
1015
2025
30
Diabetes Breast Ionosphere Heart Soybean Glass Waveform
CARTBagged CART
© 1999 Elder & Ridgeway KDD99 T5-28 Combining Estimators
Bagging References
• Leo Breiman’s homepagewww.stat.berkeley.edu/users/breiman/
• Breiman, L. (1996) “Bagging Predictors,”Machine Learning, 26:2, 123-140.
• Friedman, J. and P. Hall (1999) “OnBagging and Nonlinear Estimation”www.stat.stanford.edu/~jhf
© 1999 Elder & Ridgeway KDD99 T5-29 Combining Estimators
BoostingGoal: Improve misclassification rates
Method: Sequentially fit models, each moreheavily weighting those observationspoorly predicted by the previous model
Properties:
• Bias and variance reduction
• Easy to implement
• Theory is not fully (but almost) explained
© 1999 Elder & Ridgeway KDD99 T5-30 Combining Estimators
Origin of BoostingClassification problems
{y, x}i , i = 1,…,n
y ∈ {0, 1}
The task - construct a function,
F(x) : x → {0, 1}
so that F minimizes misclassification error.
© 1999 Elder & Ridgeway KDD99 T5-31 Combining Estimators
Generic boosting algorithm
Equally weight the observations (y,x)i
For t in 1,…,T
Using the weights, fit a classifier ft(x) → yUpweight the poorly predicted observations
Downweight the well-predicted observations
Merge f1,…,fT to form the boosted classifier
© 1999 Elder & Ridgeway KDD99 T5-32 Combining Estimators
Real AdaBoostSchapire & Singer 1998
yi ∈ {-1,1}, wi = 1/NFor t in 1,…,T do
1. Estimate Pw(y = 1|x).
2. Set f
3. w and renormalize
Output the classifier F
)|1(P̂
)|1(P̂log)( 2
1
x
xx
−=
==
y
yf
w
wt
= ∑ tfF )(sign)( xx
( ))(exp itiii fyww x−←
© 1999 Elder & Ridgeway KDD99 T5-33 Combining Estimators
AdaBoost’s PerformanceFreund & Schapire [1996]
• Leo Breiman - AdaBoost with trees is the “bestoff-the-shelf classifier in the world.”
• Performs well with many base classifiers and in avariety of problem domains.
• AdaBoost is generally slow to overfit.
• Boosted naïve Bayes tied for first place in the1997 KDD Cup. (Elkan [1997])
• Boosted naïve Bayes is a scalable, interpretableclassifier (Ridgeway, et al [1998]).
© 1999 Elder & Ridgeway KDD99 T5-34 Combining Estimators
Boosting Example
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-35 Combining Estimators
After one iterationCART splits, larger points have great weight
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-36 Combining Estimators
After 3 iterations
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-37 Combining Estimators
After 20 iterations
x1
x2
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-38 Combining Estimators
Decision boundary after 100 iterations
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
© 1999 Elder & Ridgeway KDD99 T5-39 Combining Estimators
Boosting as optimization
• Friedman, Hastie, Tibshirani [1998] -AdaBoost is an optimization method forfinding a classifier.
• Let y∈{-1,1}, F(x)∈(-∞,∞)
( ) ( )xeEFJ xyF |)(−=
© 1999 Elder & Ridgeway KDD99 T5-40 Combining Estimators
Criterion
• E(e–yF(x)) bounds the misclassification rate.
• The minimizer of E(e–yF(x)) coincides withthe maximizer of the expected Bernoullilikelihood.
)()0)(( xyFexyFI −<<
( ) )1log()),(( )(2 xyFeEyxpE −+−=l
© 1999 Elder & Ridgeway KDD99 T5-41 Combining Estimators
Optimization step
• Select f to minimize J…
( ) ( )( )xeEfFJ xfxFy |)()( +−=+
)(
21)()1(
)(
),(
]|)1([1
]|)1([log
xyF
w
wtt
t
eyxw
xyIE
xyIEFF
−
+
=
=−=
+←
© 1999 Elder & Ridgeway KDD99 T5-42 Combining Estimators
LogitBoostFriedman, Hastie, Tibshirani [1998]
• Logistic regression
• Expected log-likelihood of a regressor, F(x)
−=
)(1y probabilitwith 0
)(y probabilitwith 1
xp
xpy
( )xexyFF xF |)1log()(E)(E )(+−=l
)(1
1)(
xFexp −+
=
© 1999 Elder & Ridgeway KDD99 T5-43 Combining Estimators
Newton steps
• Iterate to optimize expected log-likelihood.
0
)(
0
)(
)()1(
)(
)()()(
2
2
=∂∂
=∂∂
+
+
+−←
f
t
f
f
tf
tt
fFJ
fFJxFxF
( )xexfxFyEfFJ xfxF |)1log())()(()( )()( ++−+=+
© 1999 Elder & Ridgeway KDD99 T5-44 Combining Estimators
LogitBoost, continued
• Newton steps for Bernoulli likelihood
• In practice the Ew(•|x) can be any regressor -trees, smoothers, etc.
• Trees are adaptive and work well for highdimensional data.
−−
+← xxpxp
xpyExFxF w ))(1)((
)()()(
))(1)(()( xpxpxw −=
© 1999 Elder & Ridgeway KDD99 T5-45 Combining Estimators
Misclassification ratesFriedman, Hastie, Tibshirani [1998]
010
2030
4050
60
Breast Ionosphere Glass Sonar Waveform
CARTAdaBoost CARTLogitBoost CART
© 1999 Elder & Ridgeway KDD99 T5-46 Combining Estimators
Boosting References
• Rob Schapire’s homepagewww.research.att.com/~schapire
• Freund, Y. and R. Schapire (1996). “Experiments with a new boostingalgorithm,” Machine Learning: Proceedings of the 13th InternationalConference, 148-156.
• Jerry Friedman’s homepagewww.stat.stanford.edu/~jhf
• Friedman, J., T. Hastie, R. Tibshirani (1998). “Additive LogisticRegression: a statistical view of boosting,” Technical report, StatisticsDepartment, Stanford University.
© 1999 Elder & Ridgeway KDD99 T5-47 Combining Estimators
In general, combining (“bundling”)estimators consists of two steps:
• Case Weights• Data Values• Guiding Parameters• Variable Subsets
1) Constructing varied models, and2) Combining their estimates
Generate component models by varying:
Combine estimates using:• Estimator Weights• Voting• Advisor Perceptrons• Partitions of Design Space, X
© 1999 Elder & Ridgeway KDD99 T5-48 Combining Estimators
Other Bundling TechniquesWe’ve Examined:• Bayesian Model Averaging: sum estimates of possible models, weighted by
posterior evidence• Bagging (Breiman 96) (bootstrap aggregating) -- bootstrap data (to build
trees mostly); take majority vote or average• Boosting (Freund & Shapire 96) -- weight error cases by βt = (1-e(t))/e(t),
iteratively re-model; average, weighing model t by ln(βt)
Additional Example Techniques:• GMDH (Ivakhenko 68) -- multiple layers of quadratic polynomials, using
two inputs each, fit by Linear Regression• Stacking (Wolpert 92) -- train a 2nd-level (LR) model using leave-1-out
estimates of 1st-level (neural net) models• ARCing (Breiman 96) (Adaptive Resampling and Combining) -- Bagging
with reweighting of error cases; superset of boosting• Bumping (Tibshirani 97) -- bootstrap, select single best• Crumpling (Anderson & Elder 98) -- average cross-validations• Born-Again (Breiman 98) -- invent new X data...
© 1999 Elder & Ridgeway KDD99 T5-49 Combining Estimators
Group Method of Data Handling(GMDH)
• Try all pairs of variables (K choose 2) in quadratic polynomial nodes.
• Fit coefficients using regression.
• Keep best M nodes.
• Train model on one training data set, score on test data set. (Need a third dataset for independent confirmation of model.)
A0 + A1a + A2b + A3 a
2 + A4ab + A5b
2
A0 + A1a + A2b + A3 a
2 + A4ab + A5b
2
B0 + B1c + B2d + B3 c
2 + B4cd + B5d
2
B0 + B1c + B2d + B3 c
2 + B4cd + B5d
2
C0 + C1e + C2f + C3 e
2 + C4ef + C5f
2
C0 + C1e + C2f + C3 e
2 + C4ef + C5f
2
a
b
c
d
e
f
z1
z2
z3
z4
z5
yest
Layer 1
Layer 2
Layer 3
© 1999 Elder & Ridgeway KDD99 T5-50 Combining Estimators
Polynomial Networks(ASPN)
a N1
k N9
z1
z9 Double 16
f N6z6
d N4z4
h N8
e N5
Layer 0Layer 1
Layer 2
Single 14
Triple 21
Triple 17
Double 20
z5
z8
Multi-Linear 15
Double 19z20
z17
z14
z16
z19
z15U2 Y1
U7 Y2
Layer 3
(Normalizers)
Unitizers
z21
Z17 = 3.1 + 0.4a - .15b2
+ 0.9bc - 0.62abc + 0.5c3
© 1999 Elder & Ridgeway KDD99 T5-51 Combining Estimators
When does Bundling work?
• Breiman (1996): when the prediction method is unstable(significantly different models are constructed)
• Ali & Pazzani (1996): when there is low noise, lots ofirrelevant variables, and good individual predictors whichmake different errors
• when models are slightly overfit
• when models are from different families
Hypotheses:
© 1999 Elder & Ridgeway KDD99 T5-52 Combining Estimators
Advanced techniques
• Stochastic gradient boosting
• Adaptive bagging
• Example regression and classification results
© 1999 Elder & Ridgeway KDD99 T5-53 Combining Estimators
Stochastic Gradient BoostingGoal: Non-parametric function estimation
Method: Cast the problem as optimization anduse gradient ascent to obtain predictor
Properties:
• Bias and variance reduction
• Widely applicable
• Can make use of existing algorithms
• Many tuning parameters
© 1999 Elder & Ridgeway KDD99 T5-54 Combining Estimators
Improving boosting
• Boosting usually has the form
Improve by...
• Sub-sampling a fraction of the data at eachstep when computing the expectation.
• “Robustifying” the expectation.
• Trimming observations with small weights.
( )xxyzExFxF wtt ),()()( )()1( λ+←+
© 1999 Elder & Ridgeway KDD99 T5-55 Combining Estimators
Stochastic gradient boosting offers...
• Application to likelihood based models(GLM, Cox models)
• Bias reduction - non-linear fitting
• Massive datasets - bagging, trimming
• Variance reduction - bagging
• Interpretability - additive models
• High-dimensional regression - trees
• Robust regression
© 1999 Elder & Ridgeway KDD99 T5-56 Combining Estimators
SGB References
• Friedman, J. (1999). “Greedy function approximation: agradient boosting machine,” Technical report, Dept. ofStatistics, Stanford University.
• Friedman, J. (1999). “Stochastic gradient boosting,”Technical report, Dept. of Statistics, Stanford University.
© 1999 Elder & Ridgeway KDD99 T5-57 Combining Estimators
Adaptive BaggingGoal: Bias and variance reduction
Method: Sequentially fit bagged models,where each fits the current residuals
Properties:
• Bias and variance reduction
• No tuning parameters
© 1999 Elder & Ridgeway KDD99 T5-58 Combining Estimators
Adaptive bagging algorithm1. Fit a bagged regressor to the dataset D.
2. Predict “out-of-bag” observations.
3. Fit a new bagged regressor to the bias(error) and repeat.
For a new observation, sum the predictionsfrom each stage.
© 1999 Elder & Ridgeway KDD99 T5-59 Combining Estimators
Regression resultsSquared error loss
05
1015
2025
Abalone Robotarm
Peak20 BostonHousing
Ozone Servo F #1 F #2 F #3
BaggingAdaptive bagging
© 1999 Elder & Ridgeway KDD99 T5-60 Combining Estimators
Classification resultsMisclassification rates
05
1015
2025
Breast Ionosphere Sonar Heart Germancredit
Votes Liver Diabetes
AdaBoostAdaptive bagging
© 1999 Elder & Ridgeway KDD99 T5-61 Combining Estimators
Relative Performance Examples: 5 Algorithms on 6 Datasets(John Elder, Elder Research & Stephen Lee, U. Idaho, 1997)
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
Diabetes Gaussian Hypothyroid German Credit Waveform Investment
Neural Network
Logistic Regression
Linear Vector Quantization
Projection Pursuit Regression
Decision Tree
© 1999 Elder & Ridgeway KDD99 T5-62 Combining Estimators
Essentially every Bundling method improves performance
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
Diabetes Gaussian Hypothyroid German Credit Waveform Investment
Advisor Perceptron
AP weighted average
Vote
Average
© 1999 Elder & Ridgeway KDD99 T5-63 Combining Estimators
Application Ex.: Direct Marketing(Elder Research 1996-1998)
• Model respondants to direct marketing as binary variable:0 (no response), 1 (response).
• Create models using several (here, 5) different algorithms,all employing the same candidate model inputs.
• Rank-order model responses:
– Give highest-probability response value a rank of 1,second highest value 2, etc.
– For bundling, combine model ranks (not estimates) intoa new consensus estimate (which is again ranked).
• Report number of response cases missed (in top portion).
© 1999 Elder & Ridgeway KDD99 T5-64 Combining Estimators
50
55
60
65
70
75
80
0 1 2 3 4 5
#Models Combined (averaging output rank)
#Cas
es M
isse
dMarketing Application Performance
Bundled Trees
Stepwise Regression
Polynomial Network
Neural Network
MARS
NT
NS STMT
PS
PTNP
MS
MN
MP
SNT
MPN
SPTSMTPNTSPNMPT MNT
SMNSMP
SPNT
SMPT
SMNT
SMPNMPNT
© 1999 Elder & Ridgeway KDD99 T5-65 Combining Estimators
55
60
65
70
75
1 2 3 4 5
No. Models
Missed
Median (and Mean) Error Reduced with each Stage of Combination
in combination
NOTE: Fewermisses is better
© 1999 Elder & Ridgeway KDD99 T5-66 Combining Estimators
Why Bundling works
• (semi-) Independent Estimators
• Bayes Rule - weighing evidence
• Shrinking (ex.: stepwise LR)
• Smoothing (ex.: decision trees)
• Additive modeling and maximum likelihood(Friedman, Hastie, & Tibshirani 8/20/98)
… Open research area.Meanwhile, we recommend bundling competing candidatemodels both within, and between, model families.
...and in a multitude of counselors there is safety.Proverbs 24:6b