Download - On-line learning and Boosting
On-line learning and Boosting
Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by
Freund and Schapire (1997).
Tim Miller
University of Minnesota
Department of Computer Science and Engineering
Hedge - Motivation
Generalization of Weighted Majority Algorithm
Given a set of expert predictions, minimize mistakes over time
Slight emphasis in motivation on possibility of treating wt as a prior.
Hedge Algorithm
Parameters , w, T For 1..T
1. Choose allocation p (probability distribution formed from weights)
2. Receive loss vector l3. Suffer loss p l
4. Set new weight vector to w l
Hedge Analysis
Does not perform “too much worse” than best strategy: LHedge() ( - ln (w1) – Li ln ) · Z Z = 1 / (1 - )
Is it possible to do better?
Boosting
If we have n classifiers, possibly looking at the problem from different perspectives, how can we optimally combine them
Example: We have a collection of “rules of thumb” for predicting horse races, how to weight them
Definitions
Given labeled data < x, c(x) >, where c is the target concept, c: X {0, 1}.
c CC,, the concept class Strong PAC-learning algorithm: For
parameters ,, hypothesis has error less than with probability (1-)
Weak algorithm: (0.5 - ), > 0
AdaBoost Algorithm
Input: Sequence of N labeled examples Distribution D over the N examples Weak learning algorithm (called WeakLearn) Number of iterations T
AdaBoost contd.
Initialize: w1 = D For t =1..T
1. Form probability distribution p from w
2. Call WeakLearn with distribution p
3. Calculate error t = i=1..N pi | ht(xi) – yi |
4. Set t = t / (1 - t)
5. Multiplicatively adjust weights (w)by
t 1-|ht(xi)–yi|
AdaBoost Output
Output (+1) if: t=1..T (log 1/t) ht(x) ½ t=1..T log 1/t 0 otherwise Computes a weighted average
AdaBoost Analysis
Note of “dual” relationship with Hedge Strategies Examples Trials Weak hypotheses Hedge increases weight for successful strategies,
AdaBoost increases weight for difficult examples
AdaBoost has dynamic
AdaBoost Bounds
2T t=1..T sqrt(t(1 - t))
Previous bounds depended on maximum error of weakest hypothesis (weak link syndrome)
AdaBoost takes advantage of gains from best hypotheses
Multi-class Setting
k > 2 output labels, i.e. Y = {1, 2, …, k} Error: Probability of incorrect prediction Two algorithms:
AdaBoost.M1 – More direct AdaBoost.M2 – Somewhat complex constraints on weak
learners
Could also just divide into “one vs. one” or “one vs. all” categories
AdaBoost.M1
Requires each classifier to have error less than 50% (stronger requirement than binary case)
Similar to regular AdaBoost algorithm except: Error is 1 if ht(xi) yi
Can’t use algorithms with error > 0.5 Algorithm outputs vector of length k with values
between 0 & 1
AdaBoost.M1 Analysis
2T t=1..T sqrt(t(1 - t))
Same as bounds for regular AdaBoost Proof converts multi-class problem to a
binary setup Can we improve this algorithm?
AdaBoost.M2
More expressive, more complex constraints on weak hypotheses
Defines idea of “Pseudo-Loss” Pseudo-loss of each weak hypothesis must be
better than chance Benefit: Allows contributions from
hypotheses with accuracy < 0.5
Pseudo-loss
Replaces straightforward loss of AdaBoost.M1 plossq(h,i) =
0.5 ( 1 – h(xi,yi) + yyi q(i,y) h(xi,y)
Intuition: For each incorrect label, pit it against known label in binary classification (second term), then take a weighted average.
Makes use of information in entire hypothesis vector, not just prediction
AdaBoost.M2 Details
Extra init: wti,y = D(i) / (k-1)
For each iteration t = 1 to T Wt
i = yyi wti,y
qt(i,y) = wti,y / Wt
i
Dt(i) = Wti / i=1..N Wt
i
WeakLearn gets D as well as q Calculate t as shown above t = t / (1 - t) wt
i,y· t (0.5)(1 + ht(xi,yi) – ht(xi,y))
Error Bounds
(k – 1) 2Tt=1..T sqrt(t(1 - t)) Where is traditional error and the t are pseudo-
losses
Regression Setting
Instead of picking from a discrete set of output labels, choose a continuous value
More formally Y = [0, 1] Minimize the mean squared error:
E[(h(x) – y)2]
Reduce to binary classification and use AdaBoost!
How it works (roughly)
For each example in training set, create continuum of associated instances xtilde(xi, y) where y [0, 1].
Label is 1 if y yi
Mapping to an infinite training set – need to convert discrete distributions to density functions
AdaBoost.R Bounds
2T t=1..T sqrt(t (1 - t))
Conclusions
Starting from a on-line learning perspective, it is possible to generalize to boosting
Boosting can take weak learners and convert them to strong learners
This paper presented several algorithms to do boosting, with proofs of error bounds