online learning: wrap up boosting: interpretations, extensions
TRANSCRIPT
Online Learning: Wrap Up
&
Boosting: Interpretations, Extensions, Theory
Arindam Banerjee
. – p.1
Online: “Things we liked”
Adaptive (Natural)
Computationally efficient
Good guarantees [2]
Gradient descent based
No assumptions [2]
. – p.2
Online: “Things we didn’t like”
Assumes some good expert or oracle [6]
Not “robust” (noisy data)
Tracking of outliers
Intuition about convergence rates
Unsupervised versions
Stock market assumptions
Worst-case focus (conservative)
Learning rate (how to choose?)
. – p.3
Online: “General Questions”
Relations with Bayesian inference (e.g., Dirichlet)
Convergence proofs are good, but what about intuitivealgorithms
Is there a Batch to Online continuum
. – p.4
Some Key Ideas
Hypothesis Spaces and Oracles
. – p.5
Some Key Ideas
Hypothesis Spaces and Oracles
Approximation error and Estimation error
. – p.5
Some Key Ideas
Hypothesis Spaces and Oracles
Approximation error and Estimation error
Complexity of Hypothesis Spaces (VC dimensions)
. – p.5
Some Key Ideas
Hypothesis Spaces and Oracles
Approximation error and Estimation error
Complexity of Hypothesis Spaces (VC dimensions)
Regularization and “Being Conservative”
. – p.5
Some Key Ideas
Hypothesis Spaces and Oracles
Approximation error and Estimation error
Complexity of Hypothesis Spaces (VC dimensions)
Regularization and “Being Conservative”
Sufficient conditions for “Learning”
. – p.5
Some Key Ideas
Hypothesis Spaces and Oracles
Approximation error and Estimation error
Complexity of Hypothesis Spaces (VC dimensions)
Regularization and “Being Conservative”
Sufficient conditions for “Learning”
Empirical Estimation Theory and Falsifiability
. – p.5
Some Key Ideas
Hypothesis Spaces and Oracles
Approximation error and Estimation error
Complexity of Hypothesis Spaces (VC dimensions)
Regularization and “Being Conservative”
Sufficient conditions for “Learning”
Empirical Estimation Theory and Falsifiability
Lets get back to real life ... to Boosting ...
. – p.5
Weak Learning
A weak learner predicts “slightly better” than random
The PAC setting (basics)Let X be a instance space, c : X 7→ {0, 1} be a targetconcept, H be a hypothesis space (h : X 7→ {0, 1})Let Q be a fixed (but unknown) distribution on X
An algorithm, after training on (xi, c(xi)), [i]m1 , selects h ∈ H
such that
Px∼Q[h(x) 6= c(x)] ≤1
2− γ
The algorithm is called a γ-weak learner
We assume the existence of such a learner
. – p.6
The Boosting Model
Boosting converts a weak learner to a strong learner
Boosting proceeds in roundsBooster constructs Dt on X, the train setWeak learner produces a hypothesis ht ∈ H so thatPx∼Dt
[ht(x) 6= c(x)] ≤ 12 − γt
After T rounds, the weak hypotheses ht, [t]T1 are combined
into a final hypothesis hfinal
We need proceduresfor obtaining Dt at each stepfor combining the weak hypotheses
. – p.7
Adaboost
Input: Training set (x1, y1), . . . , (xm, ym)
Algorithm: Initialize D1(i) = 1/m
For t = 1, . . . , T
Train a weak learner using distribution Dt
Get weak hypothesis ht with error εt = Prx∼Dt[ht(x) 6= y]
Choose αt = 12 ln
(
1−εt
εt
)
Update
Dt+1(i) =Dt(i) exp(−αtyiht(xi))
Zt
where Zt is the normalization factor
Output: h(x) = sign
[
T∑
t=1
αtht(x)
]
. – p.8
Training Error, Margin Error
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
Margin, yf(x)
Tra
inin
g lo
ss
Loss functions for boosting
0−1 lossAdaboost lossLogitboost loss
The 0-1 training set loss with convex upper bounds:exponential loss and logistic loss
. – p.9
More on the Training Error
The training error of the final classifier is bounded
1
m|{h(xi) 6= yi}| ≤
1
m
m∑
i=1
exp(−yih(xi)) =
T∏
t=1
Zt
Training error can be reduced most rapidly by choosing αt thatminimizes
Zt =n
∑
i=1
Dt(i) exp(−αtyiht(xi))
Adaboost chooses the optimal αt
Margin is (sort of) maximized
Other boosting algorithms minimize other upper bounds
. – p.10
Boosting as Entropy Projection
For a general α we have
Dαt+1(i) =
Dt(i) exp(−αyiht(xi))
Zt(α)
where Zt(α) =∑m
i=1 exp(−αyiht(xi)). Then
minED[yh(x)]=0
KL(D, Dt) = maxα
(− lnZt(α)) .
Further
argminED[yh(x)]=0
KL(D, Dt) = Dαt
t+1 ,
which is the update Adaboost uses.
But what if labels are noisy?
. – p.11
Additive Models
An additive model
FT (x) =
T∑
t=1
wtft(x) .
Fit the t-th component to minimize the “residual” error
Error is measured by C(yf(x)), an upper bound on [y 6= f(x)]
For Adaboost,C(yf(x)) = exp(−yf(x))
For Logitboost,
C(yf(x)) = log(1 + exp(−yf(x)))
. – p.12
Upper Bounds on Training Errors
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
Margin, yf(x)
Tra
inin
g lo
ss
Loss functions for boosting
0−1 lossAdaboost lossLogitboost loss
The 0-1 training set loss with convex upper bounds:exponential loss and logistic loss
. – p.13
Adaboost as Gradient Descent
Margin Cost function C(yh(x)) = exp(−yh(x))
Current classifier Ht−1(x), next weak learner ht(x)
Next classifier Ht(x) = Ht−1(x) + αtht(x)
The cost function to be minimized
m∑
i=1
C(yi[Ht−1(xi) + αtht(xi)])
The optimum solution
αt =1
2ln
(∑
i:ht(xi)=yiDt(i)
∑
i:ht(xi) 6=yiDt(i)
)
=1
2ln
(
1 − εt
εt
)
. – p.14
Gradient Descent in Function Space
Choose any appropriate margin cost function C
Goal is to minimize margin cost on the training set
Current classifier Ht−1(x), next weak learner ht(x)
Next classifier Ht(x) = Ht−1(x) + αtht(x)
Choose αt to minimize
C(Ht(x) = C(Ht−1(x) + αtht(x)) .
“Anyboost” class of algorithms
. – p.15
Margin Bounds: Why Boosting works
Test set error decreases even after train set error is 0
No simple “bias-variance” explanation
Bounds on the error rate of the boosted classifier depends on
the number of training examples
the margin achieved on the train set
the complexity of the weak learners
It does not depend on the number of classifiers combined
. – p.16
Margin Bound for Boosting
H: Class of binary classifiers of VC-dimension dH
co(H) =
{
h : h(x) =∑
i
αihi(x), αi ≥ 0,∑
i
αi = 1
}
With probability at least (1 − δ) over the draw of train set S ofsize m, ∀h ∈ co(H), θ > 0 we have
Pr[yh(x) ≤ 0] ≤ PrS [yh(x) ≤ θ] + O
(
1
θ
√
dHm
)
+ O
√
log 1δ
m
. – p.17
Maximizing the Margin
Do boosting algorithms maximize the margin?
Minimize upper bound on training error
Often get large margin in the process
Do not explicitly maximize the margin
Explicit margin maximization
Classical Boost: Minimize bound on PrS [yh(x) ≤ 0]
Margin Boost: Minimize (bound on) PrS [yh(x) ≤ θ]
. – p.18
Boosting: Wrap Up (for now)
Strong learner from convex combinations of weak learners
Choice of loss function is critical
Aggresive loss functions for clean data
Robust loss functions for noisy data
Gradient descent in function space of additive models
Generalization by implicit margin maximization
Explicit maximum margin boosting algorithms possible
Is bound minimization a good idea?
. – p.19