reducing multiclass to binary, 2000cseweb.ucsd.edu/~elkan/254spring01/aldebaro1.pdfaldebaro klautau...
TRANSCRIPT
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 1
1
Reducing Multiclass toBinary: A Unifying Approach
for Margin Classifiers
Copyright, 1996 © Dale Carnegie & Associates, Inc.
Erin Allwein, Robert Schapire and Yoram SingerJournal of Machine Learning Research,
1:113-141, 2000
CSE 254: Seminar on Learning AlgorithmsProfessor: Charles ElkanStudent: Aldebaro Klautau
April 23, 2001
2
Outlinez Motivationz Definitions
y Margin, loss, output coding
z Paper contributionsy Unified view of classifiers, “0” entries in coding matrix,
bounds and simulations
z Algorithmz Bound on training errorz Experimental results
y Compare: a) Hamming vs. loss-based decoding b) different coding matrices c) simple binary problems vs. robust matrix
z Conclusions
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 2
3
Motivationz Many applications require multiclass
categorization
z Algorithms that easily handle themulticlass case include C4.5 and CART
z L Some do not easily handle themulticlass case, as AdaBoost and SVM
z ☺ Alternative: reduce the multiclassproblem to multiple binary classifications
4
Reducing multiclass to multiple binary problems
z Paper proposes general framework that unifies several methods, namely:binary margin-based algorithms coupled through a coding matrix
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 3
5
Binary margin-based learning algorithm
z Input: total of m binary labeled training examples(x1, y1),…, (xm, ym) such that
xi ∈ X and yi ∈ {-1,+1}
z Output: real-valued function (hypothesis) f : X à ℜ
z Binary margin of a training example (x, y) is defined as:z = y f(x)
z > 0 if and only if f classifies x correctly
6
L It is difficult to minimize classification error
☺ Instead, minimize some loss function of the marginL(z) = L(y f(x)) of each example (x,y)
z Algorithms that use margin-based loss: support-vector machines (SVM), AdaBoost, regression,decision trees, etc.
z Example: neural networks and least squareregression attempt to minimize squared error
(y − f(x))2 = y2(y − f(x))2 = (yy − yf(x))2 = (1 − yf(x))2
L(z) = (1 − z)2
z Classification error:
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 4
7
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.50
2
4
6
8
10
12
14
margin z = y f(x)
loss
L(z
)
adaboos t exp(-z)
regre s s ion (1-z)2 logis tic log(1+exp(-2*z))s vm (1-z)+
Loss function of popular algorithms ),0[: ∞→ℜL
8
z Scenario: training sequence has labels drawn from set withcardinality k > 2 and the algorithm(s) assume k = 2
z Two popular alternatives: one-against-all and all-pairs
z For k = 10 classes
Approach Number of classifiersone-against-all 10
all-pairs 45complete 511
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 5
9
Output coding for solving multiclass problemsCode word (each column corresponds to a binary classifier)
Class verticalline
horizontalline
diagonalline
closedcurve
curveopen to
left
curveopen toright
0 –1 –1 –1 +1 –1 –11 +1 –1 –1 –1 –1 –12 –1 +1 +1 –1 +1 –13 –1 –1 –1 –1 +1 –1
… …9 –1 –1 +1 +1 –1 –1
k = number of classes (rows) and l = number of classifiers (columns)
k = 10
l = 6
10
Error correcting output code
z Associate each class r with a row of a “coding matrix”
z Train each binary classifier
z Each example labeled y is mapped to M(y,s)
z Obtain l hypotheses f1 to fl
z Given test example x, choose the row y of M that is “closest” to binarypredictions (f1(x),…, fl(x)) according to some distance (e.g. Hamming)
ECOC, proposed by Dietterich and Bakiri, 1995
Code word (each column corresponds to a binary classifier)
Class verticalline
horizontalline
diagonalline
closedcurve
curveopen to
left
curveopen toright
0 –1 –1 –1 +1 –1 –11 +1 –1 –1 –1 –1 –12 –1 +1 +1 –1 +1 –13 –1 –1 –1 –1 +1 –1
… …9 –1 –1 +1 +1 –1 –1
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 6
11
Summary of paper contributions
z Unified view of margin classifiers
z Possibility of “0” entry in ECOC matrix
z Decoding when matrix has “0” entries
z Bound on training error (general)
z Bound on testing error (restricted to AdaBoost)
z Experimental results
12
Idea: allow entries “0”z The coding matrix M is taken from the extended set
{-1, 0, +1}k x l
z The entry M(r, s) = 0 indicates we do not care howfs categorizes examples with label r
One-against-all à k by k
+1 –1 –1 –1–1 +1 –1 –1–1 –1 +1 –1–1 –1 –1 +1
All-pairs à k by
2
k
+1 +1 +1 0 0 0–1 0 0 +1 +1 00 –1 0 –1 0 +10 0 –1 0 –1 –1
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 7
13
Algorithmz Associate each class r with a row of a coding matrix
M ∈ {-1, 0, +1}k x l
z Train the binary classifier for each column s=1,…l.
z For training classifier s, example labeled y is mappedto M(y,s). Omit examples for which M(y,s) = 0
z Obtain l classifiers
z Given test example x, choose the row y of M that is“closest” to binary predictions (f1(x),…, fl(x))according to some distance (e.g. modified Hamming)
14
Distance
y b) Loss-based decoding: for each row, calculate the margin zs ofeach classifier s and sum their losses L(zs) adopting the same Lused by the binary classifier. Sounds better because themagnitude of predictions are an indication of a level of“confidence”
∑=
−=∆
l
s
ssvu
1 2
1)( vu,
z Two intuitive optionsy a) “Quantize” predictions and then use a generalized
Hamming distance
String A 1 −1 0 0 −1String B 1 1 1 0 −1Distance parcels 0 1 0.5 0.5 0
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 8
15
Hamming vs. loss-based decoding
quantization
16
Losses for classes 3 and 4 in fig. 2
1 2 3 4 5 6 710
-6
10-4
10-2
100
102
104
106
prediction = [0.5 -7 -1 -2 -10 -12 9]row_3 = [+1 0 -1 -1 -1 +1 +1]row_4 = [-1 -1 +1 0 -1 -1 +1]row_r_loss = exp(-prediction .* row_r)loss = sum(row_r_loss) big
influence
row_r_loss
(log scale)
binary classifier #
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 9
17
z Training error
z Minimum distance between pair of rows
( )∑∑= =
=εm
i
l
s
isi xfsyMLml 1 1
)(),(1
( ){ }2121 :)(),(min rrrr ≠∆=ρ MM
Bound on training error
z Average binary loss
One-against-allàρ = 2
+1 –1 –1 –1–1 +1 –1 –1–1 –1 +1 –1–1 –1 –1 +1
All-pairsàρ = 1 + 1/2 (l − 1)
+1 +1 +1 0 0 0–1 0 0 +1 +1 00 –1 0 –1 0 +10 0 –1 0 –1 –1
)0(L
lE
ρε
≤
18
Bound on training error
)0(L
lE
ρε
≤
complete code one-against-all
Simulation: AdaBoost with L(0) = 1
2
ε≤
kE( )
ε≈ε−
≤ −
−
22
122
1
k
k
E
# classes
3 4 5 6 7 8
10
20
30
40
50
60
70
3 4 5 6 7 8
10
20
30
40
50
60
70
average binary los s actual clas s ification errortheoretical bound
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 10
19
Bound on training error
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
margin z = y f(x)
loss
L(z
)
0)0(2
)()(>≥
−+L
zLzL
Do not need to be convex:sin(z) + 1
Requirement:
20
Experimental results
z Tradeoff between simple binary problemsand robust coding matrix
z Hamming versus loss-based decoding(AdaBoost and SVM)
z Comparison among different output codes(AdaBoost and SVM)
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 11
21
3 classes
8 classes
z Set thresholds to have exactly 100 examples per classz Label test examples using these thresholdsz Use AdaBoost
y Weak hypotheses: set of thresholdsy threshold t would label x as +1 if |x | < t and -1 otherwisey 10 rounds
Experiment 1 - synthetic data
22
Comparisons:
a) Hamming vs. loss-based decoding b) simple binary problems vs. robust matrix
complete code one-against-all
AdaBoost used in both simulations
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 12
23
complete code one-against-all
k by 2k−1−1+1 –1 –1 –1 +1 –1 –1–1 +1 –1 –1 +1 +1 +1–1 –1 +1 –1 –1 –1 +1–1 –1 –1 +1 –1 +1 –1
k by k+1 –1 –1 –1–1 +1 –1 –1–1 –1 +1 –1–1 –1 –1 +1
k = 4
24
z Experiments with UCI databasesz SVM: polynomial kernels of degree 4z AdaBoost: decision stumps for base hypothesesz 5 output codes: one-against-all, complete, all-pairs,
dense and sparsey design of dense and sparse: try 10,000 random codes, pick
a code with high ρ
Comparison: different output codes
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 13
25
26
Show error obtained with Hamming minus error with loss-based decoding
Negative height indicates loss-based outperformed Hamming decoding
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 14
27
Entry (r, c) shows error of row r classifier minus error of column c classifier
Positive height indicates classifier c outperformed classifier r
SVM
28
Entry (r, c) shows error of row r classifier minus error of column c classifier
Positive height indicates classifier c outperformed classifier r
AdaBoost
Aldebaro Klautau
Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 15
29
Conclusionsz Bounds give insight about the tradeoffs
but can be of limited use in practice
z Experiments show that in most cases:y Loss-based is better than Hamming decodingy One-against-all is outperformed (SVM)
z Choosing / designing the coding matrixis an open problem and the bestapproach is possibly task-dependenty (Crammer & Singer, 2000)
30