reducing multiclass to binary, 2000cseweb.ucsd.edu/~elkan/254spring01/aldebaro1.pdfaldebaro klautau...

Aldebaro Klautau

Reducing Multiclass to Binary: AUnifying Approach for Margin Classifiers 1

1

Reducing Multiclass toBinary: A Unifying Approach

for Margin Classifiers

Copyright, 1996 © Dale Carnegie & Associates, Inc.

Erin Allwein, Robert Schapire and Yoram SingerJournal of Machine Learning Research,

1:113-141, 2000

CSE 254: Seminar on Learning AlgorithmsProfessor: Charles ElkanStudent: Aldebaro Klautau

April 23, 2001

2

Outlinez Motivationz Definitions

y Margin, loss, output coding

z Paper contributionsy Unified view of classifiers, “0” entries in coding matrix,

bounds and simulations

z Algorithmz Bound on training errorz Experimental results

y Compare: a) Hamming vs. loss-based decoding b) different coding matrices c) simple binary problems vs. robust matrix

z Conclusions

Aldebaro Klautau


3

Motivationz Many applications require multiclass

categorization

z Algorithms that easily handle themulticlass case include C4.5 and CART

z L Some do not easily handle themulticlass case, as AdaBoost and SVM

z ☺ Alternative: reduce the multiclassproblem to multiple binary classifications

4

Reducing multiclass to multiple binary problems

z Paper proposes general framework that unifies several methods, namely:binary margin-based algorithms coupled through a coding matrix

Aldebaro Klautau


5

Binary margin-based learning algorithm

z Input: total of m binary labeled training examples(x1, y1),…, (xm, ym) such that

xi ∈ X and yi ∈ {-1,+1}

z Output: real-valued function (hypothesis) f : X à ℜ

z Binary margin of a training example (x, y) is defined as:z = y f(x)

z > 0 if and only if f classifies x correctly

6

L It is difficult to minimize classification error

☺ Instead, minimize some loss function of the marginL(z) = L(y f(x)) of each example (x,y)

z Algorithms that use margin-based loss: support-vector machines (SVM), AdaBoost, regression,decision trees, etc.

z Example: neural networks and least squareregression attempt to minimize squared error

(y − f(x))2 = y2(y − f(x))2 = (yy − yf(x))2 = (1 − yf(x))2

L(z) = (1 − z)2

z Classification error:

Aldebaro Klautau


7

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.50

2

4

6

8

10

12

14

margin z = y f(x)

loss

L(z

)

adaboos t exp(-z)

regre s s ion (1-z)2 logis tic log(1+exp(-2*z))s vm (1-z)+

Loss function of popular algorithms ),0[: ∞→ℜL

8

z Scenario: training sequence has labels drawn from set withcardinality k > 2 and the algorithm(s) assume k = 2

z Two popular alternatives: one-against-all and all-pairs

z For k = 10 classes

Approach Number of classifiersone-against-all 10

all-pairs 45complete 511

Aldebaro Klautau


9

Output coding for solving multiclass problemsCode word (each column corresponds to a binary classifier)

Class verticalline

horizontalline

diagonalline

closedcurve

curveopen to

left

curveopen toright

0 –1 –1 –1 +1 –1 –11 +1 –1 –1 –1 –1 –12 –1 +1 +1 –1 +1 –13 –1 –1 –1 –1 +1 –1

… …9 –1 –1 +1 +1 –1 –1

k = number of classes (rows) and l = number of classifiers (columns)

k = 10

l = 6

10

Error correcting output code

z Associate each class r with a row of a “coding matrix”

z Train each binary classifier

z Each example labeled y is mapped to M(y,s)

z Obtain l hypotheses f1 to fl

z Given test example x, choose the row y of M that is “closest” to binarypredictions (f1(x),…, fl(x)) according to some distance (e.g. Hamming)

ECOC, proposed by Dietterich and Bakiri, 1995

Code word (each column corresponds to a binary classifier)

Class verticalline

horizontalline

diagonalline

closedcurve

curveopen to

left

curveopen toright

0 –1 –1 –1 +1 –1 –11 +1 –1 –1 –1 –1 –12 –1 +1 +1 –1 +1 –13 –1 –1 –1 –1 +1 –1

… …9 –1 –1 +1 +1 –1 –1

Aldebaro Klautau


11

Summary of paper contributions

z Unified view of margin classifiers

z Possibility of “0” entry in ECOC matrix

z Decoding when matrix has “0” entries

z Bound on training error (general)

z Bound on testing error (restricted to AdaBoost)

z Experimental results

12

Idea: allow entries “0”z The coding matrix M is taken from the extended set

{-1, 0, +1}k x l

z The entry M(r, s) = 0 indicates we do not care howfs categorizes examples with label r

One-against-all à k by k

+1 –1 –1 –1–1 +1 –1 –1–1 –1 +1 –1–1 –1 –1 +1

All-pairs à k by

2

k

+1 +1 +1 0 0 0–1 0 0 +1 +1 00 –1 0 –1 0 +10 0 –1 0 –1 –1

Aldebaro Klautau


13

Algorithmz Associate each class r with a row of a coding matrix

M ∈ {-1, 0, +1}k x l

z Train the binary classifier for each column s=1,…l.

z For training classifier s, example labeled y is mappedto M(y,s). Omit examples for which M(y,s) = 0

z Obtain l classifiers

z Given test example x, choose the row y of M that is“closest” to binary predictions (f1(x),…, fl(x))according to some distance (e.g. modified Hamming)

14

Distance

y b) Loss-based decoding: for each row, calculate the margin zs ofeach classifier s and sum their losses L(zs) adopting the same Lused by the binary classifier. Sounds better because themagnitude of predictions are an indication of a level of“confidence”

∑=

−=∆

l

s

ssvu

1 2

1)( vu,

z Two intuitive optionsy a) “Quantize” predictions and then use a generalized

Hamming distance

String A 1 −1 0 0 −1String B 1 1 1 0 −1Distance parcels 0 1 0.5 0.5 0

Aldebaro Klautau


15

Hamming vs. loss-based decoding

quantization

16

Losses for classes 3 and 4 in fig. 2

1 2 3 4 5 6 710

-6

10-4

10-2

100

102

104

106

prediction = [0.5 -7 -1 -2 -10 -12 9]row_3 = [+1 0 -1 -1 -1 +1 +1]row_4 = [-1 -1 +1 0 -1 -1 +1]row_r_loss = exp(-prediction .* row_r)loss = sum(row_r_loss) big

influence

row_r_loss

(log scale)

binary classifier #

Aldebaro Klautau


17

z Training error

z Minimum distance between pair of rows

( )∑∑= =

=εm

i

l

s

isi xfsyMLml 1 1

)(),(1

( ){ }2121 :)(),(min rrrr ≠∆=ρ MM

Bound on training error

z Average binary loss

One-against-allàρ = 2

+1 –1 –1 –1–1 +1 –1 –1–1 –1 +1 –1–1 –1 –1 +1

All-pairsàρ = 1 + 1/2 (l − 1)

+1 +1 +1 0 0 0–1 0 0 +1 +1 00 –1 0 –1 0 +10 0 –1 0 –1 –1

)0(L

lE

ρε

≤

18


)0(L

lE

ρε

≤

complete code one-against-all

Simulation: AdaBoost with L(0) = 1

2

ε≤

kE( )

ε≈ε−

≤ −

−

22

122

1

k

k

E

# classes

3 4 5 6 7 8

10

20

30

40

50

60

70

3 4 5 6 7 8

10

20

30

40

50

60

70

average binary los s actual clas s ification errortheoretical bound

Aldebaro Klautau


19


-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

margin z = y f(x)

loss

L(z

)

0)0(2

)()(>≥

−+L

zLzL

Do not need to be convex:sin(z) + 1

Requirement:

20

Experimental results

z Tradeoff between simple binary problemsand robust coding matrix

z Hamming versus loss-based decoding(AdaBoost and SVM)

z Comparison among different output codes(AdaBoost and SVM)

Aldebaro Klautau


21

3 classes

8 classes

z Set thresholds to have exactly 100 examples per classz Label test examples using these thresholdsz Use AdaBoost

y Weak hypotheses: set of thresholdsy threshold t would label x as +1 if |x | < t and -1 otherwisey 10 rounds

Experiment 1 - synthetic data

22

Comparisons:

a) Hamming vs. loss-based decoding b) simple binary problems vs. robust matrix


AdaBoost used in both simulations

Aldebaro Klautau


23


k by 2k−1−1+1 –1 –1 –1 +1 –1 –1–1 +1 –1 –1 +1 +1 +1–1 –1 +1 –1 –1 –1 +1–1 –1 –1 +1 –1 +1 –1

k by k+1 –1 –1 –1–1 +1 –1 –1–1 –1 +1 –1–1 –1 –1 +1

k = 4

24

z Experiments with UCI databasesz SVM: polynomial kernels of degree 4z AdaBoost: decision stumps for base hypothesesz 5 output codes: one-against-all, complete, all-pairs,

dense and sparsey design of dense and sparse: try 10,000 random codes, pick

a code with high ρ

Comparison: different output codes

Aldebaro Klautau


25

26

Show error obtained with Hamming minus error with loss-based decoding

Negative height indicates loss-based outperformed Hamming decoding

Aldebaro Klautau


27

Entry (r, c) shows error of row r classifier minus error of column c classifier

Positive height indicates classifier c outperformed classifier r

SVM

28

Entry (r, c) shows error of row r classifier minus error of column c classifier

Positive height indicates classifier c outperformed classifier r

AdaBoost

Aldebaro Klautau


29

Conclusionsz Bounds give insight about the tradeoffs

but can be of limited use in practice

z Experiments show that in most cases:y Loss-based is better than Hamming decodingy One-against-all is outperformed (SVM)

z Choosing / designing the coding matrixis an open problem and the bestapproach is possibly task-dependenty (Crammer & Singer, 2000)

30

reducing multiclass to binary, 2000cseweb.ucsd.edu/~elkan/254spring01/aldebaro1.pdfaldebaro klautau...

Documents