probability estimates for multi-class classification by …cjlin/papers/svmprob/svmprob.pdf ·...

31
Journal of Machine Learning Research 5 (2004) 975-1005 Submitted 11/03; Revised 05/04; Published 8/04 Probability Estimates for Multi-class Classification by Pairwise Coupling Ting-Fan Wu [email protected] Chih-Jen Lin [email protected] Department of Computer Science National Taiwan University Taipei 106, Taiwan Ruby C. Weng [email protected] Department of Statistics National Chengchi University Taipei 116, Taiwan Editor: Yoram Singer Abstract Pairwise coupling is a popular multi-class classification method that combines all com- parisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than the two existing popular methods: voting and the method by Hastie and Tibshirani (1998). Keywords: Pairwise Coupling, Probability Estimates, Random Forest, Support Vector Machines 1. Introduction The multi-class classification problem refers to assigning each of the observations into one of k classes. As two-class problems are much easier to solve, many authors propose to use two-class classifiers for multi-class classification. In this paper we focus on techniques that provide a multi-class probability estimate by combining all pairwise comparisons. A common way to combine pairwise comparisons is by voting (Knerr et al., 1990; Fried- man, 1996). It constructs a rule for discriminating between every pair of classes and then selecting the class with the most winning two-class decisions. Though the voting procedure requires just pairwise decisions, it only predicts a class label. In many scenarios, however, probability estimates are desired. As numerous (pairwise) classifiers do provide class proba- bilities, several authors (Refregier and Vallet, 1991; Price et al., 1995; Hastie and Tibshirani, 1998) have proposed probability estimates by combining the pairwise class probabilities. Given the observation x and the class label y, we assume that the estimated pairwise class probabilities r ij of μ ij = P (y = i | y = i or j, x) are available. From the ith and j th classes of a training set, we obtain a model which, for any new x, calculates r ij as an approx- imation of μ ij . Then, using all r ij , the goal is to estimate p i = P (y = i | x),i =1,...,k. In this paper, we first propose a method for obtaining probability estimates via an approxima- c 2004 Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng.

Upload: phungkien

Post on 18-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Machine Learning Research 5 (2004) 975-1005 Submitted 11/03; Revised 05/04; Published 8/04

Probability Estimates for Multi-class Classification by

Pairwise Coupling

Ting-Fan Wu [email protected]

Chih-Jen Lin [email protected]

Department of Computer ScienceNational Taiwan UniversityTaipei 106, Taiwan

Ruby C. Weng [email protected]

Department of Statistics

National Chengchi University

Taipei 116, Taiwan

Editor: Yoram Singer

Abstract

Pairwise coupling is a popular multi-class classification method that combines all com-parisons for each pair of classes. This paper presents two approaches for obtaining classprobabilities. Both methods can be reduced to linear systems and are easy to implement.We show conceptually and experimentally that the proposed approaches are more stablethan the two existing popular methods: voting and the method by Hastie and Tibshirani(1998).

Keywords: Pairwise Coupling, Probability Estimates, Random Forest, Support VectorMachines

1. Introduction

The multi-class classification problem refers to assigning each of the observations into oneof k classes. As two-class problems are much easier to solve, many authors propose to usetwo-class classifiers for multi-class classification. In this paper we focus on techniques thatprovide a multi-class probability estimate by combining all pairwise comparisons.

A common way to combine pairwise comparisons is by voting (Knerr et al., 1990; Fried-man, 1996). It constructs a rule for discriminating between every pair of classes and thenselecting the class with the most winning two-class decisions. Though the voting procedurerequires just pairwise decisions, it only predicts a class label. In many scenarios, however,probability estimates are desired. As numerous (pairwise) classifiers do provide class proba-bilities, several authors (Refregier and Vallet, 1991; Price et al., 1995; Hastie and Tibshirani,1998) have proposed probability estimates by combining the pairwise class probabilities.

Given the observation x and the class label y, we assume that the estimated pairwiseclass probabilities rij of µij = P (y = i | y = i or j,x) are available. From the ith and jthclasses of a training set, we obtain a model which, for any new x, calculates rij as an approx-imation of µij. Then, using all rij , the goal is to estimate pi = P (y = i | x), i = 1, . . . , k. Inthis paper, we first propose a method for obtaining probability estimates via an approxima-

c©2004 Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng.

Wu, Lin, and Weng

tion solution to an identity. The existence of the solution is guaranteed by theory in finiteMarkov Chains. Motivated by the optimization formulation of this method, we propose asecond approach. Interestingly, it can also be regarded as an improved version of the cou-pling approach given by Refregier and Vallet (1991). Both of the proposed methods can bereduced to solving linear systems and are simple in practical implementation. Furthermore,from conceptual and experimental points of view, we show that the two proposed methodsare more stable than voting and the method by Hastie and Tibshirani (1998).

We organize the paper as follows. In Section 2, we review several existing methods.Sections 3 and 4 detail the two proposed approaches. Section 5 presents the relationshipbetween different methods through their corresponding optimization formulas. In Section6, we compare these methods using simulated data. In Section 7, we conduct experimentsusing real data. The classifiers considered are support vector machines and random forest.A preliminary version of this paper was presented in (Wu et al., 2004).

2. Survey of Existing Methods

For methods surveyed in this section and those proposed later, each provides a vector ofmulti-class probability estimates. We denote it as p∗ according to method ∗. Similarly,there is an associated rule arg maxi[p

∗i ] for prediction and we denote the rule as δ∗.

2.1 Voting

Let rij be the estimates of µij ≡ P (y = i | y = i or j,x) and assume rij + rji = 1. Thevoting rule (Knerr et al., 1990; Friedman, 1996) is

δV = arg maxi

[∑

j:j 6=i

I{rij>rji}], (1)

where I is the indicator function: I{x} = 1 if x is true, and 0 otherwise. A simple estimateof probabilities can be derived as

pvi = 2

j:j 6=i

I{rij>rji}/(k(k − 1)).

2.2 Method by Refregier and Vallet

With µij = pi/(pi + pj), Refregier and Vallet (1991) consider that

rij

rji≈ µij

µji=

pi

pj. (2)

Thus, making (2) an equality may be a way to solve pi. However, the number of equations,k(k− 1)/2, is more than the number of unknowns k, so Refregier and Vallet (1991) proposeto choose any k − 1 rij . Then, with the condition

∑ki=1 pi = 1, pRV can be obtained by

solving a linear system. However, as pointed out previously by Price et al. (1995), theresults depend strongly on the selection of k − 1 rij .

In Section 4, by considering (2) as well, we propose a method which remedies thisproblem.

976

Probability Estimates for Multi-class Classification by Pairwise Coupling

2.3 Method by Price, Kner, Personnaz, and Dreyfus

Price et al. (1995) consider that

j:j 6=i

P (y = i or j | x)

− (k − 2)P (y = i | x) =k

j=1

P (y = j | x) = 1.

Using

rij ≈ µij =P (y = i | x)

P (y = i or j | x),

one obtains

pPKPDi =

1∑

j:j 6=i1

rij− (k − 2)

. (3)

As∑k

i=1 pi = 1 does not hold, we must normalize pPKPD. This approach is very simpleand easy to implement. In the rest of this paper, we refer to this method as PKPD.

2.4 Method by Hastie and Tibshirani

Hastie and Tibshirani (1998) propose to minimize the Kullback-Leibler (KL) distance be-tween rij and µij :

l(p) =∑

i6=j

nijrij logrij

µij, (4)

=∑

i<j

nij

(

rij logrij

µij+ (1− rij) log

1− rij

1− µij

)

,

where µij = pi/(pi + pj), rji = 1− rij, and nij is the number of training data in the ith andjth classes.

To minimize (4), they first calculate

∂l(p)

∂pi=

j:j 6=i

nij

(

−rij

pi+

1

pi + pj

)

.

Thus, letting ∂l(p)/∂pi = 0, i = 1, . . . , k and multiplying pi on each term, Hastie andTibshirani (1998) propose finding a point that satisfies

j:j 6=i

nijµij =∑

j:j 6=i

nijrij,k

i=1

pi = 1, and pi > 0, i = 1, . . . , k. (5)

Such a point is obtained by the following algorithm:

Algorithm 11. Start with some initial pj > 0,∀j and corresponding µij = pi/(pi + pj).

977

Wu, Lin, and Weng

2. Repeat (i = 1, . . . , k, 1, . . .)

α =

j:j 6=i nijrij∑

j:j 6=i nijµij(6)

µij ←αµij

αµij + µji, µji ← 1− µij, for all j 6= i (7)

pi ← αpi (8)

normalize p (optional) (9)

until k consecutive α are all close to ones.

3. p← p/∑k

i=1 pi

(8) implies that in each iteration, only the ith component is updated and all others remainthe same. There are several remarks about this algorithm. First, the initial p must bepositive so that all later p are positive and α is well defined (i.e., no zero denominator in(6)). Second, (9) is an optional operation because whether we normalize p or not does notaffect the values of µij and α in (6) and (7).

Hastie and Tibshirani (1998) prove that Algorithm 1 generates a sequence of pointsat which the KL distance is strictly decreasing. However, Hunter (2004) indicates thatthe strict decrease in l(p) does not guarantee that any limit point satisfies (5). Hunter(2004) discusses the convergence of algorithms for generalized Bradley-Terry models whereAlgorithm 1 is a special case. It points out that Zermelo (1929) has proved that, if rij >0,∀i 6= j, for any initial point, the whole sequence generated by Algorithm 1 converges toa point satisfying (5). Furthermore, this point is the unique global minimum of l(p) underthe constraints

∑ki=1 pi = 1 and 0 ≤ pi ≤ 1, i = 1, . . . , k.

Let pHT denote the global minimum of l(p). It is shown in Zermelo (1929) and Theorem1 of (Hastie and Tibshirani, 1998) that if weights nij in (4) are considered equal, then pHT

satisfies

pHTi > pHT

j if and only if pHTi ≡

2∑

s:i6=s ris

k(k − 1)> pHT

j ≡2∑

s:j 6=s rjs

k(k − 1). (10)

Therefore, pHT is sufficient if one only requires the classification rule. In fact, pHT can bederived as an approximation to the identity

pi =∑

j:j 6=i

(

pi + pj

k − 1

)(

pi

pi + pj

)

=∑

j:j 6=i

(

pi + pj

k − 1

)

µij (11)

by replacing pi + pj with 2/k, and µij with rij in (11). We refer to the decision rule as δHT ,which is essentially

arg maxi

[pHTi ]. (12)

In the next two sections, we propose two methods which are simpler in both practicalimplementation and algorithmic analysis.

If the multi-class data are balanced, it is reasonable to assume equal weighting (i.e.,nij = 1) as the above. In the rest of this paper, we restrict our discussion under such anassumption.

978

Probability Estimates for Multi-class Classification by Pairwise Coupling

3. Our First Approach

As δHT relies on pi + pj ≈ 2/k, in Section 6 we use two examples to illustrate possibleproblems with this rule. In this section, instead of replacing pi + pj by 2/k in (11), wepropose to solve the system:

pi =∑

j:j 6=i

(pi + pj

k − 1)rij ,∀i, subject to

k∑

i=1

pi = 1, pi ≥ 0,∀i. (13)

Let p1 denote the solution to (13). Then the resulting decision rule is

δ1 = arg maxi

[p1i ].

3.1 Solving (13)

To solve (13), we rewrite it as

Qp = p,

k∑

i=1

pi = 1, pi ≥ 0,∀i, where Qij ≡{

rij/(k − 1) if i 6= j,∑

s:s 6=i ris/(k − 1) if i = j.(14)

Observe that∑k

i=1 Qij = 1 for j = 1, . . . , k and 0 ≤ Qij ≤ 1 for i, j = 1, . . . , k, so thereexists a finite Markov Chain whose transition matrix is Q. Moreover, if rij > 0 for alli 6= j, then Qij > 0, which implies that this Markov Chain is irreducible and aperiodic.From Theorem 4.3.3 of Ross (1996), these conditions guarantee the existence of a uniquestationary probability and all states being positive recurrent. Hence, we have the followingtheorem:

Theorem 1 If rij > 0, i 6= j, then (14) has a unique solution p with 0 < pi < 1 ∀i.

Assume the solution from Theorem 1 is p∗. We claim that without the constraintspi ≥ 0,∀i, the linear system

Qp = p,

k∑

i=1

pi = 1 (15)

still has the same unique solution p∗. Otherwise, there is another solution p∗(6= p∗). Thenfor any 0 ≤ λ ≤ 1, λp∗ +(1−λ)p∗ satisfies (15) as well. As p∗i > 0,∀i, when λ is sufficientlyclose to 1, λp∗i +(1−λ)p∗i > 0, i = 1, . . . , k. This violates the uniqueness property in Theorem1.

Therefore, unlike the method in Section 2.4 where a special iterative procedure has tobe implemented, here we only solve a simple linear system. As (15) has k + 1 equalitiesbut only k variables, practically we remove any one equality from Qp = p and obtain asquare system. Since the column sum of Q is the vector of all ones, the removed equalityis a linear combination of all remaining equalities. Thus, any solution of the square systemsatisfies (15) and vice versa. Therefore, this square system has the same unique solution as(15) and hence can be solved by standard Gaussian elimination.

Instead of Gaussian elimination, as the stationary solution of a Markov Chain can bederived by the limit of the n-step transition probability matrix Qn, we can solve (13) byrepeatedly multiplying Q with any initial probability vector.

979

Wu, Lin, and Weng

3.2 Another Look at (13)

The following arguments show that the solution to (13) is a global minimum of a meaningfuloptimization problem. To begin, using the property that rij + rji = 1, ∀i 6= j, we re-express

pi =∑

j:j 6=i(pi+pj

k−1 )rij of (14) (i.e., Qp = p of (15)) as

j:j 6=i

rjipi −∑

j:j 6=i

rijpj = 0, i = 1, . . . , k.

Therefore, a solution of (14) is in fact the unique global minimum of the following convexproblem:

minp

k∑

i=1

(∑

j:j 6=i

rjipi −∑

j:j 6=i

rijpj)2

subject to

k∑

i=1

pi = 1, pi ≥ 0, i = 1, . . . , k. (16)

The reason is that the object function is always nonnegative, and it attains zero under(14). Note that the constraints pi ≥ 0 ∀i are not redundant following the discussion aroundEquation (15).

4. Our Second Approach

Note that both approaches in Sections 2.4 and 3 involve solving optimization problems usingrelations like pi/(pi +pj) ≈ rij or

j:j 6=i rjipi ≈∑

j:j 6=i rijpj. Motivated by (16), we suggestanother optimization formulation as follows:

minp

k∑

i=1

j:j 6=i

(rjipi − rijpj)2 subject to

k∑

i=1

pi = 1, pi ≥ 0,∀i. (17)

Note that the method (Refregier and Vallet, 1991) described in Section 2.2 considers arandom selection of k − 1 equations of the form rjipi = rijpj. As (17) considers all rijpj −rjipi, not just k − 1 of them, it can be viewed as an improved version of the couplingapproach by Refregier and Vallet (1991).

Let p2 denote the corresponding solution. We then define the classification rule as

δ2 = arg maxi

[p2i ].

4.1 A Linear System from (17)

Since (16) has a unique solution, which can be obtained by solving a simple linear system,it is desirable to see whether the minimization problem (17) has these nice properties. Inthis subsection, we show that (17) has a unique solution and can be solved by a simplelinear system.

First, the following theorem shows that the nonnegative constraints in (17) are redun-dant.

980

Probability Estimates for Multi-class Classification by Pairwise Coupling

Theorem 2 Problem (17) is equivalent to

minp

k∑

i=1

j:j 6=i

(rjipi − rijpj)2 subject to

k∑

i=1

pi = 1. (18)

The proof is in Appendix A. Note that we can rewrite the objective function of (18) as

minp

2pT Qp ≡ minp

1

2pT Qp, (19)

where

Qij =

{

s:s 6=i r2si if i = j,

−rjirij if i 6= j.(20)

We divide the objective function by a positive factor of four so its derivative is a simple formQp. From (20), Q is positive semi-definite as for any v 6= 0,vT Qv = 1/2

∑ki=1

∑kj=1(rjivi−

rijvj)2 ≥ 0. Therefore, without constraints pi ≥ 0,∀i, (19) is a linear-equality-constrained

convex quadratic programming problem. Consequently, a point p is a global minimum ifand only if it satisfies the optimality condition: There is a scalar b such that

[

Q eeT 0

] [

pb

]

=

[

01

]

. (21)

Here Qp is the derivative of (19), b is the Lagrangian multiplier of the equality constraint∑k

i=1 pi = 1, e is the k × 1 vector of all ones, and 0 is the k × 1 vector of all zeros. Thus,the solution to (17) can be obtained by solving the simple linear system (21).

4.2 Solving (21)

Equation (21) can be solved by some direct methods in numerical linear algebra. Theorem3(i) below shows that the matrix in (21) is invertible; therefore, Gaussian elimination canbe easily applied.

For symmetric positive definite systems, Cholesky factorization reduces the time forGaussian elimination by half. Though (21) is symmetric but not positive definite, if Q ispositive definite, Cholesky factorization can be used to obtain b = −1/(eT Q−1e) first andthen p = −bQ−1e. Theorem 3(ii) shows that Q is positive definite under quite generalconditions. Moreover, even if Q is only positive semi-definite, Theorem 3(i) proves thatQ + ∆eeT is positive definite for any constant ∆ > 0. Along with the fact that (21) isequivalent to

[

Q + ∆eeT eeT 0

] [

pb

]

=

[

∆e1

]

,

we can do Cholesky factorization on Q + ∆eeT and solve b and p similarly, regardlesswhether Q is positive definite or not.

Theorem 3 If rtu > 0 ∀t 6= u, we have

981

Wu, Lin, and Weng

(i) For any ∆ > 0, Q + ∆eeT is positive definite. In addition,[

Q e

eT 0

]

is invertible, and

hence (17) has a unique global minimum.

(ii) If for any i = 1, . . . , k, there are s 6= j for which s 6= i, j 6= i, and

rsirjs

ris6= rjirsj

rij, (22)

then Q is positive definite.

We leave the proof in Appendix B.

In addition to direct methods, next we propose a simple iterative method for solving(21):

Algorithm 21. Start with some initial pi ≥ 0,∀i and

∑ki=1 pi = 1.

2. Repeat (t = 1, . . . , k, 1, . . .)

pt ←1

Qtt[−

j:j 6=t

Qtjpj + pT Qp] (23)

normalize p (24)

until (21) is satisfied.

Equation (20) and the assumption rij > 0,∀i 6= j, ensure that the right-hand side of (23) is

always nonnegative. For (24) to be well defined, we must ensure that∑k

i=1 pi > 0 after theoperation in (23). This property holds (see (40) for more explanation). With b = −pT Qpobtained from (21), (23) is motivated from the tth equality in (21) with b replaced by−pT Qp. The convergence of Algorithm 2 is established in the following theorem:

Theorem 4 If rsj > 0,∀s 6= j, then {pi}∞i=1, the sequence generated by Algorithm 2,converges globally to the unique minimum of (17).

The proof is in Appendix C. Algorithm 2 is implemented in the software LIBSVM

developed by Chang and Lin (2001) for multi-class probability estimates. We discuss someimplementation issues of Algorithm 2 in Appendix D.

5. Relations Between Different Methods

Among the methods discussed in this paper, the four decision rules δHT , δ1, δ2, and δV canbe written as arg maxi[pi], where p is derived by the following four optimization formulations

982

Probability Estimates for Multi-class Classification by Pairwise Coupling

under the constraints∑k

i=1 pi = 1 and pi ≥ 0,∀i:

δHT : minp

k∑

i=1

[∑

j:j 6=i

(rij1

k− 1

2pi)]

2, (25)

δ1 : minp

k∑

i=1

[∑

j:j 6=i

(rijpj − rjipi)]2, (26)

δ2 : minp

k∑

i=1

j:j 6=i

(rijpj − rjipi)2, (27)

δV : minp

k∑

i=1

j:j 6=i

(I{rij>rji}pj − I{rji>rij}pi)2. (28)

Note that (25) can be easily verified from (10), and that (26) and (27) have been explainedin Sections 3 and 4. For (28), its solution is

pi =c

j:j 6=i I{rji>rij}, (29)

where c is the normalizing constant; and therefore, arg maxi[pi] is the same as (1).1 Detailedderivation of (29) is in Appendix E.

Clearly, (25) can be obtained from (26) by letting pj = 1/k and rji = 1/2. Suchapproximations ignore the differences between pi. Next, (28) is from (27) with rij replacedby I{rij>rji}, and hence, (28) may enlarge the differences between pi. Moreover, comparedwith (27), (26) allows the difference between rijpj and rjipi to be canceled first, so (26) maytend to underestimate the differences between pi. In conclusion, conceptually, (25) and (28)are more extreme – the former tends to underestimate the differences between pi, while thelatter overestimates them. These arguments will be supported by simulated and real datain the next two sections.

For PKPD approach (3), the decision rule can be written as:

δPKPD = arg mini

[∑

j:j 6=i

1

rij].

This form looks similar to δHT = arg maxi[∑

j:j 6=i rij ], which can be obtained from (10)and (12). Notice that the differences among

j:j 6=i rij tend to be larger than those among∑

j:j 6=i1

rij, because 1/rij > 1 > rij . More discussion on these two rules will be given in

Section 6.

6. Experiments on Synthetic Data

In this section, we use synthetic data to compare the performance of existing methodsdescribed in Section 2 as well as two new approaches proposed in Sections 3 and 4. Here we

1. For I{rij>rji} to be well defined, we consider rij 6= rji, which is generally true. In addition, if there is ani for which

P

j:j 6=iI{rji>rij} = 0, an optimal solution of (28) is pi = 1, and pj = 0, ∀j 6= i. The resulting

decision is the same as that of (1).

983

Wu, Lin, and Weng

do not include the method in Section 2.2 because its results depend strongly on the choiceof k − 1 rij and our second method is an improved version of it.

Hastie and Tibshirani (1998) design a simple experiment in which all pi are fairly closeand their method δHT outperforms the voting strategy δV . We conduct this experimentfirst to assess the performance of our proposed methods. Following their settings, we defineclass probabilities

(a) p1 = 1.5/k, pj = (1− p1)/(k − 1), j = 2, . . . , k,

and then set

rij =pi

pi + pj+ 0.1zij if i > j, (30)

rji =pj

pi + pj+ 0.1zji = 1− rij if j > i, (31)

where zij are standard normal variates and zji = −zij . Since rij are required to be within(0,1), we truncate rij at ǫ below and 1 − ǫ above, with ǫ = 10−7. In this example, class 1has the highest probability and hence is the correct class.

Figure 2(a) shows accuracy rates for each of the five methods when k = 22, ⌈22.5⌉,23, . . . , 27, where ⌈x⌉ denotes the largest integer not exceeding x. The accuracy rates areaveraged over 1,000 replicates. Note that in this experiment all classes are quite competitive,so, when using δV , sometimes the highest vote occurs at two or more different classes. Wehandle this problem by randomly selecting one class from the ties. This partly explains thepoor performance of δV . Another explanation is that the rij here are all close to 1/2, but(28) uses 1 or 0 instead, as stated in the previous section; therefore, the solution may beseverely biased. Besides δV , the other four rules have good performance in this example.

Since δHT relies on the approximation pi + pj ≈ k/2, this rule may suffer some lossesif the class probabilities are not highly balanced. To examine this point, we consider thefollowing two sets of class probabilities:

(b) We let k1 = k/2 if k is even, and (k+1)/2 if k is odd; then we define p1 = 0.95×1.5/k1 ,pi = (0.95− p1)/(k1− 1) for i = 2, . . . , k1, and pi = 0.05/(k− k1) for i = k1 +1, . . . , k.

(c) We define p1 = 0.95 × 1.5/2, p2 = 0.95− p1, and pi = 0.05/(k − 2), i = 3, . . . , k.

An illustration of these three sets of class probabilities is in Figure 1.

p1

p2p3

p4

p5

p6

p7 p8p9

p10

p1p2

p3

p4p5

p6 · · · p10

p1

p2

p3 · · · p10

(a) (b) (c)

Figure 1: Three sets of class probabilities

984

Probability Estimates for Multi-class Classification by Pairwise Coupling

After setting pi, we define the pairwise comparisons rij as in (30)-(31). Both experimentsare repeated for 1,000 times. The accuracy rates are shown in Figures 2(b) and 2(c). Inboth scenarios, pi are not balanced. As expected, δHT is quite sensitive to the imbalance ofpi. The situation is much worse in Figure 2(c) because the approximation pi + pj ≈ k/2 ismore seriously violated, especially when k is large.

A further analysis of Figure 2(c) shows that when k is large,

r12 =3

4+ 0.1z12, r1j ≈ 1 + 0.1z1j , j ≥ 3,

r21 =1

4+ 0.1z21, r2j ≈ 1 + 0.1z2j , j ≥ 3,

rij ≈ 0 + 0.1zij , i 6= j, i ≥ 3,

where zji = −zij are standard normal variates. From (10), the decision rule δHT in this caseis mainly on comparing

j:j 6=1 r1j and∑

j:j 6=2 r2j . The difference between these two sums

is 12 + 0.1(

j:j 6=1 z1j −∑

j:j 6=2 z2j), where the second term has zero mean and, when k islarge, high variance. Therefore, for large k, the decision depends strongly on these normalvariates, and the probability of choosing the first class is approaching half. On the otherhand, δPKPD relies on comparing

j:j 6=1 1/r1j and∑

j:j 6=2 1/r2j . As the difference between1/r12 and 1/r21 is larger than that between r12 and r21, though the accuracy rates declinewhen k increases, the situation is less serious.

We also analyze the mean square error (MSE) in Figure 3:

MSE =1

1000

1000∑

j=1

1

k

k∑

i=1

(pji − pi)

2, (32)

where pj is the probability estimate obtained in the jth of the 1,000 replicates. Overall, δHT

and δV have higher MSE, confirming again that they are less stable. Note that Algorithm1 and (10) give the same prediction for δHT , but their MSE are different. Here we consider(10) as it is the one analyzed and compared in Section 5.

In summary, δ1 and δ2 are less sensitive to pi, and their overall performance are fairlystable. All observations about δHT , δ1, δ2, and δV here agree with our analysis in Section5. Despite some similarity to δHT , δPKPD outperforms δHT in general. Experiments in thisstudy are conducted using MATLAB.

7. Experiments on Real Data

In this section we present experimental results on several multi-class problems: dna, satim-

age, segment, and letter from the Statlog collection (Michie et al., 1994), waveform fromUCI Machine Learning Repository (Blake and Merz, 1998), USPS (Hull, 1994), and MNIST

(LeCun et al., 1998). The numbers of classes and features are reported in Table 7. Ex-cept dna, which takes two possible values 0 and 1, each attribute of all other data islinearly scaled to [−1, 1]. In each scaled data, we randomly select 300 training and 500testing instances from thousands of data points. 20 such selections are generated andthe testing error rates are averaged. Similarly, we do experiments on larger sets (800training and 1,000 testing). All training and testing sets used are available at http://

985

Wu, Lin, and Weng

2 3 4 5 6 70.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

log2 k

Accura

cy R

ate

s

(a)

2 3 4 5 6 70.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

log2 k

Accura

cy R

ate

s

(b)

2 3 4 5 6 70.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

log2 k

Accura

cy R

ate

s

(c)

Figure 2: Accuracy of predicting the true class by the methods: δHT (solid line, crossmarked), δV (dashed line, square marked), δ1 (dotted line, circle marked), δ2

(dashed line, asterisk marked), and δPKPD (dashdot line, diamond marked).

2 3 4 5 6 70

0.005

0.01

0.015

0.02

0.025

0.03

0.035

log2 k

MS

E

(a)

2 3 4 5 6 70

0.005

0.01

0.015

0.02

0.025

0.03

0.035

log2 k

MS

E

(b)

2 3 4 5 6 70

0.005

0.01

0.015

0.02

0.025

0.03

0.035

log2 k

MS

E

(c)

Figure 3: MSE by the methods: δHT via (10) (solid line, cross marked), δV (dashed line,square marked), δ1 (dotted line, circle marked), δ2 (dashed line, asterisk marked),and δPKPD (dashdot line, diamond marked).

www.csie.ntu.edu.tw/∼cjlin/papers/svmprob/data and the code is available at http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/svmprob.

For the implementation of the four probability estimates, δ1 and δ2 are via solving linearsystems. For δHT , we implement Algorithm 1 with the following stopping condition

k∑

i=1

j:j 6=i rij∑

j:j 6=i µij− 1

≤ 10−3.

986

Probability Estimates for Multi-class Classification by Pairwise Coupling

We observe that the performance of δHT may downgrade if the stopping condition is tooloose.

dataset dna waveform satimage segment USPS MNIST letter

#class 3 3 6 7 10 10 26#attribute 180 21 36 19 256 784 16

Table 1: Data set Statistics

7.1 SVM as the Binary Classifier

We first consider support vector machines (SVM) (Boser et al., 1992; Cortes and Vapnik,1995) with the RBF kernel e−γ‖xi−xj‖2

as the binary classifier. The regularization param-eter C and the kernel parameter γ are selected by cross-validation (CV). To begin, foreach training set, a five-fold cross-validation is conducted on the following points of (C, γ):[2−5, 2−3, . . . , 215]× [2−5, 2−3, . . . , 215]. This is done by modifying LIBSVM (Chang and Lin,2001), a library for SVM. At each (C, γ), sequentially four folds are used as the training setwhile one fold as the validation set. The training of the four folds consists of k(k − 1)/2binary SVMs. For the binary SVM of the ith and jth classes, we employ an improvedimplementation (Lin et al., 2003) of Platt’s posterior probabilities (Platt, 2000) to estimaterij :

rij = P (i | i or j,x) =1

1 + eAf+B, (33)

where A and B are estimated by minimizing the negative log-likelihood function, and f arethe decision values of training data. Platt (2000); Zhang (2004) observe that SVM decisionvalues are easily clustered at ±1, so the probability estimate (33) may be inaccurate. Thus,it is better to use CV decision values as we less overfit the model and values are not soclose to ±1. In our experiments here, this requires a further CV on the four-fold data (i.e.,a second level CV).

Next, for each instance in the validation set, we apply the pairwise coupling methodsto obtain classification decisions. The error of the five validation sets is thus the cross-validation error at (C, γ). From this, each rule obtains its best (C, γ).2 Then, the decisionvalues from the five-fold cross-validation at the best (C, γ) are employed in (33) to find thefinal A and B for future use. These two values and the model via applying the best param-eters on the whole training set are then used to predict testing data. Figure 4 summarizesthe procedure of getting validation accuracy at each given (C, γ).

The average of 20 MSEs are presented on the left panel of Figure 5, where the solid linerepresents results of small sets (300 training/500 testing), and the dashed line of large sets(800 training/1,000 testing). The definition of MSE here is similar to (32), but as there isno correct pi for these problems, we let pi = 1 if the data is in the ith class, and 0 otherwise.This measurement is called Brier Score (Brier, 1950), which is popular in meteorology. Thefigures show that for smaller k, δHT , δ1, δ2 and δPKPD have similar MSEs, but for larger k,δHT has the largest MSE. The MSEs of δV are much larger than those by all other methods,

2. If more than one parameter sets return the smallest cross-validation error, we simply choose the onewith the smallest C.

987

Wu, Lin, and Weng

Given (C, γ)

1 fold 4 folds

5-fold CVdecision values

rij

validation accuracy

Figure 4: Parameter selection when using SVM as the binary classifier

so they are not included in the figures. In summary, the two proposed approaches, δ1 and δ2,are fairly insensitive to the values of k, and all above observations agree well with previousfindings in Sections 5 and 6.

Next, left panels of Figures 6 and 7 present the average of 20 test errors for problems withsmall size (300 training/500 testing) and large size (800 training/1,000 testing), respectively.The caption of each sub-figure also shows the average of 20 test errors of the multi-classimplementation in LIBSVM. This rule is voting using merely pairwise SVM decision values,and is denoted as δDV for later discussion. The figures show that the errors of the fivemethods are fairly close for smaller k, but quite different for larger k. Notice that for smallerk (Figures 6 and 7 (a), (c), (e), and (g)) the differences of the averaged errors among the fivemethods are small, and there is no particular trend in these figures. However, for problemswith larger k (Figures 6 and 7 (i), (k), and (m)), the differences are bigger and δHT is lesscompetitive. In particular, for letter problem (Figure 6 (m), k =26), δ2 and δV outperformδHT by more than 4%. The test errors along with MSE seems to indicate that, for problemswith larger k, the posterior probabilities pi are closer to the setting of Figure 2(c), ratherthan that of Figure 2(a). Another feature consistent with earlier findings is that when k islarger the results of δ2 are closer to those of δV , and δ1 closer to δHT , for both small andlarge training/testing sets. As for δPKPD, its overall performance is competitive, but weare not clear about its relationships to the other methods.

Finally, we consider another criterion on evaluating the probability estimates: the like-lihood.

l∏

j=1

pjyj

In practice, we use its log likelihood and divide the value by a scaling factor l:

1

l

l∑

j=1

log pjyj

, (34)

where l is the number of test data, pj is the probability estimates for the jth data, and yj

is its actual class label.

A larger value implies a possibly better estimate. The left panel of Figure 8 presentsthe results of using SVM as the binary classifier. Clearly the trend is the same as MSE and

988

Probability Estimates for Multi-class Classification by Pairwise Coupling

accuracy. When k is larger, δ2 and δV have larger values and hence better performance.Similar to MSE, values of δV are not presented as they are too small.

7.2 Random Forest as the Binary Classifier

In this subsection we consider random forest (Breiman, 2001) as the binary classifier andconduct experiments on the same data sets. As random forest itself can provide multi-classprobability estimates, we denote the corresponding rule as δRF and also compare it withthe coupling methods.

For each two classes of data, we construct 500 trees as the random forest classifiers.Using mtry randomly selected features, a bootstrap sample (around two thirds) of trainingdata are employed to generate a full tree without pruning. For each test instance, rij issimply the proportion out of the 500 trees that class i wins over class j. As we set the numberof trees to be fixed at 500, the only parameter left for tuning is mtry. Similar to (Sventnik

et al., 2003), we select mtry from {1,√m,m/3,m/2,m} by five-fold cross validation, wherem is the number of attributes. The cross validation procedure first sequentially uses fourfolds as the training set to construct k(k − 1)/2 pairwise random forests, next obtains thedecision for each instance in the validation set by the pairwise coupling methods, and thencalculates the cross validation error at the given mtry by the error of five validation sets.This is similar to the procedure in Section 7.1, but we do not need a second-level CVfor obtaining accurate two-class probabilistic estimates (i.e., rij). Instead of CV, a moreefficient “out of bag” validation can be used for random forest, but here we keep using CVfor consistency. Experiments are conducted using an R-interface (Liaw and Wiener, 2002)to the code from (Breiman, 2001).

The MSE presented in the right panel of Figure 5 shows that δ1 and δ2 yield more stableresults than δHT and δV for both small and large sets. The right panels of Figures 6 and 7give the average of 20 test errors. The caption of each sub-figure also shows the averagederror when using random forest as a multi-class classifier (δRF ). Notice that random forestbears a resemblance to SVM: the errors are only slightly different among the five methodsfor smaller k, but δV and δ2 tend to outperform δHT and δ1 for larger k. The right panel ofFigure 8 presents the log likelihood value (34). The trend is again the same. In summary,the results by using random forest as the binary classifier strongly support previous findingsregarding the four methods.

7.3 Miscellaneous Observations and Discussion

Recall that in Section 7.1 we consider δDV , which does not use Platt’s posterior probabilities.Experimental results in Figure 6 show that δDV is quite competitive (in particular, 3%better for letter), but is about 2% worse than all probability-based methods for waveform.Similar observations on waveform are also reported in (Duan and Keerthi, 2003), wherethe comparison is between δDV and δHT . We explain why the results by probability-basedand decision-value-based methods can be so distinct. For some problems, the parametersselected by δDV are quite different from those by the other five rules. In waveform, at someparameters all probability-based methods gives much higher cross validation accuracy thanδDV . We observe, for example, the decision values of validation sets are in [0.73, 0.97] and[0.93, 1.02] for data in two classes; hence, all data in the validation sets are classified as

989

Wu, Lin, and Weng

in one class and the error is high. On the contrary, the probability-based methods fit thedecision values by a sigmoid function, which can better separate the two classes by cuttingat a decision value around 0.95. This observation shed some light on the difference betweenprobability-based and decision-value based methods.

Though the main purpose of this section is to compare different probability estimates,here we check the accuracy of another multi-class classification method: exponential loss-based decoding by Allwein et al. (2001). In the pairwise setting, if fij ∈ R is the two-class

hypothesis so that fij > 0 (< 0) predicts the data to be in the ith (jth) class, then

predicted label = arg mini

j:j<i

efji +∑

j:j>i

e−fij

. (35)

For SVM, we can simply use decision values as fij. On the other hand, rij − 1/2 is anotherchoice. Table 2 presents the error of the seven problem using these two options. Resultsindicate that using decision values is worse than rij − 1/2 when k is large (USPS,MNIST,

and letter). This observation seems to indicate that large numerical ranges of fij may cause(35) to have more erroneous results (rij −1/2 is always in [−1/2, 1/2]). The results of usingrij−1/2 is competitive with those in Figures 6 and 7 when k is small. However, for larger k(e.g., letter), it is slightly worse than δ2 and δV . We think this result is due to the similarity

between (35) and δHT . When fij is close to zero, efij ≈ 1 + fij, so (35) reduces to a “linear

loss-based encoding.” When rij − 1/2 is used, fji = rji − 1/2 = 1/2 − rij . Thus, the linearencoding is arg mini[

j:j 6=i−rij] ≡ arg maxi[∑

j:j 6=i rij], exactly the same as (10) of δHT .

training/testing (fij) dna waveform satimage segment USPS MNIST letter

300/500 (dec. values) 10.47 16.23 14.12 6.21 11.57 14.99 38.59300/500 (rij − 1/2) 10.47 15.11 14.45 6.03 11.08 13.58 38.27800/1000 (dec. values) 6.36 14.20 11.55 3.35 8.47 8.97 22.54800/1000 (rij − 1/2) 6.22 13.45 11.6 3.19 7.71 7.95 20.29

Table 2: Average of 20 test errors using exponential loss-based decoding (in percentage)

Regarding the accuracy of pairwise (i.e., δDV ) and non-pairwise (e.g., “one-against-the-rest”) multi-class classification methods, there are already excellent comparisons. As δV

and δ2 have similar accuracy to δDV , roughly how non-pairwise methods compared to δDV

is the same as compared to δV and δ2.

The results of random forest as a multi-class classifier (i.e., δRF ) are reported in thecaption of each sub-figure in Figures 6 and 7. We observe from the figures that, when thenumber of classes is larger, using random forest as a multi-class classifier is better thancoupling binary random forests. However, for dna (k = 3) the result is the other wayaround. This observation for random forest is left as a future research issue.

Acknowledgments

990

Probability Estimates for Multi-class Classification by Pairwise Coupling

The authors thank S. Sathiya Keerthi for helpful comments. They also thank the associateeditor and anonymous referees for useful comments. This work was supported in part bythe National Science Council of Taiwan via the grants NSC 90-2213-E-002-111 and NSC91-2118-M-004-003.

References

Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary:a unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2001. ISSN 1533-7928.

C. L. Blake and C. J. Merz. UCI repository of machine learning databases. Technical report,University of California, Department of Information and Computer Science, Irvine, CA,1998. Available at http://www.ics.uci.edu/~mlearn/MLRepository.html.

B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers.In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages144–152. ACM Press, 1992.

Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. URL citeseer.nj.

nec.com/breiman01random.html.

G. W. Brier. Verification of forecasts expressed in probabilities. Monthly Weather Review,78:1–3, 1950.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines,2001. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

Corina Cortes and Vladimir Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

Kaibo Duan and S. Sathiya Keerthi. Which is the best multiclass SVM method? Anempirical study. Technical Report CD-03-12, Control Division, Department of MechanicalEngineering, National University of Singapore, 2003.

J. Friedman. Another approach to polychotomous classification. Technicalreport, Department of Statistics, Stanford University, 1996. Available athttp://www-stat.stanford.edu/reports/friedman/poly.ps.Z.

T. Hastie and R. Tibshirani. Classification by pairwise coupling. The Annals of Statistics,26(1):451–471, 1998.

J. J. Hull. A database for handwritten text recognition research. IEEE Transactions onPattern Analysis and Machine Intelligence, 16(5):550–554, May 1994.

David R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals ofStatistics, 32:386–408, 2004.

991

Wu, Lin, and Weng

S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedurefor building and training a neural network. In J. Fogelman, editor, Neurocomputing:Algorithms, Architectures and Applications. Springer-Verlag, 1990.

Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to doc-ument recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. MNISTdatabase available at http://yann.lecun.com/exdb/mnist/.

Andy Liaw and Matthew Wiener. Classification and regression by randomForest. RNews, 2/3:18–22, December 2002. URL http://cran.r-project.org/doc/Rnews/

Rnews 2002-3.pdf.

Hsuan-Tien Lin, Chih-Jen Lin, and Ruby C. Weng. A note on Platt’s probabilistic outputsfor support vector machines. Technical report, Department of Computer Science, Na-tional Taiwan University, 2003. URL http://www.csie.ntu.edu.tw/∼cjlin/papers/

plattprob.ps.

D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and StatisticalClassification. Prentice Hall, Englewood Cliffs, N.J., 1994. Data available at http:

//www.ncc.up.pt/liacc/ML/statlog/datasets.html.

J. Platt. Probabilistic outputs for support vector machines and comparison to regularizedlikelihood methods. In A.J. Smola, P.L. Bartlett, B. Scholkopf, and D. Schuurmans,editors, Advances in Large Margin Classifiers, Cambridge, MA, 2000. MIT Press. URLciteseer.nj.nec.com/platt99probabilistic.html.

David Price, Stefan Knerr, Leon Personnaz, and Gerard Dreyfus. Pairwise nerual networkclassifiers with probabilistic outputs. In G. Tesauro, D. Touretzky, and T. Leen, editors,Neural Information Processing Systems, volume 7, pages 1109–1116. The MIT Press,1995.

Ph. Refregier and F. Vallet. Probabilistic approach for multiclass classification with neuralnetworks. In Proceedings of International Conference on Artificial Networks, pages 1003–1007, 1991.

Sheldon Ross. Stochastic Processes. John Wiley & Sons, Inc., second edition, 1996.

Vladimir Sventnik, Andy Liaw, Christopher Tong, J. Christopher Culberson, Robert P.Sheridan, and Bradley P. Feuston. Random forest: a tool for classification and regressionin compound classification and QSAR modeling. Journal of Chemical Information andComputer Science, 43(6):1947–1958, 2003.

Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. Probability estimates for multi-classclassification by pairwise coupling. In Sebastian Thrun, Lawrence Saul, and BernhardScholkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press,Cambridge, MA, 2004.

E. Zermelo. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem derWahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29:436–460, 1929.

992

Probability Estimates for Multi-class Classification by Pairwise Coupling

Tong Zhang. Statistical behavior and consistency of classification methods based on convexrisk minimization. The Annals of Statistics, 32(1):56–134, 2004.

Appendix A. Proof of Theorem 2

It suffices to prove that any optimal solution p of (18) satisfies pi ≥ 0, i = 1, . . . , k. If thisis not true, without loss of generality, we assume

p1 ≤ 0, . . . , pr ≤ 0, pr+1 > 0, . . . , pk > 0,

where 1 ≤ r < k, and there is one 1 ≤ i ≤ r such that pi < 0. We can then define a newfeasible solution of (18):

p′1 = 0, . . . , p′r = 0, p′r+1 = pr+1/α, . . . , p′k = pk/α,

where α = 1−∑r

i=1 pi > 1.With rij > 0 and rji > 0, we obtain

(rjipi − rijpj)2 ≥ 0 = (rjip

′i − rijp

′j)

2, if 1 ≤ i, j ≤ r,

(rjipi − rijpj)2 = (rijpj)

2 >(rijpj)

2

α2= (rjip

′i − rijp

′j)

2, if 1 ≤ i ≤ r, r + 1 ≤ j ≤ k,

(rjipi − rijpj)2 ≥ (rjipi − rijpj)

2

α2= (rjip

′i − rijp

′j)

2, if r + 1 ≤ i, j ≤ k.

Therefore,k

i=1

j:j 6=i

(rijpi − rjipj)2 >

k∑

i=1

j:j 6=i

(rijp′i − rjip

′j)

2.

This contradicts the assumption that p is an optimal solution of (18).

Appendix B. Proof of Theorem 3

(i) If Q + ∆eeT is not positive definite, there is a vector v with vi 6= 0 such that

vT (Q + ∆eeT )v =1

2

k∑

t=1

u:u 6=t

(rutvt − rtuvu)2 + ∆(k

t=1

vt)2 = 0. (36)

For all t 6= i, ritvt − rtivi = 0, so

vt =rti

ritvi 6= 0.

Thus,k

t=1

vt = (1 +∑

t:t6=i

rti

rit)vi 6= 0,

which contradicts (36).

993

Wu, Lin, and Weng

The positive definiteness of Q + ∆eeT implies that[

Q+∆eeT

e

eT 0

]

is invertible. As[

Q+∆eeT

e

eT 0

]

is from adding the last row of[

Q e

eT 0

]

to its first k rows (with a scaling

factor ∆), the two matrices have the same rank. Thus,[

Q e

eT 0

]

is invertible as well.

Then (21) has a unique solution, and so does (17).

(ii) If Q is not positive definite, there is a vector v with vi 6= 0 such that

vT Qv =1

2

k∑

t=1

u:u 6=t

(rutvt − rtuvu)2 = 0.

Therefore,

(rutvt − rtuvu)2 = 0,∀t 6= u.

As rtu > 0,∀t 6= u, for any s 6= j for which s 6= i and j 6= i, we have

vs =rsi

risvi, vj =

rji

rijvi, vs =

rsj

rjsvj . (37)

As vi 6= 0, (37) impliesrsirjs

ris=

rjirsj

rij,

which contradicts (22).

Appendix C. Proof of Theorem 4

First we need a lemma to show the strict decrease of the objective function:

Lemma 5 If rij > 0,∀i 6= j, p and pn are from two consecutive iterations of Algorithm 2,and pn 6= p, then

1

2(pn)T Qpn <

1

2pT Qp. (38)

Proof. Assume that pt is the component to be updated. Then, pn is obtained through thefollowing calculation:

pi =

{

pi if i 6= t,1

Qtt(−∑

j:j 6=t Qtjpj + pT Qp) if i = t,(39)

and

pn =p

∑ki=1 pi

. (40)

For (40) to be a valid operation,∑k

i=1 pi must be strictly positive. To show this, we firstsuppose that the current solution p satisfies pi ≥ 0, i = 1, . . . , l, but the next solution phas

∑ki=1 pi = 0. In Section 4.2, we have shown that pt ≥ 0, so with pi = pi ≥ 0,∀i 6= t,

pi = 0 for all i. Next, from (39), pi = pi = 0 for i 6= t, which, together with the equality∑k

i=1 pi = 1 implies that pt = 1. However, if pt = 1 and pi = 0 for i 6= t, then pt = 1 from

994

Probability Estimates for Multi-class Classification by Pairwise Coupling

(39). This contradicts the situation that pi = 0 for all i. Therefore, by induction, the onlyrequirement is to have nonnegative initial p.

To prove (38), first we rewrite the update rule (39) as

pt = pt +1

Qtt(−(Qp)t + pT Qp) (41)

= pt + ∆.

Since we keep∑k

i=1 pi = 1,∑k

i=1 pi = 1 + ∆. Then

pT Qp− (

k∑

i=1

pi)2pT Qp

= pT Qp + 2∆(Qp)t + Qtt∆2 − (1 + ∆)2pT Qp

= 2∆(Qp)t + Qtt∆2 − (2∆ + ∆2)pT Qp

= ∆(

2(Qp)t − 2pT Qp + Qtt∆−∆pT Qp)

= ∆(−Qtt∆−∆pT Qp) (42)

= −∆2(Qtt + pT Qp) < 0. (43)

(42) follows from the definition of ∆ in (41). For (43), it uses Qtt =∑

j:j 6=t r2jt > 0 and

∆ 6= 0, which comes from the assumption pn 6= p. 2

Now we are ready to prove the theorem. If this result does not hold, there is a convergentsub-sequence {pi}i∈K such that p∗ = limi∈K,i→∞pi is not optimal for (17). Note that thereis at least one index of {1, . . . , k} which is considered in infinitely many iterations. Withoutloss of generality, we assume that for all pi, i ∈ K, pi

t is updated to generate the nextiteration pi+1. As p∗ is not optimal for (17), starting from t, t + 1, . . . , k, 1, . . . , t− 1, thereis a first component t for which

k∑

j=1

Qtjp∗j − (p∗)T Qp∗ 6= 0.

By applying one iteration of Algorithm 2 on p∗t, from an explanation similar to the proof of

Lemma 5, we obtain p∗,n satisfying p∗,nt6= p∗

t. Then by Lemma 5,

1

2(p∗,n)T Qp∗,n <

1

2(p∗)T Qp∗.

Assume it takes i steps from t to t and i > 1,

limi∈K,i→∞

pi+1t = lim

i∈K,i→∞

1Qtt

(−∑

j:j 6=t Qtjpit + (pi)T Qpi)

1Qtt

(−∑

j:j 6=t Qtjpit + (pi)T Qpi) +

j:j 6=t pij

=

1Qtt

(−∑

j:j 6=t Qtjp∗t + (p∗)T Qp∗)

1Qtt

(−∑

j:j 6=t Qtjp∗t + (p∗)T Qp∗) +∑

j:j 6=t p∗j

=p∗t

∑kj=1 p∗j

= p∗t ,

995

Wu, Lin, and Weng

we havelim

i∈K,i→∞pi = lim

i∈K,i→∞pi+1 = · · · = lim

i∈K,i→∞pi+i−1 = p∗.

Moreover,lim

i∈K,i→∞pi+i = p∗,n

and

limi∈K,i→∞

1

2(pi+i)T Qpi+i =

1

2(p∗,n)T Qp∗,n

<1

2(p∗)T Qp∗

= limi∈K,i→∞

1

2(pi)T Qpi.

This contradicts the fact from Lemma 5:

1

2(p1)T Qp1 ≥ 1

2(p2)T Qp2 ≥ · · · ≥ 1

2(p∗)T Qp∗.

Therefore, p∗ must be optimal for (17).

Appendix D. Implementation Notes of Algorithm 2

From Algorithm 2, the main operation of each iteration is on calculating −∑

j:j 6=t Qtjpj

and pT Qp, both O(k2) procedures. In the following, we show how to easily reduce the costper iteration to O(k).

Following the notation in Lemma 5 of Appendix D, we consider p the current solution.Assume pt is the component to be updated, we generate p according to (39) and normalizep to the next iterate pn. Note that p is the same as p except the tth component and weconsider the form (41). Since

∑ki=1 pi = 1, (40) is pn = p/(1 + ∆). Throughout iterations,

we keep the current Qp and pT Qp, so ∆ can be easily calculated. To obtain Qpn and(pn)T Qpn, we use

(Qpn)j =(Qp)j1 + ∆

=(Qp)j + Qjt∆

1 + ∆, j = 1, . . . , k, (44)

and

(pn)T Q(pn) =pT Qp

(1 + ∆)2(45)

=pT Qp + 2∆

∑kj=1(Qp)j + Qtt∆

2

(1 + ∆)2.

Both (and hence the whole iteration) takes O(k) operations.Numerical inaccuracy may accumulate through iterations, so gradually (44) and (45)

may be away from values directly calculated using p. An easy prevention of this problemis to recalculate Qp and pT Qp directly using p after several iterations (e.g., k iterations).Then, the average cost per iteration is still O(k) + O(k2)/k = O(k).

996

Probability Estimates for Multi-class Classification by Pairwise Coupling

Appendix E. Derivation of (29)

k∑

i=1

j:j 6=i

(I{rij>rji}pj − I{rji>rij}pi)2

=k

i=1

j:j 6=i

(I{rij>rji}p2j + I{rji>rij}p

2i )

= 2

k∑

i=1

(∑

j:j 6=i

I{rji>rij})p2i .

If∑

j:j 6=i I{rji>rij} 6= 0,∀i, then, under the constraint∑k

i=1 pi = 1, the optimal solutionsatisfies

p1∑

j:j 6=1 I{rj1>r1j}= · · · = pk

j:j 6=k I{rjk>rkj}.

Thus, (29) is the optimal solution of (28).

997

Wu, Lin, and Weng

0.03

0.035

0.04

0.045

0.05

0.055

δHT δ1 δ2 δPKPD

33 3

3

+ + + +

(a) dna (k = 3) by binary SVMs

0.04

0.041

0.042

0.043

0.044

0.045

0.046

0.047

0.048

0.049

δHT δ1 δ2 δPKPD

3

3 3 3

+

+

+

+

(b) dna (k = 3) by binary random forests

0.065

0.066

0.067

0.068

0.069

0.07

0.071

0.072

0.073

δHT δ1 δ2 δPKPD

3 3 33

+ + + +

(c) waveform (k = 3) by binary SVMs

0.082

0.084

0.086

0.088

0.09

0.092

0.094

δHT δ1 δ2 δPKPD

3

3

33

++

++

(d) waveform (k = 3) by binary random forests

0.027

0.028

0.029

0.03

0.031

0.032

0.033

0.034

0.035

0.036

0.037

δHT δ1 δ2 δPKPD

3 33

3

+ + + +

(e) satimage (k = 6) by binary SVMs

0.032

0.033

0.034

0.035

0.036

0.037

0.038

0.039

0.04

δHT δ1 δ2 δPKPD

33

33

+

++ +

(f) satimage (k = 6) by binary random forests

0.008

0.01

0.012

0.014

0.016

0.018

0.02

δHT δ1 δ2 δPKPD

3

3

3 3

+ ++ +

(g) segment (k = 7) by binary SVMs

0.008

0.009

0.01

0.011

0.012

0.013

0.014

0.015

δHT δ1 δ2 δPKPD

3 33 3

+ + ++

(h) segment (k = 7) by binary random forests

998

Probability Estimates for Multi-class Classification by Pairwise Coupling

0.012

0.013

0.014

0.015

0.016

0.017

0.018

0.019

0.02

0.021

0.022

δHT δ1 δ2 δPKPD

33

3 3

++

+ +

(i) USPS (k = 10) by binary SVMs

0.022

0.023

0.024

0.025

0.026

0.027

0.028

0.029

0.03

0.031

δHT δ1 δ2 δPKPD

33 3

3

+ + +

+

(j) USPS (k = 10) by binary random forests

0.012

0.014

0.016

0.018

0.02

0.022

0.024

0.026

δHT δ1 δ2 δPKPD

33

33

+ ++ +

(k) MNIST (k = 10) by binary SVMs

0.032

0.034

0.036

0.038

0.04

0.042

0.044

0.046

0.048

δHT δ1 δ2 δPKPD

3 33

3

+ + + +

(l) MNIST (k = 10) by binary random forests

0.014

0.016

0.018

0.02

0.022

0.024

0.026

0.028

δHT δ1 δ2 δPKPD

3 3

33

+ +

+ +

(m) letter (k = 26) by binary SVMs

0.017

0.018

0.019

0.02

0.021

0.022

0.023

0.024

0.025

δHT δ1 δ2 δPKPD

3 3

3

3

++ +

+

(n) letter (k = 26) by binary random forests

Figure 5: MSE by using four probability estimates methods based on binary SVMs (left)and binary random forests (right). MSE of δV is too large and is not presented.solid line: 300 training/500 testing points; dotted line: 800 training/1,000 testingpoints.

999

Wu, Lin, and Weng

10.35

10.4

10.45

10.5

10.55

10.6

10.65

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(a) dna (k = 3) by binary SVMs; δDV = 10.82%

7.4

7.5

7.6

7.7

7.8

7.9

8

8.1

8.2

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(b) dna (k = 3) by binary random forests; δRF = 8.74%

14.8

14.81

14.82

14.83

14.84

14.85

14.86

14.87

14.88

14.89

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(c) waveform (k = 3) by binary SVMs; δDV = 16.47%

17.1

17.2

17.3

17.4

17.5

17.6

17.7

17.8

17.9

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(d) waveform (k = 3) by binary random forests; δRF = 17.39%

14.2

14.3

14.4

14.5

14.6

14.7

14.8

14.9

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(e) satimage (k = 6) by binary SVMs; δDV = 14.88%

14.7

14.8

14.9

15

15.1

15.2

15.3

15.4

15.5

15.6

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(f) satimage (k = 6) by binary random forests;δRF = 14.74%

5.75

5.8

5.85

5.9

5.95

6

6.05

6.1

6.15

6.2

6.25

6.3

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(g) segment (k = 7) by binary SVMs; δDV = 5.82%

5.3

5.4

5.5

5.6

5.7

5.8

5.9

6

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(h) segment (k = 7) by binary random forests; δRF = 5.23%

1000

Probability Estimates for Multi-class Classification by Pairwise Coupling

10.75

10.8

10.85

10.9

10.95

11

11.05

11.1

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(i) USPS (k = 10) by binary SVMs; δDV = 10.71%

16.4

16.5

16.6

16.7

16.8

16.9

17

17.1

17.2

17.3

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(j) USPS (k = 10) by binary random forests; δRF = 14.28%

13

13.1

13.2

13.3

13.4

13.5

13.6

13.7

13.8

13.9

14

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(k) MNIST (k = 10) by binary SVMs; δDV = 13.02%

15.8

16

16.2

16.4

16.6

16.8

17

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(l) MNIST (k = 10) by binary random forests; δRF = 16.18%

35

35.5

36

36.5

37

37.5

38

38.5

39

39.5

40

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(m) letter (k = 26) by binary SVMs; δDV = 32.29%

37

37.5

38

38.5

39

39.5

40

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(n) letter (k = 26) by binary random forests; δRF = 32.55%

Figure 6: Average of 20 test errors by five probability estimates methods based on binarySVMs (left) and binary random forests (right). Each of the 20 test errors is by300 training/500 testing points. Caption of each sub-figure shows the averagederror by voting using pairwise SVM decision values (δDV ) and the multi-classrandom forest (δRF ).

1001

Wu, Lin, and Weng

6.2

6.25

6.3

6.35

6.4

6.45

6.5

6.55

δHT δ1 δ2 δV δPKPD

33

3

3

3

(a) dna (k = 3) by binary SVMs; δDV = 6.31%

5.84

5.86

5.88

5.9

5.92

5.94

5.96

5.98

6

6.02

6.04

δHT δ1 δ2 δV δPKPD

3 3

3

3

3

(b) dna (k = 3) by binary random forests; δRF = 6.91%

13.54

13.55

13.56

13.57

13.58

13.59

13.6

13.61

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(c) waveform (k = 3) by binary SVMs; δDV = 14.33%

15.7

15.75

15.8

15.85

15.9

15.95

16

δHT δ1 δ2 δV δPKPD

33

3

3

3

(d) waveform (k = 3) by binary random forests; δRF = 15.66%

11.44

11.46

11.48

11.5

11.52

11.54

11.56

11.58

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(e) satimage (k = 6) by binary SVMs; δDV = 11.54%

12.1

12.2

12.3

12.4

12.5

12.6

12.7

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(f) satimage (k = 6) by binary random forests; δRF = 11.92%

3.05

3.1

3.15

3.2

3.25

3.3

3.35

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(g) segment (k = 7) by binary SVMs; δDV = 3.30%

3.45

3.5

3.55

3.6

3.65

3.7

3.75

3.8

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(h) segment (k = 7) by binary random forests; δRF = 2.77%

1002

Probability Estimates for Multi-class Classification by Pairwise Coupling

7.5

7.6

7.7

7.8

7.9

8

8.1

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(i) USPS (k = 10) by binary SVMs; δDV = 7.54%

11.6

11.7

11.8

11.9

12

12.1

12.2

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(j) USPS (k = 10) by binary random forests; δRF = 10.23%

7.8

7.85

7.9

7.95

8

8.05

8.1

8.15

8.2

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(k) MNIST (k = 10) by binary SVMs; δDV = 7.77%

10.6

10.7

10.8

10.9

11

11.1

11.2

11.3

11.4

11.5

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(l) MNIST (k = 10) by binary random forests; δRF = 10.10%

18

18.5

19

19.5

20

20.5

21

21.5

δHT δ1 δ2 δV δPKPD

3

3

3

3

3

(m) letter (k = 26) by binary SVMs; δDV = 18.61%

23.6

23.8

24

24.2

24.4

24.6

24.8

25

25.2

25.4

δHT δ1 δ2 δV δPKPD

3

3

3 3

3

(n) letter (k = 26) by binary random forests; δRF = 20.25%

Figure 7: Average of 20 test errors by five probability estimates methods based on binarySVMs (left) and binary random forests (right). Each of the 20 test errors is by800 training/1,000 testing points. Caption of each sub-figure shows the averagederror by voting using pairwise SVM decision values (δDV ) and the multi-classrandom forest (δRF ).

1003

Wu, Lin, and Weng

-0.28

-0.26

-0.24

-0.22

-0.2

-0.18

-0.16

δHT δ1 δ2 δPKPD

33

33

+ + + +

(a) dna (k = 3) by binary SVMs

-0.31

-0.3

-0.29

-0.28

-0.27

-0.26

-0.25

δHT δ1 δ2 δPKPD

3

3 33

+

+

+

+

(b) dna (k = 3) by binary random forests

-0.365

-0.36

-0.355

-0.35

-0.345

-0.34

-0.335

-0.33

-0.325

-0.32

δHT δ1 δ2 δPKPD

33 3

3

++ + +

(c) waveform (k = 3) by binary SVMs

-0.48

-0.47

-0.46

-0.45

-0.44

-0.43

-0.42

-0.41

δHT δ1 δ2 δPKPD

3

3

3

3

+

+

++

(d) waveform (k = 3) by binary random forests

-0.44

-0.42

-0.4

-0.38

-0.36

-0.34

-0.32

δHT δ1 δ2 δPKPD

33 3

3

+ + + +

(e) satimage (k = 6) by binary SVMs

-0.5

-0.48

-0.46

-0.44

-0.42

-0.4

-0.38

δHT δ1 δ2 δPKPD

3

3 3 3

+

++

+

(f) satimage (k = 6) by binary random forests

-0.32

-0.3

-0.28

-0.26

-0.24

-0.22

-0.2

-0.18

-0.16

-0.14

-0.12

δHT δ1 δ2 δPKPD

33

3 3

+ +

+ +

(g) segment (k = 7) by binary SVMs

-0.22

-0.21

-0.2

-0.19

-0.18

-0.17

-0.16

-0.15

-0.14

-0.13

δHT δ1 δ2 δPKPD

3 3

3 3

+ + ++

(h) segment (k = 7) by binary random forests

1004

Probability Estimates for Multi-class Classification by Pairwise Coupling

-0.55

-0.5

-0.45

-0.4

-0.35

-0.3

δHT δ1 δ2 δPKPD

33

3 3

++

++

(i) USPS (k = 10) by binary SVMs

-0.72

-0.7

-0.68

-0.66

-0.64

-0.62

-0.6

-0.58

-0.56

-0.54

-0.52

-0.5

δHT δ1 δ2 δPKPD

33

3

3

+ ++

+

(j) USPS (k = 10) by binary random forests

-0.6

-0.55

-0.5

-0.45

-0.4

-0.35

-0.3

δHT δ1 δ2 δPKPD

33

33

+ ++ +

(k) MNIST (k = 10) by binary SVMs

-1.1

-1.05

-1

-0.95

-0.9

-0.85

-0.8

-0.75

δHT δ1 δ2 δPKPD

3 3 33

+ + + +

(l) MNIST (k = 10) by binary random forests

-1.9

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

-1.1

-1

-0.9

δHT δ1 δ2 δPKPD

33

3

3

+ +

++

(m) letter (k = 26) by binary SVMs

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

-1.1

δHT δ1 δ2 δPKPD

3 3

33

++ +

+

(n) letter (k = 26) by binary random forests

Figure 8: Log likelihood (34) by using four probability estimates methods based on binarySVMs (left) and binary random forests (right). MSE of δV is too small andis not presented. solid line: 300 training/500 testing points; dotted line: 800training/1,000 testing points.

1005