query by transduction

15
Query by Transduction Shen-Shyang Ho, Member, IEEE, and Harry Wechsler, Fellow, IEEE Abstract—There has recently been a growing interest in the use of transductive inference for learning. We expand here the scope of transductive inference to active learning in a stream-based setting. Toward that end, this paper proposes Query-by-Transduction (QBT) as a novel active learning algorithm. QBT queries the label of an example based on the p-values obtained using transduction. We show that QBT is closely related to Query-by-Committee (QBC) using relations between transduction, Bayesian statistical testing, Kullback-Leibler divergence, and Shannon information. The feasibility and utility of QBT is shown on both binary and multiclass classification tasks using a support vector machine (SVM) as the choice classifier. Our experimental results show that QBT compares favorably, in terms of mean generalization, against random sampling, committee-based active learning, margin-based active learning, and QBC in the stream-based setting. Index Terms—Active learning, hypothesis testing, transductive inference, Kolmogorov complexity, support vector machine. Ç 1 INTRODUCTION U NLABELED data is abundant, but labeling is expensive in many machine learning applications. With a budget constraint on the labeling task, active/query learning allows the learner to select a limited number of informative examples and to query their labels from an oracle to achieve good classification performance on future observa- tions. Recently, there has been a growing interest in the use of transductive inference [1] for learning [2], [3], [4], [5]. This paper expands the scope of transductive inference to active learning in a data streaming setting. Tong and Koller [6] have previously proposed using transduction for active learning when a pool of unlabeled examples is provided. However, to apply transduction to search through a (large) pool of examples for the most informative examples to query for their labels is computationally expensive. To apply transduction to active learning in a data streaming setting is much cheaper, especially when an incremental classifier is used. Moreover, Vovk et al. [7] noted that, in the data streaming setting, the “error probabilities [of trans- ductive inference] guaranteed by the theory [of transduc- tive inference] find their manifestation as [error] frequencies” on the previously seen data points. The main contribution of this paper is a novel active learning algorithm, called Query-by-Transduction (QBT), based on p-values obtained from a transductive learning procedure in a stream-based setting where examples are observed sequentially. When a new example is observed, the algorithm follows two steps: Step 1. Construct M classifiers using previously observed examples and the new example to derive statistical information considering all M possible labels for the new example. Step 2. Decide on whether to select the new example based on the statistical information of the two most likely labels for the new example derived in Step 1. Step 1 is justified based on the relationship between algorithmic randomness and statistical hypothesis testing [8] that is realized by a transductive learning procedure first introduced by Gammerman and Vovk [9]. Based on the facts that 1) the Kullback-Leibler diver- gence can be interpreted as the expected discrimination information between the null and alternative statistical hypotheses [10] and 2) the connection between Kullback- Leibler divergence and the Shannon information, the QBT selection criterion (Step 2), is related to the Query-by- Committee (QBC) [11] selection criterion. In fact, QBT can be viewed as a variant of the “committee-type” strategy with each committee member favoring a particular label. The outline for the paper is given as follows: In Section 2, we review the active learning problem. In Section 3, the concept of transductive learning is introduced. In Section 4, we describe how Kolmogorov complexity, transductive inference, and hypothesis testing are related in detail. In Section 5, the strangeness measure and p-values are introduced. In Section 6, the QBT selection criterion using p-values are described. In Section 7, the connection between p-values and Kullback-Leibler divergence is shown. In Section 8, we establish the relation between QBT and QBC. Experimental results on eight binary-class and two multi- class classification tasks used to assess the feasibility and usefulness of our approach are reported in Section 9. 2 ACTIVE LEARNING The standard framework in machine learning in general and pattern classification in particular presents the learner with a randomly sampled data set. There has been, however, a growing interest in active learning where one IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 1557 . S.-S. Ho is with the NASA Jet Propulsion Laboratory, 4800 Oak Grove Ave, 300-123, Pasadena, CA 91109. E-mail: [email protected]. . H. Wechsler is with George Mason University, 4400 University Dr., MS 5A4, Fairfax, VA 22030. E-mail: [email protected]. Manuscript received 21 Dec. 2006; revised 4 June 2007; accepted 18 Sept. 2007; published online 11 Oct. 2007. Recommended for acceptance by S. Sclaroff. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-0897-1206. Digital Object Identifier no. 10.1109/TPAMI.2007.70811. 0162-8828/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society

Upload: independent

Post on 16-Nov-2023

1 views

Category:

Documents


0 download

TRANSCRIPT

Query by TransductionShen-Shyang Ho, Member, IEEE, and Harry Wechsler, Fellow, IEEE

Abstract—There has recently been a growing interest in the use of transductive inference for learning. We expand here the scope of

transductive inference to active learning in a stream-based setting. Toward that end, this paper proposes Query-by-Transduction

(QBT) as a novel active learning algorithm. QBT queries the label of an example based on the p-values obtained using transduction.

We show that QBT is closely related to Query-by-Committee (QBC) using relations between transduction, Bayesian statistical testing,

Kullback-Leibler divergence, and Shannon information. The feasibility and utility of QBT is shown on both binary and multiclass

classification tasks using a support vector machine (SVM) as the choice classifier. Our experimental results show that QBT compares

favorably, in terms of mean generalization, against random sampling, committee-based active learning, margin-based active learning,

and QBC in the stream-based setting.

Index Terms—Active learning, hypothesis testing, transductive inference, Kolmogorov complexity, support vector machine.

Ç

1 INTRODUCTION

UNLABELED data is abundant, but labeling is expensive inmany machine learning applications. With a budget

constraint on the labeling task, active/query learningallows the learner to select a limited number of informativeexamples and to query their labels from an oracle toachieve good classification performance on future observa-tions.

Recently, there has been a growing interest in the use of

transductive inference [1] for learning [2], [3], [4], [5]. This

paper expands the scope of transductive inference to active

learning in a data streaming setting. Tong and Koller [6]

have previously proposed using transduction for active

learning when a pool of unlabeled examples is provided.

However, to apply transduction to search through a (large)

pool of examples for the most informative examples to

query for their labels is computationally expensive. To

apply transduction to active learning in a data streaming

setting is much cheaper, especially when an incremental

classifier is used. Moreover, Vovk et al. [7] noted that, in the

data streaming setting, the “error probabilities [of trans-

ductive inference] guaranteed by the theory [of transduc-

tive inference] find their manifestation as [error]

frequencies” on the previously seen data points.The main contribution of this paper is a novel active

learning algorithm, called Query-by-Transduction (QBT),

based on p-values obtained from a transductive learning

procedure in a stream-based setting where examples are

observed sequentially. When a new example is observed,

the algorithm follows two steps:

Step 1. Construct M classifiers using previously observedexamples and the new example to derive statisticalinformation considering all M possible labels for thenew example.

Step 2. Decide on whether to select the new example basedon the statistical information of the two most likelylabels for the new example derived in Step 1.Step 1 is justified based on the relationship between

algorithmic randomness and statistical hypothesis testing[8] that is realized by a transductive learning procedurefirst introduced by Gammerman and Vovk [9].

Based on the facts that 1) the Kullback-Leibler diver-gence can be interpreted as the expected discriminationinformation between the null and alternative statisticalhypotheses [10] and 2) the connection between Kullback-Leibler divergence and the Shannon information, the QBTselection criterion (Step 2), is related to the Query-by-Committee (QBC) [11] selection criterion. In fact, QBT canbe viewed as a variant of the “committee-type” strategywith each committee member favoring a particular label.

The outline for the paper is given as follows: In Section 2,we review the active learning problem. In Section 3, theconcept of transductive learning is introduced. In Section 4,we describe how Kolmogorov complexity, transductiveinference, and hypothesis testing are related in detail. InSection 5, the strangeness measure and p-values areintroduced. In Section 6, the QBT selection criterion usingp-values are described. In Section 7, the connection betweenp-values and Kullback-Leibler divergence is shown. InSection 8, we establish the relation between QBT and QBC.Experimental results on eight binary-class and two multi-class classification tasks used to assess the feasibility andusefulness of our approach are reported in Section 9.

2 ACTIVE LEARNING

The standard framework in machine learning in generaland pattern classification in particular presents the learnerwith a randomly sampled data set. There has been,however, a growing interest in active learning where one

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008 1557

. S.-S. Ho is with the NASA Jet Propulsion Laboratory, 4800 Oak GroveAve, 300-123, Pasadena, CA 91109. E-mail: [email protected].

. H. Wechsler is with George Mason University, 4400 University Dr.,MS 5A4, Fairfax, VA 22030. E-mail: [email protected].

Manuscript received 21 Dec. 2006; revised 4 June 2007; accepted 18 Sept.2007; published online 11 Oct. 2007.Recommended for acceptance by S. Sclaroff.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-0897-1206.Digital Object Identifier no. 10.1109/TPAMI.2007.70811.

0162-8828/08/$25.00 � 2008 IEEE Published by the IEEE Computer Society

has the flexibility to choose the data points that seem mostrelevant for the learning task and include them in thetraining set. One analogy for active learning, proposed byTong and Koller [6], is that a standard passive learner is astudent that sits and listens to a teacher, whereas an activelearner is a student that asks the teacher questions, listensto answers, and asks further questions based upon theteacher’s response. The active learner selects actions ormakes queries that influence what data are added to thetraining set and their order [12]. Active learning, in general,is relevant to any activity involving choice and uncertainty.The goal of active learning is to reduce the number ofexamples that have to be manually annotated for training(due to limited resources) without compromising theperformance of the classifier.

To achieve the goal of active learning, one needs to selectexamples that provide most information to the learner.Zhang and Oles [13] analyzed the value of unlabeled dataon the active learning task using the Fisher Informationcriterion. They concluded that 1) one should choose to labeldata points with low confidence with respect to theparameters of the data model, and 2) one should chooseto label data points that are not redundant [13]. In fact, thisselection criterion corresponds to selecting data points thatminimize the learner variance regarding a number ofreference points for function approximation (regression)problems [12], which is statistically optimal.

Instead of describing the many existing active learningstrategies, we provide the two objectives of these strategies:

O1. Maximizing the information gain.O2. Minimizing the expected loss or error rate of the

predictive model.

Objective O1 corresponds to selecting examples thathave the most uncertainty [15], [16], least confidence [14],[17], or maximum disagreement among learners (QBC) [18].Zhu et al. [19] showed that, if one is interested inclassification performance, it is better to minimize thegeneralization error of the learner [20], [21], whichcorresponds to Objective O2.

There are two main settings for active learning: pool-based and stream-based. For pool-based active learning, thelearner requests labels from a fixed pool of unlabeled data.The learner can request to label the data from the pool infuture. On the other hand, for stream-based active learning,the learner has to decide on the fly whether to requestlabels for each unlabeled data point observed in sequencewithout complete knowledge about the pool of data.Stream-based active learning is hence more difficult thanpool-based active learning because a decision has to bemade immediately after a new data point is observed whilethe information about the data model at that particular timeinstance is limited.

Most previous work is related to pool-based activelearning. The margin-based approach using the supportvector machine (SVM) is extensively used in this setting.Schohn and Cohn [22] suggested using a simple form ofdivide and conquer by selecting training examples that lieon or close to the separating SVM hyperplane, with theexpectation that the inclusion of these examples reduces the

expected error. Campbell et al. [23] employed an SVMactive learning strategy that chooses the sample point thatis most likely to cause the margin to shrink most duringeach iteration. This sample point has the highest uncer-tainty about its true label in the data set. Another strategyinvolves querying instances that maximally reduce theversion space, which contains all the classifiers consistentwith the training examples, of SVM [6]. These strategiesrequire searching through the pool of unlabeled examplesat each iteration of selection and hence are computationallyexpensive. These strategies basically select sample points asclose as possible to the dividing hyperplane in the featurespace. Such data points display high uncertainty withrespect to the discrimination boundary and are thusexpected to be the most informative. Zhang and Oles [13]pointed out that SVM is suitable for active learning (in apool-based setting) since an unlabeled data point within themargin that is near to the hyperplane is likely to cause alarge change in parameter estimation once its label isknown. Brinker [24] extended the SVM active learningstrategy to select multiple examples for labeling at once toreduce computational effort. Similarly, Mitra et al. [25] alsoproposed a probabilistic active learning strategy using SVMfor multiple examples selection.

The most representative stream-based active learningalgorithm is the QBC, which selects an example only whenthe members in a committee maximally disagree in its labelassignment [11]. This algorithm is based on a theoreticalresult stating that, by bisecting the version space after eachquery, the generalization error decreases exponentially.QBC randomly samples the version space and induces aneven number of classifiers. When there is a tie among thecommittee of classifiers on the label of the data point, itslabel is queried. A tie among the classifiers implies thatthere is maximum uncertainty on the label of the data point.

The main difficulty in the implementation of QBC is thegreat effort required to uniformly sample the high-dimensional version space to select random hypothesesthat is consistent with the observed data. To address theproblem, Abe and Mamitsuka [26] proposed two practicalimplementations of QBC: Query by Boosting and Query byBagging. Both methods require resampling the trainingdata to obtain hypotheses. A data point is queried for itstrue label when the weighted majority voting by theobtained hypotheses (via Boosting or Bagging) has the leastmargin.1 One notes that the experiments in Abe andMamitsuka [26] were performed to “select a smaller[training] set of more effective data from a large [training]set” with complete information about the data set. Gilad-Bachrach et al. [30] proposed Kernel QBC (KQBC) thatprojects the high-dimensional version space into a low-dimensional version space to overcome the costly samplingstep. For KQBC, sampling of the version space is based onan efficient random walk algorithm. The apparent draw-back of KQBC is its sensitivity to the input parameters.Recently, Dasgupta et al. [31] proposed an active learningstrategy based on the modification of the perceptron

1558 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008

1. The margin is defined to be the difference between the number ofvotes in the current committee for the most popular class label and that forthe second most popular label.

update that achieved similar theoretical performance asQBC with less stringent assumptions. The simplest variantof QBC makes a query decision based on a committee ofdifferent classifiers [27], [28]. Melville and Mooney [29]proposed using a committee consisting of very diverse2

members to select data points to query.For the QBC algorithm in [11], the disagreement is based

on a vote by the committee members; other quantificationsof committee disagreement have also been proposed, suchas the entropy of the committee classification [27] and theJensen-Shannon divergence, the average Kullback-Leiblerdivergence between each distribution in a set of distribu-tion and the mean of the set [32]. Recently, the Jensen-Shannon divergence has been used to identify informativeexamples for class label probability estimate [33].

Brinker [24] and Yan et al. [34] extended the use of activelearning from binary-class tasks to the multiclass tasks.Both their methods required the transformation of amulticlass task to a binary-class task. A more naturalmulticlass active learning approach is to utilize thecommittee-based method such that it consists of classifiersfor multiclass tasks. We use this method for performancecomparison with QBT in Section 9.4 for multiclass tasks.Recently, Goh et al. [35] extended the standard activelearning framework to enable adaptive adjustments to theactive learning strategy and the sampling pool based on theproblem complexity for the image retrieval problem.

Pool-based active learning using SVM has been success-fully applied to text classification [6], query refinement taskin image retrieval [36], recognition problem in marinebiology [37], and the reduction in the number of iterationsfor biochemical testing in drug discovery [38]. Other pool-based active learning methods have been used to guideannotations for content-based information retrieval [17] toautomatically label video data [34], to reduce the amount ofdata for outdoor robotic applications [39], and for videosemantic feature selection [40]. Both pool-based andstream-based active learning methods have also been usedto reduce the labeling effort in a statistical call classificationsystem for customer care [28]. Active learning based on theFisher information criterion [13] has been successfullyapplied to select multiple instances for labeling at eachiteration for medical image classification [41] and textcategorization [42].

Other reviews on various active learning methodologiescan be found in Baram et al. [43] and Kothari and Jain [44].In the next section, transductive inference is brieflyreviewed.

3 TRANSDUCTIVE LEARNING

Transductive inference is a type of local inference thatmoves from particular to particular [1]. “In contrast toinductive inference where one uses a given empirical datato find the approximation of a functional dependency [theinductive step (that moves from particular to general)] andthen uses the obtained approximation to evaluate the

values of a function at the points of interest [the deductivestep (that moves from general to particular)], one estimates[using transduction] the values of a function [only] at thepoints of interest in one step” [1]. Transductive inferenceincorporates the unlabeled (test) data in the decision-making process responsible for their eventual labeling. Thesimplest mathematical realization for transductive infer-ence is the method of k-nearest neighbors (k-NNs). Aspecial case of transduction is local estimation, when theprediction is made at a single point [45].

The goal for inductive learning is to generalize for anyfuture test set, whereas the goal for transductive inference isto make predictions for a specific working set. The errorprobability of inductive inference is not meaningful when it isapplied to a data streaming setting, where a prediction rule isupdated very quickly and the data points may not beindependently and identically distributed (i.i.d.). Vovket al. [7] noted that, in the data streaming, setting the “errorprobabilities [of transductive inference] guaranteed by thetheory find their manifestation as [error] frequencies” on thepreviously seen data points. Moreover, Vapnik [46] pointedout that theorems on transductive inference are true evenwhen the data points of interest and the training data are noti.i.d. Hence, in a data streaming setting, the predictive powerof transductive inference can be estimated at any timeinstance in a data stream even when the data points, bothfuture and previously observed, are not i.i.d. Transductiveinference provides an attractive alternative to replaceinductive inference for learning in a data streaming setting.Furthermore, transductive inference is suitable for thestream-based setting because, in such a setting, the onlyinformation available with respect to time is local.

Recently, Yu et al. [47] proposed using transductiveexperimental design for active learning of the regressionmodel. Their algorithm selects data points that “contributemost to the predictions” on unlabeled test data in a pool-based setting. These selected data points are hard topredict and are “representative[s] to unexplored test data”[47]. Ho and Wechsler [48] used transduction and k-NN todefine a selection criterion for active learning in a pool-based setting. We extend here their active learningcriterion to the stream-based setting. The new criterion isbased on the p-values computed from a transductiveinference procedure and the SVM.

In the next section, we describe the relationship betweenKolmogorov complexity and transductive inference. Wealso establish the connection between Kolmogorov com-plexity and hypothesis testing in terms of statistical p-values. Then, in Section 5, we describe how the p-valuesused for transductive inference are constructed.

4 KOLMOGOROV COMPLEXITY, TRANSDUCTIVE

INFERENCE, AND HYPOTHESIS TESTING

The main motivation behind Kolmogorov’s algorithmicapproach to complexity was his interest in formalizing thenotion of a random sequence. The complexity of a finitestring z, according to Kolmogorov, can be measured by thelength of the shortest program for a universal Turingmachine (encoded in binary bits) that outputs the string z.

HO AND WECHSLER: QUERY BY TRANSDUCTION 1559

2. Diversity of an ensemble is defined as the probability that a randomcommittee member’s prediction on a random example will disagree withthe prediction of the committee.

Two useful characteristics of Kolmogorov’s notion ofrandomness are that it 1) applies to finite sequence and2) provides degrees of randomness [49].

There is a strong connection between transductive

inference and Kolmogorov’s notion of randomness. Given

a sequence of labeled data points and a new data point with

an unknown label, transductive inference seeks to find its

most probable labeling. The randomness of the data

sequence with the data point assigned a particular label is

estimated for each possible label. The label assigned to the

data point, which resulted in the largest randomness, is the

most confident (probable) prediction [9]. Intuitively, the

new data point assigned a label with the largest random-

ness when included into the given sequence of labeled data

points is not distinguishable from the data points in the

sequence, that is, it does not stand out (as an outlier or

noise) in the data sequence. It is highly probable that the

new data point is randomly drawn from some fixed

distribution where the sequence of data points came from.Let #ðzÞ be the length of the binary string z and KðzÞ its

Kolmogorov complexity. Kolmogorov defines the random-ness deficiency DðxÞ for string z as

DðzÞ ¼ #ðzÞ �KðzÞ; ð1Þ

where DðzÞ measures how random the binary string z is.

This definition provides a connection between incompres-

sibility and randomness. When KðzÞ is small (i.e., com-

pressible), DðzÞ is high (i.e., lack of randomness). This, in

fact, corresponds to the Minimum Description Length

(MDL) principle. Martin-Lof extended the randomness

definition to show its connection with statistical tests. The

extension to statistical tests allows one to construct a

randomness test in practice.The Martin-Lof test for randomness is defined as:

Definition 1. Let Pn be a set of computable probability

distributions in a sample space Xn containing elements made

up of n data points. A function t : Xn ! IN (the set of IN ¼f0; 1; . . .g and including infinity1) is a Martin-Lof test forrandomness if

. t is enumerable, and

. for all n 2 IN and m 2 IN and P 2 Pn,

Pfx 2 Xn : tðxÞ � mg � 2�m: ð2Þ

Condition 1 means that the randomness test is computable.Condition 2 means that the amount of regularity ismeasured in bits, and every extra bit halves the numberof sequences exhibiting the regularity. In fact, thisrandomness test is a universal version of the standardstatistical notion of p-values.

From Definition 1, critical regions used in the theoryof hypothesis testing can be constructed in the formC1 � C2 � . . . , where Cm ¼ fx : tðxÞ � mg [8]. The criticalfunction is Pfx 2 Cmg. For all n, at a fixed m with thesignificance level � ¼ 2�m:

Pfx 2 Cmg � 2�m: ð3Þ

A critical region is the set of all samples in a hypothesis

test such that the statistical null hypothesis H0 is rejected.

The critical region Cm is chosen so that Pfx 2 Cmg is

sufficiently small. This probability is called the Type I error

(false negative) or the size (or significance level) of the test.Since (2) is equivalent to

Pfx 2 Xn : tðxÞ 2 ½m;1Þg � 2�m; ð4Þ

it can be transformed into

Pfx 2 Xn : t0ðxÞ 2 ð0; 2�m�g � 2�m ð5Þ

using the transformation fðaÞ ¼ 2�a with the (test)

function t replaced by some (test) function t0. Hence, a

function t0 : Xn ! ð0; 2�m� is a Martin-Lof test for randomness

if it satisfies

Pfx 2 Xn : t0ðxÞ � 2�mg � 2�m ð6Þ

for all n 2 IN and m 2 IN.The Martin-Lof test for randomness can be reformulated

in a way that is practically equivalent to the statistical

notion of p-value as follows:

Definition 2. Let Pn be a set of computable probability

distributions in a sample space Xn containing elements made

up of n data points. A function t : Xn ! ð0; 1� is a p-value

function if, for all n 2 N , P 2 Pn and r 2 ð0; 1�,

Pfx 2 Xn : tðxÞ � rg � r: ð7Þ

In statistical significance testing, the p-value of a hypoth-

esis test based on a test statistic is defined to be the

smallest significance level of the test for which a rejection

of a null hypothesis H0 occurs based on the observed data

points. Hence, the p-value is used to reject the null

hypothesis H0 in favor of the alternative hypothesis H1

with a significance level � when the p-value is less than or

equal to �. The p-value provides a measure of how well

the data support or discredit the null hypothesis [50].

5 STRANGENESS AND P-VALUES

To construct a valid p-value function satisfying (7), one

ranks data points according to some measure. Toward that

end, one defines a strangeness measure that scores how

much a data point is different from the other data points. A

classifier such as k-NN or SVM is used to provide the

strangeness measure for the (labeled) data points. The

Lagrange multipliers derived using the SVM, in Section 5.2,

are used as the strangeness measure in our active learning

algorithm.

5.1 Strangeness Using k-NN

Given a sequence of proximities (distances) between the

members of the given training set and an unknown

instance, one quantifies to what extent any of the proposed

putative classification decisions are probable. Toward that

end, one defines the strangeness of the unknown instance i

with putative label y in relation to the rest of the training

examples as

1560 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008

�i ¼Pk

j¼1 dyijPk

j¼1 d:yij

; ð8Þ

where dyij stands for the jth shortest distance in the sorted(in ascending order) sequence of distances of example i tothe other examples with the same class label y. d:yij is thejth shortest distance from the sorted sequence of distancesof example i from other examples with class labels differentfrom y [51]. The strangeness is thus the ratio of the sum ofthe k nearest distances from the same class to the sum of thek nearest distances from the other classes. The strangenessof an example with putative label y increases when thedistances from the examples of the same class becomelarger and when the distances from the other classesbecome smaller (see Fig. 1). The strangeness of a trainingexample with a known label y can be computed similarlyusing (8).

5.2 Strangeness Using SVM

Given the training set

fðx1; y1Þ; ðx2; y2Þ; . . . ; ðxn; ynÞg;

where yi 2 f�1; 1g, the SVM seeks the separating hyper-plane that yields a maximal margin for the separable case,i.e., the set of training examples is separated without errorand the distance between the closest training example andthe hyperplane is maximal [1]. For a nonseparable case, oneattempts to maximize the margin with minimum loss inmisclassification.

When an unknown instance xnþ1 is included with aputative label ynþ1 ¼ y� into the training set, one employsthe Lagrange multipliers �1; �2; . . . ; �n; �nþ1 associatedwith the examples in the training set and ðxnþ1; y

�Þ as thestrangeness measure using the SVM. The Lagrange multi-pliers �i; i ¼ 1; . . . ; nþ 1 are found by maximizing the dualformulation of a soft-margin SVM

Qð�Þ ¼ � 1

2

Xnþ1

i¼1

Xnþ1

j¼1

�i�jyiyjKðxi;xjÞ þXnþ1

i¼1

�i ð9Þ

subject to the constraintsPnþ1

i¼1 �iyi ¼ 0 and 0 � �i � C,i ¼ 1; . . . ; nþ 1, where Kð�Þ is a kernel function.

The connection between strangeness and Lagrangemultipliers can be explained as follows: Those examplesoutside the margin have zero Lagrange multipliers. For theexamples on the margin, the values of the Lagrangemultiplier are between 0 and C. All examples within themargin have the Lagrange multiplier value C. The strangest

examples are the ones within the margin while thoseoutside the margin are least strange [52].

5.3 P-Values Based on Strangeness

Assume that xnþ1 is the unlabeled example that one tries toobtain its p-value and that �ynþ1 is its strangeness whenassigned a putative label y�. tððx1; y1Þ; ðx2; y2Þ; . . . ; ðxnþ1; y

�ÞÞis the p-value of xnþ1 for that label, given the trainingset fðx1; y1Þ; ðx2; y2Þ; . . . ; ðxn; ynÞg. Formally, one defines thep-value function t : Xnþ1 ! ½0; 1� by

tððx1; y1Þ; ðx2; y2Þ; . . . ; ðxnþ1; y�ÞÞ

¼#fi ¼ 1; . . . ; n : �i � �ynþ1g

n;

ð10Þ

which satisfies Definition 2 (see [53] for the formal proof).Equation (10) computes a p-value that is equivalent to thestatistical p-value computed during statistical significancetesting.

As an analogy to statistical significance testing, one teststhe null hypothesis H0: “xnþ1 assigned the label y� is notstrange,” against the alternative hypothesis H1: “xnþ1

assigned the label y� is strange.” During significancetesting, the smaller the p-value, the greater the evidenceagainst the null hypothesis. The larger the p-value, the lessstrange an example is. When the p-value is smaller than agiven significance level, the null hypothesis H0 is rejected.

Proedrou et al. [51] predict the class of a particulartesting example as the one that yields the largest “cred-ibility” p-value from all possible putative labels [52]. Theconfidence of classification is one minus the second highestp-value. The credibility value shows how plausible theprediction for the testing example is. The confidence valueindicates how probable a proposed label is.

6 QUERY-BY-TRANSDUCTION ALGORITHM

Our selection criterion for QBT is based on transductiveinference using the p-values described in Section 5.3. Thep-values provide a measure of diversity and disagreementin opinion regarding the true label of an unlabeledexample when it is assigned all the possible labels.

Let pi be the p-values obtained for a particular examplexnþ1 using all possible labels i ¼ 1; . . . ;M. Sort thesequence of p-values in descending order so that the firsttwo p-values, say, pj and pk are the two highest p-valueswith labels j and k, respectively. The label assigned to theunknown example is j with a p-value of pj. This valuedefines the credibility of the classification. If pj (credibility)is not high enough, the prediction is rejected. Thedifference between the two p-values can be used as aconfidence value on the prediction. Note that, the smallerthe confidence, the larger the ambiguity regarding theproposed label.

We consider three possible cases of p-values, pj and pk,assuming pj > pk:

. Case 1. pj high and pk low. Prediction “j” has high-credibility and high-confidence value.

. Case 2. pj high and pk high. Prediction “j” has high-credibility but low-confidence value.

HO AND WECHSLER: QUERY BY TRANSDUCTION 1561

Fig. 1. Strangeness of examples using k-NN. The examples from two

different classes are denoted by the circle shape and the diamond

shape.

. Case 3. pj low and pk low. Prediction “j” has low-credibility and low-confidence value.

High uncertainty in prediction occurs for both Case 2and Case 3. Note that uncertainty of prediction occurswhen pj pk. We formally define “closeness” as

Iðxnþ1Þ ¼ pj � pk ð11Þ

to indicate the quality of information possessed by theexample. As Iðxnþ1Þ approaches 0, the more uncertain weare about classifying the example. The addition of thisexample with its true label to the training set provides newinformation about the structure of the data model.

When an SVM is used on a binary-class classificationtask and a new unlabeled example xnþ1 has a Lagrangemultiplier 0 (or C) for its two putative class labels, thenIðxnþ1Þ ¼ 0. Therefore, xnþ1 is labeled and included in thetraining set. To include an unlabeled example for whom thecorresponding Lagrange multipliers are close in value forits two putative labels, the threshold for Iðxnþ1Þ can berelaxed, i.e.,

Iðxnþ1Þ < � ð12Þ

with 0 < � 1. For a multiclass classification task, oneconsiders only the two labels with the two highest p-values.

We note here that “closeness” is similar to the margindefined by Abe and Mamitsuka [26] in a way such that bothmeasures consider the difference between some statisticscomputed for the two most likely labels. The statistic usedby Abe and Mamitsuka [26] is simply the number of votesfavoring a particular label. Our statistic, however, is closelyrelated to the statistical p-values for significance testing.

QBT (Algorithm 1) based on (12) iteratively selectsinformative examples to be queried for their true labels.The labeled examples are then used to train a new SVM.Similar to QBC, one stops the active learning procedurewhen no example is included in T after � consecutiveexamples are observed, where � is some positive integer.For clearer performance comparison, the stopping criterionused in Section 9 is the budget constraint, i.e., themaximum number of examples that one can query forlabeling due to limited resources.

Algorithm 1: QBT.

Initialize: Training set T ¼ fðx1; y1Þ; . . . ; ðxn; ynÞg, selectionthreshold �, stopping threshold �, and the number of

classes M.

1. repeat

2. A new unlabeled example xnþ1 is observed.

3. for i ¼ 1 to M do

4. Assign label i to xnþ1

5. Construct an SVM using TSfðxnþ1; iÞg

6. Use the Lagrange multipliers f�1; . . . ; �n; �inþ1g to

compute the p-value pi using (10) for ðxnþ1; iÞ.7. end for

8. Let pj and pk be the two highest p-values from

fp1; . . . ; pMg9. if Iðxnþ1Þ < � do

10. T :¼ TSfðxnþ1; ynþ1Þg, where ynþ1 is the true label

of xnþ1.11. n :¼ nþ 1.

12. end if

13. until No example is included in T after � consecutive

examples are observed.

For a binary-class classification task, an SVM is used tocompute the strangeness for all the training data points. Fora multiclass classification task, M one-against-the-restSVMs are used. A practical implementation of Steps 3-7(in Algorithm 1) based on the incremental/decrementalSVM [54] is described in [55]. This implementation ensuresthe computational efficiency of our QBT algorithm as weavoid constructing an SVM from scratch each time a newdata point is observed.

7 FROM P-VALUES TO KULLBACK-LEIBLER

DIVERGENCE

The use of p-values in statistics as a measurement ofevidence provided by the data against the null [statistical]hypothesis H0 has been challenged. Moreover, p-values arenot posterior probabilities that the null [statistical] hypoth-eses are true. Such problems can be addressed by mappingthe p-values into posterior probabilities [56].

To provide a sound theoretical justification for the QBT

algorithm using (12), the p-values computed from (10) need

to be mapped to posterior probabilities. Then, based on the

fact that 1) the Kullback-Leibler divergence can be inter-

preted as the expected discrimination information between

the null and alternative statistical hypotheses [10] and 2) the

connection between Kullback-Leibler divergence and the

Shannon information, QBT is shown to be a variant of the

“committee-type” active learning strategy related to QBC.For two densities f0 and f1 of a continuous random

variable (vector) z, the Kullback-Leibler divergence from f1 tof0 is defined as

KLðf1kf0Þ ¼Zf1ðzÞ log

f1ðzÞf0ðzÞ

� �dz: ð13Þ

For two mass functions f0 and f1 over a discrete randomvariable z, it is defined as

KLðf1kf0Þ ¼X

z

f1ðzÞ logf1ðzÞf0ðzÞ

� �: ð14Þ

Let Z ¼ TSfxnþ1; y

�g, where T is the training set and xnþ1

is the new data point with an assigned label y�. The

Kullback-Leibler divergence KLðfðZjH0ÞkfðZjH1ÞÞ can be

interpreted as the expected discrimination information for H0

over H1, i.e., the mean information per sample for

discriminating in favor of a null hypothesis H0 against an

alternative hypothesis H1 when the hypothesis H0 is true.

In other words, it is the average amount of information that

support H0 when it is true [10].

After computing p-values pl, l ¼ 1; . . . ;M for all possible

labels, one only considers the two models that included the

example assigned labels with the two highest p-values. One

notes that the second highest p-value is an upper bound for

the p-values of the other less likely labels. One defines the

null hypothesis H0 as “the model M0 with xnþ1 assigned

the label with the highest p-value (from transduction)”

against the alternative hypothesis H1 as “the model M1

1562 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008

with xnþ1 assigned the label with the second highest p-

value (from transduction).”Using the Bayes’ Theorem, one defines the posterior

probability:

P ðHijZÞ ¼P ðHiÞfðZjHiÞ

P ðH0ÞfðZjH0Þ þ P ðH1ÞfðZjH1Þð15Þ

for i ¼ 0 and 1 when one considers only two models.Substituting (15) into (13) results in

KLðfðZjH0ÞkfðZjH1ÞÞ

¼ZfðZjH0Þ log

P ðH0jZÞP ðH1jZÞ

� �dz;

ð16Þ

defined in the probability sample space of the new datapoint z ¼ ðxnþ1; y

�Þ 2 X � Y , where X and Y are theinstance space and the label space, respectively, and oneassumes priors P ðH0Þ ¼ P ðH1Þ. Assuming that the two(known) highest p-values have been transformed to poster-ior probabilities, one has

KLðfðZjH0ÞkfðZjH1ÞÞ

¼ logP ðH0jZÞP ðH1jZÞ

� �ZfðZjH0Þdz ð17Þ

� logP ðH0jZÞ � logP ðH1jZÞ ð18Þ

since 0 �RfðZjH0Þdz � 1. For a discrete sample space,R

fðZjH0Þdz in (17) is replaced byP

z2X�Y fðTSzjH0Þ. Note

that BðZÞ ¼ P ðH0jZÞP ðH1jZÞ is the Bayes factor when the null and

alternative hypotheses are simple and the priors are equal.

Equation (17) can then be interpreted as the expected instan-

taneous discrimination information for xnþ1. Here, BðZÞ � 1.

When BðZÞ approaches 1, KLðfðZjH0ÞkfðZjH1ÞÞ ap-

proaches 0, i.e., the two hypotheses cannot be differentiated.By mapping p-values (10) into posterior probability by

some p-value transformation method (see [56]) and thenperforming a log-transformation on the two highest p-values, selection criterion (12) amounts to the upper boundof KLðfðZjH0ÞkfðZjH1ÞÞ.

8 RELATION BETWEEN QBT AND QBC

It was pointed out in Graepel et al. [57], [58] that theposterior probability (estimated using the two subvolumesof the version spaces) of the label of a new example xnþ1

computed from transduction can be used to estimate theinformation content of the example. The existence ofversion space requires the SVM (used in Algorithm 1) tolinearly separate the training examples in the feature space.Tong and Koller [6] noted that, since the feature space ishigh dimensional, the data set, in many cases, is linearlyseparable. Even when it is not, the kernel used by the SVMcan be modified so that the data in the new induced featurespace is linearly separable. This observation is important asthe SVMs constructed in Algorithm 1 are assumed to bedrawn from the version space even when the new exampleis labeled differently. Based on this observation, onerealizes that QBT constructs M possible SVMs (for Mpossible classes) that are drawn from the version space.However, only the two SVMs that resulted in predictions

with the two highest p-values are considered in QBT.Hence, we have two members in the committee. QBTqueries the label of an example when the two membersdisagree with the predictions according to the p-values.

One notes that, for QBC, the expected information gainof a data point xi is defined using the Shannon informationcontent IðbÞ of a binary random variable, b [11]. TheShannon information for a new data point xnþ1 can beexpressed as

Iðxnþ1Þ ¼ logN �KLðfðZjH0ÞkfðZjH1ÞÞ; ð19Þ

where N is the number of labels (assuming equally likely).Since we are only considering the two most likely models,N ¼ 2. When Iðxnþ1Þ is large, KLðfðZjH0ÞkfðZjH1ÞÞ is low,i.e., large expected information gain implies small expected(instantaneous) discrimination information. The threshold �for the QBT algorithm is implicitly an upper bound for theexpected instantaneous discrimination information. Hence,it is a lower bound for the expected (instantaneous)information gain required by QBC for good theoreticalperformance.

QBT can then be transformed into a QBC frameworkconsisting of a 2-member committee by the followingthree steps:

1. Map the two highest p-values to the posteriorprobabilities.

2. Log-transform the two posterior probabilities andthen compute the upper bound of the Kullback-Leibler divergence.

3. QBT is transformed into the QBC framework withtwo committee members using (19).

In fact, one observes a close relationship between a“transduction-type” solution and a “committee-type” solu-tion for the particular transductive inference proceduredescribed in Section 6 (Algorithm 1: Steps 3-7).

Before we end this section, we point out that, to establisha theoretical connection between the QBT and the QBC, weassume linear separability of the data set (in the featurespace). To avoid the implementation issue that the SVMsolution does not converge when a data set is not linearlyseparable, a soft-margin SVM (see Section 5.2) is used inQBT. One notes that the soft-margin SVM is only used toderive the strangeness measure, an intermediate step forcomputing the p-value. The only apparent weakness of thisstrangeness measure is that, if the previously observedexamples and the newly observed example fall within themargin, the p-value will have a higher value. This problemcan be alleviated using a randomized p-value [7]. Empiricalobservations, however, show that the performance of QBTis not affected by this problem.

9 EXPERIMENTAL RESULTS

We report experimental results that show the feasibility andutility of QBT and compare its performance with four activelearning strategies: random sampling, committee-basedactive learning [27], margin-based active learning [23],and KQBC [30]. In Section 9.1, the active learning strategiesthat we used for a performance comparison are brieflydescribed. In Section 9.2, we describe the experimentalprocedure. The performance comparison of different active

HO AND WECHSLER: QUERY BY TRANSDUCTION 1563

learning strategies is presented in Section 9.3 for eightbinary-class classification tasks and in Section 9.4 for twomulticlass classification tasks.

9.1 Active Learning Strategies for PerformanceComparison

The four active learning strategies that we used for aperformance comparison are briefly described below:

1. Random sampling. Unlabeled examples are ran-domly chosen and classification is done using SVM.

2. Committee-based active learning. Similar to that byTur et al. [28], a committee consisting of twodifferent classifiers is used. In our experiment, weuse the SVM and the k-NN with k ¼ 1. When thereis a disagreement between the two classifiers on theprediction for an unlabeled example, the true labelof the example is queried.

The classification can be performed using 1) SVM(committee-based (SVM)), 2) k-NN (committee-based (k-NN)), or 3) a combination of both (com-mittee-based). For the last classification approach,the predictions are based on the agreement betweenSVM and k-NN. When there is a disagreement, aprediction is chosen randomly from the predictionsof the SVM and the k-NN.

3. Margin-based active learning (SVM-AL). Similar toCampbell et al. [23], an SVM is constructed using theset of labeled examples fðx1; y1Þ; ðx2; y2Þ; . . . ; ðxn; ynÞg(see Section 5.2). The decision function value fðxÞ foran unlabeled example x is computed using

fðxÞ ¼Xi2I

yi�iKðxi;xÞ þ b;

where I is the set of indices for the labeled examples,Kð�Þ is a kernel function, and b 2 R is the offsetcomputed when the SVM is constructed. For Camp-bell et al. [23], one picks the example with thesmallest jfðxÞj from the set of unlabeled examplesand queries the example when it is within themargin, i.e., jfðxÞj < � with � ¼ 1. The main problemof margin-based active learning in a stream-basedsetting is that one cannot pick the example with thesmallest jfðxÞj from the set of unlabeled examplessince only one new example is observed at eachiteration. Hence, we query the label of an examplewhen jfðxÞj < �. For stream-based active learning,one may sometimes need to consider higher � tocapture sufficient examples for a reasonable classi-fication performance. In our experiment, initially,� ¼ 1. After a sufficient number of iterations (say,100) without any example being queried, � becomesthe smallest jfðxÞj value computed since the lastquery. The initial � value is selected through cross-validation using various values between 0.1 and 100on the data sets not used in the reported experi-mental results. We note that a small � results in theactive learning algorithm iterating without selectingexamples to label.

Classification is based on the SVM constructedusing the examples whose labels are queried.

4. KQBC. KQBC is similar to QBC. The only differenceis that the sampling process for QBC is done in thehigh-dimensional version space, whereas KQBCsamples from a low-dimensional projection of theoriginal version space. For KQBC, the quality of thequeried examples depends on the kernel parameter,�KQBC , and the number of random walks, r. KQBCuses a kernelized linear classifier sampled from theversion space for prediction. KQBC can also be usedto query examples and prediction based on SVM,which we called KQBC SVM.

The best performance of KQBC-SVM using�KQBC ¼ 10i, i ¼ �1, 0, 1, 2, 3 and r ¼ 10j, j ¼ 2, 3for each binary-class task is reported in Section 9.3.

5. QBT. The parameter � ¼ 0:4 for the binary-classclassification tasks in Section 9.3 and � ¼ 0, i.e.,Iðxnþ1Þ ¼ 0 (see Section 6) for the multiclassclassification tasks in Section 9.4. These values areselected using data sets, not used in the reportedexperimental results, through cross-validation. � ¼ 0for the multiclass classification task is the situationwhen an example can take either of the two mostlikely labels according to the p-values.

9.2 Experimental Design

For a given classification task, the active learner is initiallyprovided with one randomly chosen labeled example fromeach class to form a training set. For a binary-class task, theSVM and the k-NN in a committee-based active learneralways agree with their predictions when there are onlytwo training examples. Hence, in our experiments forbinary-class tasks, another labeled example is randomlychosen to be included in the initial training set. Unlabeledexamples are then observed one by one. The decision toquery the label of an example is based on the given activelearning strategy. The queried example is included in thetraining set to build a classifier. Examples not queried forlabels are returned to the unlabeled data set.

For a performance comparison of active learningstrategies, the SVM classifier is used for consistency.Similar to that by Baram et al. [43], since the main purposeof the SVM classifier is to compare different active learningstrategies, the issue of model selection for the classifier isnot the main concern of our work. Hence, the C parameterfor the SVM used in our experiments is fixed at 100, and weuse the Gaussian kernel with � ¼ 1. The SVM performsreasonably well on the various classification tasks usingthese parameters (see Tables 1 and 2).

1564 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008

TABLE 1Information on the Binary-Class ClassificationTasks Used for the Performance Comparison

The performance comparison of different active learning

strategies for a classification task is based on the classifica-

tion accuracy that is computed iteratively as the training set

increases in size with the inclusion of more queried

examples. A budget constraint, i.e., the number of label

queries that can be made on a data set due to limited

resources, is specified.We illustrate next that using a similar active learning

strategy but different classifiers results in a very different

classification performance. Consider the nonseparable

data set based on a subset of the normalized handwritten

digits automatically scanned from envelopes by the US

Postal Service [59]. The original scanned digits are binary

images whose sizes and orientations vary. The images

have been deslanted and size normalized, resulting in

16 � 16 gray-scale images. The linearly nonseparable

binary classification problem consists of the sets of digit“3” and “8.” There are 652 training examples of “3” and542 training examples of “8.” The training accuracy usingall the examples is 100 percent. The test accuracy is91.83 percent using a test set of 166 examples for each ofthe two digits.

The result reported in Fig. 2 is the mean classificationaccuracy over 20 runs of the binary classification problem,where the initial training set (containing three labeledexamples) is fixed, and the training sequences are randomlypermutated. Fig. 3 shows the standard deviation of theclassification accuracy. All five active learning strategiesdescribed in Section 9.1 are used in this experiment. Figs. 2aand 3a show the learning curves and their standarddeviations for the five active learning strategies with theSVM as the classifier. Figs. 2b and 3b show the learningcurves and their standard deviations for KQBC, committee-based active learning using k-NN as the classifier, andclassification based on majority vote. The performance ofrandom sampling is used as a baseline for comparison in theright graphs. The budget constraint is 80.

One observes from Fig. 2b that classification using otherlearners such as k-NN and kernelized linear classifier (forKQBC) or using classification based on a majority vote maynot be competitive against the SVM. One also observes thatKQBC (using kernel linear classifier) performs worse than

HO AND WECHSLER: QUERY BY TRANSDUCTION 1565

TABLE 2Information on the Two Multiclass ClassificationTasks Used for the Performance Comparison

Fig. 3. USPS Digit “3” and “8” data set. The standard deviation of classification accuracy (20 runs) versus the number of training examples.

Fig. 2. USPS Digit “3” and “8” data set. The mean classification accuracy (20 runs) versus the number of training examples.

random sampling (using SVM) in this particular binary-

class task. Hence, in Sections 9.3 and 9.4, we use only SVM

for a fair performance comparison of active learning

strategies similar to Figs. 2a and 3a.For the performance comparison for multiclass tasks in

Section 9.4, we use only 1) random sampling, 2) QBT, and

3) committee-based active learning using SVM as the

classifier (committee-based (SVM)). The random sampling

is used as a baseline for comparison purposes, whereas

committee-based active learning using SVM as the classifier

is the most competitive active learning strategy shown in

Figs. 2 and 3 and also in Section 9.3 for binary-class tasks.

KQBC-SVM and SVM-AL cannot be easily extended to

handle multiclass tasks. Hence, they are not used for

performance comparisons.

9.3 Experimental Comparison: Binary-ClassClassification Tasks

We use eight binary-class classification tasks to compare

the performance of QBT with the four active learning

strategies described in Section 9.1. Each problem consists of

100 training/testing runs (except for the splice benchmark

data set that has 20). Some information on the benchmark

data sets is shown in Table 1. The accuracy of a particular

training/testing run is computed using an SVM with

parameters C ¼ 100 and � ¼ 1 on the whole training set.

Further details about the benchmark data sets are found in

[43] and [60]. We note that the benchmark suite [60]

consists of 13 benchmark data sets, and only eight are used

here. The other five data sets are used to select the

threshold for QBT.In the experiment, the budget constraint is 80. For the

thyroid data set, due to its small data set size, we performactive learning to obtain only 70 examples for labeling.

From the learning curves in Figs. 4, 5, 6, and 7, oneobserves that QBT consistently outperforms the other fouractive learning strategies on the diabetes, heart, splice, andwaveform classification tasks. Given the same number oftraining examples, the mean classification accuracy of QBTis almost always better than using the other active learningstrategies. One also observes in Figs. 4, 5, 6, and 7 that, afteractive learning is completed (at 83 examples), the meanclassification accuracies of QBT for the four tasks are thebest among the active learning strategies compared.

In Figs. 8 and 9, one observes that the performance ofQBT is comparable to the committee-based active learningon the thyroid and German classification tasks. In Fig. 10,one observes that the performance of all five active learning

1566 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008

Fig. 4. Diabetes data set. (a) The mean classification accuracy (100 runs) versus the number of training examples. (b) The standard deviation of

classification accuracy (100 runs) versus the number of training examples.

Fig. 5. Heart data set. (a) The mean classification accuracy (100 runs) versus the number of training examples. (b) The standard deviation ofclassification accuracy (100 runs) versus the number of training examples.

strategies are comparable on the ringnorm classification

task. However, QBT and the committee-based active

learning have steeper learning curves during the early

stage of active learning, i.e., both strategies achieve high

classification performance with lesser training examples. In

Fig. 11, one observes that committee-based active learning

has the best performance on the breast cancer classification

task, whereas QBT achieves second-best performance.

Overall, QBT demonstrates competitive performance

against the other four active learning strategies on the eight

binary-class classification tasks.One interesting observation from the experimental

results is that, for some classification tasks, (diabetes, heart,

and breast cancer), the performance of the SVM classifier

HO AND WECHSLER: QUERY BY TRANSDUCTION 1567

Fig. 6. Splice data set. (a) The mean classification accuracy (20 runs) versus the number of training examples. (b) The standard deviation of

classification accuracy (20 runs) versus the number of training examples.

Fig. 7. Waveform data set. (a) The mean classification accuracy (100 runs) versus the number of training examples. (b) The standard deviation ofclassification accuracy (100 runs) versus the number of training examples.

Fig. 8. Thyroid data set. (a) The mean classification accuracy (100 runs) versus the number of training examples. (b) The standard deviation of

classification accuracy (100 runs) versus the number of training examples.

based on the 80 examples queried by QBT (and the three

initial examples) is better than using all the training

examples (see Table 1 and Figs. 4, 5, and 11).In Figs. 4, 5, 6, 7, 8, 9, 10, and 11, one observes that

KQBC-SVM is sensitive to model selection. It has an

unstable performance. Sometimes KQBC-SVM has much

better performance than random sampling (see Fig. 4).

Sometimes KQBC-SVM has worse performance thanrandom sampling (see Figs. 7 and 8). The performanceof KQBC-SVM is sensitive to the number of random walksspecified and the kernel parameter.

From our experimental results, one observes that theperformance of SVM-AL in a stream-based setting, ingeneral, is not any better than random sampling. Thismay at first appear unreasonable as margin-based active

1568 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008

Fig. 9. German data set. (a) The mean classification accuracy (100 runs) versus the number of training examples. (b) The standard deviation ofclassification accuracy (100 runs) versus the number of training examples.

Fig. 10. Ringnorm data set. (a) The mean classification accuracy (100 runs) versus the number of training examples. (b) The standard deviation of

classification accuracy (100 runs) versus the number of training examples.

Fig. 11. Breast cancer data set. (a) The mean classification accuracy (100 runs) versus the number of training examples. (b) The standard deviation

of classification accuracy (100 runs) versus the number of training examples.

learning (SVM-AL) is among the state-of-the-art activelearning strategies in a pool-based setting (see Section 2).However, one should be aware that, in a pool-basedsetting, decisions to query are based on completeinformation on the training set. On the other hand, in astream-based setting, the decision to query has to bemade once a new example is observed. Margin-basedactive learning may not be feasible in a stream-basedsetting, where a decision has to be made with partialinformation.

9.4 Experimental Comparison: MulticlassClassification Tasks

Experimental results that compare the performance of QBTwith random sampling and the committee-based activelearning on two multiclass classification tasks are reportedin this section. The two tasks are the USPS handwrittendigit problems [59] and the letter image problem [61]. Someinformation on the two multiclass data sets is shown inTable 2. The accuracy of a testing set is computed using anSVM with parameters C ¼ 100 and � ¼ 1 on the wholetraining set.

The results reported in Figs. 12 and 13 are the means andstandard deviations of the classification performance over20 runs of the two multiclass tasks. The initial training set(containing one example from each class) is fixed, and thetraining sequences are randomly permutated. The budgetconstraints for the USPS handwritten digit problem and the

letter image problem are 300 and 500, respectively. Oneobserves in Fig. 12 that QBT performs better than the othertwo strategies. In Fig. 13, the learning curves of QBT andthe committee-based active learning are comparable, andthey are significantly better than random sampling.

Our results here, together with the experimental resultsreported in Section 9.3, show that QBT is feasible andperforms competitively against the other four activelearning strategies. One also notes that committee-basedactive learning using SVM for classification displayedlearning performance similar to QBT for some classificationtasks. This corresponds to our theoretical argument aboutthe close connection between “transductive-type” activelearning and “committee-type” active learning (when thesame classifier is used for classification).

10 CONCLUSIONS

In this paper, a novel stream-based active learningalgorithm, called QBT, based on p-values obtained fromtransduction is proposed. Based on the facts that 1) theKullback-Leibler divergence can be interpreted as theexpected discrimination information between the null andalternative statistical hypotheses and 2) given theconnection between Kullback-Leibler divergence and theShannon information under some assumptions, the QBTselection criterion is closely related to the QBC selectioncriterion. In fact, QBT can be viewed as a variant of the

HO AND WECHSLER: QUERY BY TRANSDUCTION 1569

Fig. 12. USPS Digit data set. (a) The mean classification accuracy (20 runs) versus the number of training examples. (b) The standard deviation of

classification accuracy (20 runs) versus the number of training examples.

Fig. 13. Letter image data set. (a) The mean classification accuracy (20 runs) versus the number of training examples. (b) The standard deviation ofclassification accuracy (20 runs) versus the number of training examples.

“committee-type” strategy with each committee member

favoring a particular label. The feasibility and utility of QBT

is shown on both binary and multiclass classification tasks

using SVM as the choice classifier. Our experimental results

show that QBT compares favorably in terms of mean

generalization against random sampling, committee-based

active learning, margin-based active learning, and QBC.

ACKNOWLEDGMENTS

The authors thank the anonymous reviewers for their

helpful comments. The authors thank Ran Gilad-Bachrach

for the KQBC matlab code and clarification about the code.

REFERENCES

[1] V.N. Vapnik, The Nature of Statistical Learning Theory, second ed.Springer, 2000.

[2] T. Joachims, “Transductive Inference for Text Classification UsingSupport Vector Machines,” Proc. 16th Int’l Conf. Machine Learning,I. Bratko and S. Dzeroski, eds., pp. 200-209, 1999.

[3] F. Li and H. Wechsler, “Open Set Face Recognition UsingTransduction,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 27, no. 11, pp. 1686-1697, Nov. 2005.

[4] M. Okabe, K. Umemura, and S. Yamada, “Query Expansion withthe Minimum User Feedback by Transductive Learning,” Proc.Human Language Technology Conf. and Conf. Empirical Methods inNatural Language Processing (HLT/EMNLP ’05), pp. 963-970, 2005.

[5] R. Craig and L. Liao, “Protein Classification Using TransductiveLearning on Phylogenetic Profiles,” Proc. ACM Symp. AppliedComputing, pp. 161-166, 2006.

[6] S. Tong and D. Koller, “Support Vector Machine Active Learningwith Applications to Text Classification,” J. Machine LearningResearch, vol. 2, pp. 45-66, 2001.

[7] V. Vovk, A. Gammerman, and G. Shafer, Algorithmic Learning in aRandom World. Springer, 2005.

[8] M. Li and P. Vitanyi, An Introduction to Kolmogorov Complexity andIts Applications, second ed. Springer, 1997.

[9] A. Gammerman and V. Vovk, “Prediction Algorithms andConfidence Measures Based on Algorithmic Randomness Theo-ry,” Theoretical Computer Science, vol. 287, no. 1, pp. 209-217, 2002.

[10] S. Kullback, Information Theory and Statistics. John Wiley & Sons,1959.

[11] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby, “SelectiveSampling Using the Query by Committee Algorithm,” MachineLearning, vol. 28, nos. 2-3, pp. 133-168, 1997.

[12] D.A. Cohn, Z. Ghahramani, and M.I. Jordan, “Active Learningwith Statistical Models,” J. Artificial Intelligence Research, vol. 4,pp. 129-145, 1996.

[13] T. Zhang and F. Oles, “A Probability Analysis on the Value ofUnlabeled Data for Classification Problems,” Proc. 17th Int’l Conf.Machine Learning, pp. 1191-1198, 2000.

[14] M. Li and I. Sethi, “Confidence-Based Active Learning,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 28, no. 8,pp. 1251-1261, Aug. 2006.

[15] D.D. Lewis and J. Catlett, “Heterogeneous Uncertainty Samplingfor Supervised Learning,” Proc. 11th Int’l Conf. Machine Learning,pp. 148-156, 1994.

[16] D. Mackay, “Information-Based Objective Functions for ActiveData Selection,” Neural Computation, vol. 4, no. 4, pp. 590-604,1992.

[17] C. Zhang and T. Chen, “An Active Learning Framework forContent-Based Information Retrieval,” IEEE Trans. Multimedia,vol. 4, no. 2, pp. 260-268, 2002.

[18] H.S. Seung, M. Opper, and H. Sompolinsky, “Query byCommittee,” Proc. Fifth Ann. Conf. Learning Theory, pp. 287-294,1992.

[19] X. Zhu, J. Lafferty, and Z. Ghahramani, “Combining ActiveLearning and Semi-Supervised Learning Using Gaussian Fieldsand Harmonic Functions,” Proc. ICML Workshop Continuum fromLabeled to Unlabeled Data in Machine Learning and Data Mining,2003.

[20] N. Roy and A. McCallum, “Toward Optimal Active Learningthrough Sampling Estimation of Error Reduction,” Proc. 18th Int’lConf. Machine Learning, pp. 441-448, 2001.

[21] R. Yan, J. Yang, and A.G. Hauptmann, “Automatically LabelingVideo Data Using Multi-Class Active Learning,” Proc. Ninth Int’lConf. Computer Vision, pp. 516-523, 2003.

[22] G. Schohn and D. Cohn, “Less Is More: Active Learning withSupport Vector Machines,” Proc. 17th Int’l Conf. Machine Learning,pp. 839-846, 2000.

[23] C. Campbell, N. Cristianini, and A.J. Smola, “Query Learningwith Large Margin Classifiers,” Proc. 17th Int’l Conf. MachineLearning, pp. 111-118, 2000.

[24] K. Brinker, “Active Learning with Kernel Machines,” PhDdissertation, Univ. of Paderborn, 2004.

[25] P. Mitra, C.A. Murthy, and S.K. Pal, “A Probabilistic ActiveSupport Vector Learning Algorithm,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 26, no. 3, pp. 413-418, Mar. 2004.

[26] N. Abe and H. Mamitsuka, “Query Learning Strategies UsingBoosting and Bagging,” Proc. 15th Int’l Conf. Machine Learning,pp. 1-9, 1998.

[27] I. Dagan and S. Engelson, “Committee-Based Sampling forTraining Probabilistic Classifiers,” Proc. 12th Int’l Conf. MachineLearning, pp. 150-157, 1995.

[28] G. Tur, R.E. Schapire, and D. Hakkani-Tur, “Active Learning forSpoken Language Understanding,” Proc. IEEE Int’l Conf. Acoustics,Speech and Signal Processing, 2003.

[29] P. Melville and R. Mooney, “Diverse Ensembles for ActiveLearning,” Proc. 21st Int’l Conf. Machine Learning, pp. 584-591,2004.

[30] R. Gilad-Bachrach, A. Navot, and N. Tishby, “Query byCommittee Made Real,” Proc. Ann. Conf. Advances in NeuralInformation Processing Systems (NIPS ’05), 2005.

[31] S. Dasgupta, A. Kalai, and C. Monteleoni, “Analysis of Percep-tron-Based Active Learning,” Proc. 18th Ann. Conf. LearningTheory, 2005.

[32] A. McCallum and K. Nigam, “Employing EM and Pool-BasedActive Learning for Text Classification,” Proc. 15th Int’l Conf.Machine Learning, pp. 359-367, 1998.

[33] P. Melville, S. Yang, M. Saar-Tsechansky, and R. Mooney, “ActiveLearning for Probability Estimation Using Jensen-Shannon Diver-gence,” Proc. European Conf. Machine Learning, pp. 268-279, 2005.

[34] R. Yan, J. Yang, and A. Hauptmann, “Automatically LabelingVideo Data Using Multi-Class Active Learning,” Proc. Ninth IEEEInt’l Conf. Computer Vision, pp. 516-523, 2003.

[35] K.-S. Goh, E.Y. Chang, and W.C. Lai, “Multimodal Concept-Dependent Active Learning for Image Retrieval,” Proc. 12th ACMInt’l Conf. Multimedia, pp. 564-571, 2004.

[36] S. Tong and E.Y. Chang, “Support Vector Machine ActiveLearning for Image Retrieval,” Proc. ACM Multimedia, pp. 107-118, 2001.

[37] T. Luo, K. Kramer, D.B. Goldgof, L.O. Hall, S. Samson, A. Remsen,and T. Hopkins, “Active Learning to Recognize Multiple Types ofPlankton,” J. Machine Learning Research, vol. 6, pp. 589-613, 2005.

[38] M.K. Warmuth, J. Liao, G. Raetsch, M. Mathieson, S. Putta, and C.Lemmen, “Active Learning with Support Vector Machines in theDrug Discovery Process,” J. Chemical Information and ComputerSciences, vol. 43, pp. 667-673, 2003.

[39] C. Dima and M. Hebert, “Active Learning for Outdoor ObstacleDetection,” Robotics: Science and Systems, pp. 9-16, 2005.

[40] R. Yan and A. Hauptmann, “Multi-Class Active Learning forVideo Semantic Feature Extraction,” Proc. IEEE Int’l Conf. Multi-media and Expo, pp. 67-72, 2004.

[41] S.C.H. Hoi, R. Jin, J. Zhu, and M.R. Lyu, “Batch Mode ActiveLearning and Its Application to Medical Image Classification,”Proc. 23rd Int’l Conf. Machine Learning, pp. 417-424, 2006.

[42] S.C.H. Hoi, R. Jin, and M.R. Lyu, “Large-Scale Text Categorizationby Batch Mode Active Learning,” Proc. 15th Int’l Conf. World WideWeb, pp. 633-642, 2006.

[43] Y. Baram, R. Yaniv, and K. Luz, “Online Choice of ActiveLearning Algorithms,” J. Machine Learning Research, pp. 255-291,2004.

[44] R. Kothari and V. Jain, “Learning from Labeled and UnlabeledData Using a Minimal Number of Queries,” IEEE Trans. NeuralNetworks, vol. 14, no. 6, 2003.

[45] V. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory,and Methods. John Wiley & Sons, 1998.

1570 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 9, SEPTEMBER 2008

[46] Semi-Supervised Learning, O. Chapelle, B. Scholkopf, and A. Zien,eds. MIT Press, 2006.

[47] K. Yu, J. Bi, and V. Tresp, “Active Learning via TransductiveExperimental Design,” Proc. 23rd Int’l Conf. Machine Learning,pp. 1081-1088, 2006.

[48] S.-S. Ho and H. Wechsler, “Transductive Confidence Machines forActive Learning,” Proc. Int’l Joint Conf. Neural Network (IJCNN ’03),2003.

[49] V. Vovk, A. Gammerman, and C. Saunders, “Machine-LearningApplications of Algorithmic Randomness,” Proc. 16th Int’l Conf.Machine Learning, I. Bratko and S. Dzeroski, eds., pp. 444-453,1999.

[50] S. Weerahandi, Exact Statistical Methods for Data Analysis. Springer,1994.

[51] K. Proedrou, I. Nouretdinov, V. Vovk, and A. Gammerman,“Transductive Confidence Machines for Pattern Recognition,”Proc. 13th European Conf. Machine Learning, T. Elomaa, H. Mannila,and H. Toivonen, eds., pp. 381-390, 2002.

[52] C. Saunders, A. Gammerman, and V. Vovk, “Transduction withConfidence and Credibility,” Proc. 16th Int’l Joint Conf. ArtificialIntelligence, T. Dean, ed., pp. 722-726, 1999.

[53] T. Melluish, C. Saunders, I. Nouretdinov, and V. Vovk,“Comparing the Bayes and Typicalness Frameworks,” Proc. 12thEuropean Conf. Machine Learning, pp. 360-371, 2001.

[54] G. Cauwenberghs and T. Poggio, “Incremental Support VectorMachine Learning,” Advances in Neural Information ProcessingSystems 13, pp. 409-415. MIT Press, 2000.

[55] S.-S. Ho and H. Wechsler, “Learning from Data Streams viaOnline Transduction,” Proc. ICDM Workshop Temporal DataMining: Algorithms, Theory and Applications (TDM ’04), 2004.

[56] T. Sellke, M.J. Bayarri, and J.O. Berger, “Calibration of p-valuesfor Testing Precise Null Hypotheses,” The Am. Statistician, vol. 55,pp. 62-71, 2001.

[57] T. Graepel, R. Herbrich, and K. Obermayer, “Bayesian Transduc-tion,” Proc. Ann. Conf. Advances in Neural Information ProcessingSystems (NIPS ’99), S.A. Solla, T.K. Leen, and K.-R. Muler, eds.,pp. 456-462, 1999.

[58] T. Graepel and R. Herbrich, “The Kernel Gibbs Sampler,” Proc.Ann. Conf. Advances in Neural Information Processing Systems (NIPS’00), T.K. Leen, T.G. Dietterich, and V. Tresp, eds., pp. 514-520,2000.

[59] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W.Hubbard, and L.J. Jackel, “Backpropagation Applied to Hand-written Zip Code Recognition,” Neural Computation, vol. 1,pp. 541-551, 1989.

[60] G. Ratsch, T. Onoda, and K.-R. Muler, “Soft Margins forAdaboost,” Machine Learning, vol. 42, no. 3, pp. 287-320, 2001.

[61] P. Frey and D. Slate, “Letter Recognition Using Holland-StyleAdaptive Classifiers,” Machine Learning, vol. 6, pp. 161-182, 1991.

Shen-Shyang Ho received the BS degree inmathematics and computational science fromthe National University of Singapore in 1999and the MS and PhD degrees in computerscience from George Mason University in 2003and 2007, respectively. He is currently a NASAPostdoctoral Program (NPP) Fellow with theNASA Jet Propulsion Laboratory, CaliforniaInstitute of Technology. His research activitiesinclude online learning from data streams,

adaptive learning, pattern recognition, data mining, and optimizationmethods. His current research focuses on machine learning andpattern recognition techniques for the detection and tracking ofcyclones and other events from remote sensing data streams. He isa member of the IEEE.

Harry Wechsler received the PhD degree ininformation and computer science from theUniversity of California, Irvine. Currently, he is aprofessor of computer science and the director ofthe Center of Distributed and Intelligent Compu-tation, George Mason University (GMU). Hisresearch interests include intelligent systemsfocusing on computational vision, image andsignal processing, data mining, and machinelearning and pattern recognition, with applica-

tions to biometrics/face recognition/gait analysis/performance evalua-tion, augmented cognition and HCI, change detection and link analysis,and video processing and surveillance. He has published more than250 scientific papers. He serves on the editorial board of several majorscientific publications. He is the author of Computational Vision(Academic Press, 1990) and Reliable Face Recognition Methods(Springer, 2006), which break new ground in applied modern patternrecognition and biometrics. As a leading researcher in face recognition,he organized and directed the seminal NATO Advanced Study Institute(ASI) on “Face Recognition: From Theory to Applications” in 1997,whose proceedings were published by Springer in 1998. He has directedat GMU the design and development of FERET, which has become thestandard facial database for benchmark studies and experimentation. Hewas elected an IEEE fellow in 1992 for “contributions to spatial/spectralimage representations and neural networks and their theoreticalintegration and application to human and machine perception” and anInternational Association of Pattern Recognition (IAPR) Fellow in 1998.He was granted (together with his former doctoral students) two patentsby USPO in 2004 on fractal image compression using quad-q-learning(licensed in 2006) and feature-based classification (for face recognition).Two additional patents (together with his former doctoral students) onopen set (face) recognition (and intrusion/outlier detection) and changedetection using martingale are now pending with USPO. He is a fellowmember of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

HO AND WECHSLER: QUERY BY TRANSDUCTION 1571