coaching the exploration and exploitation in active ...xiaoyong/papers/tip11.pdf · coaching the...

14
1 Coaching the Exploration and Exploitation in Active Learning for Interactive Video Retrieval Xiao-Yong Wei, Member, IEEE, Zhen-Qun Yang Abstract—Conventional active learning approaches for interac- tive video/image retrieval usually assume the query distribution is unknown, because it is difficult to estimate with only a limited number of labeled instances available. It is thus easy to put the system in a dilemma whether to explore the feature space in uncertain areas for a better understanding of the query distribution or to harvest in certain areas for more relevant instances. In this paper, we propose a novel approach called coached active learning (CAL), which makes the query distribution predictable through training and thus avoids the risk of searching on a completely unknown space. The estimated distribution, which provides a more global view of the feature space, can be utilized to schedule not only the timing but also the step sizes of the exploration and the exploitation in a principled way. The results of the experiments on a large-scale dataset from TRECVID 2005–2009 validate the efficiency and effectiveness of our approach, which demonstrates encouraging performance when facing domain-shift and outperforms eight conventional active learning methods and shows superiority to six state-of- the-art interactive video retrieval systems. Index Terms — interactive video retrieval; coached active learning; query distribution modeling. I. I NTRODUCTION M OST of the video retrieval applications are imple- mented with a single-round searching scheme, where a user types a query (usually a short text phrase) and then waits aside in the hope that the system is able to figure out what she/he wants and returns the results on demand. However, the system can seldom do so because it is often difficult for the user to describe her/his specific need with a short phrase. As a complementary technique to this single-round scheme, interactive retrieval models the search as an iterative process and allows the users to give relevance feedback so as to help the systems “understand” their query intentions. In each iteration, once the system returns a set of resulting items, a user can label some of them as “relevant” or “irrelevant”. These labeled instances are then used for further estimating the user’s query intention, on the basis of which the system starts a new round of searching. As users cannot keep constant patience with the iterative labeling, how to learn users’ query Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected]. The work described in this paper was supported in part by the grants from the National Natural Science Foundation of China (No. 61001148 and No. 61272256) and the Open Projects Program of National Laboratory of Pattern Recognition of China. The authors are all with the College of Computer Science, Sichuan Uni- versity, Chengdu 610065, China (e-mail: {cswei,jessicayang}@scu.edu.cn). * Corresponding author. intentions effectively is critical in designing an interactive retrieval system, a problem called query learning [1]. Active learning [2] is a commonly adopted scheme for this purpose, which speeds up the process by reducing the number of instances needed to be labeled. In active learning, the search is considered a process to train a classifier which distinguishes relevant instances from irrelevant ones. The training is gener- ally conducted on the labeled instances (with relevant instances as positive examples and irrelevant ones as negative) and repeats every time when freshly labeled instances are obtained through the users’ labeling. To reduce the efforts of labeling, existing approaches usually choose the instances closest to the decision boundary to query the users, because those are the instances the machine is most uncertain of, and the labeling of them will potentially cause a significant change of the classifer in the next round so as to ensure a faster convergence. This greedy strategy is termed as uncertainty sampling [3–10]. Even probably being the most popularly employed, however, uncertainty sampling has been found not of a “kindred soul” with interactive search. The spirit of uncertainty sampling is to reduce the efforts of labeling by avoiding querying users with instances that the machine is most confident with, e.g., those farthest from the decision boundary on the positive side. Nevertheless, in interactive search, those are exactly what the users are looking for and delay of labeling them may frustrate the users at early stages, resulting in abortion of the learning. In other words, uncertainty sampling aims to achieve a high quality classifier at the final stage and thus is not eager to “harvest” relevant instances returned by the precursor classifiers. In interactive search (especially at early stages), we concentrate more on the harvest and put less attention on the quality of the final classifier, a conflict also observed by A. G. Hauptmann et al. [11]. The problem can be related to an open question known as trade-off between exploitation and exploration [12, 13], where a learner always hesitates whether to exploit knowledge at hand for obtaining immediate rewards or to explore the unknown area for improving future gains. This is a dilemma encountered by any learning process conducted on an unknown distribution, but has not been addressed sufficiently by prior studies of active learning. In terms of uncertainty sampling, the dilemma refers to the fact that in each run, a leaner has to choose between the exploitation of harvesting the instances that the machine is confident with so as to fulfill the retrieval and the exploration of querying the user with instances that the machine is uncertain about so as to have a better understanding of the target space. The reason here is that the exploration of uncer- tainty sampling relies too much on the posterior knowledge

Upload: others

Post on 01-Apr-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

1

Coaching the Exploration and Exploitation inActive Learning for Interactive Video Retrieval

Xiao-Yong Wei, Member, IEEE, Zhen-Qun Yang∗

Abstract—Conventional active learning approaches for interac-tive video/image retrieval usually assume the query distributionis unknown, because it is difficult to estimate with only alimited number of labeled instances available. It is thus easyto put the system in a dilemma whether to explore the featurespace in uncertain areas for a better understanding of thequery distribution or to harvest in certain areas for morerelevant instances. In this paper, we propose a novel approachcalled coached active learning (CAL), which makes the querydistribution predictable through training and thus avoids therisk of searching on a completely unknown space. The estimateddistribution, which provides a more global view of the featurespace, can be utilized to schedule not only the timing but also thestep sizes of the exploration and the exploitation in a principledway. The results of the experiments on a large-scale dataset fromTRECVID 2005–2009 validate the efficiency and effectivenessof our approach, which demonstrates encouraging performancewhen facing domain-shift and outperforms eight conventionalactive learning methods and shows superiority to six state-of-the-art interactive video retrieval systems.

Index Terms — interactive video retrieval; coached activelearning; query distribution modeling.

I. INTRODUCTION

MOST of the video retrieval applications are imple-mented with a single-round searching scheme, where

a user types a query (usually a short text phrase) and thenwaits aside in the hope that the system is able to figure outwhat she/he wants and returns the results on demand. However,the system can seldom do so because it is often difficultfor the user to describe her/his specific need with a shortphrase. As a complementary technique to this single-roundscheme, interactive retrieval models the search as an iterativeprocess and allows the users to give relevance feedback soas to help the systems “understand” their query intentions. Ineach iteration, once the system returns a set of resulting items,a user can label some of them as “relevant” or “irrelevant”.These labeled instances are then used for further estimatingthe user’s query intention, on the basis of which the systemstarts a new round of searching. As users cannot keep constantpatience with the iterative labeling, how to learn users’ query

Copyright (c) 2010 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending an email to [email protected].

The work described in this paper was supported in part by the grants fromthe National Natural Science Foundation of China (No. 61001148 and No.61272256) and the Open Projects Program of National Laboratory of PatternRecognition of China.

The authors are all with the College of Computer Science, Sichuan Uni-versity, Chengdu 610065, China (e-mail: {cswei,jessicayang}@scu.edu.cn).

∗ Corresponding author.

intentions effectively is critical in designing an interactiveretrieval system, a problem called query learning [1].

Active learning [2] is a commonly adopted scheme for thispurpose, which speeds up the process by reducing the numberof instances needed to be labeled. In active learning, the searchis considered a process to train a classifier which distinguishesrelevant instances from irrelevant ones. The training is gener-ally conducted on the labeled instances (with relevant instancesas positive examples and irrelevant ones as negative) andrepeats every time when freshly labeled instances are obtainedthrough the users’ labeling. To reduce the efforts of labeling,existing approaches usually choose the instances closest to thedecision boundary to query the users, because those are theinstances the machine is most uncertain of, and the labeling ofthem will potentially cause a significant change of the classiferin the next round so as to ensure a faster convergence. Thisgreedy strategy is termed as uncertainty sampling [3–10].

Even probably being the most popularly employed, however,uncertainty sampling has been found not of a “kindred soul”with interactive search. The spirit of uncertainty sampling isto reduce the efforts of labeling by avoiding querying userswith instances that the machine is most confident with, e.g.,those farthest from the decision boundary on the positive side.Nevertheless, in interactive search, those are exactly whatthe users are looking for and delay of labeling them mayfrustrate the users at early stages, resulting in abortion of thelearning. In other words, uncertainty sampling aims to achievea high quality classifier at the final stage and thus is noteager to “harvest” relevant instances returned by the precursorclassifiers. In interactive search (especially at early stages),we concentrate more on the harvest and put less attention onthe quality of the final classifier, a conflict also observed byA. G. Hauptmann et al. [11]. The problem can be relatedto an open question known as trade-off between exploitationand exploration [12, 13], where a learner always hesitateswhether to exploit knowledge at hand for obtaining immediaterewards or to explore the unknown area for improving futuregains. This is a dilemma encountered by any learning processconducted on an unknown distribution, but has not beenaddressed sufficiently by prior studies of active learning.

In terms of uncertainty sampling, the dilemma refers to thefact that in each run, a leaner has to choose between theexploitation of harvesting the instances that the machine isconfident with so as to fulfill the retrieval and the explorationof querying the user with instances that the machine isuncertain about so as to have a better understanding of thetarget space. The reason here is that the exploration of uncer-tainty sampling relies too much on the posterior knowledge

2

Concept-based

Search

Unlabeled

pool

Training dataset

Testing dataset

A Text Query

“find airplane in sky”

Ranked list

Classifier

Train/Retrain

classifier

Estimated query distribution

Statistical

test

Coached sampling

Training examples and

their distributions

Initial Search Coached Active Learning Query Modeling

Domain

adaptation

1

1

-1

-1

-1

Labeled set

Fig. 1. Interactive retrieval framework consisting of three modules: initial search, query modeling, and coached active learning.

indicated by the decision boundary, which provides only thelocal information about a small portion of the space, and,therefore, the exploitation and exploration cannot be balancedbecause of the lack of the global perspective. To address thisproblem, we have conducted a pilot study [14] in attemptingto utilize prior knowledge (learnt from training dataset) toguide the active learning globally, so that the exploitation andexploration can be well scheduled and balanced. However, dueto the expensiveness of the human experiments, the nature ofthe method has not been fully revealed in [14]. Furthermore,we notice that the prior knowledge learnt from training willsometimes suffer cross-domain issue when the target domainis different from that of the training.

In this paper, we extend our prior work [14] and proposea more sophisticated approach for active learning, whichbalances the exploitation and exploration, and addresses thecross-domain issue within an unified framework. Though amuch more comprehensive study on a larger dataset, we willdemonstrate that the extension is not only able to offer furtherperformance improvements, but also reveals the principal fac-tors (and their interdependence) related to interactive retrievalon the basis of which we can improve our future design. Inthe sense that the learning is jointly guided by the prior andposterior knowledge, we name it as coached active learning(CAL) to distinguish it from conventional approaches withoutthe coaching process. Compared with existing methods inactive learning, CAL offers the following advantages:

• Predictable Query Distribution: Conventional approachesof active learning usually assume the query distributionis unknown, because it is difficult to estimate with onlya limited number of labeled instances. We argue thatthe query distribution is predictable in the scenario ofinteractive video search when enough training examplesare available. To bypass the difficulty of estimating it onlabeled instances directly, we use a statistical test to findtraining examples which are from the same distribution(s)as the labeled instances, and utilize the distribution(s) ofthose examples (much more abundant than the labeledinstances) to predict that of the query. We will showthat by analyzing the semantics carried by the trainingexamples (found by the statistical test), we can evenobtain the users’ query intention semantically. With theestimated query distribution, we avoid the risk of blindlysearching on a completely unknown distribution;

• Exploitation versus Exploration: The estimated querydistribution, together with the distribution inferred fromthe decision boundary of current classifier, will be used toevaluate the relevancy and uncertainty of the instances,on the basis of which we can balance the exploitationand exploration in a principled way by controlling theproportion of two types of instances to query next, namelythose with high relevancy (for boosting the exploita-tion) and those with high uncertainty (for improvingexploration). The proportion is determined on the harvesthistory, so that we encourage exploitation to keep theusers’ patience if the number of freshly labeled relevantinstances maintains a growth momentum statistically, and,otherwise, we encourage exploration to push the decisionboundary towards uncertain area so as to increase thechance of locating more relevant instances in a long run;

• Domain Adaptivity: It often happens that the domainof the testing dataset are different from that of thetraining dataset, making the prior knowledge learnt fromtraining less applicable to the target domain. In CAL, wenarrow the domain gap by gradually mixing the predictordistributions (learnt from the training examples) with thedistribution of the labeled relevant instances (representingknowledge from target domain) in each round. This willkeep the prior knowledge update-to-date and thus resultin a more precise model of the query distribution.

With CAL, our interactive retrieval framework is shownin Figure 1. It consists of three modules: initial search, querydistribution modeling, and coached active learning. We employa concept-based search method [15] to perform the initialsearch, which returns an initial ranked list for labeling. Inquery modeling, every time when the labeled set is updatedby freshly labeled instances, we find training examples whichare statistically from the same distribution(s) as the labeledrelevant instances through a statistical test, and use their dis-tributions as predictors to estimate a query distribution. Duringthe coached active learning, the estimated query distributiontogether with the distribution of the current classifier outputswill be used to coach the sampling process, which selectsunlabeled instances to organize a new ranked list to querythe user in the next round. At the end of each round, we canintegrate the distribution of the labeled instances into those ofthe predictor distributions to address the cross-domain issue.

3

II. RELATED WORK

A. Active Learning

Active learning has been intensively studied over the lasttwo decades, and has been applied to solve a diverse rangeof problems. In this section, we mainly review active learningwithin the scenario of video/image retrieval. For more com-prehensive surveys, the reader is referred to B. Settles et al.[16] for active learning on general-purpose and T. S. Huanget al. [17] on multimedia retrieval.

The term active learning in literature is popularly referred toas pool-based active learning [18], where a classifier is trainediteratively on a small set of labeled instances and a large poolof unlabeled instances. In every iteration of learning, the majortask is to selectively sample instances from the unlabeled poolto query users, so as to retrain the classifier with the updatedlabeled set in the next round. The sampling strategy, which isdesigned to minimize the number of instances needed to belabeled, is thus the core of active learning and a fundamentalmark that distinguishes one existing method from others. Twotypes of strategies are popularly used: uncertainty sampling[3–10], and query-by-committee [18–20].

Uncertainty sampling is probably the most commonly em-ployed scheme, where the idea is to select the instances thatthe learner is most uncertain of their labels. For a binary clas-sification problem using a probabilistic model, it can be doneby simply querying the instances whose posterior probabilityof being positive is nearest 0.5 [3], or more generally (andpossibly most popularly), by using entropy to measure theuncertainty of an instance based on its posteriors probabilitypredicted by the current classifier(s) [4, 5]. In video/imageretrieval with active learning, the SVM classifier is widelyadopted [6–10, 12], with which uncertainty sampling can bestraightforwardly implemented to select the instances closestto the separating hyperplane where the classifier is usuallyconfused and works awkwardly.

Different from uncertainty sampling in which inference isnormally obtained through a single hypothesis (boundary),query-by-committee (QBC) organizes a committee with sev-eral classifiers for making a decision with the “wisdom ofthe crowds” [18–20]. Instances whose predicted labels aremost inconsistent among committee classifiers are usuallycandidates to query next. Although being claimed to reduce theclassification error and variance more efficiently than a singleexpert (classifier), QBC introduces computational burden intraining more classifiers. Therefore, it is seldom used in large-scale video/image indexing and retrieval.

Additional to uncertainty sampling and QBC, there areother sampling strategies proposed with different premises,for example, of maximizing the expected model change (e.g.,[21]), and of reducing the expected generalization error (e.g.,[22]). The reader is referred to [16] for more details.

B. Exploitation-Exploration Dilemma

In the scenario of multimedia retrieval, uncertainty samplingwith a SVM classifier is still the most prevalent scheme [3–10]. However, as we have introduced, the current way of usingactive learning for reducing labeling efforts may delay the

harvest of relevant instances, which is inconsistent with theultimate goal of multimedia retrieval [11, 23]. The algorithmsproposed by Mitra et al. [12] and Thomas Osugi et al. [13]are two examples of the few works we found to address thisopen question of exploitation-exploration dilemma.

The algorithm in [12] adaptively estimates a confidencefactor based on the population proportion of relevant instancesto irrelevant ones along the current decision boundary, so as toencourage exploitation when the two groups of instances arewell balanced (an indication that the current decision boundaryis approaching the optimal position) and to encourage explo-ration otherwise. However, the searching along the boundary inthis method is expensive on a larger-scale dataset. In [13], theauthors propose an algorithm which randomly decides whetherto exploit or explore according to a probability of exploring,which is dynamically determined and is proportional to thechange of decision boundary in two consecutive rounds.Nevertheless, defining an appropriate function to measure thechange is not a trivial task. Furthermore, in either [12] or [13],without knowing the prior query distribution, the step size ofexploitation or exploration is still difficult to decide, whichmay either causes an unstable learning (if the step size is toolarger) or slows down convergence (if too small).

Considering the way of incorporating prior data distributioninto active learning, our work is also related to [7] and itssuccessor [24]. Although these works have not been proposedfor solving the exploitation-exploration dilemma, the use ofprior knowledge for adjusting the learning has made themexhibit “coached-like” nature, which can address the dilemmato some extent. In [7], prior to the learning, the authorsdivide the target dataset by clustering and learn the priordistributions of the clusters in advance. The knowledge aboutthose prior distributions is then used to weight the uncertaintyof an unlabeled instance with reference to which cluster itbelongs to. This method is further combined with uncertaintysampling in [24] for balancing between representativeness(indicated by the instances’ prior distributions on clusters)and uncertainty. However, the parameters for clustering (e.g.,number of clusters) are difficult to be determined adaptivelyin these two methods. More importantly, it is well known thatinstances in the same cluster of a feature space may carrydiverse semantic information, because instances similar in low-level feature space are not necessarily related semantically.

C. Three Principal Factors Related to Interactive Retrieval

To summarize the discussions above, there are indeed threeprincipal factors related to interactive video retrieval, namely,exploitation vs. exploration, prior vs. posterior knowledge anddomain adaptation. Understanding the impacts of these factorscan help not only to explain the advantages/disadvantages ofexisting methods, but also to improve our future designs ofthe sampling strategies. As shown below, the three factors are

Exploitation Exploration

Posterior (Local) Knowledge

Prior (Global) Knowledge

Domain Adaptation

1

4

2

3

4

indeed interdependent, in the way that, 1) as the final goalof retrieval, exploitation is usually conducted based on ourposterior understanding about a local area of the space; 2)therefore, to avoid to be stuck locally, we need to explore; 3)however, to be well-scheduled, the exploration relies on theguidance of the prior (global) knowledge; 4) the knowledgeis dependent on the exploitation to keep up-to-date throughdomain adaptation. In this case, we can see that the intuitionby always sampling the most relevant instances to query nextis easy to be stuck locally because only relationship 1) isconsidered, and that most active learning approaches focusonly on the relationship 2) and thus have an inconsistentgoal with video retrieval. The works by Mitra et al. [12] andThomas Osugi et al. [13] try to balance the exploitation andexploration may have considered both relationships 1) and 2)but ignored 3) and therefore are lack of global perspective,while the approaches [7] and [24] utilizing prior knowledgehave considered 3) but put less emphasis on 1) and 2) becausethey are not designed for solving the exploitation-explorationdilemma. Our prior work [14] has considered 1), 2) and3) but still ignored 4), making the “collaboration” betweenthe prior and posterior knowledge not in a “mutual” manner(i.e., only the prior knowledge can influence the posteriorknowledge, but not vice versa). In this paper, we considerall these relationships within an unified and more principledframework. We will demonstrate that, by further consideringrelationship 4), not only can the prior knowledge be keptup-to-date for more precise query modeling, but also are thethree factors concerning interactive retrieval connected into a“virtuous circle”, which brings further performance gains.

III. MODELING QUERY DISTRIBUTION

At each round, we can obtain a set L+ consisting ofinstances which are labeled as “relevant”. Since L+ is justan incomplete sampling of the query distribution with limitednumber of instances, it is difficult to infer the query distri-bution from L+ directly. We propose an indirect way to inferthe query distribution: first we organize training examples intononexclusive groups according the semantics they carry; thenwe employ statistical test to find groups which are from thesame distribution as L+; finally, the distributions of thosesemantic groups (which include much more abundant instancesthan L+) are used to approximate that of the query. At theend of each round, we use domain adaption to update thedistribution of each semantic group, aiming to address thedomain shift between the target and the training domains.

A. Organizing Semantic Groups

In organizing training examples into semantic groups, wepropose to maximize the semantic generalizablity of the result-ing groups so that they can semantically cover as many queriesas possible. We use concept combination [25] to achieve thisgoal. The idea is first organizing examples into elementarygroups of concepts, and then combining those groups to createnew groups covering more complicated semantics. Given oneof the abundantly available concept definition and annotation

collections1 (e.g., Large-Scale Concept Ontology for Multime-dia (LSCOM) [27]), we denote Gi the set of instances labeledwith concept i. Intersection or union operators are then used tocombine those sets into new groups (i.e., Gi∩j and Gi∪j). Thecombination can be done recursively among existing groups(Gi’s, Gi∩j’s and Gi∪j’s) to achieve more complicated ones,as long as the number of training examples in every combinedgroup is larger than a fixed threshold (say 30), so that the groupis statistically meaningful.

B. Mapping between L+ and Semantic Groups

With L+, the set including instances labeled as “relevant”,we need to find groups that are from the same distributionas those in L+. This is a two-sample test problem whichhas a solid foundation in statistics. In our case, we employthe widely-used two-sample Hotelling T-Square test [28],where the hypothesis is that two samples are from the samemultivariate distribution, and the significance level is 0.01.However, conducting Hotelling test which uses L+ to comparewith candidate semantic groups (G’s) one by one is time-consuming, because the number of G’s can be up to a scale ofhundreds of thousands. We employ two techniques to addressthis problem. First, we represent each G with three parameters(|G|, µG, ΣG), namely the number of examples, the mean fea-ture vector, and the covariance matrix of G respectively, whichwe can learn offline. As L+ can also be represented in a similarway, the Hotelling test is thus performed between parametersinstead of the direct comparison between instances in L+ andexamples in G, effectively avoiding the intensive computation.Second, we reduce the number of candidate groups to compareby pruning groups not semantically related to the query. Morespecifically, at query time, we use query-to-concept mapping[15, 29, 30] to select a set of concepts (denoted as C) whichare semantically related to the query. In Hotelling test, we onlytest L+ with groups that concern any concept in C, while thesematic groups not including any concept in C will be pruned.For example, if car is one of the concepts in C, we onlyinvestigate groups Gcar, Gcar∩road, Gcar∪bus and so on. Inour experiment, this reduces the number of candidate groups toa scale of thousands, significantly accelerating the comparison.

C. Estimating Query Distribution

With L+ and semantic groups selected in the last section,we use a gaussian mixture model (GMM) to estimate thedistribution of the current query Q. Instances in L+ are usedas training examples in GMM. Denoting the set of selectedgroups as P (predictor set), the probability density function(pdf ) of the Q in GMM is:

P (x|Q) =∑G∈P

ωGP (x|G) (1)

where x is the feature vector of any instance, P (x|G) is thepdf of semantic group G, and ωG is the weight for P (x|G)with the summation of all ωG equal to one. Assuming eachG is generated by a standard multivariate normal distribution,

1These annotation collections are previously developed for training conceptclassifiers to fulfill the task of high-level concept indexing in TRECVID [26].

5

P (x|G) can be easily estimated using the three parameters(|G|, µG, ΣG). To estimate the weights ωG, we employthe most popular and well-established algorithm expectation-maximization (EM). With the instances in L+ as trainingexamples, EM seeks an optimal set of weights ω∗

G (i.e.,{ω∗

1 , ω∗2 , · · · , ω∗

|P|}) which maximizes the GMM likelihood

{ω∗1 , ω

∗2 , · · · , ω∗

|P|} = argmax{ω1,ω2,··· ,ω|P|}

|L+|∏i=1

P (xi|Q). (2)

To accelerate the convergence of EM, we assign each predictordistribution an initial weight

ω0i =

sim(L+, Gi)∑G∈P sim(L+, G)

(3)

where sim(.) is a function to compute similarity of twodistributions. We implement this function as

sim(L+, G) = exp{−1

2

(√(µ+ − µG)′(Σ+)−1(µ+ − µG)

+√

(µG − µ+)′(ΣG)−1(µG − µ+))}

.

(4)

The similarity is indeed determined by the Mahalanobis dis-tances between the mean vectors of the two distributions,with the intuition that the closer G’s distribution to L+’s, thelarger the contribution of P (x|G) to P (x|Q). The EM willstart from the initial ω0

i and finally converges to ω∗G. Note

that there might be other metrics available for measuring thedistribution distance in Eq. (4) (e.g., Jensen-Shannon diver-gence). In our experiment, however, ω∗

G basically convergesto the same optimal point even different metrics are used. Thisis an indication that searching for GMM weights in Eq. (2)is a convex optimization problem, and thus using differentmetrics will not affect the performance of CAL. We employEq. (4) because it is simple and can work seamless with ourparameter-based representation of semantic groups.

D. Addressing the Cross-Domain Problem in CAL

It could happen to all learning problems that, during testing,the target domain is different from the development domainwhere the models are trained. In CAL, we address thisissue by involving a domain adaptation process to updatethe candidate sematic distributions (i.e., pdf for all semanticgroups) at the end of each round. The idea is to merge thenewly labeled relevant instances L+ of current round into eachcandidate semantic group G and relearn the three parameters(|G|, µG, ΣG) (cf. Section III-B) of G using the mixedexamples/instances. To be computationally effective, we willshow this can also be implemented at a “parameter-level”without recalling the training examples of each group. Todistinguish the data in different rounds, we add a subscripti (indicating the i-th round) for these parameters. It is easy toprove that the parameters of group G at the (i+ 1)-th roundcan be calculated as

|Gi+1| = |Gi|+ |L+i | (5)

µGi+1 =

µGi |Gi|+ µ+

i |L+i |

|Gi|+ |L+i |

(6)

ΣGi+1 =

Φi +Φ+i − µi+1(µ

Gi+1)

′|Gi+1||Gi|+ |L+

i | − 1(7)

where Φi and Φ+i are the inner product matrices of feature

vector matrices of G and L+ respectively. The proof can befound at our demo page2. To be practical, we have indeedmaintained Φi at each round instead of ΣG

i , since it can beobtained easily through Eq. (7). At the end of each round,we update these parameters for semantics groups that havebeen selected for modeling the query distribution in the currentround. Note that, even this domain adaptation is not proposedto solve the general problem of domain-shift, it helps tokeep the distributions of sematic groups up-to-date, on thebasis of which we can obtain a more accurate estimationof the query distribution. More importantly, as we discussedin Section II-B, the domain adaptation connects the balancebetween the exploitation and exploration, and the collaborationbetween the prior and posterior knowledge into a “virtuouscircle”, which brings further performance improvements ofCAL (as we will validate experimentally in Section VI).

IV. COACHED ACTIVE LEARNING

In this section, we jointly utilize the prior knowledge(indicated by the estimated query distribution) and posteriorknowledge (indicated by current decision boundary) to balanceexploitation and exploration. This can be formulated as afunction ρ(x) that gives the priority of selecting an instanceto query next. That is

ρ(x) = λHarvest(x) + (1− λ)Explore(x) (8)

where Harvest(x) is a function to compute how likely theexploitation will be boosted if selecting x to query next,Explore(x) computes how likely the exploration will beimproved if selecting x to query next, and λ is a balancingfactor to control the exploitation-exploration trade-off. In ourproposed method, the use of λ is mainly for determiningthe timing of exploitation and exploration, and the step sizeis controlled in Harvest(x) and Explore(x) respectively.We will introduce the implementations of Harvest(x) andExplore(x) first, and then turn to λ.

A. Exploitative Priority

To boost the exploitation, one will intuitively select in-stances with high posterior probabilities P (Q|x) (i.e., the mostrelevant instances estimated by current classifier) to querynext. However, as the current classifier is trained with L+ (abiased sampling of relevant instances to Q), instances withhigh P (Q|x) are not necessarily lying in the densest areaof query distribution. To encourage the exploitation to shifttowards the dense area where the instances have high priorprobabilities3 P (x|Q), we implement the Harvest(x) as

Harvest(x) = P (x|Q)P (Q|x). (9)

The function indeed adjusts the posterior probability of x byfurther considering its prior probability. As in Figure 2(b),

2http://www.cs.cityu.edu.hk/∼xiaoyong/CAL/3Here we have slightly abused the term “prior probability” for P (x|Q) to

emphasize the probability is learnt from the prior knowledge.

6

x

(x)

=1.0=0.0=0.8=0.2

- P(x|Q)log(P(x|Q))

- P(Q|x)log(P(Q|x))

Explore(x)

Uncertainty

Inferred from Prior

Query Distribution

Uncertainty Inferred

from Posterior

Query Distribution

Priority to Explore

(a)

(b)

(c)

(d)

Purely

Exploitative

Purely

Explorative

More Exploitative

More Explorative

Current Decision

Boundary

P(x|Q)P(Q|x)Harvest(x)

Prior Query

Distribution

Posterior Query

Distribution

(indicated by

Classifier output)Priority to Harvest

Uncertain area of

prior distribution

Labeled RelevantUnlabeled RelevantLabeled IrrelevantUnlabeled Irrelevant

Fig. 2. Illustration of Exploitative Priority and Explorative Priority: (a) anexample of a query distribution in target space, (b) the prior and posteriordistributions and the distribution of the corresponding exploitative priorityHarvest(x), (c) the distributions of the uncertainties determined on theprior and posterior distributions and the distribution of the correspondingexplorative priority Explore(x), and (d)the effect of using λ to balanceexploitation and exploration.

the function will put higher priority to harvest the instancesbetween the centers of prior and posterior distributions, insteadof those around the center of the posterior distribution. Duringthe iterative harvesting, the two centers will merge step bystep. The step size is indirectly controlled with respect to thedistance between the two centers.

B. Explorative Priority

By considering both prior and posterior probabilities, weimplement Explore(x) as

Explore(x) ={−P (x|Q) log(P (x|Q))

}×{−P (Q|x) log(P (Q|x))

},

(10)

which is intuitively a summarization of two uncertaintiesinferred from the entropies of prior and posterior distributionsrespectively. As in Figure 2(c), Explore(x) puts high priorityon exploring the instances residing in the intersection of theuncertain areas of prior and posterior distributions. There arethree peaks for Explore(x). The leftmost one is the highest

peak which corresponds to the area both the prior and posteriordistributions are most uncertain of, and the area where the leftside of the optimal boundary is probably located. The peak inthe middle corresponds to the area where prior and posteriordistributions have a conflicting understanding. Referring toFigure 2(a), the area corresponds to the region where thecurrent decision boundary passes through the central areaof the prior distribution, and it is thus also worth exploringto clarify the misunderstanding (if the centers of prior andposterior distributions are aligned exactly, the middle peakwill disappear.) The rightmost peak corresponds to the area farfrom the current decision boundary but the prior distribution“thinks” that the right side of the optimal boundary shouldbe located there. Exploring this area helps to clarify the rightside of the optimal boundary. If the query distribution is wellestimated, exploring the areas under the three peaks will havethe effect of pulling the current boundary towards the optimalone. The step size of the exploration is indirectly controlledby the distance between the current decision boundary and the“imagined” optimal one by prior distribution.

C. Balancing Exploitation and Exploration

To balance the exploitation-exploration trade-off adaptively,every time when L+ is updated by freshly labeled instances,we update λ based on the harvest history, i.e., number ofrelevant instances obtained in each round. We encourage ex-ploitation when there are potentially more relevant instances tolabel, and encourage the exploration otherwise. To distinguishthe data in different rounds, we add a subscript i (indicatingthe i-th round) for both L+ and λ. The λi+1, balancing factorat the (i+ 1)-th round, can then be updated as

λi+1 = λi +|L+

i+1| − R

R(11)

where |L+i+1| is the number of relevant instances in L+

i+1, andR, which we present later, is the expected number of relevantinstances to be labeled in the (i+ 1)-th round. In Eq. (11), λincreases when the previous round harvested more relevantinstances than expected, so that the learner will put moreemphasis on exploitation, otherwise, λ decreases resulting inmore emphasis on exploration. When the updating in Eq. (11)causes λ out of the interval [0,1], we cut its value to thecorresponding boundary. The expected number of relevantinstances R in the (i + 1)-th round, is computed accordingto the harvest history as

R =

i∑k=1

1

2(i−k+1)|L+

k | (12)

which is a weighted summarization of the number of instancesharvested so far. Becasue the weight 1

2(i−k+1) decreases dra-matically from the current round to those of the previousrounds, the result of Eq. (12) is mainly influenced by theharvest of several most recent iterations4. The intuition is that

4The estimation of the expected harvest iteratively is a NonstationaryProblem, to which Exponential, Recency-Weighted Average is a commonlyadopted solution. Eq. (12) is in fact a special case of this solution. Amathematical justification can be found on our demonstration webpage2.

7

we expect to harvest in the new round a little bit more thanthe number of relevant instances we have harvested recently.The effect of balancing with λ is illustrated in Figure 2(d),where the distribution of the priority function ρ(x) appearsmore similar to that of exploitative priority Harvest(x) if λis larger, and more similar to that of the explorative priorityExplore(x) otherwise.

D. On The Use of Irrelevant Instances

So far, we have only discussed the use of labeled relevantinstances (i.e., L+) in CAL. We should notice the use oflabeled irrelevant instances (denoted as the set L− hereafter)is also of great importance, as argued by Gosselin et al. [31]and Tong et al. [32]. The major concern is that the sizes ofL+ and L− should be balanced when selected for training.Since in CAL the size of L− is often larger than that of L+,we simply address this issue by randomly sampling irrelevantinstances from L− with the constrain that the number ofnegative samples is not greater than 3 times of |L+|, avoidingthe number of negative samples in training is overwhelmingto that of positive ones. Even appearing heuristic, the con-strain has been reported with promising results in TRECVIDevaluations in recent years. Since how to make a good useof negative samples for training is still an open question formachine learning, we leave its practice in CAL for futureinvestigations, because it is beyond the topic of this paper.

V. EXPERIMENT-I: STUDY OF THE ESSENTIALCHARACTERISTICS OF CAL

Due to the large scale of the experiments, we divide thestudy of CAL into three parts. In this section, we investigateCAL in a real interactive retrieval environment for obtainingthe essential characteristics of CAL and the statistics aboutthe users’ behavior, on the basis of which, in Section VI, wepay particular attention to the coaching process of CAL, andfinally conduct a comprehensive comparison between CALand the state-of-the-art in Section VII.

A. Learning Candidate Semantic Distributions

With the method introduced in Section III-A, we first learnthe candidate semantic distributions using LSCOM [27], thelargest concept definition and annotation collection publiclyavailable. LSCOM includes 2000+ concepts, of which 449have been labeled on the development dataset of TRECVID2005. To be practical, when learning semantic groups usingthe concept combination, we limit our interests to semanticgroups being combined with no more than 2 concepts. Af-ter filtering out those with less than 30 training examples,we obtain 79,514 non-exclusive semantic groups, with thethree parameters (|G|, µG, ΣG) learnt for each group G. Tofurther fulfill the normality assumption in Section III-C, weperform Mardia’s test [28] (at significance level 0.05) on eachdistribution to check its similarity to the multivariate normaldistribution and filter out those being rejected in the test. Thisfinally results in 23,046 candidate distributions which will beused at query time to estimate the query distribution. Theresults can be downloaded from our demo webpage2.

B. Experimental Setting

We conduct experiments in this section using the TRECVID2005 (TV05) test set, which is composed of 45,765 shotsof news videos from multi-lingual sources including English,Chinese and Arabic. Twenty four search topics, together withtheir ground-truth provided by TRECVID, are used in theexperiments. We only consider the text queries, imaginingthat most users perform search with a short sentence. Werepresent each shot with a concept vector which is composedof the detection scores on the shot by a set of semanticconcept detectors. In our case, we use VIREO-374 [33] whichincludes detectors for 374 LSCOM semantic concepts. Thedimensionality of our concept vector is thus 374. To train theclassifier for each round, we use the well-known LIBSVM [34]with radial basis function (RBF) as kernel, gamma and costempirically set to 0.386 and 8 respectively, and “-b” parameterenabled to output probability estimations (used as posteriorprobabilities in the experiments).

To compare the performance of our method to conventionalapproaches of active learning, we conduct an experiment ofinteractive search involving 36 searchers (volunteers agingfrom 19 to 26, 8 females and 28 males, 29 undergraduatesand 7 graduates) on 3 PCs with exactly the same configuration(Intel(R) Core(TM) i3 CPU M3502.27GHz, 2GB Memory).Note that, to save the human power, we reuse data from 12searchers which have been obtained in our previous work[14], where no domain adaptation (cf. Section III-D) has beenemployed in CAL. Therefore, for the rest of 24 searchers,in this section, we temporarily remove the domain adaptationin CAL, making the querying strategies consistent over allsearchers. The impact of with and without integrating domainadaptation in CAL will be studied specifically in the nextsection. Due to the expensiveness of the human experimentagain, we only compare our coached active learning (CAL)with the most popularly employed uncertainty sampling (US),which chooses the instances closest to the SVM separatinghyperplane to query next. In this case, each searcher hasto conduct a total of 48 searches (24 TV05 queries foreach method), resulting in 1,728 searches in total. To avoidthe situation that the users might be negatively involved insearching new queries when they were frustrated by some“difficult” queries previously, the users are recommended toseparate the task into several times within twenty weeks witheach time at most 6 queries being taken. Queries are assignedto a user in stochastic order and experimented at each timewith a randomly selected active learning method (i.e., theorder of using CAL or US is also stochastic). We restrictthe interval between the searches with CAL and US on asame query to at least four weeks, for reducing the effect thata user’s experience on searching the query with one methodmay facilitate the searching with another.

The two sampling strategies are experimented within thesame user interface as shown at our demo webpage2, andthe interactive search is conducted with the framework wepresent in Figure 1 (to search with US, we replace the activelearning module with uncertainty sampling). For initial search,we employ the work in [15], a concept-based search method

8

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of Instances Examined

Mea

n A

vera

ge P

reci

sion

(M

AP

)

0.4

0.6

0.8

1

Bal

anci

ng F

acto

r (λ

) fo

r C

AL

λ for CAL

Coached Active Learning (CAL)Uncertainty Sampling (US)

Fig. 3. Performance comparison of Coached Active Learning and UncertaintySampling at increasing depths of results, using the TRECVID 2005 test set.The MAP at each point is the average of MAPs of 36 users. The balancingfactor λ for CAL is also the average over 24 queries and across all users.

which has been reported with encouraging results. All searchesare experimented using the TRECVID model of interactivesearch, where a search has to be finished within 15 minutes,but users can give up early if they feel that there is no hope tofind new relevant shots. The final result (i.e., the retrieved list)is evaluated with average precision (AP), following TRECVIDstandard. To aggregate the performance over multiple queries,we use the mean of their APs (known as MAP).

C. Comparison with Uncertainty Sampling

Figure 3 shows performance comparison of CAL and US,where the MAP at each number of examined instances is theaverage MAP of 36 users. As a reference, we also plot thechange of balancing factor λ in CAL. It is easy to see thatCAL outperforms US across the X-axis.

There are basically four stages for CAL, divided by thepoints 200, 1050, and 1800, and contributing 22.26%, 44.98%,7.14%, and 25.62% of the total MAP growth, respectively.Overall, the λ decreases gradually from left to right, indicatingthe fact that, while more and more relevant instances are har-vested, CAL can adaptively shift its focus from exploitationto exploration. This becomes more clear when looking at thesecond stage (from 200 to 1050), in which, even the relevantinstances returned by the initial search are usually harvestedout at the first stage, the estimated query distributions cansuccessfully lead the classifiers to the dense areas, and theabundant relevant instances there results in fruitful harvest andstimulates the MAP growth. Accordingly, the value of λ dropsa little to encourage searching of the dense areas and remainsas high as 0.90 to harvest in these areas.

Compared with that of CAL, MAP of US increases moregently. The major disadvantage of US is that, by alwaysencouraging exploration and ignoring harvesting, US mayeasily frustrate the users and cause an early give-up of search.To validate if there is true, we investigate CAL and US fromthree perspectives, namely the query-dependent, classifier-dependent, and user-dependent performance comparison.

1) Query-Dependent Comparison: The query-dependentcomparison of CAL and US is illustrated by Figure 4, wherethe final MAPs are obtained by evaluating the top-1000 shotsin ranked lists that have been submitted after the 15-minute

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MA

P (

Av

erag

e of

36 U

sers

)

24 TRECVID 2005 Queries and MAP

US CAL

Fig. 4. Performance comparison of CAL and US over 24 TRECVID 2005queries. The MAP for each query is the average of APs of 36 users. The textsof queries have been manually modified to fit the figure.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of Iterations

Mea

n A

vera

ge P

reci

sion

(M

AP

)

US ClassifierCAL ClassifierUS HarvestCAL Harvest

Fig. 5. Quality comparison of classifiers trained in CAL and US. Theharvest rate (measured by MAP) at each iteration is plotted for reference.

search. One can easily find that the performance superiorityof CAL over US are more apparent among those “difficult”queries (in which both approaches exhibit moderate MAPs)than among the rest. For example, CAL obtains performancegains over US on queries “G. Bush entering or leaving avehicle” and “Condoleeza Rice” by 1661% and 183.17%respectively. However, those of queries “tanks or other militaryvehicles” and “a ship or boat” are just by 16.85% and15.09% respectively, even the MAPs on these two queriesare comparably high. The former two queries include namedentities. It is well known that this type of queries is too specificto enable an initial search without text-based search modulesto collect enough instances. This is exactly what happenedin our concept-based search, which thus returns insufficientrelevant shots to satisfy the users at the beginning. In the pureexploration-oriented US, it becomes even worse and finallycauses the user to give up early. On the contrary, CAL, withthe help of prior knowledge on query distribution, may havebetter chance to locate the dense areas and consequentiallyencourage the users on labeling in the succeeding iterations.

2) Classifier-Dependent Comparison: As introduced, inUS, the learning focuses more on the quality of the classifierbut ignores the harvest meanwhile. This is confirmed inFigure 5, where we measure the quality of a classifier ateach round by applying it on all the instances (no matterlabeled or not) for classification, then sorting the instances

9

TABLE IISELECTION OF CONCEPTS AND SEMANTIC DISTRIBUTIONS FOR TRECVID 2005 QUERIES. CONCEPTS ARE ORDERED BY THEIR SIMILARITIES TO

CORRESPONDING QUERIES. SEMANTIC DISTRIBUTIONS ARE ORDERED BY GMM WEIGHTS.

Queries Concept Selection Selection of Semantic DistributionsCondoleeza Rice face food sky face, sky, foodHu Jintao, president of China face leader individual individual, face, individual∩single person maleTennis players on the court tennis court sport sport, tennis, tennis∩sportG. Bush entering/leaving a vehicle face vehicle walking face∩road, vehicle, vehicle∩walkingA ship or boat boat ship canoe boat, boat∩waterway, boat∩urbanBasketball players on the court basketball court tennis basketball, basketball∪face, basketball∩sportAn airplane off airplane boat canoe airplane, airplane∩urban, airplane∩buildingA road with one or more cars road car sidewalk road∩car, road, sidewalkTanks or other military vehicles tank military vehicle vehicle∩military, military, military∩smokeAn office setting office sitting store office, store, office∩talking

TABLE ISTATISTICS ABOUT USERS’ PATIENCE ON LABELING INCLUDING: THENUMBER OF SHOTS EXAMINED PER QUERY (#EXAMINED SHOTS), THE

NUMBER OF IRRELEVANT SHOTS CONTINUOUSLY ENCOUNTERED BEFORESUBMITTING CURRENT LABELS IN EACH ROUND (#IRRELEVANT SHOTS),

AND THE AVERAGE RATE AND TIME OF USERS’ EARLY GIVE-UP (INMINUTES) BEFORE THE 15-MINUTE LIMIT.

Mehtod #Examined Shots #Irrelevant Shots Early Give-UpMax Avg. Min Max Avg. Min Rate Time

CAL 5381 2402 213 213 62 14 12.51% 13.23US 4705 2013 132 132 31 13 18.63% 10.42

in descending order according to their relevancy, and finallyevaluate the top-1000 instances in the sorted list with AP. Eachpoint in Figure 5 is thus the average AP across 24 queries.For reference, the harvest rate at each iteration has also beenevaluated by using AP on all the labeled relevant instancesso far. It is easy to see that the harvest rate of US at earlyiterations climbs slower than that of CAL, even the quality ofits classifier is improving faster. By contrast, the classifier ofCAL increases moderately at early iterations, when it focus onharvest and its decision boundary is stuck locally. However, thesituation starts to change after the 6-th iteration, where CALreaches its limit on harvesting in the dense area and moves itsemphasis to exploration. The iterations from 5 to 10 of CALis roughly corresponding to its stage-3 in Figure 3.

3) User-Dependent Comparison: It is worth mentioningthat, compared with those in our previous study [14] whereonly 12 searchers are involved, the AP and MAP performancesof US in this section (with 36 searchers) generally decrease.This is due to the fact that the proportion of undergraduatestudents has increased from 66.7% to 80.5%. Since mostof them are less experienced and patient than the graduatestudents, it is easier for them give up early in purely ex-plorative strategies like US. The increase of undergraduatestudents, therefore, finally degrades the performance of US.By contrast, for CAL, we have not observed significant dropon its performance, which may further confirm CAL’s betterability to maintain users’ patience.

D. Study of User Patience on Labeling

To further study the effects of different sampling strategieson user patience, in Table I, we present some statisticsabout the users’ patience recorded during the experiment. Thestatistics show that users are evidently more patient with CALthan with US, indicated by their willingness to examined more

shots overall and patience of waiting when irrelevant shots arecontinuously appeared. The maximum numbers of irrelevantshots continuously encountered before each submission arecollected from the query “Mahmoud Abbas”, where initialsearch fails to find any irrelevant shots. In this extreme case,the users are still able to examine 213 shots with CAL beforegiving up, 81 more than with US. Moreover, the give-up rateof CAL is 6.12% lower than that of US, and the users areable to sticking on the labeling for 2.81 minutes longer inCAL than in US before they give up early.

It is worth mentioning that, even CAL has introducedtwo additional processes for estimating query distribution andadjusting step size of exploration, the online time cost perround for CAL estimation is only 0.547 seconds on average.According to our questionnaire, no searcher has noticed thedifference between the online time costs of CAL and US(0.113 seconds per round). In addition, despite which samplingstrategy is used, we observe that the labels provided by usersinclude errors by 2.04% on average for an user to label anirrelevant instance as relevant and by 5.38% vice versa.

E. Insights into Query Modeling

In this section, we are interested in which semantic distribu-tions are selected for a certain query and whether they are ableto simulate the query distribution. For each query, by sortingthe selected semantic distributions according to their GMMweights (i.e., ω in Eq. (1)), we show 10 examples of TV05queries in Table II5, together with the corresponding conceptsselected in the initial concept-based search (ordered by theirsemantic similarities to the query), and the top-3 semanticdistributions selected at the end of the searches. Note thatthe results presented in the last column are randomly selectedfrom one of the 36 users, because it is user dependent.

Basically, the top-3 concepts in the second column havealso been selected in the third column with the similar orders,a phenomenon consistent with our first intuition that conceptswhich are the most semantically related to a query should havethe largest contributions (i.e., with the largest GMM weights).Nevertheless, there are several exceptions. For example, theconcept tank, which ranks first in the second column forquery “Tanks or other military vehicles”, does not even appearamong the top-3 in the third column. The reason is thatalthough some concepts seem the best ones for explaining a

5A more complete table for all 24 queries can be found in [14].

10

TABLE IIIPERFORMANCE COMPARISON OF CAL VARIANTS. THE BEST RESULT FOR EACH YEAR IS BOLD.

TRECVID- Adaptive λ Fixed λ Prior vs. Post. Adaptive λ Fixed λ Prior vs. Post.CAL CAL0 CAL.5 CAL1 PriCAL PosCAL CALA CALA

0 CALA.5 CALA

1 PriCALA PosCALA

2005 0.563 0.459 0.482 0.32 0.324 0.461 0.558 0.465 0.497 0.334 0.331 0.4692006 0.422 0.316 0.397 0.283 0.275 0.318 0.432 0.318 0.403 0.296 0.294 0.3272007 0.493 0.391 0.448 0.293 0.283 0.402 0.508 0.405 0.473 0.312 0.308 0.4492008 0.331 0.208 0.266 0.169 0.164 0.213 0.352 0.215 0.291 0.170 0.182 0.2252009 0.380 0.248 0.309 0.148 0.136 0.252 0.397 0.257 0.358 0.172 0.151 0.274

query semantically, their distributions may fail to explain thequery distribution statistically, because their training examplesmight be insufficient and thus may provide biased statisticalclues about the query distribution. In the example, concept tankmight be too specific for a query relating to “military vehicles”.Its distribution is less representative than those of conceptsmilitary and vehicle which cover more relevant instances andthus have been finally selected. Another interesting observationis that some concepts falsely selected by query-to-conceptmapping have been successfully eliminated by the statisticaltest. For instance, the concept boat and canoe, which areselected for the query “An airplane take off” because theysemantically share a common parent “vehicle” with “airplane”,have been eliminated shortly after the statistical test. Inaddition, according to our statistics, there are on average18.65±10.09 semantic groups are selected for each query. Theones contributing most to the query modeling are usually thetop-5 groups which have obtained the largest GMM weights.

VI. EXPERIMENT-II: INVESTIGATION OF THE THREEFACTORS RELATED TO THE PERFORMANCE OF CAL

In this section, we study CAL’s performances by varyingits configurations on three factors related to the coaching pro-cess, namely exploitation vs. exploration, prior vs. posteriorknowledge, and domain adaptation. The aim is to investigatethe questions like: Is the balance between exploration andexploitation important, or which one of them is more importantthan another? What the influence of the prior and posteriorknowledge to the performance? Will a joint consideration ofboth the knowledge better than utilizing them separately? Isthe domain adaptation able to boost the performance, and inwhat conditions it will or will not work?

A. Experimental Setting

In the experiments, we further include shots from TRECVID(TV) 2006-2009 test datasets, which consist of 79,484 shotsfrom TV06, 18,142 shots from TV07, 67,452 shots fromTV08, and 93,902 shots from TV09. This finally results ina set of 304,745 shots from various domains (news anddocumentary), various periods (black-and-white and colored),various cultures (English, Chinese, Arabic and Dutch), andvarious photographic styles. There are 120 queries in our ex-periments (24 per year). Following the standard of TRECVID,we use average precision (AP) on TV05 and inferred averageprecision (InfAP) [35] on TV06–09 to evaluate each list.

Due to the expensiveness of human experiments, however, itis impractical to conduct the study in such a large scale usinghuman searchers as in Experiment-I. To bypass the difficulty,we use machine searchers instead. A machine searcher is an

agent program who “knows” the groundtruth and behaves inthe interactive process according to the patience parameters wehave learned in Table I. Random errors are added accordingto two error probabilities (that we learnt in Section V-D) by2.04% for a machine searcher to label an irrelevant instanceas relevant and by 5.38% vice versa. A 0.08 second delay areempirically added for the machine searchers to “browse” eachshot, imitating the behavior of real searchers.

B. Insights into The Coaching Process

To fully study the nature of CAL, we investigate severalvariants of CAL obtained by fixing the balancing factor λ (cf.Section IV-C), by considering the prior or posterior knowledgeseparately or jointly, or by adding the domain adaptation(cf. Section III-D). In the following text, we represent allvariants in an unified form of CALA

λ , where the superscriptA appears if the distribution adaptation has been embeddedin the coaching (no superscript, otherwise), and the subscriptλ is a value ranging from 0 (purely explorative) to 1 (purelyexploitative) if the balancing factor is fixed (no subscript ifthe λ is adaptively determined). In addition, we add prefixesto these variations with Pri to indicate only prior knowledgeis considered (i.e, the P (Q|x) will be removed from bothEq. (9) and Eq. (10), and with Pos to indicate only posteriorknowledge is considered (i.e, the P (x|Q) will be removedfrom both Eq. (9) and Eq. (10)). No prefixes will be added ifthe two knowledge are considered jointly.

The performances of these variations on TV05-09 are shownin Table III. It is clear that CALA and CAL can alwaysoutperform the six variants using fixed balance factor λ (i.e.,CALλ’s and CALA

λ ’s) with the improvements of MAPranging from 6.27% to 156.76%, and also outperform thefour variants considering prior and posterior knowledge sepa-rately (i.e., PriCAL, PosCAL, PriCALA and PosCALA)with the improvements of MAP ranging from 13.14% to179.41%. The superiorities have confirmed the importanceof the adaptive exploitation-exploration balancing strategyand that of the joint consideration of prior and posteriorknowledge. Comparing the two groups of variants with andwithout domain adaptation, the variants with domain adapta-tion have demonstrated superior performances over those oftheir correspondences without the adaptation, which confirmsthe effectiveness of the domain adaptation. To further inves-tigate the influence (and the interdependence) of the threefactors (i.e., exploitation vs. exploration, prior vs. posteriorknowledge, and domain adaptation), Figure 6 shows a detailedperformance comparison of the variants with a more diverserange of configurations of these factors.

1) Exploitation vs. Exploration: It is easy to observe fromFigure 6 that, by varying the balance factor λ, basically all

11

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Balancing Factor (λ)

Mea

n A

vera

ge

Pre

cisi

on

(M

AP

)

CALλ

CALAλ

PriCALλ

PriCALAλ

PosCALλ

PosCALAλ

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Balancing Factor (λ)

Mea

n A

vera

ge

Pre

cisi

on

(M

AP

)

CALλ

CALAλ

PriCALλ

PriCALAλ

PosCALλ

PosCALAλ

(b)Fig. 6. The detailed performance comparison of CAL variants underdifferent configurations of λ’s on datasets of (a) TV05 and (b) TV07. Theadvantage of employing domain adaptation becomes clearer in (b) when theproblem of domain shift is much more serious on TV07 than on TV05.

variants are approaching their maxima when the exploitation-exploration are well balanced at a certain point of λ. Inaddition, except PriCALλ and PriCALA

λ , it is a consistentphenomenon that these variants perform better when moreemphasis are put on the exploration (i.e., when λ is low)than on the exploitation. This has confirmed our analysis ofexploitation-exploration interdependence in Section II-C thatit will be intuitively better to explore the unknown areas forfinding new relevant instances than to purely harvest and bestuck locally. However, by considering the two exception runsPriCALλ and PriCALA

λ , we will see that the conclusion isdependent on the use of the prior and posterior knowledge.

2) Prior vs. Posterior Knowledge: In Figure 6, PosCALλ

and PosCALAλ have demonstrated apparently superior perfor-

mances over those of PriCALλ and PriCALAλ . The reason is

that, by only considering the prior knowledge, PriCALλ andPriCALA

λ can only harvest from either the dense or uncertainarea of the global distribution which seldom changes, andthus will be stuck when the relevant instances from these twoareas are all exploited. By contrast, by only considering theposterior knowledge, the active learner’s understanding aboutthe feature space, even being local, is expanding because theposterior knowledge will be updated after each iteration, whichthus brings better ability for PosCALλ and PosCALA

λ tolocate novel instances. However, without the global guidanceof the prior knowledge, the exploration is still in a blindmanner. Therefore, by jointly considering prior and posteriorknowledge, CALλ and CALA

λ obtain superior performancesover those variants that consider the two knowledge separately.This has confirmed our analysis of the prior and posteriorknowledge interdependence in Section II-C. Furthermore, bycombining the analysis in Section VI-B1, we can see that theinterdependence of prior and posterior knowledge has causedthe interdependence of the exploitation and exploration, but the“collaborations” between the prior and posterior knowledgeare also conducted through the exploitation and exploration.

3) Addressing Cross-Domain Issue: Due to the fact thatour training dataset (TV05 development set) for learningthe semantic groups is from news domain and the testingdatasets are from both news (TV05-TV06 testing sets) anddocumentary (TV07-TV09 testing sets) domains, we are ableto study the performance of CAL when facing domain shift.As shown in Table III, the superiorities of CALA to CAL hasincreased when moving from news to documentary domain,

indicated by the observation that CALA outperforms theCAL by 0.74%(±2.3%) on TV05-TV06, but the superiorityrises up to 4.62%(±1.66%) on TV07-TV09. Therefore, byfurther considering domain adaptation, CALA may have bettercapacity of handling the cross-domain issue. This has alsobeen validated in Figure 6 by the superiorities of CALA

λ ’sto CALλ’s, PriCALA

λ ’s to PriCALλ’s and PosCALAλ ’s

to PosCALλ’s. It is interesting to see that CALA’s doesnot always outperform CAL’s on TV05. There are one suchexception in Table III and several others in Figure 6(a) whereCALA’s performances are just comparable to those of CAL’s.However, all those exceptions disappear on TV07-TV09 (seethe consistent superiority of CALA’s over CAL’s in Table IIIand Figure 6(b)), indicating that domain adaptation gives itsfull play at where the domain-shift becomes serious on TV07-TV09. The advantage of using domain adaptation has alsoconfirmed our analysis in Section II-C on the interdependenceof the three factors, in the way that, in the variants with domainadaptation, posterior knowledge has obtained the chance toaffect the prior knowledge and enables the “collaborations”between the two knowledge to be “interactive”. However, inthe variants without domain adaptation, the posterior dose nothave any impact on the prior knowledge and thus the priorknowledge may not be up-to-date.

After the investigation of this section, we can see that thethree factors contribute to the interactive retrieval throughdifferent paths, in the way that exploitation and explorationmanipulate the active learning directly, the prior and posteriorknowledge have to put their impacts on retrieval indirectlyby affecting the exploitation and exploration, and domainadaptation has the longest path that updates prior and posteriorknowledge first with the hope that its impacts will be trans-ferred to exploitation and exploration and later to retrieval.However, our experimental results show that the three factorsdo help each other so as to form the “virtuous circle” as wediscussed in Section II-C.

VII. EXPERIMENT-III: COMPARISONS WITH THESTATE-OF-THE-ART

In this section, we conduct a more comprehensive studyof CAL’s performance from two aspects: the effectivenesscomparing with eight querying strategies which are popularlyemployed in literature or closely related to the proposedmethod, and the comparison to five best interactive searchsystems reported in TRECVID. The experiments have beenconducted on TV05-09 with machine searchers.

A. Effectiveness of Querying Strategies

With the help of the machine searchers, we are able tocompare CAL and CALA with a wide range of queryingstrategies as follows:

• Nearest Neighbors (NN ): samples instances to querynext according to their proximities to the labeled relevantinstance set L+, the simplest and most intuitive strategywhich is purely exploitative;

• Most Relevant (MR): samples instances according totheir posterior probabilities by current classifier, an ex-

12

TABLE IVPERFORMANCE COMPARISON OF DIFFERENT QUERYING STRATEGIES AND THE RESULTS OF SIGNIFICANCE TEST (AT LEVEL 0.05, AND X ≫ Y

INDICATES THAT X IS SIGNIFICANTLY BETTER THAN Y). THE BEST RESULTS ARE BOLD.

TRECVID- Coached Coached-Like Exploitative Explorative Results of Significance TestCAL CALA StatQ Hybrid PreCls DUAL NN MR US QBC

2005 0.563 0.558 0.423 0.363 0.446 0.463 0.208 0.317 0.431 0.452 CAL,CALA ≫ DUAL,QBC,PreCls ≫US, StatQ ≫ Hybrid ≫ MR ≫ NN

2006 0.422 0.432 0.284 0.309 0.297 0.368 0.193 0.224 0.294 0.287 CALA ≫ CAL ≫ DUAL ≫ Hybrid, PreCls, US≫ QBC,StatQ ≫ MR ≫ NN

2007 0.493 0.508 0.388 0.397 0.403 0.442 0.237 0.286 0.374 0.382 CALA ≫ CAL ≫ DUAL ≫ PreCls,Hybrid ≫StatQ,QBC ≫ US ≫ MR ≫ NN

2008 0.331 0.352 0.169 0.203 0.198 0.249 0.113 0.158 0.186 0.179 CALA ≫ CAL ≫ DUAL ≫ Hybrid, PreCls ≫US,QBC ≫ StatQ,MR ≫ NN

2009 0.380 0.397 0.213 0.249 0.242 0.277 0.129 0.136 0.227 0.235 CALA ≫ CAL ≫ DUAL ≫ Hybrid, PreCls ≫QBC,US, StatQ ≫ MR ≫ NN

ploitative strategy utilizing only local information (cur-rent decision boundary);

• Uncertainty Sampling (US) [3–10]: samples instancesclosest to the current decision boundary, a purely explo-rative strategy utilizing only local information;

• Query by Committee (QBC) [18–20]: samples instanceswhose predicted labels are most inconsistent among com-mittee classifiers (3 SVMs here), an explorative strategybut utilizing more comprehensive information than USbecause of the employment of multiple classifiers;

• Statistical Queries SVM (StatQ) [12]: samples instancesaccording to a confidence factor which indicates thequality of current classification boundary, so as to encour-age the exploitation when the boundary is approachingoptimal and encourage exploration otherwise;

• Hybrid Learning (Hybrid) [13]: a hybrid method whichselectively chooses either an exploitative or an explorativestrategy to sample instances according to a “probability ofexploring” which is determined by the change of decisionboundary of two consecutive rounds;

• Pre-Clustering (PreCls) [7]: sample instances by con-sidering both uncertainty and the prior data distributionlearnt from clustering, so as to give higher probabilitiesto instances which are both uncertain to current classifierand representative in their clusters;

• Dual Strategy (DUAL) [24]: sample instances usingPreCls before the derivative of the expected error isapproaching zero and then combining PreCls with con-ventional uncertainty-based strategy after that;

Table IV shows the performances of these strategies on TV05-09 respectively, with the results of significance test (at level0.05) on each year attached. It is clear that CALA and CALcan always outperform other eight conventional strategies withthe improvements of MAP ranging from 11.54% to 211.5%.

1) Comparison with Purely Exploitative and ExplorativeStrategies: The advantage of CAL over the conventionalexploitative strategies can be seen more clearly when we setCAL to be purely explorative (i.e,. λ = 1). By referring to Ta-ble IV, even CAL1 and CALA

1 can outperform NN and MRby 41.86%(±15.07%) and 12.68%(±11.21%), respectively.The reason is that, as introduced in Section IV-A, by furtherutilizing the global information indicated by the estimatedquery distribution, the coaching process can iteratively guidethe leaner to the center of the estimated distribution, putting the

harvest in an orderly manner (i.e., from the places with denselylocated relevant instances to those with less). By contrast, theharvest processes of NN and MR are more “short sighted”in the way that they blindly search the neighboring areasaround the relevant instances found in the previous roundswith the assumption that relevant instances are continuouslydistributed. However, this assumption is rarely true, becauserelevant instances are often mixed with irrelevant ones. It thustakes NN and MR a great effort filtering out the irrelevantinstances before finding the true query distribution.

In cases that CAL are purely explorative (i.e., λ = 0),by referring to Table IV, CAL0 and CALA

0 also outperformthe conventional explorative strategies US and QBC by9.28%(±3.32%) and 8.49%(±6.10%), respectively. This isagain attribute to the coaching of the estimated query distri-bution, which selectively sample instances at places where theprior and posterior distributions both unsure of or have con-flicting understandings, making the exploration well-targeted(cf. Section IV-B). However, US and QBC, which sampleinstances only referring to local information, are lack of globalperspective and easy to be stuck at local optimums.

2) Comparison with Coached-Like Strategies: Comparedto the four coached-like strategies, CALA and CALoutperform StatQ, Hybrid, PreCls and DUAL by59.26%(±30.33%), 48.58%(±15.88%), 45.33%(±27.43%)and 24.98%(±16.25%), respectively. By investigating the con-fidence factor of StatQ [12], we find a dramatic fluctuationduring retrieval which makes StatQ extremely unstable onTRECVID dataset. The reason can date back to the basicassumption of StatQ that the decision boundary is approach-ing optimal when the positive and negatives examples alongit are well balanced. It is generally true when the distribu-tions of relevant and irrelevant instances are far apart fromeach other. However, on TRECVID dataset, the majorities ofthe two distributions are often mixed, which will frequentlychange StatQ’s confidence about whether the boundary isapproaching optimal. By contrast, we do not have this issuein CAL, because the prior distribution for coaching is learntby excluding the influence of irrelevant instances, which willthus lead the learner to the most dense areas of the relevantinstances, no matter how the relevant and irrelevant are mixed.

To measure the change of decision boundary, Hybrid [13]concatenates the real-valued hypothesis (i.e., the probabilitiesto be relevant in case of SVM) of all samples in the datasetinto a vector at each round and calculates the inner product

13

of the two vectors in two consecutive rounds. It also meetsproblem on TRECVID dataset, because the population of theirrelevant instances is extremely overwhelming compared tothe relevant ones. This results in most of the entities of thehypothesis vector seldom change.

Benefitting from the idea of integrating prior knowledgeinto active learning, PreCls [7] has demonstrated betterperformance than all the conventional approaches using theexploitative, explorative and coached-like strategies. However,there are still several issues when PreCls is applied tointeractive video retrieval. First, the clusters in PreCls areobtained by performing clustering on the whole dataset usingthe the similarity between feature vectors as the metric. Thismakes the instances in the same cluster are those visuallysimilar to each other, but not necessarily sharing the samesemantic meaning, on the basis of which it is thus inherentlydifficult for PreCls to model the query distribution accurately.Second, the number of clusters in PreCls is a highly sensitiveparameter. We have done a large-scale of experiments to searchthe optimal number of clusters for PreCls and finally weobtain a diverse results across TV05-09 (i.e., 507, 323, 435,713, and 649 on TV05-09, respectively). The results reportedin Table IV are using these optimal numbers. However, theparameter tuning is impossible to conduct in real applicationsbecause the groudtruth is usually unavailable. Furthermore, theauthors of [7] have suggested that performing a re-clusteringafter each round to update the prior knowledge can furtherimprove the results. However, this is impractical in our casedue to the large-scale of TRECVID dataset. It is easy to seethat in CAL we can address the first issue by using semanticgroups to model the query distribution, and address the lastissue by updating the prior knowledge with domain adaptation.

By combining PreCls with uncertainty sampling, DUAL[24] has obtained the best performance among the eightconventional strategies. However, compared to PreCls, theimprovement (16.02%(±8.58%)) is not as significant as re-ported in [24]. The reason is that, in our experiments, DUALspends a much longer period on PreCls than on the combinedstrategy, making its performance mainly relying on that ofPreCls. This is simply due to the large-scale of TRECVIDdataset, on which, within the 15-minute limitation, it is noteasy for PreCls to reach to switching point where thederivative of the expected error is approaching zero. Therefore,the advantage of DUAL has been limited on the large-scaleinteractive video retrieval.

B. Comparisons with State-of-the-art

In this section, we compare the performance of CAL withsix state-of-the-art interactive search systems including: (1)MediaMill’05 [36], the best interactive search system reportedin TRECVID 2005; (2) Extreme Video Retrieve (EVR) [11],the one reported with comparable performance to Medi-aMill’05. For EVR, we can only compare to the MAP of themethod called manual paging with variable pagesize (MPV P )in [11], because even there are other methods proposed in thepaper which can beat MPV P , but no exact MAPs are re-ported; (3) CMU’06 [37], (4) IBM’07 [38], (5) MediaMill’08[39] and (6) MediaMill’09 [40] the best interactive search

TABLE VPERFORMANCE COMPARISON ON TRECVID 2005-2009 BETWEEN CAL

AND THE BEST INTERACTIVE SEARCH SYSTEMS (Best). THE RESULTSARE EVALUATED WITH THE AVERAGE OF APS OVER TOPICS ON 2005, AND

OF INFAPS ON THE REST OF YEARS. THE BEST RESULTS ARE BOLD.

2005 2006 2007 2008 2009MPV P 0.414 N/A N/A N/A N/ABest 0.406 0.303 0.362 0.194 0.246CAL 0.563 0.422 0.493 0.331 0.38CALA 0.558 0.432 0.508 0.352 0.397

systems reported in TRECVID 2006–2009, respectively. Ta-ble V shows the results. It is easy to see that CAL (CALA)outperforms the best interactive systems by 38.67% (37.44%),39.27% (42.57%), 36.19% (40.33%), 70.62% (81.44%), and54.47% (61.38%) on TRECVID 2005–2009, respectively, andoutperforms MPV P by 35.99% (37.44%). A significance testalso shows that both CAL and CALA outperform the five bestinteractive search systems6 of TRECVID 2005–2009 at level0.05. It again confirms that both CAL and CALA have state-of-the-art performance.

VIII. CONCLUSION

We have presented CAL for addressing the exploration-exploitation dilemma in interactive video retrieval. By usingpre-learnt semantic groups, CAL makes the query distribu-tion predictable, and thus avoids the risk of searching ona completely unknown feature space. With the coaching ofthe predicted query distribution and the posterior distributionindicated by current decision boundary, CAL can balancebetween the exploration and exploitation in a principled wayso as to keep the user’s patience on searching as well asfind their query intention effectively. The experiments on alarge-scale dataset show that the proposed approach workssatisfactorily when facing domain-shift and outperforms eightpopular querying strategies, as well as six state-of-the-artsystems. Through the study of CAL, we have identified thethree principal factors related to interactive retrieval and theirinterdependence, which provides not only a higher perspectiveto review the existing works along the same line, but also areference for improving our future design of sampling strategy.

REFERENCES

[1] C. Campbell, N. Cristianini, and A. Smola, “Query learningwith large margin classifiers,” in International Conference onMachine Learning, 2000, pp. 111–118.

[2] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learn-ing with statistical models,” Journal of Artificial IntelligenceResearch, vol. 4, no. 1, pp. 129–145, 1996.

[3] D. Lewis and W. Gale, “A sequential algorithm for training textclassifiers,” in ACM SIGIR on Research and Development inInformation Retrieval, 1994, pp. 3–12.

[4] M. Tang, X. Luo, and S. Roukos, “Active learning for statisticalnatural language parsing,” in Annual Meeting of the Associationfor Computational Linguistics, 2002, pp. 120–127.

[5] C. Zhang and T. Chen, “An active learning framework forcontent-based information retrieval,” IEEE Transactions onMultimedia, vol. 4, no. 2, pp. 260–268, 2002.

[6] M. R. Naphade and J. R. Smith, “Active learning for simultane-ous annotation of multiple binary semantic concepts,” in IEEEIntl. Conf. on Multimedia and Expo, 2004, pp. 77–80.

6MPV P has not been included in this significance test, because thedetailed APs for the 24 queries of TRECVID 2005 are not reported in [11].

14

[7] H. T. Nguyen and A. Smeulders, “Active learning using pre-clustering,” in International Conference on Machine Learning,2004, pp. 79–86.

[8] R. Yan, J. Yang, and A. Hauptmann, “Automatically labelingvideo data using multi-class active learning,” in IEEE Intl. Conf.on Computer Vision, 2003, pp. 516–524.

[9] L. Wang, K. L. Chan, and Z. Zhang, “Bootstrapping SVMactive learning by incorporating unlabelled images for imageretrieval,” in IEEE Conference on Computer Vision and PatternRecognition, 2003, pp. 629–634.

[10] S. Tong and E. Chang, “Support vector machine active learn-ing for image retrieval,” in ACM International Conference onMultimedia, 2001, pp. 107–118.

[11] A. G. Hauptmann, W.-H. Lin, R. Yan, J. Yang, and M.-Y. Chen,“Extreme video retrieval: joint maximization of human andcomputer performance,” in ACM International Conference onMultimedia, 2006, pp. 385–394.

[12] P. Mitra, C. A. Murthy, and S. K. Pal, “A probabilistic activesupport vector learning algorithm,” IEEE Trans. on PatternAnalysis and Machcine Intelligence, vol. 26, pp. 413–418, 2004.

[13] T. Osugi, D. Kun, and S. Scott, “Balancing exploration andexploitation: A new algorithm for active machine learning,” inIEEE Intl. Conf. on Data Mining, 2005, pp. 330–337.

[14] X.-Y. Wei and Z.-Q. Yang, “Coached active learning for in-teractive video search,” in ACM International Conference onMultimedia, 2011, pp. 443–452.

[15] X.-Y. Wei, C.-W. Ngo, and Y.-G. Jiang, “Selection of con-cept detectors for video search by ontology-enriched semanticspaces,” IEEE Transactions on Multimedia, vol. 10, no. 6, pp.1085–1096, 2008.

[16] B. Settles, “Active learning literature survey,” University ofWisconsin–Madison, CS Technical Report 1648, 2009.

[17] T. Huang, C. Dagli, and et al., “Active learning for interactivemultimedia retrieval,” Proceedings of the IEEE, vol. 96, no. 4,pp. 648–667, 2008.

[18] A. McCallum and K. Nigam, “Employing EM and pool-basedactive learning for text classification,” in The Fifteenth Interna-tional Conference on Machine Learning, 1998, pp. 350–358.

[19] S. Tong and D. Koller, “Support vector machine active learningwith applications to text classification,” in Intl. Conference onMachine Learning, 2000, pp. 999–1006.

[20] V. S. Iyengar, C. Apte, and T. Zhang, “Active learning usingadaptive resampling,” in ACM SIGKDD on Knowledge Discov-ery and Data Mining, 2000, pp. 92–98.

[21] B. Settles, M. Craven, and S. Ray, “Multiple-instance activelearning,” Advances in Neural Information Processing Systems(NIPS), vol. 20, pp. 1289–1296, 2008.

[22] X. Zhu, J. Lafferty, and Z. Ghahramani, “Combining activelearning and semi-supervised learning using gaussian fields andharmonic functions,” in ICML 2003 workshop on The Contin-uum from Labeled to Unlabeled Data in Machine Learning andData Mining, 2003, pp. 58–65.

[23] M.-Y. Chen, M. Christel, A. G. Hauptmann, and H. Wactlar,“Putting active learning into multimedia applications: dynamicdefinition and refinement of concept classifiers,” in ACM Intl.Conf. on Multimedia, 2005, pp. 902–911.

[24] P. Donmez, J. G. Carbonell, and P. N. Bennett, “Dual strategyactive learning,” in The 18th European conference on MachineLearning, 2007, pp. 116–127.

[25] X.-Y. Wei, Y.-G. Jiang, and C.-W. Ngo, “Concept-driven multi-modality fusion for video search,” IEEE Trans. on Circuits andSystems for Video Technology, vol. 21, pp. 62–73, 2011.

[26] A. F. Smeaton, P. Over, and W. Kraaij, “Abstract evaluationcampaigns and TRECVid,” in ACM SIGMM International Work-shop on Multimedia Information Retrieval, 2006, pp. 321–330.

[27] M. R. Naphade, J. R. Smith, J. Tesic, and et. al., “Large-scaleconcept ontology for multimedia,” IEEE Multimedia, vol. 13,no. 3, pp. 86–91, 2006.

[28] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate analysis.

Academic Press, 1979.[29] C. G. M. Snoek, B. Huurnink, L. Hollink, M. de Rijke,

G. Schreiber, and M. Worring, “Adding semantics to detectorsfor video retrieval,” IEEE Transactions on Multimedia, vol. 9,no. 5, pp. 975–986, 2007.

[30] S.-Y. Neo, J. Zhao, M.-Y. Kan, and T.-S. Chua, “Video re-trieval using high level features: Exploiting query matchingand confidence-based weighting,” in International Conferenceon Image and Video Retrieval, 2006, pp. 143–152.

[31] P.-H. Gosselin and M. Cord, “Active learning methods for inter-active image retrieval,” IEEE Transactions on Image Processing,vol. 17, no. 7, pp. 1200–1211, 2008.

[32] E. Chang, S. Tong, K. Goh, and C.-W. Chang, “Support vectormachine concept-dependent active learning for image retrieval,”IEEE Transactions on Multimedia, vol. 2, pp. 1–35, 2005.

[33] Y.-G. Jiang, C.-W. Ngo, and J. Yang, “Towards optimal bag-of-features for object categorization and semantic video retrieval,”in ACM International Conference on Image and Video Retrieval,2007, pp. 494–501.

[34] C.-C. Chang and C.-J. Lin, “Libsvm: A library for supportvector machines,” ACM Transactions on Intelligent Systems andTechnology, vol. 2, no. 3, 2011.

[35] J. A. Aslam and E. Yilmaz, “Inferring document relevance viaaverage precision,” in ACM SIGIR Conference on Research andDevelopment in Information Retrieval, 2006, pp. 601–602.

[36] C. G. M. Snoek, J. C. V. Gemert, and et al., “The Medi-aMill TRECVID 2005 semantic video search engine,” in NISTTRECVID Workshop, 2005.

[37] A. G. Hauptmann, M.-Y. Chen, and et. al., “Multi-lingualbroadcast news retrieval,” in NIST TRECVID Workshop, 2006.

[38] M. Campbell, A. Haubold, and et. al., “IBM researchTRECVID-2007 video retrieval system,” in NIST TRECVIDWorkshop, 2007.

[39] C. G. M. Snoek, O. D. Rooij, and et. al., “The Medi-aMill TRECVID 2008 semantic video search engine,” in NISTTRECVID Workshop, 2008.

[40] C. G. M. Snoek, O. D. Rooij, B. Huurnink, and et. al., “TheMediaMill TRECVID 2009 semantic video search engine,” inNIST TRECVID Workshop, 2009.

Xiao-Yong Wei (M’10) is an Associate Professorwith the College of Computer Science, Sichuan Uni-versity, China. His research interests include multi-media retrieval, data mining, and machine learning.He received his Ph.D. degree in Computer Sciencefrom City University of Hong Kong in 2009. Heis one of the founding members of the VIREOmultimedia retrieval group, City University of HongKong. He was a Senior Research Associate in Dept.of Computer Science and Dept. of Chinese, Linguis-tics and Translation of City University of Hong Kong

in 2009 and 2010, respectively. He had worked as a Manager of SoftwareDept. at Para Telecom Ltd., China, from 2000 to 2003.

Zhen-Qun Yang is a Ph.D candidate in the Collegeof Computer Science, Sichuan University, China.She was a Research Assistant with the Dept. ofComputer Science, City University of Kong Kongfrom 2007 to 2008 and a Senior Research Assistantwith the Dept. of Chinese, Linguistics and Trans-lation of City University of Hong Kong in 2009.Her research interests include multimedia retrieval,image processing and pattern recognition, and neuralnetworks.