selective weakly supervised human detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang...

15
Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Selective Weakly Supervised Human Detection under Arbitrary Poses Yawei Cai a,b , Xiaosong Tan a,b , Xiaoyang Tan a,b, a Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Yudao Street 29, Nanjing 210016, China b Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210016, China ARTICLE INFO Keywords: Weakly supervised learning Human detection Selective Weakly Supervised Detection (SWSD) Multi-instance learning (MIL) ABSTRACT In this paper we study the problem of weakly supervised human detection under arbitrary poses within the framework of multi-instance learning (MIL). Our contributions are threefold: (1) we rst show that in the context of weakly supervised learning, some commonly used bagging tools in MIL such as the Noisy-OR model or the ISR model tend to suer from the problem of gradient magnitude reduction when the initial instance- level detector is weak and/or when there exist large number of negative proposals, resulting in extremely inecient use of training examples. We hence advocate the use of more robust and simple max-pooling rule or average rule under such circumstances; (2) we propose a new Selective Weakly Supervised Detection (SWSD) algorithm, which is shown to outperform several previous state-of-the-art weakly supervised methods; (3) nally, we identify several crucial factors that may signicantly inuence the performance, such as the usefulness of a small amount of supervision information, the need of relatively higher RoP (Ratio of Positive Instances), and so on these factors are shown to benet the MIL-based weakly supervised detector but are less studied in the previous literature. We also annotate a new large-scale data set called LSP/MPII-MPHB (Multiple Poses Human Body), in which and another popular benchmark dataset we demonstrate the superiority of the proposed method compared to several previous state-of-the-art methods. 1. Introduction Dealing with people is of importance because people is one of the commonest types of visual object in pictures and videos, and because it has wide applications ranging from law enforcement and surveillance, healthcare, access control to entertainment. For these, detecting hu- mans is usually the rst step towards more deep and high-level people understanding. One of the most studied topics related to this is the pedestrian detection [13], which is a crucial component of an urban intelligent transportation system. Despite its success in some cases, the problem by itself concerns mainly about the humans in upright positions, while in general people will exhibit arbitrary poses, such as bending, sitting, lying, and others (see Fig. 1 for some illustration), highlighting the need of detecting human under arbitrary poses. However, there are several challenges for this task, mainly due to various appearance variations from shape deformation, pose variation, illumination changes, cluttered background, occlusion, etc. [5]. To deal with these, state of art object detectors, such as R-CNN [6] and its variants, exploit the powerful modeling capability of deep neural networks that learn invariant feature sets from a large number of fully supervised samples. However, it is generally dicult or laborious to obtain so many labeled samples or to annotate them manually, and people want to localize objects with minimal supervision so as to essentially relax the burden of data collection and annotation. This motivates the so-called weakly supervised learning (WSL) for object detection, where the aim is to learn from samples with only weak supervised signal available (e.g., there contains people in this image), but without any further detailed supervised information such as that indicating how many people there or where they are. Note that this setting is dierent from that of semi-supervised algorithms, where we are allowed to access to a few such detailed annotations but otherwise we have only samples without any annotations or labels at all. To this end, multi-instance learning (MIL) [7] has emerged as a promising framework for weakly supervised learning, as it naturally relaxes the requirements of instance labels but backpropagating the bag level weak supervised signal to them. Recently Cinbis et al. [8] used multi-instance learning for general object detection and achieved good results on the Pascal VOC 2007 [9] data set, showing the potential of this method. But they did not focus on the problem of multi-pose human detection, and many practical details of MIL such as sample selection and parameter setting are not clear yet. More importantly, due to the lack of strong supervision information, previous MIL methods have to collect a large number of instances to guarantee that at least one positive instance appears in a positive bag. Unfortunately, http://dx.doi.org/10.1016/j.patcog.2016.12.025 Received 19 June 2016; Received in revised form 18 October 2016; Accepted 24 December 2016 Corresponding author at: Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Yudao Street 29, Nanjing 210016, China. E-mail address: [email protected] (X. Tan). Pattern Recognition 65 (2017) 223–237 Available online 28 December 2016 0031-3203/ © 2017 Elsevier Ltd. All rights reserved. MARK

Upload: others

Post on 30-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.elsevier.com/locate/pr

Selective Weakly Supervised Human Detection under Arbitrary Poses

Yawei Caia,b, Xiaosong Tana,b, Xiaoyang Tana,b,⁎

a Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Yudao Street 29, Nanjing 210016, Chinab Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210016, China

A R T I C L E I N F O

Keywords:Weakly supervised learningHuman detectionSelective Weakly Supervised Detection (SWSD)Multi-instance learning (MIL)

A B S T R A C T

In this paper we study the problem of weakly supervised human detection under arbitrary poses within theframework of multi-instance learning (MIL). Our contributions are threefold: (1) we first show that in thecontext of weakly supervised learning, some commonly used bagging tools in MIL such as the Noisy-OR modelor the ISR model tend to suffer from the problem of gradient magnitude reduction when the initial instance-level detector is weak and/or when there exist large number of negative proposals, resulting in extremelyinefficient use of training examples. We hence advocate the use of more robust and simple max-pooling rule oraverage rule under such circumstances; (2) we propose a new Selective Weakly Supervised Detection (SWSD)algorithm, which is shown to outperform several previous state-of-the-art weakly supervised methods; (3)finally, we identify several crucial factors that may significantly influence the performance, such as theusefulness of a small amount of supervision information, the need of relatively higher RoP (Ratio of PositiveInstances), and so on – these factors are shown to benefit the MIL-based weakly supervised detector but are lessstudied in the previous literature. We also annotate a new large-scale data set called LSP/MPII-MPHB (MultiplePoses Human Body), in which and another popular benchmark dataset we demonstrate the superiority of theproposed method compared to several previous state-of-the-art methods.

1. Introduction

Dealing with people is of importance because people is one of thecommonest types of visual object in pictures and videos, and because ithas wide applications ranging from law enforcement and surveillance,healthcare, access control to entertainment. For these, detecting hu-mans is usually the first step towards more deep and high-level peopleunderstanding. One of the most studied topics related to this is thepedestrian detection [1–3], which is a crucial component of an urbanintelligent transportation system. Despite its success in some cases, theproblem by itself concerns mainly about the humans in uprightpositions, while in general people will exhibit arbitrary poses, such asbending, sitting, lying, and others (see Fig. 1 for some illustration),highlighting the need of detecting human under arbitrary poses.

However, there are several challenges for this task, mainly due tovarious appearance variations from shape deformation, pose variation,illumination changes, cluttered background, occlusion, etc. [5]. To dealwith these, state of art object detectors, such as R-CNN [6] and itsvariants, exploit the powerful modeling capability of deep neuralnetworks that learn invariant feature sets from a large number of fullysupervised samples. However, it is generally difficult or laborious toobtain so many labeled samples or to annotate them manually, and

people want to localize objects with minimal supervision so as toessentially relax the burden of data collection and annotation. Thismotivates the so-called weakly supervised learning (WSL) for objectdetection, where the aim is to learn from samples with only weaksupervised signal available (e.g., “there contains people in this image”),but without any further detailed supervised information such as thatindicating how many people there or where they are. Note that thissetting is different from that of semi-supervised algorithms, where weare allowed to access to a few such detailed annotations but otherwisewe have only samples without any annotations or labels at all.

To this end, multi-instance learning (MIL) [7] has emerged as apromising framework for weakly supervised learning, as it naturallyrelaxes the requirements of instance labels but backpropagating thebag level weak supervised signal to them. Recently Cinbis et al. [8] usedmulti-instance learning for general object detection and achieved goodresults on the Pascal VOC 2007 [9] data set, showing the potential ofthis method. But they did not focus on the problem of multi-posehuman detection, and many practical details of MIL such as sampleselection and parameter setting are not clear yet. More importantly,due to the lack of strong supervision information, previous MILmethods have to collect a large number of instances to guarantee thatat least one positive instance appears in a positive bag. Unfortunately,

http://dx.doi.org/10.1016/j.patcog.2016.12.025Received 19 June 2016; Received in revised form 18 October 2016; Accepted 24 December 2016

⁎ Corresponding author at: Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Yudao Street 29, Nanjing 210016, China.E-mail address: [email protected] (X. Tan).

Pattern Recognition 65 (2017) 223–237

Available online 28 December 20160031-3203/ © 2017 Elsevier Ltd. All rights reserved.

MARK

Page 2: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

doing this not only significantly increases the computational costs, butalso makes some of the traditional bagging model1 of MIL such as theNoisy-OR rule and the ISR (Integrated Segmentation and Recognition)[10] rule become inefficient in learning. We first observe that too manynegative instances in a positive bag may hurt the learning performancein the sense that they could significantly reduce the magnitude ofgradient, especially under a weak initial instance-level model. Toaddress this issue, we advocate the use of more robust bagging modelsuch as the max-pooling rule or the average rule.

For the study of object detection, there are many public data sets forresearchers to use such as Pascal VOC [11] and ILSVRC [12]. However,these do not focus on the category of the human being among manyother categories of objects. Actually, there currently does not exist alarge-scale data set tailored for the task of human detection.Fortunately, there have been many datasets built for pose estimationlike LSP [13] and MPII Human Pose [4], based on which we annotate anew dataset named LSP/MPII-MPHB (Multiple Poses Human Body)tailored to the specific task of human detection under arbitrary poses,by selecting over 26K challenging images from both LSP and MPIIdatasets. We annotate human body bounding boxes for each of theselected images.

In summary, the contributions of this paper are three-fold: firstly, weidentify an easily ignored pitfall of the Noisy-OR model in MIL, which isclosely related to the vanishing gradient problem in neural networks andcan significantly reduce the training efficiency of a MIL algorithm.Secondly, we propose a new Selective Weakly Supervised Detection(SWSD) algorithm for human detection under arbitrary poses, whichuses a novel constrained elites selection approach to select top rankingproposals for the next-round of detector training. This not only helps toprevent training from prematurely locking onto erroneous object loca-tions, but effectively improves the learning stability as well. Finally, weidentify several crucial factors that may significantly influence theperformance, such as the usefulness of a small amount of supervisioninformation, the need of relatively higher RoP (Ratio of PositiveInstances), and so on – these factors are shown to benefit the MIL-basedweakly supervised detector but are less studied in the previous literature.On both the newly annotated MPHB dataset and the popular Pascal VOCdataset, we show that our approach significantly outperforms severalprevious weakly supervised object detection methods. A preliminaryversion of this work appeared in [14].

The remaining of this paper is organized as follows, after brieflyintroducing the related work of weakly supervised object detection inSection 2, we illuminate the multi-instance learning in Section 3.1. Wegive the description of our algorithm SWSD in Section 4 and the dataset LSP/MPII-MPHB in Section 5. Experiments on multi-pose humanbody detection are presented in Section 6 and we conclude the paper inSection 7.

2. Related work

There are relatively few works devoted to human detection underarbitrary poses. But one of the most influential methods in generalobject detection is the R-CNN method proposed by Girshick et al. [6].The authors used selective Search to generate object proposals,extracted CNN features from each proposal, and then learned theobject model of interest. The method effectively bypasses the time-consuming sliding window scanning and take the advantage of deeplearning, achieving the state of art results on many object detectiontasks. However, a large number of supervised training samples areneeded to prevent its complicated likelihood model from overfitting.Our method can be interpreted as a weakly supervised version of R-CNN in the sense that we learn the object model with a modifiedversion of MIL (SWSD) at the later stage in the framework of R-CNN.

Deformable Parts Model (DPM) is another fully supervised modelthat is popularly used for human detection [15]. To relieve the burdenof object-level manual annotation, Pandey and Lazebnik [16] proposedto use the DPM model equipped with latent SVM training as a generalframework for weakly supervised learning, and they use this for weaklysupervised object localization and scene recognition. Siva and Xiang[17] presented a weakly supervised learning framework that incorpo-rates a new initial annotation model to start the iterative learning of adetector, while Russakovsky et al. [18] proposed an object-centricspatial pooling approach, in which the location of the object of interestis inferred from image-level labels.

Initialization is crucial to many object detection methods. Songet al. [19] proposed a graph-based initialization method, formulated asa discriminative submodular cover problem on the similarity graph ofthe candidate windows. They also extended the approach to automa-tically identify discriminative configurations of visual patterns [20] inwhich the initial candidate object windows are found by searchingfrequent part configurations and their bounding boxes. On thecontrary, in this work we adopt a supervised approach by using a fewlabeled samples to initialize our object detector.

Due to the unavailability of strongly supervised information, onecommon way is to treat those factors that cannot be observed directlybut have an important influence on the target as latent variables.Hence, many weakly supervised object detection methods are essen-tially an iterative procedure which improves the model based on thecurrent guess on these latent variables. For example, in Wang et al.'s[21] latent category learning approach, they first used the probabilisticLatent Semantic Analysis (pLSA) to learn the latent categories and thenadopted a category selection method to evaluate each category'sdiscrimination.

During iteration, it is of importance to prevent training fromprematurely locking onto erroneous object locations. Bilen et al. [22]proposed two heuristic rules for this. The first one is to enforce themodel to assign the similar score to positive training windows and theirhorizontally mirrored images, and the other one is to penalize themodel which assigns high classification scores to multiple classes for a

Fig. 1. Illustration of people under different poses. (Images from MPII Human Pose [4] with the bounding boxes annotated in this work.)

1 By “bagging” we mean the procedure of combining the individual instance-levelconditional probability into a bag-level one.

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

224

Page 3: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

single candidate window. This later strategy somewhat agrees with themaximum entropy principle. In Cinbis et al. [8], a multi-fold multipleinstance learning is proposed, which integrates the task of objectdetection and localization into one framework. The major differencesbetween our method and this method lie in that we use a newlyproposed constrained elites selection method (detailed in Section 4) todrive the learning procedure and use a small amount of ground-truthas regularization. Also, in learning we utilize all useful informationaround most promising candidates even if they contain some back-ground, instead of relying only on the top ranking estimation [8].

3. Multi-instance learning for weakly supervised objectdetection

In this section, we first give a general idea of interpreting multi-instance learning as a framework for weakly supervised learning, andthen point out some pitfalls when working within this framework.

3.1. Background

In the setting of multi-instance learning [23], we are given N bagsof data, denoted by D x t= { , }i i i

N=1, where xi is the i-th bag and ti is its

corresponding label. A bag usually contains a set of J instances, i.e.,x x= ( )i ij j

J=1, and they are treated as a whole unit in MIL training, just as

a single training sample for a usual fully supervised learning algorithm.The label of a bag is mainly decided by whether there contains apositive instance in that bag. That is, only if there is no positiveinstance in a bag can this bag be called negative (ti=0), otherwise it is apositive bag (ti=1). Under this definition, we are uncertain whichinstance is really positive in a positive bag, which may impose asignificant challenge on the MIL algorithm, since to achieve goodperformance it has to be robust enough against the noisy data in apositive bag.

Multi-instance learning can be employed as a natural framework forweakly supervised object detection [24] as follows: firstly, an image istreated as a bag, on which a label is assigned but otherwise we do nothave further annotations about the objects in that image. Secondly, wegenerate a set of proposals from the image in an unsupervised way.These proposals, understood as instances in the given bag, serve as ourinitial conjectures about the possible locations of the objects of interest.Note that we do not have explicit label information about theseinstances. Under these settings, our goal is to train an instance-levelobject detector that determines whether an input proposal is the objectof interest or not. If we regard this as the conditional probability ofbeing positive for a given proposal, it follows that a good detectorshould satisfy the condition that when fusing these instance-levelpredictions into a bag-level one, it should be consistent with theground truth bag label. This is exactly the training objective of a MILalgorithm. Particularly, in this work, we are interested in two instance-level MIL algorithms, i.e., multi-instance AdaBoost (MILBoost) [24]and multi-instance logistic regression (MILLR) [25].

With this instance-level detector in hand, when testing we simplyuse it to score each proposal (instance) the possibility of being positiveand output the location of the instance with the highest positive scoreas our detection result.

One important component of a MIL algorithm is its bagging model,i.e., fusing instance-level conditional probability into a bag level one.Formally, denote the probability of an instance xij being positive as pij.To estimate the conditional probability pi at the bag level, one cancombine the probability at the instance level using different strategies,typically including Max-pooling, Average, ISR and Noisy-OR, asfollows:

Noisy-OR:

∏p p= 1 − (1 − )ij

J

ij=1 (1)

ISR:

p

p

pp

p

=

∑1 −

1 + ∑1 −

i

jij

ij

jij

ij (2)

Max-pooling:

p p= max{ }i j ij (3)

Average:

∑pJ

p= 1i

j

J

ij=1 (4)

Among them, the Noisy-OR rule may be the most commonly usedone, as it matches well with the concept of the MIL in that a bag will bedetermined to be positive at least one of its instances is positive. TheISR rule is derived from a probabilistic point of view [10], and can bethought of as a form of evidence accumulation, which is similar to theaverage rule. Finally, the max-pooling rule is possibly the most simpleand intuitive one among others, and it naturally makes the baggingprocess invariant to translation, rotation and shifting. To this end, onenatural question is which type of bagging model is best suitable forweakly supervised object detection, and we deal with this issue in thenext section.

3.2. The pitfall of the ISR and noisy-OR model

In this section, we detail our reason why to avoid using the Noisy-OR or the ISR under the circumstances that there are too manynegative instances in a positive bag. This is exactly the case we usuallyencountered in weakly supervised object detection – due to the lack ofstrong supervision information, one has to collect a large number ofinstances from an image to justify the MIL assumption that at least onepositive instance appears in a positive bag. However, most of theseproposals are negative as usually there are only a few objects of interestin a single image.

Specifically, suppose that we have 10, 100 and 1000 instances in apositive bag respectively, and each of them is assigned a low positiveprobability of 0.1 by a normal not-so-accurate instance-level model.Note that since most of these instances are negative by assumption, theabove assignment is reasonable except for the unknown positiveinstances. Table 1 gives the pi values calculated according to theaforementioned four bagging models. One can see that the ISR andNoisy-OR tend to overestimate the probability that a bag is positiveonce the number of instances is over 100, regardless of whether the bagis truly positive or not. In an object detection problem with weaklysupervised information, we will usually consider a large number of

Table 1The pi values calculated with different bagging methods under varying number ofinstances (see text for details).

Methods 10 100 1000

Max-Pooling 0.1 0.1 0.1Average 0.1 0.1 0.1ISR 0.5263 0.9174 0.9911Noisy-OR 0.6513 1.0000 1.0000

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

225

Page 4: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

patches as object proposals (easily over 1000).To see the consequence of this, let us take the MIL logistic

regression method (MILLR) as our example. It is a linear classifierwith its instance level classifier as y w x b= +ij

Tij , where w and b are

parameters to be learned, and the probability pij of an instance beingpositive is modeled using a sigmoid function. To train the model, anegative likelihood is used as loss function,

∑L t p t p= − ( ln + (1 − )ln(1 − ))i

i i i i(5)

with its gradient as,

∑Lw

Lp

pp

p

w∂∂

= ∂∂

∂∂

∂i j i

i

ij

ij

, (6)

Here, = − +Lp

tp

tp

∂∂

1 −1 −i

i

i

i

iand p p x= (1 − )

p

w ij ij ij∂

∂ij .

Note that when parameter learning, the gradient ofpp

∂∂

i

ij, which

evaluates how the small change of instance-level probability pijinfluences the bag-level probability pi, plays an important role.Under the Noisy-OR and the ISR model respectively, we have,

Noisy-OR:

∏pp

pp

ppp

∂∂

= (1 − ) =∏ (1 − )

1 −=

1 −1 −

i

ij k jik

k ik

ij

i

ij≠ (7)

ISR:

pp p

pp

p

p

pp

∂∂

= 1

1 + ∑1 −

(1 − )

=

∑1 −

(1 − )

i

ij

kij

ijij

i

kij

ijij

2

2

2

2

2⎛⎝⎜⎜

⎞⎠⎟⎟

⎛⎝⎜⎜

⎞⎠⎟⎟

(8)

Table 2 gives the calculated gradient values assigned to theparameter w according to the above equations, under the setting ofthe previous example as that of Table 1. One may notice that both theISR and the Noisy-OR model give the very small gradient to w whenthe number of instances is over 100. The main reason behind this forthe Noisy-OR model lies in the fact that the negative instances areassumed to be independent of each other, making the model tend tolean more towards positives (cf., Eq. (7)). The ISR model partiallyalleviates this problem by replacing the product rule with a summingone, but it still treats each negative instance equally with positive ones(cf., Eq. (8)). Consequently, once the negative instances are over-whelming, it suffers from the same numeric problem as the Noisy-OR.This notorious vanishing gradient phenomena, as we commonlyencounter in training a deep neural network, could lead to theextremely inefficient use of training examples.

The max-pooling and average strategy, on the other hand, behavemuch more stable and robust than the other two (cf., Tables 1 and 2),in the sense that they exhibit less sensitivity to the number of negativeinstances in a bag. While the simple average rule may underestimatethe value of pi, the max-pooling provides more salient and robust

statistics2 about the content of the bag using a simple nonlineartransformation over the set of proposals.

Last but not least, it is worth emphasizing that despite theaforementioned pitfall the Noisy-OR/ISR model is less problematicotherwise. For example, Viola et al. successfully used this model intraining their multi-instance AdaBoost (MILBoost)-based object de-tector [24] – in their case most of the instances (proposals) are actuallypositive as these are sampled from the known location of the object ofinterest.

4. Selective Weakly Supervised Detection

4.1. Overview of the proposed method

The overall architecture of the proposed Selective WeaklySupervised Detection algorithm is illustrated in Fig. 2. The key ideais that the object detector of interest can be understood as a proxy thatpropagates backward supervised signal at the bag-level bag to theinstance level, and based on such information one can select some oftop ranking proposals (also called elites) to train the (instance-level)object detector in turn.

To improve the accuracy of initial instance-level classifiers, we takeadvantage of a small amount of supervised information usuallyavailable in practice.3 In addition, these initial bags with fullysupervised instances will keep all the time through the whole trainingprocedure to play the role of regularization for the MIL model.

More formally, assume that we have M supervised instancesx y( , )j j j M=1,…, and denote the instance-level classifier as y x θ( , )ij ofinterest which predicts labels for each input instance xij. Our goal istwofold: (1) we would like the classifier to yield good prediction at thebag level; (2) it should also correctly predict our few availablesupervised instances. These requirements lead to the following objec-tive:

∑ ∑L θN

t p λM

y y x θ

λ θ

( ) = 1 log(1 + exp( − · )) + 1 log(1 + exp( − · ( , )))

+ ∥ ∥

i

N

i ij

M

j j=1

1=1

2 2 (9)

where λ1 and λ2 are two user-defined parameters – the former onecontrols the extent of the contribution made by the supervisedinstances to the model training, and the latter one is a normalregularization coefficient, and pi is defined using the max-pooling rulementioned above,

p σ y x θ= max ( ( , ))i j iij

∈ (10)

where σ() is a standard sigmoid function (S-curve function).The next issue is how to define the instance-level classifier. We

adopt two strategies for this. This first is the logistic regression(MILLR), as described in Section 3.1. The other one is the multi-instance version of AdaBoost [24], in which the instance level predic-tion model is approximated as a linear combination of the outputs ofseveral weak classifiers ft, i.e., y x θ λ f x( , ) = ∑ ( )ij t t ij . Unlike the MILLRmodel, the output label yij is a nonlinear mapping from the corre-sponding instance xij. After training, we combine the MILLR andMILBoost. Specifically, for an instance xij denote its scores output bythe two algorithms as pAdaij and pLRij respectively, then the finalscore of the instance xij is defined to be,

Table 2The gradient of a positive bag assigned to the model parameterw under different models.

J 10 100 1000

Max-Pooling x0.9 ij* x0.9 ij* x0.9 ij*

Average x∑ 0.09kJ

ij x∑ 0.009kJ

ij x∑ 0.0009kJ

ij

ISR x∑ 0.0474kJ

ij x∑ 8.2566*10kJ

ij−4 x∑ 8.9199*10k

Jij

−6

Noisy-OR x∑ 0.0535kJ

ij0 0

2 The gradient of max-pooling depends only on the winner instance x *ij no matterwhere it is and how many negative proposals exist in the same bag, as witnessed by

= 1pi

pij

∂∂ *

, where p p= max { }ij j ij* .

3 This should not be considered to be too restricted – in many real applications we canalways annotate a few samples with affordable costs.

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

226

Page 5: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

p p p= max{ , }ij ijAda

ijLR

(11)

4.2. Constrained elites selection

Since the instance-level prediction model y x θ( , )ij is trained in aniterative way, one potential problem found in [8] is that the trainingprocedure could stop prematurely due to the lack of diversity in thetraining samples selected repeatedly, and they propose to train themodel using some portion of bags while selecting samples from anotherportion. This multi-fold method essentially injects diversity at the bag-level. Instead, we propose an instance-level method called constrainedelites selection, which constrains the range of candidate proposalsselected for the next round of training, so as to reduce the trainingnoise while stabilizing the learning process.

Particularly, at each iteration we refine the search region for thecandidate proposals through a simple dense subgraph discoveryalgorithm [26]. We treat a bag as a graph G with each instance as itsnode. The weight of edge is calculated as the pairwise distance djk inthe 2D image space weighted by the average score of the correspondinginstances (i.e., w p p= ( + )/2jk ij ik for the jth and the kth instance). Thedense subgraph discovery algorithm searches the densest subgraph S inthe graph G with its density given by,

den Sw d

S( ) =

∑ ·

| |j k S jk jk, ∈

(12)

This partitions the top ranking proposals into g small groups (densesubgraph) in a flexible way, and we adopt the minimum 2D convexclosure spanned by the members of the densest subgraph as the searchregion for the next iteration. In addition, we set a shrinking rate η ateach iteration to reduce the size of the searching region, as ourdetection confidence is expected to increase when more informationis exploited as the training procedure ongoing.

After obtaining the search region, it works as a filter to filter outthose object proposals outside the region. Then instead of directlyselecting n top ranking proposals from the remaining proposals, wefirst randomly sampling r proportion of instances/proposals from themand select k proportion of top ranking instances among these samples,such that the number of final selected instances is n r k l= × × | | (wherel| | is the number of instances in search region l of bag). This parameter rcontrols how many instances should be dropped out in searching thebest candidates for learning, and is a built-in mechanism to address theexploration versus exploitation problem. We will investigate itsinfluence on the performance in the experimental section, but overall

this semi-random strategy helps the algorithm jump out of the badlocal minimum (wrong locations but with high score) and improves itsrobustness against instance-level model uncertainty.

The whole training process is summarized in Algorithm 1. There arethree major parameters of this algorithm, i.e., the number of iterationsT, the proportions r of samples and the number n of top rankingproposals to be selected at each iteration. Both parameters are setaccording to cross-validation in practice, and the sensitivity of thealgorithm to them will be investigated in the experimental section.

Algorithm 1. Selective Weakly Supervised Detection (SWSD).

1. Initialize training set S0: fully supervised samples consisted ofM /2 ground-truth and M /2 negative instances in positive bagsand N empty positive training bags; detection proposals fromeach image.

2. Train a MIL instance-level classifier using S0 and locate a rela-tively large searching region l based on the detection scores ofeach proposal.

3. For iteration t=1 to T(a) Randomly select a r proportion of instances/proposals in

the searching region l from each positive bags.(b) Run the current MIL detector over those random in-

stances/proposals and assign a score for each of them.(c) Discovery the densest subgraph and locate the next search

region l according to it.(d) Select n top ranking proposals from the densest subgraph

and combine them with the S0 set to construct a new training setSt;

(e) Use St to train a new MIL detector.4. Return the final detector.

5. The LSP/MPII-MPHB dataset

In this section, we describe in detail a new Multiple Poses HumanBody (MPHB) data set for human detection under arbitrary Poses andits accompanying evaluation protocols.

5.1. Dataset

Images in this dataset are selected from the LSP dataset [13] andthe MPII Human Pose dataset [4]. In LSP, all human bodies are mainlyin the center of the images and the background is less cluttered. Sothese images carry more useful information about human bodies, and it

Fig. 2. The overall architecture of the Selective Weakly Supervised Detection algorithm. In training, the object proposals are input into the instance-level detector to evaluate the scoresof being positive. According to the score and location of each proposal, we use the proposed constrained elites selection algorithm to refine the search region and select the top rankingproposals (elites) as the training examples to update the current detector. In testing, we use the learned detector to score the proposals and predict the bounding box.

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

227

Page 6: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

is in favor of the detector training seen in Fig. 3. On the other hand,images in MPII Human Pose data set contain various pose variationsand possibly captured under various scales (cf., Fig. 4).

Both LSP and MPII data sets are originally created for human poseestimation and hence shipped with annotations of the positions of keyhuman parts (e.g., foot, legs, etc.). But unfortunately, such annotationscannot be directly used for human detection as bounding boxes aboutthe whole body of human beings inferred from them tend to be toonoisy and unreliable for performance evaluation. To this end, we derivea new dataset named LSP/MPII-MPHB (Multiple Poses Human Body)for human detection based on the LSP and MPII data sets, by selectingover 26K challenging images from them and making necessaryannotations about human body bounding boxes on each of the selectedimages.

The resulting dataset, named LSP/MPII-MPHB,4 contains 26,675images and 29,732 human bodies. There is at least one human perimage, and some may contain multiple people. Among these images,2,000 are from LSP and 24,675 are from MPII. We compute the sizeratio of the ground-truth bounding box to the whole image and countthe frequency histogram, as shown in Fig. 5. One can see that almost70% ground-truth's size ratio is less than 10%, indicating that it ischallenging to detect the human beings in the MPHB data set.

Human bodies in LSP/MPII-MPHB are under various poses, whichare roughly divided into six categories, i.e., bent, kneeling, lying,partial, sitting and upright. See Fig. 6 for illustration and Table 3 givesmore details about these. Note that the number of images in eachcategory of poses is unbalanced – the predominant category is “bent”while the poses of “kneeling” and “lying” are relatively few.

5.2. Evaluation

We design our evaluation protocol by dividing these images intothree partitions. Particularly, the number of images (human bodies) isrespectively 8385 (9732), 8110 (8233) and 10,180 (11,767) fortraining set, validation set and test set. The performance evaluationis based on the list of bounding boxes output by the detector and theirassociated confidence. Particularly, following the common practice inthis field (cf., the Pascal VOC [11] and ILSVRC [12]), we calculate theIoU (Intersection-over-Union) between the predicted Bp and the

ground-truth Bgt for each test image according to Eq. (13). Thebounding box whose IoU (Intersection-over-Union) with the ground-truth is more than 0.5 is considered as positive:

IoUB BB B

=∩∪

p gt

p gt (13)

The final performance is summarized using the average precision(AP) metric. For a given task and class, the precision/recall curve iscomputed from a method's ranked output. The recall is defined as theproportion of all positive examples ranked above a given rank. Theprecision is the proportion of all examples above that rank which arefrom the positive class. The AP summarizes the shape of the precision/recall curve and is the area under the curve.

6. Experiment

In this section, we first present our pilot experiments on severalfactors related to the performance of weakly supervised MIL detectorsand then compare the proposed method with several state of the artmethods. Besides human detection, we also evaluate the performanceof our method in the general object detection task under the weaklysupervised learning settings.

In the implementation, we use Selective Search method [27] to

Fig. 3. Illustration of the diversity of human poses in the LSP dataset.

Fig. 4. Illustration of humans in the MPII dataset. Note that there may be multiple subjects in one single image and some of them are in quite small scale.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

size ratio

frequ

ency

Fig. 5. Distribution of the size ratio of human beings in the MPHB dataset.4 Freely available at http://parnec.nuaa.edu.cn/xtan/data/MPHB.html

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

228

Page 7: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

generate detection proposals and the VGG net [28] for featurerepresentation. Unless otherwise noted, in all experiments we use thefollowing default experimental settings: in training initially we use 100fully supervised instances as priors, and the algorithm parameter of r isset to be 95%, T=6, and regularization coefficients in Eq. (9) are set toas λ = 1.01 and λ = 0.0012 respectively. All the MIL algorithms adoptthe max-pooling combination function except when stated explicitly.

6.1. Pilot experiments

We first conduct some pilot experiments to measure the impact ofsome interesting aspects of the model, including the effect of thenumber of supervised samples, the ratio of positive instances in apositive bag, and different strategies to combine the instance-levelmodel into a bag-level model. This helps to identify quickly the parts ofthe weakly supervised MIL detector that are most important to focuson for further development. This series of experiments is conducted onthe Pascal VOC 2007 dataset [9], which contains 2,095 positivesamples and 2,916 negative samples in the training set, and the sizeof the test set is 4,952 images.

6.1.1. Effects of the amount of supervised informationThe first problem we want to investigate is how the amount

supervised information available influences the performance of a stateof the art object detector. For this, we choose the DPM (DeformableParts Model) method, whose AP score is 48.7% on the Pascal VOC 2007

Bent10,229Bent10,229

KnKK eeling1,053Kneeling1,053

Lying(1,123)Lying(1,123)

Occlusion(4,040)Occlusion(4,040)

Sitting(5,739)Sitting(5,739)

Upright(4,492)Upright(4,492)

Fig. 6. Illustration of six types of typical human pose variations in our data set.

Table 3Detailed information about the six types of typical pose variations in our dataset.

Pose type Description Number

Bent At least one of body part is bent (e.g., stoop or horse step) 10,229Kneeling At least one of the two knees touching something 1053Lying Sleeping or swimming, in horizontal position and on its

back1123

Occlusion Only part of the body 5739Sitting Buttocks touching something 4040Upright Standing without any bend 4492

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

229

Page 8: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

data set [15] – although it is not as good as those based on the deeplearning methods, the requirement of the number of labeled datasamples is much less as well and it is efficient to train.

Fig. 7 gives the performance of DPM as the function of the numberof positive training samples available. To obtain this, we repeat thesame experiment several times by gradually decreasing the number oflabeled positive samples by 200 each time (from 2,000 to 200) whilekeeping the same number of negative samples. The figure shows thatthe AP performance of DPM reduces as the reduction of the number ofsupervised samples. Particularly, when the number of positive trainingsamples is 2,000, the AP score is the highest (35.96%). Interestingly,one can observe from the figure that when the number of labeledpositive samples is below 800, the performance of DPM starts to dropvery quickly, and when the number is as small as 200, the AP score ofthe method drops to only 22.91%.

The above experiments imply that there may exist a break point forthe DPM object detector, under which its behavior tends to beunreliable. It is well-known that the best generalization capability ofa supervised model can only be achieved with a good tradeoff betweenmodel complexity and the data available, and when the number ofsupervised samples is not large enough, the performance of a complexmodel such as DPM is likely to deteriorate due to overfitting.

On the other hand, in practice one can usually afford to label a fewsamples, it is natural to ask the question whether one can use these fewstrongly labeled data to improve the performance of a weakly super-vised algorithm. This is exactly the focus of this work. But beforeverifying the effectiveness of proposed method, let us first check somefactors that are of importance to our algorithm.

6.1.2. Effects of the ratio of positive instancesWe next investigate the impact of purity of positive instances in a

bag. Here the purity is measured by the ratio of positive instances(RoP) defined as the ratio of the number of positive instances with thetotal number of instances in a positive bag. In a standard MIL settingwe are blind to the labels of instances in a bag, but as a pilotexperiment, we assume that we know such information but will notuse them except in data generation. With this, we are able to freelychange the ratio of positive instances, by fixing positive instances usedfor learning in a bag and varying the number of negative instances inthe same bag. The positive instances are object proposals with IoUvalues higher than 0.5, and negative instances are randomly sampledfrom the non-positive instances in that bag. At each ratio, we trainthree MIL algorithms (without using the label information of in-stances), i.e., MILBoost, MILLR and their ensemble (MILBoost+MILLR). For comparison, we also train three normal fully supervisedlearning (FSL) versions of them, denoted as FSLBoost, FSLLR, and

FSLBoost + FSLLR respectively.Fig. 8 (left) gives the results. There are several interesting observa-

tions that can be made from this figure:

• The loosely defined positive instances are useful. Compared to thepure fully supervised learning, loosely defined positive instances asours can provide more information about the target object. Thefigure shows that for each of the MIL algorithm, given a higherenough RoP, their AP scores can be higher than that of the fullysupervised learning versions. For example, the performance ofweakly supervised Learner “MILBoost+MILLR” is even better thanthat of the fully supervised “FSLBoost+FSLLR” if the RoP is higherthan 70.0%.

• The performance of the MIL learner declines with the decreasing ofthe purity of positive instances in a bag reduces. The figure showsthat the performance of all the MIL algorithms quickly drops belowthe corresponding fully supervised version. The phenomenon islikely to result from the deterioration of instance-level model, asexplained in Section 3.2. In practice, the RoP is usually less than30.0%, which highlights the need of choosing high-quality positiveinstances for the MIL algorithm.

One final observation is that the ensemble version of the MILalgorithms consistently works better than the individual one. Hence inwhat follows, we would use the ”MILBoost+MILLR” with Max-Poolingas our default algorithm.

6.1.3. Effects of the ratio of instances over proposalsExperiments in the previous section highlight the benefit of

improving RoP (ratio of positive instances in a bag) but assume thatthe knowledge about the instance-level labels is known, which isunrealistic in the case of weakly supervised learning. One way toaddress this issue is to use some initial instance-level detector to assignlabels to each instance. This is essentially a kind of bootstrap strategy.

Alternatively, one can optimize the RoP score through the ratio ofinstances over proposals. As described in Section 4.2, the key idea ofthis method is to find the instances most likely to be positive among theproposals generated by the object selecting algorithm [27]. Particularly,we select n top ranking proposals from all the ones generated fromeach bag and treat them as the set of instances in that bag. In this way,the range of potential positives is effectively narrowed down, and thepurity of positive instances in a bag is improved accordingly, providingthat the initial object detector is acceptable. The ratio of the number ofselected instances (n top ranking) over all the object proposalsgenerated in a bag is hence called IoP (Instances over Proposals).Note that IoP can be thought of as an operational metric in the case ofweakly supervised learning, which may have the complex nonlinearrelationship with the RoP measurements, as verified below.

To study the impact of varying IoP on the performance, werandomly select 100 positive instances and 100 negative instanceswith ground truth to train an initial instance-level object detector, andthen use this to select n top ranking proposals from the generatedobject proposals in each bag to train the MIL detector. The experimentsare repeated for 10 times, and both the mean AP and its variance arecalculated (but the variance is relatively small, and we choose not toreport them for clarity). For comparison, we also conduct the experi-ments by randomly selecting the same number of proposals as thatselected by the initial detector to train the MIL detector.

Fig. 9 (left) gives the mean AP performance as a function of varyingIoP values. One can see that the proposal section method according tothe top ranking score (i.e., SWSD) significantly outperforms the onewith random proposal section method, and the SWSD method worksbetter than the fully supervised DMP method as well (under the case oflacking enough supervised information). Furthermore, the figurereveals that either a large IoP value or a small one is not good forthe performance. One possible explanation is that at a large IoP value,

20040060080010001200140016001800200010

15

20

25

30

35

40

the number of positive training samples

AP

(%)

Fig. 7. The AP of DPM as a function of the number of positive training samples on thePascal VOC 2007 dataset.

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

230

Page 9: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

we may incorporate too many noisy instances into the training while asmaller one may lose too much useful information. A good choice ofIoP value has a close relationship to the performance of the underlyingobject detector, but our empirical results indicate that range between30.0% and 50.0% is helpful.

By comparing the figure with Fig. 8 (left), one can see that therelationship between IoP and RoP are not linearly. Particularly, themean AP of SWSD is maximized when the IoP value is between 30%and 50%, while Fig. 8 (left) reveals that the higher the RoP value, thebetter the performance. We then look into the details of the selectedproposals at the IoP value of 40.0% (not shown), and it turns out thatthe corresponding RoP value is about 70.0%. It clearly indicates theconsistency between these two indexes. A full illustration of the value ofRoP as a function of the value of IoP is given in Fig. 9 (right).

6.1.4. Effect of different bagging strategyIn the experiments above, we used the Max-Pooling strategy (cf.,

Eq. (3)) as the default bagging function to estimate the conditionalprobability that a bag is positive given the instance-level model.Choosing this strategy effectively encourages instance competition ina bag. By contrast, the ISR and the Noisy-OR model (cf., Eq. (1)) takeall instances in a bag into account but assume that they are indepen-dent of each other. In this section we empirically compare the fourtypes of typical bagging strategies used in MIL (i.e., Max-Pooling,

Average, ISR and Noisy-OR), under the same experimental settings(i.e., the same data set and the same instances in each bag) as theprevious one described in Section 6.1.2.

Fig. 8 (right) gives the AP scores of the four methods as a functionof varying Ratio of Positive Instances (ROP). It can be seen that theperformance of the Noisy-OR function is consistently inferior to that ofMax-Pooling – actually, the best AP of Max-Pooling is 29.85%, about7.8% higher than that of the Noisy-OR. The ISR model suffers from thesame gradient vanishing problem as the Noisy-OR model, yielding acomparable performance with the latter. They both perform muchworse than the Average method and the Max-Pooling method. Theseresults are consistent with our analysis made in Section 3.2.

6.1.5. DiscussionSummarily, pilot experiments in this section leave us some useful

lessons: (1) current state of the art fully supervised object detectionmethods may fail when the supervised information is insufficient; (2)enough good quality positive instances is essential to the success of aweakly supervised MIL algorithm; and (3) the Max-Pooling combina-tion function may be better than the wildly used Noisy-OR functionwhen combining the instance-level model to a bag-level one. Thesefactors inspire us to design our SWSD algorithm for object detectionwith weakly supervised information, and we have roughly verified itseffectiveness in Section 6.1.3. In the following, we continue to evaluate

10203040506070809010010

15

20

25

30

35

RoP(%)

AP

(%)

MILBoostMILLRMILBoost+MILLRFSLBoostFSLLRFSLBoost+FSLLR

10203040506070809010010

15

20

25

30

35

RoP(%)

AP

(%)

Max−PoolingAverageISRNoisy−OR

Fig. 8. The sensitivity of the AP performance to varying RoP (ratio of positive instances in a bag). Left: comparison among multiple instance learning (MIL) versions of AdaBoost, LR,and their ensemble (AdaBoost+LR), with their normal fully supervised learning (FSL) versions serviced as baseline; Right: Comparison between four different combination strategies(Max-Pooling, Average, ISR and Noisy-OR) that combine instance-level models to bag-levels's under the MIL ensemble model of “MILBoost+MILLR”.

10 20 30 40 50 60 70 80 90 10010

15

20

25

30

35

IoP(%)

AP

(%)

Randomly Selected ProposalsTop Ranking Poposals (SWSD)Deformable Parts Model (DPM)

10 20 30 40 50 60 70 80 90 10010

20

30

40

50

60

70

80

IoP(%)

RoP

(%)

Fig. 9. Left: comparison of different proposal selection methods with varying IoP (Instances over Proposals); Right: the value of RoP (Ratio of Positive Instances) as a function of thevalue of IoP. Both on the Pascal VOC 2007 dataset.

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

231

Page 10: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

its performance on the large scale LSP/MPII-MPHB dataset andinvestigate its capability to handle the general object detection problemin a weakly supervised way.

6.2. Experiments on LSP/MPII-MPHB

6.2.1. Overall performance evaluationTo verify the effectiveness of the proposed Selective Weakly

Supervised Detection (SWSD) method, we compare our method withtwo states of the art weakly supervised learning (WSL)-based objectdetection methods, i.e., Cinbis et al.'s Multi-fold MIL (MMIL [8]) andBilen et al.'s Posterior Regularized Latent SVMmethod (PRLS [22]), onour newly annotated large scale people dataset (i.e., LSP/MPII-MPHB).Both algorithms are implemented by us, and we have verified thecorrectness of our implementations by running them on the PascalVOC 2007 and VOC 2010 datasets, finding that both achieve similarresults to those reported in their original papers. To illustrate thecontribution of the supervised regularizer, we also evaluate twovariants of the proposed SWSD method: the first one is to simply usethe initial detector trained with 100 fully supervised instances fortesting, and the second is the SWSD algorithm without using anysupervised information.

Table 4 gives the results. It can be seen that the proposed SWSDmethod performs the best among the compared ones. As shown in thetable, the proposed method significantly outperforms two previousstate of the art weakly supervised detectors by 18.8% and 24.6%respectively. This indicates that our method is much more robust thanthe compared methods when detecting highly deformable humanbodies. Again, the table verifies that the max-pooling is more suitablefor weakly supervised object detection than the other comparedbagging model, and it can be seen that the Max-Pooling strategyimproves upon the Noisy-OR model by 8.5% in terms of AP. The tablealso reveals that our performance advantage is preserved even when nosupervised regularizer is adopted – in this case, our algorithm achievesan AP accuracy of 21.26%, much higher than the compared ones(10.81% for the standard MILBoost, 10.97% for PRLS (PosteriorRegularized Latent SVM method [22]) and 16.61% for (Multi-fold

Table 4Performance comparison of various weakly supervised person detection method on theLSP/MPII-MPHB dataset.

Methods AP(%)

PRLS (Posterior Regularized Latent SVM method [22]) 10.97MMIL (Multi-fold MIL [8]) 16.61SWSD (Selective Weakly Supervised Detection, proposed) 35.37

Variants and baselines(a) Standard MILBoost (Max-Pooling) 19.38(b) Standard MILBoost (Average) 15.54(c) Standard MILBoost (ISR) 11.91(d) Standard MILBoost (Noisy-OR) 10.81(e) Fully supervised detector (with 100 samples) 27.44(f) SWSD(–) (SWSD without supervised regularizer) 21.26(g) FMP (Flexible Mixtures-of-Parts [29]) 24.33

Table 5Performance comparison (AP %) on different poses on the LSP/MPII-MPHB dataset.

Pose type MILBoost [24] PRLS [22] MMIL [8] SWSD(–) SWSD(proposed)

Bent 10.90 11.61 14.41 17.43 20.68Kneeling 14.58 15.33 15.91 16.42 22.52Lying 6.97 7.30 9.53 9.36 10.68Occlusion 14.06 13.94 15.61 22.28 47.20Sitting 11.03 11.93 17.18 20.07 29.63Upright 23.57 25.53 27.50 29.56 48.28

Fig. 10. Illustration of the human detection results of our method on the LSP/MPII-MPHB dataset. Correct localizations are shown in yellow, incorrect/missing (dash-box) ones inpink. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

232

Page 11: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

MIL [8])). Although this performance is lower than the fully supervisedmethod with only 100 samples by about 6.2%, the table shows that ouralgorithm improved upon the fully supervised method by about 8.0%,indicating that the proposed semi-random iterative MIL learningprocedure is effective.

Note that in Table 4, we also compare our method with a state ofthe art human parsing method – FMP (Flexible Mixtures-of-Parts,[29]). Particularly, we first use the FMP method to estimate humanparts in a given image and then derive a bounding box based on suchestimation to measure the performance. The result in Table 4 showsthat the performance of pose estimation method is lower than ourmethod by about 10%, showing that human parsing methods may notbe directly applied to human detection – such methods either rely onthe results of human detection for further part parsing, or cannot dealwith the case when the scale of human is very small.

Table 5 details the detection performance on various poses. It helpsus understand more clearly where the performance advantage of theproposed method over the compared ones is from. One can see that our

MMI L

SWSD

MMIL

SWSD

MMIL

SWSD

Fig. 11. Object re-localization using multi-fold MIL and the proposed SWSD from initialization (left) to the final localization (right) and two intermediate iterations. Dashed circlesindicate the search region and correct localizations are shown in yellow, incorrect ones in pink. (For interpretation of the references to color in this figure caption, the reader is referred tothe web version of this paper.)

2 3 4 5 6 7 8 9 100.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

iterations

the

ratio

of i

nter

sect

ion

Fig. 12. The ratio of search region intersection between two consecutive iterations in ourSWSD.

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

233

Page 12: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

method significantly outperforms the compared ones with over5.0–20.0% higher detection rate on all the pose types except the poseof lying. Despite this, the table reveals that the non-upright poses suchas lying, bent and kneeling are more challenging than others. Possiblereasons are either due to lack of training samples (for lying andkneeling) or due to large deformations (e.g., bent). Fig. 10 illustratessome of the correct/missing detection results of our method. Althoughthe performance is expected to improve further on some particularpose types, such as lying, by collecting more training data, the figure

shows that in general our method can reliably detect human bodieseven under partial occlusions, various poses/lighting conditions and insmall scale.

6.2.2. The behavior of object re-localizationTo investigate the object re-localization behavior of our method in

detail when handling a new test image, we compare with Cinbis et al.'sMulti-fold MIL (MMIL [8]), which also works by refining the initialrough estimation iteratively. As discussed in Section 2, besides usingdifferent bagging functions (they use noisy-or whereas we use max-pooling ), the major differences between MMIL and our method aresummarized as follows: (1) in the stage of initialization, MMIL adopts aconservative strategy by starting to search from almost the whole imagewhile our method starts from a set of object proposals verified by aweak object detector; (2) to inject diversity MMIL randomly selectpositive bags at each round while we use constrained elites selectionstrategy detailed in Section 4.2; and (3) MMIL mines hard negativeinstances each round while we use a few supervised, positive instancesas regularizer each time.

These differences lead to different behaviors of the two algorithms,as shown in Fig. 11. The figure illustrates the object re-localizationprocedures of the two methods respectively. One can see that bothalgorithms successfully localize the object instance of interest (i.e., thechild in the first row), and the bounding box yielded by our methodlooks slightly better than that of MMIL. In the man in a tractorexample, multi-fold MIL gets stuck with the windows of the wholetractor. By contrast, our SWSD is able to progressively localize down tothe man although the subject is partially occluded and is in a very smallregion. Note that in this example our algorithm starts searching fromsome obviously not-so-good initial locations, but still is able toconverge to a good location. In the last example of the man in a boat,

010

2030

4050

5060

7080

90100

26

28

30

32

34

36

k(%)r(%)

AP

(%)

010

2030

4050

5060

7080

90100

16

18

20

22

24

26

28

k(%)r(%)

AP

(%)

Fig. 13. The AP(%) score as a function of various r values and k values on LSP/MPII-MPHB (Left) and Pascal VOC 2007 (Right), where r is the proportion of proposals selected in thesearching region l, while k is the proportion of top ranking instances.

Pascal VOC 2007 Pascal VOC 2010 LSP/MPII−MPHB0

5

10

15

20

25

30

35

40

45

50

55

AP

(%)

Standard MILBoost(Noisy−OR)Standard MILBoost(Max Pooling)Standard MILBoost(Max Pooling+supervised regularizer)SWSD without supervised regularizerSWSD with supervised regularizer

Fig. 14. Influence of the individual stages of our SWSD method. On three benchmarkdatasets, we compare the average precision (AP) score with our full WSL method toscores when each of the three main stages of SWSD is removed in turn, leaving theremaining stages in place.

Table 6The AP(%) comparison of weakly supervised detection on the test set of Pascal VOC 2007. Results of [21]* are with inter-class context information.

Method aero bicy bird boa bot bus car cat cha cow dtab dog hors mbik pers plnt she sofa trai tv Av.

Song et al. [19] 27.6 41.9 19.7 9.1 10.4 35.8 39.1 33.6 0.6 20.9 10.0 27.7 29.4 39.2 9.1 19.3 20.5 17.1 35.6 7.1 22.7Song et al. [20] 36.3 47.6 23.3 12.3 11.1 36.0 46.6 25.4 0.7 23.5 12.5 23.5 27.9 40.9 14.8 19.2 24.2 17.1 37.7 11.6 24.6Bilen et al. [22] 42.2 43.9 23.1 9.2 12.5 44.9 45.1 24.9 8.3 24.0 13.9 18.6 31.6 43.6 7.6 20.9 26.6 20.6 35.9 29.6 26.4Wang et al. [21] 48.8 41.0 23.6 12.1 11.1 42.7 40.9 35.5 11.1 36.6 18.4 35.3 34.8 51.3 17.2 17.4 26.8 32.8 35.1 45.6 30.9Wang et al. [21]* 48.9 42.3 26.1 11.3 11.9 41.3 40.9 34.7 10.8 34.7 18.8 34.4 35.4 52.7 19.1 17.4 35.9 33.3 34.8 46.5 31.6Cinbis et al. [8] 39.3 43.0 28.8 20.4 8.0 45.5 47.9 22.1 8.4 33.5 23.6 29.2 38.5 47.9 20.3 20.0 35.8 30.8 41.0 20.1 30.2MILBoost [24] 32.8 38.1 20.5 10.3 9.1 36.4 40.9 17.6 8.1 21.1 13.3 17.8 28.9 41.2 13.6 16.4 24.8 20.4 34.1 19.8 23.2SWSD (–) 39.5 45.3 22.5 20.6 11.2 42.7 49.3 30.6 12.0 31.3 22.6 26.6 31.9 46.1 20.5 22.9 28.3 24.1 37.4 32.0 29.9SWSD (proposed) 41.1 47.8 23.7 23.5 11.4 46.6 54.1 35.6 15.9 36.8 26.1 32.2 32.1 47.1 26.3 24.1 29.3 25.6 38.3 38.4 32.8

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

234

Page 13: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

however, our algorithm prematurely locking onto an erroneous loca-tion. One possible reason for this is due to the effects of the supervisedregularizer which generates such a strong response at the location ofthe boat that overwhelms the subsequent attempt to re-locate thehuman being of interest. It implies the potentially harmful effect of aweak detector in some specific cases, and one way to address this is tointroduce more diversity into the searching procedure, which will bethe focus of our future study.

To study the behavior of our constrained elites selection method,Fig. 12 illustrates the changing ratio of search region intersectionbetween two consecutive iterations for a randomly selected image(shown in the first row of Fig. 11). The intersection between searchregion is defined in terms of the number of common proposals fallinginto each individual search region in two consecutive iteration steps.One can see that between 2nd and 5th iteration, the ratio of intersec-tion is maintained at a relatively low level of 0.5, indicating that the

Table 7The AP(%) comparison of weakly supervised detection on the test set of Pascal VOC 2010. Results of [22]* are based on our implementation.

Method aero bicy bird boa bot bus car cat cha cow dtab dog hors mbik pers plnt she sofa trai tv Av.

Cinbis et al. [8] 44.6 42.3 25.5 14.1 11.0 44.1 36.3 23.2 12.2 26.1 14.0 29.2 36.0 54.3 20.7 12.4 26.5 20.3 31.2 23.7 27.4Bilen et al. [22]* 38.7 44.5 18.4 10.4 12.9 39.3 42.8 25.4 8.0 23.9 11.0 22.9 29.9 46.2 9.9 22.6 24.9 18.5 32.1 25.5 25.4MILBoost [24] 34.4 38.6 16.8 8.3 9.8 37.9 32.2 20.2 9.6 18.6 9.7 17.8 26.7 41.1 13.9 14.9 17.6 17.0 26.7 15.0 21.3SWSD (–) 39.7 48.1 23.5 14.0 14.7 48.9 33.7 30.5 14.8 23.2 12.3 23.9 32.5 47.1 20.5 21.2 16.7 21.7 33.4 14.2 26.7SWSD (proposed) 44.7 52.6 24.5 17.8 18.0 53.0 34.4 34.6 15.3 24.5 12.5 24.7 36.2 48.0 27.2 27.4 20.5 28.6 34.3 17.7 29.8

ca tbicyc le

sof ad o g

bottle

Standard MILBoost PRLS MMIL SWSD

Fig. 15. Example detection results of various detectors on the test images of Pascal VOC 2007. Correct localizations are shown in yellow, incorrect ones in pink. (For interpretation ofthe references to color in this figure caption, the reader is referred to the web version of this paper.)

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

235

Page 14: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

search region is changing actively. While after the 6th iteration, theratio is above 0.7 and grows quickly, revealing that the detectortraining procedure tends to be stable and converging.

Note that the object re-localization behavior of our method isinfluenced by two parameters at each round: (1) the proportion r ofproposals selected in a searching region l, and (2) the proportion k oftop ranking proposals that are finally selected as instances in a bag.Hence the final number of selected instances per bag is n r k l= × × | |( l| | is the size of searching region l in the bag B). Fig. 13 gives the AP(%)score as a function of various r values and k values on both LSP/MPII-MPHB (left) and Pascal VOC 2007 (right). Both figures show similarbehaviors under different settings. Particularly, one can see that thesystem's performance drops with the decreasing of r values, indicatingthe importance of setting a large enough candidate sets. On the otherhand, the influence of the k value is more complicated – this has a closerelationship with the IoP value (cf., Fig. 9). Overall the optimal value ofr is around 95.0–100.0%, and the optimal value of k is around 35.0–40.0% on both data sets evaluated.

6.2.3. Contributions of the individual stagesTo illustrate the contributions of the individual stages of the

proposed method (i.e., supervised regularizer, Max-Pooling-basedcombination strategy, and the semi-randomly positive instances re-finement), we conduct a series of experiments on three datasets byremoving each of the three main stages in turn while leaving theremaining stages in place (the comparison is thus against our fullmethod). Fig. 14 gives the results. In general, each stage is beneficialand (not shown) the results are cumulative over the stages, but it seemsthat the stage of supervised regularizer contributes most to theperformance improvement. Specifically, if we remove this stage, theperformance significantly drops by over 14.0% from 35.37% to 21.26%on the LSP/MPII-MPHB dataset. Similarly, if the stage of the semi-randomly positive instances refinement is only removed, we will sufferfrom nearly 8.0% performance loss while replacing the Noisy-ORcombination function with the Max-Pooling function, we have about5.0% performance improvement on the same data set. Although not soevident as on LSP/MPII-MPHB, similar observations can be made onthe Pascal VOC 2007 and VOC 2010 data sets, see Fig. 14 for details.

6.2.4. Runtime performance evaluationWe give a brief runtime performance analysis on the proposed

method. For a given test image, it undergoes three stages beforeobtaining a result, i.e., proposals generation, feature extraction, andscore evaluation. The sizes of input image range from 480*816*3 to1080*5760*3. It takes the selective search algorithm about 4.9 s togenerate 1,500 proposals per image on our laptop and takes aboutanother 0.4 s to extract CNN features per proposal. Finally, theinstance-level classifier needs about 0.06 s to evaluate each of theseproposals and determines whether there exist human beings in thatimage. One can see that most of the test time is spent on featureextraction, which may be replaced by more efficient feature sets (e.g.,HoG or SIFT) if needed. Note that our implementation is based on theMatlab platform without any code optimization and the actual runtimeefficiency could be further improved by using other low-level program-ming languages such as C and by using code optimization techniques.

6.3. Experiments on pascal VOC

In the previous sections, we have demonstrated the effectiveness ofthe proposed method in detecting human under difficult conditions,but nothing can prevent us from applying this for more general objectdetection tasks beyond human beings. For this we conduct a new seriesof experiments on the Pascal VOC 2007 [9] and Pascal VOC 2010 [11]respectively. The Pascal VOC 2007 data set has been popularly adoptedto evaluate the performance of weakly supervised learning algorithms,although there are relatively few WSL methods tested on Pascal VOC

2010.Tables 6 and 7 give results on all detection performance of all 20

classes on the test set of the two datasets respectively. One can see thatour method achieves the best overall performance consistently on bothdatasets – 32.8% mAP (mean AP over all classes) on Pascal VOC 2007and 29.8% mAP on Pascal VOC 2010. The AP scores of our method onthe person class are respectively 26.3% and 27.2% on the two PascalVOC datasets, which are the best among the compared ones and areconsistent with the results on LSP/MPII-MPHB. Besides the personclass, our method is also found to yield the best performance on severalother classes, such as cat, plant, chair, bus and so on. Fig. 15 showssome detection results of four WLS detectors on Pascal VOC 2007. Onecan see that the standard MILBoost [24] misses the targets especiallyon some objects with the cluttered background such as those in thebottle class or with fewer textures such as those in the sofa class. Bycontrast, the multi-fold MIL method (MMIL [8]) and our SWSDmethod work much better than MILBoost, but as shown in the dogand bottle examples, the MMIL method can be somewhat over-detecting the objects of interest. As pointed out in [8], this is probablyattributing to the fact that most WLS methods tend to find the mostrepeatable structure in the positive training images, while our methodresolves this issue to some extent by using a little amount of supervisedinformation to regularize the behavior of the learned model.

7. Conclusion

In this paper, we proposed a novel Selective Weakly SupervisedDetection (SWSD) method for human detection under arbitrary poses.It is a modified MIL method in which the classical Noisy-OR baggingmodel is replaced with the more robust Max-Pooling function. Inaddition, a few fully supervised samples are incorporated into themodel, serving for both initialization and regularization. To prevent thetraining procedure from prematurely locking onto erroneous objectlocations and improve the robustness of the resulted detector, weproposed a constrained elites selection method that injects diversity atthe instance-level at the early stage of the training while helping thedetector to focus on the most suitable instances at the later stages.Additionally, we annotate a new large-scale data set called LSP/MPII-MPHB (Multiple Poses Human Body) for human detection underarbitrary poses.

We evaluate our approach and compare it to state-of-the-artmethods using both the new MPHB data set and the VOC 2007/2010dataset, showing that our method is able to progressively localize downto the object of interest even although the subject is partially occludedand very tiny. We also give an in-depth empirical study of some aspectsof the general behavior of WSL algorithms. Through which we identifyseveral crucial factors that may significantly influence the performancea WSL detector, such as the usefulness of a small amount of super-vision information, the need of relatively higher RoP (Ratio of PositiveInstances), and so on. In our opinion, it is of importance to take thesefactors into consideration when designing any practical application,although they are less studied in the previous literature. Currently, weare working on alternative methods to improve the robustness of theinstance-level model and to incorporate various side information intothe model.

Acknowledgments

This work is partially supported by National Science Foundation ofChina (61373060, 61672280) and Qing Lan Project.

References

[1] M. Enzweiler, D.M. Gavrila, Monocular pedestrian detection: survey and experi-ments, IEEE Trans. Pattern Anal. Mach. Intell. 31 (12) (2009) 2179–2195.

[2] P. Peng, Y. Tian, Y. Wang, J. Li, T. Huang, Robust multiple cameras pedestrian

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

236

Page 15: Selective Weakly Supervised Human Detection under ...parnec.nuaa.edu.cn/pubs/xiaoyang tan/journal/2017/y-cai-pr-2017.pdf · Selective Weakly Supervised Human Detection under Arbitrary

detection with multi-view Bayesian network, Pattern Recognit. 48 (5) (2015)1760–1772.

[3] Z. Zhang, W. Tao, K. Sun, W. Hu, L. Yao, Pedestrian detection aided by fusion ofbinocular information, Pattern Recognit. 60 (2016) 227–238.

[4] M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2D human pose estimation: newbenchmark and state of the art analysis, in: 2014 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), IEEE, Columbus, Ohio, 2014, pp. 3686–3693.

[5] Y. Zhou, X. Bai, W. Liu, L.J. Latecki, Similarity fusion for visual tracking, Int. J.Comput. Vis. 118 (3) (2016) 1–27.

[6] R. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based convolutional networksfor accurate object detection and segmentation, IEEE Trans. Pattern Anal. Mach.Intell. 38 (1) (2016) 142–158.

[7] T.G. Dietterich, R.H. Lathrop, T. Lozano-Pérez, Solving the multiple instanceproblem with axis-parallel rectangles, Artif. Intell. 89 (1) (1997) 31–71.

[8] R.G. Cinbis, J. Verbeek, C. Schmid, Weakly supervised object localization withmulti-fold multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell. 39(2015) 1.

[9] M. Everingham, L. Van Gool, C. Williams, J. Winn, A. Zisserman, The Pascal VisualObject Classes Challenge 2007 (VOC 2007) Results, 2008.

[10] J.D. Keeler, D.E. Rumelhart, W.K. Leow, Integrated segmentation and recognitionof hand-printed numerals, in: Conference on Advances in Neural InformationProcessing Systems, Denver, Colorado, USA, 1990, pp. 557–563.

[11] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascalvisual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338.

[12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognitionchallenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252.

[13] S. Johnson, M. Everingham, Clustered pose and nonlinear appearance models forhuman pose estimation, in: BMVC, Aberystwyth, vol. 2, no. 4, 2010, p. 5.

[14] Y. Cai, X. Tan, Weakly supervised human body detection under arbitrary poses, in:International Conference on Image Processing, IEEE, Phoenix, Arizona, USA,2016.

[15] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection withdiscriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach. Intell.32 (9) (2010) 1627–1645.

[16] M. Pandey, S. Lazebnik, Scene recognition and weakly supervised object localiza-tion with deformable part-based models, in: 2011 IEEE International Conferenceon Computer Vision (ICCV), IEEE, Barcelona, Spain, 2011, pp. 1307–1314.

[17] P. Siva, T.Xiang, Weakly supervised object detector learning with model driftdetection, in: 2011 IEEE International Conference on Computer Vision (ICCV),IEEE, Barcelona, Spain, 2011, pp. 343–350.

[18] O. Russakovsky, Y. Lin, K. Yu, L. Fei-Fei, Object-centric spatial pooling for image

classification, in: Computer Vision—ECCVSpringer, Florence, 2012, pp. 1–15.[19] H.O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, T. Darrell, On learning

to localize objects with minimal supervision, in: Proceedings of the 31stInternational Conference on Machine Learning (ICML-14), Beijing, China, 2014,pp. 1611–1619.

[20] H.O. Song, Y.J. Lee, S. Jegelka, T. Darrell, Weakly-supervised discovery of visualpattern configurations, in: Advances in Neural Information Processing Systems,Montreal, Quebec, Canada, 2014, pp. 1637–1645.

[21] C. Wang, W. Ren, K. Huang, T. Tan, Weakly supervised object localization withlatent category learning, in: Computer Vision—ECCV 2014, Springer, Zurich, 2014,pp. 431–445.

[22] H. Bilen, M. Pedersoli, T. Tuytelaars, Weakly supervised object detection withposterior regularization, in: British Machine Vision Conference, Nottingham, 2014.

[23] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with online multipleinstance learning, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011)1619–1632.

[24] P.A. Viola, J.C. Platt, C. Zhang, Multiple instance boosting for object detection, in:Advances in Neural Information Processing Systems, Vancouver Canada, 2005, pp.1417–1424.

[25] S. Ray, M. Craven, Supervised versus multiple instance learning: an empiricalcomparison, in: Proceedings of the 22nd International Conference on MachineLearning, ACM, Bonn, Germany, 2005, pp. 697–704.

[26] V.E. Lee, N. Ruan, R. Jin, C. Aggarwal, A survey of algorithms for dense subgraphdiscovery, in: Managing and Mining Graph Data, Springer, 2010, pp. 303–336.

[27] J.R. Uijlings, K.E. van de Sande, T. Gevers, A.W. Smeulders, Selective search forobject recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171.

[28] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale imagerecognition, in: ICLR, 2015,San Diego, 2015

[29] Y. Yang, D. Ramanan, Articulated human detection with flexible mixtures of parts,IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2878–2890.

Yawei Cai is a M.S. candidate in Nanjing University of Aeronautics and Astronautics.His research interests are machine learning, pattern recognition, and computer vision.

Xiaosong Tan is a PhD student in Nanjing University of Aeronautics and Astronautics.His research interests are machine learning, pattern recognition, and computer vision.

Xiaoyang Tan received a PhD degree from Department of Computer Science andTechnology of Nanjing University, China, in 2005. From 2006 to 2007, he worked as apostdoctoral researcher in the LEAR team at INRIA, France. His research interests are inface recognition, machine learning, pattern recognition, and computer vision.

Y. Cai et al. Pattern Recognition 65 (2017) 223–237

237