recognizing context for human-robot interactionhalismai/a/4-21-71_final_report.pdf · recognizing...

Recognizing Context for Human-Robot InteractionFinal Report for Project 4-21-71∗

Hatem AlismailComputer Science Department

Carnegie Mellon [email protected]

Advisors:Dr. Brett Browning and Dr. Majd Sakr

Computer Science DepartmentCarnegie Mellon University{brettb,msakr}@cs.cmu.edu

Abstract

Robots and intelligent agents are becoming more important everyday. Human Robot Inter-action (HRI) plays an important role in developing algorithms to utilize intelligent agents forour help. One such example is the Roboceptionist project [1] developed at Carnegie MellonUniversity. The Roboceptionist is a robot receptionist that does a receptionist job in a sociallyappropriate manner. Such socially-aware robots need the ability to detect humans in the sur-rounding environment, so that the behavior could be adjusted accordingly. In this paper, wedescribe a general object detection system that is tested on people detection tasks, face detec-tion as well as general object detection. The system is based on the Bag of Features (BoF)approach. In the BoF approach, images are encoded in the form of normalized histograms offeature occurrences extracted using a dictionary of visual words. The count of the bucket kin the histogram corresponds to the number of features found in cluster k in the dictionary.The dictionary is built using a clustering algorithm to group similar image features into visualwords. Among the experiments done in this paper, we see that using a dictionary of a large size(10K) to generate sparse histograms, has higher discriminative potential than smaller dictionarysizes (e.g 500). We also evaluate SIFT (Scale Invariant Feature Transform) features detected bythe Difference of Gaussian (DoG) detector against features detected by the Maximally StableExtremal Regions (MSER) regions on people detection. We show that MSER regions detectmore stable features than the DoG detector, especially in low resolution images. Finally, wemention future work possibilities in terms of real time implementation from video frames andincreasing the system’s robustness using different detectors or combinations of detectors.

∗Best viewed in color

1

1 INTRODUCTION

1 Introduction

Robust people detection algorithms is a key part for effective Human Robot Interaction (HRI).One of the interesting applications for a people detection algorithm is the Robot Receptionist, orRoboceptionist [1]. The Roboceptionist project was developed in Carnegie Mellon University to bea useful helpmate that is compelling to interact with visitors. The Roboceptionist gives informationabout Carnegie Mellon as well as directions around campus and to faculty and staff offices. Otherthan the complex social interactions the robot has to deal with, the robot needs to be able toengage people in its surroundings. A robust people detection algorithm is certainly helpful in suchdomains. However, there are many challenges that make the problem of people detection a harderproblem than detecting other objects. The essence of the problem arises from the fact that humansare non-rigid objects, they have various shapes, sizes and wear different clothing. Consequently,the high interclass variability incurred by the mentioned factors complicates the problem of peopledetection. Another challenge with people detection is the lack of standard datasets of full-bodyshots, which makes the design of people detection algorithms harder as it has to deal with the lackof training data.

In this work, we describe a trainable people detector using the Bag of Features (BoF) approach.The BoF approach is a well established technique in the field of Information retrieval (IR) anddocument classification, in which it is known as the Bag of Words model (BoW) [2]. The techniquehas been recently employed in computer vision applications and is showing promising results in thefield of object detection and classification [3]. The main idea of the BoF approach is to transformimages into a frequency count of “visual words” occurrences to exploit the discriminative visualcharacteristics located on the object. The process starts by building a bag, or a dictionary, of imagefeatures found in training images to form clusters that are used to extract visual words. Once thedictionary is established we can extract histograms from the positive training images and label themas positive to indicate the existence of the object of interest. Also we extract other histograms fromnegative training images and label them as negative to indicate the lack of the object of interest inthe image. Finally, we use a binary classifier to learn a model discriminating between positive andnegative histograms on novel testing data. However, differences between documents and imagesshould be taken into consideration for a successful system, as described next.

The continuous nature of visual words opposed to language words requires modifying the ap-proach to fit Computer Vision applications. For example, there is no ambiguity in finding matchesfor the word “car” since we can find the exact match easily. However, the case is different incomputer vision. Although many objects seem to be the same, or belong to the same object classwhen classified by a human eye, often times the underlying computer representation is significantlydifferent. For example, there is no uniform look for a car; cars come in different shapes and models.Further, a photo of the same car might be represented differently with changes in illumination,viewpoint, scale, image blur and/or occlusion. Hence, conditions on constructing and matchingvisual words needs to be relaxed.

The fact that visual words, and images for that matter, are represented as vectors of integer/realnumbers, an intuitive condition for considering whether two visual words as the same or not, isestablished via a distance metric (e.g Euclidean distance, Mahalonobis distance, etc). In this work,the Euclidean distance metric is used to group similar visual words together. Unfortunately, the

2

2 BACKGROUND

process gets harder when deciding on an acceptance threshold. An acceptance threshold is to decidewhether two visual words are the same or not. Finding a good threshold is not an easy task. Avery high degree of tolerance might consider different visual words as the same, while a very lowdegree of tolerance might not capture similar patterns. The problem is even more complicatedas the vectors representing visual words are of high dimensions and a naive comparison with allpossible vectors in the dataset is a very computationally expensive task.

Fortunately, the BoF approach provides an efficient framework for grouping similar featuresbased on a distance metric. The bag, or the dictionary, is built with a specified number of clustersfrom the training data using a clustering algorithm (e.g. K-means). The size of the dictionary cor-responds to the number of clusters specified by the user, which also serves as an indirect acceptancethreshold. The larger the dictionary, the finer the threshold is. The only computationally expensivepart of the approach is clustering the data and creating the dictionary, as the current clusteringalgorithms are not known to be fast. Once the dictionary is created, extracting histograms basedon the visual words found in the object is a matter of comparing a visual word to each cluster inthe dictionary and associating the visual word on the object to the closest cluster in the dictionary.A more detailed description of the approach is in Section 3.

The outline of this paper is as follows: Section 2 (Background) discusses background informationand provides more insight in the BoF approach. Section 3 (Our Approach) describes the systemin detail. Section 4 (Experiments and Results) explains our experiments methodology, real lifeapplications and reports on the results. Finally, Section 5 (Conclusions and Future Work) concludesthe paper and points out to possible future work and improvements.

2 Background

This section discusses the relevant aspects in an effective implementation of the BoF. We start bydescribing feature (keypoint) detectors and descriptors, the dictionary of visual features and thebinary classifier.

2.1 Keypoint Detectors and Descriptors

The main concern in image processing and recognition tasks is extracting useful information fromimages. The problem is especially hard, since there is no clear definition for what is consideredto be useful information. Further, photometric and geometric changes impose more challengesfor any computer vision algorithm and complicates the definition of useful information in imageseven further. Hence, many algorithms have been developed to extract useful information fromimages. Such algorithms might extract very basic information, such as detecting edges or blobsin a given image. On the other hand, there are more elaborate algorithms designed to detectedmore complex shapes to be considered as useful information. Those useful information are betterdescribed as image features, or keypoints. An image feature is a pixel or group of pixels that areconsidered of certain importance. The algorithms used to detect regions having potential keypointsare called “Keypoint Detectors”. Once a keypoint detector finds an interesting region, an encodingis needed to describe the keypoint. The algorithms that describe the keypoint are called “KeypointDescriptors”.

3

2.1 Keypoint Detectors and Descriptors 2 BACKGROUND

Several keypoint detectors and descriptors have been developed, and each is focused on a certaintype of invariance. Popular detectors include finding scale-space extrema with the Difference ofGaussian (DoG), Laplacian of Gaussian (LoG), Harris Laplace, Harris Affine [4], Hessian Laplace,Hessian Affine [4], MSER (Maximally Stable Extremal Regions) [5], IBR (Intensity extrema-BasedRegion detector) and EBR (Edge Based Region detector) [6]. Mikolajczyk et. al. provide athorough performance evaluation of keypoint detectors [7], where they rank MSER as the bestaffine invariant detector. The best results for scale invariant detectors are reported from Harris-Laplace, Hessian-Laplace, LoG and DoG. Similarly for keypoint descriptors, Mikolajczyk et. al.provide a thorough performance evaluation of image descriptors [8]. They compare different typesof descriptors including: Distribution based descriptors, such as SIFT, PCA-SIFT [9] and GLOH(Gradient Location Orientation Histogram) [8], Differential descriptors, such as Steerable filters[10] and complex filters [11], as well as other techniques, such as Moment Invariants [12]. Theyconclude, however, that SIFT and SIFT-based descriptors, such as PCA-SIFT and GLOH, performbest. In our approach we use the SIFT descriptor to describe input images on regions detected byDoG and MSER. Next, we describe the four main steps to generate the SIFT descriptor from animage using a DoG detector [13, 14], as well as the intuition behind MSER regions.

2.1.1 SIFT features

In this section, we summarize the four main steps involved in extracting SIFT features from aninput image.

• Scale-space extrema detection: The first step in computing the SIFT descriptor is to findscale invariant features. This is done by applying the Difference of Gaussian on a pyramid ofscales according to the following equation:

D(x, y, σ) = L(x, y, kσ)− L(x, y, σ)

where L(x, y, σ) = G(x, y, σ) ∗ I(x, y) is the convolution of a Gaussian kernel with zero meanand variance σ2 with the image at points x and y.

• Keypoint localization: In this step, the previously generated keypoints are filtered byrejecting non-corner like blobs and keypoints with low contrast.

• Orientation assignment: Each point is assigned an orientation that achieves rotation in-variance. This step is done by computing the gradient magnitude and orientation on theGaussian smoothed image L according to the equations in Figure 2.1.1.

After precomputing the m and θ values for each keypoint, a histogram of gradient orientationweighted by the gradient magnitude is built to select the orientation of the keypoints. Thekeypoints within 80% of the highest peak are considered for the final step. After selecting thekeypoints, the orientation is interpolated by fitting a parabola to the three closest histogramvalues.

• Keypoint descriptor: Finally, the histogram generated in the previous step, which mustbe rotated depending on the assigned orientation and smoothed by a Gaussian scale of 1.5,is used to describe the keypoints in an invariant manner.

4

2.1 Keypoint Detectors and Descriptors 2 BACKGROUND

∇I =(∂I

∂x

∂I

∂y

)∇I ∼ g = (L(x+ 1, y)− L(x− 1, y) L(x, y + 1)− L(x, y − 1))

m = |∂I| ∼√gx2 + gy2

θ = tan−1

(gygx

)

Figure 1: Orientation assignment for the SIFT descriptor

2.1.2 MSER regions

Maximally Stable Extremal Regions (MSER) [5] is an algorithm developed to establish a corre-spondence between a pair of images taken from different viewpoints. This problem is known asthe wide-baseline stereo matching problem. The set of MSER regions found in an image have twoimportant properties: one is that the set is closed under continuous and projective image trans-formations. That is, regions detection by MSER are affine invariant so warping or skewing theimage has no effect on the region. The second important property is that the set of MSER regionsis closed under monotonic image transformations. That is, photometric changes such as indoorlighting versus outdoor lighting do not affect the regions. MSER regions have been used in manystereo matching applications and have been modified in many ways while achieving good results.Modifications and applications of the MSER detector can be found in [15] and [16].

According to [5], an image I is a mapping I : D ⊂ Z2 → S. Extremal regions are well definedon images if:

1. S is a totally ordered relation. That is, it must be reflexive, antisymmetric and a transitivebinary relation ≤ exists. In the case of grayscale images S = {0, 1, . . . , 255}. However, thereexists extensions to MSER on color images by replacing the intensity function threshold(described next) with agglomerative clustering based on the color gradient [17].

2. A neighborhood relation A ⊂ D×D is defined, which is typically 4-connected neighborhoods.

A Region Q is defined as a contiguous subset of D, while the Region Boundary δQ is the setof pixels adjacent to Q, but do not belong to Q. An Extremal Region (ER) is a region such that∀p ∈ Q and ∀q ∈ δQ, I(p) > I(q) which is a maximum intensity region, or I(p) < I(q), which is aminimum intensity region. The function I is the intensity function. This leads to the definition ofa Maximally Stable Extremal Region (MSER), which is the region that has a local minimum in asequence of nested Extremal Regions.

Another advantage of the MSER algorithm is that MSER regions are efficient to compute witha running time of: O(n log log n), which is almost linear. The first step in detecting MSER regionsis sorting pixels by intensities, which could be done in O(n) using BINSORT. After that, contiguousregions are found efficiently using efficient data structures and a UNION-FIND algorithm. For a more

5

2.2 Dictionary of Visual Features 2 BACKGROUND

detailed explanation, see [5]. The next section provides insight into the dictionary of features ofapproach and the importance of the dictionary’s size.

2.2 Dictionary of Visual Features

The main ingredient for an effective BoF is the size of the dictionary with respect to the problemdomain. Due to the continuous domain of visual words, and the differences of the density and thedistribution of features around different types of objects, determining the dictionary size is not aneasy task. The dictionary of features in the BoF is built by clustering a database of image featuresso that we can create image feature histograms. Bucket k in the histogram contains the number offeatures found in the image that are mapped to cluster k in the dictionary. Hence, the size of thedictionary should be chosen carefully, because small dictionary sizes will lose their discriminativepower as several distinguishable features might be merged in one class. Conversely, a very largedictionary might overfit the data as each keypoint might get mapped to a unique cluster. In bothcases, the dictionary loses its discriminative power. Thus, there is no consensus on an optimal sizeof a dictionary for general object detection and classification. According to [3], Lazebnik et al. [18]used 199-400 visual words, Zhang et al. [19] used 1000 and Sivic et al. [16] used 6000-10000 visualwords to build their respective dictionaries.

In our experiments, we evaluate the BoF accuracy using two dictionary sizes. One is of size 10K,built by Vocabulary Trees (hierarchical K-means), while the other is of size 500 built by K-means.The next section completes the Background section by discussing the choice of the binary classifierfor the BoF approach.

2.3 Binary Classifier

The third important factor in the development of a robust BoF is the choice of data classifiers.In our work, we have used the SVM binary classifier [20] as it has shown good performance andis cited in much of the classification and pattern recognition research [21]. The basic principle inSVM learning is that, given a training set of pairs (xi, yi), where i = 1 . . . L, xi ∈ Rn, SVM triesto solve the following optimization problem [22]:

minw,b,ξ12wTw + C

L∑i=1

ξi (1)

subject to yi(wTφ(xi) + b) ≥ 1− ξi, ξi ≥ 0.

The function φ maps the training vectors xi into a higher dimensional space that is possiblyinfinite, where the SVM tries to find a separating hyperplane with the maximal margin in the space.Further, the kernel function K(xi,xj) ≡ φ(xi)Tφ(xj) is an important factor in achieving the bestpossible separating hyperplane on the given data. The choice of the kernel depends highly on thecharacteristics of the problem. The most used kernels are:

• Linear: K(xi,xj) = xTi xj .

• Polynomial: K(xi,xj) = (γxTi xj + r)d, γ > 0.

6

3 OUR APPROACH

• Radial Basis Function (RBF):K(xi,xj) = exp(−γ‖xi − xj‖2), γ > 0.

• Sigmoid: K(xi,xj) = tanh(γxTi ,xj + r).

Where, γ, r, and d are kernel parameters.The RBF Kernel can handle non-linear relations between the class labels and attributes as it

maps samples non-linearly into high dimension [22]. However, the Linear Kernel becomes handywhen the number of training instances is significantly less than or greater than the number offeatures in the training vectors, or when the training dataset is very large [22].

In our experiments, we try both the RBF Kernel and the Linear Kernel. However, results inSection 4 pertain to the Linear Kernel due to its better robustness to overfitting.

3 Our Approach

Our approach consists of two main parts with four basic steps in each. The first part is buildingand training the system, and the second is testing it (see Figure 2). The first step in our approach,and almost all classification problems, is collecting a database of images. We split the databaseinto two sets, one for training and building the system, while the other for use as novel testingdata. Each of the training and testing datasets is split into positive images and negative images.A positive image, or positively labeled images, denotes images that contain the object of interest,while negative images lack the existence of the object. The data used to build the dictionarymight differ from the data used to train the classifier. Thus, we use the term “dictionary data” todenote the set of images used to build the dictionary. The dictionary data is used to extract theSIFT descriptors1, which are then clustered by a clustering algorithm to create dictionary clusters.Although we have used the SIFT features in this work, other choices of image features are alsoviable. However, we use different region detectors throughout the experiments depending on thedomain of the problem.

The next step is to extract normalized histograms of feature count from the training images.Unlike what was done for the dictionary data, we need to extract features located on the objectrather than all features in the image. This is accomplished by manually labeling the object ofinterest in each positive image in the training dataset. Once features from the training datasetare extracted, we use the dictionary to generate histograms describing the extracted features,as shown in Algorithm 1. The count of bucket k in the histogram represents the number offeatures in the image that was mapped to cluster k in the dictionary. We normalize the histogramsusing an L2 norm. Once we obtain histograms from positive and negative training files we usethem as the inputs to a binary classifier. We use the SVM binary classifier [20] for its impressiveresults and discriminative power in classification problems. Specifically, we use Lin’s and Chang’simplementation of SVM (libsvm [23]) and follow their recommendations for building the model,which includes scaling the training and testing data before building the model [22].

Once the SVM model is built, we test the performance by extracting the normalized histogramsfrom the unseen testing images to build SVM files analogous to the ones built for training. There

1We use the SIFT descriptor implementation by Rob Hess, online athttp://web.engr.oregonstate.edu/~hess/index.html

7

http://web.engr.oregonstate.edu/~hess/index.html

3 OUR APPROACH

Figure 2: An overview of our approachBuilding the system1: Collect a dataset of images2: Generate SIFT features3: Build a dictionary of features4: Extract SVM training and testing filesTesting the System5: Generate SIFT features from an unseen image6: Generate a normalized histogram from the dictionary7: Scale according to SVM8: Binary output {1} for object found, {-1} otherwise

are two main tests to evaluate the system’s accuracy. The first test is the SVM cross validationaccuracy on the training data, which is often times a good indication for the future success of theclassifier. The second test is testing the system on novel testing data, which is our main interest.It is worth mentioning that the BoF approach does not localize the object, it is only capableof declaring the existence or the absence of the object form an input image. Nevertheless, weimplement object localization using a sliding window whenever possible. The sliding window testis done by sliding a variable sized window over all regions in the image and testing image patchescovered by the window for the existence or absence of the object. A window that covers more than50% of the object is considered as a true detection/localization.

In the next Section (Experiments and Results) we describe three experiments using the ap-proach. The first experiment is for general object detection. The purpose of the experiment is toexamine training data and dictionary data selection on classification accuracy. The second experi-ment applies the algorithm to a face detection system using a standard dataset. Finally, the thirdexperiment applies the approach to a full body people detection system.

8

4 EXPERIMENTS & RESULTS

Algorithm 1 Generating normalized histograms from object features using the dictionaryRequire: A vector of features found in the imageEnsure: L2 normalized histogram1: histogram := new vector[numClasses]2: for all f in features input vector do3: class := findClassFromDict(f)4: histogram[class]++5: end for6: normalize(histogram)

Algorithm 2 An overview of the approach1: Collecting data for positive and negative training and testing datasets.2: Labeling of the object of interest in positive files manually3: Extracting SIFT features using default parameters4: Building the dictionary from the extracted features5: Extracting normalized histograms as shown in 16: Building a SVM model to classify the extracted histograms.7: (Optional step) Localizing the object by sliding a window of variable size over the image and

test areas under the window for the object.

4 Experiments & Results

In this section, we discuss the conducted experiments and applications for object detection. In thefirst experiment “Dictionary Building”, we try to find out the best data selection method to build aBoF and the best approaches to train an SVM for histogram classification. The second experiment“Face Detection System” deploys the results in a face detection system using a standard dataset.Further, we experiment with the effect of the size of the dictionary of the classification accuracy.Finally, in the last experiment “People Detection System” we demonstrate the accuracy of theapproach on the harder problem of people detection.

4.1 Dictionary Building

We apply the approach described in Section 3 for a general object detection to test the effectof training data selection on classification accuracy. The object of interest is a photograph of a“butterfly”, shown in Figure 4.1. Table 1 shows the number of files in the dataset, while Figure 4shows a snapshot of the dataset. The dataset was collected manually using a commercial digitalcamera and all images are of size 640x480. The dataset is labeled manually as shown in Figure 5.The red regions are the positively labeled region in the image, which corresponds to the object ofthe interest. On the other hand, the blue regions are labeled as negative to be used as negativetraining examples. We observe that using the background in the positive images as negative trainingexamples has a significant accuracy improvements when it comes to SVM classification. In thisexperiment we train an SVM using a Gaussian RBF Kernel.

Given the textured nature of the object of interest, in this case the “butterfly”, we use the DoG

9

4.2 Butterfly Experiment 1 (BExp 1) 4 EXPERIMENTS & RESULTS

Figure 3: The butterfly image used in our dataset, obtained from:http://etc.usf.edu/clipart/3200/3236/butterfly_3.htm

Table 1: A summary of the dataset used in the experimentTraining Testing

Positive Label 109 51

Negative Label 51 38

detector. However, the choice of the region detector is problem dependent and is chosen accordingto the object of interest. For the sake of completeness we briefly describe all experiments done tofind the best way of building the dictionary, as those experiments affect decisions made in furtherexperiments. Result obtained from all these experiments are found in Table 2.

4.2 Butterfly Experiment 1 (BExp 1)

In the first experiment we use a non-rich dictionary built using the same training data without theaddition of extra features. The SVM classifier have correctly classified all images in the dataset.However, the sliding window test failed in locating the object many times.

Table 2: Results for the Butterfly Experiments (Reporting the Accuracy of Object Localization)BExp 1 BExp 2 BExp 3

+ - + - + -Real Positive 100% 0% 100% 0% 100% 0%Real Negative 50% 50% 24% 76% 0% 100%

10

http://etc.usf.edu/clipart/3200/3236/butterfly_3.htm

4.3 Butterfly Experiment 2 (BExp 2) 4 EXPERIMENTS & RESULTS

(a) Positive training set sample (b) Negative training set sample

Figure 4: A sample from the “butterfly” dataset

Figure 5: The labeling process. The red highlights the positive region (the object), while the bluehighlights the negative region.


In the second experiment we change the negative training dataset for the classifier while keepingthe same dictionary data. The new labeling process is illustrated in Figure 5. The use of thebackground as a negative training sample for the SVM have significantly improved the localizationaccuracy. Approximately, half of the false negative detections were eliminated.


In the final experiment, we completely eliminate false positive detections and localizations byapplying threshold-based filter on the number of features to be tested by the window. We noticethat the SVM fails to classify histograms found on very feature-sparse areas. The average numberof features on the butterfly was 485, while the average number of features on the false positive areaswas 31. A threshold of 227, found experimentally, achieved a 100% accuracy as shown in Table 2.

11

4.5 Face Detection System 4 EXPERIMENTS & RESULTS

Figure 6: A snapshot of the training database for the face detection system. Top: positive trainingfiles. Bottom: negative training files. The red lines show the contours of manual labeling

4.5 Face Detection System

In this experiment, we use the same approach in a harder problem: face detection and localization.Also, we examine changing the dictionary size on the classification accuracy. Larger dictionarysizes generate sparser histograms of image features, while smaller dictionary sizes generate denserones. The face detection system is trained on a set of images from the Caltech face database [24]and the PASCAL VOC 2007 database [25]. Table 3 gives an overview of the data used in theexperiment and Figure 6 shows a snapshot of the database. We conduct two experiments, oneusing a dictionary of size 500 and the other using a dictionary of size 10K. Both experiments wereconducted on the same dataset. In both experiments, we have followed the approach outlined inAlgorithm 2 to build the dictionary and train the system.

Table 3: A summary of the dataset used for the face detection experimentTraining Testing

Positive Label 216 100

Negative Label 151 118

More specifically, we have used all images in the Caltech face database, where images 0001-0199 and 0321-0399 used as the training files and the rest for positive testing files. The negativetraining files were obtained from different categories in PASCAL VOC 2007 excluding the ‘people’category. For the dictionary, we have used 596 images collected from the various PASCAL databasecategories, where 65% of images are obtained from the people category.Unlike experiments done on the “butterfly” dataset, the size of the dictionary had a direct impacton the classification accuracy. Results from the experiments are presented next.

12

4.5 Face Detection System 4 EXPERIMENTS & RESULTS

Figure 7: SVM testing accuracy in the face detection experiment using a dictionary of size 500

4.5.1 Face Detection Using a Dictionary of 500 Classes

A ROC curve of SVM classification accuracy on the testing data using a dictionary of size 500 isshown in Figure 7, which shows a high discriminative potential on labeled data. A sliding windowtest was also conducted and results are presented in Table 4. Although results for true detectionlooked promising, the miss detection rate of the sliding window is (21%), which is considered quitehigh.

Table 4: Face detection results using a dictionary of size 500Predicted

Positive NegativePositive 92% 8%Negative 21% 77%

4.5.2 Face Detection Using a Vocabulary Tree of 10K Classes

Using vocabulary trees as the basis for extracting histograms of features for image retrieval hasshown to be very effective. Although using vocabulary trees over flat K-means has not shownany statistically significant gain with regard to SVM cross validation accuracy, the sliding windowperformance has dramatically improved. Table 5 shows the improvements.

We observe that the use of a larger dictionary size to generate sparse histograms reports moreaccurate results than using smaller dictionaries. The next section presents a harder problem,

13

4.6 People Detection System 4 EXPERIMENTS & RESULTS

Table 5: Face detection results using a vocabulary tree of 10K clustersPredicted

Positive NegativePositive 97% 3%Negative 5% 95%

people detection in which we apply the approach. Full body detection on a low resolution image isa significantly harder problem than face detection. This is because a low resolution full body shotof a human has a larger degree of variation due to different clothing, shape, size and occlusion.Further, the common features between all people are located on the face and a low resolution imagemakes extracting stable features around the face a harder problem.

4.6 People Detection System

In this experiment, we test our system against the hard problem of detecting full body humans inlow resolution images (640x480). To overcome some of the difficulties, we use the MSER regiondetector, which is better suited at detecting blobs in images. In this experiment, the object ofinterest is a full-body shot of a human. Figure 8 shows a sample of the positive training images.We use the MSER region detector and the SIFT algorithm to generate a 128 dimensional vectoras described in Section 2.1. As before, the 128 dimensional SIFT descriptors are clustered usinga clustering algorithm to form the bag of features. In this work, we use the Hierarchical K-means[26] algorithm to create a dictionary of size 10K.

After that, the extracted histograms are used to train a binary classifier. In this work, thebinary classifier is a Support Vector Machine (SVM). We try two well known kernels: the RBFkernel and the Linear kernel. The former has overfitted the dataset, so we chose the latter. TheSVM is trained using 10-fold cross validation. Finally, the system is tested with unseen images fromboth negative and positive classes. We conducted seven experiments in this work. The differencesbetween experiments are the data used to build the dictionary and the region detector in use. Weuse the SIFT descriptor though out all experiments.

The first four experiments are done using the MSER region detector, while the rest are done us-ing the Difference of Gaussian (DoG) [27] detector. Throughout this paper, experiments prependedwith the letter “M” denote experiments done using the MSER region detector, while the letter “D”denote the ones done using the DoG detector. The positive dataset of images was collected bytaken full body shots of students and staff in Carnegie Mellon University Qatar Campus. Thenegative dataset were collected partly from the PASCAL dataset and from images of the interiorsof Carnegie Mellon University Qatar Campus. Image are of size 640x480 and are taken by a com-mercial digital camera. Table 6 summarizes the setup of each of the experiments. The dictionarysize for all experiments is set to 10K clusters, in accordance with previous experiments. The SIFTfeatures extraction code is from Vedaldi [28], which was run with image doubling. The MSERregion detector is obtained from the Visual Geometry Group (VGG) in the university of Oxford2.Parameters for MSER regions were tuned empirically by adjusting them to cover as much as pos-

2http://www.robots.ox.ac.uk/~vgg/research/affine/

14

http://www.robots.ox.ac.uk/~vgg/research/affine/


Figure 8: Left: sample from positive training files, right: SIFT descriptors

15


Figure 9: Dense MSER regions around the face area

sible of the face in the training data images. MSER parameters are as follows: ellipse scale is 1.00,maximum relative area is 0.01, the minimum size of output region 10 and the minimum marginis 5. Figure 9 shows a sample of MSER regions with the selected parameters. As for the binaryclassifier, we use libsvm [23] to train a SVM using a linear kernel.

Results from each experiment are presented in Table 7. Testing files for all experiments are allthe same and are collected the same way the training images were collected.

The best results were obtained from experiment M2, in which the BoF was built from thepositive and negative training files with the addition of extra images that do not belong to neithertraining files nor testing files. In general, experiments done using the MSER regions provided moreaccurate results than the ones done using the DoG detector. It is likely that the blob shaped fullbody shot of people in images is the reason for MSER detecting more stable regions.

16

5 CONCLUSIONS & FUTURE WORK

Table 6: Description of ExperimentsExperiment Description # +ve # -veM1 BoF from the positive and

negative images in thetraining data

184 169

M2 Like M1 with extra nega-tive images not from thetraining data set

184 358

M3 Same as in M2, but thenegative testing date wereextracted from the unla-beled background region ofthe positive training set

184 184

M4 Same as M1, negativetraining files are the sameas M3

184 184

D1 Same as M1 (i.e from thetraining data)

184 169

D2 The BoF was built like M2(i.e addition of extra nega-tive images)

184 358

D3 The BoF was built like M3 184 184

Table 7: Results (CV is SVM Cross Validation accuracy and Acc is Testing Accuracy on NovelData)

ExperimentM1 M2 M3 M4 D1 D2 D3

CV 89 91.5 91 92 89 92 93Acc 97.5 100 95 95 92.7 78 70

5 Conclusions & Future Work

In this paper, we presented a trainable people detection system, based on the BoF approach. Wenotice that the choice of the region detector plays a key role in the success of the algorithm. Differentobjects have different characteristics and visual identities, thus the choice of the best region detectoris not an easy task. We see that for the problem of people detection in low resolution images, theuse of MSER region detector outperforms the DoG detector. Further, we notice that the size of thedictionary plays an important role in the success of the algorithm, as shown in the Face Detectionexperiment (Section 4.5). In general and given that the object of interest has enough stable features,dictionaries of large sizes (e.g 10K) provide more accurate classification results than dictionaries ofsmaller sizes (e.g 500).

17

REFERENCES REFERENCES

There are many possibilities for future work that are worth investigating, such as a real timepeople detection and localization from video frames. Possible candidates for a real time implemen-tation include the use of stereo vision and depth maps. Depth maps can greatly reduce the searchtime by examining objects within a certain depth from the camera. In addition, color information(e.g skin color detection [29]) could be used as a filtering step to reduce the search space in theimage. Moreover, since the main focus is people detection, a region detector could be designed tobe focused only on the problem of people detection. The new region detector could be designedto be efficient to reduce run time and/or more robust for the problem of people detection. Futureresearch work will be focused on solving the problem of real-time people detection and localizationas well as integrating the system in a version of the Roboceptionist.

Acknowledgments

This report was made possible by an Undergraduate Research Experiences Program (UREP) grantfrom the Qatar National Research Fund (a member of The Qatar Foundation). The statementsmade herein are solely the responsibility of the authors.

References

[1] R. Kiryby, A. Bruce, J. Forlizze, M. Michalowski, A. Mundell, S. Rosenthal, B. Sellner, R. Sim-mons, K. Snipes, A. Schultz, and J. Wang, “Designing robots for long-term social interaction,”in IEEE International Conference on Itelligent Robots and Systems, August 2005.

[2] D. D. Lewis, “Naive (bayes) at forty: The independence assumption in information retrieval,”in ECML ’98: Proceedings of the 10th European Conference on Machine Learning. London,UK: Springer-Verlag, 1998, pp. 4–15.

[3] Y.-G. Jiang, C.-W. Ngo, and J. Yang, “Towards optimal bag-of-features for object categoriza-tion and semantic video retrieval,” CIVR, Amsterdam Netherlands, 2007.

[4] K. Mikolajczyk and C. Schmid, “Scale and affine invariant interest point detectors.” IJC,vol. 60, no. 1, pp. 63–68, 2004.

[5] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximallystable extremal regions,” BMVC, pp. 384–393, 2002.

[6] T. Tuytelaars and L. V. Gool, “Matching widely separated views based on affine invariantregions,” IJCV, vol. 59, no. 1, pp. 61–85, 2004.

[7] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir,and L. V. Gool, “A comparison of affine region detectors,” IJCV, vol. 65, no. 1/2, pp. 43–72,2005.

[8] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” PAMI, vol. 27,no. 10, pp. 1615–1630, 2005.

18


[9] Y. Ke and R. Sukthanker, “Pca-sift: A more distinctive representation for local image descrip-tors,” Proc. Conf. Computer Vision and Pattern Recognition, pp. 511–517, 2004.

[10] W. T. Freeman and E. H. Adelson, “The design and use of steerable filters,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 13, no. 9, pp. 891–906, Sept 1991.

[11] F. Schaffalitzky and A. Zisserman, “Multi-view matching for unordered sets,” in Seventh Eu-ropean Conf. Computer Vision, 2002, pp. 414–431.

[12] L. J. V. Gool, T. Moons, and D. Ungureanu, “Affine/photometric invariants for planar intensitypatterns,” in Fourth European Conf. Computer Vision, 1996, pp. 642–651.

[13] D. G. Lowe, “Object recognition from local scale-invariant features,” Proc. Seventh Int’l Conf.Computer Vision, pp. 1150–1157, 1999.

[14] D. Lowe, “Sift image features,” http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCALCOPIES/AV0405/MURRAY/SIFT.html In CVonline: On-Line Compendium of ComputerVision. R. Fisher (ed). Available: http://homepages.inf.ed.ac.uk/rbf/CVonline/ Bob Fisher.

[15] P.-E. Forssen and D. Lowe, “Shape descriptors for maximally stable extremal regions,” inIEEE International Conference on Computer Vision, vol. CFP07198-CDR. Rio de Janeiro,Brazil: IEEE Computer Society, October 2007.

[16] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching invideos.” ICCV, 2003.

[17] P.-E. Forssen, “Maximally stable colour regions for recognition and matching,” in IEEE Con-ference on Computer Vision and Pattern Recognition, IEEE Computer Society. Minneapolis,USA: IEEE, June 2007.

[18] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bag of features: Spatial pyramid matching forrecognizing natural scene categories,” IEEE CVPR, 2004.

[19] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features and kernels for classifica-tion of texture and object categories: An in-depth study.” INRIA Technical Report RR-5757,2005.

[20] V. N. Vapnik, Statistical Learning Theory. New York: John Wiley & Sons, Inc.A., 1998.

[21] D. Meyer, F. Lesich, and K. Hornik, “The support vector machine under test,” Neurocomput-ing, vol. 55(1-2), pp. 169–186, 2003.

[22] C.-C. Chang and C.-J. Lin, A practical guide to support vector classification, 2007, availableat http://www.csie.ntu.edu/∼cjlin/papers/guide/guide.pdf.

[23] C.-C. Chang and C.-J. Lin”, LIBSVM: a library for support vector machines, 2001, softwareavailable at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

[24] Caltech, “Face database,” http://www.vision.caltech.edu/html-files/archive.html.

19

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0405/MURRAY/SIFT.html

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0405/MURRAY/SIFT.html

http://homepages.inf.ed.ac.uk/rbf/CVonline/

http://www.csie.ntu.edu/~cjlin/papers/guide/guide.pdf

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://www.vision.caltech.edu/html-files/archive.html


[25] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCALVisual Object Classes Challenge 2007 (VOC2007) Results,” http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

[26] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” Computer Visionand Pattern Recognition, vol. 2, pp. 2161–2168, 2006.

[27] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journalof Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

[28] A. Vedaldi, “An open implementation of sift,” software avialable at http://vision.ucla.edu/∼vedaldi/code/sift/sift.html.

[29] V. Vezhnevets, V. Sazonov, and A. Andreeva, “A survey on pixel-based skin color detectiontechniques,” in Graphicon-2003, Moscow, Russia, September 2003, pp. 85–92.

20

http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

http://vision.ucla.edu/~vedaldi/code/sift/sift.html

http://vision.ucla.edu/~vedaldi/code/sift/sift.html

recognizing context for human-robot interactionhalismai/a/4-21-71_final_report.pdf · recognizing...

Documents