object classi cation and localization using surf...

Object Classification and Localization Using SURF Descriptors

Drew Schmitt, Nicholas McCoy

December 13, 2011

This paper presents a method for identifying and match-ing objects within an image scene. Recognition of thistype is becoming a promising field within computer visionwith applications in robotics, photography, and security.This technique works by extracting salient features, andmatching these to a database of pre-extracted features toperform a classification. Localization of the classified ob-ject is performed using a hierarchical pyramid structure.The proposed method performs with high accuracy on theCaltech-101 image database, and shows potential to per-form as well as other leading methods.

1 Introduction

There are numerous applications for object recognitionand classification in images. The leading uses of objectclassification are in the fields of robotics, photography,and security. Robots commonly take advantage of objectclassification and localization in order to recognize certainobjects within a scene. Photography and security bothstand to benefit from advancements in facial recognitiontechniques, a subset of object recognition.

Our method first obtains salient features from an inputimage using a robust local feature extractor. The leadingtechniques for such a purpose include the Scale Invari-ant Feature Transform (SIFT) and Speeded Up RobustFeatures (SURF).

After extracting all keypoints and descriptors from theset of training images, our method clusters these descrip-tors into N centroids. This operation is performed usingthe standard K-means unsupervised learning algorithm.The key assumption in this paper is that the extracteddescriptors are independent and hence can be treated as a“bag of words” (BoW) in the image. This BoW nomen-clature is derived from text classification algorithms inclassical machine learning.

For a query image, descriptors are extracted using thesame robust local feature extractor. Each descriptor is

mapped to its visual word equivalent by finding the near-est cluster centroid in the dictionary. An ensuing count ofwords for each image is passed into a learning algorithmto classify the image.

A hierarchical pyramid scheme is incorporated into thisstructure to allow for localization of classifications withinthe image.

In Section 2, the local robust feature extractor usedin this paper is further discussed. Section 3 elaborateson the K-means clustering technique. The learning algo-rithm framework is detailed in Section 4. A hierarchicalpyramid scheme is presented in Section 5. Experimentalresults and closing remarks are provided in Section 6.

2 SURF

Our method extracts salient features and descriptors fromimages using SURF. This extractor is preferred over SIFTdue to its concise descriptor length. Whereas the stan-dard SIFT implementation uses a descriptor consisting of128 floating point values, SURF condenses this descriptorlength to 64 floating point values.

Modern feature extractors select prominent features byfirst searching for pixels that demonstrate rapid changesin intensity values in both the horizontal and vertical di-rections. Such pixels yield high Harris corner detectionscores and are referred to as keypoints. Keypoints aresearched for over a subspace of {x, y, σ} ∈ R3. The vari-able σ represents the Gaussian scale space at which thekeypoint exists. In SURF, a descriptor vector of length64 is constructed using a histogram of gradient orienta-tions in the local neighborhood around each keypoint.Figure 1 shows the manner in which a SURF descriptorvector is constructed. David Lowe provides the inclinedreader with further information on local robust featureextractors [1].

The implementation of SURF used in this paper is pro-vided by the library OpenSURF [2]. OpenSURF is an

1

Figure 1: Demonstration of how SURF feature vector isbuilt from image gradients.

open-source, MATLAB-optimized keypoint and descrip-tor extractor.

3 K-means

A key development in image classification using keypointsand descriptors is to represent these descriptors using aBoW model. Although spatial and geometric relation-ship information between descriptors is lost using this as-sumption, the inherent simplification gains make it highlyadvantageous.

The descriptors extracted from the training images aregrouped into N clusters of visual words using K-means.A descriptor is categorized into its cluster centroid usinga Euclidean distance metric. For our purposes, we choosea value of N = 500. This parameter provides our modelwith a balance between high bias (underfitting) and highvariance (overfitting).

For a query image, each extracted descriptor is mappedinto its nearest cluster centroid. A histogram of counts isconstructed by incrementing a cluster centroid’s numberof occupants each time a descriptor is placed into it. Theresult is that each image is represented by a histrogramvector of length N . It is necessary to normalize each his-togram by its L2-norm to make this procedure invariantto the number of descriptors used. Applying Laplaciansmoothing to the histogram appears to improve classifi-cation results.

K-means clustering is selected over Expectation Max-imization (EM) to group the descriptors into N visual

words. Experimental methods verify the computationalefficiency of K-means as opposed to EM. Our specific ap-plication necessitates rapid training and image classifica-tion, which precludes the use of the slower EM algorithm.

4 Learning Algorithms

Naive Bayes and Support Vector Machine (SVM) super-vised learning algorithms are investigated in this paper.The learning algorithms are used to classify an imageusing the histogram vector constructed in the K-meansstep.

4.1 Naive Bayes

A Naive Bayes classifier is applied to this BoW approachto obtain a baseline classification system. The probabilityφy=c that an image fits into a classification c is given by

φy=c =1

m

m∑i=1

1{y(i) = c}. (1)

Additionally, the probability φk|y=c, that a certain clustercentroid, k, will contain a word count, xk, given that itis in classification c, is defined to be

φk|y=c =

(m∑i=1

1{y(i) = c}x(i)k

)+ 1(

m∑i=1

1{y(i) = c}ni)

+N

. (2)

Laplacian smoothing accounts for the null probabilitiesof “words” not yet encountered. Using Equation 1 withEquation 2, the classification of a query image is givenby

arg maxc

(φy=c

n∏i=1

φi|y=c

). (3)

4.2 SVM

A natural extension to this baseline framework is to in-troduce an SVM to classify the image based on its BoW.Our investigation starts by considering an SVM with alinear kernel

K(x, y) = xT y + c, (4)

due to its simplicity and computational efficiency in train-ing and classification. An intrinsic flaw of linear kernels

2

is that they are unable to capture subtle correlations be-tween separate words in the visual dictionary of lengthN .

To improve on the linear kernel’s performance, non-linear kernels are considered in spite of their increasedcomplexity and computation time. More specifically theχ2 kernel given by

K(x, y) = 1− 2

n∑i=1

(xi − yi)2

xi + yi, (5)

is implemented.Given that an SVM is a binary classifier, and it is often

desirable to classify an image into more than two distinctgroups, multiple SVM’s must be used in conjunction toproduce a multiclass classification.

A one-vs-one scheme can be used in which a differentSVM is trained for each combination of individual classes.An incoming image must be classified using each of thesedifferent SVM’s. The resulting classification of the imageis the class that tallies the most “wins”. The one-vs-onescheme involves making

(N2

)different classifications for

N classes, which grows factorially with the number ofclasses. This scheme also suffers from false positives ifan image is queried that does not belong to any of theclasses.

A more robust scheme is the one-vs-all classificationsystem in which an SVM is trained to classify an imageas either belonging to class c, or belonging to class ¬c. Forthe training data {(xi, yi)}mi=1, yi ∈ 1, ..., N , a multiclassSVM aims to train N separate SVM’s that optimize thedual optimization problem

maxa

W (α) =

m∑i=1

αi −1

2

m∑i,j=1

y(i)y(j)αiαjK(x(i), x(j)),

(6)using John Platt’s SMO algorithm [3]. In Equation 6,K(x, z) corresponds to one of the Kernel functions dis-cussed above.

A query image is then classified using

sgn

{m∑i=1

αiy(i)K(x(i), z)

}, (7)

where sgn(x) is an operator that returns the sign of itsargument and z is the query vector of BoW counts.

Figure 2 represents this concept visually. When thequery image is of class A, the A-vs-all SVM will classify

Figure 2: Portrayal of one-vs-all SVM. When query imageis of type A, the A-vs-all SVM will correctly classify it.When the query image is not of class A, B, or C, it willlikely not be classified into any.

the image correctly, and thus the overall output will placethe image into class A. When the query image is of adifferent class, D, which is not already existent in theclass structure, the query will always fall into the “all”class on the individual SVM’s. Hence, the query will notbe falsely categorized into any class.

It is important to reiterate that each multiclass SVMonly distinguishes between classes c and ¬c. A differ-ent SVM is trained in this manner for each class. Thus,the number of SVM’s needed in a one-vs-all scheme onlygrows linearly with the number of classes, N . This systemalso does not suffer from as many false positives becauseimages that do not belong to any of the classes are usuallyclassified as such in each individual SVM.

The specific multiclass SVM implementation used inthis paper was MATLAB’s built-in version as describedby Kecman [4].

5 Object Localization

The methods described thus far are sufficient for the roleof classifying an image into a class when an object isprominently displayed in the forefront of the image. How-ever, in the case when the desired object is a small subsetof the overall image, this object classification algorithmwill fail to classify it correctly. Additionally, there is mo-

3

Figure 3: Visual representation of partitioning an imageinto sub-images and constructing the histograms.

tivation to localize an object in a scene using classificationtechniques. The solution to these shortcomings is objectlocalization using a hierarchical pyramid scheme. Figure3 illustrates the general idea behind extracting descrip-tors using a pyramid scheme.

First, the set of image descriptors, D, are extractedfrom the image using SURF. Next, the image is seg-mented into L pyramid levels, where L is a user-selectedparameter that controls the granularity of the localiza-tion search. Each level l, has subsections 0 ≤ i ≤ 4(l−1),where 0 ≤ l ≤ (L − 1). At each level l, the entire setof image descriptors, D, are segmented into a subgroupd ∈ D for section i which can be found as

i =

[idiv

(p.col − 1

C2l

)+ idiv

(p.row − 1

R2l

)]2l + 1, (8)

for a given pixel p. The notation idiv(x) represents aninteger division operator. The symbols R and C are themaximum number of rows and columns, respectively, inthe original image. Then, for pixel p the votes at eachlevel of the pyramid can be tallied into an Nx1 map com-puted using

voteMap(c) =

L−1∑l=0

2l−11{labelpyr(l, i) = c}. (9)

Figure 4: Results showing both image classification andlocalization.

The resulting effect is that pixel p is most highly influ-enced by the label of its lowest-level containing subsec-tion, in l = (L− 1), and less influenced by the label of itshighest-level containing subsection, in l = 0. The result-ing label given to pixel p can then be calculated as

labelpix(p) = arg maxc

voteMap(c). (10)

6 Results and Future Work

Figure 4 shows the classification and localization resultsof our proposed algorithm on a generic image consistingof multiple classes of objects.

A more rigorous test of our method was done usinga subset of the CalTech-101 database [5]. Images fallinginto the four categories of airplanes, cars, motorbikes, andfaces were trained and tested using our method. Figure 5shows the improvement in percent correct classificationsin classification of Naive Bayes, linear SVM, and nonlin-ear SVM as the training set size increases.

The f -score, computed using the precision, P , and re-call, R, of the algorithm by

f =2PR

P +R, (11)

is perhaps a better indicator of performance because itis a statistical measure of a test’s accuracy. Figure 6

4

100 200 300 40070

75

80

85

90Percent Correct vs. Training Set Size

Per

cent

Corr

ect

Training Set Size

Naive Bayes

Linear SVM

Non Linear SVM

Figure 5: Percent correct classifications of supervisedlearning classifiers.

shows a visible improvement in the f -score for all threeclassification algorithms as the training set size increases.The nonlinear SVM maintains the largest f -score overall training set sizes, which aligns with our hypothesizedresult.

Future work for this research should focus on replacingK-means with a more robust clustering algorithm. Oneoption is Linearly Localized Codes (LLC) [6]. The LLCmethod performs sparse coding on extracted descriptorsto make soft assignments that are more robust to localspatial translations [7]. Furthermore, there is still open-ended work to be done on the reconstruction of objectsusing the individually labeled pixels from the pyramid lo-calization scheme. Hrytsyk and Vlakh present a methodof conglomerating pixels into their neighboring groups inan optimal fashion [8].

References

[1] D. Lowe. Towards a computational model for objectrecognition in IT cortex. Proc. Biologically MotivatedComputer Vision, pages 2031, 2000.

[2] C. Evans. Opensurf.http://www.chrisevansdev.com/computer-vision-

100 200 300 4000.8

0.85

0.9

0.95

1F Score vs. Training Set Size

F S

core

Training Set Size

Naive Bayes

Linear SVM

Non Linear SVM

Figure 6: f -score of classifications of supervised learningclassifiers.

opensurf.html, retrieved 11/04/2011.

[3] J. Platt. Sequential Minimal Optimization: A FastAlgorithm for Training Support Vector Machines,1998.

[4] V. Kecman. Learning and Soft Computing, MITPress, Cambridge, MA. 2001.

[5] L. Fei-Fei, R. Fergus and P. Perona. Learning genera-tive visual models from few training examples. CVPR,2004.

[6] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spa-tial pyramid matching using sparse coding for imageclassification. CVPR, 2009.

[7] T. Serre, L. Wolf, and T.Poggio. Object recognitionwith features inspired by visual cortex. CVPR, 2005.

[8] N. Hrytsyk, V. Vlakh. Method of conglomeratesrecognition and their separation into parts. Methodsand Instruments of AI, 2009.

5

object classi cation and localization using surf...

Documents