relative scale method to locate an object in cluttered environment

16
Relative scale method to locate an object in cluttered environment Md. Saiful Islam a, * , Andrzej Sluzek a,b a Center for Computational Intelligence, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore b SWPS, ul.Chodakowska 19/31, Warszawa, Poland Received 13 December 2005; received in revised form 11 March 2007; accepted 1 June 2007 Abstract This paper proposes an efficient method to locate a three-dimensional object in cluttered environment. Model of the object is repre- sented in a reference scale by the local features extracted from several reference images. A PCA-based hashing technique is introduced for accessing the database of reference features efficiently. Localization is performed in an estimated relative scale. Firstly, a pair of stereo images is captured simultaneously by calibrated cameras. Then the object is identified in both images by extracting features and matching them with reference features, clustering the matched features with generalized Hough transformation, and verifying clusters with spatial relations between the features. After the identification process, knowledge-based correspondences of features belonging to the object present in the stereo images are used for the estimation of the 3D position. The localization method is robust to different kinds of geo- metric and photometric transformations in addition to cluttering, partial occlusions and background changes. As both the model rep- resentation and localization are single-scale processes, the method is efficient in memory usage and computing time. The proposed relative scale method has been implemented and experiments have been carried out on a set of objects. The method results very good accuracy and takes only a few seconds for object localization by our primary implementation. An application of the relative scale method for exploration of an object in cluttered environment is demonstrated. The proposed method could be useful for many other practical applications. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Relative scale; Object localization; Multidimensional hashing 1. Introduction Locating an object in a cluttered three-dimensional envi- ronment is a challenging problem in computer vision. Many applications such as object manipulation, visual inspection, landmark localization for the navigation of mobile robots, and object tracking need to locate objects in cluttered environments. In order to locate an object, first we need to identify it in the scene image and then to deter- mine its orientation and position with respect to a reference coordinate system. This process is also known as object localization. The commonly accepted solution for such a situation is a local-feature based approach [1–8]. This is because of its flexibility to localize partially occluded object in cluttered environment. Moreover, the requirement of information to represent the model is significantly reduced. However, in most of the model-based methods [1–4,7,9] features from the reference images are extracted together with their 3D locations with respect to a given reference frame. Model is represented by these three-dimensional features. In order to localize the object, two-dimensional features are extracted from a single image and some iterative methods (e.g. Newton’s method) are used. Localization is performed with respect to the same reference frame. As a result, these methods seem suitable for environment-specific applica- tions. However, all these methods seem contrary to the human vision system which is the most successful vision system so far for locating an object in cluttered environ- 0262-8856/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2007.06.001 * Corresponding author. E-mail addresses: [email protected] (M.S. Islam), assluzek@ ntu.edu.sg (A. Sluzek). www.elsevier.com/locate/imavis Available online at www.sciencedirect.com Image and Vision Computing 26 (2008) 259–274

Upload: khalifa

Post on 04-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Available online at www.sciencedirect.com

www.elsevier.com/locate/imavis

Image and Vision Computing 26 (2008) 259–274

Relative scale method to locate an object in cluttered environment

Md. Saiful Islam a,*, Andrzej Sluzek a,b

a Center for Computational Intelligence, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singaporeb SWPS, ul.Chodakowska 19/31, Warszawa, Poland

Received 13 December 2005; received in revised form 11 March 2007; accepted 1 June 2007

Abstract

This paper proposes an efficient method to locate a three-dimensional object in cluttered environment. Model of the object is repre-sented in a reference scale by the local features extracted from several reference images. A PCA-based hashing technique is introduced foraccessing the database of reference features efficiently. Localization is performed in an estimated relative scale. Firstly, a pair of stereoimages is captured simultaneously by calibrated cameras. Then the object is identified in both images by extracting features and matchingthem with reference features, clustering the matched features with generalized Hough transformation, and verifying clusters with spatialrelations between the features. After the identification process, knowledge-based correspondences of features belonging to the objectpresent in the stereo images are used for the estimation of the 3D position. The localization method is robust to different kinds of geo-metric and photometric transformations in addition to cluttering, partial occlusions and background changes. As both the model rep-resentation and localization are single-scale processes, the method is efficient in memory usage and computing time. The proposedrelative scale method has been implemented and experiments have been carried out on a set of objects. The method results very goodaccuracy and takes only a few seconds for object localization by our primary implementation. An application of the relative scale methodfor exploration of an object in cluttered environment is demonstrated. The proposed method could be useful for many other practicalapplications.� 2007 Elsevier B.V. All rights reserved.

Keywords: Relative scale; Object localization; Multidimensional hashing

1. Introduction

Locating an object in a cluttered three-dimensional envi-ronment is a challenging problem in computer vision.Many applications such as object manipulation, visualinspection, landmark localization for the navigation ofmobile robots, and object tracking need to locate objectsin cluttered environments. In order to locate an object, firstwe need to identify it in the scene image and then to deter-mine its orientation and position with respect to a referencecoordinate system. This process is also known as object

localization.

0262-8856/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.imavis.2007.06.001

* Corresponding author.E-mail addresses: [email protected] (M.S. Islam), assluzek@

ntu.edu.sg (A. Sluzek).

The commonly accepted solution for such a situation isa local-feature based approach [1–8]. This is because of itsflexibility to localize partially occluded object in clutteredenvironment. Moreover, the requirement of informationto represent the model is significantly reduced. However,in most of the model-based methods [1–4,7,9] features fromthe reference images are extracted together with their 3Dlocations with respect to a given reference frame. Modelis represented by these three-dimensional features. In orderto localize the object, two-dimensional features areextracted from a single image and some iterative methods(e.g. Newton’s method) are used. Localization is performedwith respect to the same reference frame. As a result, thesemethods seem suitable for environment-specific applica-tions. However, all these methods seem contrary to thehuman vision system which is the most successful visionsystem so far for locating an object in cluttered environ-

Model representation (Off-line)

Local Shape

Reference images & reference scale

a

Stereo images & relative scale

Shape Matching (On-line)

Identity and Orientation

StereoParameters

Location

bLocal Shape

ig. 1. Modular architecture for object localization in relative scale: (a)odel representation, (b) identification and localization.

260 M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274

ment. In human vision system, we do not need to memorizethe depth information of image features. Instead, we onlymemorize 2D features (e.g. shape) of the object and localizeit by stereovision using correspondences of local features.Only a few works [6] can be found in this approach.

In order to locate the object in cluttered environment,we need to consider some important issues such as differentkinds of geometric transformations of the object in theimage plane, variation of the light intensity, etc. Particu-larly, the scale change of the object is one of the most con-cerning issues for the localization systems. In order to copewith the scale change, multi-scale or scale-space methodshave been evolved [15,27,28,38]. In these multi-scale meth-ods, an image is analyzed in multiple scale levels in bothmodel representation and recognition phases. As a directconsequence, the model representation requires large mem-ory space. On the other hand, matching process becomescomputationally expensive. However, such multi-spacemethods may not be always necessary for localization ofobjects in many practical applications.

In the model-based object recognition, efficiency could be

improved by minimizing the number of scale levels, prefera-bly to a single scale. As in model-based object recognitionsome reference images are required to represent the modelof an object and recognition is carried out by a scene image(test image), a relation between scale of the object in refer-ence image and scene image could be established. In someapplications, the distance of the object from the camera canbe measured and a relation between scale and distancecould be obtained. For example, in intruder detection sys-tem [42] the distance of the intruder could be measured by aproximity sensor. From this distance measurement, the rel-ative scale of the object in the image plane (with respect tothe object in reference image) could be established throughcalibration or analytic approaches. Similarly, for visualinspection of an object on a conveyor belt, when an objectreaches to a specific position of the belt the relative scalecould be estimated from the available distance. For anobject-following application, the relative scale could beupdated based on the motion of the object. Again, explora-tion of an object can be carried out at a constant relativescale (Section 5). In these ways, relative scale of an objectof interest in the scene image can be predicted, estimate,or pre-assigned for many applications.

In this paper, we have proposed a relative scale methodto locate a 3D object in cluttered environment. We assignan arbitrary reference scale rR to the given reference object.Model of the object is represented by the local featuresextracted in the reference scale. As a result, the model rep-resentation method needs relatively smaller amount ofmemory space. Localization of an object is performed inthe relative scale, which is estimated or assigned a priori.As a result, the process becomes efficient.

Geometric transformations between a point in a refer-ence image and the corresponding point on scene imageare adequately approximated by planar projective transfor-mation. For a rigid object having free-form surfaces, a

small surface patch (except patches which include edgeand corner regions) can be considered as a planar surface.When the camera is relatively far from the viewed object,the planar projective transformation for such a surfacepatch can be further approximated by an affine transfor-mation [24,29,30]. If we assume a small viewpoint change,this projective/affine deformation may be negligible.Hence, a point p 0 = (x 0,y 0)T on the object in the sceneimage IT is related to a point p = (x,y)T on the referenceimage IR by the following transformations:

IT ðp0Þ � cIRðpÞ; ð1Þ

x0

y0

� �¼

s 0

0 s

� �cos h � sin h

sin h cos h

� �x

y

� �þ

a

b

� �; ð2Þ

where c > 0 is an arbitrary contrast factor, s > 0 is an arbi-trary scaling factor, 0� 6 h < 360� is an arbitrary rotationand (a,b) are arbitrary translations.

The relative scale rI of the object in the scene image IT

with respect to the scale of the object in reference imageIR, is a linear function of the scaling factor s, i.e.

rI ¼ srR; ð3Þ

where reference scale rR is constant during the recognitionprocess. Throughout this work, we have used the value ofreference scale rR = 2.0. This value is arbitrarily selectedand other values are also possible to use.

Theoretically, the relative scale of the object could varyfrom 0 to1. In fact, a very large-scale change of the objectmay suppress the visual information significantly, makingthe identification and localization of the object difficult.Therefore, the valid range of scale change for matchingprocess is finite and depends on different parameters suchas the focal length of the camera, image resolution, etc.

A modular architecture of the relative scale localizationmethod is illustrated in Fig. 1. The localization processconsists of two different phases: off-line model representa-tion and on-line identification and localization. However,the first step of both of the phases is the detection of suit-able local features on the object of interest. Features areextracted by detecting some interest points and then com-

Fm

M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274 261

puting an invariant descriptor for each of them. Themethod of feature extraction is discussed in Section 2.

In the off-line model representation phase, local featuresare extracted at the reference scale from reference imagescaptured from significantly different viewpoints with uni-form backgrounds. A PCA-based technique is used for effi-cient access of model features. The hashing technique isdescribed in Section 3.

Section 4 describes the on-line localization phase. First,a pair of stereo images is captured by calibrated cameras.Then, the object is identified in both of the images byextracting features and matching them with reference fea-tures, clustering the matched features using generalizedHough transformation, and verifying clusters using spatialrelations between the features. The position of the object isestimated by a 3D reconstruction method using the corre-sponding features of the object in both of the stereo images.Some experimental results are also demonstrated. Section 5describes an application of the relative scale method.

2. Relative scale local features

In this section, a brief overview of relative scale methodto extract local features of an object [10–12] is given. Localfeatures are extracted by detecting interest points and thencomputing an invariant descriptor for each of them from asmall image patch around the interest point. In thismethod, these are done in estimated relative scale of theobject.

First, we detect interest points of the object. The classicwork on the interest point detection is the Harris cornerdetector [13] that was later improved by Schmid et al.[14]. In this method, the image is first differentiated intwo perpendicular directions with a 1-D Gaussian kerneland then integrated by a circular Gaussian window. Inthe relative scale interest point detection method [11], weuse the relative scale of the object rI as in Eq. (3), as thestandard deviation of Gaussian window for Harris integra-tion. The standard deviation of Gaussian kernel for Harrisdifferentiation rD = krI, where k is a constant. Value of k isgenerally chosen from the range [0.5–0.75]. The scale nor-malized auto-correlation matrix of Harris detector [15] ata pixel p = (x,y)T of an image I is given by

Nðp; rIÞ ¼ r2DGðp; rIÞ �

I2xðp; rDÞ IxIyðp; rDÞ

IxI ðyp; rDÞ I2yðp; rDÞ

" #; ð4Þ

where Ix(p,rD) and Iy(p,rD) are the partial derivatives ofthe given image in X and Y directions, respectively, andG(p,rI) is the circular Gaussian integration window atthe scale rI.

The measure of corner response at the point p and scalerI is

uðp; rIÞ ¼ detðNðp; rIÞÞ � ktr2ðNðp; rIÞÞ; ð5Þ

where k is a constant. u(p,rI) is positive in corner regionsand a point is selected as a corner point if it is the local max-

imum of the measure of corner responses [13]. Now such acorner point is selected as an interest point if

uðp; rIÞP tv; ð6Þ

where tv is a threshold. The threshold is selected astv = kmean(u)/ln|D|, where k is a constant and |D| is thenumber of corner points. An example of the relative scaleinterest point detection is shown in Fig. 2. In this experi-ment, the relative scale was estimated using the distanceof the object from the camera.

Each of the detected interest points is described byinvariant moments of n concentric and small circular imagepatches of slowly increasing radii surrounding the point[11,12]. This descriptor is called a local invariant descriptor(shortly LID). Maitra [16] and Abo-Zaid et al. [17] sug-gested six invariant moments, which are invariant to scale,translation, rotation, and contrast changes. One of themfor a circular image patch wi is given by

bðwiÞ1 ¼

ffiffiffiffiffiffiffiffiffi/ðwiÞ

2

q/ðwiÞ

1

; ð7Þ

where /ðwiÞ1 and /ðwiÞ

2 are first two Hu’s [18] invariant mo-ments. Thus, for n concentric circular image patches, theLID can be expressed as

dðp; rIÞ ¼ ½ bðw1Þ1 bðw2Þ

1 . . . bðwnÞ1�T: ð8Þ

Although invariant moments of other order (e.g. bðwiÞ2 , bðwiÞ

3 ,etc.) are also possible to use, from experimental results itcan be seen that bðwiÞ

1 gives better information content[12], and at the same time is less sensitive to noise [19].

3. Hashing of local features

To represent the model of a 3D object we need severalreference images (views) from different viewpoints. Thenumber of required reference images depends on the objectitself. Kovacic et al. [20] proposed a method for planningsequences of views required to represent the model. Forsimplicity, in this work, we consider views from constantinterval of pan-tilt angle (j). As the feature detection anddescription are invariant to the rotation of the object inthe image plane, we do not need to consider that angle(h) for feature–matching. Suppose that we are dealing withu objects and v is the number of views for each of them.Each of the reference images Ii,j; where i = 1 . . .u andj = 1 . . .v; is captured with a uniform background and samedistance from the camera.

The objects in reference images are assigned a referencescale rR and interest points are detected on each of the ref-erence images Ii,j at that scale. Suppose that ni,j is the num-ber of interest points detected on a reference image Ii,j. Wethen compute a local invariant descriptor (LID) for each ofdetected interest points at the reference scale rR. The num-ber of the object (i), orientation of the reference image (j)and a LID number is attached to each of the LIDs. Asthe total number of reference features mt ¼

Pi;jni;j could

Fig. 2. Interest points detected by relative scale method. (a) Interest points on a reference image with rR = 2.0. (b) Interest points on the same object in ascene image having geometric and photometric transformations in cluttered environment with partial occlusions. Here the estimated relative scale of theobject rI = 1.0.

262 M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274

be large (e.g. order of thousand), we create a hash table toaccess them efficiently during matching process.

In general, for hashing multidimensional data, values offew dimensions having the highest variance are directlyused as the hash-key [21,22]. We call such a method thenaı̈ve-hashing, which seldom reorganizes the data space tomake it uniform over the indexing space. The PCA-hashing

technique [39] uses principal component analysis method toobtain the direction of the most uniform distribution ofdata and to select a proper hash-key. This technique is usedto make the distribution of data uniform over the indexingspace after transferring the LIDs. Obviously, uniform dis-tribution is necessary for improved efficiency of hashing.This method also helps to select the necessary dimensionof indexing space. During the off-line process of model rep-resentation we accumulate all the mt local features obtainedfrom the reference images and select a proper hash-key tocreate a hash table.

Suppose that the dimension of a LID d(pRk,rR) is n. Thecovariance matrix C for the LIDs is defined as

C ¼ 1

mt

Xmt

k¼1

ðdðpRk; rRÞ � �dÞðdðpRk; rRÞ � �dÞT; ð9Þ

where �d is the mean vector of the LIDs, i.e.

�d ¼ 1

mt

Xmt

k¼1

dðpRk; rRÞ: ð10Þ

The eigenvalue decomposition of the covariance matrix canbe written as

C ¼ UBU T; ð11Þ

where B is a diagonal matrix (i.e. B = diag(k1,k2, . . . ,kn)such that k1 > k2 > . . . > kn) and U consists of theeigenvectors.

As the covariance matrix C is symmetric and positivedefinite, the eigenvalues are all real and the eigenvectorsare orthonormal. The set of vectors U represents anotherorthonormal basis for n-dimensional vector space. Eachof the LIDs are transferred to the new orthogonal basisas follows:

~dðpR; rRÞ ¼ U TdðpR; rRÞ: ð12Þ

We select the values of first d dimensions (d 6 n) as de-scribed below of the features ~dðpR; rRÞ as hash-key. It canbe observed that

• The joint probability distribution is the most possibleuniform,

• kðdðpT ; rIÞ � dðpR; rRÞÞk ¼ kð~dðpT ; rIÞ � ~dðpR; rRÞÞk,where, d(pT,rI) is a scene feature and~dðpT ; rIÞ ¼ U TdðpT ; rIÞ.

Now, the indexing space is divided into bins (or buckets)where each bin is a d-dimensional hyper-cube with constantvolume. We select the side length of a bin less than athreshold td which will be used as the threshold in Eq.(19). The number of features in each bin depends on thetotal number of local features mt and the number of bins.As the volume of a bin and total number of reference fea-tures are constant, we can increase the number of bins byincreasing the dimension of index space. Dimension ofindex space d is selected such that the number of featuresin a bin is less than a threshold tb, i.e.

mt=fd 6 tb; ð13Þ

where f is the average number of bin along each dimension.It is obvious that the best performance could be obtained iftb = 1.

In this hashing method, firstly, all mt, n-dimensionalLIDs ~dðpR; rRÞ obtained from reference images are accumu-lated in a mt · n two-dimensional array. They are sortedaccording to the value of the first dimension of the features.We divide mt LIDs into equally spaced bins. Subsequently,a hash table is created where each field of the table indexesthe starting and ending LIDs belonging to a bin. Next, allLIDs in each of the bins are sorted according to the valueof the second dimension. Each of the bins is again dividedinto equally spaced sub-bins and the hash table is expandedto 2D. This procedure is repeated up to d dimensions. Thusa d-dimensional hash table is created.

In order to analyze the performance of the hashing tech-nique, we extract LIDs from a well-known object imagelibrary known as COIL [23]. As mentioned before, the jointprobability distribution of this indexing space is expected

M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274 263

to have the most uniform distribution possible in the dataspace. Joint probability distribution for the bins can beexpressed as

P rðf1; . . . ; fdÞ ¼ kðf1; . . . ; fdÞ=mt; ð14Þ

where k(f1, . . . ,fd) is the number of reference features at abin located at (f1, . . . ,fd). Fig. 3 shows the distribution(joint probability distributions are scaled up to improvevisualization) of local shape features for d = 2 dimensionalindexing spaces. It can be observed that in PCA-hashingthe distribution is more uniform.

Oversaturation, which is one of the main drawbacks forhashing techniques, contributes to the poor performance offeature–matching. The next experiment shows the percent-age of oversaturated bins with increase of the total numberof reference features. As mentioned before, the best perfor-mance can be obtained if each bin contains exactly one ref-erence LID. In fact, it is not possible in practice for realdata to obtain completely uniform distribution. Thus, weconsider a bin as saturated if the number of entries in thebin is greater than a small constant number tb. Fig. 4 showsa percentage of oversaturated bins for naı̈ve-hashing andPCA-hashing. In the naı̈ve-hashing, we use the dimensionshaving the highest variances as the hash-key.

In order to store the reference features, the memoryrequirement is O(mt) � O(m1 · n) where m1 = u · v is thenumber of reference images and n is the average numberof features extracted on a reference image. This require-ment is significantly lower comparing with that of the geo-metric hashing method [36,37] which needs O(m1 · n4)memory space. Scale-space methods [15,38] needO(w · m1 · n) memory space, where w is the number ofscale levels under consideration for a multi-scale method.

4. Object localization

The object to be located is usually situated in a clutteredenvironment with an arbitrary position and orientation. A

Fig. 3. Distribution of reference LIDs over 2D index space for PCA-hashing avisualization).

pair of images of the environment is captured simultaneouslyby two calibrated stereo cameras. The baseline distance ofthe cameras is small with respect to the distance of the object.It is obvious that the object present in the images can be geo-metrically and photometrically transformed with respect tothe same object in corresponding reference image. Therecould also be small viewpoint changes with respect to theclosest reference image. Moreover, the object of interestmay be partially occluded in the cluttered environment.

As indicated in Fig. 1, localization consists of the iden-tification of the object with its orientation followed by theestimation of the 3D position. First, the identification ofthe object in both of the images is performed by the shapematching method at the estimated relative scale rI.Although we prefer to perform feature–matching in a sin-gle scale (i.e. in the relative scale), due to uncertainty inestimation of the relative scale, matching may need to beperformed in few consecutive scale levels around the esti-mated relative scale. We expect that the range of scale lev-els [rmin–rmax] to be searched over is rather narrow. For thematching process, this continuous interval [rmin–rmax] isquantized into w discrete levels i.e. rI can assume any ofw values. The difference between two consecutive scale lev-els should be small enough to make the matching processinsensitive to scale change. Then the 3D position is esti-mated by a 3D reconstruction from the corresponding fea-tures of the object in the stereo images. The steps oflocalization process are described below:

1. Interest points of each of the stereo images are detectedat the relative scale rI as described in Section 2. The rel-ative scale method gives a good repeatability of interestpoints on the object in cluttered environment for differ-ent kinds of geometric and photometric transformations.An example is shown in Fig. 2.

2. The object in the stereo images may have small view-point changes with respect to the same object in closestreference image. As mentioned in Section 1, the defor-

nd Naı̈ve-hashing (joint probability distributions are scaled up to improve

Oversaturation for tb = 4

0

2

4

6

8

10

7100 8199 9615 10069

No. of model features (mt)

Ove

rsat

ura

ted

bin

s (%

)

Naïve-hashing PCA-hashing

Oversaturation for tb = 5

0

2

4

6

8

7100 8199 9615 10069

No. of model features (mt)

Ove

rsat

ura

ted

bin

s (%

)

Naïve-hashing PCA-hashing

Fig. 4. Percentage of oversaturated bin for Naı̈ve-hashing and PCA-hashing for different number of reference LIDs in a 2D indexing space.

264 M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274

mation of a small neighborhood image region around aninterest point may be considered as an affine transforma-tion of the equivalent region of the reference image. Inthe second step, the image patches are compensatedfor this affine deformation. The compensation methodis described in Section 4.1.

3. The LIDs are computed at the relative scale rI for thecompensated image patches. The method of computinga LID was described in Section 2. The radius of theneighborhood region, over which a LID is computed,changes with the relative scale according to the followingequation:

riðrIÞ ¼rI

rRqi ¼ sqi; i ¼ 1 . . . n; ð15Þ

where qi is the radius of ith circular image patch in ref-erence scale rR and ri is the radius of corresponding cir-cular image patch in relative scale rI.

4. Each of the local features obtained from the scene imageis used to find potential matches by indexing the hashtable. Then a generalized Hough transformation(GHT) is used for clustering these matches to hypothe-size about the object (i) together with its orientation(j). The clustering technique is described in Section 4.2.A cluster of matched features is verified with spatial con-straint. Section 4.3 describes this verification method. Atthis stage, the identity of the object (i) together with theorientation (j) and the relative scale (rI) are obtained.

5. In the final step, we compute the position of the objectwith respect to a reference frame. This is done by apply-ing a stereo algorithm on the corresponding features ofstereo images belonging to the object. Section 4.4describes the reconstruction technique from knowncorrespondences.

4.1. Compensation for affine deformation

With a sufficiently large number of reference images, wecan assume that most features found in scene images areonly insignificantly distorted with respect to the corre-sponding reference features. However, the compensation

can be recommended, if the number of reference imagesis limited.

In order to compensate the affine deformation, causedby small viewpoint change, we need to know the amountand direction of the deformation. The use of secondmoment matrix is well known [24–26] for estimating theanisotropic shape of a local image structure. The eigen-values and eigenvectors of the second moment matrixreveal the amount and direction of deformation causedby the affine transformation. As the use of affine Gaussiankernel to estimate the deformation is expensive (as threeparameters are involved), Mikolajczyk and Schmid [26]used a circular Gaussian kernel instead and described aniterative process to normalize the affine deformation. Dur-ing each of the iterations, the image patch was enlarged inthe direction of the smaller eigenvector of the inverse ofsecond moment matrix, i.e. in the direction of maximumdeformation.

Here, instead of the normalization we use a compensa-tion method. We use the second moment matrix N(p,rI)which was computed using circular Gaussian kernels dur-ing the interest point detection process as in Eq. (4). Eigen-vectors of this matrix show the directions of major andminor deformations. Fig. 5 shows the estimations of affin-ity of the neighborhood regions around few interest pointson an object having a viewpoint change. Eigenvaluedecomposition of the matrix N�1/2 is given by

N�1=2 ¼ WBW T; ð16Þwhere W is the eigenvectors and B is a diagonal matrix andconsider that

B ¼kmax 0

0 kmin

� �:

Compensation matrix is obtained by replacing B in Eq. (16)with B 0 where

B0 ¼1 0

0 1� d

� �; d! 0:

Thus the compensation matrix

Q ¼ WB0W T ð17Þ

Fig. 5. Few corresponding interest points on two views of an object with significant viewpoint change. Object is taken from COIL [23] image-set. Circlearound an interest point shows the neighborhood region over which a LID is computed. Ellipses show the maximum and minimum eigenvectors of thesecond moment matrix. It could be observed that only the eigenvalues change (not the directions of eigenvectors).

M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274 265

is used to transform image patches around interest pointsof the stereo images. The circular regions around interestpoints in Fig. 5 show such patches in the reference scaleto be transformed. It should be noted that no compensa-tion is needed for the reference images.

In this method, we transform a patch of the scene imageby the compensation matrix Q using +d and �d in thematrix B 0. We compute the local invariant descriptor ofan image patch three times: once without compensationand twice after the compensations. The value of d isselected from the range [0.0–0.08]. We use a small valuebecause a larger value could cause unnecessary reversedeformation of the image patch resulting in false matches.Although this compensation method does not use accurateestimation of the deformation, the method still improvesthe performance of interest point matching (Section 4.2)by nearly 20% on the average in our feature–matchingexperiments. The compensation method is faster than nor-malization methods because it does not use any iterativealgorithm.

4.2. Feature clustering

The features (LIDs) extracted from a scene image arematched with reference features. The matched featuresare clustered using the generalized Hough transformation(GHT) to hypothesize about the object (i) and its orienta-tion (j) in the scene. An initialized 3D accumulator A ofdimension u · v · w is created, where u is the number ofobjects, v is the number of reference images for each objectand w is the number of discrete scale levels for the range[rmin–rmax]. Now, for each of the stereo images, we extractscene features d(pT,rI) in relative scale rI in the sameapproach as described in Section 2. Next, the scene featuresare transformed by the orthonormal basis U (see Eq. (11)),i.e.

~dðpT ; rIÞ ¼ U TdðpT ; rIÞ: ð18Þ

Now, a transformed feature ~dðpT ; rIÞ is considered as a po-tential match for a reference feature ~dðpR; rRÞ obtainedfrom a reference image if

k~dðpT ; rIÞ � ~dðpR; rRÞk 6 td ; ð19Þ

where td is a threshold. The selection of a suitable thresholdtd is crucial. Although a large threshold increases the num-ber of correct matches, at the same time it increases theprobability of false positives. We take k median absolutedistances (med) of reference features as the threshold, i.e.td ¼ kmediðj~dðpRi; rRÞ �medjð~dðpRj; rRÞÞjÞ; 1 6 i, j 6 mt.

Values of the first d dimensions of ~dðpT ; rIÞ are used asthe key for indexing the hash table. The landing bin canbe found with a searching complexity of O(1). It can beeasy to envisage that all the potential matches for the fea-ture ~dðpT ; rIÞ lie within a hyper-ball of radius td from thelanding bin. Hence, a super-bin is obtained consisting ofsome consecutive bins of the hash table within the radiusof td in d-dimensional indexing space, centered at the land-ing bin. Each of the reference features of the super-bin iscompared, and a feature is accepted as a potential matchif it satisfies Eq. (19). For each of the matched referencefeature ~dðpR; rRÞ, we obtain the associated object label (i),orientation (j) and the relative scale (rI). We insert thematched pair f~dðpT ; rIÞ; ~dðpR; rRÞg to the accumulator cellA(i, j,rI).

The time required to recognize an object directlydepends on the efficiency of features–matching. Averagenumber of comparisons required to find the potentialmatches from the database of reference features for a scenefeature ~dðpT ; rIÞ using a 2D hash table is shown in Fig. 6. Itcould be observed that feature–matching using PCA-hash-ing is more than five times faster than that of naı̈ve-hashing.

4.3. Verification of a cluster

It is expected that the interest points on the object in thescene image match only with the corresponding points ofthe reference images. In fact, due to different kinds oferrors (e.g. quantization error, image noise, etc.), therecould be some false matches. For the similar reasons, out-liers due to cluttering in the scene image could falselymatch with some reference features. That is why each ofthe clusters should be verified so that false matches andoutliers could be discarded. In fact, for recognition of anoccluded object, there should a minimum number of cor-rectly matched features in the cluster sufficient to ensure

No. of comparisons for td =1.5

0

100

200

300

400

500

600

7100 8199 9615 10069

No. of Model features (mt)

Avg

. co

mp

aris

on

/f

eatu

res

Naïve-hashing PCA-hashing

No. of comparisons for td = 2.0

0100200300400500600700800

7100 8199 9615 10069

No. of model features (mt)

Avg

. co

mp

aris

on

/f

eatu

re

Naïve-hashing PCA-hashing

Fig. 6. Number of comparisons required for the matching process for each feature on the scene images for Naı̈ve-hashing and PCA-hashing for differentnumber of reference features using 2D hash table.

Fig. 7. A connected graph is used to represent the spatial relationsbetween features of a reference image or a scene image.

266 M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274

the identity of the object in the scene image. An iterativealgorithm with the help of spatial constraints is used forthis robust verification of a cluster A(i, j,rI) having morethan a minimum number of correct matches.

The minimum number of correct matches k in a clusterrequired for the confirmation of the identity of the objectdepends on the set of objects under consideration. For exam-ple, highly textured objects may need more features thanlower textured objects. Different approaches have reportedthe use of different minimum number of points. For example,the fundamental matrix solution [40] requires at least sevencorrect matches while the affine solution [38] needs only threecorrect matches. In this work, a cluster is verified if itcontains at least five matches. It has been decided from manyexperiments on COIL and other image-sets.

Assume that cluster A(i, j,rI) has s matched pairs, wheres P k. First, we build two graphs from the cluster: one forthe matched features of a reference image Ii,j called the ref-erence graph Gi;j ¼ f~dðpRl; rRÞ;ERg; l = 1, . . . ,s, andanother one for the matched features in the scene imagecalled the scene graph Si;j ¼ f~dðpTl; rIÞ;ETgg; l = 1, . . . ,s.For the reference graph a weight ipRq � pRri is assignedto the edge ER(q, r) between nodes ~dðpRq; rRÞ and~dðpRr; rRÞ; 1 6 q, r 6 s. Similarly, for the scene graph aweight ipTq � pTri is assigned to the edge ET(q, r) betweennode ~dðpTq; rRÞ and ~dðpTr; rRÞ; 1 6 q, r 6 s.

A simple graph is illustrated in Fig. 7. If the viewpointchange is small, the distance between two image pointschanges with the scaling factor s = rI/rR, for the changeof scale of the object and remains invariant under othertransformations (e.g. translation, rotation, contrastchange) of the image.

The following iterative algorithm is used for the verifica-tion of a cluster. Each of the iterations consists of the twosteps: pruning of loosely connected nodes and the verifica-tion of remaining features with spatial constraint.

Algorithm 4.1 (Verification of a cluster).Set k 0 = k � 1;

Step 1 (pruning): From scene graph Si,j we remove anedge ET(q, r) if |ET(q, r) � s · ER(q, r) | P ts; where ts isa spatial threshold. Then we retain a sub-graph SSi,j ˝

Si,j; where each of the node ~dðpTl; rIÞ has a degree (totaledges of the node) of at least k 0. Let k00 be the number ofremaining nodes. If k00 < k 0, then abandon the clusterand exit; otherwise go to Step 2.Step 2 (verification): Let ~dðpT ; rIÞ be a node of the sub-graph SSi,j and ~dðpR; rRÞ be the corresponding node ofreference graph Gi,j. We assume that interest pointpT = (xT,yT) is an affine transformation of interest pointpR = (xR,yR) satisfying the following equation:

xT

yT

1

264

375 ¼

b11 b12 b13

b21 b22 b23

0 0 1

264

375

yR

xR

1

264

375: ð20Þ

We can rewrite the above equation as follows:

xR yR 0 0 1 0

0 0 xR yR 0 1

� �b11

b12

b21

b22

b13

b23

2666666664

3777777775¼

xT

yT

� �: ð21Þ

Thus for each pair of matched interest points pT and pR

we get two equations. Therefore, for k00 pairs, 2k00 equa-tions is obtained. This system of equations can be solvedusing the least squares method to obtain the parametersof affine transformation. If all the matched pairs are

Fig. 8. Location estimation using correspondences of features in stereoimages.

M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274 267

correct, any three matched pairs of the scene and refer-ence features should agree with the affine parameters. Ifsuch an agreement is found for several random tests, allthe matches of the cluster are accepted. Hence, i is theidentity, j is the orientation, and rI is the relative scaleof the object. The nodes of SSi,j are correctly matchedfeatures. Otherwise, set k 0 = k 0 + 1, Si,j = SSi,j and goto Step 1.

This iterative algorithm converges very fast and in mostof the cases takes only a few iterations. It should be noticedthat an object in the scene image could match with morethat one consecutive reference images having small view-point changes. In such cases, we take all these matches ascorrect matches. The exact orientation j could be estimatedby the interpolation technique using all these matched ori-entation. The rotation angle h of the object in the imageplane can be easily estimated using the correctly matchedpoint pairs. It is possible that an object in the scene imagecould match with the correct reference images in two (ormore) consecutive scale levels.

4.3.1. Efficiency of recognition

The time complexity for clustering is O(n) for each of thestereo images, where n is the average number of interestpoints detected on an image. Here, it is assumed that fea-ture–matching by the hashing technique takes a constanttime and the time need for the cluster verification is negli-gible. This complexity is considerably small comparingwith other methods. Geometric Hashing method [36,37]has time complexity of O(n4) for affine invariant feature–matching. Scale-space methods [15,38] need to match fea-tures in w scale levels for the scene image with the timecomplexity O(n · w).

4.4. Estimation of 3D location

Let us consider I1 and I2 are the images captured by leftand right cameras of the stereo vision system, respectively.c1 and c2 are the optical centers of two cameras as shown inFig. 8. Perspective projection matrices Q and R, obtainedfrom the calibration [31] of the two cameras individually,can be written as follows:

Q ¼qT

1 q14

qT2 q24

qT3 q34

264

375; where qi ¼

qi1

qi2

qi3

264

375; i ¼ 1; 2; 3: ð22Þ

Similarly

R ¼rT

1 r14

rT2 r24

rT3 r34

264

375; where ri ¼

ri1

ri2

ri3

264

375; i ¼ 1; 2; 3: ð23Þ

In order to compute the displacement of the object we needto find out corresponding features on the stereo images.Although in the intensity-based correlation techniques

[2,31] and in feature based methods [31,41], it is possibleto find correlated points but it remains unknown whichpoints belong to the object of interest. Hence, aknowledge-based method is required to solve this problem.The local-feature based method is helpful because by fea-ture–matching, clustering and verification processes; it isalready known which feature belongs to the object ofinterest. Therefore, the matched local features (LIDs) areutilized to solve the correspondence problem. Consider,IR 2 {Ii,j} is the matched reference image for which thecorrect matches were found to confirm the identity of theobject in both I1 and I2 images (Fig. 8). Assume that{p1j} is the set of points on I1 and {p2j} is the set of pointson I2 such that points p1j and p2j match with the same pointpRj of IR. It is now obvious that a point p1j in the left imageI1 is the corresponding point for p2j in right image I2.

For the estimation of 3D position of the object, it isessential that there should be at least one correspondingpair of features (points). The location n = (x,y,z)T of the3D object with respect to a reference frame centered at O

can be computed from such corresponding image pointpairs.

Consider that [U1j,V1j,S1j]T and [U2j,V2j,S2j]

T be theprojective coordinates of image points p1j and p2j, respec-tively. Then

U 1j

V 1j

S1j

264

375 ¼ Q

x

y

z

1

2666437775; ð24Þ

p1j ¼u1j

v1j

� �¼

U 1j=S1j

V 1j=S1j

� �if S1j 6¼ 0: ð25Þ

Similarly

268 M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274

U 2j

V 2j

S2j

264

375 ¼ R

x

y

z

1

2666437775; ð26Þ

p2j ¼u2j

v2j

� �¼

U 2j=S2j

V 2j=S2j

� �if S2j 6¼ 0: ð27Þ

The correspondence for each/pair of points p1j and p2j

could be further verified by the epipolar constraint for ste-reo images. Point p2j necessarily belongs to the epipolar lineof image I2 determined by p1j. The epipolar line associatedwith p1j is given by [31]

e2 � ð½r1; r2; r3�Tð½q1; q2; q3�TÞ�1½U 1j; V 1j; S1j�TÞ; ð28Þ

where e2 is the projective coordinate of the epipole of imageI2 as given by

e2 ¼ R�ð½q1; q2; q3�

TÞ�1½q14; q24; q34�T

1

" #;

For each pair of the corresponding points p1j and p2j, weget four linear equations. From Eqs. (22), (24) and (25)we get

ðq1 � u1jq3ÞTnþ q14 � u1jq34 ¼ 0; ð29Þ

ðq2 � v1jq3ÞTnþ q24 � v1jq34 ¼ 0: ð30Þ

Similarly using Eqs. (23), (26) and (27) we get

ðr1 � u2jr3ÞTnþ r14 � u2jr34 ¼ 0; ð31Þ

ðr2 � v2jr3ÞTnþ r24 � v2jr34 ¼ 0: ð32Þ

Eqs. (29)–(32) form a system of linear equations, which canbe written as

Kn ¼ b; ð33Þ

where K ¼

ðq1 � u1jq3ÞT

ðq2 � v1jq3ÞT

ðr1 � u2jr3ÞT

ðr2 � v2jr3ÞT

266664

377775 and b ¼

u1jq34 � q14

v1jq34 � q24

u2jr34 � r14

v2jr34 � r24

26664

37775:

Eq. (33) can be solved using the least squares method. Ifrank (K) = 3 (this condition is satisfied if the object is situ-ated in front the focal plane) then the 3D location n of theobject is

n ¼ ðKTKÞ�1KTb: ð34Þ

As matched interest points of a 3D object have differentdepth, we take the average of the measures as the estimatedlocation of the object.

4.5. Experimental results

The first step of the localization is the identification ofthe object in both of the stereo images. So we should verifythe accuracy of the recognition method. The performance

of the described method was tested on five objects fromCOIL-unprocessed [23] image-set. Each of the objects has72 reference images with 5� viewpoint changes. In orderto represent the model of the objects we used 12 of themwith 30� viewpoint change as reference images. The valueof reference scale rR = 2.0 was assigned. The performancewas evaluated by three criteria such as recognition rate,failure rate, and false-alarm rate, where recognitionrate + failure rate + false-alarm rate P100%. The methodwas coded in MATLAB. The timing results reported isfor this primary implementation.

First, we tested the method for similarity transformationwith simultaneous contrast change as described by Eqs. (1)and (2). We applied these transformations (by factorsselected randomly such as scale change from the continu-ous range s = 0.5–2.0, rotation h = 0�–360� and contrastchange c = 0.3–2.0) 25 times on each of the referenceimages. The transformed images were used as test (scene)images. We performed matching in five discrete scale levelswith the interval of 0.1 around the relative scale (which isknown in these particular experiments as a the test imagewas created by digitally transforming a reference image).Fig. 9 shows an example of object recognition for similaritytransformation with simultaneous contrast change. Table 1shows the performance for all five objects.

In order to investigate the performance of recognitionmethod for different viewing angle, we used remaining 60images with three different viewpoints (5�, 10�, 15�).Fig. 10 shows an example of matching with viewpointchange. Fig. 11 shows the performance for three differentviewpoint changes.

The next experiment was to test the performance ofthe method for cluttering and partial occlusions. The testimages were taken from Drexel Object Occlusion Repos-itory [43]. The images were constructed by overlappingobjects from the COIL image-set and occluding themby various amounts. In this experiment, we computedthe recognition rate for different amount of occlusionsirrespective of the degree of cluttering. Fig. 12 showsan example of recognition of an object of interest withocclusions and cluttering. Fig. 13 shows the average rec-ognition rate for all five objects for different amount ofocclusions. The average time for shape matching in thisexperiment is 2.99 second.

When we capture an image of an object from an arbi-trary location in real environment, all these transforma-tions and conditions could happen simultaneously withthe object of interest in the scene image. In the followingexperiment, we captured 17 reference images for each ofsome household and laboratory objects. The referenceimages were captured with a uniform background and atthe same distance from the camera. During the recognitionphase, the object was placed in arbitrarily locations andorientations in cluttered environment and 48 scene imageswere captured. Fig. 14 shows two examples of object recog-nition in real condition. Table 2 gives the performance ofthe method for these two objects.

Fig. 10. Object recognition with viewpoint change. The right image is thetest image and the left image is the matched reference image.

0

10

20

30

40

50

60

70

80

90

5 10 15

Viewpoint change (degree)

Ave

rag

e re

cog

nit

ion

rat

e (%

)

Fig. 11. Average recognition rate for different viewpoint changes.

Fig. 9. Object recognition with similarity transformations and contrast change. The right image is the test image and the left image is the matchedreference image. Matched features are shown.

Table 1Performance of the method on different objects of COIL image-set for similarity transformations with contrast change

Object Recognition rate (%) Failure rate (%) False-alarm rate (%) Average time/test (s)

‘Pig-bank’ 94.44 5.56 0 2.06‘Anacin-packet’ 98.22 1.18 0.22 3.61‘Car’ 81.56 18.44 0 1.62‘Vaseline’ 83.33 16.67 11.11 2.63‘Wooden-tower’ 95.56 4.44 0 4.58

M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274 269

For localization experiments, the models of twoobjects of interest (‘toy_car’ and ‘fire extinguisher’) wererepresented first. Now, an object was placed in an arbi-trary position in the laboratory environment and a pairof images was captured by a stereo vision system.Fig. 15 shows the reference image (left column) matchingwith the stereo images (right column) where both left

and right images matched with the same reference image.Another example involving ‘fire extinguisher’ is shown inFig. 16. For these experiments, the relative scales of anobject in the stereo images were estimated from manuallymeasured distances. The optical center of the right cam-era was used as the origin of the reference coordinatesystem.

The average error of the position estimation is less than2% for the range of relative scales 0.5–2.0. Such magnitudesof the error are expected as the correspondence in stereoimages is up to the pixel accuracy. The accuracy could befurther improved by sub-pixel correspondences using inter-polation techniques. Experiments of localization were car-ried out with other objects as well. The initial MATLABimplementation in Windows environment takes around1 min to complete the localization procedure. Real-timeresponse should be possible by a careful implementationwith faster hardware and software.

5. Relative scale exploration

The proposed relative scale localization method may beapplied for an exploration task where a robot needs tosearch a cluttered environment for an object of interest.For example, a robot may be employed to find a fire extin-guish situated somewhere in the hallway environment.Obviously, there are two problems related here: navigatingthrough the hallway and locating the object of interest.Several techniques of hallway navigation system [32–35]under vision control have been reported in the literature.Any of these methods could be combined with our relativescale exploration.

Fig. 12. Object recognition with partial occlusion in cluttered environment. The right image is the test image and the left image is the matched referenceimage.

0102030405060708090

100

0.1-10 10-20 20-30 30-40 40-50 50-60

Occlusion (%)

Ave

rag

e re

cogn

ition

rat

e (%

)

Fig. 13. Average recognition rate for different amount of occlusions.

Fig. 14. Examples of object recognition in real conditions. The images in the rmatched reference images.

Table 2Performance of the method on two objects in real conditions

Object Recognition rate(%)

Failure rate(%)

False-alarm rate(%)

‘Dictionary’ 95.83 2.08 58.33‘Harpic’ 93.75 4.17 10.41

270 M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274

As the robot makes progress according to a plannedtour, it looks around to locate the object of interest.For this purpose, a relative scale rI is arbitrarily assignedwithin the valid range of relative scale. The robot cap-

ight column are the scene images and the images in the left column are the

M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274 271

tures images of the environment with a small intervaland performs the matching in the relative scale to locatethe object. As the robot approaches the object, the pre-

assigned relative scale rI should coincide with the true rel-

Fig. 15. Localization of ‘toy car’. First column shows the same matched referestereo cameras. Matched points for both of them are shown for the same refe

Fig. 16. Localization of ‘fire extinguisher’. First column shows the matched reby stereo cameras. Matched points for both of them are shown for the same

ative scale at a certain instant and the robot should be

able to locate the object. As the recognition process isperformed in a single scale only, it should be efficientenough for this real time application.

nce image. Second column shows the two images taken simultaneously byrence image.

ference image. Second column shows the two images taken simultaneouslyreference image.

LiftLobby

1

2 3

4

5

Section BSection A Section C LiftLobby

Fig. 17. Top view of a hallway environment. Dotted line shows the robot’s path during the exploration exercise. The location of the target object ofinterest is shown by the black dot.

Fig. 18. Scenes of the hallway environment at five different positions (as marked in Fig. 17) on the path of the robot during relative scale exploration. Theobject of interest was identified at the position 5.

272 M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274

M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274 273

Fig. 17 shows a top view of a hallway environmentwhere we carried out the navigation and exploration exper-iments. In this experiment, we place the robot at the initialposition 1. From this position, the robot starts to searchfor the target object (a fire extinguisher). The location ofthe target object is shown by the black circle. The modelof the object was represented in the reference scalerR = 2.0. The relative scale of the exploration wasrI = 1.2. This value is arbitrarily selected and any valueof relative scale inside a valid range should be suitablefor the exploration exercise. It is also possible to updatethe relative scale, if necessary, during the navigation.

In the experiment, instead of the real robot we installedthe computer and the stereo vision system on a trolley. Wepush the trolley along the track shown by the dotted line inFig. 17. During this navigation, the robot keeps searchingthe target object in the relative scale for every small changein position. Fig. 18 shows scenes of the hallway environ-ment at positions 1, 2, 3, 4, 5 of Fig. 17. At the position5, the robot is able to locate the target object at theassigned relative scale 1.2. It then estimates the 3D positionof the object as discussed in Section 4.4.

6. Discussion

In this paper, we have described a novel method tolocate 3D objects in a relative scale. We have also intro-duced a PCA-based hashing technique and knowledge-based stereo-correspondence method. In fact, we havedescribed all the essential components for a relative scalelocalization procedure. The method is able to handle differ-ent kinds of situations such as cluttering, partial occlu-sions, change of background, and geometric andphotometric transformations of the object. We have imple-mented the method and tested on different objects. Theaccuracy and efficiency of the localization suggest the highpotentiality of the proposed method. An exemplary appli-cation of the method to localize an object during a hallwayexploration is has also been explained.

The main limitation of this method is that it would failto represent models of objects having smooth surfaces(without any texture or internal contours) due to lack ofinterest points; subsequently the localization of suchobjects would fail. Although, we have proposed a methodof object localization in a relative scale, we do not haveanalyzed the scale properties sufficiently: such as selectingand assigning suitable reference scale, the allowable ranges,and quantization of relative scale, etc. All these shouldbe investigated in the future. The model of the object isrepresented by reference images having constant intervalof viewpoints. This constant interval is insufficient forsome objects while unnecessary for the others. As a futurework, a method of automatic selection of reference images,necessary for representing the model of the object, shouldbe investigated. The invariant moment based local fea-tures are only invariant under similarity transformations.Moreover, for the spatial relationship between interest

points, we have considered a feature which is invariantto similarity transformation as well. However, due to theviewpoint change, the transformation of the object in theimage plane is in fact projective. In order to improverobustness of the 3D object recognition and localization,further investigation for suitable projective invariantfeatures is required.

References

[1] G. Hausler, D. Ritter, Feature-based object recognition and locali-zation in 3D-space using a single video image, Computer Vision andImage Understanding 73 (1999) 64–81.

[2] E. Trucco, A. Verri, Introductory Techniques for 3-D ComputerVision, Prentice Hall, 1998.

[3] A.K.C. Wong, L. Rong, X. Liang, Robotic vision: 3D objectrecognition and pose determinationl, in: Proceedings of IEEE/RSJInternational Conference on Intelligent Robots and Systems, Victo-ria, BC, Canada, 1998, pp. 1202–1209.

[4] J.L. Chen, G.C. Stockman, Determining pose of 3D objects withcurved surfaces, IEEE Transactions on Pattern Analysis and MachineIntelligence 18 (1996) 52–57.

[5] P.L. Rosin, Robust pose estimation, IEEE Transactions on Systems,Man, and Cybernetics – Part B 29 (1999) 297–303.

[6] S.H. Joseph, Optimal pose estimation in two and three dimension,Computer Vision and Image Understanding 73 (1999) 215–231.

[7] J.R. Cozar, N. Guil, E.L. Zapata, Detection of arbitrary planarshapes with 3D pose, Image and Vision Computing 19 (2001) 1057–1070.

[8] M. Boshra, H. Zhang, An indexing scheme for efficient data-drivenverification of 3D pose hypotheses, Image and Vision Computing 20(2002) 469–481.

[9] D.D. Sheu, A generalized method for 3D object location from single2D images, Pattern Recognition 25 (1992) 771–786.

[10] M.S. Islam, A. Sluzek, L. Zhu, Representing and matching the localshape of an object, in: Proceedings of Mirage 2005 (Computer Vision/Computer Graphics Collaboration Techniques and Applications),France, 2005, pp. 9–16.

[11] M.S. Islam, A. Sluzek, L. Zhu, Detecting and matching interest pointsin relative scale, Machine Graphics & Vision 14 (2005) 259–283.

[12] M.S. Islam, L. Zhu, Matching interest points of an object, in:Proceedings of IEEE International Conference on Image Processing,Italy, 2005.

[13] C. Harris, M. Stephens, A combined corner and edge detector, in:Proceedings of 4th Alvey Vision Conference, Manchester, UK, 1988.

[14] C. Schmid, R. Mohr, C. Bauckhage, Evaluation of interest pointdetectors, International Journal of Computer Vision 37 (2000) 151–172.

[15] K. Mikolajczyk, C. Schmid, Scale & affine invariant interest pointdetectors, International Journal of Computer Vision 60 (2004) 63–86.

[16] S. Maitra, Moment invariants, Proceedings of IEEE 67 (1979) 697–699.

[17] A. Abo-Zaid, O. Hinton, E. Horne, About moment normalizationand complex moment descriptors, in: Proceedings of 4th InternationalConference on Pattern Recognition, 1988, pp. 399–407.

[18] M.-K. Hu, Pattern recognition by moment invariants, IRE Transac-tions on Information Theory IT-8 (1962).

[19] C.-H. Teh, R.T. Chin, On image analysis by the method of moments,IEEE Transactions on Pattern Analysis and Machine Intelligence 10(1988) 496–513.

[20] S. Kovacic, A. Leonardis, F. Pernus, Planning sequences of views for3-D object recognition and pose determination, Pattern Recognition31 (1998) 1407–1417.

[21] F. Stein, G. Medioni, Structural indexing: efficient 3-D objectrecognition, IEEE Transactions on Pattern Analysis and MachineIntelligence 14 (1992) 125–145.

274 M.S. Islam, A. Sluzek / Image and Vision Computing 26 (2008) 259–274

[22] A. Califano, R. Mohan, Multidimensional indexing for recognizingvisual shapes, IEEE Transactions on Pattern Analysis and MachineIntelligence 16 (1994) 373–392.

[23] S.A. Nene, S.K. Nayar, H. Murase, Columbia Object Image Library(COIL-20), Technical Report CUCS-005-96 (1996).

[24] A. Baumberg, Reliable feature matching across widely separatedviews, in: Proceedings of IEEE Conference on Computer Vision andPattern Recognition, Hilton Head, South Carolina, USA, 2000, pp.774–781.

[25] T. Lindeberg, J. Garding, Shape-adapted smoothing in estimation of3-D shape cues from affine deformations of local 2-D brightnessstructure, Image and Vision Computing 15 (1997) 415–434.

[26] K. Mikolajczyk, C. Schmid, An affine invariant interest pointdetector, in: Proceedings of European Conference on ComputerVision (ECCV’2002), Copenhagen, Denmark, 2002.

[27] T. Kadir, M. Brady, Saliency scale and image description, Interna-tional Journal of Computer Vision 45 (2001) 83–105.

[28] F. Jurie, C. Schmid, Scale-invariant shape features for recognition ofobject categories, in: Proceedings of IEEE Conference on ComputerVision and Pattern Recognition, vol. 2, 2004, pp. II-90–II-96.

[29] T.H. Reiss, Recognizing Planar Objects Using Invariant ImageFeatures, Springer, Verlag, 1993.

[30] F. Mindru, T. Tuytelaars, L. Van Gool, T. Moons, Momentinvariants for recognition under changing viewpoint and illumination,Computer Vision and Image Understanding 94 (2004) 3–27.

[31] O. Faugeras, Three-Dimensional Computer Vision, The MIT Press,1993.

[32] A. Kosaka, A.C. Kak, Fast vision-guided mobile robot navigationusing model-based reasoning and prediction of uncertainties, Com-puter Vision, Graphics and Image Processing-Image Understanding56 (1992) 271–329.

[33] M. Meng, Vision-guided mobile robot navigation using neuralnetworks and topological models of the environment: Ph.D. Thesis,Purdue University, 1998.

[34] Y. Ma, A differential geometric approach to computer vision and itsapplications in control: Ph.D. Thesis, University of California atBerkeley, 2000.

[35] M.S. Islam, A. Sluzek, Vision guided navigation for mobile robotswithout model of the environment, in: Proceedings of The SecondInternational Conference on Computational Intelligence, Roboticsand Autonomous Systems, Singapore, 2003.

[36] Y. Lamdan, J.T. Schwartz, H.J. Wolfson, Affine invariant model-based object recognition, IEEE Transactions on Robotics andAutomation 6 (1990) 578–589.

[37] H.J. Wolfson, Geometric hashing: an overview, IEEE ComputationalScience and Engineering 4 (1997) 10–21.

[38] D.G. Lowe, Distinctive image features from scale-invariant key-points, International Journal of Computer Vision 60 (2004) 91–110.

[39] M.S. Islam, A. Sluzek, Hashing technique for multi-dimensional localshape features of 3D objects, in: Proceedings of InternationalConference on Intelligent Systems (ICIS-2005), Malaysia, 2005.

[40] R. Hartley, A. Zisserman, Multiple View Geometry in ComputerVision, Cambridge University Press, Cambridge, UK, 2000.

[41] N. Ayache, P.T. Sander, Artificial Vision for Mobile Robots: StereoVision and Multisensory Perception, The MIT Press, 1991.

[42] A. Sluzek, P. Annamalai, Development of a reconfigurable sensornetwork for intrusion detection, in: Proceedings of 8th Military andAerospace Programmable Logic Devices (MAPLD) InternationalConference, Washington DC, 2005.

[43] T. Denton, J. Novatnack, A. Shokoufandeh, Drexel Object OcclusionRepository (DOOR), Technical Report DU-CS-05-08, Department ofComputer Science, Drexel University, Philadelphia, PA 19104, 2005.