place recognition and self-localization in interior ......place recognition and self-localization in...

Place Recognition and Self-Localization in Interior Hallways by IndoorMobile Robots: A Signature-Based Cascaded Filtering Framework

Khalil M. Ahmad Yousef1, Johnny Park2 and Avinash C. Kak3

Abstract— We present a robot self-localization approach thatis based on using a cascade of filters that increasingly refine arobot’s guess regarding where it is in a hallway system. Thelocation refinement carried out by each stage of the cascadecompares a signature extracted from a stereo pair of cameraimages taken at the current location of the robot with a databaseof such signatures collected previously during a training phase.A central question in this approach to robot localization is whatsignatures to use for each stage of the cascade. An answer to thisquestion must recognize the special importance of the first stageof the cascade — we refer to this as the prefiltering stage. Thesignature used for prefiltering must be significantly viewpointinvariant, while possessing sufficient locale uniqueness to yielda set of possible locations for the robot that includes thetrue location with a high probability. On the other hand, thesignature(s) used for downstream filtering in the cascade mustthen prune away the inapplicable locales from the list yieldedby the prefilter. What that implies is that the downstreamfilters must be increasingly viewpoint variant and locale specific.Although the framework we propose allows for an arbitrarynumber of filters to follow the prefiltering stage, the resultswe present in this paper are for a two-stage cascade consistingof a prefilter followed by one additional filter. The signatureswe use in our experiments are based on 3D-JUDOCA featuresthat can be extracted from stereo pairs of images. The proposedframework for choosing the best signatures for the prefilteringstage and the filtering stage that follows was tested in a largeindoor hallway system with a total linear length of 1539 m.The validation results we show are based on a dataset of 6209stereo images collected by a robot from the hallways duringits training phase. The performance evaluation presented inthis paper demonstrates that our framework can lead to highlocalization accuracy with good time performance by a robot.

I. INTRODUCTION

The fundamental problem that is of interest to us in thispaper is that of Place Recognition and Self-Localization(PRSL) by a robot in interior hallways. Many challengesexist in interior hallways or in indoor environments ingeneral that make solving the PRSL problem a hard task.These include space layout variability, viewpoint changes,moving objects, and variations in illumination conditions.Obviously, these challenges also arise for outdoor mobilerobots. However, the computer vision, the data modeling,and the scene modeling issues for the indoor and the outdoorcases are very different and here we focus on just the former.Fig. 1 demonstrates examples of some of the aforementioned

1Khalil Ahmad Yousef is with the Computer Engineering Department,the Hashemite University, Zarqa 13115, Jordan (e-mail: [email protected]).During the course of the work reported in this paper, Khalil Ahmad Yousefwas with the School of Electrical and Computer Engineering, PurdueUniversity. 2,3Johnny Park and Avinash Kak are with the School of Electricaland Computer Engineering, Purdue University, West Lafayette, IN, 47907,USA https://engineering.purdue.edu/RVL/

challenges in the context of indoor mobile robotics. As thereader can see, we can expect an indoor mobile robot toencounter a high degree of similarity between different sec-tions of a hallway system and many sections that are devoidof visual features. Such challenges have attracted severalresearchers to solving the PRSL problem in recent years [4],[2], [5], [6], [7], [1]. Building on the research progress madeby others, our signature-based cascaded filtering frameworkpresented in this paper is an attempt to provide a solution toPRSL that is effective and fast.

For a robot to localize itself successfully in a systemof indoor hallways, it needs first to recognize as to whichsection of a hallway it is currently traversing and then itneeds to compute its location in the frame of reference inwhich the hallways are modeled. Both the place recognitionand the eventual self-localization cannot be carried out unlessthe robot has stored in its memory a model of the hallwaysthat is rich in both the geometry of the space involved andthe features likely to be encountered while traversing thehallways.

Several approaches have been proposed in the literaturefor creating such models. Generally speaking, the modelingapproaches that have been proposed over the years depend onthe type of the sensors on the robot. The sensors themselvesrun the gamut from beacons to ultrasonic sensors, fromsingle-camera vision to multi-camera vision, and from laser-based distance-to-a-point measurements to full-scale ladarimaging. The approaches based on beacons, such as WiFior radar [3], first construct a database of measured signalstrengths at the different locations in an indoor environmentand then carry out PRSL by comparing the recorded signalswith the signals stored in the database. For the approachesthat are based on visual images, the goal is to first constructa point-cloud representation of the interest-point descriptorsextracted from the images [10], [19], [9] and then to carry outPRSL by comparing the point descriptors in a query image(which is the image recorded at the current location of therobot) with the descriptors stored in the database. The mostpopular interest-point descriptors that have been used in thepast are SIFT [12] and SURF [2]. More recent work has useddescriptors derived from 3D-JUDOCA features [1] since theypossess the highest viewpoint invariance for the sort of visualfeatures likely to dominate hallway scenes — right-anglesformed by the corners of doors, windows, bulletin-boards,picture frames, etc.

Given a feature-rich geometric model of a hallway system,there is still the issue of how to best use the featureswhen comparing those that are extracted from the sensory

2014 IEEE/RSJ International Conference on Intelligent Robots and Systems

https://engineering.purdue.edu/RVL/

Fig. 1: Some of the challenges in interior hallways

information collected at a given position of the robot to thosethat populate the model [16]. The work of [13] presents aglobal appearance based approach using invariant signaturesthat are extracted from omni-directional images throughgroup averaging based on the classical invariance theory. Theauthors compute Haar attributes from the images and thenuse a histogram intersection based similarity measure forplace recognition. In another contribution [14], a histogram-based comparison of the features extracted from color im-agery — simulated imagery for model construction and realcolor images for self-localization — is used for solving theproblem of PRSL. The work of [18] uses the notion offingerprints of places for incremental and automatic topo-logical mapping. The fingerprints consist of color patchesand vertical edges from visual information and corners fromlaser scanner all extracted from the features around the robot.A recent contribution in [1] uses the notion of signatures toorganize the visual features for the purpose of comparing thesensory information collected at the current position of therobot with the model. The work in [1] was a preliminarystep in the direction of using the signatures to solving thePRSL problem. That brings us to the notion of signatures.

We consider a signature to be a summarizer of the featuresthat can be used to characterize a locale in the hallways.One can obviously think of multiple ways to construct suchsignatures. We are therefore faced with the following ques-tion: If it is possible to construct multiple signatures fromthe sensory information that a robot is capable of collecting,how does the robot decide as to what type of a signature (orwhich subset of the different types of signatures) to use forsolving the PRSL problem?

The goal of our research is to pose the question statedabove in a cascaded filtering approach to the solution ofPRSL. The cascaded filtering approach consists of a prefilterfollowed by an arbitrary number of filters. The goal of theprefiltering stage is to yield a set of candidate locations forthe robot that includes with a high probability the robot’strue location. And the goal of each subsequent filtering stage

is to prune away the inapplicable locations from the list oflocations returned by the prefilter. This logic places differentrequirements on the signatures used for prefiltering and forfiltering, as the reader will see in the rest of this paper.

Whereas the proposed overall design allows for an arbi-trary number of filtering stages to follow the prefilter in thecascade, in the present paper we will only show results witha two-stage cascade in which the prefilter is followed bya single filter. Our work will show how to select the bestsignature type for the prefilter and how to do the same forthe filtering stage that follows.

II. SIGNATURES AND THEIR DESIRABLE PROPERTIESFOR PREFILTERING AND FILTERING

Given a feature-rich geometric model of a hallway systemin a large institutional building, the PRSL problem boilsdown to matching the features extracted from sensory dataat the current location of the robot with all such featurespopulating the model. Strictly speaking, this is an exponen-tially complex problem — even with constraints on where therobot may expect itself to be located at the current moment.One way to mitigate this complexity is through the use of asignature-based cascade of filters of different properties.

The cascade must obviously include a prefiltering stagethat compares the signature recorded at the current locationof the robot with the signatures for possibly the entirehallway system and returns a set of candidate locations thatinclude the true location with a very high probability. Such aprefilter must obviously be very fast — since it may need toscan through a very large database of signatures. In additionto its time performance, the prefilter must also ensure that thenumber of candidate locales returned is sufficiently small soas not to burden the downstream filtering stages excessively.On the other hand, the downstream filters must be sufficientlydiscriminative so that it rejects with a high probability allnon-applicable locale candidates returned by the prefilteringstep. Therefore, a key issue in this approach is to figure outwhat signature to use for prefiltering and what to use for the

filtering stages that follows the prefilter. As mentioned in theIntroduction, the work we present in this paper involves justa two-stage cascade. So our challenge is to choose the bestsignature for the prefilter and the best signature for the filterthat follows the prefilter.

In order to address the challenge mentioned above, thefirst column of Table I lists a set of criteria that can beused to choose the best signature for the prefiltering stageand the best signature for the filtering stage. To elaborate,these four criteria are: (1) The time it takes to match a“query” signature with a signature stored in the database;(2) The degree of invariance to viewpoint changes; (3) Localeuniqueness, which is the extent to which other locales appeardissimilar to any specific locale with regard to a signature;and related to the previous criterion is the additional criterion(4) The frequency with which a signature returns the truelocale at rank 1 (i.e. at the first position) vis-a-vis all otherlocales. We refer to the last criterion as the Rank-1 Preci-sion criterion. Table I shows how the prefilter signature(s)and the ensuing filter signature(s) should compare vis-a-visthese four criteria. The parameter N in the time-complexityfunction O(N) is a rough estimate of the number of visuallydistinct locales in a hallway system. The parameter p is thenumber of features that are used in the filter signature. Thenotation ‘O(pn) f or n≈ 1’ is meant to convey the idea thatwe want the time complexity of the filter signature to bea low-order polynomial for some n. Note that localizationcalculations in the filtering stage are likely to be morecomplex whose time effort will depend, in general, on thenumber of features used in the signature. The entries inthe second row express the fact that a signature used forprefiltering must possess high viewpoint invariance, whereasthe signature used for filtering can get away with possessinglower such invariance. The characterizations “Moderate” and“High” in the last two rows are meant to merely conveythe relative sense in which those two criteria apply to theprefiltering and the filtering signatures.

TABLE I: Desirable characteristics for prefiltering and filter-ing signatures

Criteria Prefiltering FilteringMatching time < O(N) < O(pn) f or n≈ 1

Viewpoint invariance > ±θ < ±θ

Locale Uniqueness Moderate HighRank-1 Precision Moderate High

Whereas Table I gives us the criteria for assessing thequality of the signatures to be used for the prefiltering andthe filtering stages, it leaves open the question of how to usethe criteria in a quantitative framework for choosing the bestsignatures for the two stages. We believe that the questionregarding how to use these criteria can only be answered inan empirical setting, and that too with regard to a given classof features extracted from the hallway images. Therefore, inthe section that follows, we first talk about 3D-JUDOCAfeatures and then we answer the question about how to usethe criteria of Table I for choosing the best signatures.

III. SELECTION OF SIGNATURES

A. Signatures Derived from 3D-JUDOCA Features

The 3D-JUDOCA features are 3D junction features basedon the 2D JUDOCA junctions [8] extracted from pairs ofstereo images. As demonstrated in [1], the viewpoint invari-ance of 3D-JUDOCA features and the robustness with whichthey can be extracted from stereo pairs of images makes themideal for characterizing locales in hallway images. A largemajority of visually interesting points in typical hallwaysare made up of the edge junctions at the corners of doors,windows, wall hangings, bulletin boards, display windows,etc. The regular 2D JUDOCA operator extracts these junc-tions in a viewpoint invariant manner from the individualimages. Subsequently, even greater viewpoint invariance canbe achieved when the corresponding 2D JUDOCA featuresfrom the two images of a stereo pair are combined into 3D-JUDOCA features.

For the purpose of this investigation, we will consider thefollowing seven signatures derived from the 3D-JUDOCAfeatures: (i) Vertical Signature (VS); (ii) Horizontal Signature(HS); (iii) Radial Signature (RS); (iv) VS||HS; (v) VS||RS;(vi) HS||RS; and (vii) VS||HS||RS, where || stands forconcatenation. VS is a height-based histogram of the absolutedifferences in the 3D-JUDOCA feature counts collected fromthe left side and from the right side up to a certain thresholddistance beyond the current position of the robot. HS issimilar in spirit to VS, except that it is now constructed fromtwo histograms whose cells are along the horizontal distancestraight down the hallway, with one histogram for the leftside and the other for the right side. RS is a radial histogramof the 3D-JUDOCA features at the current location of therobot. It is important to mention that the robot tries to centeritself the best it can between the hallway walls and orientsitself so that it is looking straight down a hallway beforeconstructing these signatures. The construction of VS, HS,and RS is illustrated in Figures 2b, 2c and 2d for the localedepicted in Fig. 2a. At training time, the robot associateswith each signature its spatial location with respect to theworld frame based at the position and the orientation of therobot at the beginning of the training phase.

Given the set of seven signatures as constructed above,we need to answer the question: How exactly should thecriteria of Table I be used in order to find the best signaturesfor the prefilter and for the filter? A simpleminded approachwould be to rank each signature with respect to a weightedsum of the criteria listed in Table I, with the weighting beingdifferent for the prefilter and the filter in order to express thedifferent importance of the listed criteria to prefiltering andfiltering. Since a weighted sum of the criteria would constrainthe final choice of the signatures to be linearly dependent onthe criteria, a more general approach would be to construct anordering criterion as a weighted sum of functions of the TableI criteria. This can obviously be done off-line during thetraining phase of the robot. After the signatures are orderedin this manner, we can pick the top ranked signatures forthe prefilter and for the filter. And, if our goal is to select

(a) A locale in a hallway

0 0.2x 104

0

2000

4000

6000

8000

−0.2x mm

ym

m

......

HS =|Histogram1 - Histogram2|

Histogram 2Histogram 1

2.5 mRobotPosition

(b) HS

VS =|Histogram1 - Histogram2|

Histogram 1

...

...

Histogram 2

RobotPosition

(c) VS

−1 −0.5 0 0.5 1x 104

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 104

r

R1

x mm

ym

m

Radial Signature

Robot Locattion

The number of the zones isa user-defined parameter

Number offeatures found

in the zone

R2

R3R4

R5. . .

(d) VS

Fig. 2: Illustration of how the HS, VS, and RS are computed

TABLE II: The performance evaluation of the 7 signatures using the Locale Uniqueness criteria

Signature Type VS HS RS VS‖HS VS‖RS HS‖RS VS‖HS‖RSActual Signature 75.94 79.47 93.94 93.73 97.06 96.59 97.87Binary Version 55.58 69.69 93.88 93.41 97.05 96.57 97.87

TABLE III: The weighted sum of the performance results of the 7 signatures

The weighted sum percentage for each signature typeMatching stage VS HS RS VS‖HS VS‖RS HS‖RS VS‖HS‖RS

prefilter 72.2638 63.9275 59.7382 72.3107 59.0074 55.5803 52.4081filter 57.8547 54.0521 72.3357 57.4494 64.6030 67.2921 58.0977

the best subsets of signatures to use, we can rely on the factthat the number of signatures can be expected to be small.Therefore, one can construct the same sort of an ordered listfor every element of the power set of the set of signatures.Subsequently, one can select the top-ranked subsets for theprefilter and the filter.

Using the dataset described in Section IV we evaluatedthe performance of the seven signatures with respect to thecriteria mentioned in Table I. For the first, the third, andthe fourth criteria in Table I, we used 20% of the dataset,selected randomly, for testing and the remaining 80% fortraining. For the second criterion — the viewpoint invariancecriterion — we used all of the dataset for training. When a

test signature histogram is compared with each of the trainingsignature histograms, we used the Earth Movers Distance(EMD) metric [15] (see Section IV for a comparison ofdifferent distance metrics for comparing histograms).

Fig. 3 and Table II show the performance evaluation resultsfor all of the seven signatures. Fig. 3a shows the performancewith regard to the average matching time (first criterion inTable I). Fig. 3b shows the Rank-1 precision rate (fourthcriterion in Table I). Fig. 3c shows a similar comparison withrespect to viewpoint invariance (second criterion in TableI). Lastly, Table II shows the Locale Uniqueness rate (thirdcriterion in Table I) considering the noise in signature values.

In Table III, we show a rank ordering of the seven

VS HS RS VS + HS VS + RS HS + RS VS+HS+RS0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Signature Type

Ave

rage

Mat

chin

g T

ime

(sec

)

0.5 sec 0.47 sec

1.2 sec1.4 sec

2.3 sec2.2 sec

3.7 sec

EMD

(a) Average matching time

VS HS RS VS + HS VS + RS HS + RS VS+HS+RS0

10

20

30

40

50

60

70

80

90

Signature Type

Ran

k1 R

ecog

nitio

n R

ate

(%)

Rank 1 Recognition Rate for up to 1 meter localization error

49.9 %

23.7 %

74.3 %

69.2 %

80.3 %

77.4 %

83.9 %EMD

(b) Rank-1 precision rate for up to 1m localization error

0.5 1 2 3 4 5 10 15 200

10

20

30

40

50

60

70

80

90

100

Cut−Off Rank (%)

Vie

wpo

int I

nvar

ianc

e R

ate

(%)

Viewpoint Invariance Performance Evaluation of the Signatures (EMD Distance)

VSHSRSVS + HSVS + RSHS + RSVS+HS+RS

(c) Viewpoint invariance test using EMD distance as a function of the cut-off rank

Fig. 3: The performance evaluation of the signatures usingthree of the criteria listed in Table I

signatures for prefiltering and filtering. The weighting givento the Table I criteria for prefiltering was set empiricallyat (0.3, 0.4, 0.15 and 0.15) and for filtering at (0.15, 0.4,0.15 and 0.3). For the first, the third, and fourth criterialisted in Table I, we used the criteria values directly. Forthe second criterion that deals with viewpoint invariance, toexpress the more nonlinear relationship between the filteringsignatures and how they depend on viewpoint invariance,we first translated viewpoint invariance into what may bereferred to as “viewpoint dependence” by (100% - viewpointinvariance rate). It is this value that is shown in the weightsfor filtering. Based on this table, we chose the VS||HSsignature for prefiltering and the RS signature for filtering.1

IV. EXPERIMENTS WITH A TWO-STAGE FILTERCASCADE

Now that the reader knows how we select the signaturetypes to use for the prefiltering stage and for the filteringstage, in this section we will present experimental supportfor a signature-based two-stage filter cascade for solving thePRSL problem.

In the training phase of the robot, the VS||HS and RSsignatures are constructed from the 3D-JUDOCA featuresextracted from the stereo pairs of images recorded by therobot at known locations in the indoor environment ofinterest. After the training phase is over, the 2-stage filteringcascade that we present in this section works in real timeto help the robot to localize itself when taken to a randomlocale in the space that was used in the training phase.

For the sake of validating our 2-stage approach in areproducible manner, the results we present in this sectionwill be based on a large indoor database of 6209 pairs ofstereo images recorded by our robot with a sampling intervalof 25 cm along the hallways. This data was collected in thehallways of three buildings at Purdue University.2 The totallength of the hallways used for the collection of this data is1539 m. The camera images are recorded at a resolution of640× 480 pixels. To help the reader visualize these hallways,the upper half of Fig. 4 shows a conjoined depiction of thefour hallways in three separate buildings. This depiction wasconstructed from the data produced by a laser range sensor(mounted near the bottom of the robot) that returns rangepoints at a fixed height above the floor. The lines fitted to therange points typically fall on the walls of the hallways. Thebottom half of Fig. 4 shows a full textured 3D reconstructionfor two of the four hallways. These were produced by theframework presented in [11].

The VS||HS and RS signatures are computed for all ofthe 6209 locales based on the 3D-JUDOCA features. Theaverage computation time of the 3D-JUDOCA features perlocale was around one second and the average computation

1The reader my argue that the empirically set weights for the differentcriteria may be suboptimal. Choosing these weights by analyzing thedistribution of the values for the different criteria in Table I is a futuregoal of our research.

2This dataset is publicly available at https://engineering.purdue.edu/RVL/Research/PlaceRecognition/index.html.

https://engineering.purdue.edu/RVL/Research/PlaceRecognition/index.html



Fig. 4: The top frame shows a 2D range line map of the system of interior hallways that was constructed from the 6209 laserscans. The 6209 stereo pairs of images in this system of interior hallways are used in the evaluation of the signature-basedframework to the solution of PRSL. The lower frame shows our PowerBot robot used in acquiring all of the data and fortesting. Also shown two samples of 3D map reconstructions of some sections of the hallways

time for constructing the VS||HS and RS signatures perlocale was around 0.45 sec. This is using a 2.67 GHz PCclass machine in which the computation of the signatures iscarried out using Matlab and feature computations with C++based code. The length of each VS||HS signature was 55 bins— 27 bins for the height-based histogram (VS) spanning 2.7m, which represents ceiling to floor height, 27 bins for thewidth-based histogram (HS) spanning 5.4 m of the wallsvisible to the robot in the stereo image and 1 bin to keeptrack of the locale ID. For the RS, 64 bins were used — 60bins that span the field-of-view of the stereo camera in 1◦

sampling interval, 1 bin to keep track of the locale ID, and 3bins to store the position(x,y) and heading φ , in real units,of the robot specific to the locale.

For training and testing of our proposed framework, werandomly selected 20% of the dataset for testing (1241locales) and the remaining 80% for training (4968 locales).3

All of our reported results are based on the above train-ing/testing percentages selected randomly and being aver-aged over 50 random runs. The localization error of therobot, the recognition rate and the average recognition timeare the performance evaluation metrics used to evaluate ourresults. Specifically, for the localization error, we categorizedthe position error into 20 categories each corresponding toan increment of 1 m distance up to 20 meters and for eachcategory we computed the recognition rate.

Since the performance of the selected VS||HS and RSsignatures is obviously predicated on the choice of featuresused in constructing them, Fig. 5 shows the performance of

3Dividing the dataset in this manner may seem strange in the contextof validating a robot localization framework. An “ideal” method for suchvalidation would have the robot wander around in the hallways and havethe robot attempt self localization at random locations. Unfortunately, suchan experimental validation approach would not be reproducible. Hence thereason for the approach outlined in this section in which a portion ofthe recorded data is used for training and the rest for testing. From anexperimental viewpoint, what makes our approach work is that we havechosen a very small interval (25 cm) between the successive locales forrecording the training data and a relatively large interval (1 m) for declaringa test locale to be the same as a training locale.

our proposed framework, based on using the EMD distancemetric, when four different stereo triangulated feature typesare used for constructing the signatures — 3D-JUDOCA,SIFT, BRISK, and SURF. As it is clear, when the 3D-JUDOCA features were used, about 84% recognition rateis obtained for up to 1 m localization error distance, whichis the highest rate achieved among all other features. Thecomputed average recognition time per test locale was around1.5 sec for the different feature types. This rate goes upsignificantly when we also use RANSAC for rejecting theoutlier signatures in both the queries and the database.

Fig. 6 demonstrates the performance of our signature-based framework using three different distance metrics; EMD(Earth Mover’s Distance), Battacharyya and Hamming. EMDgives us the best performance, but with the slowest averagerecognition time (1.5 sec) due to the complexity associatedwith the algorithm. Whereas, Hamming provides the slowestperformance but with extremely fast average recognition time(0.02 sec), because only XOR comparisons need to be carriedout for the computation of the distance. The Hamming metricis applied to binary versions of the absolute values of thehistogram differences. That is, if a bin has a non-zero value,positive or negative, it gets a 1 in the binary representation,otherwise it gets a 0. Lastly, for the case of Battacharyyadistance, the average recognition time was 0.45 sec.

Now we want to demonstrate a comparison between thesignature-based approach proposed in this paper with purelyfeature based approaches that compare a query pair ofstereo images with locale-specific pairs of stereo imageson the basis of 3D features — 3D-SIFT, 3D-SURF, 3D-BRISK and 3D-JUDOCA — extracted from the images.Correspondences between the descriptor vectors in the querypair of stereo images and the database pair of stereo imagesare established through the minimization of the Euclideandistances between the descriptor vectors while dependingon RANSAC for outlier rejection. In what follows, we willrefer to this feature based approach as the “least-squares

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2076

78

80

82

84

86

88

90

92

94

96

Cutoff Localization Error Distance (meter)

Ran

k 1

Rec

ogni

tion

Rat

e

3D−JUDOCASIFTSURFBRISK

Fig. 5: The performance of the signature-based frameworkusing different feature types in the matching algorithm

0 2 4 6 8 10 12 14 16 18 2070

75

80

85

90

95

100

Cutoff Localization Error Distance (meter)

Ran

k 1

Rec

ogni

tion

Rat

e

EMDbhattacharyyaHAMMING

Fig. 6: The performance of the signature-based frameworkusing different distance metrics; EMD, Battacharyya andHamming

with RANSAC” method. The descriptors associated withthe 3D-features (meaning, 3D-SIFT, 3D-SURF, 3D-BRISKand 3D-JUDOCA) were constructed from their popular 2Dversions except for the case of 3D-JUDOCA for which thedescriptors were computed as described in [1]. We onlyrecorded the recognition rate for up to 1 m positional errorand 5◦ heading error bounds. We used the same experimentalsetup as we mentioned earlier in this section i.e. 20% ofthe dataset randomly selected for testing and the remaining80% for training. The results are summarized in Table IV.As shown in Table IV, both 3D-JUDOCA and 3D-SURFprovide the best results for purely feature-based localizationwith a recognition rates of 96.393% and 97.07% respectively.For feature-based localization, the average recognition time(not including the time for computing the descriptors) perquery image turns out to be a function of the total numberand the size of the descriptors. As is clear from Table IV, theperformance of the purely feature-based approaches is, moreor less, comparable to the signature-based approach (thatincorporates least-squares with RANSAC for final selection,as we explain in the next paragraph). However, note howmuch faster the signature based localization is compared tothe feature based localization. Feature based approaches alsoneed to organize and index a database composed of a largenumbers of descriptors [17], [19].

To elaborate on the rows in Table IV where we men-tion ‘signature-based’ along with RANSAC, the idea isto choose a small number of top-ranked locales returnedby the filter cascade and to then apply the feature-basedmatching approach described in the previous paragraph forfinding the best match between the query stereo imagesand the stereo images for those locales. This match can besubject to RANSAC type of outlier rejection as mentionedin the previous paragraph. Let nl represent the number ofthe locales retained from the list returned by signature-based matching because their match scores are above a user-

set threshold. From this list, we now choose nm localesfor feature based matching. We found experimentally thatif nm = 40 = 0.8%× nl , then the recognition rate can beimproved up to 91% with a total average recognition time of2.31 sec. This result is demonstrated in Fig. 7 and is achievedby using the EMD distance metric for the signature-basedinitial match and by using the 3D-SURF as a base featurefor the final feature-based match (with RANSAC for outlierrejection, as explained previously). We made several otherexperiments, as shown in Fig. 7, to see if the recognitionrate can still be improved if nm is to be increased. We foundthat if nm = 150 = 3%×nl then the recognition rate can beimproved up to 95.727% with a recognition time of 2.5772sec. A more interesting finding we have found is that, if weuse the Hamming distance metric to the binarized signaturesinstead of the EMD metric, as shown in Fig. 8, set nm =3%×nl , and at the same time use the least-squares feature-based method incorporating RANSAC for final selection,the recognition rate can be improved from 74% (Fig. 6)up to 93.79% with a recognition time of 0.8994 sec. Thisrecognition time is extremely fast compared to the 2.5772 sec— when using EMD as mentioned above — and comparedto the purely feature-based approaches listed in Table IVwith a comparable recognition rate. Therefore, in light ofthese results, it seems that the signature-based two-stage filtercascade framework is promising and has great potential interms of its being scalable and robust. Our signature-basedframework using the Hamming distance metric applied tobinarized signatures for creating a pool of top-ranked localecandidates and then using feature-based matching for thefinal selection has the best time performance. However, itranks fifth in terms of the recognition rate.

V. CONCLUSIONS

This paper presented a signature-based cascaded filteringframework for fast PRSL in interior hallways. We presenteda set of criteria for assessing the suitability of signatures

TABLE IV: The recognition rate and average recognition time for the randomly selected test imagesRecognition rate for

up to 1 m localizationerror distance

Average recognitiontime (sec)

The signature used inprefiltering

The signature used infiltering

Signature-Based Cascaded Filters (EMD) 84 % 1.5 VS||HS RSSignature-Based Cascaded Filters (EMD) with

RANSAC 95.727 % 2.5772 VS||HS RSSignature-Based Cascaded Filters (Hamming) with

RANSAC 93.79% 0.8994 VS||HS RS3D-JUDOCA with RANSAC 96.393% 10.1527 - -

3D-SIFT with RANSAC 91.8449 30.37 - -3D-SURF with RANSAC 97.07% 10.2457 - -3D-BRISK with RANSAC 94.93% 8.5261 - -

2.25 2.3 2.35 2.4 2.45 2.5 2.55 2.6 2.65 2.7 2.7586

88

90

92

94

96

98

Matching Time (sec)

Re

co

gn

itio

nra

tefo

ru

pto

1m

ete

rlo

ca

liza

tio

ne

rro

r%

↓ 0.5% nl

↓ 0.8% nl

↓ 1% nl

↓ 2% nl

↓ 3% nl

↓ 4% nl

↓ 5% nl

Signature−Based Cascaded Filters (EMD) with RANSAC

Fig. 7: The performance of the signature-based frameworkwith RANSAC using the EMD distance metric as a functionof nm

0.65 0.7 0.75 0.8 0.85 0.9 0.9582

84

86

88

90

92

94

96

Matching Time (sec)

Recognitionrate

forupto

1meterlocalizationerror%

↓ 0.5% nl

↓ 0.8% nl

↓ 1% nl

↓ 2% nl

↓ 3% nl

↓ 4% nl

↓ 5% nl

Signature−Based Cascaded Filters (Hamming) with RANSAC

Fig. 8: The performance of the signature-based frameworkwith RANSAC using the HAMMING distance metric as afunction of nm

(or subsets thereof) for prefiltering and for filtering. Inour experimental work, we used 3D-JUDOCA features andshowed that the best signatures derived from these featuresgive the robot good self-localization performance.

REFERENCES

[1] K. M. Ahmad Yousef, J. Park, and A. C. Kak. An Approach-PathIndependent Framework for Place Recognition and Mobile Robot Lo-calization in Interior Hallways. In 2013 IEEE International Conferenceon Robotics and Automation (ICRA). IEEE, 2013.

[2] H. Andreasson, A. Treptow, and T. Duckett. Localization for mobilerobots using panoramic vision, local features and particle filter. InICRA 2005, pages 3348–3353. IEEE, 2005.

[3] P. Bahl and V. Padmanabhan. RADAR: An in-building rf-baseduser location and tracking system. In INFOCOM 2000. NineteenthAnnual Joint Conference of the IEEE Computer and CommunicationsSocieties. Proceedings. IEEE, volume 2, pages 775–784. IEEE, 2000.

[4] K. Briechle and U. D. Hanebeck. Localization of a mobile robot usingrelative bearing measurements. IEEE Transactions on Robotics andAutomation, 20(1):36– 44, Feb. 2004.

[5] C. Detweiler, J. Leonard, D. Rus, and S. Teller. Passive mobile robotlocalization within a fixed beacon field. In In 2006 Workshop on theAlgorithmic Foundations of Robotics, Jul. 2006.

[6] R. Elias and A. Elnahas. An accurate indoor localization techniqueusing image matching. In Intelligent Environments, 2007. IE 07. 3rdIET International Conference on, pages 376–382. IET, 2007.

[7] R. Elias and A. Elnahas. Fast localization in indoor environments.In Computational Intelligence for Security and Defense Applications,2009. CISDA 2009. IEEE Symposium on, pages 1–6. IEEE, 2009.

[8] R. Elias and R. Laganière. Judoca: Junction detection operator basedon circumferential anchors. IEEE Transactions on Image Processing,21(4):2109–2118, 2012.

[9] F. Fraundorfer, C. Wu, J. Frahm, and M. Pollefeys. Visual wordbased location recognition in 3d models using distance augmentedweighting. In Fourth International Symposium on 3D Data Processing,Visualization and Transmission, volume 2, 2008.

[10] A. Irschara, C. Zach, J. Frahm, and H. Bischof. From structure-from-motion point clouds to fast location recognition. In CVPR 2009, pages2599–2606. IEEE, 2009.

[11] H. Kwon, K. M. Ahmad Yousef, and A. C. Kak. Building 3DVisual Maps of Interior Space with a New Hierarchical Sensor-FusionArchitecture. Robotics and Autonomous Systems, 2013.

[12] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2):91–110, 2004.

[13] R. Marie, O. Labbani-Igbida, and E. M. Mouaddib. Invariant signaturesfor omnidirectional visual place recognition and robot localization inunknown environments. In Pattern Recognition (ICPR), 2012 21stInternational Conference on, pages 2537–2540. IEEE, 2012.

[14] T. Nagasaka, K. Tanaka, T. Ishimaru, and I. Uesaka. Robot self-localization using simulated experience. In SICE Annual Conference2010, Proceedings of, pages 1993–1998. IEEE, 2010.

[15] O. Pele and M. Werman. Fast and robust earth mover’s distances. InICCV, 2009.

[16] A. Pronobis, B. Caputo, P. Jensfelt, and H. Christensen. A real-istic benchmark for visual indoor place recognition. Robotics andAutonomous Systems, 58(1):81 – 96, 2010.

[17] T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt. Image Retrieval forImage-Based Localization Revisited. In Proceedings of the BritishMachine Vision Conference, pages 76.1–76.12. BMVA Press, 2012.

[18] A. Tapus and R. Siegwart. Incremental robot mapping with fingerprintsof places. In 2005 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS 2005), pages 2429–2434, 2005.

[19] C. Wu, F. Fraundorfer, J. Frahm, J. Snoeyink, and M. Pollefeys. Imagelocalization in satellite imagery with feature-based indexing. ISPRS,2008.

place recognition and self-localization in interior ......place recognition and self-localization in...

Documents