multi-scale descriptor for robust and fast camera motion estimation

IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 7, JULY 2013 725

Multi-Scale Descriptor for Robust andFast Camera Motion Estimation

Changhun Sung and Myung Jin Chung

Abstract—This letter presents a robust and fast 6D camera mo-tion estimation system that combines a FASTdetectorwith aMulti-scale Descriptor (MSD). For a robust and fast camera motion es-timation system, the descriptor is an especially important compo-nent, as it is the most time consuming step in motion estimation.Furthermore, the properties of the descriptor significantly affectthe precision of the estimatedmotion. In this letter, we present a de-scriptor that requires a low computational burden and offers highprecision for robust and fast motion estimation. The descriptorprovides high precision through the use of multi-scale gradient in-formation and computational efficiency by employing 4 integralimages and an intensity gradient. We conclude this letter with anevaluation of the proposed motion estimation in comparison withdifferent types of descriptors in practical situations. The resultsshow that the proposed motion estimation provides good perfor-mance over previous methods in terms of processing time and pre-cision of estimated camera motion.

Index Terms—Descriptor, motion estimation (ME).

I. INTRODUCTION

M OTION ESTIMATION (ME) algorithms calculate themotion of a camera through its environment by evalu-

ating captured images. ME is essential for many applications,e.g. Structure from Motion (SFM), Simultaneous Localization,and Mapping (SLAM), as it can estimate motion reliably in ad-verse conditions or uneven terrain.Interest point detection and feature description form the basis

of feature-based motion estimation, and a variety of algorithmsfor these tasks have been proposed [1], [2], [4]. Feature basedME generally consists of four steps: interest point detection,descriptor generation, feature matching, and motion estimation.For the descriptor generation and feature matching steps, thereare two main approaches. The first is to find features in oneimage and track them in the following images. The second is todetect features independently in all the images and match thembased on a similarity metric between their descriptors. Currentresearch in ME has largely opted for the latter approach, as it

Manuscript received December 17, 2012; revised April 09, 2013; acceptedMay 14, 2013. Date of publication May 22, 2013; date of current versionJune 04, 2013. This wotk was supported by the MKE (Ministry of KnowledgeEconomy), Korea, under the Human Resources Development Program forConvergence Robot Specialists support program supervised by the NIPA (Na-tional IT Industry Promotion Agency). The associate editor coordinating thereview of this manuscript and approving it for publication was Prof. DimitriosAndroutsos.C. Sung is with The Robotics Program, Korea Advanced Institute of Science

and Technology, Daejeon 305-701, Korea (e-mail: [email protected]).M. J. Chung is with the Department of Electrical Engineering, Korea Ad-

vanced Institute of Science and Technology, Daejeon 305-701, Korea (e-mail:[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/LSP.2013.2264672

Fig. 1. Overview of proposed motion estimation system from frame to t.

is more suitable when a large motion or viewpoint change isexpected.In feature based ME for large-scale environments, the

descriptor is an especially important component, because de-scriptor based matching results significantly affect the precisionof the estimated motion, and the computation of the descriptoris the most time-consuming part of the overall visual MEprocess. The most widely used descriptor is SIFT (Scale In-variant Feature Transform) [3], owing to its good performance[5]. However, SIFT is computationally expensive. To reducethe computational time, Bay et al. proposed SURF (Speed UpRobust Features) [6]. Although the SURF descriptor providesfast computation, it proved to be less distinctive than SIFT ina comparative study [4].For the purpose of realizing robust and fast ME, a different

type of descriptor that is both fast and robust is needed. In thisletter, we describe the development of a novel descriptor thatcan be utilized for robust and fast ME.

II. PROPOSED VISUAL MOTION ESTIMATION USING MSD

An overview of the proposed ME scheme is illustrated inFig. 1. Our approach for ME consists of four main steps. First,we extract corners in the input image using the FAST detector[8], which is known to be one of the fastest corner detectors. Wethen compute the robust descriptor, called MSD, for each se-lected corner by combining predefined multi-scales. After com-puting the features descriptors, we find a set of putative matchesbetween four images, namely, the left and right images of twoconsecutive frames, by using a kd-tree. Finally, we estimatethe relative camera motion by minimizing 3D to 2D reprojec-tion error using the standard Gauss-Newton optimization via aRandom Sample Consensus (RANSAC) approach.

A. Introduction of Multi-Scale Descriptor

A descriptor represents the neighborhood of every cornerby a feature vector so as to find correspondence points. The

1070-9908/$31.00 © 2013 IEEE

726 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 7, JULY 2013

Fig. 2. Computation of MSD using four integral images with the 3 differentscales ( , , and ). Each patch represents different scales. is the scalefactor.

most important properties of the descriptor are its distinctive-ness and computational speed. The SIFT and SURF descrip-tors are widely used descriptors for detector-descriptor-basedmotion estimation. The SURF descriptor is similar to SIFT inconcept, in that they both focus on the spatial distribution ofgradient information. Nevertheless, SIFT has a more distinctiveproperty than SURF because it represents sub-regions by a gra-dient histogram instead of the gradient sum. However, SURF,which is based on an integral image, is faster than the SIFT de-scriptor in terms of computation, since computing the gradienthistogram needs more computational effort. Hence, both de-scriptors are insufficient for fast and robust motion estimation.In order to improve the distinctiveness of the descriptor, we

extend the scale space of the descriptor from a fixed single scale,such as in the cases of SIFT and SURF, to multi-scales. The pro-posed descriptor, Multi-scale Descriptor (MSD), is calculatedbased on the sum of the intensity gradient over multi-scales forrepresenting different scales of the same corner. This makes thedescriptor more distinctive due to the fact that correspondencepoints are found by comparing three different scale descriptorsinstead of one fixed scale descriptor.As shown in Fig. 2, MSD consists of three previously de-

fined scale descriptors at each corner. Each patch isdivided regularly into smaller 3 3 square sub-regions aroundthe selected corner. The size of each patch is defined by a scalefactor , , where

, and is the nth scale. Each sub-region has a 4D de-scriptor vector , , similarto the SURF descriptor. Concatenating this for all 3 3 sub-re-gions of each scale, a descriptor vector of length 108 is obtained.In order to make it invariant to contrast, MSD is transformed toa unit vector.

B. Creating MSD Using Integral Images

MSD consists of multiple descriptors for each corner. Hence,computation of MSD is a time-consuming step. In order to sig-nificantly improve the computational speed, we construct fourintegral images [9] in each input image (left and right images).We then calculate the descriptor at each scale using the con-structed 4 integral images.

Each integral image is defined as follows.

(1)

The entry of an integral image at a locationrepresents the sum of the intensity gradient and absolute inten-sity gradient within a rectangular region formed by the originand .

andrepresent the intensity gradient of the horizontal and

vertical directions at , respectively. Here, representsthe image intensity at .Note that the use of integral images provides computational

efficiency. Once the four integral images are calculated, onlythree additions are required to calculate the sum of the intensitygradient over any size of sub-region. Therefore, we can computeany scale of descriptor at low computational cost.We compute the MSD from 9 sub-regions at three different

scales using the computed four integral images as follows.

(2)

where ,, and .

represents the descriptor of the sn scale. Here, andrepresent the top-left and bottom-right coordinates of

a sub-region , respectively. Therefore, each descriptorvector , is calculatedbased on each integral image, , , , and , respectively. Inparticular, of the sub-region at scale is calculatedby .SURF uses Haar wavelet responses to calculate the image

gradient, which can be efficiently computed considering thescale of the feature. However, Haar wavelet responses areredundant in corner based motion estimation. The Haar waveletfilter is no longer necessary to consider the keypoint scale,because corners are not scale invariant. MSD hence uses anintensity gradient instead of Haar wavelet responses to improvecomputational efficiency, without sacrificing performance. Thisis because the Haar wavelet responses need seven operations,whereas only one operation is needed to compute the inten-sity gradient in the or direction. We have proven this inSection III-B by comparing a SURF descriptor based on anintensity gradient sum, called G-SURF, with the original SURFbased on Haar wavelet responses.Note that both MSD and SURF use a similar concept, the

gradient sum, to describe the gradient distribution. Neverthe-less, MSD is more distinctive and faster than SURF in practicalmotion estimation cases, as shown in Section III. This is due

SUNG AND CHUNG: MULTI-SCALE DESCRIPTOR FOR ROBUST AND FAST CAMERA MOTION ESTIMATION 727

to the attribute that MSD represents larger scale space of thedescriptor and reduces computation time drastically by usingfour integral images and the intensity gradient instead of Haarwavelet responses.

C. Motion Estimation Based on MSD

This subsection discusses the 6D motion estimation. In orderto estimate the motion parameter between two consecutiveframes, we first find correspondence points between fourimages, namely, the left and right images of two consecutiveframes (frame , ). As finding correspondence pointsbetween four images is a time consuming step, various methodshave been proposed [7]. We used a kd-tree based matchingmethod because of its computational efficiency.We first estimate correspondence points between the current

left and current right images using the constructed kd-tree inthe current left image. We then find matches between the pre-vious (frame ) left and current (frame ) left images usingthe already computed kd-tree in the current left image. Notethat the use of the proposed feature matching method providesnot only computational efficiency but also high matching pre-cision. Once the kd-tree is constructed in the current image, itcan be used for finding correspondence points between the leftand right images of two consecutive frames efficiently withoutany prediction about feature location or possibly large cameramovement. This is possible because correspondence points arefound solely by comparing the distance between the descriptorvectors.After the matching procedure, the image is broken up into

square blocks and a limited number of features in each blockare selected. This forces the features to span the image, whichimproves motion estimation accuracy.We reconstruct feature points from the previous frame into 3d

via triangulation. The estimated 3d point(X) is then reprojectedinto the current image using the calibration parameters of thestereo camera rig. We then estimate the motion by iterativelyminimizing the reprojection error E using Gauss-Newton opti-mization by

(3)

Here, denotes the number of matched points between fourimages. represents the Euclidean distance between twoimage points, and . denotes a projection matrix that mapsa 3d point (X) to a pixel on the left image plane. Similarly,is the projection onto the right image plane. and are thecorner locations in the current left and right images, respec-tively. To provide robustness against outliers, we filter outliersin each iteration using the RANSAC approach.

III. EVALUATION

In order to evaluate the proposed motion estimation systemin a more practical situation, we used 3 different large-scaleenvironment datasets published in [10] (“Sequence_00”, “Se-quence_02”, and “Sequence_06”), which provides ground truthGPS IMU data as well as stereo sequences at a resolution of1240 376 pixels and 10 fps.In order to evaluate the performance of the descriptors, we

tested the proposedMEwith different types of descriptors whilemaintaining the other parts of the proposed ME such as cornerdetection, feature matching, and pose estimation.

Fig. 3. Precision of MSD with varying scale number andscale factor. Precision of 1NN is defined by precision of

.

The descriptors that were evaluated were SIFT, SURF,and MSD. In order to compare the performance of a SURFdescriptor based on an intensity gradient, we also evaluatedG-SURF. We used the same patch size of 25 25 for faircompetition when we calculated the MSD s3, SIFT, SURF, andG-SURF descriptors. Accordingly, the patch size of MSD s2and s1 is 15 15 and 9 9, respectively.

A. Effect of Patch Number and Scale Factor

In this subsection, we evaluate the effects of some param-eter settings.MSD describes a selected corner with multi-scales.Hence, the number of scales and the scale factor have signifi-cant impacts on the descriptor performance. In order to inves-tigate the effects of the number of scales and the scale factor,we tested MSD with varying scale numbers and scale factorsin Fig. 3. MSD-72, MSD-108, and MSD-144 were calculatedusing 2, 3, and 4 scales, respectively.MSD-108 achieves similar performance to MSD-144. How-

ever, the dimensions of the descriptor have a direct impact onthe matching speed. Therefore, MSD-108 is optimal for fastmatching with high precision. The scale factorshows maximum precision for MSD-108. As a result, three dif-ferent scales and a scale factor of were used.

B. Performance Comparison

In Fig. 4, we compare our estimated motion trajectories todifferent types of descriptor based ME and the ‘ground truth’output of an OXTS RT 3003 GPS/IMU system on three chal-lenging Karlsruhe datasets [10].The MSD based ME (MSD ME) outperformed the others

with respect to processing time and estimated motion error forall sequences. The precision of the MSD descriptor is evenhigher than that of the SIFT descriptor. The processing time ofthe MSD ME is faster than that of the SURF based ME (SURFME) and the G-SURF based ME (G-SURF ME), respectively.This is due to the use of integral images and the sum of theintensity gradient.As expected, the SIFT descriptor shows higher precision than

the SURF descriptor. Meanwhile, the SURF ME is faster thanthe SIFT based ME (SIFT ME).In order to analyze the accuracy of each ME, we computed

translational and rotational errors using the evaluation code pro-vided by KITTI [10]. The MSD ME showed good performanceover all test sequences compared to the other descriptor basedME schemes in terms of both rotation and translation error.This is due to the high precision of MSD compared to the otherdescriptors.

728 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 7, JULY 2013

Fig. 4. Results of the MSD ME compared with different types of descriptor based motion estimation and ‘ground truth’. (a) Sequence_00 (4541 frames, SLRC(Sum of the left and right image average corner): 2640), (b) Sequence_02 (4661 frames, SLRC: 3572), (c) Sequence_06 (1101 frames, SLRC: 1789).

TABLE IPERFORMANCE COMPARISON

AP: Average 1NN Precision (number of correct matches/number

of matches)

APT: Average Processing Time [ms]

ATE: Average Translation Error [%]

ARE: Average Rotation Error [deg/m]

TABLE IICOMPUTATION TIME (TIME PER FRAME [MS])

Note that the precision of the descriptor affects the accuracyof pose estimation. In the same RANSAC iteration, high de-scriptor precision provides a better chance to estimate a moreaccurate pose than low descriptor precision.

C. Computation Time

To measure computation time, we used “Sequence_02”. Thecomputation time of the corner and descriptor is represented bythe sum of the left and right image corner computation time inTable II.

Algorithms were implemented in C++ using OpenCV. Weused the openSURF1 library provided by Evans; SIFT was im-plemented using the Integrating Vision Toolkit (IVT)2. All eval-uations were run on a PC with a 2.8 GHz and 2 GB RAM.As expected, the computation of the descriptor is the most

time-consuming part of the overall visual ME in Table II.The computation time of MSD was more than 3, 8, and 13times faster than that of the G-SURF, SURF, and SIFT de-scriptors, respectively, The total processing time of MSD was

, on average, as seen inTable II. This is 1.7, 3.4, and 5.5 times faster than the G-SURF,SURF and SIFT based ME, respectively.

IV. CONCLUSION

We have presented a robust and real time camera motion es-timation that combines the Fast detector with a proposed de-scriptor, MSD, which provides fast computation and offers highprecision. The obtained results indicate that ME based on MSDis suitable for applications where accuracy of the motion esti-mation and computation time are vital.

REFERENCES[1] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor

applied to wide baseline stereo,” IEEE Trans. Patt. Anal. Mach. Intell.,vol. 32, no. 5, pp. 815–830, 2010.

[2] M. Heikkilä, M. Pietikäinen, and C. Schmid, “Description of interestregions with local binary patterns,” Pattern Recognit., 2009.

[3] D. G. Lowe, “Object recognition from local scale-invariant features,”in IEEE Int. Conf. Computer Vision, Kerkyra, Greece, 1999.

[4] S. Gauglitz, T. Höllerer, and M. Turk, “Evaluation of interestpoint de-tectors and feature descriptors for visual tracking,” Int. J. Comput. Vis.,2011.

[5] K. Mikolajczyk and C. Schmid, “A performance evaluation of localdescriptors,” in Proc. IEEE Int. Conf. Computer Vision and PatternRecognition, 2003, vol. 2.

[6] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded up robust fea-tures,” in Eur. Conf. Computer Vision, 2006.

[7] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “LDAHash:Improved matching with smaller descriptors,” IEEE Trans. Patt. Anal.Mach. Intell., vol. 34, no. 1, pp. 66–78, 2012.

[8] E. Rosten and T. Drummond, “Machine learning for high-speed cornerdetection,” in Eur. Conf. Computer Vision, 2006.

[9] P. A. Viola and M. Jones, “Rapid object detection using a boosted cas-cade of sample features,” in Proc. IEEE Int. Conf. Computer Visionand Pattern Recognition, 2001.

[10] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomousdriving? the KITTI vision benchmark suite,” IEEE Comput. Vis. Patt.Recognit., 2012.

1http://www.code.google.com/p/opensurf1/2http://www.ivt.sourceforge.net

multi-scale descriptor for robust and fast camera motion estimation

Documents