more vision for slam - laas

8
More vision for SLAM Simon Lacroix, Thomas Lemaire and Cyrille Berger LAAS-CNRS Toulouse, France {firstname.name}@laas.fr Abstract— Many progresses have been made on SLAM so far, and various visual SLAM approaches have proven their effectiveness in realistic scenarios. However, there are many other improvements that can be made to SLAM thanks to vision, essentially for the mapping and data association functionalities. This paper sketches some of these improvements, on the basis of recent work in the literature and of on-going work. I. I NTRODUCTION SLAM has been identified as a key problem in mobile robotics for over 20 years [1], [2], and has received much attention since, especially these last 10 years – an overview of the problem and the main proposed solutions can be found in [3]. Dozens of robots now use on-board SLAM solutions on an everyday basis in laboratories. First SLAM solutions concerned robots evolving on a 2D plane, that perceive the environment with a laser range finder. It is only quite recently that solutions to SLAM using vision have been proposed: first using stereovision [4], [5], and then with monocular cameras. A large amount of contributions to the latter problem have rapidly been proposed since the pioneer work of [6] (see e.g. [7], [8], [9], [10]), and a commercial software is available since 2005 [11] – though only applicable to robots evolving on a 2D plane. There are many interests to use vision for SLAM. Besides the advantages of using a small, low cost and lightweight sensor, vision offers the benefit of perceiving the environment in a 3D volume, up to infinite distances, and of providing plenty of information relevant to analyze the perceived scenes. Last – but certainly not least, the computer vision community has provided numerous formalisms and algorithmic solutions to a collection of essential problems (feature detection and matching, structure from motion, image segmentation and classification, object recognition and scene interpretation, im- age indexing...). Still, vision-based SLAM solutions do not exploit all the possibilities brought by vision. The goal of this paper is to explore these possibilities, on the basis of on-going devel- opments made in our lab. The next section briefly reviews the SLAM problem and presents the required functionalities, analyzing the ones that can benefit from vision. Relations between filtering solutions applied to SLAM and estimation techniques applied to the structure from motion problem are now clearly established, and will not be detailed here. Section III is the main of the paper: it describes how various vision algorithms can benefit to SLAM in robotics, mainly for the essentials problems of environment modeling and loop closing. A B D C Fig. 1. Main steps of a SLAM process. With the first observations, the robot builds a map of 3 points (A), then it moves and computes an estimate of its position (B), 3 new observations (blue) are matched with the current map (C), and are fused to update the map and robot pose (D). The paper is concluded by some short term perspectives for vision-based SLAM. II. FUNCTIONALITIES REQUIRED BY SLAM A typical SLAM process is chronologically depicted figure 1, in the case of a robot evolving in a 2D plane, where the landmarks are corners. The various functionalities involved in this process can be summarized as the following four ones: Environment feature selection. It consists in detecting landmarks in the perceived data, i.e. features of the environment that are salient, easily observable and whose relative position to the robot can be measured. Relative measures estimation. Two processes are in- volved here: Estimation of the landmark location relatively to the robot pose from which it is observed: this is the observation. Estimation of the robot motion between two land- mark observations: this is the prediction. This es- timate can be provided by sensors, by a dynamic model of robot evolution fed with the motion control inputs, or thanks to simple assumptions, such as a constant velocity model. Data association. The observations of landmarks are useful to compute robot position estimates only if they are perceived from different positions: they must imperatively be properly associated (or matched), otherwise the robot position can become totally inconsistent. Estimation. This is the core of the solution to SLAM: it consists in integrating the various predictions and obser- vations to estimate the robot and landmarks positions in a common global reference frame. In the robotics community, the main effort has been put into the estimation functionality. Various stochastic estimation frameworks have been successfully applied [12], [13], [14],

Upload: others

Post on 01-Aug-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: More vision for SLAM - LAAS

More vision for SLAMSimon Lacroix, Thomas Lemaire and Cyrille Berger

LAAS-CNRSToulouse, France

{firstname.name}@laas.fr

Abstract— Many progresses have been made on SLAM sofar, and various visual SLAM approaches have proven theireffectiveness in realistic scenarios. However, there are many otherimprovements that can be made to SLAM thanks to vision,essentially for the mapping and data association functionalities.This paper sketches some of these improvements, on the basis ofrecent work in the literature and of on-going work.

I. INTRODUCTION

SLAM has been identified as a key problem in mobilerobotics for over 20 years [1], [2], and has received muchattention since, especially these last 10 years – an overviewof the problem and the main proposed solutions can be foundin [3]. Dozens of robots now use on-board SLAM solutionson an everyday basis in laboratories.

First SLAM solutions concerned robots evolving on a 2Dplane, that perceive the environment with a laser range finder.It is only quite recently that solutions to SLAM using visionhave been proposed: first using stereovision [4], [5], and thenwith monocular cameras. A large amount of contributions tothe latter problem have rapidly been proposed since the pioneerwork of [6] (see e.g. [7], [8], [9], [10]), and a commercialsoftware is available since 2005 [11] – though only applicableto robots evolving on a 2D plane.

There are many interests to use vision for SLAM. Besidesthe advantages of using a small, low cost and lightweightsensor, vision offers the benefit of perceiving the environmentin a 3D volume, up to infinite distances, and of providingplenty of information relevant to analyze the perceived scenes.Last – but certainly not least, the computer vision communityhas provided numerous formalisms and algorithmic solutionsto a collection of essential problems (feature detection andmatching, structure from motion, image segmentation andclassification, object recognition and scene interpretation, im-age indexing...).

Still, vision-based SLAM solutions do not exploit all thepossibilities brought by vision. The goal of this paper is toexplore these possibilities, on the basis of on-going devel-opments made in our lab. The next section briefly reviewsthe SLAM problem and presents the required functionalities,analyzing the ones that can benefit from vision. Relationsbetween filtering solutions applied to SLAM and estimationtechniques applied to the structure from motion problem arenow clearly established, and will not be detailed here. SectionIII is the main of the paper: it describes how various visionalgorithms can benefit to SLAM in robotics, mainly for theessentials problems of environment modeling and loop closing.

A B DC

Fig. 1. Main steps of a SLAM process. With the first observations, the robotbuilds a map of 3 points (A), then it moves and computes an estimate of itsposition (B), 3 new observations (blue) are matched with the current map (C),and are fused to update the map and robot pose (D).

The paper is concluded by some short term perspectives forvision-based SLAM.

II. FUNCTIONALITIES REQUIRED BY SLAM

A typical SLAM process is chronologically depicted figure1, in the case of a robot evolving in a 2D plane, where thelandmarks are corners. The various functionalities involved inthis process can be summarized as the following four ones:• Environment feature selection. It consists in detecting

landmarks in the perceived data, i.e. features of theenvironment that are salient, easily observable and whoserelative position to the robot can be measured.

• Relative measures estimation. Two processes are in-volved here:

– Estimation of the landmark location relatively to therobot pose from which it is observed: this is theobservation.

– Estimation of the robot motion between two land-mark observations: this is the prediction. This es-timate can be provided by sensors, by a dynamicmodel of robot evolution fed with the motion controlinputs, or thanks to simple assumptions, such as aconstant velocity model.

• Data association. The observations of landmarks areuseful to compute robot position estimates only if they areperceived from different positions: they must imperativelybe properly associated (or matched), otherwise the robotposition can become totally inconsistent.

• Estimation. This is the core of the solution to SLAM: itconsists in integrating the various predictions and obser-vations to estimate the robot and landmarks positions ina common global reference frame.

In the robotics community, the main effort has been putinto the estimation functionality. Various stochastic estimationframeworks have been successfully applied [12], [13], [14],

Page 2: More vision for SLAM - LAAS

and important contributions deal with the definition of land-mark map structures that lower the estimation computationalcomplexity and allow to overcome the difficulties raised bythe non-linearities of the problem [15], [16], [17].

However, most of the functionalities involved in SLAMare perception processes. This is obvious for the landmarkdetection functionality, that represents the landmark with aspecific data structure, depending on the considered environ-ment and on the sensors the robot is equipped with. Also,the relative position of the landmarks in the sensor frame (theobservation) is estimated by processing the perceived data. Asfor the data association process, it can be achieved consideringonly the current estimated states of the world and the robot(i.e. positions and associated variances). But it can be solvedmuch more robustly and easily when tackled as a perceptionprocess, especially to establish loop closures, where perceptioncan provide correct data associations regardless of the currentestimated states.

A. Vision and SLAM

For all these perception functionalities, vision can obviouslyprovide powerful solutions. Images indeed carry a vast amountof information on the perceived environment, and many algo-rithms that process these information can be very effective forSLAM:• Detection and modeling of landmarks from images can

be performed thanks to images features extraction pro-cesses for instance, as it has mainly been made upto now in vision-based SLAM approaches. But somevisual segmentation, classification or tracking processescan also be very useful for that purpose – not to mentionthe numerous higher level approaches to visual objectrecognition.

• Relative 3D coordinates of the detected landmarks arereadily observable with multi-camera systems. For singlecameras, the fact that the projective geometry providesonly angular observations has recently lead to the de-velopment of various partially observable SLAM solu-tions (“bearings-only”). Naturally, solutions to the classicstructure from motion problem (SFM) developed in thevision community have recently been successfully appliedto vision-based SLAM [7], [18].

• Finally, the data association problem in SLAM is oftenill-posed when only considered within the estimationframework. Although well founded approaches have beenproposed (e.g. [19], [20]), they are hardly efficient whenthe estimated states become inconsistent, and they arestill challenged by partially observable SLAM problems –namely by monocular vision SLAM. On the contrary, thevision literature provide with plenty of approaches thatcan robustly solve the data association problem, such asfeature matching algorithms, object recognition or imageindexation approaches.It is important to note here that the data associationproblem is very different when it comes to associatelandmarks from two consecutive positions than whenit comes to associate landmarks perceived from very

different viewpoints, as it happens when closing loopsfor instance. In the first case, the problem is easily solvedby feature tracking algorithms, whereas the second casecalls for more complex feature matching algorithms1

III. VISION AND THE M OF SLAM

A. Visual landmarks for SLAM

In a SLAM context, landmark must satisfy the followingtwo properties :• They must be detectable in the perceived data,• and some parameter of their position must be observable,

so as to feed an estimation technique.But to solve the data association problem independently

of the robot and landmarks estimated positions, they mustalso be represented by a model, that contains the informationrequired for the data association processes. A landmark modelis defined by the specification of the actual landmark naturein the physical world, by the considered sensor model, and bythe specification of the detection and association algorithms.

The key to a successfully vision based SLAM approachrelies then on the choice of the landmark representation(model) and corresponding algorithms that allow to detect,observe and associate landmarks.

1) Point features: Up to now, most vision based SLAMsolutions derive landmarks from point features detected in theimages, be it for stereovision-based approaches [5], [21] or formonocular approaches [8], [9]. This is mainly due to the factsthat there exist various algorithms that extract stable featurepoints, and that points landmarks are the simplest geometricobjects to handle in a SLAM estimation framework.

a) Detection: the Harris points are often used, becausethey have good invariant properties with respect to imagerotations and small scale changes [22]. More recently, SIFTfeatures have become very popular: their detection is morescale independent, and the information associated to them(the local descriptor [23], a vector of scalar values) allowsto successfully match them.

b) Geometric representation: the geometric representa-tion of point landmarks is straightforward, their state beingfully represented by the 3 Cartesian coordinates X = (x, y, z).With stereovision, this state is fully observable: (x, y, z) =h(θ, φ, d), where (θ, φ) are the angles at which the point isperceived, and d is the computed disparity – in the case ofa rectified image pair. With monocular vision, the state ispartially observable, and the observation function h is:(

θφ

)=

(arctan(y/x)

− arctan(z/√

x2 + y2)

)The observation error model is not straightforward to derive

from the feature detection process. Most of the authors use afixed estimate of the error on the observed measures (e.g. astandard deviation of 0.5 pixel for Harris points [22]), whilesome investigated the definition of more precise error models(see e.g. [24] for an error model of the disparity estimate

1Of course, using matching algorithms for the first case if often overkill.

Page 3: More vision for SLAM - LAAS

in stereovision). There are probably some more work to bedone regarding this, as good (accurate) error models are aprerequisite for any SLAM implementation.

c) Data association: The algorithms to find associationbetween point features perceived vary a lot, depending onwhether the viewpoints are close or not. In the second case,the choice and representation of the point features is essential,as their model must carry enough information to match them,whereas in the first case the problem is easily solved by frame-to-frame tracking algorithms, e.g. using simple correlationmeasures.

Various algorithms that match Harris points are availablein the literature, some of them being able to deal with largescale changes. In [25], the Harris points are modeled by avector of local characteristics computed from the “local jet”,a set of image derivatives. Matches are determined thanksto the computation of the Mahalanobis distance betweentheir characteristics, and the geometric configurations betweenpoints is exploited to ensure the elimination of false matches.Extensions to large scale variations have been presented in[26]. The approach presented in [4], uses a combination ofthe points’ signal information computed during their extractionand of the geometric constraints between detected points.Matches are established for groups of points, which allows theestimate of a local affine transformation between the images:this transformation is exploited to compute correlation scoresbetween points, and to guide the search for matches (figure2).

Fig. 2. Harris points matched with a 1.5 scale change by the algorithmpresented in [4] (red crosses shows all the detected points, and green squaresindicate successful matches).

SIFT features are intrinsically scale independent, and theycan be modeled by a large vector of characteristics computedfor their extraction, from which matches can be found. Theyare therefore very suited for a visual SLAM implementation[27], [11].

Nevertheless, these matching algorithms can be challengedby large variations of illumination conditions, and can bequite time consuming when the memorized landmarks becomenumerous, in particular if the robot position estimate is toocoarse to focus the match search. Section III-B explain howplace recognition algorithms can overcome this latter difficulty.

d) Map management and representation: In the liter-ature, the maps resulting from a visual point-based SLAMapproach are a set of localized 3D points landmarks with theassociated variances (figure 3), to each of which are associatedthe landmark representation (e.g. local image characteristics).

Fig. 3. 2D projection of the map resulting from a bearing-only SLAMalgorithm (the robot moved along two loops).

Other useful information could however advantageously bememorized: for instance, the orientations from which pointshave been perceived during the map building process couldhelp to cast the set of landmarks to search within when closingloops.

One of the interesting issue to deal with in the case of point-based visual SLAM if the landmark selection issue. Indeedfeatures points are so numerous in images that integrating allof them in the SLAM map rapidly yields huge maps, hardlytractable by the estimation and matching algorithms. Varioussimple criteria can be applied to select the points to memorizeas landmarks. For instance, since a good landmark shouldeasily be observable (matched), and landmarks should beregularly dispatched in the environment, the following strategycan be applied: each acquired image is regularly sampled incells. If there is at least one mapped landmark in a cell, no newlandmark is selected; if not, the most salient feature point (e.g.the one that has the highest Harris low eigenvalue) is selectedas a landmark. This ensures a quite good regularity in theobservation space (the image plane - figure 4 ). Furthermore,a simple selection in the 3D space, such as maintaining amaximum volumetric density of landmarks, can also help toget rid of useless landmarks.

2) Line features: If point features do yield successfulSLAM solutions, the resulting map is however very poor,and actually only useful for the SLAM process. Variousother higher level environment representations are required forautonomous mobility (by the trajectory planning processes forinstance). Such representations can be built using dedicateddata processing algorithms, their spatial consistency beingensured by the robot and landmark localization estimatesprovided by SLAM [28]. But this would be made simpler ifhigher level maps could be handled by SLAM: this is possibleusing higher level visual features.

Monocular visual SLAM using line features has only beentackled very recently. In [29], edges are defined by their two

Page 4: More vision for SLAM - LAAS

Fig. 4. Selection of the points that will be kept as landmarks (green squares).Some cells here contain more than one landmark: indeed, when a landmarkleaves a cell, it can move to a cell where there are already landmarks (a newlandmark is then generated in the old cell).

end-points, and the authors use the inverse depth parametriza-tion [10]. A more convincing approach has been introduced in[30], and we also recently investigated the problem [31]. Notethat other approaches use edge information as landmarks, butdo not maintain a 3D estimate of their position. For instance,in [32], vertical edges are used as 2D bearing measures.

Surprisingly, since pioneer work made in the 80’s on SLAMwith 3D lines extracted from stereovision [33], the problem didnot retain much attention from the community (except in [34],where results are only provided in simulation).

a) Detection: the vision literature provide with many ap-proaches to extract line segments in images (Hough transform,contour images segmentation...). But all these algorithms arevery sensitive to noise and illumination: a precise estimate ofthe segment extremities is hard to obtain, a single line in theenvironment is often described as several line features, andtwo collinear distinct lines can be detected as a single linein the image (figure 5). Of course, many of these phenom-ena are caused by occlusions and some particular viewpointconditions.

Fig. 5. A typical result of a line segment extraction algorithm: some segmentsare artifacts, and others are longer or smaller than the actual line in theenvironment.

b) Geometric representation: because of the difficultiesto perceive the segment extremities, it is much more reasonableto consider the parameters of the supporting lines as the stateto be estimated in the SLAM process.

Several sets of parameters can be used to represent a 3Dline L in Euclidean space. The minimal representation consistsof 4 scalars: such a minimal representation is (P1, P2) whereP1 = (x1, y1, 0)t is the intersection of L with the planeΠ1(z = 0) and P2 = (x2, y2, 1)t is the intersection of L withthe plane Π2(z = 1). Several conventions for (Π1,Π2) mustbe considered, so as to represent all possible lines with asatisfactory numerical precision (lines parallel to the planes(P1, P2) can indeed not be represented on the basis of theseplanes). A more intuitive but non minimal representation ofL is (A, u), where A is any point of L, and u is a directionvector of L. In this representation, the choice of A is arbitraryand A is not observable since it cannot be distinguished onthe line.

An other representation often used in the vision commu-nity is the Plucker coordinates. [35]. The Euclidean Pluckercoordinates are represented by the following 6-vector:

L(6×1) =(

n = h.nu

)(1)

n is the normal to the plane containing the line and theorigin O of the reference frame, h is the distance betweenO and the line and u is a unit vector which represents thedirection of the line. The Plucker constraint has to be satisfied:

n · u = 0

This ensures that the representation is geometrically consistent.Any point P on the line satisfies the relation:

P ∧ u = n (2)

The advantage of the Plucker representation is that theprojection of a 3D line L in an image is a 2D line l which isdefined by the intersection of the image plane and the planedefined by n: the canonical representation of l (ax+by+c = 0)is exactly n expressed in image coordinates.

c) Data association: although numerous line segmentmatching and tracking algorithms can be found in the visionliterature, the problem of finding outlier-free sets of matchesremains a difficult one. Texture information and epipolargeometry allow to get rid of most of them, but in SLAM, onecan also rely on the estimated states to filter out the remainingones, by analyzing the difference between the predicted andobserved states. Also, 3D model based object recognitionalgorithms can be exploited to deal with loop-closing.

d) Map management and representation: The resultingmap of a 3D line segment SLAM approach that estimate thesupporting line parameters is still far from to a 3D wire modelof the environment. In order to get a more precise descriptionof the scene structure, the coordinates of the segment endpointsmust be estimated. These coordinates can hardly be part of theestimated numerical state, as their observation if affected bynoise, segmentation errors an occlusions, these errors being notat all Gaussians. A dedicated process must therefore be defined

Page 5: More vision for SLAM - LAAS

Fig. 6. Top: line segments extracted from 2 aerial views. Bottom: matchedsegments.

Fig. 7. Example of a 3D line segments model built from a sequence ofmonocular images (from [31]).

– for instance a simple heuristics can consist in updating theirlinear abscissas by considering the observations that yields thelongest segments.

3) Planar features: If a wire 3D model of the environmentcontains more structural information than a 3D points model,it can still hardly be exploited by other functionalities thanlocalisation. A natural extension would be to add 3D planarpatches and areas to the model. Again, numerous contributionsand vision can be exploited for that purpose in SLAM, withthe very interesting fact that planes carry more geometricinformation to be estimated. Indeed, if one is able to measurethe normal of a plan patch (2 parameters) plus an orientationaround this normal, a planar patch is described by 6 indepen-dent parameters : the observation of a single planar landmarkprovide enough information to estimate the 6 parameters ofthe robot position.

a) Detection: Of course, at least two images taken fromdifferent viewpoints are required to extract planar areas in theperceived scene. The most efficient way to detect such areasis to determine whether or not there exist and homographyH that transforms its projection from one image to the other.Considering a plan P and two images I1 and I2 taken fromdifferent viewpoints, for all points of P , the coordinates of thecorresponding pixel in I1 and I2 are linked by an homographyH . Two areas Ip

1 from I1 and Ip2 from I2, correspond to a

planar feature if there is a matrix H such that:

H ∗ Ip1 = Ip

1 (3)

Moreover, the homography estimate can provide a measureof the planar patch surface, which is linked with the 3Dtransformation (R, t) between the two view points by thefollowing relation:

H ≈ R + tnT

d(4)

where n is the normal vector of the plane and d its distanceto the camera.

Homographies can be retrieved from two close viewpointsthanks to image alignment techniques (a review of imagealignment algorithms can be found in [36], and a very efficientmethod to track large image patches has been proposed in[37]).

Such algorithms can be used for SLAM in the monocularcase: in [38], a nice approach that also estimates the planarpatches normal is presented (figure 8). However, here the nor-mal estimate is not precise enough to be part of the landmarkstates for SLAM: indeed, image alignment techniques requirerather large patches to provide an accurate estimate of theplane normal. Nevertheless, in [38] the normal estimate is usedto predict how the image patches should be warped to ease thepoint matching process in case of large viewpoint changes.

Fig. 8. Planar patches estimated during a monocular SLAM process (from[38]).

Planar patches can of course be more easily detected fromstereovision images. A first natural idea is to use dense pointspixel correspondences to detect them. But fast stereovisionalgorithms are quite noisy, and the normal vector estimatesprovided by plane fitting algorithms applied on sets of neigh-bouring 3D points are not reliable2. Finding the homographyestimates using an image alignment algorithm, which is madeeasier thanks to the knowledge of the epipolar geometry, yieldsmuch better results. Nevertheless, their application on smallareas (e.g. 20 × 20 pixels) can sometimes provide totallyerroneous normal estimates.

Figure 9 shows the local planar patches (“facets”) detectedfrom a pair of stereovision images, centred on Harris pointsin the image pair. To eliminate facets whose normal estimateis erroneous, we are investigating two possible solutions. Thefirst one consists in exploiting texture attributes to determinewhether the homography estimate is good or not, and the

2Various sophisticated dense stereovision algorithms that provide moreprecise results exists in the literature – but they are computationally muchmore expensive than the fast algorithms used in robotics

Page 6: More vision for SLAM - LAAS

Fig. 9. Facets extracted from a single pair of stereovision images. Right:left camera image, with the matched Harris points on which the facets arecentred.

second is to rely on the behaviour of the data associationprocess (see below) to discard wrong facets.

b) Geometric representation: Facets extracted onmatched Harris points are naturally described by the pointlocal characteristics, but their description is extended by theestimate of their normal, which gives 2 additional positioningparameters (orientation). This description can be completedby a third orientation parameter, which can for instance bedefined by local gradients computed on the facet pixels (figure10). This additional orientation provides a full description ofthe facet position in 3D, which can be advantageously usedfor the data association and SLAM estimation processes.

Fig. 10. Close view of an extracted facet, with its normal (red) and thirdorientation (green) estimates.

c) Data association: We are currently experimenting analgorithm similar to the Harris point matching algorithm de-scribed in [4]: the principle is to generate match hypothesis onthe basis of local information, and to confirm these hypothesesusing geometric constraints between neighbouring facets. Thegeometric constraints being here expressed in 3D, they aremuch very discriminant. Figure 11 shows a map of facets builtfrom several positions.

4) Higher level landmarks: Facets are only the first steptoward a higher level of representation of the environment.In structured environments (i.e. urban-like and indoor), manyobjects can be described thanks to first order geometricprimitives. Algorithms that extract large planar areas frommonocular image sequences are now becoming efficient androbust (see e.g. [39] – figure 12): combined with facets and linesegments representations, they can yield the building of high

Fig. 11. Planar patches extracted from a set of stereoscopic pairs.

Fig. 12. Planar patches extracted from a pair of monocular images (from[39]).

level maps in a SLAM context. Other approaches that modeland detect non-structured objects can also be very helpful forSLAM (see e.g. a method for detecting tree trunks in [40]).

B. Loop closing

As the estimation framework can hardly solve the loopclosing problem when the robot position estimate is poorlyknown, we rather consider the loop closing process as aperception process. This implies a re-definition of the loop-closing in SLAM: instead of defining it as a topological eventcorresponding to a loop trajectory, we rather consider that aloop-closure occurs when a mapped landmark that is currentlynot being tracked is re-observed – and associated thanks to alandmark matching algorithm.

We have seen that with good landmark visual represen-tations, various matching algorithms could be used for thepurpose of loop closing. However, relying only on landmarkmatching processes to detect loop closures can be an issue:with maps containing a large number of landmarks, thematching algorithms are challenged and can be quite timeconsuming. Image indexing techniques can be of a very goodhelp here, and has already been successfully applied in variousSLAM approaches [41] – and naturally, the problem is made

Page 7: More vision for SLAM - LAAS

much easier when using panoramic cameras [42], [43]. Theliterature on “view-based navigation” or “appearance-basedlocalization” in robotics is already abundant in the roboticscommunity, and progresses in this domain in computer vi-sion are definitely worth to be considered in visual SLAMapproaches. Thanks to these approaches, topological loopclosures can be efficiently and robustly detected, which allowsto focus landmark matching algorithms.

Using such techniques yields to the definition of a new kindof environment model, dedicated to loop closure detection,consisting in a database of image indexes and signatures.

IV. CONCLUSIONS

The next steps in vision-based SLAM approaches are cer-tainly to focus on the development of rich maps, that exhibitsthe environment 3D structure and semantic information. Wehave seen that many vision tools are available for that purpose,and that some have already lead to interesting results. Muchwork remains however to be done to integrate those tools inSLAM solutions. One of the interesting issue is of course tofocus on the synergies between these tools and the SLAMestimation process. Such developments also appear promisingto tackle two difficult challenges for SLAM, namely multi-robot SLAM and the integration of SLAM approaches withinGeographic Information Systems (GIS).

Multi robot SLAM

From the estimation point of view, various contributionssolve the multi-robot SLAM problem, in which robots canobserve the position of other robots and of landmarks mappedby other robots (see e.g [44], [45], [46]). For this problem,data association between landmarks perceived by the differentrobots would of course greatly benefit from the building ofhigh level landmark map representations – all the more whenconsidering heterogeneous robots (figure 13).

SLAM, GIS and vision

There is currently a tremendous development of the buildingand exploitation of Geographic Information Systems, thatpartly inherits from progresses in computer vision. Any opera-tional robotic system, be it aerial, terrestrial or even maritime,should not ignore such initial information on the environment,as it can be of a very good help to perform SLAM. There areobvious similarities between GIS and robotic mapping, as theresulting environment models are organized in layers contain-ing information relevant for different processes. Consideringthe various environment models previously sketched (planarregions, segment-based object descriptions, dense models), weend up with an environment model that has the same layeredstructure of a usual GIS. The bottom layer is made of the setof landmarks which are consistently estimated by SLAM. Theupper layers are the maps containing dense data, or possiblyother sparse information relevant for the robots or the mission(see figure 14). The only difference is that the layers of a GISare defined in a single Earth centered reference frame, whereasthe layers of the SLAM maps are made of local maps anchoredin the bottom stochastic layer.

Fig. 13. Air/ground multi robot cooperation. What landmarks can be usedto build a consistent environment model from the data perceived by the twokinds of robots ?

Fig. 14. The various environment models built by an autonomous robot havea layered structure akin to the one of a GIS.

Note also that some visual SLAM approaches have beenable to build environment models from aerial data (e.g. [21])– a problem that had been exclusively considered in the GIScommunity so far.

Again, the problems to solve to integrate SLAM-built mapsand GIS models rely essentially on the data association side.We believe that the development of higher environment modelssuch as planar regions or segment-based descriptions on thebasis of visual information is a promising way to tackle them.

REFERENCES

[1] R. Chatila and J.-P. Laumond, “Position referencing and consistentworld modeling for mobile robots,” in IEEE International Conferenceon Robotics and Automation, St Louis (USA), 1985, pp. 138–145.

[2] R. Smith, M. Self, and P. Cheeseman, “A stochastic map for uncertainspatial relationships,” in Robotics Research: The Fourth InternationalSymposium, Santa Cruz (USA), 1987, pp. 468–474.

[3] S. Thrun, “Robotic mapping: A survey,” in Exploring Artificial Intelli-gence in the New Millenium, G. Lakemeyer and B. Nebel, Eds. MorganKaufmann, 2002.

[4] I.-K. Jung and S. Lacroix, “A robust interest point matching algo-rithm,” in 8th International Conference on Computer Vision, Vancouver(Canada), July 2001.

Page 8: More vision for SLAM - LAAS

[5] S. Se, D. Lowe, and J. Little, “Mobile robot localization and mappingwith uncertainty using scale-invariant visual landmarks,” InternationalJournal of Robotics Research, vol. 21, no. 8, pp. 735–758, 2002.

[6] A. Davison and N. Kita, “Sequential localisation and map-buildingfor real-time computer vision and robotics,” Robotics and AutonomousSystems, vol. 36, pp. 171–183, 2001.

[7] D. Nister, “An efficient solution to the five-point relative pose problem,”in IEEE Conference on Computer Vision and Pattern Recognition,Madison, Wi. (USA), June 2003.

[8] A. Davison, “Real-time simultaneous localisation and mapping with asingle camera,” in IEEE International Conference on Computer Vision,Nice (France), Oct. 2003, pp. 1403–1410.

[9] T. Lemaire, S. Lacroix, and J. Sola, “A practicle bearing-only SLAMalgorithm,” in IEEE International Conference on Intelligent Robots andSystems, Edmonton (Canada), Aug. 2005.

[10] E. Eade and T. Drummond, “Scalable monocular slam,” in Conferenceon Computer Vision and Pattern Recognition, New York (USA), June2006, pp. 469–476.

[11] M. Munich, P. Pirjanian, E. D. Bernardo, L. Goncalves, N. Karlsson,and D. Lowe, “SIFT-ing through features with ViPR,” IEEE Roboticsand Automation Magazine, vol. 13, no. 3, pp. 72–77, Sept. 2006.

[12] G. Dissanayake, P. M. Newman, H.-F. Durrant-Whyte, S. Clark, andM. Csorba, “A solution to the simultaneous localization and mapbuilding (slam) problem,” IEEE Transaction on Robotic and Automation,vol. 17, no. 3, pp. 229–241, May 2001.

[13] S. Thrun, Y. Liu, D. Koller, A. Ng, Z. Ghahramani, and H. Durrant-Whyte, “Simultaneous Localization and Mapping With Sparse ExtendedInformation Filters,” International Journal of Robotics Research, vol. 23,no. 7-8, pp. 693–716, 2004.

[14] M. M. S. Thrun and B. Wegbreit, “Fastslam 2.0: An improvedparticle filtering algorithm for simultaneous localization andmapping that provably converges,” in International Conference onArtificial Intelligence (AAAI), 2003. [Online]. Available: http://www-2.cs.cmu.edu/ mmde/mmdeijcai2003.pdf

[15] P. Newman, “On the structure and solution of the simultaneous locali-sation and map building problem,” Ph.D. dissertation, Australian Centrefor Field Robotics - The University of Sydney, March 1999. [Online].Available: http://oceanai.mit.edu/pnewman/papers/pmnthesis.pdf

[16] P. M. Newman and J. J. Leonard, “Consistent convergentconstant time slam,” in International Joint Conference on ArtificialIntelligence, Acapulco (Mexico), Aug. 2003. [Online]. Available:http://www.robots.ox.ac.uk/ pnewman/papers/IJCAI2003.pdf

[17] C. Estrada, J. Neira, and J. Tards, “Hierarchical SLAM: real-timeaccurate mapping of large environments,” IEEE Transactions onRobotics, vol. 21, no. 4, pp. 588–596, Aug. 2005. [Online]. Available:http://webdiis.unizar.es/ jdtardos/papers/Estrada TRO 2005.pdf

[18] E. Mouragnon, M. L. abd M. Dhome, F. Dekeyser, and P.Sayd,“Monocular vision based slam for mobile robots,” in 18th InternationalConference on Pattern Recognition, Aug. 2006.

[19] J. Neira and J. Tardos, “Data association in stochastic mapping usingthe joint compatibility test,” IEEE Transactions on Robotics, vol. 17,no. 6, pp. 890–897, Dec. 2001.

[20] W. S. Wijesoma, L. Perera, and M. Adams, “Toward multidimensionalassignment data association in robot localization and mapping,” IEEETransactions on Robotics, vol. 22, no. 2, pp. 350–365, April 2006.

[21] I.-K. Jung and S. Lacroix, “High resolution terrain mapping using lowaltitude aerial stereo imagery,” in International Conference on ComputerVision, Nice (France)), Oct 2003.

[22] C. Schmid, R. Mohr, and C. Bauckhage, “Comparing and evaluatinginterest points,” in International Conference on Computer Vision, Jan1998.

[23] D. Lowe, “Distinctive features from scale-invariant keypoints,” Interna-tional Journal on Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

[24] L. Matthies, “Toward stochastic modeling of obstacle detectability inpassive stereo range imagery,” in IEEE International Conference onComputer Vision and Pattern Recognition, Champaign, Illinois (USA),1992, pp. 765–768.

[25] C. Schmid and R. Mohr, “Local greyvalue invariants for image retrieval,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 19, no. 5, May 1997.

[26] Y. Dufournaud, C. Schmid, and R. Horaud, “Matching images withdifferent resolutions,” in International Conference on Computer Visionand Pattern Recognition, Hilton Head Island, SC (USA), Juin 2000, pp.612–618.

[27] S. Se, D. Lowe, and J. Little, “Vision based global localization andmapping for mobile robots,” IEEE Transactions on Robotics, vol. 21,no. 3, pp. 364–375, June 2005.

[28] J. Nieto, J. Guivant, and E. Nebot, “DenseSLAM: Simultaneous locali-sation and dense mapping,” International Journal of Robotics Research,vol. 25, no. 8, pp. 711–744, Aug. 2006.

[29] P. Smith, I. Reid, and A. Davison, “Real-time monocular slam withstraight lines,” in British Machine Vision Conference, Edinburgh (UK),Sep. 2006.

[30] E. Eade and T. Drummond, “Edge landmarks in monocular slam,” inBritish Machine Vision Conference, Edinburgh (UK), Sep. 2006.

[31] T. Lemaire and S. Lacroix, “Monocular-vision based SLAM usingline segments,” in IEEE International Conference on Robotics andAutomation, Roma (Italy), April 2007.

[32] N. Kwok and G. Dissanayake, “Bearing-only slam in indoor environ-ments using a modified particle filter,” in Australasian Conference onRobotics and Automation, Brisbane (Australia), Dec. 2003.

[33] N. Ayache and O. Faugeras, “Building a consistent 3d representationsof a mobile robot environment by combining multiple stereo views,”in 10th International Joint Conference on Artificial Intelligence, Milan(Italy), Aug. 1987, pp. 808–810.

[34] M. Dailey and M. Parnichkun, “Simultaneous localization and mappingwith stereo vision,” in Proceedings of the International Conference onAutomation, Robotics, and Computer Vision, Singapore, Dec 2006.

[35] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge University Press, 2004.

[36] S. Baker and I. Matthews, “Equivalence and efficiency of image align-ment algorithms,” in Proceedings of the 2001 IEEE Conference onComputer Vision and Pattern Recognition, December 2001.

[37] E. Malis, “Improving vision-based control using efficient second-orderminimization techniques,” in Proceedings of the 2004 IEEE Interna-tional Conference on Robotics and Automation, April 2004.

[38] N. Molton, A. Davison, and I. Reid, “Locally planar patch features forreal-time structure from motion,” in British Machine Vision Conference,London (UK), Sept. 2004.

[39] G. Silveira, E. Malis, and P. Rives, “Real-time robust detection of planarregions in a pair of images,” in IEEE/RSJ International Conference onIntelligent Robots Systems, Beijing (China), Oct. 2006.

[40] D. C. Asmar, J. S. Zelek, and S. M. Abdallah, “Tree trunks as landmarksfor outdoor vision slam,” Conference on Computer Vision and PatternRecognition Workshop, vol. 0, p. 196, 2006.

[41] I. Posner, D. Schroeter, and P. Newman, “Using scene similarity forplace labeling,” in International Symposium on Experimental Robotics,Rio de Janeiro (Brazil), July 2006.

[42] T. Lemaire and S. Lacroix, “Long term SLAM with panoramic vision,”To appear in Journal of Field Robotics, 2007.

[43] A. Tapus and R. Siegwart, Bayesian sensory-motor models and pro-grams. Springer-Verlag, 2007, ch. Topological SLAM using Finger-prints of Places.

[44] E. Nettleton, P. Gibbens, and H. Durrant-Whyte, “Closed form solutionsto the multiple platform simultaneous localisation and map building(slam) problem,” in SPIE AeroSense conference, Orlando, Fl (USA),Arpil 2000.

[45] S. Thrun and Y. Liu, “Multi-robot slam with sparse extended informationfilters,” in International Symposium of Robotics Research, Sienna (Italy),Oct. 2003.

[46] E. Nettleton, S. Thrun, and H. Durrant-Whyte, “Decentralised slam withlow-bandwidth communication for teams of airborne vehicles,” in Inter-national Conference on Field and Service Robotics, Lake Yamanaka,(Japan), 2003.