trajectory optimization using · 2010-05-10 · trajectory optimization using 1 reinforcement...

26
1 Trajectory Optimization using Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically building maps from sensor data is a necessary and fundamental skill for mobile robots; as a result, considerable research attention has focused on the technical challenges inherent in the mapping problem. While statistical inference tech- niques have led to computationally efficient mapping algorithms, the next major challenge in robotic map- ping is to automate the data collection process. In this paper, we address the problem of how a robot should plan to explore an unknown environment and collect data in order to maximize the accuracy of the resulting map. We formulate exploration as a constrained optimization problem and use reinforce- ment learning to find trajectories that lead to accurate maps. We demonstrate this process in simulation and show that the learned policy not only results in improved map-building, but that the learned policy also transfers successfully to a real robot exploring on MIT campus. 1. I NTRODUCTION B UILDING maps has become a fundamental skill that most autonomous mobile robots must have. In order to properly interact with their en- vironment, robots need to track their position over time and plan deliberate motions through the world. A robot that is able to build and update its own maps autonomously will have a tremendous advantage over those robots that rely on external positioning systems and static, pre-made maps. As a result, considerable research attention has been focused on Thomas Kollar is with the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139. E-mail: [email protected]. Nicholas Roy is with the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA. 02139. E-mail: [email protected]. the technical challenges inherent in the mapping problem. Early mapping algorithms have addressed the problems of using high-quality range data taken at known positions to build floor plans of indoor envi- ronments (Moravec and Martin 1994; Thrun 1998). More recent mapping research has focused on build- ing maps with robots that do not have access to global position information (Smith et al. 1990; Lu and Milios 1997; H¨ ahnel et al. 2003; Dellaert and Kaess 2006; Eustice et al. 2005; Thrun et al. 2005; Olson et al. 2006). Without a global position sensor, the sensor data must be used to simultaneously determine the robot position and update the map as the robot moves around. This inference prob- lem, known as the Simultaneous Localization and Mapping problem (or SLAM), may initially appear computationally intractable, but modern statistical inference and machine learning techniques have led to computationally efficient mechanisms for accu- rate mapping without global position information. Further, these innovations have led to robots that can map large-scale, outdoor environments with a variety of different sensors and to the creation of commodity open-source community around map- ping software (Haehnel et al. 2003; Thrun and Montemerlo 2005; Martinez-Cantin and Castellanos 2005; Martinelli et al. 2007; Pfaff et al. 2007; Davison et al. June 2007). The next major challenge in robotic mapping is to automate the data collection process in order to build the highest-quality map with the least time and cost. In almost all operational map building systems, the robot records the data that will be used to build the map, but the collection process

Upload: others

Post on 04-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

1

Trajectory Optimization usingReinforcement Learning for Map

ExplorationThomas Kollar and Nicholas Roy,

Abstract—Automatically building maps from sensordata is a necessary and fundamental skill for mobilerobots; as a result, considerable research attentionhas focused on the technical challenges inherent inthe mapping problem. While statistical inference tech-niques have led to computationally efficient mappingalgorithms, the next major challenge in robotic map-ping is to automate the data collection process.

In this paper, we address the problem of how arobot should plan to explore an unknown environmentand collect data in order to maximize the accuracyof the resulting map. We formulate exploration as aconstrained optimization problem and use reinforce-ment learning to find trajectories that lead to accuratemaps. We demonstrate this process in simulationand show that the learned policy not only results inimproved map-building, but that the learned policyalso transfers successfully to a real robot exploringon MIT campus.

1. INTRODUCTION

BUILDING maps has become a fundamentalskill that most autonomous mobile robots must

have. In order to properly interact with their en-vironment, robots need to track their position overtime and plan deliberate motions through the world.A robot that is able to build and update its own mapsautonomously will have a tremendous advantageover those robots that rely on external positioningsystems and static, pre-made maps. As a result,considerable research attention has been focused on

Thomas Kollar is with the Computer Science and ArtificialIntelligence Laboratory, Massachusetts Institute of Technology,Cambridge, MA, 02139. E-mail: [email protected].

Nicholas Roy is with the Computer Science and ArtificialIntelligence Laboratory, Massachusetts Institute of Technology,Cambridge, MA. 02139. E-mail: [email protected].

the technical challenges inherent in the mappingproblem.

Early mapping algorithms have addressed theproblems of using high-quality range data taken atknown positions to build floor plans of indoor envi-ronments (Moravec and Martin 1994; Thrun 1998).More recent mapping research has focused on build-ing maps with robots that do not have access toglobal position information (Smith et al. 1990; Luand Milios 1997; Hahnel et al. 2003; Dellaert andKaess 2006; Eustice et al. 2005; Thrun et al. 2005;Olson et al. 2006). Without a global position sensor,the sensor data must be used to simultaneouslydetermine the robot position and update the mapas the robot moves around. This inference prob-lem, known as the Simultaneous Localization andMapping problem (or SLAM), may initially appearcomputationally intractable, but modern statisticalinference and machine learning techniques have ledto computationally efficient mechanisms for accu-rate mapping without global position information.Further, these innovations have led to robots thatcan map large-scale, outdoor environments with avariety of different sensors and to the creation ofcommodity open-source community around map-ping software (Haehnel et al. 2003; Thrun andMontemerlo 2005; Martinez-Cantin and Castellanos2005; Martinelli et al. 2007; Pfaff et al. 2007;Davison et al. June 2007).

The next major challenge in robotic mapping isto automate the data collection process in order tobuild the highest-quality map with the least timeand cost. In almost all operational map buildingsystems, the robot records the data that will beused to build the map, but the collection process

Page 2: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

has limited autonomy. Path-planning and controlare sometimes employed, but human supervision isoften required to ensure that the collected data iscomplete (that is, the robot has observed all relevantparts of the world), and that the robot has collectedthe most useful data for maximizing the accuracyof the map. Once the data is collected, a map isbuilt during an off-line phase; this map can then beused for localization and planning.

There are clearly multiple possible trajectoriesthat constitute a complete exploration of the en-vironment, but some trajectories are preferable toothers. As a simple example, point turns for aholonomic robot are a canonically “bad” motionas most mapping algorithms are sensitive to errorsin rotational measurements. The responsibility foravoiding such trajectories almost always falls to thehuman driver; data collected by a bad (or novice)robot driver will lead to the creation of erroneousmaps. Avoiding such problems will require robotcontrol algorithms that can reliably explore theirenvironments and generate accurate maps withouthuman intervention. This will enable exploration indistant environments where closed-loop control bya remote human is impractical.

The problem of how to choose sensing trajecto-ries to obtain the best map will be referred to asthe explorationproblem. A good robot explorationstrategy must both cover the space sufficiently andcollect data that leads to an accurate map. Theseconstraints are to some extent in opposition. Aplanner that maximizes the environmental coveragewill spend relatively little time revisiting previouslocations, whereas a controller that minimizes themap error will revisit previous locations regularlyto eliminate inconsistencies between different partsof the map. The optimal exploration controller willbalance these constraints.

The map-building process for a robot equippedwith a laser range scanner shown in Figures 1(a-d) illustrates the need for good exploration. Thepose of the robot during each range scan is cal-culated from the (noisy) odometry measurementswithout using a SLAM algorithm, and the mapis constructed assuming these pose estimates arecorrect (Figure 1(a)). The red dots in Figures 1(a-c) correspond to measured obstacle positions, and

the resulting map is completely unintelligible. If aSLAM procedure is used to estimate the correctrobot poses from the sensor data, then a betterestimate of the map can be inferred, as shown inFigure 1(b). However, this inferred map contains er-rors, in particular, the two black arrows are pointingto robot poses in the data where large-scale mappingerrors occurred, resulting in map doubling. Noticethat outside of these two specific misalignments,most of the map is relatively correct; straight hall-ways and corners can be clearly seen. For reference,Figure 1(c) shows the corrected map where a humanuser manually edited the pose information so thatthe scans aligned, and Figure 1(d) shows the finaloccupancy grid map that can be used for planning.

A formal model of autonomous exploration isunfortunately not easy to solve. From a decision-theoretic standpoint, the sequential decision-makingproblem of exploration is an instance of acontinuous-state partially observable Markov deci-sion process (POMDP). Although there has beenwork in solving continuous POMDPs (Porta et al.2006; Thrun 2000), the difficulty in scaling thesealgorithms makes them ineffective in addressing theexploration problem. Even expressing the model pa-rameters of the exploration problem as a continuousPOMDP is impractical in real-world domains, nevermind solving for a policy.

Without a closed form solution to the objectivefunction, we cannot write down an analytic controllaw. One alternative strategy would be to use numer-ical optimization such as model-predictive control,but the computational cost of map prediction elim-inates the use of predictive control optimization.This idea has been pursued with some success (Simet al. 2004; Stachniss et al. 2005), but even a one-step optimization is slow to compute.

Reinforcement learning has been used success-fully to solve large sequential decision makingproblems in both fully observable and partiallyobservable domains. Many reinforcement learningalgorithms rely on the idea that even when theoptimal policy cannot be solved analytically, us-ing knowledge of where good policies lie allowsthe learner to optimize its own performance fromexperience. It is precisely this idea that we useto find good exploration controllers; although we

2

Page 3: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

(a) Raw Measurements (b) Mapping Errors

(c) Corrected Measurements

0 10 20 30 40 500

10

20

30

40

50

60

70

(d) Occupancy Grid

Fig. 1. Simultaneous Localization and Mapping. In (a)-(c), the red dots are laser range points projected from the estimated oractual robot pose. In (a), the robot odometry is used to calculate the robot pose. The errors in the pose estimate lead to a completelyuseless map. In (b) a SLAM algorithm is used to estimate the robot trajectory, but the robot trajectory at two locations (shown bythe black arrows) puts the true robot pose outside the bounded location assumed by the inference algorithm, causing the mappingto fail. The map is correct on either side of that motion, but thecompleted map is inconsistent. In (c) and (d), the robot pose andmeasurements were corrected by hand to generate the globally consistent set of range scans (c) and the corresponding occupancygrid (d). In (d), the black pixel are obstacles, white pixelsare free space and blue are unknown.

may not be able to directly optimize the robottrajectory, we can learn from trajectories that haveminimized the map error in the past. The advan-tage to the reinforcement learning approach is thatwe can produce control policies for an otherwiseintractable optimization; additionally, by learning atrue policy as opposed to computing a plan, wecan amortize the cost of learning over the life-time

of the agent. Assuming the agent dynamics do notchange substantially, the policy, once learned, canbe used to choose control actions in real-time.

In this paper, we frame the exploration problemas a constrained optimization, where the goal isto choose a trajectory that maximizes the mapaccuracy within the constraints of planned mapcoverage. Firstly, we describe the mapping problem

3

Page 4: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

and choose an appropriate representation of theinference problem. We then formulate the explo-ration problem as a sequential decision makingproblem and describe the control architecture. Wedemonstrate how the Policy Search Dynamic Pro-gramming algorithm (Bagnell et al. 2003) can beused to learn good policies by decomposing thereinforcement learning problem into a sequenceof one-step learning problems. Finally, we giveexperimental results for both simulation and real-world exploration problems.

2. MAP BUILDING

The mapping problem can be described formallyas inference of a mapm. The environment isassumed to be stationary, such thatm does notchange with time, and that the map consists of aset of features{m0, . . . ,m|m|}, such as grid cells orlandmark positions. We wish to infer the map froma sequence of observations,z0:T = {z0, . . . , zT },wherezt is the observation at timet. These obser-vations are generally measurements of the featuresmi such as range and bearing to the feature inthe robot’s frame of reference. The subscript0 : Tdenotes a sequence of measurements taken at dis-crete intervals from timet = 0 to some end timet = T . Since sensing is a noisy and potentiallyambiguous process, each observationzt is modeledas a stochastic functionp(zt|xt,m) of both themap and the vehicle pose at timet, denoted asxt = (xt, yt, θt), and we refer to this distributionas the observation function or measurement model.A distribution over maps,p(m|z0:T ,x0:T ), can beinferred from observations and known robot posesas

p(m|z0:T ,x0:T ) = α

T∏

t=0

p(zt|m,xt)p(m), (1)

whereα is a normalizer. Given the distribution overm, a single map estimatem can be extracted byusing a statistic of the distribution, such as the meanor the maximum likelihood estimate.

2.1. Mapping and Localization

It is rarely the case that mapping is carried outin an environment where the robot pose is known.

Without knowledge of the robot pose, the jointdistribution of the map and robot poses must berepresented, a formulation known as the SLAMproblem (Smith et al. 1990; Moutarlier and Chatila1989; Leonard and Durrant-Whyte 1991). If therobot poses are correlated by the control actionsut taken by the robot using the transition functionp(xt+1|xt, ut), then the joint distribution over themap and robot poses isp(m,x0:T |z0:T , u0:T )

If many featuresmi in the mapm are correlatedby observations taken from a single pose, theninferring the true posterior can become computa-tionally demanding. A common approximation isto filter the observations, such that the inference isover the map and just the most recent robot pose,xT , rather than the complete robot trajectory,x0:T .Using some statistical identities, we can computethe joint posterior as

p(m,xT |z0:T , u0:T ) = (2)

αp(zT |xT ,m)

p(xT |uT ,xT−1)·

p(m,xT−1|z0:T−1, u0:T−1)d(xT−1).

Equation (2) is known as the Bayes filter, and isa popular form of map inference for its recursiveformulation, allowing additional observations to beincorporated into the posterior efficiently. Imple-menting the Bayes’ filter requires committing to aspecific distribution of the distributionp(m,xT ),and this choice of representation will have con-sequences for the representation of the transitionp(xt|xt−1, ut) and observationp(zt|m,xt) func-tions.

2.2. Extended Kalman Filter SLAM

One of the most popular representations for map-ping is the Kalman filter (Kalman 1960), in whichthe state distribution is assumed to be Gaussian,such that

p(ξ) = N (µ,Σ), (3)

where ξ = (x,m) is the joint state space of themap and robot pose,µ is the mean estimate ofthe state, andΣ is the covariance. The originalKalman filter formulation of the general trackingproblem assumes that the transition and observation

4

Page 5: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

functions are linear functions of the state withGaussian noise. If the transition and observationfunctions are non-linear as in the SLAM problem,the extended Kalman filter (EKF) (Smith et al.1990) approximates the transition and observationfunctions by linearizing them around the mean stateat each time step.

Formally, the EKF models the transition and ob-servation functions as general non-linear functionsg andh, such that

ξt = g(ξt−1, ut, wt) wt ∼ N (0,Wt), (4)

and zt = h(ξt, qt), qt ∼ N (0, Qt), (5)

where wt and qt are random, unobservable noisevariables. In the presence of this unobserved noise,the EKF estimates the state at timeT from theestimate at timeT−1 in two separate steps: aprocess step based only on the control inputuT

leading to an estimatep(ξT ) = N (µT ,ΣT ), anda measurement step to complete the estimate ofp(ξT ) = p(m,xT )1. The process step is given by

µT = g(µT−1, ut, 0) (6)

ΣT = GT ΣT−1GTT + VT WT V T

T , (7)

where GT is the Jacobian ofg with respect toξ

evaluated atµT−1, and VT is the Jacobian ofgwith respect tow. For convenience, we useR ,

V WV T . Similarly, the measurement step followsas

µT = µT + KT (h(µT , 0) − zT ). (8)

KT is the Kalman gain and is given as

KT = ΣT HTT

(

HT ΣT HTT + QT

T

)−1, (9)

whereHT is the Jacobian ofh with respect toξ

evaluated atµT . Finally, the covariance follows as

ΣT = (I − KT HT )ΣT . (10)

The complete EKF update algorithm is given inAlgorithm 1.

The EKF approach to SLAM and mapping isattractive for a number of reasons. Firstly, when the

1We denoteξ,m,xt, µt and Σt as the variables after thecontrol ut but before the measurementzt has been integrated.For brevity we will also omit conditioning variables such thatp(ξt) = p(m,xT ) = p(m,xT |u0:T , z0:T ).

Algorithm 1 The Extended Kalman FilterRequire: Prior mean and covarianceµt−1,Σt−1,

control ut and observationzt.1: Compute process mean,

µt = g(µt−1, ut, 0)2: Compute process JacobianGt.3: Compute process covariance,

Σt = GtΣt−1GTt + Rt.

4: Compute measurement JacobianHt.5: Compute Kalman gain,

Kt = ΣtHTt

(

HtΣtHTt + Qt

)−1.

6: Compute new mean,µt = µt + Kt(h(µt, 0) − zt).

7: Compute new covariance,ΣT = (I − KT HT )ΣT .

8: return (µt,Σt).

state vectorµ is not too large, the inference processis efficient; the most computationally demandingstep is the matrix inversion required to calculatethe Kalman gain. When the state vector does be-come large, this matrix inversion can become un-wieldy, but there exist a number of approximationalgorithms for factoring the map model to ensureconstant upper bounds on the map inference pro-cess (Thrun et al. 2004; Bosse et al. 2005; Walteret al. 2005) while minimizing the error induced inthe approximation. Secondly, when the linearizationof the transition and observation functions is a goodapproximation, the EKF mean has been shown tobe an unbiased estimate that minimizes the expectedsquared error (Newman and Leonard 2003). Finally,by representing the distribution over the map androbot pose parametrically, statistics of the mapestimate can be used in planning exploration.

2.3. Map Representations

Throughout the remainder of this paper, we willassume without loss of generality that the mapm isa set of features with specific locations,mi ∈ R2.While our treatment of mapping and explorationwill be independent of the map representation, wewill rely on an accurate model of the joint posterior,including statistics of the posterior such as thecovariance of the map and robot trajectory, in order

5

Page 6: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

Robot Start Location

(a) Partial map estimate

Destination SensingLocations

(b) Selected CoverageConstraints

Final Trajectory

(c) Constrained Exploration Tra-jectory

Fig. 2. The constrained optimization of the exploration process. (a) The current partial map estimate. (b) The sensing constraintschosen by the map coverage process. (c) The resulting trajectory that maximizes map accuracy while satisfying the coverageconstraints.

to minimize the expected error. If range sensordata is available to capture the boundary betweenfree space and obstacles, a common approach isto first use a feature-based map to infer the meanestimate of the vehicle trajectoryx0:T and then useequation (1) to build a grid map from the rangemeasurements and the estimated mean poses. Inpractice, the occupancy grid contains a strong inde-pendence assumption between observations of mapfeatures (known as the “Naive Bayes” assumption)that is not consistent with the true generative model.Using Naive Bayesian independence assumptionscan often allow computationally efficient and ac-curate inference of the mean of a distribution butthe higher order moments are generally biased andover-confident, leading to problems in active learn-ing and exploration (Roy and McCallum 2001).

3. EXPLORATION AND ACTIVE MAP BUILDING

The SLAM problem in the previous sectionsdescribes how a robot can build a map: by collectinga series of measurements, inferring a distributionover the map and using the mean of the distri-bution as an unbiased estimate. If the extendedKalman filter is to compute this distribution thenthe posterior mean will minimize the error of thedistribution and, given enough data, the mean willconverge to the true map. Unfortunately, few ifany guarantees are given for the rate at whichthe map converges. The mapping process is highlydependent on the Gaussian assumption, the robot

trajectory, the acquired sensor measurements, andthe distinctiveness of map features (e.g. to performdata association). The goal of the algorithm inthis paper is to choose control actions that leadto sensor data that results in the best map. Thereare, however, competing concerns in determiningthe “best map” and how a robot should explore itsenvironment. Sensor measurements are expensivewith respect to time, power and certainty, and mapinference is only possible for the sensed parts ofthe environment; if the data only covers part ofthe environment, then the mapping process will beincomplete and the mapm will not contain allmi features. Therefore, exploration planning shouldbe designed to minimize the number of sensormeasurements required to sense every feature ofthe mapm. Conversely, complete but erroneousmaps as in Figures 1(a) and 1(b) are not useful forplanning, therefore exploration should be designedto minimize the error of the map.

There are different ways to reconcile these objec-tives of coverage and accuracy. The most commonapproach is to optimize the robot trajectory withrespect to an overall objective that is a weightedsum of individual terms (Bourgault et al. 2002),but this raises the question of how to choose theindividual weights. Our approach is instead to de-fine the coverage problem as choosing then bestsensing locations that maximize the likelihood thecomplete environment will be observed. If thesen

sensing locations serve asconstraintson a trajectory

6

Page 7: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

planner which maximizes the map accuracy subjectto these constraints, then we have a principled wayto ensure that every map feature is contained in themap estimatem, and that the map itself is accurate.Figure 2 shows how this process can be decom-posed and both objectives optimized separately. Thetrajectory constraints based on coverage can becalculated first, and then the constrained trajectorycan be calculated to maximize map accuracy; thefollowing two sections describe how to optimizeeach separately.

3.1. Map Coverage

Let us first consider the problem of coverage. Wedefine a “visibility” functionδ(mi, z0:T ),

δ(mi, z0:T )=

1if map featurem

i appears atleast once in the observationsz0:T ,

0 otherwise.

(11)

The goal of a coverage planner can then be givenas a search for the trajectory that maximizes thenumber of sensed features,

argmaxz0:T

|m|∑

i=1

δ(mi, z0:T ). (12)

We cannot control or predicta priori what obser-vations zt will be received. We can control therobot posesx0:T , but the observations are generatedstochastically according to the map and the robotpose,p(z0:T |x0:T ,m). We do not know the mapm,so equation (12) must be computed in expectationover the map prior distributionp(m) and observa-tion likelihoods, such that

argmaxx0:T

Ep(m)Ep(z0:T |x0:T ,m)

|m|∑

i=1

δ(mi, z0:T )

. (13)

This raises two problems. Firstly, computing thedouble expectation over both map prior and ob-servations is computationally infeasible. Secondly,the utility of the plan depends strongly on howinformative the map prior is. At the beginning ofthe mapping process, the map prior will be veryuncertain, which will cause all trajectories to haveapproximately the same expected performance, asall maps are equally likely. As more observations

are made, the map distribution will begin to con-verge to the true map, but the mean map estimatem will become an increasingly good approximationto the complete distribution. In either case, we canapproximate equation (13) as

argmaxx0:T

Ep(z0:T |x0:T ,m)

|m|∑

i=1

δ(mi, z0:T )

. (14)

Therefore, a robot coverage algorithm that choosesa set of destination points to maximize the likeli-hood of completing the map based on the currentestimate, and then follows a tour through thesepoints will have performance comparable to anyother sequential decision making algorithm. Solvingthis optimization problem is outside the scope ofthis paper, but there are a number of geometric cov-erage planners in the literature includingart gallery-type results (Bourque and Dudek 2000; Ganguliet al. 2006). In the remainder of this paper wewill assume the availability of such an algorithmfor determining sensor locations for coverage.

3.2. Map Accuracy

The coverage exploration optimization does notdepend on the robot motion, but only on the robotperception at discrete points in the environment;the optimization is independent of the motion be-tween these discrete sensing locations. However,the trajectory between sensor measurements canhave a significant effect on the accuracy of themap. When the robot motion is noisy, substantialuncertainty is introduced into the robot pose; whenfew or no observations are expected (e.g., becauseof sensor range limits or environmental sparsity),this uncertainty is propagated throughout the SLAMfilter into future measurements, increasing the ex-pected error of the posterior. By avoiding trajecto-ries with large motion noise and trajectories withlimited observations of map features, the robot canminimize the error introduced by its own motion.Similarly, many robots also have limited visibilityand can see only in 180 degrees. In this case, therobot orientation at each point along the path isimportant for controlling the number of measure-ments or choosing trajectories that maximize thenumber of observations of specific landmarks, even

7

Page 8: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

DestinationSensingLocation

Robot Start

(a) Worst Trajectory

DestinationSensingLocation

Robot Start

(b) Best Trajectory

Fig. 3. Example trajectories. (a) An example bad trajectory, where the long looping trajectory induces considerable positionalerror. (b) The optimal trajectory between the start and goal location. The short motion and large radius of curvature minimizes theoverall error.

when traveling through already-mapped areas of theenvironment.

In Figure 3, two trajectories are shown. Therobot does not have access to the complete mapin either case, although the map is drawn in forreference. In Figure 3(a) the robot trajectory is longand contains a tight loop. This would clearly be avery suboptimal way of arriving at the destinationpoint. In contrast, Figure 3(b) shows a much morereasonable trajectory both in terms of length and interms of the kind of path.

Assuming a standard error function such as sumof squared errors, an exploration planner that is togenerate the most accurate map must optimize itstrajectoryx0:T as follows

argminx0:T

Ep(m|x0:T )||m − m||2. (15)

where againm is the mean of the map marginaldistribution. In the SLAM problem, we do not havedirect control of the robot pose but can only issuecontrolsut, so we change the objective function to

argminu0:T

Ep(m|u0:T )||m − m||2, (16)

implicitly marginalizing over states and observa-tions. The constraints of the sensing locations are

not present in the objective of equation (16); theyare constraints on the solution that the optimizermust satisfy. Just as in equation (13), this expecta-tion of equation (16) is computationally intractablewithout a fairly accurate map prior. Furthermore,since the expectation is taken with respect to themap at theend of the trajectory, the same approxi-mation as in equation (14) cannot be used. Withoutthe ability to optimize the trajectory exactly or finda good approximation, we turn to reinforcementlearning in the next section to find trajectories tominimize equation (16) subject to the constraintsof equation (14).

4. REINFORCEMENTLEARNING FOR

EXPLORATION

Solving for the policy in equation (16) requiresa sequential decision-making algorithm, where thechoice of actionut depends on the current posteriordistributionp(ut,m). The most general frameworkfor choosing actions based on the current distri-bution is the partially observable Markov decisionprocess (POMDP) (Sondik 1971). A POMDP isdescribed by the following:

• a set of statesS,

8

Page 9: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

• a set of actionsA,• a set of observationsZ,• a transition functionT : S ×A×S → R such

that T (si, aj , sk) = p(si|aj , sk),• an emission functionO : S×A×Z → R such

that O(si, aj , zk) = p(zk|aj , si),• a reward functionr : S × A × S → R,• an initial state distributionp0(S)• a discount factorγ ∈ [0, 1].

The state at each timet is assumed to be unob-servable. Assuming a known actiona, a receivedobservationz and the corresponding transition andaction functions, the Bayes’ filter of equation (2)can be used to estimate the distributionp(s). Thestandard optimal policy for a POMDP model is onethat chooses an actiona based on the current dis-tribution p(s) to maximize the expected discountedreward over the life of the agent. Given appropriatechoices of the state, action and observation setsand transition, observation and reward functions, theexploration problem can be modeled as a POMDP.

The state space trivially seem to be the currentrobot posex and the mapm. Almost all algorithmsfor solving POMDPs optimally require a discretestate space, but rather than solving for the POMDPpolicy, the use of an appropriate reinforcementlearning algorithm will allow us to maintain acontinuous representation of the state. However,recall that the trajectory must satisfy a set of con-straints in terms of visiting the sensing locationsselected by the coverage planner. We therefore aug-ment the state with the set of remaining locations{L1, . . . ,Ln} that are to be visited, such that

S = X × M × {Li} (17)

where the last term is a set of sensing locations thatthe policy must visit, andn will decrease as thetrajectory visits each location in turn. By puttingthe active constraints directly into the state space,there is a danger of computational intractability, butwe will again rely on the reinforcement learningalgorithm to avoid this problem. It should be notedthat this is a hybrid state space, where the first twocomponents (robot pose,x, and mapm) are unob-served but the sensing locations{Li} are chosenby the exploration planner and are therefore fullyobservable.

The observations are exactly the observationsz

described in Section 2. The observation emissionfunction is given byp(z|s) = p(z|x(s),m(s))where x(s) is the robot pose component of states, m(s) is the map component of states andp(z|x(s),m(s)) is the observation function used formapping in Section 2.

4.1. The Action Space

The action space may initially appear to be theparameterized control space of the robot, whichfor most ground robots is a bearing and velocity,u = (θ, v). This action has the unfortunate effectof requiring extremely long plans in terms of thenumber of actions required; solving for a POMDPpolicy has complexity that is roughly double ex-ponential in the horizon length (a loose bound onthe complexity isO(|A||Z|t) for discrete and finitestate, action and observation spaces), and learning apolicy will also scale in complexity with the horizonlength. In order to have any hope of finding goodexploration policies, choosing a different policyclass (and action parameterization) that can expressthe optimal policy more compactly is essential.

In choosing a more appropriate policy class, wecan take advantage of structure in our problem.Firstly, we can use the policy class to impose asmoothness on the trajectories. Instead of requiringthe exploration planner to select fresh control pa-rameters at each timet, we can induce a bias inthe controls by representing the action as a smoothparametric curve. Secondly, if the policy class is aset of trajectories, we can also use this policy classto enforce the constraints imposed by the coverageplanner, which simplifies the optimization process.As a result, we define the action for states to bea parameterized trajectory from the robot’s currentpose (position and orientation) to the next sensinglocation in the state.

The simplest polynomial that both interpolatesthe robot pose and the next sensing location andallows us to constrain the start orientation is a cubicpolynomial, that is, the curve parameterized as

x(t) = axt3 + bxt2 + cxt + dx

y(t) = ayt3 + byt2 + cyt + dy

9

Page 10: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

with t ∈ [0, 1]. The curve is fixed such that thederivative at the start matches the orientation of therobot at its current position, such that

atan2

(

∂y

∂t

t=0,∂x

∂t

t=0

)

= θ.

Using this parameterization and providing valuesfor the x and y positions(x, y)t=0, (x, y)t=1 andderivatives(∂x

∂t, ∂y

∂t)t=0, and (∂x

∂t, ∂y

∂t)t=1 at t = 0

and t = 1, one can solve for the specific constants.Although there are nominally eight total parameters,five of these parameters are fixed,(x, y, θ)t=0 isthe start pose of the robot and(x, y)t=1 is thedestination position(xg, yg). Thus, we have onefree parameter at the start point (the magnitude ofthe derivative att = 0) and two free parameters(the orientation of arrival and the magnitude of thisratio) at the destination, to give a continuous actionspace inR3. Figure 4 shows a set of sample actionparameterizations, with a destination sensing pointthat is 3 meters in front of the robot.

Robot Start Location

Destination Sensing Location

Sample trajectories

Fig. 4. A sample set of possible actions for a destinationsensing point that is 3 meters in front of the robot. Someof the actions are clearly sub-optimal for this start and goal,but may be useful for achieving a particular orientation at thedestination. The reinforcement learning problem is to identifywhich parameterization should be chosen given the current robotstate, map and sensing points.

There are three potential issues in this policyclass that could lead to sub-optimality in the per-formance. Firstly, the set of motion strategies thatcan be expressed by the exploration planner hasbeen considerably reduced, and the true optimal

exploration policy may no longer be feasible withthis action parameterization. However, this biasallows us to dramatically increase the computationaltractability of the problem; in preliminary experi-ments using richer policy classes, we experienceddifficulties in scalability and learnability, and couldnot demonstrate satisfactory convergence for learn-ing exploration policies outside of toy examples.The choice of policy class is one of the key enablersof our optimization, and we will see experimentalresults that suggest that this policy class resultsin good (if not guaranteed optimal) explorationperformance.

A second issue is that the curves are not con-strained to a fixed length and will take varyingamounts of time to execute. The assumption ofa standard discounted POMDP is that the actionsare all of some fixed length. If this assumption isviolated for problems with a discount factorγ < 1,then the planner may not be able to correctly com-pute the expected discounted reward. A completeplan comprised of fewer but longer actions willappear to be less valuable compared to a plan of thesame overall performance but comprised of more,shorter actions. The correct planning formalismwould be the semi-Markov decision process, whichadds an explicit representation of the duration ofeach action. Experimentally, we found that the vari-ability in the length of the actions was not sufficientto justify the additional computational overhead ofa semi-Markov model.

It should be noted that the actions are kine-matic trajectories and do not directly provide motorcontrol commands. To convert each action intoexecutable motor controlsu0:T , we use pure-pursuitcontrol (Ollero and Heredia 1995) on the currentrobot position estimate and a point on the curve adistanced in front of the robot. This pure-pursuitcontroller also gives rise to the transition functioncorresponding to the actiona. Each actiona givesrise to a distribution of possible control sequencesrequired to follow the curvea. The probability ofarriving at a particular statesi (for si defined byEquation 17) is

p(x(si)|x(sj), a) = p(x(si)|x(sj), u0:T ). (18)

This probability is the result of a series of EKF

10

Page 11: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

updates, and can not easily be evaluated in closedform, but can be approximated using Monte Carlointegration. The map componentm(s) remainsconstant. Finally, the sensing location sequencecomponent of the state changes deterministicallyin that once a sensing location is visited at theend of executing actiona, that sensing location isdeterministically removed from the coverage plan.

4.2. The Reward Function

Given access to training data with ground truthinformation of the map (m), the reward function isthe squared error of the map as in equation (15).For all actions, the reward function is

r(·, a, ·) = ||m(s) − m|| (19)

The squared error reward function is our funda-mental quantity of interest. This reward functionis not a function of the specific SLAM algorithm,and allows the controller to predict the effects ofunmodeled SLAM filter failures. Over-confidence,bad data-associations, and filter divergence accountfor a number of problems that may occur and arenot modeled explicitly by an extended Kalman filtermapper; these nevertheless will affect the squarederror.

In circumstances where the squared error cannotbe computed due to lack of ground-truth data, analternate reward function based on the uncertaintyof the distribution can be considered. There isan abundance of related work using informationgain for choosing actions, among which the mostclosely related comes from (Makarenko et al. 2002;Stachniss 2006; Feder et al. 1999; Sim 2006).Information-theoretic reward functions have alsobeen used to choose the next best measurementin sensor networks. If the estimator is unbiased,then as the entropy of the distribution convergesto 0, the mean of the estimator will converge tothe true distribution. The literature on optimal ex-perimentation (Fedorov 1972) defines two principalinformation-theoretic objective functions in terms ofinformation:

• Given a Gaussian distribution with covarianceΣ and meanµ, the D-optimal objective func-tion is defined as:

I(x) = det(ΣV ) − det(ΣV |A)

• Given a Gaussian distribution with covarianceΣ and meanµ, the A-optimal objective func-tion is defined as:

I(x) = tr(ΣV ) − tr(ΣV |A)

For information gain,ΣV |A is the covarianceafter a set of observationsA have been made ofvariablesV and ΣV is the covariance before thethey were made.

There is some intuition that underlies these met-rics. The determinant of a square, symmetric ma-trix is the product of the eigenvalues, and so thedeterminant of the covariance roughly correspondsto the volume of the uncertainty ellipse at a contourof constant likelihood. The determinant is directlyproportional to the entropy of a Gaussian. Re-ducing one of the dimensions of uncertainty tozero corresponds to driving the determinant of thecovariance matrix to zero. Similarly, the trace has aninterpretation as the sum of the eigenvalues of thecovariance. The trace therefore roughly correspondsto the circumference of the uncertainty ellipse. Wewill compare the policies for these reward functionsin the next section.

5. POLICY SEARCH DYNAMIC PROGRAMMING

Given the model parameters described in the pre-vious section, it remains to find the optimal policymapping distributions over statesp(s) to actionsa. The POMDP model of the exploration problemdescribed above includes large and continuous state,action and observation spaces. There are no knownalgorithms for finding the exact optimal policyfor continuous POMDPs, only approximation algo-rithms (Porta et al. 2006) and these algorithms areextremely unlikely to scale, either by approximatingthe value function or by approximating the con-tinuous parameters with discretizations. Finally themodel is sufficiently complex that even generatingan explicit representation of the model parametersmay be computationally intractable.

Although an exact computational solution is in-feasible, we do have access to high-quality simula-tors and real-world data that can be used to charac-terize good solutions. By learning good policies, weavoid the need to represent the model and solve it.

11

Page 12: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

Algorithm 2 Policy Search by Dynamic ProgrammingRequire: Simulator with state sampling functionP

1: ξ ={

p(1)(s), . . . , p(m1)(s)|p(s) ∼ P}

2: Wherevaj ,πt+1,...,πN

t (s) is a trace through the MDP sampledm2 times using the previously computedpolicies:

vA,πt+1,...,πN

t (ξ) =

va1,πt+1,...πN

t (p(1)(s) . . . va|A|,πt+1,...πN

t (p(1)(s)...

......

va1,πt+1,...πN

t (p(m1)(s)) . . . va|A|,πt+1,...πN

t (p(m1)(s))

3: πt(ξ) = argmaxa∈A

[

vA,πt+1,...,πN

t (ξ)]

, whereargmax is taken with respect to the rows of the matrix.

4: Train a classifier with inputξ and outputπ′

t(ξ) and return the classifier as the policy:

πt(p(s)) : P (s) → A

5: Π = {πt . . . πN}; t = t − 1; goto step 1 until convergence.

We instead use reinforcement learning to find goodpolicies from data.

The decision to use reinforcement learning doesbring with it the question of which algorithm touse in learning. There are a number of differ-ent learning algorithms, with different representa-tional and learning power. The process of gatheringdata during exploration learning can be expensive,and the input data (the state distributions) arehigh-dimensional. Policy Search Dynamic Program-ming (Bagnell et al. 2003) is a form of reinforce-ment learning that decomposes the learning processinto a series of one-step learning algorithms, es-sentially turning the learning of sequential decisionmaking into a form of dynamic programming. ThePolicy Search by Dyanamic Programming (PSDP)algorithm can be seen in Algorithm 2.

The algorithm operates by first learning a one-step policy at timeT (e.g., the end time) bygenerating a sample distributionp(i)(s) and actiona and propagating each sampled distributionp(i)(s)forward. Each distribution-action pair is labeledwith immediate reward. The best distribution-actionpair is kept as a training instance for a supervisedlearning problem at timet. The result is a clas-sifier that provides an optimal actiona for anydistribution p(s). A new one-step policy for theprevious time stepT−1 is then learned by sampling

distributions and actions as before. Using the policypreviously obtained at timeT , a two-step value isobtained for each distribution and action atT − 1:the value associated with each distribution andaction is accumulated from the immediate rewardat timeT − 1 and the reward received by runningthe learned policy for timeT , and a new policy islearned for timeT − 1. The learner then iterates ateach timet through to timeT using the policiesπi

for i ∈ {t . . . T}. After sampling all of the actionsfor distributionp(s) at time t, the best actiona′ isselected.

5.1. Support Vector Machine Policy Learning

The primary advantage to PSDP for our purposesis that each one-step policy can be found using anylearning algorithm. This allows us to take advantageof machine learning techniques that are effectivein generalizing from high-dimensional data. Thereare a number of classifiers that could be chosenfor use in PSDP such as neural networks, Gaussianprocesses, and support vector machines (SVMs).We have chosen to leverage a classifier knownto generalize well in high-dimensional spaces, hasfew parameters to tune and is known to performwell on a variety of problems: a Support VectorMachine (Burges 1998; Duan and Keerthi 2005).

12

Page 13: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

Support vector machines start from the principlethat the decision boundary which maximizes thespace between classes is likely to be the bestone. In its simplest form it has a linear decisionboundary. However, by using the “kernel trick”, itis possible to extend support vector machines tonon-linear decision boundaries (Haykin 1994). Theoptimization performed to find the best decisionboundary can be formulated as a quadratic programthat is relaxed to penalize errors. The parametersleft free include a term to account for the numberof errors that will be allowed as well as the kernelparameters.

To learn a policy, we take training data consistingof labeled distribution-action pairs and learn to pre-dict the best action for each distribution. Althoughwe can maintain a continuous input space, an SVMrequires a discrete action space. The basic supportvector machine performs binary classification andmust be extended to then-ary classification problemof choosing actions. A set of classifiers is typicallytrained in aone-versus-allor one-versus-oneman-ner (Duan and Keerthi 2005; Hsu and Lin 2001). Inthe one-versus-all framework, each binary classifieris trained using the data from one class againstthe data from all of the other classes, generatingn binary classifiers for each of then classes. Atprediction time, each of then binary classifiersreturn either a class orother, referring to the factthat it could be any of the other classes. Whenthere are multiple available classes, the one with thehighest value is selected as the correct class. In theone-versus-one framework, each binary classifier istrained using data from only two classes, generatingO(n2) classifiers. At prediction time, each pairwisebinary classifier returns the most likely of the twoclasses and casts a vote for this class. The class withthe most votes wins then-ary classification and istaken to be the true class. In (Hsu and Lin 2001),it was shown that the one-versus-one framework ismost appropriate forn-ary classification problemssuch as ours. In our optimization, we will also wantto rule out trajectories that collide with obstacles.We simply ignore votes for infeasible trajectoriesto get the next optimal trajectory while avoidingthe need to retrain for particular environmentalconstraints, allowing the learner to generate policies

that generalize across environments.

-4 -3 -2 -1 0 1 2 3 4

Angle

0

20

40

60

80

100

Cla

ss

Fig. 5. A slice through the training data along the destinationangle, for destinations at a fixed distance. Thex axis is angleto destination training instances, for training instanceswith thenext sensing point at a fixed distances from the start robot pose.They axis is the training label of each instance (i.e., the optimalaction). It is impossible to draw a definite conclusion, but thedata does not appear linearly separable.

In order to ensure that the SVM can learn well,an appropriate kernel must be chosen. Figure 5shows a slice through a training data set alongthe destination bearing direction at a fixed distanceto the next destination sensing point. The trainingdata has some clear structure, but does not appearlinearly separable, suggesting the need for a non-linear kernel.

Cross-validation was performed with the learnedclassifier on the test set to determine how well thetrajectory exploration function was learned for alinear kernel, a polynomial kernel and a Gaussiankernel. A parameter search over misclassificationpenalty C and covariance widthσ for C, σ ∈{2i|[−5 ≤ i ≤ 7] ⊂ Z} was additionally performedto find the best values. A graph of the parametersearch is shown in Figure 6(a)-(c). The y-axis isthe classification error rate on the test dataset fordifferent values ofC and the x-axis are values ofσ. The best values over a range of datasets wereC = 4, σ = 0.125.

We can visualize the learned trajectories to thedestination to confirm that the data matches ourexpectations. Figure 7(a) shows a set of training

13

Page 14: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

-3 -2 -1 0 1 2 3 4

log(C)

20

40

60

80

100

Err

or

(perc

enta

ge o

f te

st

set

cla

ssifie

d incorr

ectl

y)

(a) Linear Kernel

-3 -2 -1 0 1 2 3

log(C)

0

20

40

60

80

100

Err

or

(perc

enta

ge o

f te

st

set

cla

ssifie

d incorr

ectl

y)

ord=0.0625

ord=0.125

ord=2.0

ord=4.0

ord=0.25

ord=1.0

ord=8.0

ord=64.0

ord=128.0

ord=16.0

ord=32.0

ord=0.5

(b) Polynomial Kernel

-3 -2 -1 0 1 2 3 4 5

log(C)

0

20

40

60

80

100

Err

or

(perc

enta

ge o

f te

st

set

cla

ssifie

d incorr

ectl

y)

sig=0.0625

sig=0.125

sig=2.0

sig=4.0

sig=0.25

sig=1.0

sig=8.0

sig=16.0

sig=0.5

(c) Gaussian Kernel

Fig. 6. In these graphs we can see the effect of search for goodparameters. (a) Search for good linear kernel, (b) a polynomialkernel, and (c) a Gaussian kernel. Graph (b) contains data for multiple polynomial kernels, but all appeared to have identicalperformance. The graphs confirm the suggestion of Figure 5, that the training data is not linearly separable in that the fastest graphto converge is the Gaussian kernel. This is the kernel that was used in this paper.

data consisting of a fixed robot start pose and dif-ferent destination sensing locations sampled arounda 3 meter radius circle centered on the robot.The training label on each sample was found bysimulating each trajectory and retaining the max-imum reward action, a process that is subject tonoise. Therefore, some of the training examplesappear to have incorrect labels. Figure 7(b) showsthe policy learned from this data. The learner hasgeneralized enough to avoid many of the spuriousnoisy labels, but classification error does persist.The classification error in general was 20% in 100class problems.

As a comparison, we examined the learner onthe reward functionstrace(Σ) and the entropy ofthe distribution, ent(Σ), which would have theeffect of optimizing the policy for different normson the predicted filter covariance, rather than theexperienced filter error. The trace controller leads totrajectories that were more reflective of the desiredmotion with much lower curvature. The controllertrained onent(Σ) reward led to a policy that wasmuch closer to a shortest-path controller, suggestingthat the ent(Σ) reward function is not a goodsubstitute for squared error.

6. EXPERIMENTAL RESULTS AND ANALYSIS

Having demonstrated that the classifier was ca-pable of learning the target concept, we traineda control policy and evaluated the performance in

both simulated and real experiments. In both cases,the resulting map was significantly improved withthe learned trajectories.

6.1. Policy Training

Collecting training data consisted of four phases.Firstly, a model of the robot motion and sensingwas built for generating simulated training data.The simulator is part of the CARMEN robot controlpackage, and assumes a quasi-holonomic robot withthree degrees of freedom, such thatx = (x, y, θ).The robot was also been equipped with a180◦

field-of-view laser range sensor. The motion modelfollowed (Eliazar and Parr 2004), such that therobot control was in terms of a forward trans-lation and a rotation. Each control consisted ofa rotation dθ1, a translationdr along the direc-tion θ, and a final rotationdθ2. The motion errorhad a translation (down-range), rotation and slip(cross-range) component, where slip correspondsto a translation in the direction perpendicular tothe expected direction of motion. The specific pa-rameters of the motion model were generated bycollecting example controls and sensor data fromthe robot. The Expectation-Maximization modellearning algorithm of (Eliazar and Parr 2004) wasused to fit the motion model parameters to the data.The complete equations of motion and the fittedmodel parameters are given in Appendix A. Thesensing model was a laser range finder, with a fea-

14

Page 15: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

corresponding to different destinations

Training instances

Sample Destinations

Robot StartPose

(a) Training Data (b) Classifier Output

(c) Learned Policy Usingtrace(Σ) (d) Learned Policy Usingent(Σ)

Fig. 7. (a) Example training data. The robot start pose is fixed, and plotted are the optimal actions for different destinationssampled on the circumference of a circle 3 meters in radius centered at the robot pose. (b) Example learned labels for a test set.The robot start pose is again fixed and the samples are again drawn from a circle of radius 3 meters around the robot. (c) and (d)Outcomes of the controller trained ontrace(Σ) andent(Σ).

ture extraction algorithm that follows the algorithmin (Walter and Leonard 2004).

In order to train each one-step policy, 1,000pairs of start pose and destination sensing locations(x,Li) were sampled randomly in a simulatedmap. These poses were sampled from an emptymap containing only pre-constructed landmarks.For each actiona ∈ A, the robot was placed atthe start pose, the action was executed, and anEKF was updated. The action space consisted of100 different parameterizations of a cubic splineinterpolatingx andLi and the robot followed thisspline using a pure-pursuit controller. The reward

(trace or entropy) was computed and the trajectorywas labeled with its reward. The best instance of(x,Li) was labeled and returned to be included inthe training data.

Third, a support vector machine classifier wastrained using the location of the destination(s) asthe state and an integer action number as the class.A Gaussian kernel was used as discussed in Sec-tion 5.1. The kernel parameters used to train thisclassifier wereC = 4 andσ = 0.125.

Finally, the support vector machine was loaded atruntime and novel destinations given to the robot.In real-time the classifier classified the current state

15

Page 16: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

to generate a trajectory, which the robot followedfrom its current pose to the destination.

6.2. Artificial Simulated Environment

To understand the significance of the differencebetween the learned controller and a conventionalShortest-Path coverage controller touring the des-tination sensing locations, a set of simulated ex-periments were analyzed. The first of these was inan artificial world that was created to be difficultto map. In particular, when using a scan-matchingbased approach, matching the right spokes to oneanother can be difficult since many different align-ments may look the same.

Fig. 8. An artificial simulated map used for initial evaluationinsimulation. The white pixels are free space and the black pixelsare obstacles.

Figure 8 shows the artificial test map that wasused for an initial evaluation of the learned explo-ration policy. The robot was placed initially in thecenter of the (unknown) map and given a fixed setof destination sensing locations. The free space ofthe map is a star polygon with a kernel in the centerof the map, which meant that a coverage trajectorywas a short loop around the inner circle. Figure 9(a)shows a Shortest-Path trajectory computed using astandard shortest-path motion planner (Barraquandet al. 1992), and Figure 9(d) shows the trajectoryof the Learned controller.

Figure 9(b) and Figure 9(e) show maps created byusing the recorded (simulated) odometry to directly

compute the environmental model from the (simu-lated) laser range data. In both cases the odometry isnoisy, and the maps contain substantial errors andmisalignments. However, using a robust mappingalgorithm to correct the maps, the trajectory of theLearned controller led to a substantially better mapcompared to the Shortest-Path controller.

For the Shortest-Path controller, we can clearlysee a number of mapping errors. Referring to Fig-ure 8, we can see an opening on one end of themap. This opening is not present in the map fromthe Shortest-Path controller. Further, a number ofthe passageways are blocked off, whereas in themap acquired by the Learned controller, this isnot the case. Finally, the environment was square;examining the bottom right of the Shortest-Pathmap, we can see that some of the scans have beenmisaligned with respect to the square.

Since we performed these experiments in simu-lation, the true position of the robot was available.After recovering the estimated pose of the robotover the entire trajectory using a SLAM algorithm,the distance of the estimated trajectory from thetrue trajectory was computed and this error hasbeen plotted in Figures 10(a) and 10(b). The robotpose error has been substantially reduced by theLearned controller, and moreover, the error in theShortest-Path controller appears to be growing. Incontrast, in repeated trials, the error resulting fromthe Learned controller remained boundedly small inrepeated trials.

Overall, in this environment, we can concludethat the Learned controller provided sensor datathat was much easier to integrate at a much higherquality than the Shortest-Path controller. Map errorswere reduced and the true position of the robot wasable to be estimated much more accurately. The useof a robust mapping algorithm for these experimentsand the repeatability of the mapping errors leads usto believe that these results will generalize acrossmapping algorithms.

6.3. Realistic Simulated Environments

Following these results, further tests were run insimulations of real-world environments. We eval-uated the performance in exploring and mappingeight different environmental models using maps

16

Page 17: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

(a) True/Estimated trajectories (Shortest-Path)

(b) Raw Measurements (Shortest-Path) (c) Corrected Measurements(Shortest-Path)

0 10 20 30 40X

0

10

20

30

40

Y

start

end

Trajectory

True trajectory

Scan matched trajectory

(d) True/Estimated trajectories (Learned) (e) Raw Measurements (Learned) (f) Corrected Measurements(Learned)

Fig. 9. Comparison of Shortest-Path exploration controllervs. Learned controller. (a) Exploration trajectory using Shortest-Pathcontroller. (b) Raw data plotted using odometry estimates forShortest-Path trajectory. The red dots are laser endpointscalculatedfrom the robot pose based on dead reckoning (odometry) and thelaser range data. (c) Final map inferred by a SLAM algorithm fromdata collected by Shortest-Path controller. (d) Exploration trajectory using Learned controller. (e) Raw data plotted using odometryestimates for Shortest-Path trajectory. (f) Final map inferred by SLAM algorithm from data collected by Learned controller.

available for the Radish data set (Howard andRoy 2003). Figures 11(a) and 11(c) show exampletrajectories. Figures 11(d) and 11(b) show that theShortest-Path controller often has large deviationsfrom the true position, but the learned controllermatches the true trajectory very closely. The error inthe Learned controller was only briefly above.1m,where the error in the Shortest-Path controller grewto .3m. When the robot revisited these locations toremap the area, the errors of.3m could have beencatastrophic.

Finally, the average error between the estimatedand ground truth robot trajectories in each map wascomputed. These results are shown in Figure 12.The difference between controllers across maps isvariable; for maps 1, 3, 4, 5 and 7, the Learned

controller provided a significant improvement. Notethat the error bars are plotted at 1,000 standarderrors2 in order to be visible, demonstrating sub-stantial statistical significance to the differences inthe mean values.

In map 0, a comparatively short trajectory wasfollowed, so it is reasonable that this was due tothe uncertainty in the mapping process. In map6,the point turn controller performed better largelybecause the environment was very tight. Further,even though the average map error was reduced bythe Shortest-Path controller, the mapping processdiverged at the end, and had the data collection notbeen terminated, the error induced by the Learned

2Standard error is defined asσ/√

N .

17

Page 18: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

(a) Error in estimated position (Shortest-Path) (b) Error in estimated position (Learned)

Fig. 10. (a) The squared error in robot position between the estimated robot position and the ground truth position for thetrajectoryof the Shortest-Path controller. (b) The squared error in robot position between the estimated robot position and the ground truthposition for the trajectory of the Learned controller..

Fig. 12. The average mean squared error between the estimatedand true robot pose over a variety of trajectories. The errorbarsare at 1000 standard errors, which demonstrates that there issubstantial statistical significance between the control methods.

controller appeared to be improving compared tothe Shortest-Path controller. In conclusion, in five ofthe eight trials, the Learned controller was consider-ably better than the Shortest-Path controller in termsof squared error and in others they were equivalent.

As a final analysis, Figure 12 plots an estimateof the error in the occupancy grid maps createdby the mapper. The difference between the esti-mated and true robot pose shown in Figure 12 was

Fig. 13. The KL divergence between the occupancy grid com-puted from the corrected scans and the ground truth occupancygrid.

the most reliable estimate of the accuracy of themapping process, but as an additional verification,we compared the Kullback-Leibler divergence be-tween the occupancy grid computed by the SLAMalgorithm and the ground truth occupancy gridcomputed from the true position of the robot inthe simulation (Cover and Thomas 1991). Com-puting the KL divergence required searching forthe minimum error alignment between occupancygrids, as the mapping process can result in a rigidtransformation between the inferred map and the

18

Page 19: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

(a) Trajectory (Shortest-Path) (b) Pose Error (Shortest-Path)

(c) Trajectory (Learned) (d) Pose Error (Learned)

Fig. 11. Learned trajectories for a simulated experiment using the Longwood map from the Radish data set. (a) The trajectoryofthe Shortest-Path controller. (b) The robot pose error for the Shortest-Path controller. (c) The trajectory of the Learned controller.(d) The robot pose error for the Learned controller.

19

Page 20: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

corrected map, and this alignment process is subjectto local minima problems. Additionally, neither themapping process nor the exploration process isdesigned explicitly to minimize this error measure.However, the error in the maps as measured bythe KL divergence (Figure 13) was consistent withthe analysis of the error in robot pose (Figure 12),which confirmed the performance of the Learnedcontroller.

7. REAL-WORLD EXPLORATION

As a final validation, we evaluated the learnedcontroller in a real exploration task. The learnedcontroller was trained using simulated data as be-fore, but the simulator model was learned itselffrom real data. There are numerous instances ofreinforcement learning algorithms transferring wellfrom simulation to real-world experiments when thesimulator is a high-fidelity physical model of theworld (Ng. et al. 2004). As a result, we had highconfidence that our learned policy would performwell in real exploration tasks.

Fig. 14. The robot used both for the simulator model and totest the learned controller.

The robot used for the experiments is shown inFigure 14. It was equipped with a single SICK laserrange finder on the front and odometry encoders ineach wheel. This robot was particularly interestingbecause its construction and differential drive mech-anism generate noisy odometry.

The experiment was performed on the first floorof the Stata Center at the Massachusetts Institute ofTechnology. No completely accurate “ground truth”map of a real world environment exists due to

non-permanent objects, and architectural drawingsfor the Stata Center are also inaccurate due topermanent changes to the environment structure.Thus, before running experiments, a map of the areawas pre-generated by hand, and a process runningMonte Carlo Localization onboard provided an es-timate of ground truth. During the experiment, therobot chose actions in real-time and drove around alarge loop that had been identified during simulationas difficult to map.

Figures 15(a) and 15(c) show the estimated tra-jectories for the Shortest-Path and Learned con-trollers. It is clear that the Learned controllermatches the estimated path quite well. As was seenin simulation, the Shortest-Path controller divergesas it comes around the loop.

Figures 15(b) and 15(d) show the error betweenthe estimated robot pose during mapping and theestimated true pose. These plots confirm that datacollected by the Shortest-Path controller resultedin much more inaccurate map. By the end of thetrajectory it was nearly 1.5 meters off course, whilethe Learned controller was no more than half ameter off of the true trajectory at any given time.These results indicate not only that the trajectoriescan have a large influence on map creation, but alsothat learning the best trajectories from simulationhelps immensely during the mapping process.

8. RELATED WORK

This is not the first work to consider the problemof active exploration for building the most accurate,lowest-uncertainty map. One popular approach inthe literature, described in several places (Whaiteand Ferrie 1997; Feder et al. 1999; Bourgault et al.2002), is gradient ascent in information gain. Thishas substantial computational efficiency, in that theinformation gain need only be computed for the 8-connected neighboring states. The major disadvan-tage to the gradient ascent approach is that it is sub-ject to local minima. While it has been argued thatrepeated observations taken from a local minimumwill eventually flatten out the information gain atthat pose, it may take a considerable amount of timefor this process to occur, as is evidenced by the slowconvergence of this approach in the experimentalresults. The gradient ascent approach also has no

20

Page 21: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

(a) Trajectory (Shortest-Path) (b) Pose Error

(c) Trajectory (Learned) (d) Pose Error (Learned)

Fig. 15. Learned trajectories for a real experiment in the first floor of the Stata Center at MIT. (a) The trajectory of the Shortest-Pathcontroller. (b) The robot pose error for the Shortest-Path controller. (c) The trajectory of the Learned controller. (d) The robot poseerror for the Learned controller.

notion of “closing the loop” and cannot model longpaths with large payoffs in map certainty at the endof the path.

Stachniss et al. (Stachniss and Burgard 2003)describe one of the few global exploration algo-rithms where they explicitly compute the informa-tion available at all locations in the environment.Their algorithm is closest in spirit to our algorithm,although they integrate an additional notion of travelcosts to gather data at different locations in the envi-ronment. The disadvantage to their approach is thatthey are limited by computational costs to searching

for the maximally informative single location, ratherthan a maximally informative trajectory through aseries of locations in the environment.

Some attention has already been paid to theproblem of completeness during automatic data col-lection, or coverage (Choset 2000; Acar et al. 2003),especially in the sensor network and computationalgeometry communities. Coverage exploration canoften be defined in terms of sensor placement,that is, determining point locations to minimizethe number of sensor measurements required tosense the entire space. With the exception of some

21

Page 22: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

work in coverage behaviors (Choset 2001), theseexploration strategies generally ignore the effect ofmotion between measurement points. However, datafrom a poorly chosen motion and sensing trajectorybetween measurement points can substantially inter-fere with the overall inference.

The problem of active exploration is related to anumber of other sequential decision making prob-lems. The active localization problem is one wherethe robot’s position is unknown but the mapisknown. Fox et al. (Fox et al. 1998) use the samecriterion of entropy minimization to find a trajectorythat localizes the robot as quickly as possible.The authors choose a purely local gradient ascentapproach, although they augment the informationgain with a notion of relative costs of differentactions.

Active exploration and localization are both spe-cific instances of a more general planning frame-work, known as the Partially Observable MarkovDecision Process (Sondik 1971), in which thecontroller makes decision based on the completestate distribution, rather than the mean estimate.Exact solutions for the optimal POMDP policy arecomputationally intractable, but good approxima-tion algorithms have been emerged recently. TheAugmented MDP (AMDP, also known as “CoastalNavigation”) algorithm (Roy and Thrun 1999) inparticular assumes that the state distribution isGaussian, and uses this assumption to find efficientapproximations to the POMDP policy. The AMDPapproach could be applied to the active explorationproblem, if the information gain were representedby an “information volume”; each voxel in theinformation volume would represent the currentrobot pose and map entropy. Given a transitionfunction relating a voxel and action to some poste-rior voxel, the planning problem would be just bea shortest-path problem to the “lowest” reachablevoxel in the information volume. The disadvantageto this approach is that computing the expectedtransitions in the information volume is infeasiblefor reasonably-sized maps.

More recently, non-myopic exploration algo-rithms have been developed by the sensor networkcommunity. The sub-modularity properties of infor-mation gain have been used to demonstrate efficient

algorithms for planning in single (Meliou et al.2007) and multi-robot domains (Singh et al. 2007).In contrast to this work, however, these algorithmsdo not model the information loss due to motionand are essentially forms of coverage. The closestalgorithm in spirit to our reinforcement learningalgorithm uses Gaussian processes to model thepredicted reward of an exploring robot (Martinez-Cantin et al. 2007). The Gaussian process learner isable to learn multi-step trajectories and incorporateboth predicted information gain from observationsand information loss due to motion. However, thelearning is performed on a per-step basis. Givena new destination for the robot, a new GP valuefunction must be learned. Ours is the only work weare aware of that learns a general purpose policy forexploration that is independent of the environment.

Finally, active learning and statistical experimentdesign solve very related problems, addressing thequestion of, given a space of possible data (orqueries, or experiments), what is the one additionalmeasurement that would maximally improve theexisting model (Cohn et al. 1996; Freund et al.1997). The active learning approaches typicallymaximize the information gain criterion as we do inthis work, but the disadvantage to these approachesis that they are not physically motivated; that is,there is no notion of “traveling” to acquire moredata, there is no notion of a sequential decisionmaking process that gathers a stream of data, andthere is no notion of the relative costs of differentqueries.

9. CONCLUSION

We have demonstrated in this paper a novel ap-proach for trajectory selection to explore unknownenvironments. We have shown that reinforcementlearning is an effective way to identify control poli-cies that can explore an unknown environment andcollect data to minimize map error. The technicalcontributions are a problem formulation that allowsthe exploration for coverage to be balanced with theexploration for map accuracy using a constrainedoptimization representation, and a formulation ofthe exploration problem as a tractable partiallyobservable Markov decision process. In particular,we demonstrate that the use of a policy class based

22

Page 23: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

on smooth trajectories interpolating way-point con-straints leads to efficient learning. By learning withthis policy class of smooth trajectories, the robotis biased towards controllers that minimize errorinduced by its own motion. Additionally, by usinga policy class where each action has considerableduration, the effective horizon length of the problemis dramatically reduced.

Our method relies on a physically accurate sim-ulation of robot kinematics, but the parameters ofthe simulation were fitted to real data, allowingus to transfer policies learned in simulation seam-lessly onto a real robot. We demonstrated thatour algorithm was able to generate trajectories forexploration in large-scale environments resulting inmuch more accurate maps than those maps resultingfrom a conventional shortest-path controller. Thelearned exploration controller not only minimizedmap error but also prevented unbounded growthin error over the course of the exploration. Whileother forms of reinforcement learning are oftenenvironment-specific, our algorithm can generalizeto many real-world environments efficiently. As aresult, once trained, our learned policy was able tobe executed in real time. We believe this is the onlyreal-time current exploration algorithm capable ofmaximizing map accuracy.

A number of interesting questions are still to beaddressed in the exploration literature. We relied ona conventional coverage planner to generate desti-nation sensing locations, focusing only on the prob-lem of map accuracy. A globally optimal solutionwould involve a coverage planner that can utilizethe learned policy to make decisions about whereto move. Secondly, our current formulation usedthe EKF formulation of mapping; as a result, ourexploration algorithm largely focused on trajectoriesthat minimized the Gaussian noise captured by theEKF model. We would like to extend our model toallow the learner to better predict trajectories thatmay violate the assumptions of the mapper, such aslinearity or accurate data association.

REFERENCES

Acar, E. U., Choset, H., Zhang, Y., and Schervish,M. (2003). Path planning for robotic demining:Robust sensor-based coverage of unstructured

environments and probabilistic methods.Interna-tional Jounal of Robotics Research22(7-8): 441–466.

Bagnell, J. D., Kakade, S., Ng, A., and Schneider, J.(2003). Policy search by dynamic programming.In Neural Information Processing Systems, vol-ume 16. MIT Press.

Barraquand, J., Langlois, B., and Latombe, J.(1992). Numerical potential field techniquesfor robot path planning.IEEE Transactions onSystems, Man and Cybernetics22(2): 224–241.

Bosse, M., Newman, P. M., Leonard, J. J., andTeller, S. (2005). Slam in large-scale cyclicenvironments using the atlas framework.Inter-national Journal of Robotics Research23(12):1113–1139.

Bourgault, F., Makarenko, A., Williams, S., Gro-cholsky, B., and Durrant-Whyte, H. (2002). In-formation based adaptive robotic exploration. InIEEE/RSJ International Conference on IntelligentRobots and Systems (IROS). EPFL, Lausanne.

Bourque, E. and Dudek, G. (2000). On-line con-struction of iconic maps. InProc. IEEE Interna-tional Conference in Robotics and Automation.San Francisco, CA.

Burges, C. J. C. (1998). A tutorial on support vectormachines for pattern recognition.Data Miningand Knowledge Discovery2(2): 121–167.

Choset, H. (2000). Coverage of known spaces:The boustrophedon cellular decomposition.Au-tonomous Robots9: 247–253.

Choset, H. (2001). Coverage for robotics–A surveyof recent results. Annals of Mathematics andArtificial Intelligence31(1): 113–126.

Cohn, D. A., Ghahramani, Z., and Jordan, M. I.(1996). Active learning with statistical mod-els. Journal of Artificial Intelligence Research4: 129–145.

Cover, T. M. and Thomas, J. A. (1991).Elementsof information theory. Wiley-Interscience, NewYork, NY, USA. ISBN 0471062596.

Davison, A. J., Reid, I. D., Molton, N. D., andStasse, O. (June 2007). Monoslam: Real-timesingle camera slam. Transactions on PatternAnalysis and Machine Intelligence29(6): 1052–1067.

Dellaert, F. and Kaess, M. (2006). Square root

23

Page 24: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

sam: Simultaneous localization and mapping viasquare root information smoothing.Int. J. Rob.Res.25(12): 1181–1203. ISSN 0278-3649.

Duan, K.-B. and Keerthi, S. S. (2005). Which isthe best multiclass svm method? An empiricalstudy. In Multiple Classifier Systems, volume3541, 278–285.

Eliazar, A. and Parr, R. (2004). Learning prob-abilistic motion models for mobile robots. InProceedings of the International Conference onMachine Learning, volume 69, 32.

Eustice, R., Walter, M., and Leonard, J. (2005).Sparse extended information filters: Insights intosparsification. InProceedings of the InternationalConference on Intelligent Robots and Systems,3281–3288.

Feder, H. J. S., Leonard, J. J., and Smith, C. M.(1999). Adaptive mobile robot navigation andmapping. International Journal of RoboticsResearch, Special Issue on Field and ServiceRobotics18(7): 650–668.

Fedorov, V. (1972).Theory of optimal experiments.Academic press, New York.

Fox, D., Burgard, W., and Thrun, S. (1998). ActiveMarkov localization for mobile robots.Roboticsand Autonomous Systems25(3-4): 195–207.

Freund, Y., Seung, H., Shamir, E., and Tishby, N.(1997). Selective sampling using the Query ByCommittee algorithm. Machine Learning28:133–168.

Ganguli, A., Cortes, J., and Bullo, F. (2006). Dis-tributed deployment of asynchronous guards inart galleries. InAmerican Control Conference,1416–1421. Minneapolis, MN.

Haehnel, D., Burgard, W., and Thrun, S. (2003).Learning compact 3D models of indoor and out-door environments with a mobile robot.Roboticsand Autonomous Systems44(1): 15–27.

Hahnel, D., Burgard, W., Fox, D., and Thrun, S.(2003). A highly efficient FastSLAM algorithmfor generating cyclic maps of large-scale envi-ronments from raw laser range measurements. InProc. of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS).

Haykin, S. (1994).Neural networks: a comprehen-sive foundation. MacMillan, Toronto, Canada.

Howard, A. and Roy, N. (2003). The

robotics data set repository (Radish). URLhttp://radish.sourceforge.net/.

Hsu, C. and Lin, C. (2001). A comparison ofmethods for multi-class support vector machines.In Technical report, National Taiwan University.Taipei, Taiwan.

Kalman, R. E. (1960). A new approach to linearfiltering and prediction problems.Transactions ofthe ASME-Journal of Basic Engineering82(Se-ries D): 35–45.

Leonard, J. and Durrant-Whyte, H. F. (1991). Si-multaneous map building and localization for anautonomous mobile robot. InProceedings ofthe IEEE International Workshop on IntelligentRobots and Systems, 1442–1447. Osaka, Japan.

Lu, F. and Milios, E. (1997). Globally consistentrange scan alignment for environment mapping.In Autonomous Robots, volume 4, 333–349.

Makarenko, A., Williams, S., Bourgault, F., andDurrant-Whyte, H. (2002). An experiment inintegrated exploration.

Martinelli, A., Nguyen, V., Tomatis, N., and Sieg-wart, R. (2007). A relative map approach toSLAM based on shift and rotation invariants.Robotics and Autonomous Systems55(1): 50–61.

Martinez-Cantin, R. and Castellanos, J. (2005).Unscented SLAM for Large-Scale Outdoor En-vironments. InProc. IEEE/RSJ Int. Conf. onIntelligent Robots and Systems, 328–333.

Martinez-Cantin, R., de Freitas, N., Doucet, A., andCastellanos, J. (2007). Active policy learning forrobot planning and exploration under uncertainty.In Proc. Robotics: Science and Systems (RSS).Atlanta, GA, USA.

Meliou, A., Krause, A., Guestrin, C., and Heller-stein, J. (2007). Nonmyopic informative pathplanning in spatio-temporal models. InProc.of 22nd Conference on Artificial Intelligence(AAAI), 602–607.

Moravec, H. and Martin, M. (1994). Robot navi-gation by 3D spatial evidence grids. Technicalreport, Mobile Robot Laboratory, Robotics Insti-tute, Carnegie Mellon University, Pittsburgh, PA.

Moutarlier, P. and Chatila, R. (1989). An ex-perimental system for incremental environmentmodeling by an autonomous mobile robot. InProc. 5th International Symposium on Robotics

24

Page 25: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

Research. Tokyo.Newman, P. M. and Leonard, J. J. (2003). Con-

sistent convergent constant time slam. InPro-ceedings of the International Joint Conferenceon Artificial Intelligence (IJCAI-03), 1143–1150.Acapulco Mexico.

Ng., A. Y., Coates, A., Diel, M., Ganapathi, V.,Schulte, J., Tse, B., Berger, E., and Liang, E.(2004). Inverted autonomous helicopter flight viareinforcement learning.International Symposiumon Experimental Robotics.

Ollero, A. and Heredia, G. (1995). Stability Analy-sis of Mobile Robot Path Tracking.Proceedingsof the 1995 IEEE/RSJ International Conferenceon Intelligent Robots and Systems3: 461–466.

Olson, E., Leonard, J., and Teller, S. (2006). Fastiterative optimization of pose graphs with poorinitial estimates. InInternational Conference onRobotics and Automation, 2262–2269.

Pfaff, P., Triebel, R., Stachniss, C., Lamon, P.,Burgard, W., and Siegwart, R. (2007). Towardsmapping of cities. InProc. of the IEEE Interna-tional Conference on Robotics and Automation(ICRA). Rome, Italy.

Porta, J., Spaan, M., Vlassis, N., and Poupart, P.(2006). Point-based value iteration for contin-uous POMDPs. Journal of Machine LearningResearch7: 2329–2367.

Roy, N. and McCallum, A. (2001). Toward optimalactive learning through Monte Carlo estimationof error reduction. InProceedings of the Interna-tional Conference on Machine Learning (ICML2001). Williamstown, MA.

Roy, N. and Thrun, S. (1999). Coastal navigationwith mobile robots. InAdvances in NeuralProcessing Systems 12, volume 12, 1043–1049.

Sim, R. (2006). On Visual Maps and their Auto-matic Construction. Ph.D. thesis, McGill Uni-versity.

Sim, R., Dudek, G., and Roy, N. (2004). Onlinecontrol policy optimization for minimizing mapuncertainty during exploration. InProceedings ofthe IEEE International Conference on Roboticsand Automation (ICRA 2004), 1758 – 1763. NewOrleans, LA.

Singh, A., Krause, A., Guestrin, C., Kaiser, W.,and Batalin, M. (2007). Efficient planning of

informative paths for multiple robots. InProc.International Joint Conference on Artificial In-telligence (IJCAI).

Smith, R., Self, M., and Cheeseman, P. (1990).Autonomous Robot Vehicles, chapter Estimatinguncertain spatial relationships in robotics, 167–193. Springer-Verlag, Orlando, FL. ISBN 0-387-97240-4.

Sondik, E. (1971).The Optimal Control of PartiallyObservable Markov Decision Processes. Ph.D.thesis, Stanford University, Stanford, California.

Stachniss, C. (2006).Exploration and Mappingwith Mobile Robots. Ph.D. thesis, University ofFreiburg, Department of Computer Science.

Stachniss, C. and Burgard, W. (2003). Exploringunknown environments with mobile robots usingcoverage maps. InProc. of the International Con-ference on Artificial Intelligence (IJCAI). Aca-pulco, Mexico.

Stachniss, C., Grisetti, G., and Burgard, W. (2005).Information gain-based exploration using rao-blackwellized particle filters. InProc. ofRobotics: Science and Systems (RSS), 65–72.Cambridge, MA, USA.

Thrun, S. (1998). Bayesian landmark learning formobile robot localization.Machine Learning33.

Thrun, S. (2000). Monte Carlo POMDPs. InAdvances in Neural Information Processing Sys-tems, volume 12, 1064–1070. MIT Press.

Thrun, S., Burgard, W., and Fox, D. (2005).Prob-abilistic Robotics. The MIT Press, Cambridge,MA.

Thrun, S., Liu, Y., Koller, D., Ng, A., Ghahramani,Z., and Durrant-Whyte, H. (2004). Simultaneouslocalization and mapping with sparse extendedinformation filters. International Journal ofRobotics Research23(7-8).

Thrun, S. and Montemerlo, M. (2005). The graph-slam algorithm with applications to large-scalemapping of urban structures.International Jour-nal on Robotics Research25(5/6): 403–430.

Walter, M., Eustice, R., and Leonard, J. (2005). Aprovably consistent method for imposing exactsparsity in feature-based slam information filters.In Proceedings of the 12th International Sympo-sium of Robotics Research (ISRR). San Francisco,CA.

25

Page 26: Trajectory Optimization using · 2010-05-10 · Trajectory Optimization using 1 Reinforcement Learning for Map Exploration Thomas Kollar and Nicholas Roy, Abstract—Automatically

Walter, M. and Leonard, J. (2004). An experimentalinvestigation of cooperative SLAM. InProceed-ings of the Fifth IFAC Symposium on IntelligentAutonomous Vehicles. Lisbon, Portugal.

Whaite, P. and Ferrie, F. (1997). Autonomousexploration: Driven by uncertainty.IEEE Trans-actions on Patten Analysis and Machine Intelli-gence19(3): 193–205.

APPENDIX

The motion model with slip is given asg(ξt, ut, wt) where the control consists of an initialrotationdθ1, a forward translationdr in the currentheading direction of the robotθ and a final rotationdθ2, such thatut = (dθ1, dr, dθ2). An effectivecontrol ut = ( ˜dθ1, dr, ˜dθ2, ds) is generated, whereeach effective control component is a random vari-able depending on both the control and the noiseprocessw = (wdθ1

, wdr, wdθ2, wds). An additional

componentds is included in the effective control torepresent the slip motion, that is, cross-range error.The effective control variables are calculated as

wdθ1∼ N (0, 1)

˜dθ1 = µθ1+ wdθ1

· σθ1

µθ1= αθ,θdθ1 + αθ,rdr

σ2θ1

= β2θ,θ||dθ1||

2 + βθ,r||dr||2

wdr ∼ N (0, 1)

dr = µr + wdr · σr

µr = αr,rdr + αr,θdθ

σ2r = β2

r,r||dr||2 + βr,θ||dθ||2

wdθ2∼ N (0, 1)

˜dθ2 = µθ2+ wdθ2

· σθ2

µθ2= αθ,θdθ2 + αθ,rdr

σ2θ2

= β2θ,θ||dθ2||

2 + βθ,r||dr||2

wds ∼ N (0, 1)

ds = µs + wdr · σs

µs = αs,rdr + αs,θdθ

σ2s = β2

s,r||dr||2 + βs,θ||dθ||2,

where dθ = dθ1 + dθ2. The meansαi,j andcovariancesβi,j describe the contribution of eachcontrol j to the effective controli, e.g.,αθ,r is themean contribution of the translation to the rotation.The controlsdθ1 anddθ2 share model parameters,such thatαθ1,r = αθ2,r = αθ,r, so the subscripts1and2 are omitted from the model parameters.

Each effective control has a dependence on thecommanded translation and rotation, and each de-pendence consists of a bias and random variance,leading to a total of 16 parameters that mustbe estimated. An Expectation-Maximization proce-dure (Eliazar and Parr 2004) was used to estimatethe parameters for the robot in Figure 14, and theparameters are given in table I.

A translational and rotational displacementζt isthen calculated from the effective controls as

ζt =

dr cos(θt−1+ ˜dθ1)+ds cos(θt−1+ ˜dθ1+ π2 )

dr sin(θt−1+ ˜dθ1)+ds sin(θt−1+ ˜dθ1+ π2 )

˜dθ1+ ˜dθ2

.

The complete motion can then be calculated as

xt = xt−1 + ζt. (20)

αr,r = 1.0065 β2r,r = 0.0932

αr,θ = −0.0072 β2

r,θ= 0

αθ,r = 0.0144 β2

θ,r= 0

αθ,θ = 0.8996 β2

θ,θ= 0.3699

αs,r = −0.0182 β2s,r = 0

αs,θ = −0.105 β2

s,θ= 0.0612

TABLE ITHE SIXTEEN PARAMETERS FOR THE STOCHASTIC MOTION

MODEL OF THE ROBOT.

26