multirobot coordination for space exploration

Articles

WINTER 2014 61Copyright © 2014, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602

Multirobot Coordination for Space Exploration

Logan Yliniemi, Adrian K. Agogino, Kagan Tumer

Imagine for a moment that you’re tasked with teleoperat-ing (controlling with a joystick) a Mars rover as it navi-gates across the surface. You watch the feed from the on-

board camera as the rover rolls along the surface, when younotice the terrain changing ahead, so you instruct the roverto turn. The problem? You’re 6 minutes too late. Due to thespeed-of-light delay in communication between yourselfand the rover, your monolithic multimillion dollar project isin pieces at the bottom of a Martian canyon, and the near-est repairman is 65 million miles away.

There are, of course, solutions to this type of problem.You can instruct it to travel a very small distance and reeval-uate the rover’s situation before the next round of travel, butthis leads to painfully slow processes that take orders of mag-nitude longer than they would on Earth. The speed of lightis slow enough that it hinders any attempts at interactingregularly with a rover on another planet.

But what if, instead of attempting to control every aspectof the rover’s operation, we were able to take a step back andsimply tell the rover what we’re trying to find and have itreport back when it finds something we’ll think is interest-ing? Giving the rover this type of autonomy removes theneed for constant interaction and makes the speed of light amoot point.

Hard-coding a procedure for handling all of the cases arover could encounter while navigating — and the thou-sands of other tasks that a rover might have to undertake —is not a good option in these cases. The need for flexibility iskey, and the on-board storage space is typically quite limit-ed. Due to the large distances, communication lag, andchanging mission parameters, any efforts in space explo-

n Teams of artificially intelligent planetaryrovers have tremendous potential for spaceexploration, allowing for reduced cost,increased flexibility, and increased reliability.However, having these multiple autonomousdevices acting simultaneously leads to a prob-lem of coordination: to achieve the bestresults, they should work together. This is nota simple task. Due to the large distances andharsh environments, a rover must be able toperform a wide variety of tasks with a widevariety of potential teammates in uncertainand unsafe environments. Directly coding allthe necessary rules that can reliably handleall of this coordination and uncertainty isproblematic. Instead, this article examinestackling this problem through the use of coor-dinated reinforcement learning: rather thanbeing programmed what to do, the rovers iter-atively learn through trial and error to taketake actions that lead to high overall systemreturn. To allow for coordination, yet alloweach agent to learn and act independently, weemploy state-of-the-art reward-shaping tech-niques. This article uses visualization tech-niques to break down complex performanceindicators into an accessible form and identi-fies key future research directions.

Articles

62 AI MAGAZINE

Figure 1. The Speed-of-Light Communication Delay Makes Artificial Intelligence a Necessity for Space Exploration.

ration need to be extremely robust to a wide array ofpossible disturbances and capable of a wide array oftasks. In short, as the human race expands its effortsto explore the solar system, artificial intelligence willplay a key role in many high-level control decisions.

However, giving a rover that cost many person-years of labor and a multimillion dollar budget com-plete autonomy over its actions on another planetmight be a bit unnerving. Space is a harsh and dan-gerous place; what if it isn’t able to achieve the tasksit needs to? Worse, what if the rover finds an unpre-dicted and creative way to fail? These are legitimateconcerns, worth addressing seriously.

One way to mitigate these concerns is to take theconcept of a single traditional monolithic rover andbroke it up into many pieces, creating a team ofrovers, with one to embody each of these pieces. Eachwould be simple and perform just a few functions.Though each of the pieces is less effective individual-ly than the monolithic rover, the sum of the pieces isgreater than the whole in many ways.

First, any of the members of the team is signifi-cantly more expendable than the whole monolithicrover. This alleviates a large number of concerns andopens many opportunities. If one rover does find away to fail creatively, the remainder of the team isstill completely operational. By the same token, theteam of rovers can undertake more dangerous mis-sions than the monolithic rover; if the dangerousconditions lead to the failure of one rover, the restcan complete the mission. Additionally, redundancy

can be designed into the team for particularly dan-gerous or critical roles.

Beyond the disposability of the individual teammembers, there are other benefits to this team-basedapproach. Savings can be realized in construction, aseach rover can be designed with parts from a lower-cost parts portion of the reliability curve. Similar sav-ings are available in the design process, as a new teamcan be formed with some members that have beenpreviously designed.

In addition, a team of rovers can have capabilitiesthat a single monolithic rover cannot, like havingpresence in multiple locations at once, which isincredibly useful for planetary exploration. Ephemer-al events can be simultaneously observed from sepa-rate locations (Estlin et al. 2010), even from theground and from orbit simultaneously (Chien et al.2011), which can make interpreting the situation sig-nificantly easier. Construction tasks that might beimpossible for a single rover with limited degrees offreedom become much easier. Teams can survey areasseparated by impassible terrain and share long-rangecommunication resources (Chien et al. 2000).

However, the concerns that we must addressexpand rapidly once we start to consider the possi-bilities that arise with multiple rovers acting in thesame area simultaneously. How do the rovers coordi-nate so that their efforts lead to the maximumamount of interesting discoveries? How does a roverdecide between achieving a task on its own versushelping another rover that has become stuck? How

Articles

WINTER 2014 63

does it decide between covering an area that’s beendeemed interesting or exploring an area that hasn’treceived much attention? These are all issues that fallunder the larger umbrella of multiagent artificialintelligence (or multiagent systems), which is a ripearea of modern research (Wooldridge 2008).

One technique that has proven useful within themultiagent systems community is that of rewardshaping used in conjunction with reinforcementlearning. In this paradigm, instead of the roversbeing told what to do, they each individually learnwhat to do through an iterative process of trial anderror 1. In this process, each rover learns to maximizea reward function, measuring its performance. Bycarefully shaping the rewards that the rovers receive,we can promote coordination and improve therobustness of the learning process (Mataric 1994;Taylor and Stone 2009). Our goals in reward shapingare to balance two fundamental tensions in learning:(1) the rewards that the rovers are maximizingshould be informative enough that they can promotecoordination of the entire system, and (2) theyshould be simple enough that the rovers can easilydetermine the best actions to take to maximize theirrewards. There are a number of obstacles that canmake achieving this goal more difficult.

Multiagent Coordination Is HardBeing able to automatically learn intelligent controlpolicies for autonomous systems is an excitingprospect for space exploration. Especially within thecontext of a coordinated set of autonomous systems,we have the possibility of achieving increased capa-bilities while maintaining an adaptive and robustsystem. However, these multiagent systems are fun-damentally different from other types of artificialintelligence in two ways. First, we have to promotecoordination in a multiagent system (see figure 2),since agents learning by themselves may work atcross-purposes, and second, we have to overcomeincreased learning complexity as the actions taken byother agents increase the difficulty that any particu-lar agent has in determining the value of its actionswith respect to a coordinated goal.

In space applications, this coordination willinvolve many issues like optimizing communicationnetworks, maximizing scientific informationreturned from a set of sensors, and coordinating pow-er usage through shared power resources. As a guid-ing example, consider a group of autonomous roveragents set to explore an area of an extraterrestrialbody. Their goal is to observe a series of points ofinterest, and gain as much knowledge about thesepoints as possible on a teamwide level. This meansthat ideally each agent within the multiagent systemwill cooperate toward the common good, but how todo this is not immediately obvious. For example, itmay not be readily apparent in practice that a rover

is actively observing a point that has been well stud-ied at an earlier point in time. The rover’s actions ofobserving that point may be a very good choice,except that the other agents acting in the environ-ment had already gleaned the necessary informationfrom the point, making the action redundant.

Complex communication protocols or teamworkframeworks may offer a solution to this problem, butit might not be a practical one for space travel. Com-munication availability is limited, and failures ofexisting rovers or introduction of new rovers thatweren’t originally planned into the team are a realis-tic expectation for space exploration (Stone et al.2013). Because of the large travel times and distances,and unpredictable and harsh environments, flexibil-ity in implementation is key, and the solution mustbe robust to all sorts of disturbances.

This flexibility can be developed through the useof adaptive agent policies, which change over timeto fit the situation the rover encounters. This createsa learning multiagent system, which allows the teamto effectively deal with changing environments ormission parameters. A key issue in a learning multia-gent system is the choice of the reward function thatthe agents use.

How to Judge a Reward FunctionA multiagent learning system depends on a way tomeasure the value of each agent’s behavior. Forinstance, did a particular sensor reading give addi-tional scientific value? Did a particular message sentefficiently use the communications channel? Did aparticular rover movement put the rover in a goodlocation and not interfere with the actions of anoth-er rover? This measurement is called a reward func-tion, and changing what form the reward functiontakes is the science of reward shaping (Chalkiadakisand Boutilier 2003; Guestrin, Lagoudakis, and Parr2002; Hu and Wellman 1998; Mataric 1998; Stoneand Veloso 2000; Tumer, Agogino, and Wolpert 2002;Wolpert and Tumer 2001). An agent will seek to sole-ly increase its reward function. Thus it should havetwo specific properties: sensitivity and alignment.

First, the reward function must be sensitive to theactions of the agent (Wolpert and Tumer 2001). Anagent taking good actions should receive a highreward, and an agent taking poor actions shouldreceive a lower reward. In an unpredictable, stochas-tic, or multiagent environment, there are other fac-tors affecting the reward that the agent will receive.An ill-developed reward function will allow theserandom factors to insert a large amount of noise intothe signal offered by the reward function, and as thesignal-to-noise ratio decreases, so does the agent’sperformance.

Second, the reward function must be aligned withthe overall mission that the agent team must achieve(Wolpert and Tumer 2001). That is, an agent that

increases its own reward should simultaneously beincreasing the system performance. A lack of align-ment can lead to situations such as the tragedy of thecommons (Hardin 1968, Crowe 1969), wherein agroup of rationally self-concerned agents lead to adrop in system performance due to working at cross-purposes. That is, agent A does what it perceives in itsown best interest, as does agent B; in some way, theiractions deplete their shared environment and lead toboth agents being worse off than they would be hadthey cooperated for the communal good.

Both of these properties — sensitivity and align-ment — are critical to multiagent systems. An agentmust be able to clearly discern what it has done toearn a high reward, and continuing to earn that highreward must be in the best interest of the system as

a whole. This is especially the case in space applica-tions, because the large distances and communica-tion restrictions introduced by limited bandwidth,limited power, or line-of-sight lead-to time preventoutside intervention if the system performance wereto go awry. In fact, even identifying that a problemexists within the system is challenging: space andextra planetary exploration is a complex and difficultproblem, and it might not be easy to immediatelydiagnose when agents aren’t achieving their fullpotential.

In this article, we show one approach to diagnos-ing potential system performance issues throughvisualizing the sensitivity and alignment of variousreward structures in a simple and straightforwardmanner.

Articles

64 AI MAGAZINE

Figure 2. Effective Single-Agent Learning May Lead to Incompatible Interactions in a Multiagent Setting.

Repetitive exploration and congestion are common problems.

Classic Approaches to CoordinationThrough Reward Shaping

There are three classic approaches to solving complexmultiagent systems: robot totalitarianism, robotsocialism, and robot capitalism. Each has specificadvantages and drawbacks.

Robot Totalitarianism (Centralized Control)First, consider a centralized system in which oneagent is making all necessary decisions for the entiresystem as a whole, and all other agents are merely fol-lowing orders. The advantages here are that perfectcoordination is possible and the pieces of the systemas a whole will cooperate to increase system per-formance. This typically works well for small systemsconsisting of just a few agents (Sutton and Barto1998). However, such a centralized system can fallprey to complexities such as communication restric-tions, component failures — especially where a singlepoint of failure can stop the entire system — and sim-ply the difficulty of simultaneously solving a prob-lem for hundreds or thousands of agents simultane-ously. In most realistic situations, this is simply notan option.

Robot Socialism (Global or Team Reward)Next, consider a system in which each agent isallowed to act autonomously in the way that it seesfit, and every agent is given the same global reward,which represents the system performance as a whole.They will single-mindedly pursue improvements onthis reward, which means that their efforts are direct-ed toward improving system performance, due tothis reward having perfect alignment. However,because there may be hundreds or thousands ofagents acting simultaneously in the shared environ-ment, it may not be clear what led to the reward. Ina completely linear system of n agents, each agent isonly responsible for 1/n of the reward that they allreceive, which can be entirely drowned out by the (n– 1)/n portion for which that agent is not responsible.In a system with 100 agents, that means an agentmight only have dominion over 1 percent of thereward it receives! This could lead to situations inwhich an agent chooses to do nothing, but the sys-tem reward increases, because other agents foundgood actions to take. This would encourage thatagent to continue doing nothing, even though thishurts the system, due to a lack of sensitivity of thereward.

Robot Capitalism (Local or Perfectly Learnable Reward)Finally, consider a system in which each agent has alocal reward function related to how productive it is.For example, a planetary rover could be evaluated onhow many photographs it captures of interestingrocks. This means that its reward is dependent only

on itself, creating high sensitivity. However, the teamof rovers obtaining hundreds of photographs of thesame rock is not as interesting as obtaining hundredsof photographs of different rocks, though thesewould be evaluated the same with a local scheme.This means that the local reward is not aligned withthe system-level reward.

SummaryEach of the reward functions has benefits and draw-backs that are closely mirrored in human systems.However, we are not limited to just these rewardfunctions; as we mentioned before, an agent will sin-gle-mindedly seek to increase its reward, no matterwhat it is, whether or not this is in the best interestof the system at large. Is there, perhaps, a methodthat could be as aligned as the global reward, whileas sensitive as the local reward, while still avoidingthe pitfalls of the centralized approach?

Difference RewardsAn ideal solution would be to create a reward that isaligned with the system reward while removing thenoise associated with other agents acting in the sys-tem. This would lead agents toward doing every-thing they can to improve the system’s performance.Such a reward in a multirover system would rewarda rover for taking a good action that coordinates wellwith rovers that are close to it, and would ignore theeffects of distant rovers that were irrelevant.

A way to represent this analytically is to take theglobal reward G(z) of the world z, and subtract offeverything that doesn’t have to do with the agentwe’re evaluating, revealing how much of a differencethe agent made to the overall system. This takes theform

Di(z) = G(z) – G(z–i) (1)

where G(z–i) is the global reward of the world with-out the contributions of agent i, and Di(z) is the dif-ference reward.

Let us first consider the alignment of this reward.G(z) is perfectly aligned with the system reward. G(z–

i) may or may not be aligned, but in this case, it does-n’t matter, because agent i (whom we are evaluating)has no impact on G(z–i), by definition. This meansthat Di(z) is perfectly aligned, because all parts thatagent i affects are aligned: agent i taking action toimprove Di(z) will simultaneously improve G(z).

Now, let us consider the sensitivity of this reward.G(z) is as sensitive as the system reward, because it isidentical. However, we remove G(z–i) from the equa-tion; that is, a large portion of the system — onwhich agent i has no impact on the performance —does not affect Di(z). This means that Di(z) is verysensitive to the actions of agent i and includes littlenoise from the actions of other agents.

Difference rewards are not a miracle cure. They dorequire additional computation to determine which

Articles

WINTER 2014 65

portions of the system reward are caused by eachagent. However, it is important to note that it is notnecessary to analytically compute these contributions.In many cases, a simple approximation that serves toremove a large portion of the noise caused by usingthe system-level reward gains significant performanceincreases over using the system reward alone.

Although in this article we focus on the continu-ous rover domain, both the difference reward and thevisualization approach have broad applicability. Thedifference reward used in this article has been appliedto many domains, including data routing over atelecommunication network (Tumer and Wolpert2000), multiagent gridworld (Tumer, Agogino, andWolpert 2002), congestion games such as traffic tolllanes (Tumer and Wolpert 2004a, 2004b; Wolpertand Tumer 2001), and optimization problems suchas bin packing (Wolpert, Tumer, and Bandari 2004)and faulty device selection (Tumer 2005).

Continuous Rover DomainTo examine the properties of the difference reward ina more practical way, let us return to our example ofa team of rovers on a mission to explore an extrater-restrial body, like the moon or Mars (figure 3). Weallow each rover to take continuous actions to movein the space, while receiving noisy sensor data at dis-crete time steps (Agogino and Tumer 2004).

Points of InterestCertain points in the team’s area of operation havebeen identified as points of interest (POIs), which werepresent as green dots. Figure 4 offers one of the lay-outs of POIs that we studied, with a series of lower-valued POIs located to the left on the rectangularworld, and a single high-valued POI located on theright half. Because multiple simultaneous observa-tions of the same POI are not valued higher than asingle observation in this domain, the best policy forthe team is to spread out: one agent will closely studythe large POI, while the remainder of the team willcover the smaller POIs on the other side.

Sensor ModelWe assume that the rovers have the ability to sensethe whole domain (except in the results we presentlater marked with PO for partial observability), buteven so, using state variables to represent each of therovers and POIs individually results in an intractablelearning problem: there are simply too many param-eters. This is also why a centralized controller doesnot function well in this case. We reduce the statespace by providing eight inputs through the processillustrated in figure 5. For each quadrant, whichrotates to remain aligned with the rover as it movesthrough the space, the rover has a rover sensor anda POI sensor. The rover sensor calculates the relativedensity and proximity of rovers within that quad-rant and condenses this to a single value. The POI

Articles

66 AI MAGAZINE

Figure 3. A Team of Rovers Exploring Various Points of Interest on the Martian Surface.

Artist’s rendition.

sensor does the same for all POIs within the quad-rant.

Motion ModelWe model the continuous motion of the rovers ateach finite time step as shown in figure 6. We main-tain the current heading of each rover, and at eachtime step the rovers select a value for dy and dx,where the value of dy represents how far forward therover will move, and dx represents how much therover will turn at that time step. The rover’s headingfor the next time step is represented as the directionof the resultant vector (dx + dy), shown as the solidline in figure 6.

Policy SearchThe rovers use multilayer perceptrons (MLPs) with sig-moid activation functions to map the eight inputsprovided by the four POI sensors and four rover sen-sors through 10 hidden units to two outputs, dx anddy, which govern the motion of the rover. The weightsassociated with the MLP are established through anonline simulated annealing algorithm that changes

the weights with preset probabilities (Kirkpatrick,Gelatt, and Vecchi 1983). This is a form of direct pol-icy search, where the MLPs are the policies.

Reward StructuresWe present the visualizations for alignment and sen-sitivity of four reward structures in this work. Theperfectly learnable local reward, Pi, is calculated by con-sidering the value of observations of all POIs madeby agent i throughout the course of the simulation,ignoring the contributions that any other agents hadto the system.

The global team reward, Ti, is calculated by consid-ering the best observation the team as a whole madeduring the course of the simulation.

The difference reward, Di, is calculated similarly tothe perfectly learnable reward Pi, with the exceptionthat if a second agent j also observed the POI, agenti is only rewarded with the difference between thequality of observations. Thus, if two agents observea POI equally well, it adds to neither of theirrewards, because the team would have observed it

Articles

WINTER 2014 67

High ValuedPOI

Low ValuedPOIs

Rovers

Figure 4. A Team of Rovers Observing a Set of Points of Interest.

Each POI has a value, represented by its size here. The team will ideally send one rover to observe the large POI on the rightclosely, while the rest spread out in the left region to observe as many small POIs as possible.

anyway. If an agent is the sole observer of a POI, itgains the full value of the POI observation.

The difference reward under partial observability,Di(PO), is calculated in the same manner as Di, butwith restrictions on what agent i can observe. Eachrover evaluates itself in the same way as Di, butbecause of the partial observability, it is possible thattwo rovers will be observing the same POI fromopposite sides, and neither will realize that the POI isdoubly observed (which does not increase the systemperformance), and both will credit themselves. Like-wise, each rover cannot sense POIs located outside ofits observation radius. This is represented in figure 7.

Visualization of Reward StructuresVisualization is an important part of understandingthe inner workings of many systems, but particularlythose of learning systems (Agogino, Martin, andGhosh 1999; Bishof, Pinz, and Kropatsch 1992; Gal-lagher and Downs 1997; Hinton 1989; Hoen et al.2004; Wejchert and Tesauro 1991). Especially in costly

space systems we need additional validation that ourlearning systems are likely to work. Performance simu-lations can give us good performance bounds in sce-narios that we can anticipate ahead of time. However,these simulations may not uniformly test the rovers inall situations that they may encounter. Learning andadaptation can allow rovers to adapt to unanticipatedscenarios, but their reward functions still have to havehigh sensitivity and alignment to work. The visualiza-tion presented here can give us greater insight into thebehavior of our reward functions. Our visualizationscan answer important questions such as how often wethink our reward will be aligned with our overall goalsand how sensitive our rewards are to a rover’s actions.

Through visual inspection we can see if there areimportant gaps in our coverage, and we can increaseour confidence that a given reward system will workreliably.

The majority of the results presented in this workshow the relative sensitivity and alignment of eachof the reward structures. We have developed a uniquemethod for visualizing these, which is illustrated in

Articles

68 AI MAGAZINE

Rover Sensor

Points of Interest Sensor

Points of Interest

Figure 5. Rover Sensing Diagram.

Each rover has eight sensors: four rover sensors and four POI sensors that detect the relative congestion of each in each ofthe four quadrants that rotate with the rover as it moves.

Articles

WINTER 2014 69

dy

dxFigure 6. Rover Motion Model.

At each time step, each rover determines a continuous dy value to represent how far it moves in the direction itis facing, and a dx value determining how far it turns. Its heading at the next time step is the same as the vec-tor (dx + dy).

Figure 7. Rovers Under Partial Observability of Range Denoted by the Dotted Line.

Both rover A and rover B can sense and observe POI P, but cannot sense each other. In the Di(PO) formulation,they both would calculate that theirs was the only observation. Additionally, neither rover has any knowledgeof POI Q.

A

B

P

Q

Figure 8. We use the sensor information from therover (left) to determine which of the spaces we willupdate (right). The alignment or sensitivity calcula-tion (Agogino and Tumer 2008) is then representedby a symbol that takes the form of a “+” or “–” sign;the brighter the shade of the spot, the further fromthe average. A bright “+,” then, represents a veryaligned or very sensitive reward and a bright “–” rep-resents an antialigned or very nonsensitive rewardfor a given POI and rover density, in the case of fig-ure 9. We also present these calculations projectedonto a specific case of the actual space that the roversmove through in figure 10. A more general version ofthis technique projects onto the principal compo-nents of the state space, which is more thoroughlyexplored in other work (Agogino and Tumer 2008).

Sensitivity and Alignment AnalysisA reward with simultaneously high alignment andsensitivity will be the easiest for agents to use toestablish high-performing policies. Figure 9 presentsthe visualization for each of the reward structures.Notice that the perfectly learnable reward Pi does

indeed have high sensitivity across the space, but haslow alignment with the global reward in most of thecenter areas, which correspond to a moderate con-centration of rovers and POIs. This area near the cen-ter of the visualization represents circumstances thatthe rovers find themselves in most often (Agoginoand Tumer 2008).

The team reward Ti, by contrast, is very alignedthroughout the search space, but is extremely lack-ing in sensitivity (denoted by the many “–” signsthroughout the space).

The difference reward Di is both highly alignedand highly sensitive throughout the search space.When we reduce the radius at which Di can senseother rovers and POIs, the visualization from theDi(PO) row indicates that the sensitivity remainsstrong everywhere, but there is a slight drop in align-ment throughout the space.

So, it would appear that difference rewards (Di)offer benefits over other rewards, even with partialobservability (Di(PO)), but what does this mean in amore practical sense? To address this, we created fig-ure 10, which projects the same type of alignmentinto the actual plane in which the rovers are operat-ing.

Articles

70 AI MAGAZINE

Rover Sensor

Points of InterestSensor

Points of Interest

More POIs

Mor

e Ro

vers

Alignment Computation

X Coordinate

YCoordinate

Figure 8. Illustration of the Visualization Calculation Process.

We use sensor data to determine which spot in the state space a circumstance represents, and place a marker in that location that repre-sents whether the reward scores highly (bright +), near random (blank) or lowly (bright –).

Articles

WINTER 2014 71

Pi

Di(PO)

Di

Ti

Alignment Sensitivity

More POIs Mor

e Ro

vers

Figure 9. Alignment and Sensitivity Visualization for the Four Reward Types, Projected Onto a Two-Dimensional Space Representative of the State Space.

Note that the perfectly learnable reward Pi has low alignment through most of the space, and the team reward Ti isextremely nonsensitive through most of the space, while both instances of the difference reward maintain high per-formance by both metrics.

Articles

72 AI MAGAZINE

Di (PO)Pi

Aligned Bridges

Figure 10. Alignment Visualization for the Perfectly Learnable Reward Pi, and the Difference Reward Under Partial Observability, Di(PO).

Projected onto the actual plane the rovers operate within.

0.65

0.7

0.75

0.8

0.85

0.9

1 10 100

Fina

l Glo

bal R

ewar

d A

chie

ved

Communications Radius

D

P

T

Figure 11. Final Performance Attained Versus Communication Radius for the Different Reward Structures.

Difference rewards maintain robust performance, but team rewards lose significant performance under restricted commu-nication.

The left figure presents the align-ment for the perfectly learnable rewardPi, and the indicated region isantialigned with the system-levelreward. That is, even though travelingacross this region would be beneficialto the team (because traveling acrossthis region is required to reach thelarge POI on the right), the rovers thatfind themselves in this area of thespace are actively penalized.

The figure on the right presents thealignment for the difference rewardunder observation restrictions Di(PO),which is qualitatively different withinthe highlighted regions: Di(PO) buildstwo aligned bridges, which allow therovers to pass through the highlightedregion without being penalized whilethey travel to the large POI on theright. Furthermore, the other parts ofthe highlighted region are notantialigned with the system rewardmeaning that the rovers are not penal-ized for traveling through this space;they merely do not increase theirreward while there.

System PerformanceWe present system-level performancein figure 11, which represents the finalsystem reward after training (y-axis) forteams of rovers trained on variousrewards (line type), within differentexperiments with varying restrictionson observation radius (x-axis). Pointsto the left represent performanceunder extreme observation restric-tions, and points to the right representnear-full observability. The visualiza-tions performed in figures 9–10 corre-spond to full observability for allrewards except Di(PO), which corre-sponds to the Di reward at a communi-cation radius of 10 units in figure 11.

The benefits in sensitivity and align-ment offered by the difference rewardsDi does result in increased system per-formance, as shown by the rightmostportion of figure 11. This reward leadsto high-performing systems of roveragents with very successful policies.The global shared team reward Ti iscapable of making some increases overa local policy under full observability,but still falls short of the differencereward.

The remainder of figure 11 presents

a result based on the final system per-formance attained by agent teamsoperating with different rewards underrestricted communications. Agentstrained on the difference reward Di arerobust to a reduced communicationradius, which could easily happen incases of a dust storm, craggy land-scape, or partial sensor failures. Agentsusing the perfectly learnable reward Piare not affected by these restrictions,as the actions of other agents don’taffect their policies.

Agents trained on the team or glob-al reward Ti show an interesting phe-nomenon, however. Agents operatingwith a large communication radius areable to perform well as a team, and asthis communication radius is reduced,so is the quality of the discovered poli-cies — this much is expected. Howev-er, as the observation radius isdecreased further, experimental runswith very low observation radii actual-ly perform slightly better than thosewith moderate observation powers.This suggests that a little bit of knowl-edge about the location of other roversis actually a bad thing. This can beexplained: as the observation radius isreduced, agents trained on the teamreward will behave more selfishly, likerovers using Pi, simply because theycannot sense the other rovers in thearea; thus the gap between their per-formance decreases as the restrictionsmirror this case.

ConclusionsSpace exploration creates a unique setof challenges that must be addressedas we continue expanding our reach inthe solar system. One approach fordealing with these challenges isthrough the use of reinforcementlearning with reward shaping. Caremust be taken in any use of rewardshaping: a solution that works with asmall number of agents will not neces-sarily scale up in an expected fashionand might lead to catastrophic system-level results. The readily obvious teamreward and perfectly learnable rewardboth lead to poor results due to theirlow sensitivity and alignment, respec-tively. There is a need for local-levelrewards that can be carried out quick-ly and efficiently that will scale into

Articles

WINTER 2014 73

favorable results at the broader systemlevel.

Difference rewards are an effectivetool for this by encouraging multia-gent coordination by their guaranteedalignment with the system objective,as well as their high sensitivity to localactions. They maintain high learnabil-ity throughout the state space, whileoffering perfect alignment with thesystem-level reward. This results inbenefits that can be readily visualizedwithin the space in which a team ofrovers works, creating bridges of highreward that rovers can cross inbetween sparse POIs, and increasingoverall system performance over a per-sonal or team-based reward.

These properties in tandem withthe robustness to various types ofchange within the environment showthat their use in space explorationapplications is an ideal fit. The capa-bility of using a difference reward toencourage agents to do their best tohelp the team at whatever task isassigned allows for a team that canquickly and deftly adjust when mis-sion parameters change. This can be asmundane as a sensor failing, or as dra-matic as a complete mission reassign-ment.

While developing more sophisti-cated technologies for sensing moreabout the environment in a more effi-cient manner is a useful step forward,for multiagent space exploration, thekey problem remains as what shouldthe agents do to work together? Thispersists as a fertile motivating ques-tion for future research.

ReferencesAgogino, A., and Tumer, K. 2004. EfficientEvaluation Functions for Multi-Rover Sys-tems. In Proceedings of the Genetic and Evo-lutionary Computation Conference (GECCO-2004), Lecture Notes in Computer Sciencevolume 3103, 1–12. Berlin: Springer.

Agogino, A.; Martin, C.; and Ghosh, J.1999. Visualization of Radial Basis Func-tion Networks. In Proceedings of Internation-al Joint Conference on Neural Networks. Pis-cataway, NJ: Institute of Electrical andElectronics Engineers.

Agogino, A. K., and Tumer, K. 2008. Ana-lyzing and Visualizing Multiagent Rewardsin Dynamic and Stochastic Environments.Journal of Autonomous Agents and Multi-Agent Systems 17(2): 320–338.

Hoen, P.; Redekar, G.; Robu, V.; and LaPoutre, H. 2004. Simulation and Visualiza-tion of a Market-Based Model for LogisticsManagement in Transportation. In Proceed-ings of the Third International Joint Conferenceon Autonomous Agents and Multiagent Sys-tems, 1218–1219. Piscataway, NJ: Instituteof Electrical and Electronics Engineers.

Hu, J., and Wellman, M. P. 1998. MultiagentReinforcement Learning: Theoretical Frame-work and an Algorithm. In Proceedings of theFifteenth International Conference on MachineLearning, 242–250. San Francisco: MorganKaufmann, Inc.

Kirkpatrick, S.; Gelatt, C. D. J.; and Vecchi,M. P. 1983. Optimization by SimulatedAnnealing. Science 220(4598)(May 13): 671–680.

Mataric, M. J. 1998. New Directions: Robot-ics: Coordination and Learning in Multi-Robot Systems. IEEE Intelligent Systems13(2): 6–8. dx.doi.org/10.1109/5254.671083

Mataric, M. J. 1994. Reward Functions forAccelerated Learning. In Proceedings of theEleventh International Conference on MachineLearning, 181–189. San Francisco: MorganKaufmann, Inc.

Stone, P., and Veloso, M. 2000. MultiagentSystems: A Survey from a Machine LearningPerspective. Autonomous Robots 8(3): 345–383. dx.doi.org/10.1023/A:1008942012299

Stone, P.; Kaminka, G. A.; Kraus, S.; Rosen-schein, J. R.; and Agmon, N. 2013. Teachingand Leading an Ad Hoc Teammate: Collab-oration Without Pre-Coordination. ArtificialIntelligence 203 (October): 35–65. dx.doi.org/10.1016/j.artint.2013.07.003

Sutton, R. S., and Barto, A. G. 1998. Rein-forcement Learning: An Introduction. Cam-bridge, MA: The MIT Press.

Taylor, M. E., and Stone, P. 2009. TransferLearning for Reinforcement LearningDomains: A Survey. Journal of MachineLearning Research 10 (2009): 1633–1685

Tumer, K. 2005. Designing Agent Utilitiesfor Coordinated, Scalable, and Robust Mul-ti-Agent Systems. In Challenges in the Coor-dination of Large Scale Multiagent Systems, ed.P. Scerri, R. Mailler, and R. Vincent. Berlin:Springer.

Tumer, K., and Wolpert, D. H. 2000. Collec-tive Intelligence and Braess’ Paradox. In Pro-ceedings of the Seventeenth National Confer-ence on Artificial Intelligence, 104–109. MenloPark, CA: AAAI Press.

Tumer, K., and Wolpert, D., eds. 2004a. Col-lectives and the Design of Complex Systems.Berlin: Springer. dx.doi.org/10.1007/978-1-4419-8909-3

Tumer, K., and Wolpert, D. 2004b. A Surveyof Collectives. In Collectives and the Design ofComplex Systems, ed. K. Tumer and D.

Wolpert, 1–42. Berlin: Springer. dx.doi.org/10.1007/978-1-4419-8909-3_1

Tumer, K.; Agogino, A.; and Wolpert, D.2002. Learning Sequences of Actions in Col-lectives of Autonomous Agents. In Proceed-ings of the First International Joint Conferenceon Autonomous Agents and Multi-Agent Sys-tems, 378–385. New York: Association forComputing Machinery. dx.doi.org/10.1145/544741.544832

Wejchert, J., and Tesauro, G. 1991. Visualiz-ing Processes in Neural Networks. IBM Jour-nal of Research and Development 35(1–2):244–253. dx.doi.org/10.1147/rd.351.0244

Wolpert, D. H., and Tumer, K. 2001. Opti-mal Payoff Functions for Members of Col-lectives. Advances in Complex Systems 4(2/3):265–279. dx.doi.org/10.1142/S0219525901000188

Wolpert, D. H.; Tumer, K.; and Bandari, E.2004. Improving Search Algorithms byUsing Intelligent Coordinates. PhysicalReview E 69:017701. dx.doi.org/10.1103/PhysRevE.69.017701

Wooldridge, M. 2008. An Introduction toMultiagent Systems. New York: Wiley.

Logan Yliniemi is a graduate research assis-tant at Oregon State University. He is pur-suing his Ph.D. in robotics. His researchfocuses on credit assignment in multiagentsystems and the interactions between mul-tiagent systems and multiobjective prob-lems.

Adrian Agogino is a researcher with theRobust Software Engineering Group in theIntelligent Systems Division at NASA AmesResearch Center, employed through theUniversity of California, Santa Cruz. Hisresearch interests are in the fields ofmachine learning, complex learning sys-tems, and multiagent control.

Kagan Tumer is a professor of robotics andcontrol at Oregon State University. Hisresearch interests are control and optimiza-tion in large autonomous systems with aparticular emphasis on multiagent coordi-nation. Applications of his work includecoordinating multiple robots, optimizinglarge sensor networks, controllingunmanned aerial vehicles, reducing trafficcongestion, and managing air traffic.

dx.doi.org/10.1007/s10458-008- 9046-9

Bishof, H.; Pinz, A.; and Kropatsch, W. G.1992. Visualization Methods for Neural Net-works. In Proceedings of the 11th Internation-al Conference on Pattern Recognition, 581–585. Piscataway, NJ: Institute of Electricaland Electronics Engineers.

Chalkiadakis, G., and Boutilier, C. 2003.Coordination in Multiagent ReinforcementLearning: A Bayesian Approach. In Proceed-ings of the Second International Joint Confer-ence on Autonomous Agents and MultiagentSystems (AAMAS-03). New York: Associationfor Computing Machinery.

Chien, S.; Barrett, A.; Estlin, T.; andRabideau, G. 2000. A Comparison of Coor-dinated Planning Methods for CooperatingRovers. In Proceedings of the Fourth Interna-tional Conference on Autonomous Agents(Agents ’00). New York: Association forComputing Machinery.

Chien, S.; Doubleday, J.; Mclaren, D.; Tran,D.; Tanpipat, V.; Chitradon, R.; Boonya-aroonnef, S.; Thanapakpawin, P.; Khunboa,C.; Leelapatra, W.; Plermkamon, V.;Raghavendra, C.; Mandl, D. 2011. Combin-ing Space-Based and In-Situ Measurementsto Track Flooding in Thailand. In Proceedingsof the 2011 IEEE International Symposium onGeoscience and Remote Sensing, 3935–3938.Piscataway, NJ: Institute of Electrical andElectronics Engineers.

Crowe, B. L. 1969. The Tragedy of the Com-mons Revisited. Science 166(3909)(Novem-ber 28): 1103–1107.

Estlin, T. Chien, S.; Castano, R.; Doubleday,J.; Gaines, D.; Anderson, R. C.; de Granville,C.; Knight, R.; Rabideau, G.; Tang, B. 2010.Coordinating Multiple Spacecraft in JointScience Campaigns. Paper Presented at the10th International Symposium on SpaceArtificial Intelligence, Robotics, andAutomation for Space (i-SAIRAS 2010). Sap-poro, Japan. August 29–September 1.

Gallagher, M., and Downs, T. 1997. Visuali-zation of Learning in Neural NetworksUsing Principal Component Analysis. Paperpresented at the International Conferenceon Computational Intelligence and Multi-media Applications, Griffith University,Gold Coast, Australia, 10–12 February.

Guestrin, C.; Lagoudakis, M.; and Parr, R.2002. Coordinated Reinforcement Learn-ing. In Proceedings of the 19th InternationalConference on Machine Learning. San Francis-co: Morgan Kaufmann Publishers.

Hardin, G. 1968. The Tragedy of the Com-mons. Science 162(3859)(December 13):1243–1248.

Hinton, G. 1989. Connectionist LearningProcedures. Artificial Intelligence 40(1–3):185–234. dx.doi.org/10.1016/0004-3702(89)90049

Articles

74 AI MAGAZINE

multirobot coordination for space exploration

Documents