tractable planning in large teams distributed pomdps with … · 2011-08-23 · this research has...

1
In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about the types of interactions we have: Slow convergence Prioritization Oscillation Probabilistic shaping Local optima Optimistic policy initialization Distributed POMDPs with Coordination Locales (DPCLs) This work uses the DPCL problem model [2] . DPCLs are similar to Dec-POMDPs in representing problems as sets of states, actions and observations with joint transition, reward, and observation functions. However, DPCLs differ in that they factor state space into global and per-agent local components, and interactions among agents are limited to coordination locales. Evaluation in a Heterogeneous Rescue Robot Domain Consider the problem of a team of robots planning to search a disaster area. Some robots can assist victims, while others can clear otherwise intraversable debris. Robot observations and movements are subject to uncertainty. We evaluate D-TREMORʼs performance on a number of these planning problems, in teams of up to 100 agents. Acknowledgements This research has been funded in part by the AFOSR MURI grant FA9550-08-1-0356. This material is based upon work supported under a National Science Foundation Graduate Research Fellowship. Tractable Planning in Large Teams Emerging team applications require the cooperation of 1000s of members (humans, robots, agents). Team members must complete complex, collaborative tasks in dynamic and uncertain environments. How can we effectively and tractably plan in these domains? Scaling up from TREMOR [2] to D-TREMOR Conclusions and Future Work We introduce D-TREMOR, an approach to scale distributed planning under uncertainty into the hundreds of agents using information exchange and model-shaping. Results suggest competitive performance while improving scalability and reducing computational cost. We are working to further improve performance through better modeling of interaction dynamics and intelligent information dissemination between agents. References [1] M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning. 2002. [2] P. Varakantham, J. Kwak, M. Taylor, J. Marecki, P. Scerri, and M. Tambe. Exploiting Coordination Locales in Distributed POMDPs via Social Model Shaping. Proc. of ICAPS, 2009. [3] P. Varakantham, R.T. Maheswaran, T. Gupta, and M. Tambe. Towards Efficient Computation of Error Bounded Solutions in POMDPs: Expected Value Approximation and Dynamic Disjunctive Beliefs. Proc. of IJCAI, 2007. Role Allocation Policy Solution Interaction Detection Coordination TREMOR Branch & Bound MDP Independent EVA [3] solvers Joint policy evaluation Reward shaping of local models D-TREMOR Decentralized Auction Sampling & message passing Reward shaping of local models with convergence heuristics Rescue Agent Cleaner Agent Narrow Corridor Victim Unsafe Cell Clearable Debris Example Map: Rescue Domain Objective function: Get rescue agents to as many victims as possible within a fixed time horizon while minimizing collisions. Agents can collide in narrow corridors (only one agent can fit at a time) and with clearable debris (blocks rescue agents, but can be cleared by cleaner agents). D-TREMOR policies Max-joint-value Last iteration Comparison policies – Independent – Optimistic – Do-nothing – Random Scaling dataset: 10 to 100 agents Random maps Density dataset 100 agents Concentric ring maps 3 problems/condition 20 planning iterations 7 time step horizon 1 CPU per agent D-TREMOR: Distributed Team REshaping of Models for Rapid-execution We extend the TREMOR [2] algorithm for solving DPCLs to produce D-TREMOR, a fully- distributed solver that scales to problems with hundreds of agents. It approximates DPCLs as a set of single-agent POMDPs which are solved in parallel, then iteratively reshaped using messages that describe CL interactions between agent policies. 1 1.5 2 2.5 3 0 100 200 300 400 Number of Rings Average # of Collisions -./ ' 0.1* 23456-)5 1 1.5 2 2.5 3 0 5 10 15 20 25 30 35 Number of Rings Average # of Victims Rescued !"# % &"'( )*+,-!., )7*87(950: % ,"0176 D-TREMOR rescues many more victims. D-TREMOR resolves many, but not all collisions. Task Allocation Local Planning Interaction Exchange Model Shaping Finding the probability of a CL [1] : Evaluate local policy Compute frequency of associated s i , a i Entered corridor in 95 of 100 runs: Pr CLi = 0.95 Finding the value of a CL [1] : Sample local policy value with/without interactions Test interactions independently Compute change in value if interaction occurred No collision Collision Val CLi = -7 • Send CL messages to teammates: Sparsity Relatively small # of messages Shape local model rewards/transitions based on remote interactions Probability of interaction Interaction model functions Independent model functions Re-solve shaped local models to get new policies Result: new locally-optimal policies new interactions Distributed Interaction Detection using Sampling and Message Exchange Improved model shaping of local agent models with convergence heuristics 0 20 40 60 80 100 0 5 10 15 20 Number of Agents Time Per Iteration (min) 0 20 40 60 80 100 0 1000 2000 3000 4000 Number of Agents # of CLs Active (per agent) Increases in time are related to # of CLs, not # of agents. Results of Scaling Dataset Results of Density Dataset 0 20 40 60 80 100 ï500 ï400 ï300 ï200 ï100 0 100 Number of Agents Normalized Joint Value Naïve Policies D-TREMOR Policies Number of agents and map size are varied as density of debris, corridors, and unsafe cells is held constant. Concentric rings of narrow corridors are added from outside in on a map where victims are at the center. 1 1.5 2 2.5 3 -2500 -2000 -1500 -1000 -500 0 Number of Rings Average Joint Value !"# % &"'( )*+,-!., At first glance, do-nothing seems to do best. Ignoring interactions = poor performance Independent & Optimistic Independent & Optimistic Independent & Optimistic ? Agents not interacting, use independent functions: Agents are interacting, use joint CL functions: CL = Explicit time constraint Coordination Locales define regions of state-action space where joint transition/reward functions are needed Relevant region of joint state-action space : Joint Transition : Joint Reward : Joint Observation : Set of States : Set of Actions : Set of Observations : Initial Belief State

Upload: others

Post on 01-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tractable Planning in Large Teams Distributed POMDPs with … · 2011-08-23 · This research has been funded in part by the AFOSR MURI grant FA9550-08-1-0356. This material is based

In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about the types of interactions we have: –  Slow convergence Prioritization –  Oscillation Probabilistic shaping –  Local optima Optimistic policy initialization

Distributed POMDPs with Coordination Locales (DPCLs)"This work uses the DPCL problem model[2]. DPCLs are similar to Dec-POMDPs in representing problems as sets of states, actions and observations with joint transition, reward, and observation functions. However, DPCLs differ in that they factor state space into global and per-agent local components, and interactions among agents are limited to coordination locales."

Evaluation in a Heterogeneous Rescue Robot Domain"Consider the problem of a team of robots planning to search a disaster area. Some robots can assist victims, while others can clear otherwise intraversable debris. Robot observations and movements are subject to uncertainty. We evaluate D-TREMORʼs performance on a number of these planning problems, in teams of up to 100 agents."

Acknowledgements"This research has been funded in part by the AFOSR MURI grant FA9550-08-1-0356. This material is based upon work supported under a National Science Foundation Graduate Research Fellowship. "

Tractable Planning in Large Teams"Emerging team applications require the cooperation of 1000s of members (humans, robots, agents). Team members must complete complex, collaborative tasks in dynamic and uncertain environments. How can we effectively and tractably plan in these domains?"

Scaling up from TREMOR[2] to D-TREMOR!

Conclusions and Future Work"We introduce D-TREMOR, an approach to scale distributed planning under uncertainty into the hundreds of agents using information exchange and model-shaping. Results suggest competitive performance while improving scalability and reducing computational cost. We are working to further improve performance through better modeling of interaction dynamics and intelligent information dissemination between agents."

References"[1] M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision

processes. Machine Learning. 2002."[2] P. Varakantham, J. Kwak, M. Taylor, J. Marecki, P. Scerri, and M. Tambe. Exploiting Coordination Locales in Distributed

POMDPs via Social Model Shaping. Proc. of ICAPS, 2009. "[3] P. Varakantham, R.T. Maheswaran, T. Gupta, and M. Tambe. Towards Efficient Computation of Error Bounded Solutions in

POMDPs: Expected Value Approximation and Dynamic Disjunctive Beliefs. Proc. of IJCAI, 2007."

Role Allocation Policy Solution Interaction Detection Coordination

TRE

MO

R

Branch & Bound MDP

Independent EVA[3] solvers

Joint policy evaluation

Reward shaping of local models

D-T

REM

OR

Decentralized Auction

Sampling & message passing

Reward shaping of local models with

convergence heuristics

Rescue Agent"

Cleaner Agent"

Narrow Corridor"

Victim"

Unsafe Cell"

Clearable "Debris"

Example Map: Rescue Domain!

Objective function: Get rescue agents to as many victims as possible within a fixed time horizon while minimizing collisions. "

Agents can collide in narrow corridors (only one agent can fit at a time) and with clearable debris (blocks rescue agents, but can be cleared by cleaner agents). "

•  D-TREMOR policies –  Max-joint-value –  Last iteration

•  Comparison policies –  Independent –  Optimistic –  Do-nothing –  Random

•  Scaling dataset: –  10 to 100 agents –  Random maps

•  Density dataset –  100 agents –  Concentric ring maps

•  3 problems/condition •  20 planning iterations •  7 time step horizon •  1 CPU per agent

D-TREMOR:!Distributed Team REshaping of Models for Rapid-execution"We extend the TREMOR[2] algorithm for solving DPCLs to produce D-TREMOR, a fully-distributed solver that scales to problems with hundreds of agents. It approximates DPCLs as a set of single-agent POMDPs which are solved in parallel, then iteratively reshaped using messages that describe CL interactions between agent policies."

1 1.5 2 2.5 30

100

200

300

400

Number of Rings

Ave

rage

# o

f Col

lisio

ns

!"#$%&'()%*+,&

-./('(0.1*23456-)5

1 1.5 2 2.5 30

5

10

15

20

25

30

35

Number of Rings

Ave

rage

# o

f Vic

tims

Res

cued

!"#$%$&"'()*+,-!.,

/01234%$.3(564

)7*87(950:%$,"0176

D-TREMOR rescues many more victims.

D-TREMOR resolves many, but not all collisions.

Task Allocation

Local Planning

Interaction Exchange

Model Shaping

Finding the probability of a CL[1]: •  Evaluate local policy

•  Compute frequency of associated si, ai

Entered corridor in 95 of 100

runs: PrCLi= 0.95

Finding the value of a CL[1]: •  Sample local policy value

with/without interactions –  Test interactions independently

•  Compute change in value if interaction occurred

No collision

Collision ValCLi= -7

•  Send CL messages to teammates:

•  Sparsity Relatively small # of messages

•  Shape local model rewards/transitions based on remote interactions

Probability of interaction

Interaction model functions

Independent model functions

•  Re-solve shaped local models to get new policies

•  Result: new locally-optimal policies new interactions

Distributed Interaction Detection using Sampling and Message Exchange!

Improved model shaping of local agent models with convergence heuristics!

0 20 40 60 80 1000

5

10

15

20

Number of Agents

Tim

e Pe

r Ite

ratio

n (m

in)

0 20 40 60 80 1000

1000

2000

3000

4000

Number of Agents

# of

CLs

Act

ive

(per

age

nt)

Increases in time are related to # of CLs, not # of agents.

Results of Scaling Dataset!

Results of Density Dataset!

0 20 40 60 80 100500

400

300

200

100

0

100

Number of Agents

Nor

mal

ized

Joi

nt V

alue

Naïve Policies

D-TREMOR Policies

Number of agents and map size are varied as density of debris, corridors, and unsafe cells is held constant. "

Concentric rings of narrow corridors are added from outside in on a map where victims are at the center."

1 1.5 2 2.5 3!2500

!2000

!1500

!1000

!500

0

Number of Rings

Ave

rage

Joi

nt V

alue

!"#$%$&"'()*+,-!.,

/01234%$.3(564

At first glance, do-nothing seems to do best.

Ignoring interactions = poor performance Independent

& Optimistic

Independent & Optimistic

Independent & Optimistic

?

Agents not interacting, use independent functions:!

Agents are interacting, use joint CL functions:!

CL = Explicit time constraint

Coordination Locales define regions of state-action space where joint transition/reward functions are needed!

Relevant region of joint state-action space

: Joint Transition

: Joint Reward

: Joint Observation

: Set of States

: Set of Actions

: Set of Observations

: Initial Belief State