incremental pruning: a simple, fast, exact method for partially observable markov decision processes...

41
Incremental Pruning: A simple, Fast, Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Exact Method for Partially Observable Markov Decision Processes Markov Decision Processes Anthony Cassandra Anthony Cassandra Computer Science Computer Science Dept. Brown Dept. Brown University University Providence, RI Providence, RI 02912 02912 [email protected] [email protected] Michael L. Littman Dept. of Computer Science Duke University Durham, NC 27708- 0129 [email protected] u Nevin L. Zhang Nevin L. Zhang Computer Science Computer Science Dept. The Dept. The Hong Kong U. of Sci. & Hong Kong U. of Sci. & Tech. Clear Water Bay, Tech. Clear Water Bay, Kwolon, HK Kwolon, HK [email protected] [email protected] Presented by Costas Djouvas

Post on 20-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Incremental Pruning: A simple, Fast, Exact Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Method for Partially Observable Markov

Decision ProcessesDecision Processes

Anthony Anthony Cassandra Cassandra

Computer Science Computer Science Dept. Brown Dept. Brown University University

Providence, RI Providence, RI 02912 02912

[email protected]@cs.brown.edu

Michael L. LittmanDept. of Computer

ScienceDuke University

Durham, NC 27708-0129

[email protected]

Nevin L. Zhang Nevin L. Zhang Computer Science Dept. Computer Science Dept. The Hong Kong U. The Hong Kong U.

of Sci. & Tech. Clear Water of Sci. & Tech. Clear Water Bay, Kwolon, HK Bay, Kwolon, HK [email protected]@cs.ust.hk

Presented by Costas

Djouvas

Page 2: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

POMDPs: Who Needs them?POMDPs: Who Needs them?

Tony CassandraTony CassandraSt. Edwards UniversitySt. Edwards University

Austin, TXAustin, TX

http://www.cassandra.org/pomdp/talks/who-needs-pohttp://www.cassandra.org/pomdp/talks/who-needs-pomdps/index.shtmlmdps/index.shtml

Page 3: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Markov Decision Processes Markov Decision Processes (MDP)(MDP)

A discrete model for decision making A discrete model for decision making under uncertainty.under uncertainty.

The four components of MDP model:The four components of MDP model: States: The world is divided into states.States: The world is divided into states. Actions: Each state has a finite number of Actions: Each state has a finite number of

actions to choose from.actions to choose from. Transition Function: Probabilistic relationship Transition Function: Probabilistic relationship

between states and available actions for each between states and available actions for each state. state.

Reward Function: The expected reward of Reward Function: The expected reward of taking action a under state s.taking action a under state s.

Page 4: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

MDP More FormallyMDP More Formally

S = S = A set of possible world states.

A = A = A set of possible actions.

Transition Function: A real number function T(s,a,s') = Pr(s'|s, a).

Reward Function: A real number function R(s,a).

Page 5: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

MDP Example (1/2)MDP Example (1/2) S = {OK, DOWN}.S = {OK, DOWN}. A = {NO-OP, ACTIVE-QUERY, RELOCATE}.A = {NO-OP, ACTIVE-QUERY, RELOCATE}. Reward FunctionReward Function

R(a, s)R(a, s)

ss

aa OKOK DOWDOWNN

NO-OPNO-OP +1+1 -10-10A-QA-Q -5-5 -5-5

RELOCATERELOCATE -22-22 -20-20

Page 6: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Transition Functions:Transition Functions:

POMDPPOMDP

MDP Example (2/2)MDP Example (2/2)

T(s, RELOCATE, s')T(s, RELOCATE, s')

s's'

ssOKOK DOWDOW

NN

OKOK 1.001.00 0.000.00

DOWDOWNN

1.001.00 0.000.00

T(s, A-Q, s')T(s, A-Q, s')

s's'

ssOKOK DOWDOW

NN

OKOK 0.980.98 0.020.02

DOWDOWNN

0.000.00 1.001.00

T(s, NO-OP, s')T(s, NO-OP, s')

s's'

ssOKOK DOWDOW

NN

OKOK 0.980.98 0.020.02

DOWDOWNN

0.000.00 1.001.00

Page 7: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Best StrategyBest Strategy Value Iteration Algorithm:Value Iteration Algorithm:

Input: Actions, States, Reward Function, Input: Actions, States, Reward Function, Probabilistic Transition Function.Probabilistic Transition Function.

Derive a mapping from states to “best” actions for Derive a mapping from states to “best” actions for a given horizon of time.a given horizon of time.

Starts with horizon length 1 and iteratively found Starts with horizon length 1 and iteratively found the value function for the desired horizon.the value function for the desired horizon.

Optimal PolicyOptimal Policy Maps states to actions (S Maps states to actions (S A). A). It depends only on current state (Markov It depends only on current state (Markov

Property).Property). To apply this we must know the agent’s state.To apply this we must know the agent’s state.

Page 8: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Partially Observable Partially Observable Markov Decision ProcessesMarkov Decision Processes

Domains with partial information Domains with partial information available about the current state (we available about the current state (we can’t observe the current state).can’t observe the current state). The observation can be probabilistic.The observation can be probabilistic.

We need an observation function.We need an observation function. Uncertainly about current state.Uncertainly about current state.

Non-Markovian process: required Non-Markovian process: required keeping track of the entire history.keeping track of the entire history.

Page 9: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Partially Observable Partially Observable Markov Decision ProcessesMarkov Decision Processes

In addition to MDP model we have:In addition to MDP model we have:

Observation: A set of observation of the Observation: A set of observation of the state.state. Z = A set of observations.Z = A set of observations.

Observation Function: Relation between Observation Function: Relation between the state and the observation.the state and the observation. O(s, a, z) = Pr(z |s, a).O(s, a, z) = Pr(z |s, a).

Page 10: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

POMDP ExamplePOMDP Example In addition to the In addition to the

definitions of the definitions of the MDP exampleMDP example, we , we must define the must define the observation set and observation set and the observation the observation probability function.probability function.

Z={pink-ok, pink-Z={pink-ok, pink-timeout, active-ok, timeout, active-ok, active-down}.active-down}.

O(s, ACTIVE_QUERY, O(s, ACTIVE_QUERY, Observation)Observation)

ObservationObservation

ss POPO PTPT AOAO ADAD

OKOK 0.00.00000

0.90.99999

0.00.00000

0.00.00101

DOWDOWNN

0.00.00000

0.00.01010

0.00.00000

0.90.99090

O(s, NO-OP, Observation)O(s, NO-OP, Observation)

ObservationObservation

ss POPO PTPT AOAO ADAD

OKOK 0.90.97070

0.00.00000

0.00.03030

0.00.00000

DOWDOWNN

0.00.02525

0.00.00000

0.90.97575

0.00.00000

O(s, RELOCATE, O(s, RELOCATE, Observation)Observation)

ObservationObservation

ss POPO PTPT AOAO ADAD

OKOK 0.20.25050

0.20.25050

0.20.25050

0.20.25050

DOWDOWNN

0.20.25050

0.20.25050

0.20.25050

0.20.25050

Optimal Policy

Page 11: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Background on Solving Background on Solving POMDPsPOMDPs

We have to find a mapping from probability We have to find a mapping from probability distribution over states to actions. distribution over states to actions. Belief State: the probability distribution over Belief State: the probability distribution over

states.states. Belief Space: the entire probability space.Belief Space: the entire probability space.

Assuming finite number of possible actions Assuming finite number of possible actions and observations, there are finite number of and observations, there are finite number of possible next beliefs states.possible next beliefs states.

Our next belief state is fully determined and it Our next belief state is fully determined and it depends only on the current belief state depends only on the current belief state (Markov Property).(Markov Property).

Page 12: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Background on Solving Background on Solving POMDPsPOMDPs

Next Belief State

Page 13: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Background on Solving Background on Solving POMDPsPOMDPs

Start from belief state b (Yellow Start from belief state b (Yellow Dot).Dot).

Two states s1, s2.Two states s1, s2. Two actions a1, a2.Two actions a1, a2. Tree observations z1, z2, z3.Tree observations z1, z2, z3.

Belief Space

Belief States

Page 14: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Policies for POMDPsPolicies for POMDPs An optimal POMDP policy maps belief An optimal POMDP policy maps belief

states to actions. states to actions. The way in which one would use a The way in which one would use a

computed policy is to start with some computed policy is to start with some a priori belief about where you are in a priori belief about where you are in the world. The continually: the world. The continually: 1.1. Use the policy to select action for current belief Use the policy to select action for current belief

state; state; 2.2. Execute the action; Execute the action; 3.3. Receive an observation; Receive an observation; 4.4. Update the belief state using current belief, Update the belief state using current belief,

action and observation;action and observation;5.5. Repeat. Repeat.

Page 15: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Example for Optimal Example for Optimal PolicyPolicy

Pr(OK)Pr(OK) ActionAction

0.000 – 0.000 – 0.2370.237

RELOCATERELOCATE

0.237 – 0.237 – 0.4850.485

ACTIVEACTIVE

0.485 – 0.485 – 0.4930.493

ACTIVEACTIVE

0.493 – 0.493 – 0.7130.713

NO-OPNO-OP

0.713 – 0.713 – 0.9280.928

NO-OPNO-OP

0.928 – 0.928 – 0.9890.989

NO-OPNO-OP

0.989 – 0.989 – 1.0001.000

NO-OPNO-OP

Value Function

Belief Space

RELACATE

0 1

ACTIVE

ACTIVE NO-OP

NO-OP

NO-OP

Page 16: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Policy GraphPolicy Graph

Page 17: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Value FunctionValue Function

The Optimal Policy computation is based The Optimal Policy computation is based on Value Iteration.on Value Iteration.

Main problem using the value iteration is Main problem using the value iteration is that the space of all belief states is that the space of all belief states is continuous.continuous.

Page 18: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Value FunctionValue Function

For each belief state get a single For each belief state get a single expected value.expected value.

Find the expected value of all belief Find the expected value of all belief states.states.

Yield a value function defined over all Yield a value function defined over all belief space.belief space.

Page 19: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Value Iteration ExampleValue Iteration Example Two states, two actions, three observations.Two states, two actions, three observations. We will use a figure to represent the Belief Space We will use a figure to represent the Belief Space

and the Transformed Value Function. and the Transformed Value Function. We will use the s(a, z) function to transform the We will use the s(a, z) function to transform the

continues space Value Function.continues space Value Function.

Belief Space

Tra

nsf

orm

ed

V

alu

e

Dot Product

Page 20: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Value Iteration ExampleValue Iteration Example Start from belief state bStart from belief state b One available action, a1 for the first One available action, a1 for the first

decision and then two a1 and a2.decision and then two a1 and a2. Three possible observations, z1, z2, Three possible observations, z1, z2,

z3.z3.

Page 21: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Value Iteration ExampleValue Iteration Example

For each of the three new belief For each of the three new belief states compute the new value states compute the new value function, for all actions.function, for all actions.

Transformed Value Functions for all observations

Partition for action a1

Page 22: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Value Iteration ExampleValue Iteration Example

Value Function and partition for action a1

Value Function and partition for action a2

Combined a1 and a2 values functions

Values functions for horizon 2

Page 23: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Transformed Value Transformed Value ExampleExample

MDP Example

Page 24: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Incremental Pruning: A simple, Fast, Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Exact Method for Partially Observable

Markov Decision ProcessesMarkov Decision Processes The agent is not aware of its current state.The agent is not aware of its current state. It only knows its information (belief) state It only knows its information (belief) state

x (probability distribution over possible x (probability distribution over possible states).states).

new information state new information state xa

where

az

S: a finite set of states

A: a finite set of possible actions

Z: a finite set of possible observations

α Α

s S

z Z

rα(s) R

Transition function: Pr(s'|s, α) [0, 1]

Observation function: Pr(z'|s', α) [0, 1]

Notations

Page 25: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

IntroductionIntroduction

Algorithms for POMDPs use a form of Algorithms for POMDPs use a form of dynamic programming, called dynamic dynamic programming, called dynamic programming updates.programming updates.

One Value Function is translated into a One Value Function is translated into a another.another.

Some of the algorithms using DPU:Some of the algorithms using DPU: One pass (Sondik 1971) Exhaustive (Monahan 1982) Linear support (Cheng 1988) Witness (Littman, Cassandra & Kaelbling 1996) Dynamic Pruning (Zhang & Liu 1996)

Page 26: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Dynamic Programming Dynamic Programming UpdatesUpdates Idea: Define a new value function Idea: Define a new value function V'V' in terms of a given value in terms of a given value

function function VV.. Using value iteration, in infinite-horizon, Using value iteration, in infinite-horizon, V'V' represents an represents an

approximation that is very close to optimal value function.approximation that is very close to optimal value function. The The V'V' is defined by: is defined by:

So the function So the function VV can be expressed as vectors can be expressed as vectors

for some finite set of |S|-vectors Sfor some finite set of |S|-vectors Sαα, S, Sαα, S' , S' The transformations preserve piecewise linearity

and convexity (Smallwood & Sondik, 1973).

zz

Page 27: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

α1 > α2 α1 > α2 if and only if for a1(s) > a2(s) if and only if for a1(s) > a2(s) for all s S.for all s S.

Dynamic Programming Dynamic Programming UpdatesUpdates

Some more notationsSome more notations Vector Comparison:Vector Comparison:

Vector dot product:Vector dot product:

Cross sum:Cross sum:

Set subtraction:Set subtraction:

α.β = Σα.β = Σs s α(α(s)s)ββ(s)(s)

A B = {α + β|α Α, β Β}

Α\Β = {α Α|β Β}

Page 28: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Dynamic Programming Dynamic Programming UpdatesUpdates

Using these notations, we can Using these notations, we can characterize the “S” sets described characterize the “S” sets described earlier as:earlier as:

purge(.) takes a set of vectors and reduces it to its unique minimum form

Page 29: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Pruning Sets of VectorsPruning Sets of Vectors Given a set of |S|-vectors A and a vector Given a set of |S|-vectors A and a vector αα, ,

define:define:

which is called which is called “witness region”“witness region” the set of information states the set of information states for which vector for which vector αα is the clear “winner” (has the largest dot is the clear “winner” (has the largest dot product) compared to all the others vectors of A.product) compared to all the others vectors of A.

Using the definition of R, we can define:Using the definition of R, we can define:

which is the set of vectors in A that have non-empty witness which is the set of vectors in A that have non-empty witness region and is precisely the minimum-size set.region and is precisely the minimum-size set.

Page 30: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Pruning Sets of VectorsPruning Sets of Vectors Implementation of purge(F)Implementation of purge(F)

Returns the vectors in F with non-empty witness

region.

Returns an information state x for which α gives larger dot product

that any vector in A.

Page 31: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Incremental PruningIncremental Pruning

Computes SComputes Sαα efficiently:efficiently:

• Conceptually easier than witness.

• Superior performance and asymptotic complexity.

• A = purge(A), B = purge(B).

• W = purge(A B).

•|W| ≥ max(|A|, |B|).

•It never grows explosively compared to its final size.

Page 32: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Incremental PruningIncremental Pruning

We first construct We first construct all of S(a,z) sets.all of S(a,z) sets.

We do all We do all combinations of combinations of the S(a,z1) and the S(a,z1) and S(a,z2) vectors.S(a,z2) vectors.

Page 33: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Incremental PruningIncremental Pruning

We yields the new We yields the new value function.value function.

We then eliminate We then eliminate all useless (light all useless (light blue) vectors.blue) vectors.

Page 34: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Incremental PruningIncremental Pruning

We are left with We are left with just three vectors.just three vectors.

We then combine We then combine

these three with these three with the vectors in the vectors in S(a,z3).S(a,z3).

This is repeated for This is repeated for the other action.the other action.

Page 35: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Generalizing Incremental Generalizing Incremental PruningPruning

Modification of FModification of FILTER ILTER to take to take advantage of the fact that the set advantage of the fact that the set of vectors has a great deal of of vectors has a great deal of regularity.regularity.

Replace x Replace x D DOMINATEOMINATE((ΦΦ, , W) with x W) with x DDOMINATEOMINATE((ΦΦ, , D\{D\{ΦΦ}).}).

Recall:Recall: A B : filtering set of vectors.A B : filtering set of vectors. W: set of wining vectors.W: set of wining vectors. ΦΦ: the “winner” vectors of the W: the “winner” vectors of the W D A BD A B

Page 36: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Generalizing Incremental Generalizing Incremental PruningPruning

D must satisfying any of the following D must satisfying any of the following properties:properties:

Different choices of D result in different Different choices of D result in different incremental pruning algorithms.incremental pruning algorithms.

The smaller the D set the more efficient the The smaller the D set the more efficient the algorithm.algorithm.

(1)

(2)

(3)

(4)

(5)

Page 37: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Generalizing Incremental Generalizing Incremental PruningPruning

To IP algorithm uses equation 1.To IP algorithm uses equation 1. A variation of the incremental pruning method A variation of the incremental pruning method

using a combination of 4 and 5 is referred as using a combination of 4 and 5 is referred as restricted region (RR) algorithm.restricted region (RR) algorithm.

The asymptotic total number of linear programs does not change RR, actually requires slightly more linear programs than IP in the worst case.

However empirically it appears that the savings in the total constraints usually saves more time than the extra linear programs require.

Page 38: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Generalizing Incremental Generalizing Incremental PruningPruning

Complete RR algorithm

Page 39: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

Empirical ResultsEmpirical Results

Total execution time

Total time spent constructing STotal time spent constructing Sαα sets.sets.

Page 40: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

ConclusionsConclusions

We examined the incremental pruning We examined the incremental pruning method for performing dynamic programming updates in partially observable Markov decision processes.

It compares favorably in terms of ease of implementation to the simplest of the previous algorithms.

It has asymptotic performance as good as or better than the most efficient of the previous algorithms and is empirically the fastest algorithm of its kind.

Page 41: Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University

ConclusionConclusion In any event even the slowest variation of the

incremental pruning method that we studied is a consistent improvement over earlier algorithms.

This algorithm will make it possible to greatly expand the set of POMDP problems that can be solved efficiently.

Issues to be explored:Issues to be explored: All algorithms studied have a precision parameter All algorithms studied have a precision parameter ε, ε,

which differs from algorithm to algorithm.which differs from algorithm to algorithm. Develop better best-case and worst-case analyses for

RR.