optimal policies for pomdp

39
Optimal Policies for POMDP Presented by Alp Sardağ

Upload: taipa

Post on 10-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Optimal Policies for POMDP. Presented by Alp Sardağ. As Much Reward As Possible?. Greedy Agent. How long agent take decision?. Finite Horizon Infinite Horizon (discount factor) Values will converge. Good model if the number of decision step is not given. Policy. General plan - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimal Policies for POMDP

Optimal Policies for POMDP

Presented by Alp Sardağ

Page 2: Optimal Policies for POMDP

As Much Reward As Possible?

Greedy Agent

Page 3: Optimal Policies for POMDP

How long agent take decision?

Finite HorizonInfinite Horizon (discount factor)

Values will converge.Good model if the number of decision step is not given.

Page 4: Optimal Policies for POMDP

PolicyGeneral planDeterministic : one action for each stateStochastic : pdf over the set of actionsStationary : can be applied at any timeNon-stationary : dependent on timeMemoryless : no history

Page 5: Optimal Policies for POMDP

Finite HorizonAgent has to make k decisions, non-stationary

Page 6: Optimal Policies for POMDP

Infinite HorizonWe do not need different policy for each time step.

0<<1

Infiniteness helps us to find stationary policy.={0, 1,..., t}={i, i,..., i}

Page 7: Optimal Policies for POMDP

MDPFinite horizon, solved with dynamic programming.

Infinite horizon S equations S unknowns LP.

Page 8: Optimal Policies for POMDP

MDPActions may be stochastic.Do you know what state end up?Dealing with uncertainity in observations.

Page 9: Optimal Policies for POMDP

POMDP ModelFinite set of statesFinite set of actionsTransition probabilities (as in MDP)Observation modelReinforcement

Page 10: Optimal Policies for POMDP

POMDP ModelImmediate reward for performing action a in state i.

Page 11: Optimal Policies for POMDP

POMDP ModelBelief state : probability distribution over states.

= {0, 1,...., |S|}Drawback to compute next state world model needed. From Bayes rule:

Page 12: Optimal Policies for POMDP

POMDP ModelControl dynamics for a POMDP

Page 13: Optimal Policies for POMDP

Policies for POMDPBelief states infinite, value functions in tables infeasible.For horizon length 1.

No control over observations (not found in MDP), weigh all observations

Page 14: Optimal Policies for POMDP

Value functions for POMDPs

Formula is complex, however if VF is piecewise linear (a way of rep. Continous space VF), it can be written:

Page 15: Optimal Policies for POMDP

Value functions for POMDPs

Page 16: Optimal Policies for POMDP

Value Functions for POMDPs

Given Vt-1, Vt can be calculated.Keep the action which gives rise to specific vector.To find optimal policy at a belief state, just perform maximization over all vectors and take the associated action.

Page 17: Optimal Policies for POMDP

Geometric Interpretation of VF

Belief simplex:

2 dimensional case:

Page 18: Optimal Policies for POMDP

Geometric Interpretation of VF

3 dimensional case :

Page 19: Optimal Policies for POMDP

Alternate VF InterpretationA decision tree could enumerate each possible policy for k-horizon, if initial belief state given.

Page 20: Optimal Policies for POMDP

Alternate VF InterpretationThe number of nodes for each action:

The number of possible tree (|A| possible actions for each node)

Somehow only generate useful trees, the complexity will be greatly reduced.Previously, to create entire VF generate for all , too many for the algorithm to work.

Page 21: Optimal Policies for POMDP

POMDP SolutionsFor finite horizon:

Iterate over time steps. Given Vt-1 compute Vt.

Retain all intermediate solutions.For finitely transient, same idea apply to find infinite horizon.Iterate until previous optimal value functions are the same for any two consecutive time steps.Once infinite horizon found, discard all intermediate results.

Page 22: Optimal Policies for POMDP

POMDP SolutionsGiven Vt-1 Vt can be calculated for one from previous formula. No knowledge about which region this is optimal. (Sondik)Too many to construct VF, one possible solution:

Choose random points.If the number of points is large, one can’t miss any of true vectors.How many points to choose? No guarantee.

Find optimal policies by developing a systematic algorithm to explore the entire continous space of beliefs.

Page 23: Optimal Policies for POMDP

Tiger ProblemActions: open left door, open right door, listen.Listenning not accurate.s0: tiger on the left, s1: tiger on the right.Rewards: +10 openning right door, -100 for wrong door, -1 for listenning.Initially: = (0.5 0.5)

Page 24: Optimal Policies for POMDP

Tiger Problem

Page 25: Optimal Policies for POMDP

Tiger ProblemFirst action, intuitively:

-100+102=-55 & -1 for listenningFor horizon length 1:

Page 26: Optimal Policies for POMDP

Tiger ProblemFor Horizon length 2:

Page 27: Optimal Policies for POMDP

Tiger ProblemFor horizon length 4, nice features:

A belief state for the same action & observation transformed to a single belief state.Observations made precisely define the nodes in the graph that would be traversed.

Page 28: Optimal Policies for POMDP

Infinite HorizonFinite horizon cumbersome, different policy for the same belief point for each time step.Different set of vectors for each time step.Add discount factor to tiger problem, after 56. Step the underlying vectors are slightly different:

Page 29: Optimal Policies for POMDP

Infinite Horizon for Tiger Problem

By this way the finite horizon algorithms can be used for the infinite horizon problems.Advantage of infinite horizon, keep the last policy.

Page 30: Optimal Policies for POMDP

Policy GraphsA way to encode, without keeping vectors, no dot products.

Beginning state Endstate

Page 31: Optimal Policies for POMDP

Finite TransienceAll the belief states within a particular partition element will be transformed to another element for a particular action and observation.For non-finitely transient policies the policy graphs that are exactly optimal can not be constructed.

Page 32: Optimal Policies for POMDP

Overview of AlgorithmsAll performed iteratively.All try to find the set of vectors that define both the value function and the optimal policy at each time step.Two separate class:

Given Vt-1, generate superset of Vt, reduce that set until the optimal Vt found (Monahan and Eagle). Given Vt-1 construct subset of optimal Vt. These subsets grow larger until optimal Vt found.

Page 33: Optimal Policies for POMDP

Monahan AlgorithmEasy to implementDo not expect to solve anything but smallest of problems.Provides background for understanding of other algorithms.

Page 34: Optimal Policies for POMDP

Monahan Enumeration Phase

Generate all vectors:Number of gen. Vectors = |A|M||

where M vectors of previous state

Page 35: Optimal Policies for POMDP

Monahan Reduction PhaseAll vectors can be kept:

Each time maximize over all vectors.Lot of excess baggageThe number of vectors in next step will be even large.

LP used to trim away useless vectors

Page 36: Optimal Policies for POMDP

Monahan Reduction PhaseFor a vector to be useful, there must be at least one belief point it gives larger value than others:

Page 37: Optimal Policies for POMDP

Monahan Algorithm

Page 38: Optimal Policies for POMDP

Monahan’s LP Complication

Page 39: Optimal Policies for POMDP

Future WorkEagle’s Variant of Monahan’s Algorithm.Sondik’s One-Pass Algorithm.Cheng’s Relaxed Region Algorithm.Cheng’s Linear Support Algorithm.