learning and planning for pomdps eyal even-dar, tel-aviv university sham kakade, university of...
Post on 22-Dec-2015
218 views
TRANSCRIPT
Learning and Planning for POMDPs
Eyal Even-Dar, Tel-Aviv UniversitySham Kakade, University of PennsylvaniaYishay Mansour, Tel-Aviv University
Talk Outline
• Bounded Rationality and Partially Observable MDPs• Mathematical Model of POMDPs• Learning in POMDPs
– Planning in POMDPs– Tracking in POMDPs
Bounded Rationality
• Rationality: – Unlimited Computational
power players
• Bounded Rationality– Computational limitation– Finite Automata
• Challenge: play optimally against a Finite Automata– Size of automata unknown
Bounded Rationality and RL
• Model:– Perform an action– See an observation– Either immediate rewards or delay
reward
• This is a POMDP– Unknown size is a serious challenge
Classical Reinforcement Learning
Agent – Environment Interaction
Agent
Environment
Agent action
Next state
Reward
Reinforcement Learning - Goal
• Maximize the return.
• Discounted return ∑trt 0<<1
• Undiscounted return ∑rt/ Tt=1
T
∞
t=1
Reinforcement Learning ModelPolicy
• Policy Π:– Mapping states to distribution over
• Optimal policy Π*:– Attains optimal return from any start
state.• Theorem:
There exists a stationary deterministic optimal policy
Planning and Learning in MDPs
• Planning:– Input: a complete model– Output: an optimal policy Π*:
• Learning:– Interaction with the environment– Achieve near optimal return.
• For MDPs both planning and learning can be done efficiently– Polynomial in the number of states– representation in tabular form
Partial ObservableAgent – Environment
Interaction
Agent
Environment
Agent action
Signal correlated with state
Reward
Partially Observable Markov Decision Process
s1s2
s3
•S the states•A actions
0.70.3
•Psa(-) next state distribution•R(s,a) Reward distribution
E[R (s3,a)] = 10
•O Observations•O(s,a) Observation distribution
O1 = .1
02 = .8
03 = .1O1 = .1
02 = .1
03 = .8
O1 = .8
02 = .1
03 = .1
Partial Observables – problems in Planning
• The optimal policy is not stationary furthermore it is history dependent
• Example:
Partial Observables – Complexity Hardness results
policy horizon Approximation
Complexity
stationary finite -additive NP-comp
History dependent
finite -additive PSPACE-comp
stationary discounted
-additive NP-compLGM01, L95
Learning in PODMPs – Difficulties
• Suppose an agent knows its state initially, can he keep track of his state?– Easy given a completely accurate model.– Inaccurate model: Our new tracking result.
• How can the agent return to the same state?
• What is the meaning of very long histories?– Do we really need to keep all the history?!
Planning in POMDPs – Belief State Algorithm
• A Bayesian setting• Prior over initial state• Given an action and observation
defines a posterior– belief state: distribution over states
• View the possible belief states as “states”– Infinite number of states
• Assumes also a “perfect model”
Learning in POMDPs – Popular methods
• Policy gradient methods :– Find local optimal policy in a
restricted class of polices (parameterized policies)
– Need to assume a reset to the start state!
– Cannot guarantee asymptotic results
– [Peshkin et al, Baxter & Bartlett,…]
Learning in POMDPs
• Trajectory trees [KMN]:– Assume a generative model
• A strong RESET procedure
– Find “near best” policy in a restricted class of polices • finite horizon policies• parameterized policies
Trajectory tree [KMN]
s0
o2
o3o4
o1
o1o2
a1
a2a2
a2
a1a1
Our setting
• Return: Average reward criteria
• One long trajectory– No RESET– Connected environment (unichain POMDP)
• Goal: Achieve the optimal return (average reward) with probability 1
Homing strategies - POMDPs
• Homing strategy is a strategy that identifies the state.– Knows how to return “home”
• Enables to “approximate reset” in during a long trajectory.
Homing strategies
• Learning finite automata [Rivest Schapire]– Use homing sequence to identify the state
• The homing sequence is exact• It can lead to many states
– Use finite automata learning of [Angluin 87]
• Diversity based learning [Rivest Schpire]– Similar to our setting
• Major difference: deterministic transitions
Homing strategies - POMDPs
Definition:H is an (,K)-homing strategy if for every two belief states x1 and x2,
after K steps of following H, the expected belief states b1 and b2
are within distance.
Homing strategies – Random Walk
• The POMDP is strongly connected, then the random walk Markov chain is irreducible
• Following the random walk assures that we converge to the steady state
Homing strategies – Random Walk
• What if the Markov chain is periodic?– a cycle
• Use “stay action” to overcome periodicity problems
Homing strategies – Amplifying
Claim:If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence
Reinforcement learning with homing
• Usually algorithms should balance between exploration and exploitation
• Now they should balance between exploration, exploitation and homing
• Homing is performed in both exploration and exploitation
Policy testing algorithm
Theorem:For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1
After T time steps is competes with policies of horizon log log T
Policy testing
• Enumerate the policies – Gradually increase horizon
• Run in phases:– Test policy πk
• Average runs, resetting between runs
– Run the best policy so far• Ensures good average return• Again, reset between runs.
Model based algorithm
Theorem:For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1
After T time steps is competes with policies of horizon log T
Model based algorithm• For t=1 to ∞
– For K1(t)times do• Run random for t steps and build an empirical
model• Use homing sequence to approximate reset
– Compute optimal policy on the empirical model
– For K2(t) times do
• Run the empirical optimal policy for t steps• Use homing sequence to approximate reset
Exploration
Exploitation
Model based algorithm
s0
o2o1
~
a1
a1
a1 a2
a2
a2
o2 o1
…………………………………………………………………………
Model based algorithm –Computing the optimal
policy• Bounding the error in the model
– Significant Nodes• Sampling• Approximate reset
– Insignificant Nodes
• Compute an ε-optimal t horizon policy in each step
Model Based algorithm- Convergence w.p 1 proof
• Proof idea:
• At any stage K1(t) is large enough so we compute an t-optimal t horizon policy
• K2(t) is large enough such that all phases before influence is bounded by t
• For a large enough horizon, the homing sequence influence is also bounded
Model Based algorithmConvergence rate
• Model based algorithm produces an -optimal policy with probability 1 - in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy
• Note the algorithm does not depend on |S|
Planning in POMDP
• Unfortunately, not today …• Basic results:
– Tight connections with Multiplicity Automata• Well establish theory starting in the 60’s
– Rank of the Hankel matrix • Similar to PSR• Always less then the number of states
– Planning algorithm:• Exponential in the rank of the Hankel
matrix
Tracking in POMDPs
• Belief states algorithm– Assumes perfect tracking
• Perfect model.
• Imperfect model, tracking impossible– For example: No observable
• New results:– “Informative observables” implies
efficient tracking.• Towards a spectrum of “partially” …