learning and planning for pomdps eyal even-dar, tel-aviv university sham kakade, university of...

Learning and Planning for POMDPs

Eyal Even-Dar, Tel-Aviv UniversitySham Kakade, University of PennsylvaniaYishay Mansour, Tel-Aviv University

Talk Outline

• Bounded Rationality and Partially Observable MDPs• Mathematical Model of POMDPs• Learning in POMDPs

– Planning in POMDPs– Tracking in POMDPs

Bounded Rationality

• Rationality: – Unlimited Computational

power players

• Bounded Rationality– Computational limitation– Finite Automata

• Challenge: play optimally against a Finite Automata– Size of automata unknown

Bounded Rationality and RL

• Model:– Perform an action– See an observation– Either immediate rewards or delay

reward

• This is a POMDP– Unknown size is a serious challenge

Classical Reinforcement Learning

Agent – Environment Interaction

Agent

Environment

Agent action

Next state

Reward

Reinforcement Learning - Goal

• Maximize the return.

• Discounted return ∑trt 0<<1

• Undiscounted return ∑rt/ Tt=1

T

∞

t=1

Reinforcement Learning ModelPolicy

• Policy Π:– Mapping states to distribution over

• Optimal policy Π*:– Attains optimal return from any start

state.• Theorem:

There exists a stationary deterministic optimal policy

Planning and Learning in MDPs

• Planning:– Input: a complete model– Output: an optimal policy Π*:

• Learning:– Interaction with the environment– Achieve near optimal return.

• For MDPs both planning and learning can be done efficiently– Polynomial in the number of states– representation in tabular form

Partial ObservableAgent – Environment

Interaction

Agent

Environment

Agent action

Signal correlated with state

Reward

Partially Observable Markov Decision Process

s1s2

s3

•S the states•A actions

0.70.3

•Psa(-) next state distribution•R(s,a) Reward distribution

E[R (s3,a)] = 10

•O Observations•O(s,a) Observation distribution

O1 = .1

02 = .8

03 = .1O1 = .1

02 = .1

03 = .8

O1 = .8

02 = .1

03 = .1

Partial Observables – problems in Planning

• The optimal policy is not stationary furthermore it is history dependent

• Example:

Partial Observables – Complexity Hardness results

policy horizon Approximation

Complexity

stationary finite -additive NP-comp

History dependent

finite -additive PSPACE-comp

stationary discounted

-additive NP-compLGM01, L95

Learning in PODMPs – Difficulties

• Suppose an agent knows its state initially, can he keep track of his state?– Easy given a completely accurate model.– Inaccurate model: Our new tracking result.

• How can the agent return to the same state?

• What is the meaning of very long histories?– Do we really need to keep all the history?!

Planning in POMDPs – Belief State Algorithm

• A Bayesian setting• Prior over initial state• Given an action and observation

defines a posterior– belief state: distribution over states

• View the possible belief states as “states”– Infinite number of states

• Assumes also a “perfect model”

Learning in POMDPs – Popular methods

• Policy gradient methods :– Find local optimal policy in a

restricted class of polices (parameterized policies)

– Need to assume a reset to the start state!

– Cannot guarantee asymptotic results

– [Peshkin et al, Baxter & Bartlett,…]

Learning in POMDPs

• Trajectory trees [KMN]:– Assume a generative model

• A strong RESET procedure

– Find “near best” policy in a restricted class of polices • finite horizon policies• parameterized policies

Trajectory tree [KMN]

s0

o2

o3o4

o1

o1o2

a1

a2a2

a2

a1a1

Our setting

• Return: Average reward criteria

• One long trajectory– No RESET– Connected environment (unichain POMDP)

• Goal: Achieve the optimal return (average reward) with probability 1

Homing strategies - POMDPs

• Homing strategy is a strategy that identifies the state.– Knows how to return “home”

• Enables to “approximate reset” in during a long trajectory.

Homing strategies

• Learning finite automata [Rivest Schapire]– Use homing sequence to identify the state

• The homing sequence is exact• It can lead to many states

– Use finite automata learning of [Angluin 87]

• Diversity based learning [Rivest Schpire]– Similar to our setting

• Major difference: deterministic transitions

Homing strategies - POMDPs

Definition:H is an (,K)-homing strategy if for every two belief states x1 and x2,

after K steps of following H, the expected belief states b1 and b2

are within distance.

Homing strategies – Random Walk

• The POMDP is strongly connected, then the random walk Markov chain is irreducible

• Following the random walk assures that we converge to the steady state

Homing strategies – Random Walk

• What if the Markov chain is periodic?– a cycle

• Use “stay action” to overcome periodicity problems

Homing strategies – Amplifying

Claim:If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence

Reinforcement learning with homing

• Usually algorithms should balance between exploration and exploitation

• Now they should balance between exploration, exploitation and homing

• Homing is performed in both exploration and exploitation

Policy testing algorithm

Theorem:For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log log T

Policy testing

• Enumerate the policies – Gradually increase horizon

• Run in phases:– Test policy πk

• Average runs, resetting between runs

– Run the best policy so far• Ensures good average return• Again, reset between runs.

Model based algorithm

Theorem:For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log T

Model based algorithm• For t=1 to ∞

– For K1(t)times do• Run random for t steps and build an empirical

model• Use homing sequence to approximate reset

– Compute optimal policy on the empirical model

– For K2(t) times do

• Run the empirical optimal policy for t steps• Use homing sequence to approximate reset

Exploration

Exploitation

Model based algorithm

s0

o2o1

~

a1

a1

a1 a2

a2

a2

o2 o1

…………………………………………………………………………

Model based algorithm –Computing the optimal

policy• Bounding the error in the model

– Significant Nodes• Sampling• Approximate reset

– Insignificant Nodes

• Compute an ε-optimal t horizon policy in each step

Model Based algorithm- Convergence w.p 1 proof

• Proof idea:

• At any stage K1(t) is large enough so we compute an t-optimal t horizon policy

• K2(t) is large enough such that all phases before influence is bounded by t

• For a large enough horizon, the homing sequence influence is also bounded

Model Based algorithmConvergence rate

• Model based algorithm produces an -optimal policy with probability 1 - in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy

• Note the algorithm does not depend on |S|

Planning in POMDP

• Unfortunately, not today …• Basic results:

– Tight connections with Multiplicity Automata• Well establish theory starting in the 60’s

– Rank of the Hankel matrix • Similar to PSR• Always less then the number of states

– Planning algorithm:• Exponential in the rank of the Hankel

matrix

Tracking in POMDPs

• Belief states algorithm– Assumes perfect tracking

• Perfect model.

• Imperfect model, tracking impossible– For example: No observable

• New results:– “Informative observables” implies

efficient tracking.• Towards a spectrum of “partially” …

learning and planning for pomdps eyal even-dar, tel-aviv university sham kakade, university of...

Documents

pomdps slide

state reward slide

l95 slide

perfect model slide

pomdps planning

state distribution rs

start state

initial state