pegasus: a policy search method for large mdp’s and pomdp’s

32
PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine

Upload: ernie

Post on 01-Feb-2016

21 views

Category:

Documents


0 download

DESCRIPTION

PEGASUS: A policy search method for large MDP’s and POMDP’s. Andrew Ng, Michael Jordan Presented by: Geoff Levine. Motivation. For large, complicated domains, estimation of value functions/Q-functions can take a long time. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PEGASUS:  A policy search method for large MDP’s and POMDP’s

PEGASUS: A policy search method for large MDP’s and

POMDP’sAndrew Ng, Michael Jordan

Presented by: Geoff Levine

Page 2: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Motivation

• For large, complicated domains, estimation of value functions/Q-functions can take a long time.

• However, there often exist far simpler policies than the optimal that perform nearly as well.– Can directly search through a policy space

Page 3: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Preliminaries

• MDP – M = (S, D, A, {Psa(.)}, γ, R)

– S – set of states– D – initial state distribution– A – set of action

– Psa(.) : S -> [0,1] – transition probabilities

– γ – discount factor– R – deterministic rewards (function of state)

Page 4: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Policies

• Policy п : S -> A

• Value Function Vп : S -> Reals

Vп(s) = R(s) + γ Es’~P(s,п(s))[Vп(s’)]

• For convenience, also define:

V(п) = Es0~D[Vп(s0)]

Page 5: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Application Domain

• Helicopter Flight (Hovering in Place)– 12-d continuous state space ([0,1]12)

• (x,y,z,pitch,roll,yaw,x’,y’,z’,pitch’,roll’,yaw’)

– 4-d continuous action space ([0,1]4)• (front/back cyclic pitch control,left/right cyclic pitch control

main rotor pitch control,tail rotor pitch control)

– Timesteps correspond to 1/50th of a second– γ = .9995

– R(s) = -(a(x-x*)2+b(y-y*)2+c(z-z*)2+(yaw-yaw*)2)

Page 6: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Helicopter

Page 7: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Transformation of MDP’s

• Given M = (S, D, A, {Psa(.)}, γ, R) we construct M’ = (S’, D’, A, {P’sa(.)}, γ, R’), an MDP with deterministic state transitions

• Intuition: Instead of rolling the dice when we move from state to state, we will roll all the dice we need ahead of time, and store their results as part of our state.

Page 8: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Parcheesi

… …

Page 9: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Deterministic Simulative Model

• Assume we have a deterministic functional representation of our MDP Transitions– g : S x A x [0,1]dp –> S

such that if p is distributed uniformly in [0,1]dp then Prp[g(s, a, p) = s’] = Psa(s’).

– More powerful than a generative model.

Page 10: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Transformations of MDP’s

– S’ = S x [0,1]

– D’ – (s, p1, p2, p3, …) such that s ~ D, and the pi’s are drawn iid from Uniform[0,1]

– P’ta(t’) ={1 if g(s, a, p1)=s’,0 otherwise}(dP = 1)

– R’(t) = R(s)

t = (s, p1, p2, p3, …) t’ =(s’, p2, p3, …)

Page 11: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Policies

• Given a policy space П for S, consider a corresponding policy space П’ for S’, s.t. – п in П, п’ in П’, s in S, p1, p2,…

п’((s, p1, p2, p3, …)) = п(s)

• As the transition probabilities and rewards are equivalent in the transformed MDP:

VM п(s) = Ep~Unif[0,1]^[VM’

п’(s,p)]

VM(п) = VM’(п’)

Page 12: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Policy Search

• VMп(s0) = R(s0) + γ Es’~P(s0,п(s0))[Vп(s’)]

• VM’п’((s0,p1,p2,…)) = R(s0)+γR(s1)+γ2R(s2)+…

– s1 = g(s0, п’(s0), p1), s2 = g(s1, п’(s1), p2)

• As VM(п) = VM’(п’), we can estimate

VM(п) = Et0~D’[VM’п’(t0)]

Page 13: PEGASUS:  A policy search method for large MDP’s and POMDP’s

PEGASUS

Policy Evaluation-of-Goodness and Search Using Scenarios

• Draw a sample of m initial states (scenarios) {s0

(1), s0(2), s0

(3), …, s0(m)} iid from

D’

• Estimate

Page 14: PEGASUS:  A policy search method for large MDP’s and POMDP’s

PEGASUS

• Given {s0(1), s0

(2), s0(3), …, s0

(m)},

is a deterministic function• The sum is infinite, but can truncate the sum

after Hε = logγ(ε(1-γ)/2Rmax), introducing at most ε/2 error. Also, this allows us to store our “dice rolls” in finite space.

Page 15: PEGASUS:  A policy search method for large MDP’s and POMDP’s

PEGASUS

• Given the deterministic function VM’(п), we can use an optimization technique to find argmaxп VM’(п).

– If working in a continuous, smooth, differentiable domain, we can use gradient ascent

– If R is discontinuous, may need to use “continuation” methods to smooth it out

Page 16: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Results

• On 5x5 Gridworld POMDP, discovers near optimal policy in very few scenarios (~5)

• On continuous space/action bicycle riding problem, results near optimal and far better than earlier reward shaping methods.

Page 17: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Helicopter Hovering

• Policy represented by a hand-crafted neural network.

• PEGASUS used to search through set of possible ANN weights. – Tried both gradient ascent and random walk

searches

Page 18: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Neural Network Structure

(x,y,z) = (forward, sideways, down)

a1 = front/back cyclic pitch control, a2 = left/right cyclic pitch control a3 = main rotor pitch control a4 = tail rotor pitch control

Page 20: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Pseudo-Dimension

• H set of functions X -> Reals

• H shatters x1, x2, …, xd ε X if there exists a sequence of real numbers t1, t2, …, td s.t. {(h(x1) – t1, h(x2) – t2, …, h(xd) – td)

| h ε H} intersects all 2d orthants of Rd

• The pseudo-dimension of H (dimp(H)) is the size of the largest set shattered by H

Page 21: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Lipschitz Continuity

• A function f is Lipschitz continuous with Lipschitz bound B if

||f(x) – f(y)|| <= B||x – y||

(with respect to Euclidean norm on range and domain)

Page 22: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Realizable Dynamics in an MDP

• Let S = [0, 1]ds, g: S x A x [0, 1]dp -> S be given.

• We can define Fi as a set of functions

{Fia: S x [0, 1]dp -> [0, 1],

Fia(s, p1,…,pdp) = Ii(g(s, a, p1,…,pdp))| a in A}

Ii(x) returns the ith coordinate of x

Page 23: PEGASUS:  A policy search method for large MDP’s and POMDP’s

PEGASUS Theoretical Result

• Let S = [0, 1]ds, policy class П, and model g: S x A x [0, 1]dp -> S be given.

• F is the family of realizable dynamics in the MDP and Fi the resulting family of coordinate functions. For all i, let dimP(Fi) <= d, and let Fi be uniformly Lipschitz continuous with bound B

• Reward Function R is Lipschitz continuous with bound BR. • Then if:

with probability at least 1 – δ, the PEGASUS estimate V’(п) will be uniformly close to the the actual value |V’(п) – V (п)| <= ε

Page 24: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Proof (1)

• Think of the reward at step i as a random variable

Vп(s0(1)) = R(so

(1)) + γ R(s1(1)) + γ2 R(s2

(1)) +…

Vп(s0(2)) = R(so

(2)) + γ R(s1(2)) + γ2 R(s2

(2)) +…

Vп(s0(3)) = R(so

(3)) + γ R(s1(3)) + γ2 R(s2

(3)) +…

• By bounding properties of each R(si(j)), we

can prove uniform convergence for V(п)

Page 25: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Proof (2)

• Calling on work by Haussler, we show that if the psuedo-dimension of each Fi, dimP(Fi) <= d, we can “nearly” represent our world dynamics functions Fi

a by a smaller set of functions of size

Page 26: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Proof (3)

• Similarly if Fi uniformly has Lipschitz bound B, and the Reward function R has Lipschitz bound BR, we can “nearly” represent a function mapping from scenarios to ith step rewards by a set of size

Page 27: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Proof (4)

• A result by Haussler then shows that with probability 1 – δ, our ith step reward will be ε-close to the mean if we select a number of scenarios bounded by

Page 28: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Proof (5)

• Strengthening the bound to account for all Hε rewards and employing the Union bound, we find that a number of scenarios bounded by

is sufficient.

Page 29: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Critique

• Success limited to very small fairly linear control problem, with high frequency controller

• Lots of human bias incorporated into system– Restrictions/Linear Regression for model identification

– Structure of neural net for each of the tasks

• PAC learning guarantees still out of reach• No theoretical bounds on final policy

Page 30: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Bibliography

1. Chapter on PAC learning model, and decision-theoretic generalizations, with applications to neural nets. From Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Associates, 1995, Information and Computation, Vol. 100, September, 1992, pp. 78-150.

• Ng, A. Y., Jordan, M. I. PEGASUS: A policy search method for large MDP’s and POMDP’s. In Uncertainty in Artificial Intelligence, Sixth Conference, 2000.

• Ng, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. Autonomous helicopter flight via reinforcement learning. Advances in Neural Information Processing Systems 16. 2004.

• Ng, A. Y.,Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and Liang, E. Inverted autonomous helicopter flight via reinforcement learning, In International Symposium on Experimental Robotics, 2004.

Page 31: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Application – Helicoptor Flight

• PEGASUS has been used to derive policies for hovering in place.

• Later generalized to handle slow motion maneuvers and upside down hovering.

• GPS system relays state information (position and velocity) to an off board computer which calculates a 4-dimensional action

Page 32: PEGASUS:  A policy search method for large MDP’s and POMDP’s

Model Identification

• Construction of an MDP representation of the world dynamics

• Transition Dynamics learned from several minutes of data based on human flight– Fit using linear regression– Forced to respect innate properties of the

domain (gravity, symmetry)