apprenticeship learning for the dynamics model overview challenges in reinforcement learning for...

Apprenticeship Learning for the Dynamics Model Overview Challenges in reinforcement learning for complex physical systems such as helicopters: Data collection: Aggressive exploration is dangerous. Difficult to specify the proper reward function for a given task. We present apprenticeship learning algorithms which use an expert demonstration which: Do not require explicit exploration. Do not require an explicit reward function specification. Experimental results: Demonstrate effectiveness of algorithms on a highly challenging control problem. Significantly extend the state of the art in autonomous helicopter flight. In particular, first completion of autonomous stationary forward flips, stationary sideways rolls, nose-in funnels and tail- in funnels. Complex tasks: hard to specify the reward function. S S T T A A N N F F O O R R D D An Application of Reinforcement Helicopter Pieter Abbeel, Adam Coates, Key question: How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process? State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Can we avoid explicit exploration? Have good model of dynamics? NO “Explore” YES “Exploit” Expert human pilot flight (a 1 , s 1 , a 2 , s 2 , a 3 , s 3 , ….) Learn P sa (a 1 , s 1 , a 2 , s 2 , a 3 , s 3 , ….) Autonomous flight Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning ) ( ... ) ( E max 0 T s R s R Control policy Take away message: In the apprenticeship learning setting, i.e., when we have an expert demonstration, we do not need explicit exploration to perform as well as the expert. Theorem. Assuming we have a polynomial number of teacher demonstrations, then the apprenticeship learning algorithm will return a policy that performs as well as the teacher within a polynomial number of iterations. [Abbeel & Ng, 2005 for more details.] Dynamics Model P sa Reward Function R Reinforcemen t Learning ) ( ... ) ( E max 0 T s R s R Control policy Apprenticeship Learning for the Reward Function Reward function can be very difficult to specify. E.g., for our helicopter control problem we have: R(s) = c 1 * (position error) 2 + c 2 * (orientation error) 2 + c 3 * (velocity error) 2 + c 4 *(angular rate error) 2 + … + c 25 * (inputs) 2 . Difficult to specify the proper reward function for a given task. Can we avoid the need to specify the reward function? Our approach: [Abbeel & Ng, 2004] is based on inverse reinforcement learning [Ng & Russell, 2000]. Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function in polynomial number of iterations. Inverse RL Algorithm: For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= w T (s) such that under R(s) the expert ment Learning and Apprenticeship Learning Data collection: aggressive exploration is dangerous

Post on 20-Dec-2015




1 download


Apprenticeship Learning for the Dynamics Model

OverviewChallenges in reinforcement learning for complex physical systems such as helicopters:

Data collection: Aggressive exploration is dangerous.Difficult to specify the proper reward function for a given task.

We present apprenticeship learning algorithms which use an expert demonstration which:

Do not require explicit exploration.Do not require an explicit reward function specification.

Experimental results:Demonstrate effectiveness of algorithms on a highly challenging control problem.Significantly extend the state of the art in autonomous helicopter flight. In particular, first completion of autonomous stationary forward flips, stationary sideways rolls, nose-in funnels and tail-in funnels.

Complex tasks: hard to specify

the reward function.


An Application of Reinforcement

HelicopterPieter Abbeel, Adam Coates,

•Key question:How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?

•State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

•Can we avoid explicit exploration?

Have goodmodel of dynamics?





Expert human pilot flight

(a1, s1, a2, s2, a3, s3, ….)

Learn P sa

(a1, s1, a2, s2, a3, s3, ….)

Autonomous flight

Learn Psa

Dynamics Model


Reward Function R

ReinforcementLearning )(...)(Emax 0 TsRsR

Control policy

Take away message:

In the apprenticeship learning setting, i.e., when we have an expert demonstration, we do not need explicit exploration to perform as well as the expert.

Theorem.Assuming we have a polynomial

number of teacher demonstrations, then the apprenticeship learning algorithm will return a policy that performs as well as the teacher within a polynomial number of iterations.

[Abbeel & Ng, 2005 for more details.]

Dynamics Model


Reward Function R


)(...)(Emax 0 TsRsR

Control policy

Apprenticeship Learning for the Reward Function

Reward function can be very difficult to specify. E.g., for our helicopter control problem we have:R(s) = c1 * (position error)2 + c2 * (orientation error)2 + c3 * (velocity error)2 + c4*(angular rate error)2 + … + c25 * (inputs)2.Difficult to specify the proper reward function for a given task.

Can we avoid the need to specify the reward function?

Our approach: [Abbeel & Ng, 2004] is based on inverse reinforcement learning [Ng & Russell, 2000]. Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function in polynomial number of iterations.

Inverse RL Algorithm:For t = 1,2,…

Inverse RL step: Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert performs better than all previously found policies {i}.

RL step: Compute optimal policy t for the estimated reward w.

Related work:Imitation learning: learn to predict expert’s actions as a function of states. Usually lacks strong performance guarantees. [E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …]Max margin planning, Ratliff et al., 2006.

Reinforcement Learning and Apprenticeship Learning

Data collection: aggressive

exploration is dangerous

Stationary Rolls

Stationary Flips


Tail-In Funnels

Nose-In Funnels

Learning to AutonomousFlightMorgan Quigley and Andrew Y. Ng

Experimental Results Video available.Video available.

ConclusionApprenticeship learning for the dynamics model avoids explicit exploration in our experiments.

Procedure based on apprenticeship learning (inverse RL) for the reward function gives performance similar to human pilots.

Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels.

ADDITIONAL REFERENCES (SPECIFIC TO AUTONOMOUS HELICOPTER FLIGHT)[1] J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In International Conference on Robotics and Automation. IEEE, 2001.

[2] V. Gavrilets, I. Martinos, B. Mettler, and E. Feron. Control logic for automated aerobatic flight of miniature helicopter. In AIAA Guidance, Navigation and Control Conference, 2002.

[3] M. La Civita. Integrated Modeling and Robust Control for Full-Envelope Flight of Robotic Helicopters . PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 2003.

[4] M. La Civita, G. Papgeorgiou, W. C. Messner, and T. Kanade. Design and flight testing of a high-bandwidth H-infinity loop shaping controller for a robotic helicopter. Journal of Guidance, Control, and Dynamics, 29(2):485-494, March-April 2006.

[5] B. Mettler, M. Tischler, and T. Kanade. System identification of small-size unmanned helicopter dynamics. In American Helicopter Society, 55th Forum, 1999.

[6] Jonathan M. Roberts, Peter I. Corke, and Gregg Buskey. Low-cost flight control system for a small autonomous helicopter. In IEEE Int’l Conf. On Robotics and Automation, 2003.

[7] S. Saripalli, J. Montgomery, and G. Sukhatme. Visually-guided landing of an unmanned aerial vehicle, 2003.

RL/Optimal Control We use differential dynamic programming.

We penalize for high frequency controls.

We include integrated orientation errors in the cost.

(see paper for more details)