apprenticeship learning for the dynamics model overview challenges in reinforcement learning for...
Post on 20-Dec-2015
223 views
TRANSCRIPT
Apprenticeship Learning for the Dynamics Model
OverviewChallenges in reinforcement learning for complex physical systems such as helicopters:
Data collection: Aggressive exploration is dangerous.Difficult to specify the proper reward function for a given task.
We present apprenticeship learning algorithms which use an expert demonstration which:
Do not require explicit exploration.Do not require an explicit reward function specification.
Experimental results:Demonstrate effectiveness of algorithms on a highly challenging control problem.Significantly extend the state of the art in autonomous helicopter flight. In particular, first completion of autonomous stationary forward flips, stationary sideways rolls, nose-in funnels and tail-in funnels.
Complex tasks: hard to specify
the reward function.
SSTTAANNFFOORRDD
An Application of Reinforcement
HelicopterPieter Abbeel, Adam Coates,
•Key question:How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?
•State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
•Can we avoid explicit exploration?
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
Expert human pilot flight
(a1, s1, a2, s2, a3, s3, ….)
Learn P sa
(a1, s1, a2, s2, a3, s3, ….)
Autonomous flight
Learn Psa
Dynamics Model
Psa
Reward Function R
ReinforcementLearning )(...)(Emax 0 TsRsR
Control policy
Take away message:
In the apprenticeship learning setting, i.e., when we have an expert demonstration, we do not need explicit exploration to perform as well as the expert.
Theorem.Assuming we have a polynomial
number of teacher demonstrations, then the apprenticeship learning algorithm will return a policy that performs as well as the teacher within a polynomial number of iterations.
[Abbeel & Ng, 2005 for more details.]
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
)(...)(Emax 0 TsRsR
Control policy
Apprenticeship Learning for the Reward Function
Reward function can be very difficult to specify. E.g., for our helicopter control problem we have:R(s) = c1 * (position error)2 + c2 * (orientation error)2 + c3 * (velocity error)2 + c4*(angular rate error)2 + … + c25 * (inputs)2.Difficult to specify the proper reward function for a given task.
Can we avoid the need to specify the reward function?
Our approach: [Abbeel & Ng, 2004] is based on inverse reinforcement learning [Ng & Russell, 2000]. Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function in polynomial number of iterations.
Inverse RL Algorithm:For t = 1,2,…
Inverse RL step: Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert performs better than all previously found policies {i}.
RL step: Compute optimal policy t for the estimated reward w.
Related work:Imitation learning: learn to predict expert’s actions as a function of states. Usually lacks strong performance guarantees. [E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …]Max margin planning, Ratliff et al., 2006.
Reinforcement Learning and Apprenticeship Learning
Data collection: aggressive
exploration is dangerous
Stationary Rolls
Stationary Flips
SSTTAANNFFOORRDD
Tail-In Funnels
Nose-In Funnels
Learning to AutonomousFlightMorgan Quigley and Andrew Y. Ng
Experimental Results Video available.Video available.
ConclusionApprenticeship learning for the dynamics model avoids explicit exploration in our experiments.
Procedure based on apprenticeship learning (inverse RL) for the reward function gives performance similar to human pilots.
Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels.
ADDITIONAL REFERENCES (SPECIFIC TO AUTONOMOUS HELICOPTER FLIGHT)[1] J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In International Conference on Robotics and Automation. IEEE, 2001.
[2] V. Gavrilets, I. Martinos, B. Mettler, and E. Feron. Control logic for automated aerobatic flight of miniature helicopter. In AIAA Guidance, Navigation and Control Conference, 2002.
[3] M. La Civita. Integrated Modeling and Robust Control for Full-Envelope Flight of Robotic Helicopters . PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 2003.
[4] M. La Civita, G. Papgeorgiou, W. C. Messner, and T. Kanade. Design and flight testing of a high-bandwidth H-infinity loop shaping controller for a robotic helicopter. Journal of Guidance, Control, and Dynamics, 29(2):485-494, March-April 2006.
[5] B. Mettler, M. Tischler, and T. Kanade. System identification of small-size unmanned helicopter dynamics. In American Helicopter Society, 55th Forum, 1999.
[6] Jonathan M. Roberts, Peter I. Corke, and Gregg Buskey. Low-cost flight control system for a small autonomous helicopter. In IEEE Int’l Conf. On Robotics and Automation, 2003.
[7] S. Saripalli, J. Montgomery, and G. Sukhatme. Visually-guided landing of an unmanned aerial vehicle, 2003.
RL/Optimal Control We use differential dynamic programming.
We penalize for high frequency controls.
We include integrated orientation errors in the cost.
(see paper for more details)