![Page 1: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/1.jpg)
Apprenticeship Learning for Robotics, with Application to Autonomous
Helicopter Flight
Pieter AbbeelStanford University
Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley
![Page 2: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/2.jpg)
Pieter Abbeel
Preliminaries: reinforcement learning.
Apprenticeship learning algorithms.
Experimental results on various robotic platforms.
Outline
![Page 3: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/3.jpg)
Pieter Abbeel
Reinforcement learning (RL)
System
Dynamics
Psa
state s0
s1
System
dynamics
Psa
…
System
Dynamics
PsasT-1
sT
s2
a0 a1 aT-1
reward R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++
Example reward function: R(s) = - || s – s* ||
Goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)]
Solution: policy which specifies an action for each possible state for all times t= 0, 1, … , T.
![Page 4: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/4.jpg)
Pieter Abbeel
Model-based reinforcement learning
Run RL algorithm in simulator.
Control policy
![Page 5: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/5.jpg)
Pieter Abbeel
Apprenticeship learning algorithms use a demonstration to help us find
a good reward function,
a good dynamics model,
a good control policy.
Reinforcement learning (RL)
Reward Function
R
ReinforcementLearning Control
policy
|)(...)(Emax 0 TsRsR
Dynamics Model
Psa
![Page 6: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/6.jpg)
Pieter Abbeel
Apprenticeship learning: reward
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
![Page 7: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/7.jpg)
Pieter Abbeel
Reward function trades off: Height differential of terrain.
Gradient of terrain around each foot.
Height differential between feet.
… (25 features total for our setup)
Many reward functions: complex trade-off
![Page 8: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/8.jpg)
Pieter Abbeel
Example result
[ICML 2004, NIPS 2008]
![Page 9: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/9.jpg)
Pieter Abbeel
Compact description: reward function ~ trajectory (rather than a trade-off).
Reward function for aerobatics?
![Page 10: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/10.jpg)
Pieter Abbeel
Perfect demonstrations are extremely hard to obtain.
Multiple trajectory demonstrations: Every demonstration is a noisy instantiation of the
intended trajectory. Noise model captures (among others):
Position drift. Time warping.
If different demonstrations are suboptimal in different ways, they can capture the “intended” trajectory implicitly.
[Related work: Atkeson & Schaal, 1997.]
Reward: Intended trajectory
![Page 11: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/11.jpg)
Pieter Abbeel
Example: airshow demos
![Page 12: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/12.jpg)
Pieter Abbeel
Probabilistic graphical model for multiple demonstrations
![Page 13: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/13.jpg)
Pieter Abbeel
Step 1: find the time-warping, and the distributional parameters
We use EM, and dynamic time warping to alternatingly optimize over the different parameters.
Step 2: find the intended trajectory
Learning algorithm
![Page 14: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/14.jpg)
Pieter Abbeel
After time-alignment
![Page 15: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/15.jpg)
Pieter Abbeel
Apprenticeship learning for the dynamics model
Reward Function R
ReinforcementLearning
Control policy
|)(...)(Emax 0 TsRsR
Dynamics Model
Psa
![Page 16: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/16.jpg)
Pieter Abbeel
Algorithms such as E3 (Kearns and Singh, 2002) learn the dynamics by using exploration policies, which are dangerous/impractical for many systems.
Our algorithm Initializes model from a demonstration.
Repeatedly executes “exploitation policies'' that try to maximize rewards.
Provably achieves near-optimal performance (compared to teacher).
Machine learning theory: Complicated non-IID sample generating process. Standard learning theory bounds not applicable. Proof uses martingale construction over relative losses.
Apprenticeship learning for the dynamics model
[ICML 2005]
![Page 17: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/17.jpg)
Pieter Abbeel
Learning the dynamics model Details of algorithm for learning dynamics from
data: Exploiting structure from physics. Lagged learning criterion.
[NIPS 2005, 2006]
![Page 18: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/18.jpg)
Pieter Abbeel
Related work Bagnell & Schneider, 2001; LaCivita et al., 2006;
Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, 2002.
Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.
![Page 19: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/19.jpg)
Pieter Abbeel
Autonomous nose-in funnel
![Page 20: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/20.jpg)
Pieter Abbeel
Accuracy
![Page 21: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/21.jpg)
Pieter Abbeel
Modeling extremely complex: Our dynamics model state:
Position, orientation, velocity, angular rate.
True state: Air (!), head-speed, servos, deformation, etc.
Key observation: In the vicinity of a specific point along a
specific trajectory, these unknown state variables tend to take on similar values.
Non-stationary maneuvers
![Page 22: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/22.jpg)
Pieter Abbeel
Example: z-acceleration
![Page 23: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/23.jpg)
Pieter Abbeel
1. Time align trajectories.
2. Learn locally weighted models in the vicinity of the trajectory.
W(t’) = exp(- (t – t’)2 /2 )
Local model learning algorithm
![Page 24: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/24.jpg)
Pieter Abbeel
Autonomous flips
![Page 25: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/25.jpg)
Pieter Abbeel
Apprenticeship learning: RL algorithm
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
(Crude) model [None of the demos exactly equal to intended trajectory.]
(Sloppy) demonstration or initial trial
Small number of real-life trials
![Page 26: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/26.jpg)
Pieter Abbeel
Input to algorithm: approximate model. Start by computing the optimal policy
according to the model.
Algorithm Idea
Real-life trajectory
Target trajectory
The policy is optimal according to the model, so no improvement is possible based on the model.
![Page 27: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/27.jpg)
Pieter Abbeel
Algorithm Idea (2) Update the model such that it becomes exact for
the current policy.
![Page 28: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/28.jpg)
Pieter Abbeel
Algorithm Idea (2) Update the model such that it becomes exact for
the current policy.
![Page 29: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/29.jpg)
Pieter Abbeel
Algorithm Idea (2) The updated model perfectly predicts the state
sequence obtained under the current policy.
We can use the updated model to find an improved policy.
![Page 30: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/30.jpg)
Pieter Abbeel
Algorithm1. Find the (locally) optimal policy for the model.
2. Execute the current policy and record the state trajectory.
3. Update the model such that the new model is exact for the current policy .
4. Use the new model to compute the policy gradient and update the policy: := +
5. Go back to Step 2.
Notes: The step-size parameter is determined by a line search. Instead of the policy gradient, any algorithm that provides a local
policy improvement direction can be used. In our experiments we used differential dynamic programming.
![Page 31: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/31.jpg)
Pieter Abbeel
Performance Guarantees Let the local policy improvement algorithm be policy gradient.
Notes: These assumptions are insufficient to give the same
performance guarantees for model-based RL. The constant K depends only on the dimensionality of
the state, action, and policy (), the horizon H and an upper bound on the 1st and 2nd derivatives of the transition model, the policy and the reward function.
![Page 32: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/32.jpg)
Pieter Abbeel
Our expert pilot provides 5-10 demonstrations.
Our algorithm aligns trajectories, extracts intended trajectory as target, learns local models.
We repeatedly run controller, collect model errors, until satisfactory performance is obtained.
We use receding-horizon differential dynamic programming (DDP) to find the controller
Experimental Setup
![Page 33: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/33.jpg)
Pieter Abbeel
[Switch to Quicktime for HD airshow.]
Airshow
![Page 34: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/34.jpg)
Pieter Abbeel
Airshow accuracy
![Page 35: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/35.jpg)
Pieter Abbeel
Tic-toc
![Page 36: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/36.jpg)
Pieter Abbeel
[Switch to Quicktime for HD chaos.]
Chaos
![Page 37: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/37.jpg)
Pieter Abbeel
Conclusion
Apprenticeship learning algorithms help us find better controllers by exploiting teacher demonstrations.
Algorithmic instantiations:
Inverse reinforcement learning
Learn trade-offs in reward.
Learn “intended” trajectory.
Model learning
No explicit exploration.
Local models.
Control with crude model + small number of trials.
![Page 38: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/38.jpg)
Pieter Abbeel
Automate more general advice taking.
Guaranteed safe exploration---safely learning to outperform the teacher.
Autonomous helicopters
Assist in wildland fire fighting.
Auto-rotation landings.
Fixed-wing formation flight.
Potential savings for even three aircraft formation: 20%.
Current and future work
![Page 39: Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649edc5503460f94bec8ea/html5/thumbnails/39.jpg)
Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2004.
Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, 2005.
Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2005.
Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, Varun Ganapathi and Andrew Y. Ng. In NIPS 18, 2006.
Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, 2006.
An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, 2007.
Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.