lifelong learning for disturbance rejection on …eeaton/papers/isele2016work...lifelong learning...
TRANSCRIPT
![Page 1: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/1.jpg)
Lifelong Learning for Disturbance Rejection on Mobile Robots
GRASP LABORATORY
David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin,
Brandon Kallaher, Matthew E. Taylor
1Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor
![Page 2: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/2.jpg)
Problem 1: Without prior knowledge, RL in a new task is slow
Idea: Reuse knowledge from previously learned tasks
Motivation
G
standard“tabula rasa” initialization initialization via
transfer
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 2
![Page 3: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/3.jpg)
Problem 1: Without prior knowledge, RL in a new task is slow
Idea: Reuse knowledge from previously learned tasks
Motivation
G
standard“tabula rasa” initialization initialization via
transfer
… … … …
Time
Current Task
We focus on the lifelong learning case:Agent learns multiple tasks consecutivelyWant stability guarantees as the number of tasks grows large
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 3
![Page 4: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/4.jpg)
Background
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 4
![Page 5: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/5.jpg)
•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces
–Have shown recent success in applications to robotic control [Kober & Peters 2011;
Peters & Schaal 2008; Sutton et al. 2000]
G
reward function
agent
probabilistic transition
Agent makes sequential decisions
Background: Policy Gradient Methods for Control
•Formalized as a Markov Decision Process (MDP)
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 5
![Page 6: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/6.jpg)
Background: Policy Gradient Methods for Control
•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces
–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]
n trajectories
Policy GradientLearner
Policy
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 6
![Page 7: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/7.jpg)
Background: Policy Gradient Methods for Control
•Agent interacts with environment, taking consecutive actions•PG methods support continuous state and action spaces
–Have shown recent success in applications to robotic control–[Kober & Peters 2011; Peters & Schaal 2008; Sutton et al. 2000]
n trajectories
Policy GradientLearner
Policy
probability of trajectory reward function
Goal: find policy that minimizes
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 7
![Page 8: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/8.jpg)
Background: Finite Difference Policy Gradients
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 8
Approximate the change in reward with sampled disturbances
![Page 9: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/9.jpg)
Background: Finite Difference Policy Gradients
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 9
Approximate the change in reward with sampled disturbances
Use the pseudo-inverse to find the gradient
![Page 10: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/10.jpg)
Background: Finite Difference Policy Gradients
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 10
Approximate the change in reward with sampled disturbances
Use the pseudo-inverse to find the gradient
Update the current policy
![Page 11: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/11.jpg)
Lifelong PG Learning
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 11
![Page 12: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/12.jpg)
Lifelong Machine Learning
17Lifelong Learning System
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 12
![Page 13: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/13.jpg)
... ...
Lifelong Machine Learning
19Lifelong Learning System
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 13
![Page 14: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/14.jpg)
... ...
Lifelong Machine Learning
14Lifelong Learning System
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 14
![Page 15: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/15.jpg)
... ...
Lifelong Machine Learning
21Lifelong Learning System
2.) Knowledge is transferred from previously learned tasks
learned policy
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 15
![Page 16: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/16.jpg)
... ...
Lifelong Machine Learning
22Lifelong Learning System
2.) Knowledge is transferred from previously learned tasks
3.) New knowledge is stored for future uselearned policy
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 16
![Page 17: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/17.jpg)
... ...
Lifelong Machine Learning
Lifelong Learning System
2.) Knowledge is transferred from previously learned tasks
3.) New knowledge is stored for future use
4.) Existingknowledge is refined
learned policy
previously learnedknowledge
previously learned tasks future learning tasks
... ...tt-1t-2t-3 t+1 t+2 t+3
trajectories for task t
current task
Time
1.) Tasks are received consecutively
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 17
![Page 18: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/18.jpg)
Issue: the objective is dependent on all trajectories
PG-ELLA Objective
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 18
![Page 19: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/19.jpg)
Issue: the objective is dependent on all trajectories
PG-ELLA Objective
Hessian
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 19
![Page 20: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/20.jpg)
Verification on Robots
Experiments
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 20
![Page 21: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/21.jpg)
Results for Robot Go-to-Goal Task
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 21
• Run RL on a new robot (goal and disturbance) for a small number of iterations• Use PG-ELLA to adjust policy according to known solutions• Continue training
PG-ELLA improves Learning
![Page 22: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/22.jpg)
Better Results Incorporating Prior
Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor 22
• Initialization with average policy of other robots improves benefit
PG-ELLA improves Learning
![Page 23: Lifelong Learning for Disturbance Rejection on …eeaton/papers/Isele2016Work...Lifelong Learning for Disturbance Rejection on Mobile Robots GRASP LABORATORY David Isele, José Marcio](https://reader030.vdocuments.mx/reader030/viewer/2022041116/5f275a05754d1a56be7dcddd/html5/thumbnails/23.jpg)
GRASP LABORATORY
Thank you!
Questions?This research was supported by ONR N00014-11-1-0139, AFRL FA8750-14-1-0069, AFRL FA8750-14-1-0070, NSF IIS-1149917, NSF IIS-1319412, USDA 2014-67021-22174, and a Google Research Award.
Lifelong Learning for Disturbance Rejection on Mobile Robots
23Isele, Luna, Eaton, Cruz, Irwin, Kallaher, Taylor
David Isele, José Marcio Luna, Eric Eaton, Gabriel V. de la Cruz, James Irwin, Brandon Kallaher, Matthew E. Taylor