![Page 1: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/1.jpg)
CSC2621Imitation Learning for Robotics
Florian Shkurti
Week 10: Inverse Reinforcement Learning (Part II)
![Page 2: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/2.jpg)
Today’s agenda
• Guided Cost Learning by Finn, Levine, Abbeel
• Inverse KKT by Englert, Vien, Toussaint
• Bayesian Inverse RL by Ramachandran and Amir
• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell
![Page 3: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/3.jpg)
Today’s agenda
• Guided Cost Learning by Finn, Levine, Abbeel
• Inverse KKT by Englert, Vien, Toussaint
• Bayesian Inverse RL by Ramachandran and Amir
• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell
![Page 4: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/4.jpg)
Recall: Maximum Entropy IRL [Ziebart et al. 2008]
Assumption: Trajectories (states and action sequences) here are discrete
![Page 5: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/5.jpg)
Recall: Maximum Entropy IRL [Ziebart et al. 2008]
Linear Reward Function
![Page 6: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/6.jpg)
Recall: Maximum Entropy IRL [Ziebart et al. 2008]
Linear Reward Function
![Page 7: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/7.jpg)
Recall: Maximum Entropy IRL [Ziebart et al. 2008]
Linear Reward Function
Assumption: known and
deterministic dynamics
![Page 8: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/8.jpg)
Recall: Maximum Entropy IRL [Ziebart et al. 2008]
Linear Reward Function
Assumption: known and
deterministic dynamics
Log-likelihood of observed dataset D of trajectories
![Page 9: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/9.jpg)
Recall: Maximum Entropy IRL [Ziebart et al. 2008]
Linear Reward Function
Assumption: known and
deterministic dynamics
Log-likelihood of observed dataset D of trajectories
![Page 10: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/10.jpg)
Recall: Maximum Entropy IRL [Ziebart et al. 2008]
Linear Reward Function
Assumption: known and
deterministic dynamics
Log-likelihood of observed dataset D of trajectories
Serious problem:
Need to compute Z(theta)
every time we compute
the gradient
Hand-Engineered Features
![Page 11: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/11.jpg)
Guided Cost Learning [Finn, Levine, Abbeel et al. 2016]
Nonlinear Reward Function
Learned Features
True and stochastic dynamics (unknown)
Log-likelihood of observed dataset D of trajectories
![Page 12: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/12.jpg)
Guided Cost Learning [Finn, Levine, Abbeel et al. 2016]
Nonlinear Reward Function
Learned FeaturesLog-likelihood of observed dataset D of trajectories
Serious problem
remains
True and stochastic dynamics (unknown)
![Page 13: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/13.jpg)
Approximating the gradient of the log-likelihood
Nonlinear Reward Function
Learned Features
How do you approximate this expectation?
![Page 14: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/14.jpg)
Approximating the gradient of the log-likelihood
Nonlinear Reward Function
Learned Features
How do you approximate this expectation?
Idea #1: sample from
(can you do this?)
![Page 15: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/15.jpg)
Approximating the gradient of the log-likelihood
Nonlinear Reward Function
Learned Features
How do you approximate this expectation?
Idea #1: sample from
(don’t know the dynamics )
![Page 16: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/16.jpg)
Approximating the gradient of the log-likelihood
Nonlinear Reward Function
Learned Features
How do you approximate this expectation?
Idea #1: sample from
(don’t know the dynamics )
Idea #2: sample from an easier distribution
that approximates
Importance Samplingsee Relative Entropy Inverse RL
by Boularias, Kober, Peters
![Page 17: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/17.jpg)
Importance Sampling
How to estimate properties/statistics of one distribution (p) given samples from another distribution (q)
Weights = likelihood ratio,
i.e. how to reweigh samples to obtain statistics of p from samples of q
p q
![Page 18: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/18.jpg)
Importance Sampling: Pitfalls and Drawbacks
What can go wrong?
p q
Problem #1:
If q(x) = 0 but f(x)p(x) > 0
for x in non-measure-zero
set then there is estimation
bias
![Page 19: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/19.jpg)
Importance Sampling: Pitfalls and Drawbacks
What can go wrong?
p q
Problem #1:
If q(x) = 0 but f(x)p(x) > 0
for x in non-measure-zero
set then there is estimation
bias
Problem #2:
Weights measure mismatch
between q(x) and p(x). If
mismatch is large then some
weights will dominate. If
x lives in high dimensions a
single weight may dominate
![Page 20: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/20.jpg)
Importance Sampling: Pitfalls and Drawbacks
What can go wrong?
p q
Problem #1:
If q(x) = 0 but f(x)p(x) > 0
for x in non-measure-zero
set then there is estimation
bias
Problem #2:
Weights measure mismatch
between q(x) and p(x). If
mismatch is large then some
weights will dominate. If
x lives in high dimensions a
single weight may dominate
Problem #3:
Variance of estimator
is high if (q – fp)(x) is high
![Page 21: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/21.jpg)
Importance Sampling: Pitfalls and Drawbacks
What can go wrong?
p q
Problem #1:
If q(x) = 0 but f(x)p(x) > 0
for x in non-measure-zero
set then there is estimation
bias
Problem #2:
Weights measure mismatch
between q(x) and p(x). If
mismatch is large then some
weights will dominate. If
x lives in high dimensions a
single weight may dominate
For more info see:
#1, #3: Monte Carlo theory, methods, and examples, Art Owen, chapter 9
#2: Bayesian reasoning and machine learning, David Barber, chapter 27.6 on importance sampling
Problem #3:
Variance of estimator
is high if (q – fp)(x) is high
![Page 22: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/22.jpg)
Importance Sampling
What is the best approximating distribution q?
p q
Best approximation
![Page 23: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/23.jpg)
Importance Sampling
How does this connect back to partition function estimation?
p q
![Page 24: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/24.jpg)
Importance Sampling
How does this connect back to partition function estimation?
p q
Best approximating distribution
![Page 25: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/25.jpg)
Importance Sampling
How does this connect back to partition function estimation?
p q
Best approximating distributionCost function estimate changes at each gradient step
Therefore the best approximating distribution should change as well
![Page 26: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/26.jpg)
Approximating the gradient of the log-likelihood
Nonlinear Reward Function
Learned Features
How do you approximate this expectation?
Idea #1: sample from
(don’t know the dynamics )
Idea #2: sample from an easier distribution
that approximates
Importance Samplingsee Relative Entropy Inverse RL
by Boularias, Kober, Peters
![Page 27: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/27.jpg)
Approximating the gradient of the log-likelihood
Nonlinear Reward Function
Learned Features
How do you approximate this expectation?
Idea #1: sample from
(don’t know the dynamics )
Idea #2: sample from an easier distribution
that approximates
Importance Samplingsee Relative Entropy Inverse RL
by Boularias, Kober, Peters
Previous papers used
a fixed
![Page 28: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/28.jpg)
Approximating the gradient of the log-likelihood
Nonlinear Reward Function
Learned Features
How do you approximate this expectation?
Idea #1: sample from
(don’t know the dynamics )
Idea #2: sample from an easier distribution
that approximates
Importance Samplingsee Relative Entropy Inverse RL
by Boularias, Kober, Peters
Previous papers used
a fixed
Adaptive Importance Samplingsee Guided Cost Learning
By Finn, Levine, Abbeel
This paper uses
adaptive
![Page 29: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/29.jpg)
How do you select q?
How do you adapt it as the cost c changes?
Guided Cost Learning
![Page 30: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/30.jpg)
How do you select q?
How do you adapt it as the cost c changes?
Guided Cost Learning: the punchline
Given a fixed cost function c, the distribution of trajectories that Guided Policy Search computes is close to
i.e. it is good for importance sampling of the partition function Z
![Page 31: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/31.jpg)
Recall: Finite-Horizon LQR
// n is the # of steps left
for n = 1…N
Optimal control for time t = N – n is with cost-to-go
where the states are predicted forward in time according to linear dynamics
![Page 32: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/32.jpg)
Recall: LQG = LQR with stochastic dynamics
Assume and
Then the form of the optimal policy is the same as in LQR
No need to change the algorithm, as long as you observe the state at each
step (closed-loop policy)
zero mean Gaussian
Linear Quadratic GaussianLQG
estimate of the state
![Page 33: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/33.jpg)
Deterministic Nonlinear Cost & Deterministic Nonlinear Dynamics
Arbitrary differentiable functions c, f
iLQR: iteratively approximate solution by solving linearized versions of the problem via LQR
![Page 34: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/34.jpg)
Deterministic Nonlinear Cost & Stochastic Nonlinear Dynamics
Arbitrary differentiable functions c, f
iLQG: iteratively approximate solution by solving linearized versions of the problem via LQG
![Page 35: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/35.jpg)
Recall from Guided Policy Search
Learn linear Gaussian dynamics
![Page 36: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/36.jpg)
Recall from Guided Policy Search
Learn linear Gaussian dynamics
![Page 37: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/37.jpg)
Recall from Guided Policy Search
Learn linear Gaussian dynamics
Linear Gaussian
dynamics and controller
![Page 38: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/38.jpg)
Recall from Guided Policy Search
Learn linear Gaussian dynamics
Linear Gaussian
dynamics and controller
Run controller on the robot
Collect trajectories
![Page 39: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/39.jpg)
Recall from Guided Policy Search
Learn linear Gaussian dynamics
Given a fixed cost function c, the linear
Gaussian controllers that GPS computes
induce a distribution of trajectories close to
i.e. good for importance sampling of the
partition function Z
![Page 40: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/40.jpg)
Guided Cost Learning [rough sketch]
Importance sample trajectories from
Do forward optimization using Guided Policy Search for cost function
and compute linear Gaussian distribution of trajectories
Collect demonstration trajectories D
Initialize cost parameters
![Page 41: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/41.jpg)
Regularization of learned cost functions
![Page 42: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/42.jpg)
![Page 43: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/43.jpg)
Today’s agenda
• Guided Cost Learning by Finn, Levine, Abbeel
• Inverse KKT by Englert, Vien, Toussaint
• Bayesian Inverse RL by Ramachandran and Amir
• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell
![Page 44: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/44.jpg)
Setting up trajectory optimization problems[e.g. for manipulation]
Non-learned features of the current state, e.g. distance to object
Features and their weights are time-dependent
![Page 45: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/45.jpg)
Non-learned features of the current state, e.g. distance to object
Features and their weights are time-dependent
Constraints such as “stay away from an obstacle”,
or “acceleration should be bounded”
Setting up trajectory optimization problems[e.g. for manipulation]
![Page 46: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/46.jpg)
Non-learned features of the current state, e.g. distance to object
Features and their weights are time-dependent
Constraints such as “stay away from an obstacle”,
or “acceleration should be bounded”
Constraints such as “always touch the door handle”
Setting up trajectory optimization problems[e.g. for manipulation]
![Page 47: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/47.jpg)
Solving constrained optimization problems
Lagrangian function for this problem:
![Page 48: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/48.jpg)
KKT conditions for trajectory optimization
Lagrangian function for this problem:
One of the necessary conditions for optimal motion
![Page 49: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/49.jpg)
KKT conditions for trajectory optimization
Lagrangian function for this problem:
What are the conditions on
the feature weights to ensure
optimality of demonstrated motion?
One of the necessary conditions for optimal motion
![Page 50: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/50.jpg)
Inverse KKT conditions: optimality of cost
Lagrangian function for this problem:
What are the conditions on
the feature weights to ensure
optimality of demonstrated motion?
Minimize
One of the necessary conditions for optimal motion
![Page 51: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/51.jpg)
Lagrangian function for this problem:
What are the conditions on
the feature weights to ensure
optimality of demonstrated motion?
Minimize
One of the necessary conditions for optimal motion
Inverse KKT conditions: optimality of cost
![Page 52: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/52.jpg)
Lagrangian function for this problem:
What are the conditions on
the feature weights to ensure
optimality of demonstrated motion?
Minimize
subject to
One of the necessary conditions for optimal motion
Inverse KKT conditions: optimality of cost
![Page 53: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/53.jpg)
Lagrangian function for this problem:
What are the conditions on
the feature weights to ensure
optimality of demonstrated motion?
Minimize
subject to
One of the necessary conditions for optimal motion
Inverse KKT conditions: optimality of cost
![Page 54: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/54.jpg)
Lagrangian function for this problem:
What are the conditions on
the feature weights to ensure
optimality of demonstrated motion?
Minimize
subject to
Quadratic program
Efficient solvers exist (CPLEX, CVXGEN, Gurobi)
One of the necessary conditions for optimal motion
Inverse KKT conditions: optimality of cost
![Page 55: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/55.jpg)
![Page 56: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/56.jpg)
Features
![Page 57: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/57.jpg)
Feature weights over time
![Page 58: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/58.jpg)
Today’s agenda
• Guided Cost Learning by Finn, Levine, Abbeel
• Inverse KKT by Englert, Vien, Toussaint
• Bayesian Inverse RL by Ramachandran and Amir
• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell
![Page 59: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/59.jpg)
Bayesian updates of deterministic rewards
Demonstration trajectory
Reward parameters
How does our belief in the reward
change after a demonstration?
![Page 60: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/60.jpg)
Bayesian updates of deterministic rewards
Demonstration trajectory
Reward parameters
How does our belief in the reward
change after a demonstration?In this paper it is assumed that
![Page 61: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/61.jpg)
MCMC sampling of the posterior
Next candidate reward vector
picked randomly from current one
If the optimal policy has changed
then do policy iteration starting from
the old policy
The paper has results on mixing
times for the MCMC walk
![Page 62: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/62.jpg)
Interesting result
![Page 63: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/63.jpg)
Today’s agenda
• Guided Cost Learning by Finn, Levine, Abbeel
• Inverse KKT by Englert, Vien, Toussaint
• Bayesian Inverse RL by Ramachandran and Amir
• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell
![Page 64: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/64.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
Assumptions:
- Linear rewards with respect to handcrafted features
- Discrete states and actions
Main idea: (reward weights should be such that) demonstrated trajectories collect more reward than any other trajectory, by a large margin
![Page 65: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/65.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
Assumptions:
- Linear rewards with respect to handcrafted features
- Discrete states and actions
Main idea: (reward weights should be such that) demonstrated trajectories collect more reward than any other trajectory, by a large margin
How can we formulate
this mathematically?
![Page 66: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/66.jpg)
Detour: solving MDPs via linear programming
![Page 67: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/67.jpg)
Detour: solving MDPs via linear programming
Discounted state action counts / occupancy measure
Optimal policy
![Page 68: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/68.jpg)
Detour: solving MDPs via linear programming
Primal LP
Dual LP
![Page 69: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/69.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
![Page 70: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/70.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
Is searching over visitation
frequencies the same as searching
over policies?
![Page 71: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/71.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
Feature frequencies
for i-th demonstrated trajectory
for any trajectory or state action pair
![Page 72: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/72.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
Impose large margin that is
dependent on state action pairs
![Page 73: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/73.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
![Page 74: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/74.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
![Page 75: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/75.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
Don’t allow too much slack
![Page 76: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/76.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
Is this a proper formulation of a quadratic program? NO
![Page 77: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/77.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
But it optimizes a linear objective with linear constraints,
so we can use duality in linear programming
![Page 78: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/78.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
![Page 79: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/79.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
reward
![Page 80: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/80.jpg)
Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]
Is this sufficient to make v_i the optimal value function?
![Page 81: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning](https://reader034.vdocuments.mx/reader034/viewer/2022042113/5e8eb2f9597ef6335a03b0ec/html5/thumbnails/81.jpg)
Results