![Page 1: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/1.jpg)
Introduction to Imitation LearningMatt Barnes
*Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le
TAs: Matthew Rockett, Gilwoo Lee, Matt Schmittle
![Page 2: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/2.jpg)
Recap• Markov Decision Processes are a very general class of models, which
encompass planning and reinforcement learning.
xt
xt
ct
ct ut
Cost
![Page 3: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/3.jpg)
Recap
“Markov” means that _____ captures all information about the history x1, x2, …, xt
The most recent state xt
![Page 4: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/4.jpg)
Recap• The difference between planning and reinforcement learning is whether the
__________ are known
transition model / dynamics / environment
![Page 5: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/5.jpg)
Recap• The three general methods for reinforcement learning are…
(1) Model-based
(2) Approximate dynamic programming
(3) Policy gradient
• (2) and (3) are both _______ methods
Model-free
![Page 6: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/6.jpg)
Recap
The multi-armed bandit problem is reinforcement learning with __ state(s)
1
![Page 7: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/7.jpg)
Recap
What is the fundamental trade-off in bandit (and reinforcement learning) problems?
Exploration vs. exploitation
![Page 8: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/8.jpg)
Recap
Reward is equivalent to ________
Negative cost
![Page 9: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/9.jpg)
Recap
The ε-greedy algorithm randomly explores with probability
ε
![Page 10: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/10.jpg)
Recap
The UCB algorithm chooses actions according to the estimated reward, plus a bonus (i.e. confidence interval) which decreases with respect to _____
The number of times we try that action.
![Page 11: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/11.jpg)
Today’s lecture
!11
![Page 12: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/12.jpg)
Today’s lecture
• Robots do not operate in a vacuum. They do not need to learn everything from scratch.
• Humans need to easily interact with robots and share our expertise with them.
• Robots need to learn from the behavior and experience of others, not just their own.
!12 *Based off Florian Shkurti’s lectures
![Page 13: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/13.jpg)
Today’s lecture
• Part 1: How can robots easily understand our objectives from demonstrations?
Inverse reinforcement learning (IRL)
• Part 2: How can robots incorporate other’s decision into their own?
Behavior cloning (BC), interactive imitation learning (IL)
!13 *Based off Florian Shkurti’s lectures
![Page 14: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/14.jpg)
!14
Today’s lecture
*Partially based off slides from Yisong Yue and Hoang M. Le
Output Inputs
Policy Learning
Reward Learning
Access to Environment
Interactive Demonstrator
Pre-collected Demonstration
sPart 1 Inverse RL No Yes Yes No Yes
Part 2
Behavior Cloning Yes (direct) No No NoYes
Interactive IL Yes (direct) No Yes Yes (Optional)
GAILYes (indirect) No Yes No
Yes
![Page 15: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/15.jpg)
Part 1: Inverse Reinforcement Learning (IRL)
• Setting: No reward function. For complex tasks, these can be hard to specify!
• Fortunately, we are given a set of demonstrations
• Goal: Learn a reward function r* such that
!15
π* = argmaxπ𝔼π [r*(x, u)]
D = {τ1, …, τm} = {(xi0, ui
0, xi1, ui
1…)} ∼ ρπ*
m trajectories The state-action distribution of policy π*
![Page 16: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/16.jpg)
The high-level recipe
• Step 1: Learn a policy for the current reward function r
• Step 2: Update the reward function
Repeat until policy trajectories are similar to demonstrations
!16
Run RL
Update reward
Compare
*Partially based off slides from Yisong Yue and Hoang M. Le
![Page 17: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/17.jpg)
Learning a reward function is an under defined problem
!17
• Many reward functions correspond to the same policy
*Partially based off slides from Yisong Yue and Hoang M. Le
![Page 18: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/18.jpg)
Learning a reward function is an under defined problem
• Let r* be one solution, i.e.
• Note that ar* for any constant a is also a solution, since
• In fact, r* = 0 is always a solution, since every policy is optimal
!18
π* = argmaxπ𝔼π [r*(x, u)]
argmaxπ𝔼π [ar*(x, u)] = argmaxπ𝔼π [r*(x, u)]
![Page 19: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/19.jpg)
Many different IRL approaches
• The reward function is linear [Abbeel & Ng 2004]
• Maximize the trajectory entropy, subject to a feature matching constraint [Ziebart et al., 2008]
• Maximum Margin Planning [Ratliff et al., 2006]
!19
![Page 20: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/20.jpg)
The linear reward function assumption reduces to feature expectation matching
• Assume where are the features of state x
• Then the value of a policy is linear in the expected features
!20
r(x) = θ ⋅ ϕ(x) ϕ(x)
J(π) = 𝔼 [H
∑t=0
θ ⋅ ϕ(xt)]= θ ⋅ 𝔼 [
H
∑t=0
ϕ(xt)]= θ ⋅ μ(π)
The expected features of the policy
![Page 21: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/21.jpg)
New objective, matching the feature expectations
!21
• Step 1: Learn a policy for the current reward function and compute its feature expectation
• Step 2: Update the reward function
Repeat until feature expectations are close
Run RL
Update reward
Compare
*Partially based off slides from Yisong Yue and Hoang M. Le!21
πμ
max||θ||≤1
θT(μπ − μπ*)
| |μπ − μπ* | |
![Page 22: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/22.jpg)
!22
[Kitani et al., 2012]
Using IRL to predict pedestrian intention
![Page 23: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/23.jpg)
!23
Part 2: Directly learning a policy
*Partially based off slides from Yisong Yue and Hoang M. Le
Output Inputs
Policy Learning
Reward Learning
Access to Environment
Interactive Demonstrator
Pre-collected Demonstration
sPart 1 Inverse RL No Yes Yes No Yes
Part 2
Behavior Cloning Yes (direct) No No NoYes
Interactive IL Yes (direct) No Yes Yes (Optional)
GAILYes (indirect) No Yes No
Yes
![Page 24: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/24.jpg)
Behavior cloning• Observe pre-collected expert
demonstrations:
• Learn a policy
!24
D = {τ1, …, τm} = {(xi0, ui
0, xi1, ui
1…)} ∼ ρπ*
π = argminπ
n
∑i=0
ℓ(ui, π(xi))
Some loss function
The state-action distribution of policy π*
![Page 25: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/25.jpg)
Behavior cloning suffers from compounding errors
• The good news: will perform well on samples from
• The really bad news: When we roll out policy it will inevitably make some mistakes compared to , and these errors could compound resulting in drastically different state action distributions and
!25
πρπ*
ππ*
ρπ*ρ π
![Page 26: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/26.jpg)
Behavior cloning suffers from compounding errors
• The good news: will perform well on samples from
• The really bad news: When we roll out policy it will inevitably make some mistakes compared to , and these errors could compound resulting in drastically different state action distributions and
!26
πρπ*
ππ*
ρπ*ρ π
![Page 27: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/27.jpg)
Behavior cloning suffers from compounding errors
!27
J( π) ≤ J(π*) + T2ϵ
Theorem (simplified) [Ross et al., 2011]. Let ε be the supervised learning error rate of . Then the cumulative reward of this policy is bounded by:
π
Errors compound with respect to the time horizon T
![Page 28: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/28.jpg)
A history of covariate shift in imitation learning
Navlab 1 (1986-1989) and Navlab 2 + ALVINN
!28
30 x 32 pixels, 3 layer network, outputs steering command from approximately 5 minutes of training
data per road type [Pomerleau 1992]
![Page 29: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/29.jpg)
A history of covariate shift in imitation learning
!29
[Pomerleau 1992]
![Page 30: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/30.jpg)
Imitation Learning is not supervised learning
• Policy’s actions affect future observations/data
• This is not the case in supervised learning
!30
Imitation Learning
• Train/test data are not i.i.d.
• If expected hold-out error is ε, then expected test error after T decisions is up to
T2ε
• Errors compound
*From Florian Shkurti
Supervised Learning
• Train/test data are i.i.d.
• If expected hold-out error is ε, then expected test error after T decisions is order
Tε
• Errors are independent
![Page 31: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/31.jpg)
!31
Interactive imitation learning
*Partially based off slides from Yisong Yue and Hoang M. Le
Output Inputs
Policy Learning
Reward Learning
Access to Environment
Interactive Demonstrator
Pre-collected Demonstration
sPart 1 Inverse RL No Yes Yes No Yes
Part 2
Behavior Cloning Yes (direct) No No NoYes
Interactive IL Yes (direct) No Yes Yes (Optional)
GAILYes (indirect) No Yes No
Yes
![Page 32: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/32.jpg)
Interactive feedback
!32
Expert feedback π*(xt)
• Roll-out any policy, and expert provides feedback for the current state
• In today’s lecture, we’ll consider the simple setting where this feedback takes the form of the expert’s action π*(xt)
![Page 33: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/33.jpg)
The high-level recipe
• Step 1: Roll-out the current policy, collect expert feedback on the states it visits
• Step 2: Update the dataset and retrain the policy
Repeat
!33
![Page 34: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/34.jpg)
The Forward Training algorithm
!34
Initialize T policies,
• Step 1: Roll-out policy
and collect expert feedback
• Step 2: Update policies
Intuitively, each policy will have to learn to correct for the mistakes of earlier policies
π1, …, πT
u1 ∼ π1(s1), u2 ∼ π2(s2), …
u*1 = π*(x1), u*2 = π*(x2), …
![Page 35: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/35.jpg)
The Forward Training algorithm
!35
• Theorem (simplified) [Ross et al., 2011]. Let ε be the supervised learning error rate of . Then the cumulative reward of this policy is bounded by:
J( π) ≤ J(π*) + uTϵ
π
Errors increase linearly with respect to the time horizon T, Same as supervised learning!
A constant
The downside: We have to learn T separate policies
![Page 36: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/36.jpg)
DAgger: Dataset Aggregation
!36
Aggregate the data
• Learn only a single policy
![Page 37: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/37.jpg)
DAgger: Dataset Aggregation
!37
Ross et al, 2011
![Page 38: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/38.jpg)
DAgger: Dataset Aggregation
!38
Ross et al, 2011
![Page 39: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/39.jpg)
GAIL: Generative Adversarial Imitation Learning
!39
Output Inputs
Policy Learning
Reward Learning
Access to Environment
Interactive Demonstrator
Pre-collected Demonstration
sPart 1 Inverse RL No Yes Yes No Yes
Part 2
Behavior Cloning Yes (direct) No No NoYes
Interactive IL Yes (direct) No Yes Yes (Optional)
GAILYes (indirect) No Yes No
Yes
![Page 40: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/40.jpg)
GAIL: Generative Adversarial Imitation Learning
• Setting: Pre-collected expert demonstrations
• Goal: Minimize the divergence between and
!40
D = {τ1, …, τm} = {(xi0, ui
0, xi1, ui
1…)} ∼ ρπ*
ρπ* ρ π
π = argminπD(ρπ | |ρπ*)
![Page 41: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/41.jpg)
What’s a divergence?• Tells us how far apart two distributions are
• Given samples from and , we can estimate the divergence (e.g. using [Nguyen et al., 2008])
!41
ρπ* ρ π
P(x
)x
![Page 42: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/42.jpg)
GAIL: Generative Adversarial Imitation Learning
• The GAIL objective
• Step 1: For the current π, maximize the discriminator to estimate the divergence
• Step 2: Update the policy using reinforcement learning.
!42
minπ
maxf
𝔼x,u∼ρπ*log( f(x, u)) + 𝔼x,u∼ρπ
log(1 − f(x, u))
RL problemThe “discriminator”
The divergence estimator
![Page 43: Introduction to Imitation Learning · 2019. 7. 24. · Introduction to Imitation Learning Matt Barnes *Some content borrowed from Florian Shkurti, Yisong Yue and Hoang M. Le TAs:](https://reader033.vdocuments.mx/reader033/viewer/2022060516/60494e3df7b91c6dea47b325/html5/thumbnails/43.jpg)
Takeaways
• Expert demonstrations, and in particular expert feedback, can dramatically speed up policy training.
• Behavior Cloning suffers from compounding errors. This is not true for Forward Training or DAgger, which use interactive expert feedback.
• Inverse reinforcement learning allows us to learn an expert’s reward function.
!43