imitation learning for autonomous driving in torcs

Imitation Learning forAutonomous Driving in TORCS

Final Report

Yasunori KudoMitsuru Kusumoto, Yasuhiro Fujita

SP Team

Imitation LearningImitation Learning is an approach for the sequential predictionproblem, where expert demonstrations of good behavior are used to learn a controller.

In standard reinforcement learning, agents need to explore theenvironment many times to obtain a good policy. However, sampleefficiency is crucial in actual environments.Expert demonstrations may be helpful for this issue.

Examples :• Legged locomotion [Ratliff 2006]• Outdoor navigation [Silver 2008]• Car driving [Pomerleau 1989]• Helicopter flight [Abbeel 2007]

Where we’ll go

DAgger : Dataset AggregationDAGGER3.6. DATASET AGGREGATION: ITERATIVE INTERACTIVE LEARNINGAPPROACH 69

Execute current policy and Query Expert New Data

Supervised Learning

All previous data Aggregate Dataset

Steering from expert

New Policy

Figure 3.5: Depiction of the DAGGER procedure for imitation learning in a drivingscenario.

Test Execu*on

Collect Data

No‐Regret Online Learner

Expert

Learned Policy Done?

yes no iπ̂

Best Policy

iπ̂

e.g. Gradient Descent

Figure 3.6: Diagram of the DAGGER algorithm with a general online learner for imita-tion learning.

policies, with relatively few data points, may make many more mistakes and visit states

that are irrelevant as the policy improves. We will typically use �1

= 1 so that we do

not have to specify an initial policy ⇡̂1

before getting data from the expert’s behavior.

Then we could choose �i = pi�1 to have a probability of using the expert that decays

exponentially as in SMILE and SEARN. The only requirement is that {�i} be a sequence

such that �N = 1

N

PNi=1

�i ! 0 as N ! 1. The simple, parameter-free version of the

Stéphane. Ross, Geoffrey J. Gordon, and J. Andrew. Bagnell. A reduction of imitationlearning and structured prediction to no-‑regret online learning. In AISTATS, 2011.

DAgger : Dataset AggregationDAGGER

70 CHAPTER 3. LEARNING BEHAVIOR FROM DEMONSTRATIONS

Initialize D ;.Initialize ⇡̂

1

to any policy in ⇧.for i = 1 to N do

Let ⇡i = �i⇡⇤ + (1� �i)⇡̂i.Sample T -step trajectories using ⇡i.Get dataset Di = {(s,⇡⇤(s))} of visited states by ⇡i and actions given by expert.Aggregate datasets: D D

SDi.

Train classifier ⇡̂i+1

on D (or use online learner to get ⇡̂i+1

given new data Di).end forReturn best ⇡̂i on validation.

Algorithm 3.6.1: DAGGER Algorithm.

algorithm described above is the special case �i = I(i = 1) for I the indicator function,

which often performs best in practice (see Chapter 5). The general DAGGER algorithm

is detailed in Algorithm 3.6.1.

Analysis

We now provide a complete analysis of this DAGGER procedure, and show how the

strong no-regret property of online learning procedures can be leveraged, in this inter-

active learning procedure, to obtain good performance guarantees. Again here, we seek

to provide a similar analysis to previously analyzed methods that seeks to answer the

following question : if we can find good policies at mimicking the expert on the aggre-

gate dataset we collect during training, then how well the learned policy will perform

the task?

The theoretical analysis of DAGGER relies primarily on seeing how learning iter-

atively in this algorithm can be viewed as an online learning problem and using the

no-regret property of the underlying Follow-The-Leader algorithm on strongly convex

losses (Kakade and Tewari, 2009) which picks the sequence of policies ⇡̂1:N . Hence, the

presented results also hold more generally if we use any other no-regret online learning

algorithm we would apply to our imitation learning setting. In particular, we can con-

sider the results here a reduction of imitation learning to no-regret online learning where

we treat mini-batches of trajectories under a single policy as a single online-learning

example. In addition, in Chapter 9, we also show that the data aggregation procedure

works generally whenever the supervised learner applied to the aggregate dataset has

su�cient stability properties. We refer the reader to Chapter 2 for a review of concepts

related to online learning and no regret that is used for this analysis. All the proofs of

the results presented here are in Appendix A.

!�"∗(%, !) !�

DAgger: Dataset Aggregation

• Collect new trajectories with S1

• New Dataset D1’ = {(s, S*(s))}

• Aggregate Datasets:D1 = D0 U D1’

• Train S2 on D1

17

S1


DAgger: Dataset Aggregation

S2• Collect new trajectories with S2

• New Dataset D2’ = {(s, S*(s))}

• Aggregate Datasets:D2 = D1 U D2’

• Train S3 on D2

18


Expert policy Predicted policyAvoid to collect states affectedby only expert policy

Experiments• Pendulum and Pong in OpenAI Gym

• We compared the performance of DAggerwith standard RL algorithm.

REINFORCE

Toy Problem: Pendulum Swingup● Classical RL benchmark task

○ Nonlinear control:○ Action: Torque○ State:

○ Reward:

From “Reinforcement Learning In Continuous Time and Space”, Kenji Doya, 2000State : (θ, θ)

Reward : cosθ・

State : 80×80 binaryReward : win +1, lose -‑1

Experiments - REINFORCEREward Increment = Nonnegative Factor × Offset Reinforcement

× Characteristic Eligibility

!∗($, &) &

!! ()

!! *)+ = $, -(∗ $

!!

*) = *. ∪ *)′

!! (1 *)

∇34 3 =16

789):;,8 − =

>

8?)

∇-3 log ( &;,8|$;,8; 3

>

8?)

E

;?)

38F) = 38 + H∇34 3

!∗($, &) &

!! ()

!! *)+ = $, -(∗ $

!!

*) = *. ∪ *)′

!! (1 *)

∇34 3 =16

789):;,8 − =

>

8?)

∇-3 log ( &;,8|$;,8; 3

>

8?)

E

;?)

38F) = 38 + H∇34 3

Ronald J. Williams. Simple statistical gradient-‑following algorithms forconnectionist reinforcement learning. Machine Learning, 8(3):229-‑256, 1992.

• Predict gradient

• Update model parameter

!�"�#�$�%�&�'�(�)�

!�"�#�$�%�&�'�(�)�

: Model parameter

: Number of episode

: Number of step

: Decay of reward

: Reward

: Baseline

: Policy

: Action

: State

Experiments ‒ Multi Agenthttp://192.168.0.1/8080

http://192.168.0.2/8080

http://192.168.0.3/8080

experience

environment

experience

environment

experience

environment

gradient

gradient

gradient

modelparameter

modelparameter

modelparameter

update

Training speed is about 3 times faster thansingle agent.

3 agents

Results - PendulumREINFORCE

DAgger

3 Layers Perceptron

3200

2input

(cosθ, sinθ, θ).

outputor

Less episodes until convergence !

Results - PongREINFORCE

DAgger

3 Layers Perceptron6400 200

2input

6400 (80x80) vectoroutput

Up or Down

Validation accuracy : 97.04%

= ー

S St+1 t

Less episodes until convergence !

Application to TORCS 7 training trackstrack4 track7 track18

3 test trackstrack8 track12 track16

…• Car driving simulator game• Try to improve Yoshida-‑sanʼ’s projects• Train policy only from vision sensor

Imitation Learning(expert : hand-‑crafted AI)

Reinforcement Learning

IIEC=

LCGL

=K

Res

ult:

Rea

sona

ble

beha

vior

•+

CNCGA

KCFME

LA

F

•G

KGK

K=D

KGK

IIG

GLK

GK8I

KGK

L=

•CG

IEC=

GEF

NCKCG

KGK

,AA

PIL(

G=

L0 S

-‑0.

+-‑/

I/

I/

I/

I/

I(

xt-‑1xt

3×64×64• Steering wheel : (-‑1, 0, 1)• Whether to brake : (0, 1)

• Steering wheel : -‑1 ~∼ 1• Accel : 0 ~∼ 1

or

Discrete actions

Continuous actions

Transfer Learning

Results ‒ DAgger in TORCSDiscrete actions Continuous actions

• DAgger works well in different environments(no overfitting!).• The agent cannot surpass the performance of the expert : Most places where an agent fails are where the expert fails.

• The expert cannot reach the goal in all test tracks.• An agent with continuous actions gradually become worse...

Expert can reach

Experiments ‒Transfer Learning• Experiment 1 (single-‑play) -‑ RL for faster and safer driving

• Experiment 2 (self-‑play) -‑ RL for racing battle

RewardsOut of the tracks ⇒ -1Every 400(track 0) or 200(track 8) steps ⇒ mean speedEnvironmentsTrack 0 and 16

RewardsOut of the tracks ⇒ -1Overtaken by the opponent ⇒ -1Overtake the opponent ⇒ mean speedEnvironmentTrack 0 32

32

64

64

Input

Input

(≒0 ~∼ 2.2)

(≒0 ~∼ 2.2)

Results - Experiment 1 (single-play)Track 0 (Goal : 400 steps) Track 16 (Goal : 1600 steps)

• Transfer learning works well in REINFORCE algorithm.• Better driving than expert in terms of both speed and safety.• An agent trained well seems to control speedby steering action only (no braking).

Expert

Moving Average

Results - Experiment 2 (self-play)

Opponent= ExpertAgent Opponent Agent OpponentAgent

vs. expert self-‑play１ self-‑play 2

Moving Average

RL not to be overtaken RL to overtake RL not to be overtaken

Conclusion and Future Works

• DAgger works well in various environments such as TORCS.• DAgger is very effective as pre-‑training before RL.

• Imitation Learning as pre-‑training cause to get stuck in local minima ?• Multi-‑task learning (predicting existence of another car to the leftor right at the same time) could help to train autonomous driving ?

Future Works

Conclusion

With baselines

Without baselines

With pre-‑training

Without pre-‑training

AppendixComparison of baselines Comparison of pre-‑training by DAgger

imitation learning for autonomous driving in torcs

Technology