imitation learning for autonomous driving in torcs

16
ImitationLearningfor AutonomousDrivinginTORCS Final Report Yasunori Kudo Mitsuru Kusumoto, Yasuhiro Fujita SP Team

Upload: preferred-infrastructure-preferred-networks

Post on 06-Jan-2017

673 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Imitation Learning for Autonomous Driving in TORCS

Imitation  Learning  forAutonomous  Driving  in  TORCS

Final Report

Yasunori KudoMitsuru Kusumoto, Yasuhiro Fujita

SP Team

Page 2: Imitation Learning for Autonomous Driving in TORCS

Imitation LearningImitation  Learning  is  an  approach  for  the  sequential  predictionproblem,  where  expert  demonstrations  of  good  behavior  are  used  to  learn  a  controller.

In  standard  reinforcement  learning,  agents  need  to  explore  theenvironment  many  times  to  obtain  a  good  policy.  However,  sampleefficiency  is  crucial  in  actual  environments.Expert  demonstrations  may  be  helpful  for  this  issue.

Examples  :• Legged  locomotion  [Ratliff  2006]• Outdoor  navigation  [Silver  2008]• Car  driving  [Pomerleau 1989]• Helicopter  flight  [Abbeel 2007]  

Where we’ll go

Page 3: Imitation Learning for Autonomous Driving in TORCS

DAgger : Dataset AggregationDAGGER3.6. DATASET AGGREGATION: ITERATIVE INTERACTIVE LEARNINGAPPROACH 69

Execute current policy and Query Expert New Data 

Supervised Learning 

All previous data Aggregate Dataset 

Steering from expert 

New Policy 

Figure 3.5: Depiction of the DAGGER procedure for imitation learning in a drivingscenario.

Test Execu*on 

Collect Data 

No‐Regret Online Learner 

Expert 

Learned  Policy Done? 

yes  no iπ̂

Best Policy 

iπ̂

e.g. Gradient Descent 

Figure 3.6: Diagram of the DAGGER algorithm with a general online learner for imita-tion learning.

policies, with relatively few data points, may make many more mistakes and visit states

that are irrelevant as the policy improves. We will typically use �1

= 1 so that we do

not have to specify an initial policy ⇡̂1

before getting data from the expert’s behavior.

Then we could choose �i = pi�1 to have a probability of using the expert that decays

exponentially as in SMILE and SEARN. The only requirement is that {�i} be a sequence

such that �N = 1

N

PNi=1

�i ! 0 as N ! 1. The simple, parameter-free version of the

Stéphane.  Ross,  Geoffrey  J.  Gordon,  and  J.  Andrew.  Bagnell.  A  reduction  of  imitationlearning  and  structured  prediction  to  no-‑regret  online  learning.  In  AISTATS,  2011.

Page 4: Imitation Learning for Autonomous Driving in TORCS

DAgger : Dataset AggregationDAGGER

70 CHAPTER 3. LEARNING BEHAVIOR FROM DEMONSTRATIONS

Initialize D ;.Initialize ⇡̂

1

to any policy in ⇧.for i = 1 to N do

Let ⇡i = �i⇡⇤ + (1� �i)⇡̂i.Sample T -step trajectories using ⇡i.Get dataset Di = {(s,⇡⇤(s))} of visited states by ⇡i and actions given by expert.Aggregate datasets: D D

SDi.

Train classifier ⇡̂i+1

on D (or use online learner to get ⇡̂i+1

given new data Di).end forReturn best ⇡̂i on validation.

Algorithm 3.6.1: DAGGER Algorithm.

algorithm described above is the special case �i = I(i = 1) for I the indicator function,

which often performs best in practice (see Chapter 5). The general DAGGER algorithm

is detailed in Algorithm 3.6.1.

Analysis

We now provide a complete analysis of this DAGGER procedure, and show how the

strong no-regret property of online learning procedures can be leveraged, in this inter-

active learning procedure, to obtain good performance guarantees. Again here, we seek

to provide a similar analysis to previously analyzed methods that seeks to answer the

following question : if we can find good policies at mimicking the expert on the aggre-

gate dataset we collect during training, then how well the learned policy will perform

the task?

The theoretical analysis of DAGGER relies primarily on seeing how learning iter-

atively in this algorithm can be viewed as an online learning problem and using the

no-regret property of the underlying Follow-The-Leader algorithm on strongly convex

losses (Kakade and Tewari, 2009) which picks the sequence of policies ⇡̂1:N . Hence, the

presented results also hold more generally if we use any other no-regret online learning

algorithm we would apply to our imitation learning setting. In particular, we can con-

sider the results here a reduction of imitation learning to no-regret online learning where

we treat mini-batches of trajectories under a single policy as a single online-learning

example. In addition, in Chapter 9, we also show that the data aggregation procedure

works generally whenever the supervised learner applied to the aggregate dataset has

su�cient stability properties. We refer the reader to Chapter 2 for a review of concepts

related to online learning and no regret that is used for this analysis. All the proofs of

the results presented here are in Appendix A.

!�"∗(%, !)   !�

DAgger: Dataset Aggregation

• Collect new trajectories with S1

• New Dataset D1’ = {(s, S*(s))}

• Aggregate Datasets:D1 = D0 U D1’

• Train S2 on D1

17

S1

Steering from expert

DAgger: Dataset Aggregation

S2• Collect new trajectories with S2

• New Dataset D2’ = {(s, S*(s))}

• Aggregate Datasets:D2 = D1 U D2’

• Train S3 on D2

18

Steering from expert

Expert  policy Predicted  policyAvoid  to  collect  states  affectedby  only  expert  policy

Page 5: Imitation Learning for Autonomous Driving in TORCS

Experiments• Pendulum  and  Pong  in  OpenAI Gym

• We  compared  the  performance  of  DAggerwith  standard  RL  algorithm.

REINFORCE

Toy Problem: Pendulum Swingup● Classical RL benchmark task

○ Nonlinear control:○ Action: Torque○ State:

○ Reward:

From “Reinforcement Learning In Continuous Time and Space”, Kenji Doya, 2000State  :  (θ,  θ)

Reward  :  cosθ・

State  :  80×80  binaryReward  :  win  +1,  lose  -‑1

Page 6: Imitation Learning for Autonomous Driving in TORCS

Experiments - REINFORCEREward Increment  =  Nonnegative  Factor  × Offset  Reinforcement

× Characteristic  Eligibility

 

!∗($, &) &

!! ()

!! *)+ = $, -(∗ $

!!  

*) = *. ∪ *)′

!! (1 *)

∇34 3 =16

789):;,8 − =

>

8?)

∇-3 log ( &;,8|$;,8; 3

>

8?)

E

;?)

38F) = 38 + H∇34 3

 

!∗($, &) &

!! ()

!! *)+ = $, -(∗ $

!!  

*) = *. ∪ *)′

!! (1 *)

∇34 3 =16

789):;,8 − =

>

8?)

∇-3 log ( &;,8|$;,8; 3

>

8?)

E

;?)

38F) = 38 + H∇34 3

Ronald J.  Williams.  Simple  statistical  gradient-‑following  algorithms  forconnectionist  reinforcement  learning.  Machine Learning,  8(3):229-‑256,  1992.

• Predict  gradient

• Update  model  parameter

!�"�#�$�%�&�'�(�)�

 

!�"�#�$�%�&�'�(�)�

 

:  Model  parameter

:  Number  of  episode

:  Number  of  step

:  Decay  of  reward

:  Reward

:  Baseline

:  Policy

:  Action

:  State

Page 7: Imitation Learning for Autonomous Driving in TORCS

Experiments ‒ Multi Agenthttp://192.168.0.1/8080

http://192.168.0.2/8080

http://192.168.0.3/8080

experience

environment

experience

environment

experience

environment

gradient

gradient

gradient

modelparameter

modelparameter

modelparameter

update

Training  speed  is  about  3  times  faster  thansingle  agent.

3  agents

Page 8: Imitation Learning for Autonomous Driving in TORCS

Results - PendulumREINFORCE

DAgger

3  Layers  Perceptron

3200

2input

(cosθ,  sinθ,  θ).

outputor

Less  episodes  until  convergence  !

Page 9: Imitation Learning for Autonomous Driving in TORCS

Results - PongREINFORCE

DAgger

3  Layers  Perceptron6400 200

2input

6400  (80x80)  vectoroutput

Up  or  Down

Validation  accuracy  :  97.04%

= ー

S St+1 t

Less  episodes  until  convergence  !

Page 10: Imitation Learning for Autonomous Driving in TORCS

Application to TORCS 7  training  trackstrack4 track7 track18

3  test  trackstrack8 track12 track16

…• Car  driving  simulator  game• Try  to  improve  Yoshida-‑sanʼ’s  projects• Train  policy  only  from  vision  sensor

Imitation  Learning(expert  :  hand-‑crafted  AI)

Reinforcement  Learning

IIEC=

LCGL

=K

Res

ult:

Rea

sona

ble

beha

vior

•+

CNCGA

KCFME

LA

F

•G

KGK

K=D

KGK

IIG

GLK

GK8I

KGK

L=

•CG

IEC=

GEF

NCKCG

KGK

,AA

PIL(

G=

L0  S

-‑0.

+-‑/

I/

I/

I/

I/

I(

xt-‑1xt

3×64×64• Steering  wheel  :  (-‑1,  0,  1)• Whether  to  brake  :  (0,  1)

• Steering  wheel  :  -‑1  ~∼ 1• Accel :  0  ~∼  1

or

Discrete  actions

Continuous  actions

Transfer  Learning

Page 11: Imitation Learning for Autonomous Driving in TORCS

Results ‒ DAgger in TORCSDiscrete  actions Continuous  actions

• DAgger works  well  in  different  environments(no  overfitting!).• The  agent  cannot  surpass  the  performance  of  the  expert  :  Most  places  where  an  agent  fails  are  where  the  expert  fails.  

• The  expert  cannot  reach  the  goal  in  all  test  tracks.• An  agent  with  continuous  actions  gradually  become  worse...

Expert  can  reach

Page 12: Imitation Learning for Autonomous Driving in TORCS

Experiments ‒Transfer Learning• Experiment  1  (single-‑play)  -‑ RL  for  faster  and  safer  driving

• Experiment  2  (self-‑play)  -‑ RL  for  racing  battle

RewardsOut  of  the  tracks  ⇒ -1Every  400(track  0)  or  200(track  8)  steps  ⇒ mean speedEnvironmentsTrack  0  and  16

RewardsOut  of  the  tracks  ⇒ -1Overtaken  by  the  opponent  ⇒ -1Overtake  the  opponent ⇒ mean speedEnvironmentTrack  0 32

32

64

64

Input

Input

(≒0  ~∼  2.2)

(≒0  ~∼  2.2)

Page 13: Imitation Learning for Autonomous Driving in TORCS

Results - Experiment 1 (single-play)Track  0  (Goal  :  400  steps) Track  16  (Goal  :  1600  steps)

• Transfer  learning  works  well  in  REINFORCE  algorithm.• Better  driving  than  expert in  terms  of  both  speed  and  safety.• An  agent  trained  well  seems  to  control  speedby  steering  action  only  (no  braking).

Expert

Moving Average

Page 14: Imitation Learning for Autonomous Driving in TORCS

Results - Experiment 2 (self-play)

Opponent=  ExpertAgent Opponent Agent OpponentAgent

vs.  expert self-‑play1 self-‑play  2

Moving  Average

RL  not  to  be  overtaken RL  to  overtake RL  not  to  be  overtaken

Page 15: Imitation Learning for Autonomous Driving in TORCS

Conclusion and Future Works

• DAgger works  well  in  various  environments  such  as  TORCS.• DAgger is  very  effective  as  pre-‑training  before  RL.

• Imitation  Learning  as  pre-‑training  cause  to  get  stuck  in  local  minima ?• Multi-‑task  learning  (predicting  existence  of  another  car  to  the  leftor  right  at  the  same  time)  could  help  to  train  autonomous  driving  ?

Future  Works

Conclusion

Page 16: Imitation Learning for Autonomous Driving in TORCS

With  baselines

Without  baselines

With  pre-‑training

Without  pre-‑training

AppendixComparison  of  baselines Comparison  of  pre-‑training  by  DAgger