deep parking
Post on 07-Apr-2017
5.168 Views
Preview:
TRANSCRIPT
1
Deep parking:an implementation of automatic parking with deep reinforcement
learning
Shintaro Shiba, Feb.2016-Dec.2016Engineer Internship at Preferred Networks
Mentor: Abe-san, Fujita-san
2
About meShintaro Shiba• Graduate student at the University of
Tokyo– Major in neuroscience and animal behavior
• Part-time engineer (internship) at Preferred Networks, Inc.– Blog post URL: https://
research.preferred.jp/2017/03/deep-parking/
3
Contents• Original Idea• Background: DQN and Double-DQN• Task definition
– Environment: car simulator– Agents
1. Coordinate2. Bird‘s-eye view3. Subjective view
• Discussion• Summary
4
Achievement
Trajectory of the car agent Subjective view (Input for DQN)
0 deg
-120 deg
+120 deg
5
Original Idea: DQN for parking
https://research.preferred.jp/2016/01/ces2016/https://research.preferred.jp/2015/06/distributed-deep-reinforcement-learning/
Succeeded in driving smoothly with DQNInput: 32 virtual sensors, 3 previous actions + Current speed and steeringOutput: 9 actions
Is it possible to learn for car agent to park itself,with inputs of images from camera?
6
Reinforcement learning
Environment
Agent
action statereward
Learning algorithm
7
DQN: Deep-Q Network
Volodymyr Mnih et al. 2015
each episode >>each action >>
update Q function >>
8
Double DQN
Preventing overestimation of Q values
Hado van Hasselt et al. 2015
9
Reinforcement learning in this project
EnvironmentCar simulator
AgentDifferent sensor +
different neural network
action state = sensor inputreward
10
Environment:Car simulator
Forces of …• Traction• Air resistance• Rolling resistance• Centrifugal force• Brake• Cornering force
F = Ftraction + Faero + Frr + Fc + Fbrake + Fcf
11
Common specifications:state, action, reward
Input (States)– Features specific to each agent + car speed, car steering
Output (Actions)– 9: accelerate, decelerate, steer right, steer left, throw (do nothing),
accelerate + steer right, accelerate + steer left, decelerate + steer right, decelerate + steer left
Reward– +1 when the car is in the goal– -1 when the car is out of the field– 0.01 - 0.01 * distance_to_goal otherwise (changed afterward)
Goal– Car inside the goal region, no other conditions like car direction
Terminate– Time up: 500 times of actions (changed to 450 afterward)– Field out: Out of the field
12
Common specifications:hyperparameters
Maximum episode: 50,000Gamma: 0.97Optimizer: RMSpropGraves
– lr=0.00015, alpha=0.95, momentum=0.95, eps=0.01
– changed afterward: lr=0.00015, alpha=0.95, momentum=0, eps=0.01
Batchsize: 50 or 64Epsilon: 0.1 at last
– linearly decreased from 1.0 at first
13
Agents1. Coordinate2. Bird’s-eye view3. Subjective view
– Three cameras– Four cameras
14
Coordinate agentInput features
– Relative coordinate value from the car to the goal
(80, 300)
goal
carinput shape: (2, )normalized
15
Coordinate agentNeural Network
– only full-connected layers (3)
n of actions (9)
n of car parameters (2)
coordinates (2)
64 64
16
Coordinate agentResult
17
Bird’s-eye view agentInput features
– Bird’s-eye image of the whole field
input size: 80 x 80normalized
18
Bird’s-eye view agentNeural Network
80
80
128
192n of actions
n of car parameters (2)64
400
Conv
19
Bird’s-eye view agentNeural Network
80
80
128
192n of actions
n of car parameters (2)64
400
Conv
20
Bird’s-eye view agentResult: 18k episodes
21
Bird’s-eye view agentResult: after 18k episodes ?
But we had already spent about 6 month for this agent so moved to the next…
22
Subjective view agentInput features
– N_of_camera images of subjective view from the car
– Number of cameras…Three or Four– FoV = 120 deg
cameraex. Input images for four camera agent
front+0
back+180
right+90
left+270
23
Subjective view agentNeural Network
Conv
80
80
200 x 3
400
256n of actions
n of car parameters (2)
64
24
Subjective view agentNeural Network
Conv
80
80
200 x 3
400
256n of actions
n of car parameters (2)
64
25
Subjective view agentProblem
– Calculation time (GeForce GTX TITAN X) • At first… 3 [min/ep] x 50k [ep] = 100 days• Reviewed by Abe-san… 1.6 [min/ep] x 50k [ep] = 55
days– Because of copy and synchronization between GPU and CPU– Learning interrupted as soon as divergence of DNN output– (Fortunately) agent “learned” goal by ~10k episodes in
some trials– Memory usage
• In DQN, we need to store 1M previous input data– 1M x (80 x 80 x 3 ch x 4 cameras)
• Save images to disk and access every time
26
Subjective view agentResult: three cameras, 6k episodes
0 deg
-120 deg
+120 deg
Trajectory of the car agent Subjective view (Input for DQN)
27
Subjective view agentResult: three cameras, 50k episodes
The policy “move anyways” ?>> Reward setting
Seems not able to goal every timeOnly “easy” goal to achieve>> Variable task difficulty (curriculum)
Frequent goals here
28
Subjective view agentFour camera at 30k ep.
29
Modify rewardPrevious
– +1 when the car is in the goal– -1 when the car is out of the field– 0.01 - 0.01 * distance_to_goal otherwise
New– +1 - speed when the car is in the goal
• in order to stop the car– -1 when the car is out of the field– -0.005
30
Modify difficultyDifficulty: Initial car direction & position
– Constraint• Car always starts near the middle of the field• Car always starts with face toward center:
– Curriculum• Car direction:
– where n = currriculum• Criteria:
– 0.6 of mean reward over 100 episodes
Goaln = 1
n = 2
31
Subjective view agent: modifications
N cameras Reward Difficulty Learning result
3 Default Default about 6k: o50k: x
3 modified Default about 16k: o
3 modified Constraint ? (still learning)
3 modified Curriculum o(though curriculum 1
yet)4 Default Default x
4 modified Curriculum △ (not bad, but not successful yet at 6k)
32
Subjective view agent: modifications
Curriculum + Three cameras@curriculum 1. Criteria needs to be modified
reward mean reward sum1.0
0.0
500
0
n episode0 10k 20k
n episode0 10k 20k
33
Discussion1. Initial settings included the situation
where car cannot reach the goal– e.g. Start towards the edge of the field– This made learning unstable
2. Why successful for coordinate agent?– In spite there could be such situations?
34
Discussion3. Comparison with three and four cameras
– Considering success rate and execution time, three camera is better
– Why not successful in four cameras?– Need several trials?
4. DQN often diverged– every three times in personal feeling
• four cameras is slightly more oftern– Importance of dataset for learning
• memory size, batch size
35
Discussion5. Curriculum
– Ideally better to quantify “difficulty of the task”
• In this case, maybe it is roughly represented as “bias of distribution” of the selected actions?accelerate
deceleratethrow (do nothing)
steer rightsteer left
accelerate + steer rightaccelerate + steer left
decelerate + steer rightdecelerate + steer left
same times for each actions >> go straightbiased distribution of selected actions >> go right/left
36
Summary• Car agent can park itself with subjective
view of cameras, though not always stable learning
• Trade-off between reward design and learning difficulty– Simple reward: difficult to learn
• Try other algorithms like A3C– Complex reward: difficult to set
• Other setting for distance_to_goal
top related