the predictron: end-to-end learning and...
TRANSCRIPT
The Predictron: End-to-end Learning andPlanning
Yoonho Lee
Department of Computer Science and EngineeringPohang University of Science and Technology
December 27, 2016
Hierarchical RLmotivation
Hierarchical RL
Hierarchical RL
Reward augmentation1
1Bellemare et al. Unifying Count-Based Exploration and IntrinsicMotivation, NIPS 2016
Hierarchical RL
Hierarchical actions2
2Florensa et al. Stochastic Neural Networks for Hierarchical ReinforcementLearning, under review for ICLR 2017
Hierarchical RL
Model-based video prediction3
3Oh et al. Action-Conditional Video Prediction using Deep Networks inAtari Games, NIPS 2015
Value Iteration Networks
Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, Pieter AbbeelBest paper at NIPS 2016
Value Iteration NetworksMotivation
Value Iteration NetworksNetwork diagram
Value Iteration NetworksBellman Operator as a CNN forward pass
Bellman Operator
Every V converges to V ∗ under the Bellman Operator T definedas:
(TV )(s) = maxa∈A
{R(s, a) + γ∑s′∈S
P(s ′|s, a)V (s ′)} (1)
V ∗ = limn→∞
T nV ∀V (2)
The Bellman Operator can be viewed as a CNN forward pass:
(T ∗V )(s) = maxa∈A
{R(s, a) + γconvP(a)(V (s))}
Value Iteration NetworksVI Module
Value Iteration NetworksResults
Value Iteration NetworksSummary
I Neural network architecture that plans using value iteration
I Assumes that the state is a sufficient statistic for the rewardfunction
I The small MDP must have a finite state space
I Uses prior knowledge about the environment’s structure
The Predictron: End-to-End Learning and Planning
David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul,Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert,
Neil Rabinowitz, Andre Barreto, Thomas DegrisUnder review for ICLR 2017
Predictron
Predictron
Predictron
Predictron
Predictron
Predictron
Predictron
PredictronSummary
I End-to-end simulation of an MRP
I Works with arbitrary state space for small MRP
I Small MRP has a non-interpretable state space
I Designed for RL, but does not take actions into account
Summary
I Hierarchical Reinforcement Learning attempts to identifycrucial decisions
I Agents can now use NN-based planning for better decisionmaking
I Research directionsI Theoretical bounds for optimal policy of a smaller MDPI Learning a smaller MDP with abstract actionsI End-to-end planning based Q network or policy network
Thank You
Value Iteration NetworksThe Predictron: End-to-End Learning and Planning