the predictron: end-to-end learning and...

27
The Predictron: End-to-end Learning and Planning Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology December 27, 2016

Upload: others

Post on 01-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

The Predictron: End-to-end Learning andPlanning

Yoonho Lee

Department of Computer Science and EngineeringPohang University of Science and Technology

December 27, 2016

Page 2: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RLmotivation

Page 3: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 4: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 5: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Reward augmentation1

1Bellemare et al. Unifying Count-Based Exploration and IntrinsicMotivation, NIPS 2016

Page 6: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 7: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical actions2

2Florensa et al. Stochastic Neural Networks for Hierarchical ReinforcementLearning, under review for ICLR 2017

Page 8: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 9: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Model-based video prediction3

3Oh et al. Action-Conditional Video Prediction using Deep Networks inAtari Games, NIPS 2015

Page 10: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration Networks

Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, Pieter AbbeelBest paper at NIPS 2016

Page 11: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksMotivation

Page 12: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksNetwork diagram

Page 13: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksBellman Operator as a CNN forward pass

Bellman Operator

Every V converges to V ∗ under the Bellman Operator T definedas:

(TV )(s) = maxa∈A

{R(s, a) + γ∑s′∈S

P(s ′|s, a)V (s ′)} (1)

V ∗ = limn→∞

T nV ∀V (2)

The Bellman Operator can be viewed as a CNN forward pass:

(T ∗V )(s) = maxa∈A

{R(s, a) + γconvP(a)(V (s))}

Page 14: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksVI Module

Page 15: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksResults

Page 16: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksSummary

I Neural network architecture that plans using value iteration

I Assumes that the state is a sufficient statistic for the rewardfunction

I The small MDP must have a finite state space

I Uses prior knowledge about the environment’s structure

Page 17: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

The Predictron: End-to-End Learning and Planning

David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul,Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert,

Neil Rabinowitz, Andre Barreto, Thomas DegrisUnder review for ICLR 2017

Page 18: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 19: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 20: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 21: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 22: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 23: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 24: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 25: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

PredictronSummary

I End-to-end simulation of an MRP

I Works with arbitrary state space for small MRP

I Small MRP has a non-interpretable state space

I Designed for RL, but does not take actions into account

Page 26: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Summary

I Hierarchical Reinforcement Learning attempts to identifycrucial decisions

I Agents can now use NN-based planning for better decisionmaking

I Research directionsI Theoretical bounds for optimal policy of a smaller MDPI Learning a smaller MDP with abstract actionsI End-to-end planning based Q network or policy network

Page 27: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Thank You

Value Iteration NetworksThe Predictron: End-to-End Learning and Planning