the predictron: end-to-end learning and...

27

The Predictron: End-to-end Learning and Planning Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology December 27, 2016

Upload: others

Post on 01-Aug-2020

2 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

The Predictron: End-to-end Learning andPlanning

Yoonho Lee

Department of Computer Science and EngineeringPohang University of Science and Technology

December 27, 2016

Page 2: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RLmotivation

Page 3: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 4: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 5: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Reward augmentation1

1Bellemare et al. Unifying Count-Based Exploration and IntrinsicMotivation, NIPS 2016

Page 6: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 7: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical actions2

2Florensa et al. Stochastic Neural Networks for Hierarchical ReinforcementLearning, under review for ICLR 2017

Page 8: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 9: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Model-based video prediction3

3Oh et al. Action-Conditional Video Prediction using Deep Networks inAtari Games, NIPS 2015

Page 10: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration Networks

Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, Pieter AbbeelBest paper at NIPS 2016

Page 11: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksMotivation

Page 12: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksNetwork diagram

Page 13: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksBellman Operator as a CNN forward pass

Bellman Operator

Every V converges to V ∗ under the Bellman Operator T definedas:

(TV )(s) = maxa∈A

{R(s, a) + γ∑s′∈S

P(s ′|s, a)V (s ′)} (1)

V ∗ = limn→∞

T nV ∀V (2)

The Bellman Operator can be viewed as a CNN forward pass:

(T ∗V )(s) = maxa∈A

{R(s, a) + γconvP(a)(V (s))}

Page 14: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksVI Module

Page 15: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksResults

Page 16: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksSummary

I Neural network architecture that plans using value iteration

I Assumes that the state is a sufficient statistic for the rewardfunction

I The small MDP must have a finite state space

I Uses prior knowledge about the environment’s structure

Page 17: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

The Predictron: End-to-End Learning and Planning

David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul,Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert,

Neil Rabinowitz, Andre Barreto, Thomas DegrisUnder review for ICLR 2017

Page 18: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 19: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 20: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 21: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 22: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 23: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 24: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 25: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

PredictronSummary

I End-to-end simulation of an MRP

I Works with arbitrary state space for small MRP

I Small MRP has a non-interpretable state space

I Designed for RL, but does not take actions into account

Page 26: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Summary

I Hierarchical Reinforcement Learning attempts to identifycrucial decisions

I Agents can now use NN-based planning for better decisionmaking

I Research directionsI Theoretical bounds for optimal policy of a smaller MDPI Learning a smaller MDP with abstract actionsI End-to-end planning based Q network or policy network

Page 27: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Thank You

Value Iteration NetworksThe Predictron: End-to-End Learning and Planning

Society of Behavioral Sleep Medicine (SBSM) ReadingList

BUSINESS YOGA · 4 5 konzepte Übersicht zpp zertifiziert ayas® hatha-yoga gesundheitskurs (10 ue á 60 min) ..... id: 20161227-v4107

Development, Variants, and Pathology Anatomy, Embryology ...assets.cureus.com/uploads/review_article/pdf/5833/1482867954-20161227... · article distributed under the terms of the

Stochastic Neural Networks For HRL - POSTECH MLGmlg.postech.ac.kr/~readinglist/slides/20180109.pdf · 2018-03-25 · 1.2) SNN (Stochastic Neural Networks) •To learn several skills

DistrictSummer ReadingList

2016년 3분기 주요 정보보안 소식 20161227 차민석_공개판

20161227 Classmethod Basic Knowledge for Direct Connect

PROVINCIA DE MENDOZAgobernac.old.mendoza.gov.ar/boletin/pdf/20161227-30270-normas.pdf · del Artículo 100, inciso b) de la Ley N° 8706. ... gasto para los ejercicios futuros y

20161227 MLB COOPERSTOWN 4 PIN SET FLYER20161227 MLB COOPERSTOWN 4 PIN SET FLYER Created Date: 1/24/2017 9:50:54 AM

20161227 Taipei Smart IOT Innovation Lab workshop

Observation of Parity-Time Symmetry in Optically …optics.nju.edu.cn/upload/20161227/201612271041305218.pdf2016/12/27 · Observation of Parity-Time Symmetry in Optically Induced

Wasserstein GAN - POSTECHmlg.postech.ac.kr/~readinglist/slides/20170313.pdf · Wasserstein GAN (WGAN) I Arxiv submission I Martin Arjovsky, Soumith Chintala, and L eon Bottou I A

International Consumption - onsumers’ perspectiveeconomia.unipv.it/pagp/pagine_personali/msassi/readinglist/Magda/... · 3 Outline 1 Introduction to the comparative perspectives

2017%Summer%ReadingList - Creative Learning Academy3).pdf · wellastheenrichment!activities!on!our!website.!Spending!time!on!these!valuable! ... Davies,!Jacqueline;The’Lemonade’War

Common Platform Enumeration (CPE) – Specificationpeople.cs.ksu.edu/~zhangs84/ReadingList/cpe-specification_2.1.pdfThere is a strong trend toward automation in security practice

Bundle Protocol Mail Convergence Layer - Ke Shikeshi.ubiwna.org/2014IoTComm/readinglist/Bundle protocol... · 2014-08-31 · two standard discovery mechanisms: IPND [5] for local

The Holy season of Thelema - Daniel Tarr · The Holy season of Thelema Readinglist for english speaking practicioners Bahlasti Ompehda ± O.T.O. Hungary

InfoGAIL: Intepretable Imitation Learning from Visual ...mlg.postech.ac.kr/~readinglist/slides/20180220.pdf · Introduction • Imitation Learning: mimic expert behavior without access

Calculation Tools for the Energy Concept Adviser€¦ · The Kulu window consists of three parts: READINGLIST, GRAPH and REPORT. In the readinglist there is shown data (dates, readings

149289 (TM-108)Timer UT Instruction KAB 20161227 Update …pdf.lowes.com/useandcareguides/827214004552_use.pdfTitle: 149289_(TM-108)Timer_UT_Instruction_KAB_20161227 Update_OL Created

RAPPORT D´INFORMATION - data.over-blog-kiwi.comdata.over-blog-kiwi.com/1/58/42/19/20161227/ob_d7149b_rapport-se... · N° 147 SÉNAT SESSION ORDINAIRE DE 2016-2017 Enregistré à

TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019 · the few-shot learning domain: Matching Networks

Siamese Network & Matching Network for one-shot learningmlg.postech.ac.kr/~readinglist/slides/20161122.pdf · 2017-03-02 · Siamese Network & Matching Network for one-shot learning

REINO UNIDO - caminosandalucia.escaminosandalucia.es/wp-content/uploads/2016/12/20161227... · LAS FORTALEZAS DE LOS INGENIEROS DE CAMINOS. COLEGIO PROFESIONAL CEng MICE e=youtu

Composing graphical models with neural networks …mlg.postech.ac.kr/~readinglist/slides/20161101.pdf2016/11/01 · Composing graphical models with neural networks for structured

One-shot Generalization in Deep Generative Modelmlg.postech.ac.kr/~readinglist/slides/20160919.pdf · Deep Generative Model Danilo J. Rezende, Shakir Mohamed, ICML 2016 Reference

File Name : xkmts.20161227.048458.ﬁts Project ID : OBS Image …kmtnet.kasi.re.kr/data_access/o2/kmts.20161227.report.pdf · 2018. 10. 4. · File Name : xkmts.20161227.048459.ﬁts

Gradient Estimation Using Stochastic Computation Graphsmlg.postech.ac.kr/~readinglist/slides/20170509.pdf · 7Andriy Mnih and Karol Gregor.\Neural Variational Inference and Learning

Generalized Zero-Shot Learning with Deep Calibration Networkmlg.postech.ac.kr/~readinglist/slides/20181120.pdf · 2018-11-20 · Generalized Zero-Shot Learning with Deep Calibration

Deadline Constrained Packet Scheduling in Wireless Networkingkeshi.ubiwna.org/2014IoTComm/readinglist/Deadline Constrained Pa… · Deadline Constrained Packet Scheduling in Wireless

Random Forest for the Contextual Bandit Problem …mlg.postech.ac.kr/~readinglist/slides/20170328.pdf · 1/19 Random Forest for the Contextual Bandit Problem Raphael Feraud, Robin

The Internet of Things: A surveyksuweb.kennesaw.edu/~she4/2018Spring/cs4491/ReadingList/IoT_survey.pdfNIC foresees that ‘‘by 2025 Internet nodes may reside in everyday things –

The IoT Architectural Framework, Design Issues and ...ksuweb.kennesaw.edu/~she4/2018Spring/cs4491/ReadingList/IoT_Architecture.pdfThe IoT Architectural Framework, Design Issues and

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 5, …keshi.ubiwna.org/2014IoTComm/readinglist/Mobile Data Offloading t… · Mobile data offloading, also referred to as mobile

고객통합앱-사용자메뉴얼 v0.72 20161227 · [케이티텔레캅고객통합모바일앱] 사용자설명서 1. 고객통합앱설치-Android 1. 구글플레이실행 2. 케이티텔레캅으로검색3