report from dagstuhl seminar 13321 reinforcement learning€¦ · report from dagstuhl seminar...

26
Report from Dagstuhl Seminar 13321 Reinforcement Learning Edited by Peter Auer 1 , Marcus Hutter 2 , and Laurent Orseau 3 1 Montanuniversität Leoben, AT, [email protected] 2 Australian National University – Canberra, AU, [email protected] 3 AgroParisTech – Paris, FR, [email protected] Abstract This Dagstuhl Seminar also stood as the 11th European Workshop on Reinforcement Learning (EWRL11). Reinforcement learning gains more and more attention each year, as can be seen at the various conferences (ECML, ICML, IJCAI, . . . ). EWRL, and in particular this Dagstuhl Seminar, aimed at gathering people interested in reinforcement learning from all around the globe. This unusual format for EWRL helped viewing the field and discussing topics differently. Seminar 04.–09. August, 2013 – www.dagstuhl.de/13321 1998 ACM Subject Classification I.2.6 Learning, I.2.8 Problem Solving, Control Methods, and Search, I.2.9 Robotics, I.2.11 Distributed Artificial Intelligence, G.3 Probability and Statistics (Markov processes) Keywords and phrases Machine Learning, Reinforcement Learning, Markov Decision Processes, Planning Digital Object Identifier 10.4230/DagRep.3.8.1 1 Executive Summary Peter Auer Marcus Hutter Laurent Orseau License Creative Commons BY 3.0 Unported license © Peter Auer, Marcus Hutter, and Laurent Orseau Reinforcement Learning (RL) is becoming a very active field of machine learning, and this Dagstuhl Seminar aimed at helping researchers have a broad view of the current state of this field, exchange cross-topic ideas and present and discuss new trends in RL. It gathered 38 researchers together. Each day was more or less dedicated to one or a few topics, including in particular: The exploration/exploitation dilemma, function approximation and policy search, universal RL, partially observable Markov decision processes (POMDP), inverse RL and multi-objective RL.This year, by contrast to previous EWRL events, several small tutorials and overviews were presented. It appeared that researchers are nowadays interested in bringing RL to more general and more realistic settings, in particular by alleviating the Markovian assumption, for example so as to be applicable to robots and to a broader class of industrial applications.This trend is consistent with the observed growth of interest in policy search and universal RL. It may also explain why the traditional treatment of the exploration/exploitation dilemma received less attention than expected. Except where otherwise noted, content of this report is licensed under a Creative Commons BY 3.0 Unported license Reinforcement Learning, Dagstuhl Reports, Vol. 3, Issue 8, pp. 1–26 Editors: Peter Auer, Marcus Hutter, and Laurent Orseau Dagstuhl Reports Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany

Upload: others

Post on 19-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Report from Dagstuhl Seminar 13321

Reinforcement LearningEdited byPeter Auer1 Marcus Hutter2 and Laurent Orseau3

1 Montanuniversitaumlt Leoben AT auerunileobenacat2 Australian National University ndash Canberra AU marcushutteranueduau3 AgroParisTech ndash Paris FR laurentorseauagroparistechfr

AbstractThis Dagstuhl Seminar also stood as the 11th European Workshop on Reinforcement Learning(EWRL11) Reinforcement learning gains more and more attention each year as can be seenat the various conferences (ECML ICML IJCAI ) EWRL and in particular this DagstuhlSeminar aimed at gathering people interested in reinforcement learning from all around theglobe This unusual format for EWRL helped viewing the field and discussing topics differently

Seminar 04ndash09 August 2013 ndash wwwdagstuhlde133211998 ACM Subject Classification I26 Learning I28 Problem Solving Control Methods and

Search I29 Robotics I211 Distributed Artificial Intelligence G3 Probability and Statistics(Markov processes)

Keywords and phrases Machine Learning Reinforcement Learning Markov Decision ProcessesPlanning

Digital Object Identifier 104230DagRep381

1 Executive Summary

Peter AuerMarcus HutterLaurent Orseau

License Creative Commons BY 30 Unported licensecopy Peter Auer Marcus Hutter and Laurent Orseau

Reinforcement Learning (RL) is becoming a very active field of machine learning and thisDagstuhl Seminar aimed at helping researchers have a broad view of the current state of thisfield exchange cross-topic ideas and present and discuss new trends in RL It gathered 38researchers together Each day was more or less dedicated to one or a few topics includingin particular The explorationexploitation dilemma function approximation and policysearch universal RL partially observable Markov decision processes (POMDP) inverseRL and multi-objective RLThis year by contrast to previous EWRL events several smalltutorials and overviews were presented It appeared that researchers are nowadays interestedin bringing RL to more general and more realistic settings in particular by alleviating theMarkovian assumption for example so as to be applicable to robots and to a broader classof industrial applicationsThis trend is consistent with the observed growth of interest inpolicy search and universal RL It may also explain why the traditional treatment of theexplorationexploitation dilemma received less attention than expected

Except where otherwise noted content of this report is licensedunder a Creative Commons BY 30 Unported license

Reinforcement Learning Dagstuhl Reports Vol 3 Issue 8 pp 1ndash26Editors Peter Auer Marcus Hutter and Laurent Orseau

Dagstuhl ReportsSchloss Dagstuhl ndash Leibniz-Zentrum fuumlr Informatik Dagstuhl Publishing Germany

2 13321 ndash Reinforcement Learning

2 Table of Contents

Executive SummaryPeter Auer Marcus Hutter and Laurent Orseau 1

Overview of TalksPreference-based Evolutionary Direct Policy SearchRobert Busa-Fekete 4

Solving Simulator-Defined MDPs for Natural Resource ManagementThomas G Dietterich 4

ABC and Cover Tree Reinforcement LearningChristos Dimitrakakis 6

Some thoughts on Transfer Learning in Reinforcement Learning on States andRepresentationLutz Frommberger 6

Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh 8

Statistical Learning Theory in Reinforcement Learning and Approximate DynamicProgrammingMohammad Ghavamzadeh 8

Universal Reinforcement LearningMarcus Hutter 8

Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski 9

State Representation Learning in RoboticsRico Jonschkowski 10

Reinforcement Learning with Heterogeneous Policy RepresentationsPetar Kormushev 10

Theoretical Analysis of Planning with OptionsTimothy Mann 11

Learning Skill Templates for Parameterized TasksJan Hendrik Metzen 11

Online learning in Markov decision processesGergely Neu 12

Hierarchical Learning of Motor Skills with Information-Theoretic Policy SearchGerhard Neumann 13

Multi-Objective Reinforcement LearningAnn Nowe 13

Bayesian Reinforcement Learning + ExplorationTor Lattimore 14

Knowledge-Seeking AgentsLaurent Orseau 14

Peter Auer Marcus Hutter and Laurent Orseau 3

Toward a more realistic framework for general reinforcement learningLaurent Orseau 15

Colored MDPs Restless Bandits and Continuous State Reinforcement LearningRonald Ortner 16

Reinforcement Learning using Kernel-Based Stochastic FactorizationJoeumllle Pineau 16

A POMDP TutorialJoeumllle Pineau 17

Methods for Bellman Error Basis Function constructionDoina Precup 17

Continual LearningMark B Ring 17

Multi-objective Reinforcement LearningManuela Ruiz-Montiel 18

Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POM-DPsScott Sanner 19

Deterministic Policy GradientsDavid Silver 20

Sequentially Interacting Markov Chain Monte Carlo BasedPolicy IterationOrhan Soumlnmez 20

Exploration versus Exploitation in Reinforcement LearningPeter Sunehag 21

The Quest for the Ultimate TD(λ)Richard S Sutton 21

Relations between Reinforcement Learning Visual Input Perception and ActionMartijn van Otterlo 22

Universal RL applications and approximationsJoel Veness 23

Learning and Reasoning with POMDPs in RobotsJeremy Wyatt 24

Schedule 24

Participants 26

13321

4 13321 ndash Reinforcement Learning

3 Overview of Talks

31 Preference-based Evolutionary Direct Policy SearchRobert Busa-Fekete (Universitaumlt Marburg DE)

License Creative Commons BY 30 Unported licensecopy Robert Busa-Fekete

We introduce a preference-based extension of evolutionary direct policy search (EDPS) asproposed by Heidrich-Meisner and Igel [2] EDPS casts policy learning as a search problemin a parametric policy space where the function to be optimized is a performance measurelike expected total reward andevolution strategies (ES) such as CMA-ES [1] are used asoptimizers Moreover since the evaluation of a policy can only be done approximatelynamely in terms of a finite number of rollouts the authors make use of racing algorithms [3]to control this number in an adaptive manner These algorithms return a sufficiently reliableranking over the current set of policies (candidate solutions) which is then used by the ESfor updating its parameters and population A key idea of our approach is to extend EDPSby replacing the value-based racing algorithm with a preference-based one that operates on asuitable ordinal preference structure and only uses pairwise comparisons between samplerollouts of the policies

References1 N Hansen and S Kern Evaluating the CMA evolution strategy on multimodal test func-

tions In Parallel Problem Solving from Nature-PPSN VIII pages 282ndash291 20042 V Heidrich-Meisner and C Igel Hoeffding and Bernstein races for selecting policies in

evolutionary direct policy search In Proceedings of the 26th International Conference onMachine Learning pages 401ndash408 2009

3 O Maron and AW MooreHoeffding races accelerating model selection search for classific-ation and function approximation In Advances in Neural Information Processing Systemspages 59ndash66 1994

32 Solving Simulator-Defined MDPs for Natural ResourceManagement

Thomas G Dietterich (Oregon State University US)

License Creative Commons BY 30 Unported licensecopy Thomas G Dietterich

Joint work of Dietterich Thomas G Taleghan Alkaee Majid Crowley MarkMain reference TG Dietterich M Alkaee Taleghan M Crowley ldquoPAC Optimal Planning for Invasive Species

Management Improved Exploration for Reinforcement Learning from Simulator-Defined MDPsrdquoin Proc of the AAAI Conference on Artificial Intelligence (AAAIrsquo13) AAAI Press 2013

URL httpwwwaaaiorgocsindexphpAAAIAAAI13paperview6478URL httpwebengroregonstateedu~tgdpublicationsdietterich-taleghan-crowley-pac-optimal-

planning-for-invasive-species-management-etc-aaai2013pdf

Natural resource management problems such as forestry fisheries and water resourcescan be formulated as Markov decisionprocesses However solving them is difficult for tworeasons First the dynamics of the system are typically available only in the form of acomplex and expensive simulator This means that MDP planning algorithms are neededthat minimize the number of calls to the simulator Second the systems are spatial Anatural way to formulate the MDP is to divide up the region into cells where each cell is

Peter Auer Marcus Hutter and Laurent Orseau 5

modeled with a small number of state variables Actions typically operate at the level ofindividual cells but the spatial dynamics couple the states of spatially-adjacent cells Theresulting state and action spaces of these MDPs are immense We have been working ontwo natural resource MDPs The first involves the spread of tamarisk in river networksA native of the Middle East tamarisk has become an invasive plant in the dryland riversand streams of the western US Given a tamarisk invasion a land manager must decidehow and where to fight the invasion (eg eradicate tamarisk plants plant native plantsupstream downstream) Although approximate or heuristic solutions to this problem wouldbeuseful our collaborating economists tell us that our policy recommendations will carrymore weight if they are provably optimal with high probability A large problem instanceinvolves 47times 106 states with 2187 actions in each state On a modern 64-bit machine theaction-value function for this problem can fit into main memory However computing thefull transition function to sufficient accuracy to support standardvalue iteration requireson the order of 3times 1020 simulator calls The second problem concerns the management ofwildfire in Eastern Oregon In this region prior to European settlement the native ponderosapine forests were adapted to frequent low-intensity fires These fires allow the ponderosapine trees (which are well-adapted to survive fire) to grow very tall while preventing theaccumulation of fuel at ground level These trees provide habitat for many animal speciesand are also very valuable for timber However beginning in the early 1900s all fires weresuppressed in this landscape which has led to the build up of huge amounts of fuel Theresult has been large catastrophic fires that kill even the ponderosa trees and that areexceptionally expensive to control The goal of fire management is to return the landscapeto a state where frequent low-intensity fires are again the normal behavior There are twoconcrete fire management problems Let Burn (decide which fires to suppress) and FuelTreatment (decide in which cells to perform fuel reduction treatments) Note that in theseproblems the system begins in an unusual non-equilibrium state and the goal is to returnthe system to a desired steady state distribution Hence these problems are not problemsof reinforcement learning but rather problems of MDP planning for a specific start stateMany of the assumptions in RL papers such as ergocity of all policies are not appropriatefor this setting Note also that it is highly desirable to produce a concrete policy (as opposedto just producing near-optimal behavior viareceding horizon control) A concrete policy canbe inspected by stakeholders to identify missing constraints state variables and componentsof the reward function To solve these problems we are exploring two lines of research Fortamarisk we have been building on recent work in PAC-RL algorithms(eg MBIE UCRLUCRL2 FRTDP OP) to develop PAC-MDP planning algorithms We are pursuing twoinnovations First we have developed an exploration heuristic based on an upper bound onthe discounted state occupancy probability Second we are developing tighter confidenceintervals in order to terminate the search earlier These are based on combining Good-Turingestimates of missing mass (ie for unseen outcomes) with sequential confidence intervalsfor multinomial distributions These reduce the degree to which we must rely on the unionbound and hence give us tighter convergence For the LetBurn wildfire problem we areexploring approximate policy iteration methods For Fuel Treatment we are extendingCrowleyrsquos Equilibrium Policy Gradient methods These define a local policy function thatstochastically chooses the action for cell i based on the actions already chosen for the cells inthe surrounding neighborhood A Gibbs-sampling-style MCMC method repeatedly samplesfrom these local policies until a global equilibrium is reached This equilibrium defines theglobal policy At equilibrium gradient estimates can be computed and applied to improvethe policy

13321

6 13321 ndash Reinforcement Learning

33 ABC and Cover Tree Reinforcement LearningChristos Dimitrakakis (EPFL ndash Lausanne CH)

License Creative Commons BY 30 Unported licensecopy Christos Dimitrakakis

Joint work of Dimitrakakis Christos Tziortziotis NikolaosMain reference C Dimitrakakis N Tziortziotis ldquoABC Reinforcement Learningrdquo arXiv13036977v4 [statML]

2013 to appear in Proc of ICMLrsquo13URL httparxivorgabs13036977v4

In this talk I presented our recent results on methods for Bayesian reinforcement learningusing Thompson sampling but differing significantly on their prior The first ApproximateBayesian Computation Reinforcement Learning (ABC-RL) employs an arbitrary prior over aset of simulators and is most suitable in cases where an uncertain simulation model is availableThe secondCover Tree Bayesian Reinforcement Learning (CTB-RL) performs closed-formonline Bayesian inference on a cover tree and is suitable for arbitrary reinforcement learningproblems when little is known about the environment and fast inference is essential ABC-RLintroduces a simple general framework for likelihood-free Bayesian reinforcement learningthrough Approximate Bayesian Computation (ABC) The advantage is that we only requirea prior distribution ona class of simulators This is useful when a probabilistic model ofthe underlying process is too complex to formulate but where detailed simulation modelsare available ABC-RL allows the use of any Bayesian reinforcement learning techniquein this case It can be seen as an extension of simulation methods to both planning andinference We experimentally demonstrate the potential of this approach in a comparisonwith LSPI Finally we introduce a theorem showing that ABC is sound in the sense thatthe KL divergence between the incomputable true posterior and the ABC approximationis bounded by an appropriate choice of statistics CTBRL proposes an online tree-basedBayesian approach for reinforcement learning For inference we employ a generalised contexttree model This defines a distribution on multivariate Gaussian piecewise-linear modelswhich can be updated in closed form The tree structure itself is constructed using the covertree method which remains efficient in high dimensional spaces We combine the model withThompson sampling and approximate dynamic programming to obtain effective explorationpolicies in unknown environments The flexibility and computational simplicity of the modelrender it suitable for many reinforcement learning problems in continuous state spaces Wedemonstrate this in an experimental comparison with least squares policy iteration

References1 N Tziortziotis C Dimitrakakis and K BlekasCover Tree Bayesian Reinforcement Learning

arXiv13051809

34 Some thoughts on Transfer Learning in Reinforcement Learningon States and Representation

Lutz Frommberger (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Lutz Frommberger

Main reference L Frommberger ldquoQualitative Spatial Abstraction in Reinforcement Learningrdquo CognitiveTechnologies Series ISBN 978-3-642-16590-0 Springer 2010

URL httpdxdoiorg101007978-3-642-16590-0

The term ldquotransfer learningrdquo is a fairly sophisticated term for something thatcan be considereda core component of any learning effort of a human or animal to base the solution to a newproblem on experience and learning success of prior learning tasks This is something that a

Peter Auer Marcus Hutter and Laurent Orseau 7

learning organism does implicitly from birth on no task is ever isolated but embedded in acommon surrounding or history In contrary to this lifelong learning type setting transferlearning in RL[5] assumes two different MDPsM andMprime that have something ldquoin commonrdquoThis commonality is mostlikely given in a task mapping function that maps states and actionsfromM toMprime as a basis for reusing learned policies Task mappings can be given by humansupervisors or learned but mostly there is some instance telling the learning agent what to doto benefit from its experience In very common words Here is taskM there is taskMprime andthis is how you can bridge between them This is a fairly narrow view on information reuseand more organic and autonomous variants of knowledge transfer are desirable Knowledgetransfer may it be in-task (ie generalization) or cross-task exploits similarity betweentasks By task mapping functions information on similarity is brought into the learningprocess from outside This also holds for approaches that do not require an explicit statemapping [2 4] where relations or agent spaces eg are defined a-priori What is mostlylacking so far is the agentrsquos ability to recognize similarities on its own andor seamlesslybenefit from prior experiences as an integral part of the new learning effort An intelligentlearning agent should easily notice if certain parts of the current task are identical or similarto an earlier learning task for example general movement skills that remain constant overmany specialized learning tasks In prior work I proposed generalization approaches such astask space tilecoding [1] that allow to reuse knowledge of the actual learning task if certainstate variables are identical This works if structural information is made part of the statespace and does not require a mapping function However it needs a-priori knowledge ofwhich state variables are critical for action selection in a structural way Recent approachesfoster thehope that such knowledge can be retrieved by the agent itself eg[3] allows foridentification of state variables that have a generally high impact on action selection overone or several tasks But even if we can identify and exploit certain state variables thatencode structural information and have this generalizing impact these features must at leastexist If they do not exist in the state representations such approaches fail For exampleif the distance to the next obstacle in front of an agent is this critical value it does nothelp if the agentrsquos position and the position of the obstacle are in the state representationas this implicitly hides the critical information Thus again the question of state spacedesign becomes evident How can we ensure that relevant information is encoded on thelevel of features Or how can we exploit information that is only implicitly given in thestate representation Answering these questions will be necessary to take the next step intoautonomous knowledge reuse for RL agents

References1 Lutz Frommberger Task space tile coding In-task and cross-task generalization in re-

inforcement learning In 9th European Workshop on Reinforcement Learning (EWRL9)Athens Greece September 2011

2 George D Konidaris and Andrew G Barto Autonomous shaping Knowledge transfer inreinforcement learning In Proceedings of the Twenty Third International Conference onMachine Learning (ICML 2006) pp 489ndash49 Pittsburgh PA June 2006

3 Matthijs Snel and Shimon Whiteson Multi-task reinforcement learning shaping and fea-ture selection In Recent Advances in Reinforcement Learning pp 237ndash248 Springer 2012

4 Prasad Tadepalli Robert Givan and Kurt Driessens Relational reinforcement learningAn overview In Proc of the ICML-2004 Workshop on Relational Reinforcement Learningpp 1ndash9 2004

5 Matthew E Taylor and Peter Stone Transfer learning for reinforcement learning domainsA survey The Journal of Machine Learning Research 101633ndash16852009

13321

8 13321 ndash Reinforcement Learning

35 Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

In many sequential decision-making problems we may want to manage risk by minimizingsome measure of variability in rewards in addition to maximizing a standard criterionVariance related risk measures are among the most common risk-sensitive criteria in financeand operations research However optimizing many such criteria is known to be a hardproblem In this paper we consider both discounted and average reward Markov decisionprocesses For each formulation we first define a measure of variability for a policy whichin turn gives us a set of risk-sensitive criteria to optimize For each of these criteria wederive a formula for computing its gradient We then devise actor-critic algorithms forestimating the gradient and updating the policy parameters in the ascent direction Weestablish the convergence of our algorithms to locally risk-sensitive optimal policies Finallywe demonstrate the usefulness of our algorithms in a traffic signal control application

36 Statistical Learning Theory in Reinforcement Learning andApproximate Dynamic Programming

Mohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

Approximate dynamic programming (ADP) and reinforcement learning (RL) algorithms areused to solve sequential decision-making tasks where the environment (ie the dynamicsand the rewards) is not completely known andor the size of thestate and action spaces istoo large In these scenarios the convergence and performance guarantees of the standardDP algorithms are no longer valid and adifferent theoretical analysis have to be developedStatistical learning theory (SLT) has been a fundamental theoretical tool to explain theinteraction between the process generating the samples and the hypothesis space used bylearning algorithms and shown when and how-well classification and regression problemscan be solved In recent years SLT tools have been used to study the performance of batchversions of RL and ADP algorithms with the objective of deriving finite-sample bounds onthe performance loss (wrt the optimal policy) of the policy learned by these methods Suchan objective requires to effectively combine SLT tools with the ADP algorithms and to showhow the error is propagated through the iterations of these iterative algorithms

37 Universal Reinforcement LearningMarcus Hutter (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Marcus Hutter

There is great interest in understanding and constructing generally intelligent systemsapproaching and ultimately exceeding human intelligence Universal AI is such a mathematical

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 2: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

2 13321 ndash Reinforcement Learning

2 Table of Contents

Executive SummaryPeter Auer Marcus Hutter and Laurent Orseau 1

Overview of TalksPreference-based Evolutionary Direct Policy SearchRobert Busa-Fekete 4

Solving Simulator-Defined MDPs for Natural Resource ManagementThomas G Dietterich 4

ABC and Cover Tree Reinforcement LearningChristos Dimitrakakis 6

Some thoughts on Transfer Learning in Reinforcement Learning on States andRepresentationLutz Frommberger 6

Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh 8

Statistical Learning Theory in Reinforcement Learning and Approximate DynamicProgrammingMohammad Ghavamzadeh 8

Universal Reinforcement LearningMarcus Hutter 8

Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski 9

State Representation Learning in RoboticsRico Jonschkowski 10

Reinforcement Learning with Heterogeneous Policy RepresentationsPetar Kormushev 10

Theoretical Analysis of Planning with OptionsTimothy Mann 11

Learning Skill Templates for Parameterized TasksJan Hendrik Metzen 11

Online learning in Markov decision processesGergely Neu 12

Hierarchical Learning of Motor Skills with Information-Theoretic Policy SearchGerhard Neumann 13

Multi-Objective Reinforcement LearningAnn Nowe 13

Bayesian Reinforcement Learning + ExplorationTor Lattimore 14

Knowledge-Seeking AgentsLaurent Orseau 14

Peter Auer Marcus Hutter and Laurent Orseau 3

Toward a more realistic framework for general reinforcement learningLaurent Orseau 15

Colored MDPs Restless Bandits and Continuous State Reinforcement LearningRonald Ortner 16

Reinforcement Learning using Kernel-Based Stochastic FactorizationJoeumllle Pineau 16

A POMDP TutorialJoeumllle Pineau 17

Methods for Bellman Error Basis Function constructionDoina Precup 17

Continual LearningMark B Ring 17

Multi-objective Reinforcement LearningManuela Ruiz-Montiel 18

Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POM-DPsScott Sanner 19

Deterministic Policy GradientsDavid Silver 20

Sequentially Interacting Markov Chain Monte Carlo BasedPolicy IterationOrhan Soumlnmez 20

Exploration versus Exploitation in Reinforcement LearningPeter Sunehag 21

The Quest for the Ultimate TD(λ)Richard S Sutton 21

Relations between Reinforcement Learning Visual Input Perception and ActionMartijn van Otterlo 22

Universal RL applications and approximationsJoel Veness 23

Learning and Reasoning with POMDPs in RobotsJeremy Wyatt 24

Schedule 24

Participants 26

13321

4 13321 ndash Reinforcement Learning

3 Overview of Talks

31 Preference-based Evolutionary Direct Policy SearchRobert Busa-Fekete (Universitaumlt Marburg DE)

License Creative Commons BY 30 Unported licensecopy Robert Busa-Fekete

We introduce a preference-based extension of evolutionary direct policy search (EDPS) asproposed by Heidrich-Meisner and Igel [2] EDPS casts policy learning as a search problemin a parametric policy space where the function to be optimized is a performance measurelike expected total reward andevolution strategies (ES) such as CMA-ES [1] are used asoptimizers Moreover since the evaluation of a policy can only be done approximatelynamely in terms of a finite number of rollouts the authors make use of racing algorithms [3]to control this number in an adaptive manner These algorithms return a sufficiently reliableranking over the current set of policies (candidate solutions) which is then used by the ESfor updating its parameters and population A key idea of our approach is to extend EDPSby replacing the value-based racing algorithm with a preference-based one that operates on asuitable ordinal preference structure and only uses pairwise comparisons between samplerollouts of the policies

References1 N Hansen and S Kern Evaluating the CMA evolution strategy on multimodal test func-

tions In Parallel Problem Solving from Nature-PPSN VIII pages 282ndash291 20042 V Heidrich-Meisner and C Igel Hoeffding and Bernstein races for selecting policies in

evolutionary direct policy search In Proceedings of the 26th International Conference onMachine Learning pages 401ndash408 2009

3 O Maron and AW MooreHoeffding races accelerating model selection search for classific-ation and function approximation In Advances in Neural Information Processing Systemspages 59ndash66 1994

32 Solving Simulator-Defined MDPs for Natural ResourceManagement

Thomas G Dietterich (Oregon State University US)

License Creative Commons BY 30 Unported licensecopy Thomas G Dietterich

Joint work of Dietterich Thomas G Taleghan Alkaee Majid Crowley MarkMain reference TG Dietterich M Alkaee Taleghan M Crowley ldquoPAC Optimal Planning for Invasive Species

Management Improved Exploration for Reinforcement Learning from Simulator-Defined MDPsrdquoin Proc of the AAAI Conference on Artificial Intelligence (AAAIrsquo13) AAAI Press 2013

URL httpwwwaaaiorgocsindexphpAAAIAAAI13paperview6478URL httpwebengroregonstateedu~tgdpublicationsdietterich-taleghan-crowley-pac-optimal-

planning-for-invasive-species-management-etc-aaai2013pdf

Natural resource management problems such as forestry fisheries and water resourcescan be formulated as Markov decisionprocesses However solving them is difficult for tworeasons First the dynamics of the system are typically available only in the form of acomplex and expensive simulator This means that MDP planning algorithms are neededthat minimize the number of calls to the simulator Second the systems are spatial Anatural way to formulate the MDP is to divide up the region into cells where each cell is

Peter Auer Marcus Hutter and Laurent Orseau 5

modeled with a small number of state variables Actions typically operate at the level ofindividual cells but the spatial dynamics couple the states of spatially-adjacent cells Theresulting state and action spaces of these MDPs are immense We have been working ontwo natural resource MDPs The first involves the spread of tamarisk in river networksA native of the Middle East tamarisk has become an invasive plant in the dryland riversand streams of the western US Given a tamarisk invasion a land manager must decidehow and where to fight the invasion (eg eradicate tamarisk plants plant native plantsupstream downstream) Although approximate or heuristic solutions to this problem wouldbeuseful our collaborating economists tell us that our policy recommendations will carrymore weight if they are provably optimal with high probability A large problem instanceinvolves 47times 106 states with 2187 actions in each state On a modern 64-bit machine theaction-value function for this problem can fit into main memory However computing thefull transition function to sufficient accuracy to support standardvalue iteration requireson the order of 3times 1020 simulator calls The second problem concerns the management ofwildfire in Eastern Oregon In this region prior to European settlement the native ponderosapine forests were adapted to frequent low-intensity fires These fires allow the ponderosapine trees (which are well-adapted to survive fire) to grow very tall while preventing theaccumulation of fuel at ground level These trees provide habitat for many animal speciesand are also very valuable for timber However beginning in the early 1900s all fires weresuppressed in this landscape which has led to the build up of huge amounts of fuel Theresult has been large catastrophic fires that kill even the ponderosa trees and that areexceptionally expensive to control The goal of fire management is to return the landscapeto a state where frequent low-intensity fires are again the normal behavior There are twoconcrete fire management problems Let Burn (decide which fires to suppress) and FuelTreatment (decide in which cells to perform fuel reduction treatments) Note that in theseproblems the system begins in an unusual non-equilibrium state and the goal is to returnthe system to a desired steady state distribution Hence these problems are not problemsof reinforcement learning but rather problems of MDP planning for a specific start stateMany of the assumptions in RL papers such as ergocity of all policies are not appropriatefor this setting Note also that it is highly desirable to produce a concrete policy (as opposedto just producing near-optimal behavior viareceding horizon control) A concrete policy canbe inspected by stakeholders to identify missing constraints state variables and componentsof the reward function To solve these problems we are exploring two lines of research Fortamarisk we have been building on recent work in PAC-RL algorithms(eg MBIE UCRLUCRL2 FRTDP OP) to develop PAC-MDP planning algorithms We are pursuing twoinnovations First we have developed an exploration heuristic based on an upper bound onthe discounted state occupancy probability Second we are developing tighter confidenceintervals in order to terminate the search earlier These are based on combining Good-Turingestimates of missing mass (ie for unseen outcomes) with sequential confidence intervalsfor multinomial distributions These reduce the degree to which we must rely on the unionbound and hence give us tighter convergence For the LetBurn wildfire problem we areexploring approximate policy iteration methods For Fuel Treatment we are extendingCrowleyrsquos Equilibrium Policy Gradient methods These define a local policy function thatstochastically chooses the action for cell i based on the actions already chosen for the cells inthe surrounding neighborhood A Gibbs-sampling-style MCMC method repeatedly samplesfrom these local policies until a global equilibrium is reached This equilibrium defines theglobal policy At equilibrium gradient estimates can be computed and applied to improvethe policy

13321

6 13321 ndash Reinforcement Learning

33 ABC and Cover Tree Reinforcement LearningChristos Dimitrakakis (EPFL ndash Lausanne CH)

License Creative Commons BY 30 Unported licensecopy Christos Dimitrakakis

Joint work of Dimitrakakis Christos Tziortziotis NikolaosMain reference C Dimitrakakis N Tziortziotis ldquoABC Reinforcement Learningrdquo arXiv13036977v4 [statML]

2013 to appear in Proc of ICMLrsquo13URL httparxivorgabs13036977v4

In this talk I presented our recent results on methods for Bayesian reinforcement learningusing Thompson sampling but differing significantly on their prior The first ApproximateBayesian Computation Reinforcement Learning (ABC-RL) employs an arbitrary prior over aset of simulators and is most suitable in cases where an uncertain simulation model is availableThe secondCover Tree Bayesian Reinforcement Learning (CTB-RL) performs closed-formonline Bayesian inference on a cover tree and is suitable for arbitrary reinforcement learningproblems when little is known about the environment and fast inference is essential ABC-RLintroduces a simple general framework for likelihood-free Bayesian reinforcement learningthrough Approximate Bayesian Computation (ABC) The advantage is that we only requirea prior distribution ona class of simulators This is useful when a probabilistic model ofthe underlying process is too complex to formulate but where detailed simulation modelsare available ABC-RL allows the use of any Bayesian reinforcement learning techniquein this case It can be seen as an extension of simulation methods to both planning andinference We experimentally demonstrate the potential of this approach in a comparisonwith LSPI Finally we introduce a theorem showing that ABC is sound in the sense thatthe KL divergence between the incomputable true posterior and the ABC approximationis bounded by an appropriate choice of statistics CTBRL proposes an online tree-basedBayesian approach for reinforcement learning For inference we employ a generalised contexttree model This defines a distribution on multivariate Gaussian piecewise-linear modelswhich can be updated in closed form The tree structure itself is constructed using the covertree method which remains efficient in high dimensional spaces We combine the model withThompson sampling and approximate dynamic programming to obtain effective explorationpolicies in unknown environments The flexibility and computational simplicity of the modelrender it suitable for many reinforcement learning problems in continuous state spaces Wedemonstrate this in an experimental comparison with least squares policy iteration

References1 N Tziortziotis C Dimitrakakis and K BlekasCover Tree Bayesian Reinforcement Learning

arXiv13051809

34 Some thoughts on Transfer Learning in Reinforcement Learningon States and Representation

Lutz Frommberger (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Lutz Frommberger

Main reference L Frommberger ldquoQualitative Spatial Abstraction in Reinforcement Learningrdquo CognitiveTechnologies Series ISBN 978-3-642-16590-0 Springer 2010

URL httpdxdoiorg101007978-3-642-16590-0

The term ldquotransfer learningrdquo is a fairly sophisticated term for something thatcan be considereda core component of any learning effort of a human or animal to base the solution to a newproblem on experience and learning success of prior learning tasks This is something that a

Peter Auer Marcus Hutter and Laurent Orseau 7

learning organism does implicitly from birth on no task is ever isolated but embedded in acommon surrounding or history In contrary to this lifelong learning type setting transferlearning in RL[5] assumes two different MDPsM andMprime that have something ldquoin commonrdquoThis commonality is mostlikely given in a task mapping function that maps states and actionsfromM toMprime as a basis for reusing learned policies Task mappings can be given by humansupervisors or learned but mostly there is some instance telling the learning agent what to doto benefit from its experience In very common words Here is taskM there is taskMprime andthis is how you can bridge between them This is a fairly narrow view on information reuseand more organic and autonomous variants of knowledge transfer are desirable Knowledgetransfer may it be in-task (ie generalization) or cross-task exploits similarity betweentasks By task mapping functions information on similarity is brought into the learningprocess from outside This also holds for approaches that do not require an explicit statemapping [2 4] where relations or agent spaces eg are defined a-priori What is mostlylacking so far is the agentrsquos ability to recognize similarities on its own andor seamlesslybenefit from prior experiences as an integral part of the new learning effort An intelligentlearning agent should easily notice if certain parts of the current task are identical or similarto an earlier learning task for example general movement skills that remain constant overmany specialized learning tasks In prior work I proposed generalization approaches such astask space tilecoding [1] that allow to reuse knowledge of the actual learning task if certainstate variables are identical This works if structural information is made part of the statespace and does not require a mapping function However it needs a-priori knowledge ofwhich state variables are critical for action selection in a structural way Recent approachesfoster thehope that such knowledge can be retrieved by the agent itself eg[3] allows foridentification of state variables that have a generally high impact on action selection overone or several tasks But even if we can identify and exploit certain state variables thatencode structural information and have this generalizing impact these features must at leastexist If they do not exist in the state representations such approaches fail For exampleif the distance to the next obstacle in front of an agent is this critical value it does nothelp if the agentrsquos position and the position of the obstacle are in the state representationas this implicitly hides the critical information Thus again the question of state spacedesign becomes evident How can we ensure that relevant information is encoded on thelevel of features Or how can we exploit information that is only implicitly given in thestate representation Answering these questions will be necessary to take the next step intoautonomous knowledge reuse for RL agents

References1 Lutz Frommberger Task space tile coding In-task and cross-task generalization in re-

inforcement learning In 9th European Workshop on Reinforcement Learning (EWRL9)Athens Greece September 2011

2 George D Konidaris and Andrew G Barto Autonomous shaping Knowledge transfer inreinforcement learning In Proceedings of the Twenty Third International Conference onMachine Learning (ICML 2006) pp 489ndash49 Pittsburgh PA June 2006

3 Matthijs Snel and Shimon Whiteson Multi-task reinforcement learning shaping and fea-ture selection In Recent Advances in Reinforcement Learning pp 237ndash248 Springer 2012

4 Prasad Tadepalli Robert Givan and Kurt Driessens Relational reinforcement learningAn overview In Proc of the ICML-2004 Workshop on Relational Reinforcement Learningpp 1ndash9 2004

5 Matthew E Taylor and Peter Stone Transfer learning for reinforcement learning domainsA survey The Journal of Machine Learning Research 101633ndash16852009

13321

8 13321 ndash Reinforcement Learning

35 Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

In many sequential decision-making problems we may want to manage risk by minimizingsome measure of variability in rewards in addition to maximizing a standard criterionVariance related risk measures are among the most common risk-sensitive criteria in financeand operations research However optimizing many such criteria is known to be a hardproblem In this paper we consider both discounted and average reward Markov decisionprocesses For each formulation we first define a measure of variability for a policy whichin turn gives us a set of risk-sensitive criteria to optimize For each of these criteria wederive a formula for computing its gradient We then devise actor-critic algorithms forestimating the gradient and updating the policy parameters in the ascent direction Weestablish the convergence of our algorithms to locally risk-sensitive optimal policies Finallywe demonstrate the usefulness of our algorithms in a traffic signal control application

36 Statistical Learning Theory in Reinforcement Learning andApproximate Dynamic Programming

Mohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

Approximate dynamic programming (ADP) and reinforcement learning (RL) algorithms areused to solve sequential decision-making tasks where the environment (ie the dynamicsand the rewards) is not completely known andor the size of thestate and action spaces istoo large In these scenarios the convergence and performance guarantees of the standardDP algorithms are no longer valid and adifferent theoretical analysis have to be developedStatistical learning theory (SLT) has been a fundamental theoretical tool to explain theinteraction between the process generating the samples and the hypothesis space used bylearning algorithms and shown when and how-well classification and regression problemscan be solved In recent years SLT tools have been used to study the performance of batchversions of RL and ADP algorithms with the objective of deriving finite-sample bounds onthe performance loss (wrt the optimal policy) of the policy learned by these methods Suchan objective requires to effectively combine SLT tools with the ADP algorithms and to showhow the error is propagated through the iterations of these iterative algorithms

37 Universal Reinforcement LearningMarcus Hutter (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Marcus Hutter

There is great interest in understanding and constructing generally intelligent systemsapproaching and ultimately exceeding human intelligence Universal AI is such a mathematical

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 3: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 3

Toward a more realistic framework for general reinforcement learningLaurent Orseau 15

Colored MDPs Restless Bandits and Continuous State Reinforcement LearningRonald Ortner 16

Reinforcement Learning using Kernel-Based Stochastic FactorizationJoeumllle Pineau 16

A POMDP TutorialJoeumllle Pineau 17

Methods for Bellman Error Basis Function constructionDoina Precup 17

Continual LearningMark B Ring 17

Multi-objective Reinforcement LearningManuela Ruiz-Montiel 18

Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POM-DPsScott Sanner 19

Deterministic Policy GradientsDavid Silver 20

Sequentially Interacting Markov Chain Monte Carlo BasedPolicy IterationOrhan Soumlnmez 20

Exploration versus Exploitation in Reinforcement LearningPeter Sunehag 21

The Quest for the Ultimate TD(λ)Richard S Sutton 21

Relations between Reinforcement Learning Visual Input Perception and ActionMartijn van Otterlo 22

Universal RL applications and approximationsJoel Veness 23

Learning and Reasoning with POMDPs in RobotsJeremy Wyatt 24

Schedule 24

Participants 26

13321

4 13321 ndash Reinforcement Learning

3 Overview of Talks

31 Preference-based Evolutionary Direct Policy SearchRobert Busa-Fekete (Universitaumlt Marburg DE)

License Creative Commons BY 30 Unported licensecopy Robert Busa-Fekete

We introduce a preference-based extension of evolutionary direct policy search (EDPS) asproposed by Heidrich-Meisner and Igel [2] EDPS casts policy learning as a search problemin a parametric policy space where the function to be optimized is a performance measurelike expected total reward andevolution strategies (ES) such as CMA-ES [1] are used asoptimizers Moreover since the evaluation of a policy can only be done approximatelynamely in terms of a finite number of rollouts the authors make use of racing algorithms [3]to control this number in an adaptive manner These algorithms return a sufficiently reliableranking over the current set of policies (candidate solutions) which is then used by the ESfor updating its parameters and population A key idea of our approach is to extend EDPSby replacing the value-based racing algorithm with a preference-based one that operates on asuitable ordinal preference structure and only uses pairwise comparisons between samplerollouts of the policies

References1 N Hansen and S Kern Evaluating the CMA evolution strategy on multimodal test func-

tions In Parallel Problem Solving from Nature-PPSN VIII pages 282ndash291 20042 V Heidrich-Meisner and C Igel Hoeffding and Bernstein races for selecting policies in

evolutionary direct policy search In Proceedings of the 26th International Conference onMachine Learning pages 401ndash408 2009

3 O Maron and AW MooreHoeffding races accelerating model selection search for classific-ation and function approximation In Advances in Neural Information Processing Systemspages 59ndash66 1994

32 Solving Simulator-Defined MDPs for Natural ResourceManagement

Thomas G Dietterich (Oregon State University US)

License Creative Commons BY 30 Unported licensecopy Thomas G Dietterich

Joint work of Dietterich Thomas G Taleghan Alkaee Majid Crowley MarkMain reference TG Dietterich M Alkaee Taleghan M Crowley ldquoPAC Optimal Planning for Invasive Species

Management Improved Exploration for Reinforcement Learning from Simulator-Defined MDPsrdquoin Proc of the AAAI Conference on Artificial Intelligence (AAAIrsquo13) AAAI Press 2013

URL httpwwwaaaiorgocsindexphpAAAIAAAI13paperview6478URL httpwebengroregonstateedu~tgdpublicationsdietterich-taleghan-crowley-pac-optimal-

planning-for-invasive-species-management-etc-aaai2013pdf

Natural resource management problems such as forestry fisheries and water resourcescan be formulated as Markov decisionprocesses However solving them is difficult for tworeasons First the dynamics of the system are typically available only in the form of acomplex and expensive simulator This means that MDP planning algorithms are neededthat minimize the number of calls to the simulator Second the systems are spatial Anatural way to formulate the MDP is to divide up the region into cells where each cell is

Peter Auer Marcus Hutter and Laurent Orseau 5

modeled with a small number of state variables Actions typically operate at the level ofindividual cells but the spatial dynamics couple the states of spatially-adjacent cells Theresulting state and action spaces of these MDPs are immense We have been working ontwo natural resource MDPs The first involves the spread of tamarisk in river networksA native of the Middle East tamarisk has become an invasive plant in the dryland riversand streams of the western US Given a tamarisk invasion a land manager must decidehow and where to fight the invasion (eg eradicate tamarisk plants plant native plantsupstream downstream) Although approximate or heuristic solutions to this problem wouldbeuseful our collaborating economists tell us that our policy recommendations will carrymore weight if they are provably optimal with high probability A large problem instanceinvolves 47times 106 states with 2187 actions in each state On a modern 64-bit machine theaction-value function for this problem can fit into main memory However computing thefull transition function to sufficient accuracy to support standardvalue iteration requireson the order of 3times 1020 simulator calls The second problem concerns the management ofwildfire in Eastern Oregon In this region prior to European settlement the native ponderosapine forests were adapted to frequent low-intensity fires These fires allow the ponderosapine trees (which are well-adapted to survive fire) to grow very tall while preventing theaccumulation of fuel at ground level These trees provide habitat for many animal speciesand are also very valuable for timber However beginning in the early 1900s all fires weresuppressed in this landscape which has led to the build up of huge amounts of fuel Theresult has been large catastrophic fires that kill even the ponderosa trees and that areexceptionally expensive to control The goal of fire management is to return the landscapeto a state where frequent low-intensity fires are again the normal behavior There are twoconcrete fire management problems Let Burn (decide which fires to suppress) and FuelTreatment (decide in which cells to perform fuel reduction treatments) Note that in theseproblems the system begins in an unusual non-equilibrium state and the goal is to returnthe system to a desired steady state distribution Hence these problems are not problemsof reinforcement learning but rather problems of MDP planning for a specific start stateMany of the assumptions in RL papers such as ergocity of all policies are not appropriatefor this setting Note also that it is highly desirable to produce a concrete policy (as opposedto just producing near-optimal behavior viareceding horizon control) A concrete policy canbe inspected by stakeholders to identify missing constraints state variables and componentsof the reward function To solve these problems we are exploring two lines of research Fortamarisk we have been building on recent work in PAC-RL algorithms(eg MBIE UCRLUCRL2 FRTDP OP) to develop PAC-MDP planning algorithms We are pursuing twoinnovations First we have developed an exploration heuristic based on an upper bound onthe discounted state occupancy probability Second we are developing tighter confidenceintervals in order to terminate the search earlier These are based on combining Good-Turingestimates of missing mass (ie for unseen outcomes) with sequential confidence intervalsfor multinomial distributions These reduce the degree to which we must rely on the unionbound and hence give us tighter convergence For the LetBurn wildfire problem we areexploring approximate policy iteration methods For Fuel Treatment we are extendingCrowleyrsquos Equilibrium Policy Gradient methods These define a local policy function thatstochastically chooses the action for cell i based on the actions already chosen for the cells inthe surrounding neighborhood A Gibbs-sampling-style MCMC method repeatedly samplesfrom these local policies until a global equilibrium is reached This equilibrium defines theglobal policy At equilibrium gradient estimates can be computed and applied to improvethe policy

13321

6 13321 ndash Reinforcement Learning

33 ABC and Cover Tree Reinforcement LearningChristos Dimitrakakis (EPFL ndash Lausanne CH)

License Creative Commons BY 30 Unported licensecopy Christos Dimitrakakis

Joint work of Dimitrakakis Christos Tziortziotis NikolaosMain reference C Dimitrakakis N Tziortziotis ldquoABC Reinforcement Learningrdquo arXiv13036977v4 [statML]

2013 to appear in Proc of ICMLrsquo13URL httparxivorgabs13036977v4

In this talk I presented our recent results on methods for Bayesian reinforcement learningusing Thompson sampling but differing significantly on their prior The first ApproximateBayesian Computation Reinforcement Learning (ABC-RL) employs an arbitrary prior over aset of simulators and is most suitable in cases where an uncertain simulation model is availableThe secondCover Tree Bayesian Reinforcement Learning (CTB-RL) performs closed-formonline Bayesian inference on a cover tree and is suitable for arbitrary reinforcement learningproblems when little is known about the environment and fast inference is essential ABC-RLintroduces a simple general framework for likelihood-free Bayesian reinforcement learningthrough Approximate Bayesian Computation (ABC) The advantage is that we only requirea prior distribution ona class of simulators This is useful when a probabilistic model ofthe underlying process is too complex to formulate but where detailed simulation modelsare available ABC-RL allows the use of any Bayesian reinforcement learning techniquein this case It can be seen as an extension of simulation methods to both planning andinference We experimentally demonstrate the potential of this approach in a comparisonwith LSPI Finally we introduce a theorem showing that ABC is sound in the sense thatthe KL divergence between the incomputable true posterior and the ABC approximationis bounded by an appropriate choice of statistics CTBRL proposes an online tree-basedBayesian approach for reinforcement learning For inference we employ a generalised contexttree model This defines a distribution on multivariate Gaussian piecewise-linear modelswhich can be updated in closed form The tree structure itself is constructed using the covertree method which remains efficient in high dimensional spaces We combine the model withThompson sampling and approximate dynamic programming to obtain effective explorationpolicies in unknown environments The flexibility and computational simplicity of the modelrender it suitable for many reinforcement learning problems in continuous state spaces Wedemonstrate this in an experimental comparison with least squares policy iteration

References1 N Tziortziotis C Dimitrakakis and K BlekasCover Tree Bayesian Reinforcement Learning

arXiv13051809

34 Some thoughts on Transfer Learning in Reinforcement Learningon States and Representation

Lutz Frommberger (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Lutz Frommberger

Main reference L Frommberger ldquoQualitative Spatial Abstraction in Reinforcement Learningrdquo CognitiveTechnologies Series ISBN 978-3-642-16590-0 Springer 2010

URL httpdxdoiorg101007978-3-642-16590-0

The term ldquotransfer learningrdquo is a fairly sophisticated term for something thatcan be considereda core component of any learning effort of a human or animal to base the solution to a newproblem on experience and learning success of prior learning tasks This is something that a

Peter Auer Marcus Hutter and Laurent Orseau 7

learning organism does implicitly from birth on no task is ever isolated but embedded in acommon surrounding or history In contrary to this lifelong learning type setting transferlearning in RL[5] assumes two different MDPsM andMprime that have something ldquoin commonrdquoThis commonality is mostlikely given in a task mapping function that maps states and actionsfromM toMprime as a basis for reusing learned policies Task mappings can be given by humansupervisors or learned but mostly there is some instance telling the learning agent what to doto benefit from its experience In very common words Here is taskM there is taskMprime andthis is how you can bridge between them This is a fairly narrow view on information reuseand more organic and autonomous variants of knowledge transfer are desirable Knowledgetransfer may it be in-task (ie generalization) or cross-task exploits similarity betweentasks By task mapping functions information on similarity is brought into the learningprocess from outside This also holds for approaches that do not require an explicit statemapping [2 4] where relations or agent spaces eg are defined a-priori What is mostlylacking so far is the agentrsquos ability to recognize similarities on its own andor seamlesslybenefit from prior experiences as an integral part of the new learning effort An intelligentlearning agent should easily notice if certain parts of the current task are identical or similarto an earlier learning task for example general movement skills that remain constant overmany specialized learning tasks In prior work I proposed generalization approaches such astask space tilecoding [1] that allow to reuse knowledge of the actual learning task if certainstate variables are identical This works if structural information is made part of the statespace and does not require a mapping function However it needs a-priori knowledge ofwhich state variables are critical for action selection in a structural way Recent approachesfoster thehope that such knowledge can be retrieved by the agent itself eg[3] allows foridentification of state variables that have a generally high impact on action selection overone or several tasks But even if we can identify and exploit certain state variables thatencode structural information and have this generalizing impact these features must at leastexist If they do not exist in the state representations such approaches fail For exampleif the distance to the next obstacle in front of an agent is this critical value it does nothelp if the agentrsquos position and the position of the obstacle are in the state representationas this implicitly hides the critical information Thus again the question of state spacedesign becomes evident How can we ensure that relevant information is encoded on thelevel of features Or how can we exploit information that is only implicitly given in thestate representation Answering these questions will be necessary to take the next step intoautonomous knowledge reuse for RL agents

References1 Lutz Frommberger Task space tile coding In-task and cross-task generalization in re-

inforcement learning In 9th European Workshop on Reinforcement Learning (EWRL9)Athens Greece September 2011

2 George D Konidaris and Andrew G Barto Autonomous shaping Knowledge transfer inreinforcement learning In Proceedings of the Twenty Third International Conference onMachine Learning (ICML 2006) pp 489ndash49 Pittsburgh PA June 2006

3 Matthijs Snel and Shimon Whiteson Multi-task reinforcement learning shaping and fea-ture selection In Recent Advances in Reinforcement Learning pp 237ndash248 Springer 2012

4 Prasad Tadepalli Robert Givan and Kurt Driessens Relational reinforcement learningAn overview In Proc of the ICML-2004 Workshop on Relational Reinforcement Learningpp 1ndash9 2004

5 Matthew E Taylor and Peter Stone Transfer learning for reinforcement learning domainsA survey The Journal of Machine Learning Research 101633ndash16852009

13321

8 13321 ndash Reinforcement Learning

35 Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

In many sequential decision-making problems we may want to manage risk by minimizingsome measure of variability in rewards in addition to maximizing a standard criterionVariance related risk measures are among the most common risk-sensitive criteria in financeand operations research However optimizing many such criteria is known to be a hardproblem In this paper we consider both discounted and average reward Markov decisionprocesses For each formulation we first define a measure of variability for a policy whichin turn gives us a set of risk-sensitive criteria to optimize For each of these criteria wederive a formula for computing its gradient We then devise actor-critic algorithms forestimating the gradient and updating the policy parameters in the ascent direction Weestablish the convergence of our algorithms to locally risk-sensitive optimal policies Finallywe demonstrate the usefulness of our algorithms in a traffic signal control application

36 Statistical Learning Theory in Reinforcement Learning andApproximate Dynamic Programming

Mohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

Approximate dynamic programming (ADP) and reinforcement learning (RL) algorithms areused to solve sequential decision-making tasks where the environment (ie the dynamicsand the rewards) is not completely known andor the size of thestate and action spaces istoo large In these scenarios the convergence and performance guarantees of the standardDP algorithms are no longer valid and adifferent theoretical analysis have to be developedStatistical learning theory (SLT) has been a fundamental theoretical tool to explain theinteraction between the process generating the samples and the hypothesis space used bylearning algorithms and shown when and how-well classification and regression problemscan be solved In recent years SLT tools have been used to study the performance of batchversions of RL and ADP algorithms with the objective of deriving finite-sample bounds onthe performance loss (wrt the optimal policy) of the policy learned by these methods Suchan objective requires to effectively combine SLT tools with the ADP algorithms and to showhow the error is propagated through the iterations of these iterative algorithms

37 Universal Reinforcement LearningMarcus Hutter (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Marcus Hutter

There is great interest in understanding and constructing generally intelligent systemsapproaching and ultimately exceeding human intelligence Universal AI is such a mathematical

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 4: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

4 13321 ndash Reinforcement Learning

3 Overview of Talks

31 Preference-based Evolutionary Direct Policy SearchRobert Busa-Fekete (Universitaumlt Marburg DE)

License Creative Commons BY 30 Unported licensecopy Robert Busa-Fekete

We introduce a preference-based extension of evolutionary direct policy search (EDPS) asproposed by Heidrich-Meisner and Igel [2] EDPS casts policy learning as a search problemin a parametric policy space where the function to be optimized is a performance measurelike expected total reward andevolution strategies (ES) such as CMA-ES [1] are used asoptimizers Moreover since the evaluation of a policy can only be done approximatelynamely in terms of a finite number of rollouts the authors make use of racing algorithms [3]to control this number in an adaptive manner These algorithms return a sufficiently reliableranking over the current set of policies (candidate solutions) which is then used by the ESfor updating its parameters and population A key idea of our approach is to extend EDPSby replacing the value-based racing algorithm with a preference-based one that operates on asuitable ordinal preference structure and only uses pairwise comparisons between samplerollouts of the policies

References1 N Hansen and S Kern Evaluating the CMA evolution strategy on multimodal test func-

tions In Parallel Problem Solving from Nature-PPSN VIII pages 282ndash291 20042 V Heidrich-Meisner and C Igel Hoeffding and Bernstein races for selecting policies in

evolutionary direct policy search In Proceedings of the 26th International Conference onMachine Learning pages 401ndash408 2009

3 O Maron and AW MooreHoeffding races accelerating model selection search for classific-ation and function approximation In Advances in Neural Information Processing Systemspages 59ndash66 1994

32 Solving Simulator-Defined MDPs for Natural ResourceManagement

Thomas G Dietterich (Oregon State University US)

License Creative Commons BY 30 Unported licensecopy Thomas G Dietterich

Joint work of Dietterich Thomas G Taleghan Alkaee Majid Crowley MarkMain reference TG Dietterich M Alkaee Taleghan M Crowley ldquoPAC Optimal Planning for Invasive Species

Management Improved Exploration for Reinforcement Learning from Simulator-Defined MDPsrdquoin Proc of the AAAI Conference on Artificial Intelligence (AAAIrsquo13) AAAI Press 2013

URL httpwwwaaaiorgocsindexphpAAAIAAAI13paperview6478URL httpwebengroregonstateedu~tgdpublicationsdietterich-taleghan-crowley-pac-optimal-

planning-for-invasive-species-management-etc-aaai2013pdf

Natural resource management problems such as forestry fisheries and water resourcescan be formulated as Markov decisionprocesses However solving them is difficult for tworeasons First the dynamics of the system are typically available only in the form of acomplex and expensive simulator This means that MDP planning algorithms are neededthat minimize the number of calls to the simulator Second the systems are spatial Anatural way to formulate the MDP is to divide up the region into cells where each cell is

Peter Auer Marcus Hutter and Laurent Orseau 5

modeled with a small number of state variables Actions typically operate at the level ofindividual cells but the spatial dynamics couple the states of spatially-adjacent cells Theresulting state and action spaces of these MDPs are immense We have been working ontwo natural resource MDPs The first involves the spread of tamarisk in river networksA native of the Middle East tamarisk has become an invasive plant in the dryland riversand streams of the western US Given a tamarisk invasion a land manager must decidehow and where to fight the invasion (eg eradicate tamarisk plants plant native plantsupstream downstream) Although approximate or heuristic solutions to this problem wouldbeuseful our collaborating economists tell us that our policy recommendations will carrymore weight if they are provably optimal with high probability A large problem instanceinvolves 47times 106 states with 2187 actions in each state On a modern 64-bit machine theaction-value function for this problem can fit into main memory However computing thefull transition function to sufficient accuracy to support standardvalue iteration requireson the order of 3times 1020 simulator calls The second problem concerns the management ofwildfire in Eastern Oregon In this region prior to European settlement the native ponderosapine forests were adapted to frequent low-intensity fires These fires allow the ponderosapine trees (which are well-adapted to survive fire) to grow very tall while preventing theaccumulation of fuel at ground level These trees provide habitat for many animal speciesand are also very valuable for timber However beginning in the early 1900s all fires weresuppressed in this landscape which has led to the build up of huge amounts of fuel Theresult has been large catastrophic fires that kill even the ponderosa trees and that areexceptionally expensive to control The goal of fire management is to return the landscapeto a state where frequent low-intensity fires are again the normal behavior There are twoconcrete fire management problems Let Burn (decide which fires to suppress) and FuelTreatment (decide in which cells to perform fuel reduction treatments) Note that in theseproblems the system begins in an unusual non-equilibrium state and the goal is to returnthe system to a desired steady state distribution Hence these problems are not problemsof reinforcement learning but rather problems of MDP planning for a specific start stateMany of the assumptions in RL papers such as ergocity of all policies are not appropriatefor this setting Note also that it is highly desirable to produce a concrete policy (as opposedto just producing near-optimal behavior viareceding horizon control) A concrete policy canbe inspected by stakeholders to identify missing constraints state variables and componentsof the reward function To solve these problems we are exploring two lines of research Fortamarisk we have been building on recent work in PAC-RL algorithms(eg MBIE UCRLUCRL2 FRTDP OP) to develop PAC-MDP planning algorithms We are pursuing twoinnovations First we have developed an exploration heuristic based on an upper bound onthe discounted state occupancy probability Second we are developing tighter confidenceintervals in order to terminate the search earlier These are based on combining Good-Turingestimates of missing mass (ie for unseen outcomes) with sequential confidence intervalsfor multinomial distributions These reduce the degree to which we must rely on the unionbound and hence give us tighter convergence For the LetBurn wildfire problem we areexploring approximate policy iteration methods For Fuel Treatment we are extendingCrowleyrsquos Equilibrium Policy Gradient methods These define a local policy function thatstochastically chooses the action for cell i based on the actions already chosen for the cells inthe surrounding neighborhood A Gibbs-sampling-style MCMC method repeatedly samplesfrom these local policies until a global equilibrium is reached This equilibrium defines theglobal policy At equilibrium gradient estimates can be computed and applied to improvethe policy

13321

6 13321 ndash Reinforcement Learning

33 ABC and Cover Tree Reinforcement LearningChristos Dimitrakakis (EPFL ndash Lausanne CH)

License Creative Commons BY 30 Unported licensecopy Christos Dimitrakakis

Joint work of Dimitrakakis Christos Tziortziotis NikolaosMain reference C Dimitrakakis N Tziortziotis ldquoABC Reinforcement Learningrdquo arXiv13036977v4 [statML]

2013 to appear in Proc of ICMLrsquo13URL httparxivorgabs13036977v4

In this talk I presented our recent results on methods for Bayesian reinforcement learningusing Thompson sampling but differing significantly on their prior The first ApproximateBayesian Computation Reinforcement Learning (ABC-RL) employs an arbitrary prior over aset of simulators and is most suitable in cases where an uncertain simulation model is availableThe secondCover Tree Bayesian Reinforcement Learning (CTB-RL) performs closed-formonline Bayesian inference on a cover tree and is suitable for arbitrary reinforcement learningproblems when little is known about the environment and fast inference is essential ABC-RLintroduces a simple general framework for likelihood-free Bayesian reinforcement learningthrough Approximate Bayesian Computation (ABC) The advantage is that we only requirea prior distribution ona class of simulators This is useful when a probabilistic model ofthe underlying process is too complex to formulate but where detailed simulation modelsare available ABC-RL allows the use of any Bayesian reinforcement learning techniquein this case It can be seen as an extension of simulation methods to both planning andinference We experimentally demonstrate the potential of this approach in a comparisonwith LSPI Finally we introduce a theorem showing that ABC is sound in the sense thatthe KL divergence between the incomputable true posterior and the ABC approximationis bounded by an appropriate choice of statistics CTBRL proposes an online tree-basedBayesian approach for reinforcement learning For inference we employ a generalised contexttree model This defines a distribution on multivariate Gaussian piecewise-linear modelswhich can be updated in closed form The tree structure itself is constructed using the covertree method which remains efficient in high dimensional spaces We combine the model withThompson sampling and approximate dynamic programming to obtain effective explorationpolicies in unknown environments The flexibility and computational simplicity of the modelrender it suitable for many reinforcement learning problems in continuous state spaces Wedemonstrate this in an experimental comparison with least squares policy iteration

References1 N Tziortziotis C Dimitrakakis and K BlekasCover Tree Bayesian Reinforcement Learning

arXiv13051809

34 Some thoughts on Transfer Learning in Reinforcement Learningon States and Representation

Lutz Frommberger (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Lutz Frommberger

Main reference L Frommberger ldquoQualitative Spatial Abstraction in Reinforcement Learningrdquo CognitiveTechnologies Series ISBN 978-3-642-16590-0 Springer 2010

URL httpdxdoiorg101007978-3-642-16590-0

The term ldquotransfer learningrdquo is a fairly sophisticated term for something thatcan be considereda core component of any learning effort of a human or animal to base the solution to a newproblem on experience and learning success of prior learning tasks This is something that a

Peter Auer Marcus Hutter and Laurent Orseau 7

learning organism does implicitly from birth on no task is ever isolated but embedded in acommon surrounding or history In contrary to this lifelong learning type setting transferlearning in RL[5] assumes two different MDPsM andMprime that have something ldquoin commonrdquoThis commonality is mostlikely given in a task mapping function that maps states and actionsfromM toMprime as a basis for reusing learned policies Task mappings can be given by humansupervisors or learned but mostly there is some instance telling the learning agent what to doto benefit from its experience In very common words Here is taskM there is taskMprime andthis is how you can bridge between them This is a fairly narrow view on information reuseand more organic and autonomous variants of knowledge transfer are desirable Knowledgetransfer may it be in-task (ie generalization) or cross-task exploits similarity betweentasks By task mapping functions information on similarity is brought into the learningprocess from outside This also holds for approaches that do not require an explicit statemapping [2 4] where relations or agent spaces eg are defined a-priori What is mostlylacking so far is the agentrsquos ability to recognize similarities on its own andor seamlesslybenefit from prior experiences as an integral part of the new learning effort An intelligentlearning agent should easily notice if certain parts of the current task are identical or similarto an earlier learning task for example general movement skills that remain constant overmany specialized learning tasks In prior work I proposed generalization approaches such astask space tilecoding [1] that allow to reuse knowledge of the actual learning task if certainstate variables are identical This works if structural information is made part of the statespace and does not require a mapping function However it needs a-priori knowledge ofwhich state variables are critical for action selection in a structural way Recent approachesfoster thehope that such knowledge can be retrieved by the agent itself eg[3] allows foridentification of state variables that have a generally high impact on action selection overone or several tasks But even if we can identify and exploit certain state variables thatencode structural information and have this generalizing impact these features must at leastexist If they do not exist in the state representations such approaches fail For exampleif the distance to the next obstacle in front of an agent is this critical value it does nothelp if the agentrsquos position and the position of the obstacle are in the state representationas this implicitly hides the critical information Thus again the question of state spacedesign becomes evident How can we ensure that relevant information is encoded on thelevel of features Or how can we exploit information that is only implicitly given in thestate representation Answering these questions will be necessary to take the next step intoautonomous knowledge reuse for RL agents

References1 Lutz Frommberger Task space tile coding In-task and cross-task generalization in re-

inforcement learning In 9th European Workshop on Reinforcement Learning (EWRL9)Athens Greece September 2011

2 George D Konidaris and Andrew G Barto Autonomous shaping Knowledge transfer inreinforcement learning In Proceedings of the Twenty Third International Conference onMachine Learning (ICML 2006) pp 489ndash49 Pittsburgh PA June 2006

3 Matthijs Snel and Shimon Whiteson Multi-task reinforcement learning shaping and fea-ture selection In Recent Advances in Reinforcement Learning pp 237ndash248 Springer 2012

4 Prasad Tadepalli Robert Givan and Kurt Driessens Relational reinforcement learningAn overview In Proc of the ICML-2004 Workshop on Relational Reinforcement Learningpp 1ndash9 2004

5 Matthew E Taylor and Peter Stone Transfer learning for reinforcement learning domainsA survey The Journal of Machine Learning Research 101633ndash16852009

13321

8 13321 ndash Reinforcement Learning

35 Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

In many sequential decision-making problems we may want to manage risk by minimizingsome measure of variability in rewards in addition to maximizing a standard criterionVariance related risk measures are among the most common risk-sensitive criteria in financeand operations research However optimizing many such criteria is known to be a hardproblem In this paper we consider both discounted and average reward Markov decisionprocesses For each formulation we first define a measure of variability for a policy whichin turn gives us a set of risk-sensitive criteria to optimize For each of these criteria wederive a formula for computing its gradient We then devise actor-critic algorithms forestimating the gradient and updating the policy parameters in the ascent direction Weestablish the convergence of our algorithms to locally risk-sensitive optimal policies Finallywe demonstrate the usefulness of our algorithms in a traffic signal control application

36 Statistical Learning Theory in Reinforcement Learning andApproximate Dynamic Programming

Mohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

Approximate dynamic programming (ADP) and reinforcement learning (RL) algorithms areused to solve sequential decision-making tasks where the environment (ie the dynamicsand the rewards) is not completely known andor the size of thestate and action spaces istoo large In these scenarios the convergence and performance guarantees of the standardDP algorithms are no longer valid and adifferent theoretical analysis have to be developedStatistical learning theory (SLT) has been a fundamental theoretical tool to explain theinteraction between the process generating the samples and the hypothesis space used bylearning algorithms and shown when and how-well classification and regression problemscan be solved In recent years SLT tools have been used to study the performance of batchversions of RL and ADP algorithms with the objective of deriving finite-sample bounds onthe performance loss (wrt the optimal policy) of the policy learned by these methods Suchan objective requires to effectively combine SLT tools with the ADP algorithms and to showhow the error is propagated through the iterations of these iterative algorithms

37 Universal Reinforcement LearningMarcus Hutter (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Marcus Hutter

There is great interest in understanding and constructing generally intelligent systemsapproaching and ultimately exceeding human intelligence Universal AI is such a mathematical

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 5: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 5

modeled with a small number of state variables Actions typically operate at the level ofindividual cells but the spatial dynamics couple the states of spatially-adjacent cells Theresulting state and action spaces of these MDPs are immense We have been working ontwo natural resource MDPs The first involves the spread of tamarisk in river networksA native of the Middle East tamarisk has become an invasive plant in the dryland riversand streams of the western US Given a tamarisk invasion a land manager must decidehow and where to fight the invasion (eg eradicate tamarisk plants plant native plantsupstream downstream) Although approximate or heuristic solutions to this problem wouldbeuseful our collaborating economists tell us that our policy recommendations will carrymore weight if they are provably optimal with high probability A large problem instanceinvolves 47times 106 states with 2187 actions in each state On a modern 64-bit machine theaction-value function for this problem can fit into main memory However computing thefull transition function to sufficient accuracy to support standardvalue iteration requireson the order of 3times 1020 simulator calls The second problem concerns the management ofwildfire in Eastern Oregon In this region prior to European settlement the native ponderosapine forests were adapted to frequent low-intensity fires These fires allow the ponderosapine trees (which are well-adapted to survive fire) to grow very tall while preventing theaccumulation of fuel at ground level These trees provide habitat for many animal speciesand are also very valuable for timber However beginning in the early 1900s all fires weresuppressed in this landscape which has led to the build up of huge amounts of fuel Theresult has been large catastrophic fires that kill even the ponderosa trees and that areexceptionally expensive to control The goal of fire management is to return the landscapeto a state where frequent low-intensity fires are again the normal behavior There are twoconcrete fire management problems Let Burn (decide which fires to suppress) and FuelTreatment (decide in which cells to perform fuel reduction treatments) Note that in theseproblems the system begins in an unusual non-equilibrium state and the goal is to returnthe system to a desired steady state distribution Hence these problems are not problemsof reinforcement learning but rather problems of MDP planning for a specific start stateMany of the assumptions in RL papers such as ergocity of all policies are not appropriatefor this setting Note also that it is highly desirable to produce a concrete policy (as opposedto just producing near-optimal behavior viareceding horizon control) A concrete policy canbe inspected by stakeholders to identify missing constraints state variables and componentsof the reward function To solve these problems we are exploring two lines of research Fortamarisk we have been building on recent work in PAC-RL algorithms(eg MBIE UCRLUCRL2 FRTDP OP) to develop PAC-MDP planning algorithms We are pursuing twoinnovations First we have developed an exploration heuristic based on an upper bound onthe discounted state occupancy probability Second we are developing tighter confidenceintervals in order to terminate the search earlier These are based on combining Good-Turingestimates of missing mass (ie for unseen outcomes) with sequential confidence intervalsfor multinomial distributions These reduce the degree to which we must rely on the unionbound and hence give us tighter convergence For the LetBurn wildfire problem we areexploring approximate policy iteration methods For Fuel Treatment we are extendingCrowleyrsquos Equilibrium Policy Gradient methods These define a local policy function thatstochastically chooses the action for cell i based on the actions already chosen for the cells inthe surrounding neighborhood A Gibbs-sampling-style MCMC method repeatedly samplesfrom these local policies until a global equilibrium is reached This equilibrium defines theglobal policy At equilibrium gradient estimates can be computed and applied to improvethe policy

13321

6 13321 ndash Reinforcement Learning

33 ABC and Cover Tree Reinforcement LearningChristos Dimitrakakis (EPFL ndash Lausanne CH)

License Creative Commons BY 30 Unported licensecopy Christos Dimitrakakis

Joint work of Dimitrakakis Christos Tziortziotis NikolaosMain reference C Dimitrakakis N Tziortziotis ldquoABC Reinforcement Learningrdquo arXiv13036977v4 [statML]

2013 to appear in Proc of ICMLrsquo13URL httparxivorgabs13036977v4

In this talk I presented our recent results on methods for Bayesian reinforcement learningusing Thompson sampling but differing significantly on their prior The first ApproximateBayesian Computation Reinforcement Learning (ABC-RL) employs an arbitrary prior over aset of simulators and is most suitable in cases where an uncertain simulation model is availableThe secondCover Tree Bayesian Reinforcement Learning (CTB-RL) performs closed-formonline Bayesian inference on a cover tree and is suitable for arbitrary reinforcement learningproblems when little is known about the environment and fast inference is essential ABC-RLintroduces a simple general framework for likelihood-free Bayesian reinforcement learningthrough Approximate Bayesian Computation (ABC) The advantage is that we only requirea prior distribution ona class of simulators This is useful when a probabilistic model ofthe underlying process is too complex to formulate but where detailed simulation modelsare available ABC-RL allows the use of any Bayesian reinforcement learning techniquein this case It can be seen as an extension of simulation methods to both planning andinference We experimentally demonstrate the potential of this approach in a comparisonwith LSPI Finally we introduce a theorem showing that ABC is sound in the sense thatthe KL divergence between the incomputable true posterior and the ABC approximationis bounded by an appropriate choice of statistics CTBRL proposes an online tree-basedBayesian approach for reinforcement learning For inference we employ a generalised contexttree model This defines a distribution on multivariate Gaussian piecewise-linear modelswhich can be updated in closed form The tree structure itself is constructed using the covertree method which remains efficient in high dimensional spaces We combine the model withThompson sampling and approximate dynamic programming to obtain effective explorationpolicies in unknown environments The flexibility and computational simplicity of the modelrender it suitable for many reinforcement learning problems in continuous state spaces Wedemonstrate this in an experimental comparison with least squares policy iteration

References1 N Tziortziotis C Dimitrakakis and K BlekasCover Tree Bayesian Reinforcement Learning

arXiv13051809

34 Some thoughts on Transfer Learning in Reinforcement Learningon States and Representation

Lutz Frommberger (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Lutz Frommberger

Main reference L Frommberger ldquoQualitative Spatial Abstraction in Reinforcement Learningrdquo CognitiveTechnologies Series ISBN 978-3-642-16590-0 Springer 2010

URL httpdxdoiorg101007978-3-642-16590-0

The term ldquotransfer learningrdquo is a fairly sophisticated term for something thatcan be considereda core component of any learning effort of a human or animal to base the solution to a newproblem on experience and learning success of prior learning tasks This is something that a

Peter Auer Marcus Hutter and Laurent Orseau 7

learning organism does implicitly from birth on no task is ever isolated but embedded in acommon surrounding or history In contrary to this lifelong learning type setting transferlearning in RL[5] assumes two different MDPsM andMprime that have something ldquoin commonrdquoThis commonality is mostlikely given in a task mapping function that maps states and actionsfromM toMprime as a basis for reusing learned policies Task mappings can be given by humansupervisors or learned but mostly there is some instance telling the learning agent what to doto benefit from its experience In very common words Here is taskM there is taskMprime andthis is how you can bridge between them This is a fairly narrow view on information reuseand more organic and autonomous variants of knowledge transfer are desirable Knowledgetransfer may it be in-task (ie generalization) or cross-task exploits similarity betweentasks By task mapping functions information on similarity is brought into the learningprocess from outside This also holds for approaches that do not require an explicit statemapping [2 4] where relations or agent spaces eg are defined a-priori What is mostlylacking so far is the agentrsquos ability to recognize similarities on its own andor seamlesslybenefit from prior experiences as an integral part of the new learning effort An intelligentlearning agent should easily notice if certain parts of the current task are identical or similarto an earlier learning task for example general movement skills that remain constant overmany specialized learning tasks In prior work I proposed generalization approaches such astask space tilecoding [1] that allow to reuse knowledge of the actual learning task if certainstate variables are identical This works if structural information is made part of the statespace and does not require a mapping function However it needs a-priori knowledge ofwhich state variables are critical for action selection in a structural way Recent approachesfoster thehope that such knowledge can be retrieved by the agent itself eg[3] allows foridentification of state variables that have a generally high impact on action selection overone or several tasks But even if we can identify and exploit certain state variables thatencode structural information and have this generalizing impact these features must at leastexist If they do not exist in the state representations such approaches fail For exampleif the distance to the next obstacle in front of an agent is this critical value it does nothelp if the agentrsquos position and the position of the obstacle are in the state representationas this implicitly hides the critical information Thus again the question of state spacedesign becomes evident How can we ensure that relevant information is encoded on thelevel of features Or how can we exploit information that is only implicitly given in thestate representation Answering these questions will be necessary to take the next step intoautonomous knowledge reuse for RL agents

References1 Lutz Frommberger Task space tile coding In-task and cross-task generalization in re-

inforcement learning In 9th European Workshop on Reinforcement Learning (EWRL9)Athens Greece September 2011

2 George D Konidaris and Andrew G Barto Autonomous shaping Knowledge transfer inreinforcement learning In Proceedings of the Twenty Third International Conference onMachine Learning (ICML 2006) pp 489ndash49 Pittsburgh PA June 2006

3 Matthijs Snel and Shimon Whiteson Multi-task reinforcement learning shaping and fea-ture selection In Recent Advances in Reinforcement Learning pp 237ndash248 Springer 2012

4 Prasad Tadepalli Robert Givan and Kurt Driessens Relational reinforcement learningAn overview In Proc of the ICML-2004 Workshop on Relational Reinforcement Learningpp 1ndash9 2004

5 Matthew E Taylor and Peter Stone Transfer learning for reinforcement learning domainsA survey The Journal of Machine Learning Research 101633ndash16852009

13321

8 13321 ndash Reinforcement Learning

35 Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

In many sequential decision-making problems we may want to manage risk by minimizingsome measure of variability in rewards in addition to maximizing a standard criterionVariance related risk measures are among the most common risk-sensitive criteria in financeand operations research However optimizing many such criteria is known to be a hardproblem In this paper we consider both discounted and average reward Markov decisionprocesses For each formulation we first define a measure of variability for a policy whichin turn gives us a set of risk-sensitive criteria to optimize For each of these criteria wederive a formula for computing its gradient We then devise actor-critic algorithms forestimating the gradient and updating the policy parameters in the ascent direction Weestablish the convergence of our algorithms to locally risk-sensitive optimal policies Finallywe demonstrate the usefulness of our algorithms in a traffic signal control application

36 Statistical Learning Theory in Reinforcement Learning andApproximate Dynamic Programming

Mohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

Approximate dynamic programming (ADP) and reinforcement learning (RL) algorithms areused to solve sequential decision-making tasks where the environment (ie the dynamicsand the rewards) is not completely known andor the size of thestate and action spaces istoo large In these scenarios the convergence and performance guarantees of the standardDP algorithms are no longer valid and adifferent theoretical analysis have to be developedStatistical learning theory (SLT) has been a fundamental theoretical tool to explain theinteraction between the process generating the samples and the hypothesis space used bylearning algorithms and shown when and how-well classification and regression problemscan be solved In recent years SLT tools have been used to study the performance of batchversions of RL and ADP algorithms with the objective of deriving finite-sample bounds onthe performance loss (wrt the optimal policy) of the policy learned by these methods Suchan objective requires to effectively combine SLT tools with the ADP algorithms and to showhow the error is propagated through the iterations of these iterative algorithms

37 Universal Reinforcement LearningMarcus Hutter (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Marcus Hutter

There is great interest in understanding and constructing generally intelligent systemsapproaching and ultimately exceeding human intelligence Universal AI is such a mathematical

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 6: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

6 13321 ndash Reinforcement Learning

33 ABC and Cover Tree Reinforcement LearningChristos Dimitrakakis (EPFL ndash Lausanne CH)

License Creative Commons BY 30 Unported licensecopy Christos Dimitrakakis

Joint work of Dimitrakakis Christos Tziortziotis NikolaosMain reference C Dimitrakakis N Tziortziotis ldquoABC Reinforcement Learningrdquo arXiv13036977v4 [statML]

2013 to appear in Proc of ICMLrsquo13URL httparxivorgabs13036977v4

In this talk I presented our recent results on methods for Bayesian reinforcement learningusing Thompson sampling but differing significantly on their prior The first ApproximateBayesian Computation Reinforcement Learning (ABC-RL) employs an arbitrary prior over aset of simulators and is most suitable in cases where an uncertain simulation model is availableThe secondCover Tree Bayesian Reinforcement Learning (CTB-RL) performs closed-formonline Bayesian inference on a cover tree and is suitable for arbitrary reinforcement learningproblems when little is known about the environment and fast inference is essential ABC-RLintroduces a simple general framework for likelihood-free Bayesian reinforcement learningthrough Approximate Bayesian Computation (ABC) The advantage is that we only requirea prior distribution ona class of simulators This is useful when a probabilistic model ofthe underlying process is too complex to formulate but where detailed simulation modelsare available ABC-RL allows the use of any Bayesian reinforcement learning techniquein this case It can be seen as an extension of simulation methods to both planning andinference We experimentally demonstrate the potential of this approach in a comparisonwith LSPI Finally we introduce a theorem showing that ABC is sound in the sense thatthe KL divergence between the incomputable true posterior and the ABC approximationis bounded by an appropriate choice of statistics CTBRL proposes an online tree-basedBayesian approach for reinforcement learning For inference we employ a generalised contexttree model This defines a distribution on multivariate Gaussian piecewise-linear modelswhich can be updated in closed form The tree structure itself is constructed using the covertree method which remains efficient in high dimensional spaces We combine the model withThompson sampling and approximate dynamic programming to obtain effective explorationpolicies in unknown environments The flexibility and computational simplicity of the modelrender it suitable for many reinforcement learning problems in continuous state spaces Wedemonstrate this in an experimental comparison with least squares policy iteration

References1 N Tziortziotis C Dimitrakakis and K BlekasCover Tree Bayesian Reinforcement Learning

arXiv13051809

34 Some thoughts on Transfer Learning in Reinforcement Learningon States and Representation

Lutz Frommberger (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Lutz Frommberger

Main reference L Frommberger ldquoQualitative Spatial Abstraction in Reinforcement Learningrdquo CognitiveTechnologies Series ISBN 978-3-642-16590-0 Springer 2010

URL httpdxdoiorg101007978-3-642-16590-0

The term ldquotransfer learningrdquo is a fairly sophisticated term for something thatcan be considereda core component of any learning effort of a human or animal to base the solution to a newproblem on experience and learning success of prior learning tasks This is something that a

Peter Auer Marcus Hutter and Laurent Orseau 7

learning organism does implicitly from birth on no task is ever isolated but embedded in acommon surrounding or history In contrary to this lifelong learning type setting transferlearning in RL[5] assumes two different MDPsM andMprime that have something ldquoin commonrdquoThis commonality is mostlikely given in a task mapping function that maps states and actionsfromM toMprime as a basis for reusing learned policies Task mappings can be given by humansupervisors or learned but mostly there is some instance telling the learning agent what to doto benefit from its experience In very common words Here is taskM there is taskMprime andthis is how you can bridge between them This is a fairly narrow view on information reuseand more organic and autonomous variants of knowledge transfer are desirable Knowledgetransfer may it be in-task (ie generalization) or cross-task exploits similarity betweentasks By task mapping functions information on similarity is brought into the learningprocess from outside This also holds for approaches that do not require an explicit statemapping [2 4] where relations or agent spaces eg are defined a-priori What is mostlylacking so far is the agentrsquos ability to recognize similarities on its own andor seamlesslybenefit from prior experiences as an integral part of the new learning effort An intelligentlearning agent should easily notice if certain parts of the current task are identical or similarto an earlier learning task for example general movement skills that remain constant overmany specialized learning tasks In prior work I proposed generalization approaches such astask space tilecoding [1] that allow to reuse knowledge of the actual learning task if certainstate variables are identical This works if structural information is made part of the statespace and does not require a mapping function However it needs a-priori knowledge ofwhich state variables are critical for action selection in a structural way Recent approachesfoster thehope that such knowledge can be retrieved by the agent itself eg[3] allows foridentification of state variables that have a generally high impact on action selection overone or several tasks But even if we can identify and exploit certain state variables thatencode structural information and have this generalizing impact these features must at leastexist If they do not exist in the state representations such approaches fail For exampleif the distance to the next obstacle in front of an agent is this critical value it does nothelp if the agentrsquos position and the position of the obstacle are in the state representationas this implicitly hides the critical information Thus again the question of state spacedesign becomes evident How can we ensure that relevant information is encoded on thelevel of features Or how can we exploit information that is only implicitly given in thestate representation Answering these questions will be necessary to take the next step intoautonomous knowledge reuse for RL agents

References1 Lutz Frommberger Task space tile coding In-task and cross-task generalization in re-

inforcement learning In 9th European Workshop on Reinforcement Learning (EWRL9)Athens Greece September 2011

2 George D Konidaris and Andrew G Barto Autonomous shaping Knowledge transfer inreinforcement learning In Proceedings of the Twenty Third International Conference onMachine Learning (ICML 2006) pp 489ndash49 Pittsburgh PA June 2006

3 Matthijs Snel and Shimon Whiteson Multi-task reinforcement learning shaping and fea-ture selection In Recent Advances in Reinforcement Learning pp 237ndash248 Springer 2012

4 Prasad Tadepalli Robert Givan and Kurt Driessens Relational reinforcement learningAn overview In Proc of the ICML-2004 Workshop on Relational Reinforcement Learningpp 1ndash9 2004

5 Matthew E Taylor and Peter Stone Transfer learning for reinforcement learning domainsA survey The Journal of Machine Learning Research 101633ndash16852009

13321

8 13321 ndash Reinforcement Learning

35 Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

In many sequential decision-making problems we may want to manage risk by minimizingsome measure of variability in rewards in addition to maximizing a standard criterionVariance related risk measures are among the most common risk-sensitive criteria in financeand operations research However optimizing many such criteria is known to be a hardproblem In this paper we consider both discounted and average reward Markov decisionprocesses For each formulation we first define a measure of variability for a policy whichin turn gives us a set of risk-sensitive criteria to optimize For each of these criteria wederive a formula for computing its gradient We then devise actor-critic algorithms forestimating the gradient and updating the policy parameters in the ascent direction Weestablish the convergence of our algorithms to locally risk-sensitive optimal policies Finallywe demonstrate the usefulness of our algorithms in a traffic signal control application

36 Statistical Learning Theory in Reinforcement Learning andApproximate Dynamic Programming

Mohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

Approximate dynamic programming (ADP) and reinforcement learning (RL) algorithms areused to solve sequential decision-making tasks where the environment (ie the dynamicsand the rewards) is not completely known andor the size of thestate and action spaces istoo large In these scenarios the convergence and performance guarantees of the standardDP algorithms are no longer valid and adifferent theoretical analysis have to be developedStatistical learning theory (SLT) has been a fundamental theoretical tool to explain theinteraction between the process generating the samples and the hypothesis space used bylearning algorithms and shown when and how-well classification and regression problemscan be solved In recent years SLT tools have been used to study the performance of batchversions of RL and ADP algorithms with the objective of deriving finite-sample bounds onthe performance loss (wrt the optimal policy) of the policy learned by these methods Suchan objective requires to effectively combine SLT tools with the ADP algorithms and to showhow the error is propagated through the iterations of these iterative algorithms

37 Universal Reinforcement LearningMarcus Hutter (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Marcus Hutter

There is great interest in understanding and constructing generally intelligent systemsapproaching and ultimately exceeding human intelligence Universal AI is such a mathematical

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 7: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 7

learning organism does implicitly from birth on no task is ever isolated but embedded in acommon surrounding or history In contrary to this lifelong learning type setting transferlearning in RL[5] assumes two different MDPsM andMprime that have something ldquoin commonrdquoThis commonality is mostlikely given in a task mapping function that maps states and actionsfromM toMprime as a basis for reusing learned policies Task mappings can be given by humansupervisors or learned but mostly there is some instance telling the learning agent what to doto benefit from its experience In very common words Here is taskM there is taskMprime andthis is how you can bridge between them This is a fairly narrow view on information reuseand more organic and autonomous variants of knowledge transfer are desirable Knowledgetransfer may it be in-task (ie generalization) or cross-task exploits similarity betweentasks By task mapping functions information on similarity is brought into the learningprocess from outside This also holds for approaches that do not require an explicit statemapping [2 4] where relations or agent spaces eg are defined a-priori What is mostlylacking so far is the agentrsquos ability to recognize similarities on its own andor seamlesslybenefit from prior experiences as an integral part of the new learning effort An intelligentlearning agent should easily notice if certain parts of the current task are identical or similarto an earlier learning task for example general movement skills that remain constant overmany specialized learning tasks In prior work I proposed generalization approaches such astask space tilecoding [1] that allow to reuse knowledge of the actual learning task if certainstate variables are identical This works if structural information is made part of the statespace and does not require a mapping function However it needs a-priori knowledge ofwhich state variables are critical for action selection in a structural way Recent approachesfoster thehope that such knowledge can be retrieved by the agent itself eg[3] allows foridentification of state variables that have a generally high impact on action selection overone or several tasks But even if we can identify and exploit certain state variables thatencode structural information and have this generalizing impact these features must at leastexist If they do not exist in the state representations such approaches fail For exampleif the distance to the next obstacle in front of an agent is this critical value it does nothelp if the agentrsquos position and the position of the obstacle are in the state representationas this implicitly hides the critical information Thus again the question of state spacedesign becomes evident How can we ensure that relevant information is encoded on thelevel of features Or how can we exploit information that is only implicitly given in thestate representation Answering these questions will be necessary to take the next step intoautonomous knowledge reuse for RL agents

References1 Lutz Frommberger Task space tile coding In-task and cross-task generalization in re-

inforcement learning In 9th European Workshop on Reinforcement Learning (EWRL9)Athens Greece September 2011

2 George D Konidaris and Andrew G Barto Autonomous shaping Knowledge transfer inreinforcement learning In Proceedings of the Twenty Third International Conference onMachine Learning (ICML 2006) pp 489ndash49 Pittsburgh PA June 2006

3 Matthijs Snel and Shimon Whiteson Multi-task reinforcement learning shaping and fea-ture selection In Recent Advances in Reinforcement Learning pp 237ndash248 Springer 2012

4 Prasad Tadepalli Robert Givan and Kurt Driessens Relational reinforcement learningAn overview In Proc of the ICML-2004 Workshop on Relational Reinforcement Learningpp 1ndash9 2004

5 Matthew E Taylor and Peter Stone Transfer learning for reinforcement learning domainsA survey The Journal of Machine Learning Research 101633ndash16852009

13321

8 13321 ndash Reinforcement Learning

35 Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

In many sequential decision-making problems we may want to manage risk by minimizingsome measure of variability in rewards in addition to maximizing a standard criterionVariance related risk measures are among the most common risk-sensitive criteria in financeand operations research However optimizing many such criteria is known to be a hardproblem In this paper we consider both discounted and average reward Markov decisionprocesses For each formulation we first define a measure of variability for a policy whichin turn gives us a set of risk-sensitive criteria to optimize For each of these criteria wederive a formula for computing its gradient We then devise actor-critic algorithms forestimating the gradient and updating the policy parameters in the ascent direction Weestablish the convergence of our algorithms to locally risk-sensitive optimal policies Finallywe demonstrate the usefulness of our algorithms in a traffic signal control application

36 Statistical Learning Theory in Reinforcement Learning andApproximate Dynamic Programming

Mohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

Approximate dynamic programming (ADP) and reinforcement learning (RL) algorithms areused to solve sequential decision-making tasks where the environment (ie the dynamicsand the rewards) is not completely known andor the size of thestate and action spaces istoo large In these scenarios the convergence and performance guarantees of the standardDP algorithms are no longer valid and adifferent theoretical analysis have to be developedStatistical learning theory (SLT) has been a fundamental theoretical tool to explain theinteraction between the process generating the samples and the hypothesis space used bylearning algorithms and shown when and how-well classification and regression problemscan be solved In recent years SLT tools have been used to study the performance of batchversions of RL and ADP algorithms with the objective of deriving finite-sample bounds onthe performance loss (wrt the optimal policy) of the policy learned by these methods Suchan objective requires to effectively combine SLT tools with the ADP algorithms and to showhow the error is propagated through the iterations of these iterative algorithms

37 Universal Reinforcement LearningMarcus Hutter (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Marcus Hutter

There is great interest in understanding and constructing generally intelligent systemsapproaching and ultimately exceeding human intelligence Universal AI is such a mathematical

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 8: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

8 13321 ndash Reinforcement Learning

35 Actor-Critic Algorithms for Risk-Sensitive MDPsMohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

In many sequential decision-making problems we may want to manage risk by minimizingsome measure of variability in rewards in addition to maximizing a standard criterionVariance related risk measures are among the most common risk-sensitive criteria in financeand operations research However optimizing many such criteria is known to be a hardproblem In this paper we consider both discounted and average reward Markov decisionprocesses For each formulation we first define a measure of variability for a policy whichin turn gives us a set of risk-sensitive criteria to optimize For each of these criteria wederive a formula for computing its gradient We then devise actor-critic algorithms forestimating the gradient and updating the policy parameters in the ascent direction Weestablish the convergence of our algorithms to locally risk-sensitive optimal policies Finallywe demonstrate the usefulness of our algorithms in a traffic signal control application

36 Statistical Learning Theory in Reinforcement Learning andApproximate Dynamic Programming

Mohammad Ghavamzadeh (INRIA Nord Europe ndash Lille FR)

License Creative Commons BY 30 Unported licensecopy Mohammad Ghavamzadeh

Approximate dynamic programming (ADP) and reinforcement learning (RL) algorithms areused to solve sequential decision-making tasks where the environment (ie the dynamicsand the rewards) is not completely known andor the size of thestate and action spaces istoo large In these scenarios the convergence and performance guarantees of the standardDP algorithms are no longer valid and adifferent theoretical analysis have to be developedStatistical learning theory (SLT) has been a fundamental theoretical tool to explain theinteraction between the process generating the samples and the hypothesis space used bylearning algorithms and shown when and how-well classification and regression problemscan be solved In recent years SLT tools have been used to study the performance of batchversions of RL and ADP algorithms with the objective of deriving finite-sample bounds onthe performance loss (wrt the optimal policy) of the policy learned by these methods Suchan objective requires to effectively combine SLT tools with the ADP algorithms and to showhow the error is propagated through the iterations of these iterative algorithms

37 Universal Reinforcement LearningMarcus Hutter (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Marcus Hutter

There is great interest in understanding and constructing generally intelligent systemsapproaching and ultimately exceeding human intelligence Universal AI is such a mathematical

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 9: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 9

theory of machinesuper-intelligence More precisely AIXI is an elegant parameter-free theoryof an optimal reinforcement learning agent embedded in an arbitrary unknown environmentthat possesses essentially all aspects of rational intelligence The theory reduces all conceptualAI problems to pure computational questions After a brief discussion of its philosophicalmathematical and computational ingredients I will give a formal definition and measure ofintelligence which is maximized by AIXI AIXI can be viewed as the most powerful Bayes-optimal sequential decision maker for which I will present general optimality results Thisalso motivates some variations such as knowledge-seeking and optimistic agents and featurereinforcement learning Finally I present some recent approximations implementations andapplications of this modern top-down approach to AI

References1 M HutterUniversal Artificial IntelligenceSpringer Berlin 20052 J Veness K S Ng M Hutter W Uther and D SilverA Monte Carlo AIXI approximation

Journal of Artificial Intelligence Research 4095ndash142 20113 S LeggMachine Super Intelligence PhD thesis IDSIA Lugano Switzerland 20084 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20125 T LattimoreTheory of General Reinforcement Learning PhD thesis Research School of

Computer Science Australian National University 2014

38 Temporal Abstraction by Sparsifying Proximity StatisticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Toussaint Marc

Automatic discovery of temporal abstractions is a key problem in hierarchical reinforcementlearning In previous work such abstractions were found through the analysis of a set ofexperienced or demonstrated trajectories [2] We propose that from such trajectory datawe may learn temporally marginalized transition probabilitiesmdashwhich we call proximitystatisticsmdashthat model possible transitions on larger time scales rather than learning 1-steptransition probabilities Viewing the proximity statistics as state values allows the agent togenerate greedy policies from them Making the statistics sparse and combining proximityestimates by proximity propagation can substantially accelerate planning compared to valueiteration while keeping the size of the statistics manageable The concept of proximitystatistics and its sparsification approach is inspired from recent work in transit-node routingin large road networks [1] We show that sparsification of these proximity statistics impliesan approach to the discovery of temporal abstractions Options defined by subgoals areshown to be a special case of the sparsification of proximity statistics We demonstrate theapproach and compare various sparsification schemes in a stochastic grid world

References1 Holger Bast Setfan Funke Domagoj Matijevic Peter Sanders Dominik Schultes In transit

to constant time shortest-path queries in road networks Workshop on Algorithm Engineer-ing and Experiments 46ndash59 2007

2 Martin Stolle Doina Precup Learning options in reinforcement learning Proc of the 5thIntrsquol Symp on Abstraction Reformulation and Approximation pp 212ndash223 2002

13321

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 10: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

10 13321 ndash Reinforcement Learning

39 State Representation Learning in RoboticsRico Jonschkowski (TU Berlin DE)

License Creative Commons BY 30 Unported licensecopy Rico Jonschkowski

Joint work of Jonschkowski Rico Brock OliverMain reference R Jonschkowski O Brock ldquoLearning Task-Specific State Representations by Maximizing Slowness

and Predictabilityrdquo in Proc of the 6th Intrsquol Workshop on Evolutionary and ReinforcementLearning for Autonomous Robot Systems (ERLARSrsquo13) 2013

URL httpwwwroboticstu-berlindefileadminfg170Publikationen_pdfJonschkowski-13-ERLARS-finalpdf

The success of reinforcement learning in robotic tasks is highly dependent onthe staterepresentation ndash a mapping from high dimensional sensory observations of the robot to statesthat can be used for reinforcement learning Currently this representation is defined byhuman engineers ndash thereby solving an essential part of the robotic learning task Howeverthis approach does not scale because it restricts the robot to tasks for which representationshave been predefined by humans For robots that are able to learn new tasks we needstate representation learning We sketch how this problem can be approached by iterativelylearning a hierarchy of task-specific state representations following a curriculum We thenfocus on a single step in this iterative procedure learning a state representation that allowsthe robot to solve asingle task To find this representation we optimize two characteristics ofgood state representations predictability and slowness We implement these characteristicsin a neural network and show that this approach can find good state representations fromvisual input in simulated robotic tasks

310 Reinforcement Learning with Heterogeneous PolicyRepresentations

Petar Kormushev (Istituto Italiano di Tecnologia ndash Genova IT)

License Creative Commons BY 30 Unported licensecopy Petar Kormushev

Joint work of Kormushev Petar Caldwell Darwin GMain reference P Kormushev DG Caldwell ldquoReinforcement Learning with Heterogeneous Policy

Representationsrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_20pdf

We propose a novel reinforcement learning approach for direct policy search that cansimultaneously (i) determine the most suitable policy representation for a given task and(ii) optimize the policy parameters of this representation in order to maximize the rewardand thus achieve the task The approach assumes that there is a heterogeneous set of policyrepresentations available to choose from A naiumlve approach to solving this problem wouldbe to take the available policy representations one by one run a separate RL optimizationprocess (ie conduct trials and evaluate the return) for each once and at the very end pickthe representation that achieved the highest reward Such an approach while theoreticallypossible would not be efficient enough in practice Instead our proposed approach is toconduct one single RL optimization process while interleaving simultaneously all availablepolicy representations This can be achieved by leveraging our previous work in the area ofRL based on Particle Filtering (RLPF)

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 11: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 11

311 Theoretical Analysis of Planning with OptionsTimothy Mann (Technion ndash Haifa IL)

License Creative Commons BY 30 Unported licensecopy Timothy Mann

Joint work of Mann Timothy Mannor Shie

We introduce theoretical analysis suggesting how planning can benefit from using options Ex-perimental results have shown that options often induce faster convergence [3 4] and previoustheoretical analysis has shown that options are well-behaved in dynamic programming[2 3]We introduced a generalization of the Fitted Value Iteration (FVI) algorithm [1] that in-corporates samples generated by options Our analysis reveals that when the given set ofoptions contains the primitive actions our generalized algorithm converges approximately asfast as FVI with only primitive actions When only temporally extended actions are usedfor planning convergence can be significantly faster than planning with only primitives butthis method may converge toward a suboptimal policy We also developed precise conditionswhere our generalized FVI algorithm converges faster with a combination of primitive andtemporally extended actions than with only primitive actions These conditions turn out todepend critically on whether the iterates produced by FVI underestimate the optimal valuefunction Our analysis of FVI suggests that options can play an important role in planningby inducing fast convergence

References1 Reacutemi Munos and Csaba SzepesvaacuteriFinite-time bounds for fitted value iterationJournal of

Machine Learning Research 9815ndash857 20082 Doina Precup Richard S Sutton and Satinder Singh Theoretical results on reinforcement

learning with temporallyabstract options Machine Learning ECML-98 Springer 382ndash3931998

3 David Silver and Kamil Ciosek Compositional Planning Using Optimal Option ModelsInProceedings of the 29th International Conference on Machine Learning Edinburgh 2012

4 Richard S Sutton Doina Precup and Satinder Singh Between MDPs and semi-MDPs Aframework for temporal abstraction in reinforcement learning Artificial Intelligence 112(1)181ndash211 August 1999

312 Learning Skill Templates for Parameterized TasksJan Hendrik Metzen (Universitaumlt Bremen DE)

License Creative Commons BY 30 Unported licensecopy Jan Hendrik Metzen

Joint work of Metzen Jan Hendrik Fabisch Alexander

We consider the problem of learning skills for a parameterized reinforcement learning problemclass That is we assume that a task is defined by a task parameter vector and likewisea skill is considered as a parameterized policy We propose skill templates which allowto generalize skills that have been learned using reinforcement learning to similar tasksIn contrast to the recently proposed parameterized skills[1] skill templates also provide ameasure of uncertainty for this generalization which is useful for subsequent adaptation ofthe skill by means of reinforcement learning In order to infer a generalized mapping fromtask parameter space to policy parameter space and an estimate of its uncertainty we useGaussian process regression [3] We represent skills by dynamical movement primitives [2]

13321

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 12: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

12 13321 ndash Reinforcement Learning

and evaluate the approach on a simulated Mitsubishi PA10 arm where learning a single skillcorresponds to throwing a ball to a fixed target position while learning the skill templaterequires to generalize to new target positions We show that learning skill templates requiresonly a small amount of training data and improves learning in the target task considerably

References1 BC da Silva G Konidaris and AG Barto Learning Parameterized Skills 29th Inter-

national Conference on Machine Learning 20122 A J Ijspeert J Nakanishi H Hoffmann P Pastor and S Schaal Dynamical Movement

Primitives Learning Attractor Models for Motor Behaviors Neural Computation 252328ndash373 2013

3 CERasmussen CWilliams Gaussian Processes for Machine Learning MIT Press 2006

313 Online learning in Markov decision processesGergely Neu (Budapest University of Technology amp Economics HU)

License Creative Commons BY 30 Unported licensecopy Gergely Neu

Joint work of Neu Gergely Szepesvari Csaba Zimin Alexander Gyorgy Andras Dick TravisMain reference A Zimin G Neu ldquoOnline learning in Markov decision processes by Relative Entropy Policy

Searchrdquo to appear in Advances in Neural Information Processing 26

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes The natural performancemeasure in this learning problem is the regret defined as the difference between the totalloss of the best stationary policy and the total loss suffered by the learner We assume thatthe learner is given access to a finite action space A and the state space X has a layeredstructure with L layers so that state transitions are only possible between consecutive layersWe propose several learning algorithms based on applying the well-known Mirror Descentalgorithm to the problem described above For deriving our first method we observe thatMirror Descent can be regarded as a variant of the recently proposed Relative Entropy PolicySearch (REPS) algorithm of [1] Our corresponding algorithm is called Online REPS orO-REPS Second we show how to approximately solve the projection operations required byMirror Descent without taking advantage of the connections to REPS Finally we propose alearning method based on using a modified version of the algorithm of[2] to implement theContinuous Exponential Weights algorithm for the online MDP problem For these last twotechniques we provide rigorous complexity analyses More importantly we show that all ofthe above algorithms satisfy regret bounds of O(

radicL |X | |A|T log(|X | |A| L)) in the bandit

setting and O(LradicT log(|X | |A| L)) in thefull information setting (both after T episodes)

These guarantees largely improve previously known results under much milder assumptionsand cannot be significantly improved under general assumptions

References1 Peters J Muumllling K and Altuumln Y (2010)Relative entropy policy searchIn Proc of the

24th AAAI Conf on Artificial Intelligence (AAAI 2010) pp 1607ndash16122 Narayanan H and Rakhlin A (2010)Random walk approach to regret minimization In

Advances in Neural Information Processing Systems 23 pp 1777ndash1785

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 13: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 13

314 Hierarchical Learning of Motor Skills with Information-TheoreticPolicy Search

Gerhard Neumann (TU Darmstadt ndash Darmstadt DE)

License Creative Commons BY 30 Unported licensecopy Gerhard Neumann

Joint work of Neumann Gerhard Daniel Christian Kupcsik Andras Deisenroth Marc Peters JanURL G Neumann C Daniel A Kupcsik M Deisenroth J Peters ldquoHierarchical Learning of Motor

Skills with Information-Theoretic Policy Searchrdquo 2013URL httpewrlfileswordpresscom201306ewrl11_submission_1pdf

The key idea behind information-theoretic policy search is to bound the lsquodistancersquo between thenew and old trajectory distribution where the relative entropy is used as lsquodistance measurersquoThe relative entropy bound exhibits many beneficial properties such as a smooth and fastlearning process and a closed-form solution for the resulting policy We summarize our workon information theoretic policy search for motor skill learning where we put particular focuson extending the original algorithm to learn several options for a motor task select an optionfor the current situation adapt the option to the situation and sequence options to solve anoverall task Finally we illustrate the performance of our algorithm with experiments onreal robots

315 Multi-Objective Reinforcement LearningAnn Nowe (Free University of Brussels BE)

License Creative Commons BY 30 Unported licensecopy Ann Nowe

Joint work of Nowe Ann Van Moffaert Kristof Drugan M MadalinaMain reference K Van Moffaert MM Drugan A Noweacute ldquoHypervolume-based Multi-Objective Reinforcement

Learningrdquo in Proc of the 7th Intrsquol Conf on Evolutionary Multi-Criterion Optimization (EMOrsquo13)LNCS Vol 7811 pp 352ndash366 Springer 2013

URL httpdxdoiorg101007978-3-642-37140-0_28

We focus on extending reinforcement learning algorithms to multi-objective problems wherethe value functions are not assumed to be convex In these cases the environment ndash eithersingle-state or multi-state ndash provides the agent multiple feedback signals upon performingan action These signals can be independent complementary or conflicting Hence multi-objective reinforcement learning (MORL) is the process of learning policies that optimizemultiple criteria simultaneously In our talk we briefly describe our extensions to multi-armed bandits and reinforcement learning algorithms to make them applicable in multi-objective environments In general we highlight two main streams in MORL ie either thescalarization or the direct Pareto approach The simplest method to solve a multi-objectivereinforcement learning problem is to use a scalarization function to reduce the dimensionalityof the objective space to a single-objective problem Examples are the linear scalarizationfunction and the non-linear Chebyshev scalarization function In our talk we highlight thatscalarization functions can be easily applied in general but their expressive power dependsheavily on on the fact whether a linear or non-linear transformation to a single dimensionis performed [2] Additionally they suffer from additional parameters that heavily bias thesearch process Without scalarization functions the problem remains truly multi-objectiveandQ-values have to be learnt for each objective individually and therefore a state-action ismapped to a Q-vector However a problem arises in the boots trapping process as multipleactions be can considered equally good in terms of the partial order Pareto dominance

13321

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 14: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

14 13321 ndash Reinforcement Learning

relation Therefore we extend the RL boots trapping principle to propagating sets of Paretodominating Q-vectors in multi-objective environments In [1] we propose to store the averageimmediate reward and the Pareto dominating future discounted reward vector separatelyHence these two entities can converge separately but can also easily be combined with avector-sum operator when the actual Q-vectors are requested Subsequently the separationis also a crucial aspect to determine the actual action sequence to follow a converged policyin the Pareto set

References1 K Van Moffaert M M Drugan A Noweacute Multi-Objective Reinforcement Learning using

Sets of Pareto Dominating PoliciesConference on Multiple Criteria Decision Making 20132 M M Drugan A Noweacute Designing Multi-Objective Multi-Armed Bandits An Analysis

in Proc of International Joint Conference of Neural Networks (IJCNN 2013)

316 Bayesian Reinforcement Learning + ExplorationTor Lattimore (Australian National University ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Tor Lattimore

Joint work of Lattimore Tor Hutter Marcus

A reinforcement learning policy π interacts sequentially with an environment micro In eachtime-step the policy π takes action a isin A before receiving observation o isin O and rewardr isin R The goal of an agentpolicy is to maximise some version of the (expecteddiscounted)cumulative reward Since we are interested in the reinforcement learning problem we willassume that thetrue environment micro is unknown but resides in some known set M Theobjective is to construct a single policy that performs well in some sense for allmost micro isinMThis challenge has been tackled for many specificM including bandits and factoredpartiallyobservableregular MDPs but comparatively few researchers have considered more generalhistory-based environments Here we consider arbitrary countable M and construct aprincipled Bayesian inspired algorithm that competes with the optimal policy in Cesaroaverage

317 Knowledge-Seeking AgentsLaurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Lattimore Tor Hutter MarcusMain reference L Orseau T Lattimore M Hutter ldquoUniversal Knowledge-Seeking Agents for Stochastic

Environmentsrdquo in Proc of the 24th Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo13) LNCSVol 8139 pp 146ndash160 Springer 2013

URL httpdxdoiorg101007978-3-642-40935-6_12URL httpwwwhutter1netpublksaprobpdf

Observing that the optimal Bayesian rational agent AIXI does not explore its environmententirely led us to give a seek a definition of an optimal Bayesian that does so in an optimalway We recently defined such a knowledge-seeking agent KL-KSA designed for countablehypothesis classes of stochastic environments Although this agent works for arbitrarycountable classes and priors we focus on the especially interesting case where all stochastic

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 15: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 15

computable environments are considered and the prior is based on Solomonoffrsquos universalprior Among other properties we show that KL-KSA learns the true environment in thesense that it learns to predict the consequences of actions it does not take We show that itdoes not consider noise to be information and avoids taking actions leading to inescapabletraps We also present a variety of toy experiments demonstrating that KL-KSA behavesaccording to expectation

318 Toward a more realistic framework for general reinforcementlearning

Laurent Orseau (AgroParisTech ndash Paris FR)

License Creative Commons BY 30 Unported licensecopy Laurent Orseau

Joint work of Orseau Laurent Ring MarkMain reference L Orseau M Ring ldquoSpace-Time Embedded Intelligencerdquo in Proc of the 5th Intrsquol Conf on

Artificial General Intelligence (AGIrsquo12) LNCS Vol 7716 pp 209ndash218 Springer 2012URL httpdxdoiorg101007978-3-642-35506-6_22

The traditional agent framework commonly used in reinforcement learning (RL)and elsewhereis particularly convenient to formally deal with agents interacting with their environmentHowever this framework has a number of issues that are usually of minor importance butcan become severe when dealing with general RL and artificial general intelligence whereone studies agents that are optimally rational or can merely have a human-level intelligenceAs a simple example an intelligent robot that is controlled by rewards and punishmentsthrough a remote control should by all means try to get hold of this remote control in orderto give itself as many rewards as possible In a series of paper [1 2 4 3] we studied variousconsequences of integrating the agent more and more in its environment leading to a newdefinition of artificial general intelligence where the agent is fully embedded in the world tothe point where it is even computed by it [3]

References1 Ring M amp Orseau L (2011) Delusion Survival and Intelligent Agents Artificial General

Intelligence (AGI) (pp 11ndash20) Berlin Heidelberg Springer2 Orseau L amp Ring M (2011) Self-Modification and Mortality in Artificial Agents Artifi-

cial General Intelligence (AGI) (pp 1ndash10) Springer3 Orseau L amp Ring M (2012) Space-Time Embedded Intelligence Artificial General In-

telligence (pp 209ndash218) Oxford UK Springer Berlin Heidelberg4 Orseau L amp Ring M (2012) Memory Issues of Intelligent Agents Artificial General

Intelligence (pp 219ndash231) Oxford UK Springer Berlin Heidelberg

13321

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 16: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

16 13321 ndash Reinforcement Learning

319 Colored MDPs Restless Bandits and Continuous StateReinforcement Learning

Ronald Ortner (Montan-Universitaumlt Leoben AT)

License Creative Commons BY 30 Unported licensecopy Ronald Ortner

Joint work of Ortner Ronald Ryabko Daniil Auer Peter Munos ReacutemiMain reference R Ortner D Ryabko P Auer R Munos ldquoRegret Bounds for Restless Markov Banditsrdquo in Proc

of the 23rd Intrsquol Conf on Algorithmic Learning Theory (ALTrsquo12) LNCS Vol 7568 pp 214ndash228Springer 2012

URL httpdxdoiorg101007978-3-642-34106-9_19

We introduce the notion of colored MDPs that allows to add structural information toordinary MDPs Thus state-action pairs are assigned the same color when they exhibitsimilar rewards and transition probabilities This extra information can be exploited by anadaptation of the UCRL algorithm leading to regret bounds that depend on the number ofcolors instead of the size of the state-action space As applications we are able to deriveregret bounds for the restless bandit problem as well as for continuous state reinforcementlearning

320 Reinforcement Learning using Kernel-Based StochasticFactorization

Joeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

Joint work of Barreto Andreacute MS Precup D Pineau JoeumllleMain reference AM S Barreto D Precup J Pineau ldquoReinforcement Learning using Kernel-Based Stochastic

Factorizationrdquo in Proc of the 25th Annual Conf on Neural Information Processing Systems(NIPSrsquo11) pp 720ndash728 2011

URL httpbooksnipsccpapersfilesnips24NIPS2011_0496pdfURL httpwwwcsmcgillca~jpineaufilesbarreto-nips11pdf

Recent years have witnessed the emergence of several reinforcement-learning techniques thatmake it possible to learn a decision policy from a batch of sample transitions Among themkernel-based reinforcement learning (KBRL)stands out for two reasons First unlike otherapproximation schemes KBRL always converges to a unique solution Second KBRL isconsistent in the statistical sense meaning that adding more data improves the quality ofthe resulting policy and eventually leads to optimal performance Despite its nice theoreticalproperties KBRL has not been widely adopted by the reinforcement learning communityOne possible explanation for this is that the size of the KBRL approximator grows with thenumber of sample transitions which makes the approach impractical for large problems Inthis work we introduce a novel algorithm to improve the scalability of KBRL We use aspecial decomposition of a transition matrix called stochastic factorization which allows usto fix the size of the approximator while at the same time incorporating all the informationcontained in the data We apply this technique to compress the size of KBRL-derived modelsto a fixed dimension This approach is not only advantageous because of the model-sizereduction it also allows a better bias-variance trade-off by incorporating more samples inthemodel estimate The resulting algorithm kernel-based stochastic factorization (KBSF) ismuch faster than KBRL yet still converges to a unique solution We derive a theoreticalbound on the distance between KBRLrsquos solutionand KBSFrsquos solution We show that it is alsopossible to construct the KBSF solution in a fully incremental way thus freeing the space

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 17: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 17

complexity of the approach from its dependence on the number of sample transitions Theincremental version of KBSF (iKBSF) is able to process an arbitrary amount of data whichresults in a model-based reinforcement learning algorithm that canbe used to solve largecontinuous MDPs in on-line regimes We present experiments on a variety of challenging RLdomains including thedouble and triple pole-balancing tasks the Helicopter domain thepenthatlon event featured in the Reinforcement Learning Competition 2013 and a model ofepileptic rat brains in which the goal is to learn a neurostimulation policy to suppress theoccurrence of seizures

321 A POMDP TutorialJoeumllle Pineau (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Joeumllle Pineau

URL httpwwwcsmcgillca~jpineautalksjpineau-dagstuhl13pdf

This talk presented key concepts algorithms theory and empirical results pertaining tolearning and planning in Partially Observable Markov Decision Processes (POMDPs)

322 Methods for Bellman Error Basis Function constructionDoina Precup (McGill University ndash Montreal CA)

License Creative Commons BY 30 Unported licensecopy Doina Precup

Function approximation is crucial for obtaining good results in large reinforcement learningtasks but the problem of devising a good function approximator is difficult and often solved inpractice by hand-crafting theldquorightrdquo set of features In the last decade a considerable amountof effort has been devoted to methods that can construct value function approximatorsautomatically from data Among these methods Bellman error basis functionc onstruction(BEBF) are appealing due to their theoretical guarantees and good empirical performancein difficult tasks In this talk we discuss on-going developments of methods for BEBFconstruction based on random projections (Fard Grinberg Pineau and Precup NIPS 2013)and orthogonal matching pursuit (Farahmand and Precup NIPS 2012)

323 Continual LearningMark B Ring (Anaheim Hills US)

License Creative Commons BY 30 Unported licensecopy Mark B Ring

Joint work of Ring Mark B Schaul Tom Schmidhuber JuumlrgenMain reference MB Ring T Schaul ldquoThe organization of behavior into temporal and spatial neighborhoodsrdquo in

Proc of the 2012 IEEE Intrsquol Conf on Development and Learning and Epigenetic Robotics(ICDL-EPIROBrsquo12) pp 1ndash6 IEEE 2012

URL httpdxdoiorg101109DevLrn20126400883

A continual-learning agent is one that begins with relatively little knowledge and few skillsbut then incrementally and continually builds up new skills and new knowledge based on

13321

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 18: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

18 13321 ndash Reinforcement Learning

what it has already learned But what should such an agent do when it inevitably runs outof resources One possible solution is to prune away less useful skills and knowledge whichis difficult if these are closely connected to each other in a network of complex dependenciesThe approach I advocate in this talk is to give the agent at the outset all the computationalresources it will ever have such that continual learning becomes the process of continuallyreallocating those fixed resources I will describe how an agentrsquos policy can be broken intomany pieces and spread out among many computational units that compete to representdifferent parts of the agentrsquos policy space These units can then be arranged across alower-dimensional manifold according to those similarities which results in many advantagesfor the agent Among these advantages are improved robustness dimensionality reductionand an organization that encourages intelligent reallocation of resources when learning newskills

324 Multi-objective Reinforcement LearningManuela Ruiz-Montiel (University of Malaga ES)

License Creative Commons BY 30 Unported licensecopy Manuela Ruiz-Montiel

Joint work of Ruiz-Montiel Manuela Mandow Lawrence Peacuterez-de-la-Cruz Joseacute-LuisMain reference M Riuz-Montiel L Mandow J L Peacuterez-de-la-Cruz ldquoPQ-Learning Aprendizaje por Refuerzo

Multiobjetivordquo in Proc of the XV Conference of the Spanish Association for Artificial Intelligence(CAEPIArsquo13) pp 139ndash148 2013

URL httpwwwcongresocediesimagessiteactasActasCAEPIApdf

In this talk we present PQ-learning a new Reinforcement Learning (RL)algorithm thatdetermines the rational behaviours of an agent in multi-objective domains Most RLtechniques focus on environments with scalar rewards However many real scenarios arebest formulated in multi-objective terms rewards are vectors and each component standsfor an objective to maximize In scalar RL the environment is formalized as a MarkovDecision Problem defined by a set S of states a set A of actions a function Psa(sprime) (thetransition probabilities) and a function Rsa(sprime) (the obtained scalar rewards) The problemis to determine apolicy π S rarr A that maximizes the discounted accumulated rewardRt = Σinfin

k=0γkrt+k+1Eg Q-learning [1] is an algorithm that learns such policy It learns

the scalar values Q(s a) S times A rarr R that represent the expected accumulated rewardwhen following a given policy after taking a in s The selected action a in each state isgiven by the expression argmaxaQ(s a) In the multi-objective case the rewards are vectorsminusrarrr isin Rn so different accumulated rewards cannot be totally ordered minusrarrv dominates minusrarrwwhen existi vi gt wi and j vj lt wj Given a set of vectors those that are not dominated byany other vector are said to lie in the Pareto front We seek the set of policies that yieldnon-dominated accumulated reward vectors The literature on multi-objective RL (MORL)is relatively scarce (see Vamplew et al [2]) Most methods use preferences (lexicographicordering or scalarization) allowing a total ordering of the value vectors and approximatethe front by running a scalar RL method several times with different preferences Whendealing with non-convex fronts only a subset of the solutions is approximated Somemulti-objective dynamic programming (MODP) methods calculate all the policies at onceassuming a perfect knowledge of Psa(sprime) and Rsa(sprime) We deal with the problem of efficientlyapproximating all the optimal policies at once without sacrificing solutions nor assuminga perfect knowledge of the model As far as we know our algorithm is the first to bringthese featurestogether As we aim to learn a set of policies at once Q-learning is apromising

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 19: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 19

starting point since the policy used to interact with theenvironment is not the same thatis learned At each step Q-learning shifts the previous estimated Q-value towards its newestimation Q(s a)larr (1minusα)Q(s a) +α(r+ γmaxaprimeQ(sprime aprime)) In PQ-learning Q-values aresets of vectors so the max operator is replaced by ND(

⋃aprime Q(sprime aprime))) where ND calculates

the Pareto front A naive approach to perform the involved set addition is a pairwisesummation (imported from MODP methods) but it leads to an uncontrolled growth of thesets and the algorithm becomes impractical as it sums vectors that correspond to differentaction sequences The results of these mixed sums are useless when learning deterministicpolicies because two sequences cannot be followed at once We propose a controlled setaddition that only sums those pairs of vectors that correspond to useful action sequencesThis is done by associating each vector minusrarrq with two data structures with information aboutthe vectors that (1) have been updated by minusrarrq and (2) have contributed to its value In thistalk we describe in detail the application ofPQ-learning to a simple example and the resultsthat the algorithm yields when applied to two problems of a benchmark [2] It approximatesall the policies in the true Pareto front as opposed to the naive approach that produceshuge fronts with useless values that dramatically slow down the process1

References1 CJ Watkins Learning From DelayedRewards PhD Thesis University of Cambridge 19892 P Vamplew et al Empirical EvaluationMethods For Multiobjective Reinforcement Learn-

ing in Machine Learning84(1-2) pp 51-80 2011

325 Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPs

Scott Sanner (NICTA ndash Canberra AU)

License Creative Commons BY 30 Unported licensecopy Scott Sanner

Joint work of Sanner Scott Zamani Zahra

Many real-world decision-theoretic planning problems are naturallymodeled using mixeddiscrete and continuous state action and observation spaces yet little work has providedexact methodsfor performing exact dynamic programming backups in such problems Thisoverview talk will survey a number of recent developments in the exact and approximatesolution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique ofsymbolic dynamic programming (SDP) as covered in recent work by the authors [1 2 3 4]

References1 S Sanner K V Delgado and L Nunes de Barros Symbolic dynamic programming

for discrete and continuous state MDPs In In Proc of the 27th Conf on Uncertainty inArtificial Intelligence (UAI-11) Barcelona Spain 2011

2 Z Zamani S Sanner K V Delgado and L Nunes de Barros Robust optimization forhybrid mdps with state-dependent noise In Proc of the 23rd International Joint Conf onArtificial Intelligence (IJCAI-13) Beijing China 2013

1 This work is partially funded by grant TIN2009-14179 (SpanishGovernment Plan Nacional de I+D+i)and Universidad de Maacutelaga Campusde Excelencia Internacional Andaluciacutea Tech Manuela Ruiz-Montielis funded by the Spanish Ministry of Education through the National FPU Program

13321

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 20: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

20 13321 ndash Reinforcement Learning

3 Z Zamani S Sanner and C Fang Symbolic dynamic programming for continuous stateand action mdpsIn In Proc of the 26th AAAI Conf on Artificial Intelligence (AAAI-12)Toronto Canada 2012

4 Z Zamani S Sanner P Poupart and K Kersting Symbolic dynamic programming forcontinuous state and observation pomdpsIn In Proc of the 26th Annual Conf on Advancesin Neural Information Processing Systems (NIPS-12) Lake Tahoe Nevada 2012

326 Deterministic Policy GradientsDavid Silver (University College ndash London GB)

License Creative Commons BY 30 Unported licensecopy David Silver

Joint work of David Silver

In this talk we consider deterministic policy gradient algorithms for reinforcement learningwith continuous actions The deterministic policy gradient has a particularly appealingform it is the expected gradient of the action-value function This simple form means thatthe deterministic policy gradient can be estimated much more efficiently than the usualstochastic policy gradient To ensure adequate exploration we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policyWe demonstrate that deterministic policy gradient algorithms can significantly outperformtheir stochastic counterparts in high-dimensional action spaces

327 Sequentially Interacting Markov Chain Monte Carlo BasedPolicyIteration

Orhan Soumlnmez (Boğaziccedili University ndash Istanbul TR)

License Creative Commons BY 30 Unported licensecopy Orhan Soumlnmez

Joint work of Soumlnmez Orhan Cemgil A Taylan

In this ongoing research we introduce a policy iteration method where policies are evaluatedusing sequentially interacting Markov chain Monte Carlo (SIMCMC) [1] for planning indiscrete time continuous state space Markov decision processes (MDPs)In order to do so weutilize the expectation-maximization algorithm derived for solving MDPs [2] and employ aSIMCMC sampling scheme in its intractable expectation step Fortunately the maximizationstep has a closed form solution due to Markov properties Meanwhile we approximate thepolicy as a function over the continuous state space using Gaussian processes [3] Hence inthe maximization step we simply select the state-action pairs of the trajectories sampled bySIMCMC as the support of the Gaussian process approximator We are aware that SIMCMCmethods are not the best choice with respect to sample efficiency compared to sequentialMonte Carlo samplers (SMCS)[4] However they are more appropriate for online settingsdue to their estimation at any time property which SMCSs lack As a future work we areinvestigating different approaches to develop an online reinforcement learning algorithmbased on SIMCMC policy evaluation As a model based approach the dynamics of the modelwould be approximated with Gaussian processes [5]

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 21: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 21

References1 Brockwell A Del Moral P Doucet A Sequentially Interacting Markov Chain Monte

Carlo Methods emphThe Annals of Statistics 38(6)3870ndash3411 December 20102 Toussaint M Storkey A Probabilistic Inference for Solving Discrete and Continuous

State Markov Decision Processes In Proc of the 23rd Intrsquol Conf on MachineLearningpp 945ndash952 2006

3 Deisenroth M Rasmussen CE Peters J Gaussian Process Dynamic ProgrammingNeurocomputing 72(7-9)1508ndash1524March 2009

4 Del Moral P Doucet A Sequential Monte Carlo SamplersJournal of the Royal StatisticalSociety ndash Series B StatisticalMethodology 68(3)411ndash436 June 2006

5 Deisenroth M Rasmussen CEPILCO A Model-Based and Data-Efficient Approach toPolicy Search In Proc of the 28th Intrsquol Conf on Machine Learning pp 465ndash472 2006

328 Exploration versus Exploitation in Reinforcement LearningPeter Sunehag (Australian National University AU)

License Creative Commons BY 30 Unported licensecopy Peter Sunehag

Main reference T Lattimore M Hutter P Sunehag ldquoThe Sample-Complexity of General ReinforcementLearningrdquo arXiv13084828v1 [csLG] and in Proc of the 30th Intrsquol Conf on Machine Learning(ICMLrsquo13) JMLR WampCP 28(3)28ndash36 2013

URL httparxivorgabs13084828v1URL httpjmlrorgproceedingspapersv28lattimore13pdf

My presentation was a tutorial overview of the exploration vs exploitation dilemma inreinforcement learning I began in the multi-armed bandit setting and went through MarkovDecision Processes to the general reinforcement learning setting that has only recentlybeen studied The talk discussed the various strategies for dealing with the dilemma likeoptimismin a frequentist setting or posterior sampling in the Bayesian setting as well as theperformance measures like sample complexity in discounted reinforcement learning or regretbounds for undiscounted the setting Itwas concluded that sample complexity bounds can beproven in much more general settings than regret bounds Regret bounds need some sort ofrecoverability guarantees while unfortunately sample complexity says less about how muchreward the agent will achieve The speakerrsquos recommendation is to try to achieve optimalsample complexity but only within the class of rational agents described by an axiomaticsystem developed from classical rational choice decision theory

329 The Quest for the Ultimate TD(λ)Richard S Sutton (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Richard S Sutton

Joint work of Maei H R Sutton R SMain reference HR Maei R S Sutton ldquoGQ(λ) A general gradient algorithm for temporal-difference prediction

learning with eligibility tracesrdquo in Proc of the 3rd Conf on Artificial General Intelligence(AGIrsquo10) Advances in Intelligent Systems Research Series 6 pp 2010

URL httpdxdoiorg102991agi201022URL httpwebdocscsualbertaca suttonpapersmaei-sutton-10pdf

TD(λ) is a computationally simple model-free algorithm for learning to predict long termconsequences It has been used to learn value functions to form multi-scale models of

13321

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 22: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

22 13321 ndash Reinforcement Learning

the world and to compile planning into policies for immediate action It is a natural corealgorithm for artificial intelligence based on reinforcement learning Before realizing its fullpotential however TD(λ) needs to be generalized in several ways to off-policy learning ashas already been partially done to maximally general parameterization as has also beenpartially done and to off-policy eligibility traceswhich was previously thought impossiblebut now perhaps we can see how this too can be done In all these ways we see a glimmerof a perfected and complete algorithmmdashsomething inspired by TD(λ) with all its positivecomputational and algorithmic features and yet more general flexible and powerful Seekingthis perfected algorithm is the quest for the ultimate TD (λ) this talk is a status report onthe goals for it and the prospects for achieving them

References1 H R Maei R S Sutton GQ(λ) A general gradient algorithm for temporal-difference pre-

diction learning with eligibility tracesIn Proceedings of the Third Conference on ArtificialGeneral Intelligence Lugano Switzerland 2010

330 Relations between Reinforcement Learning Visual InputPerception and Action

Martijn van Otterlo (Radboud University Nijmegen NL)

License Creative Commons BY 30 Unported licensecopy Martijn van Otterlo

Typical reinforcement learning (RL) algorithms learn from traces of state action state action sequences in order to optimize action selection for each state (wrt a reward criterium)The field of RL has come up with many algorithms based on abstraction and generalization[7]In my view general RL amounts to feedback-based interactive experimentation with particularabstraction levels forstates actions and tasks However despite all previous efforts directcouplings of RL with complex visual input (eg raw images) are still rare In roughly therecent decade RL has been combined with so-called relational knowledge representation forstates and languages[5 6] Also many forms of decision-theoretic planning using abstractor relational version of Bellman equations can employ powerful knowledge representationschemes[4] An interesting development is that also in the computervision communitypeople wish to employ similar relational generalization overvisual input due to advancesin (probabilistic) logical learning (eg see our recent work on the interpretation of housesfrom images [1] and robotics [3 2]) My talk is about a possibilities for relational integrationof both relational action and vision The real potential of relational representations is thatstates can share information with actions (eg parameters or more specifically objects)Possibilities exist to define novel languages for interactive experimentation with relationalabstraction levels in the context of both complex visual input and complex behavioral outputThis includes new types of interactions ndash for example dealing with scarce human feedbacknew types of experimentation ndash for example incorporating visual feedback and physicalmanipulation and new types of abstraction levels ndash such as probabilistic programminglanguages I will present first steps towards amore tight integration of relational vision andrelational action for interactive 2 learning settings In addition I present several new problemdomains

2 See also the IJCAI Workshop on Machine Learning forInteractive Systems (httpmlis-workshoporg2013)

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 23: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 23

References1 L Antanas M van Otterlo J Oramas T Tuytelaars and L De RaedtThere are plenty

of places like home Using relational representations in hierarchy for distance-based imageunderstanding Neurocomputing 2013 in press (httpdxdoiorg101016jneucom201210037)

2 B Moldovan P Moreno and M van OtterloOn the use of probabilistic relational afford-ance models for sequential manipulation tasks in roboticsIn IEEE International Conferenceon Robotics and Automation (ICRA) 2013

3 B Moldovan P Moreno M van Otterlo J Santos-Victor and L De Raedt Learningrelational affordance models for robots in multi-object manipulation tasks In IEEE Inter-national Conference on Robotics and Automation (ICRA) pages 4373ndash4378 2012

4 M van OtterloIntensional dynamic programming A Rosetta stone for structured dynamicprogrammingJournal of Algorithms 64169ndash191 2009

5 M van OtterloThe Logic of Adaptive Behavior IOS Press Amsterdam The Netherlands2009

6 M van Otterlo Solving relational and first-order logical Markov decision processes Asurvey In M Wiering and M van Otterlo editors Reinforcement Learning State-of-the-Art chapter 8 Springer 2012

7 M Wiering and M van OtterloReinforcement Learning State-of-the-ArtSpringer 2012

331 Universal RL applications and approximationsJoel Veness (University of Alberta ndash Edmonton CA)

License Creative Commons BY 30 Unported licensecopy Joel Veness

Joint work of Veness Joel Hutter MarcusMain reference J Veness ldquoUniversal RL applications and approximationsrdquo 2013

URL httpewrlfileswordpresscom201210ewrl11_submission_27pdf

While the main ideas underlying Universal RL have existed for over a decadenow (see [1] forhistorical context) practical applications are only juststarting to emerge In particular thedirect approximation introduced by Venesset al [2 3] was shown empirically to comparefavorably to a number of other model-based RL techniques on small partially observableenvironments with initially unknown stochastic dynamics Since then a variety of additionaltechniques have been introduced that allow for the construction of far more sophisticatedapproximations We present and review some of the mainideas that have the potential tolead to larger scale applications

References1 M Hutter One decade of universal artificial intelligence In Theoretical Foundations of

Artificial General Intelligence pages 67ndash88 Atlantis Press 20122 J Veness K S Ng M Hutter and D Silver Reinforcement Learning via AIXI Approx-

imation In AAAI 20103 J Veness K S Ng M Hutter W T B Uther and D Silver A Monte-Carlo AIXI

Approximation Journal of Artificial Intelligence Research (JAIR) 4095ndash142 2011

13321

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 24: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

24 13321 ndash Reinforcement Learning

332 Learning and Reasoning with POMDPs in RobotsJeremy Wyatt (University of Birmingham ndash Birmingham GB)

License Creative Commons BY 30 Unported licensecopy Jeremy Wyatt

Sequential decision making under state uncertainty is a well understood if intractableproblemIn this talk I show various ways that approximate methods for belief state planningand learningcan be developed to solve practical robot problems including those that scale tohigh dimensional continuous state and action spaces problems with incomplete informationand problems requiring real-time decision making

4 Schedule

MondayMorning

Self-introduction of participantsPeter Sunehag Tutorial on explorationexploitation

AfternoonTom Dietterich Solving Simulator-Defined MDPs for Natural Resource ManagementCsaba Szepesvari Gergely Neu Online learning in Markov decision processesRonald Ortner Continuous RL and restless banditsMartijn van Otterlo Relations Between Reinforcement Learning Visual InputPerception and Action

TuesdayMorning

M Ghavamzadeh and A Lazaric Tutorial on Statistical Learning Theory in RLand Approximate Dynamic ProgrammingGerard Neumann Hierarchical Learning of Motor Skills with Information-TheoreticPolicy SearchDoina Precup Methods for Bellman Error Basis Function constructionJoeumllle Pineau Reinforcement Learning using Kernel-Based Stochastic Factorization

AfternoonPetar Kormushev Reinforcement Learning with Heterogeneous Policy Representa-tionsMark B Ring Continual learningRich Sutton The quest for the ultimate TD algorithmJan H Metzen Learning Skill Templates for Parameterized TasksRobert Busa-Fekete Preference-based Evolutionary Direct Policy SearchOrhan Soumlnmez Sequentially Interacting Markov Chain Monte Carlo Based PolicyIteration

WednesdayMorning

Marcus Hutter Tutorial on Universal RLJoel Veness Universal RL ndash Applications and ApproximationsLaurent Orseau Optimal Universal Explorative AgentsTor Lattimore Bayesian Reinforcement Learning + ExplorationLaurent Orseau More realistic assumptions for RL

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 25: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

Peter Auer Marcus Hutter and Laurent Orseau 25

AfternoonScott Sanner Tutorial on Symbolic Dynamic ProgrammingRico Jonschkowski Representation Learning for Reinforcement Learning in Robot-icsTimothy Mann Theoretical Analysis of Planning with OptionsMohammad Ghavamzadeh SPSA based Actor-Critic Algorithm for Risk SensitiveControlDavid Silver Deterministic Policy Gradients

ThursdayMorning

Joeumllle Pineau Tutorial on POMDPsJeremy Wyatt Learning and Reasoning with POMDPs in RobotsScott Sanner Recent Advances in Symbolic Dynamic Programming for HybridMDPs and POMDPsLutz Frommberger Some thoughts on Transfer Learning in RL ndash On States andRepresentationWill Uther Tree-based MDP ndash PSRNils Siebel Neuro-Evolution

Afternoon HikingFridayMorning

Christos Dimitrakakis ABC and cover-tree reinforcement learningManuela Ruiz-Montiel Multi-objective Reinforcement LearningAnn Nowe Multi-Objective Reinforcement LearningRico Jonschkowski Temporal Abstraction in Reinforcement Learning with Proxim-ity StatisticsChristos Dimitrakakis RL competition

13321

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants
Page 26: Report from Dagstuhl Seminar 13321 Reinforcement Learning€¦ · Report from Dagstuhl Seminar 13321 Reinforcement Learning Editedby Peter Auer1, Marcus Hutter2, and Laurent Orseau3

26 13321 ndash Reinforcement Learning

Participants

Peter AuerMontan-Universitaumlt Leoben AT

Manuel BlumAlbert-Ludwigs-UniversitaumltFreiburg DE

Robert Busa-FeketeUniversitaumlt Marburg DE

Yann ChevaleyreUniversity of Paris North FR

Marc DeisenrothTU Darmstadt DE

Thomas G DietterichOregon State University US

Christos DimitrakakisEPFL ndash Lausanne CH

Lutz FrommbergerUniversitaumlt Bremen DE

Jens GarstkaFernUniversitaumlt in Hagen DE

Mohammad GhavamzadehINRIA Nord Europe ndash Lille FR

Marcus HutterAustralian National Univ AU

Rico JonschkowskiTU Berlin DE

Petar KormushevItalian Institute of Technology ndashGenova IT

Tor LattimoreAustralian National Univ AU

Alessandro LazaricINRIA Nord Europe ndash Lille FR

Timothy MannTechnion ndash Haifa IL

Jan Hendrik MetzenUniversitaumlt Bremen DE

Gergely NeuBudapest University ofTechnology amp Economics HU

Gerhard NeumannTU Darmstadt DE

Ann NoweacuteFree University of Brussels BE

Laurent OrseauAgroParisTech ndash Paris FR

Ronald OrtnerMontan-Universitaumlt Leoben AT

Joeumllle PineauMcGill Univ ndash Montreal CA

Doina PrecupMcGill Univ ndash Montreal CA

Mark B RingAnaheim Hills US

Manuela Ruiz-MontielUniversity of Malaga ES

Scott SannerNICTA ndash Canberra AU

Nils T SiebelHochschule fuumlr Technik undWirtschaft ndash Berlin DE

David SilverUniversity College London GB

Orhan SoumlnmezBogaziccedili Univ ndash Istanbul TR

Peter SunehagAustralian National Univ AU

Richard S SuttonUniversity of Alberta CA

Csaba SzepesvaacuteriUniversity of Alberta CA

William UtherGoogle ndash Sydney AU

Martijn van OtterloRadboud Univ Nijmegen NL

Joel VenessUniversity of Alberta CA

Jeremy L WyattUniversity of Birmingham GB

  • Executive Summary Peter Auer Marcus Hutter and Laurent Orseau
  • Table of Contents
  • Overview of Talks
    • Preference-based Evolutionary Direct Policy Search Robert Busa-Fekete
    • Solving Simulator-Defined MDPs for Natural Resource Management Thomas G Dietterich
    • ABC and Cover Tree Reinforcement Learning Christos Dimitrakakis
    • Some thoughts on Transfer Learning in Reinforcement Learning on States and Representation Lutz Frommberger
    • Actor-Critic Algorithms for Risk-Sensitive MDPs Mohammad Ghavamzadeh
    • Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming Mohammad Ghavamzadeh
    • Universal Reinforcement Learning Marcus Hutter
    • Temporal Abstraction by Sparsifying Proximity Statistics Rico Jonschkowski
    • State Representation Learning in Robotics Rico Jonschkowski
    • Reinforcement Learning with Heterogeneous Policy Representations Petar Kormushev
    • Theoretical Analysis of Planning with Options Timothy Mann
    • Learning Skill Templates for Parameterized Tasks Jan Hendrik Metzen
    • Online learning in Markov decision processes Gergely Neu
    • Hierarchical Learning of Motor Skills with Information-Theoretic Policy Search Gerhard Neumann
    • Multi-Objective Reinforcement Learning Ann Nowe
    • Bayesian Reinforcement Learning + Exploration Tor Lattimore
    • Knowledge-Seeking Agents Laurent Orseau
    • Toward a more realistic framework for general reinforcement learning Laurent Orseau
    • Colored MDPs Restless Bandits and Continuous State Reinforcement Learning Ronald Ortner
    • Reinforcement Learning using Kernel-Based Stochastic Factorization Joeumllle Pineau
    • A POMDP Tutorial Joeumllle Pineau
    • Methods for Bellman Error Basis Function construction Doina Precup
    • Continual Learning Mark B Ring
    • Multi-objective Reinforcement Learning Manuela Ruiz-Montiel
    • Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs Scott Sanner
    • Deterministic Policy Gradients David Silver
    • Sequentially Interacting Markov Chain Monte Carlo BasedPolicy Iteration Orhan Soumlnmez
    • Exploration versus Exploitation in Reinforcement Learning Peter Sunehag
    • The Quest for the Ultimate TD() Richard S Sutton
    • Relations between Reinforcement Learning Visual Input Perception and Action Martijn van Otterlo
    • Universal RL applications and approximations Joel Veness
    • Learning and Reasoning with POMDPs in Robots Jeremy Wyatt
      • Schedule
      • Participants