reinforcement learning in robotics: a survey - tu · pdf filereinforcement learning in...

38
Reinforcement Learning in Robotics: A Survey Jens Kober ∗† J. Andrew Bagnell Jan Peters §¶ email: [email protected], [email protected], [email protected] Reinforcement learning offers to robotics a frame- work and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the chal- lenges of robotic problems provide both inspiration, impact, and validation for developments in reinforce- ment learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research com- munities by providing a survey of work in reinforce- ment learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value function-based and policy search methods. By analyz- ing a simple problem in some detail we demonstrate how reinforcement learning approaches may be prof- itably applied, and we note throughout open questions and the tremendous potential for future research. keywords: reinforcement learning, learning control, robot, survey 1 Introduction A remarkable variety of problems in robotics may be naturally phrased as ones of reinforcement learn- ing. Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. In- stead of explicitly detailing the solution to a problem, in reinforcement learning the designer of a control task Bielefeld University, CoR-Lab Research Institute for Cogni- tion and Robotics, Universitätsstr. 25, 33615 Bielefeld, Ger- many Honda Research Institute Europe, Carl-Legien-Str. 30, 63073 Offenbach/Main, Germany Carnegie Mellon University, Robotics Institute, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA § Max Planck Institute for Intelligent Systems, Department of Empirical Inference, Spemannstr. 38, 72076 Tübingen, Ger- many Technische Universität Darmstadt, FB Informatik, FG Intel- ligent Autonomous Systems, Hochschulstr. 10, 64289 Darm- stadt, Germany provides feedback in terms of a scalar objective func- tion that measures the one-step performance of the robot. Figure 1 illustrates the diverse set of robots that have learned tasks using reinforcement learning. Consider, for example, attempting to train a robot to return a table tennis ball over the net (Muelling et al., 2012). In this case, the robot might make an observations of dynamic variables specifying ball posi- tion and velocity and the internal dynamics of the joint position and velocity. This might in fact capture well the state s of the system – providing a complete statis- tic for predicting future observations. The actions a available to the robot might be the torque sent to mo- tors or the desired accelerations sent to an inverse dy- namics control system. A function π that generates the motor commands (i.e., the actions) based on the incoming ball and current internal arm observations (i.e., the state) would be called the policy. A rein- forcement learning problem is to find a policy that optimizes the long term sum of rewards R(s,a); a re- inforcement learning algorithm is one designed to find such a (near)-optimal policy. The reward function in this example could be based on the success of the hits as well as secondary criteria like energy consumption. 1.1 Reinforcement Learning in the Context of Machine Learning In the problem of reinforcement learning, an agent ex- plores the space of possible strategies and receives feed- back on the outcome of the choices made. From this information, a “good” – or ideally optimal – policy (i.e., strategy or controller) must be deduced. Reinforcement learning may be understood by con- trasting the problem with other areas of study in ma- chine learning. In supervised learning (Langford and Zadrozny, 2005), an agent is directly presented a se- quence of independent examples of correct predictions to make in different circumstances. In imitation learn- ing, an agent is provided demonstrations of actions of a good strategy to follow in given situations (Argall et al., 2009; Schaal, 1999). To aid in understanding the RL problem and its relation with techniques widely used within robotics, Figure 2 provides a schematic illustration of two axes of problem variability: the complexity of sequential in- teraction and the complexity of reward structure. This 1

Upload: vothien

Post on 01-Feb-2018

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Reinforcement Learning in Robotics:

A Survey

Jens Kober∗† J. Andrew Bagnell‡ Jan Peters§¶

email: [email protected], [email protected], [email protected]

Reinforcement learning offers to robotics a frame-work and set of tools for the design of sophisticatedand hard-to-engineer behaviors. Conversely, the chal-lenges of robotic problems provide both inspiration,impact, and validation for developments in reinforce-ment learning. The relationship between disciplineshas sufficient promise to be likened to that betweenphysics and mathematics. In this article, we attemptto strengthen the links between the two research com-munities by providing a survey of work in reinforce-ment learning for behavior generation in robots. Wehighlight both key challenges in robot reinforcementlearning as well as notable successes. We discuss howcontributions tamed the complexity of the domain andstudy the role of algorithms, representations, and priorknowledge in achieving these successes. As a result, aparticular focus of our paper lies on the choice betweenmodel-based and model-free as well as between valuefunction-based and policy search methods. By analyz-ing a simple problem in some detail we demonstratehow reinforcement learning approaches may be prof-itably applied, and we note throughout open questionsand the tremendous potential for future research.

keywords: reinforcement learning, learning control,robot, survey

1 Introduction

A remarkable variety of problems in robotics maybe naturally phrased as ones of reinforcement learn-ing. Reinforcement learning (RL) enables a robot toautonomously discover an optimal behavior throughtrial-and-error interactions with its environment. In-stead of explicitly detailing the solution to a problem,in reinforcement learning the designer of a control task

∗Bielefeld University, CoR-Lab Research Institute for Cogni-tion and Robotics, Universitätsstr. 25, 33615 Bielefeld, Ger-many

†Honda Research Institute Europe, Carl-Legien-Str. 30, 63073Offenbach/Main, Germany

‡Carnegie Mellon University, Robotics Institute, 5000 ForbesAvenue, Pittsburgh, PA 15213, USA

§Max Planck Institute for Intelligent Systems, Department ofEmpirical Inference, Spemannstr. 38, 72076 Tübingen, Ger-many

¶Technische Universität Darmstadt, FB Informatik, FG Intel-ligent Autonomous Systems, Hochschulstr. 10, 64289 Darm-stadt, Germany

provides feedback in terms of a scalar objective func-tion that measures the one-step performance of therobot. Figure 1 illustrates the diverse set of robotsthat have learned tasks using reinforcement learning.

Consider, for example, attempting to train a robotto return a table tennis ball over the net (Muellinget al., 2012). In this case, the robot might make anobservations of dynamic variables specifying ball posi-tion and velocity and the internal dynamics of the jointposition and velocity. This might in fact capture wellthe state s of the system – providing a complete statis-tic for predicting future observations. The actions a

available to the robot might be the torque sent to mo-tors or the desired accelerations sent to an inverse dy-namics control system. A function π that generatesthe motor commands (i.e., the actions) based on theincoming ball and current internal arm observations(i.e., the state) would be called the policy. A rein-forcement learning problem is to find a policy thatoptimizes the long term sum of rewards R(s, a); a re-inforcement learning algorithm is one designed to findsuch a (near)-optimal policy. The reward function inthis example could be based on the success of the hitsas well as secondary criteria like energy consumption.

1.1 Reinforcement Learning in theContext of Machine Learning

In the problem of reinforcement learning, an agent ex-plores the space of possible strategies and receives feed-back on the outcome of the choices made. From thisinformation, a “good” – or ideally optimal – policy(i.e., strategy or controller) must be deduced.

Reinforcement learning may be understood by con-trasting the problem with other areas of study in ma-chine learning. In supervised learning (Langford andZadrozny, 2005), an agent is directly presented a se-quence of independent examples of correct predictionsto make in different circumstances. In imitation learn-ing, an agent is provided demonstrations of actions ofa good strategy to follow in given situations (Argallet al., 2009; Schaal, 1999).

To aid in understanding the RL problem and itsrelation with techniques widely used within robotics,Figure 2 provides a schematic illustration of two axesof problem variability: the complexity of sequential in-teraction and the complexity of reward structure. This

1

Page 2: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

(a) OBELIX robot (b) Zebra Zero robot

(c) Autonomous helicopter (d) Sarcos humanoidDB

Figure 1: This figure illustrates a small sample of robotswith behaviors that were reinforcement learned. Thesecover the whole range of aerial vehicles, robotic arms,autonomous vehicles, and humanoid robots. (a) TheOBELIX robot is a wheeled mobile robot that learned topush boxes (Mahadevan and Connell, 1992) with a valuefunction-based approach (Picture reprint with permissionof Sridhar Mahadevan). (b) A Zebra Zero robot armlearned a peg-in-hole insertion task (Gullapalli et al., 1994)with a model-free policy gradient approach (Picture reprintwith permission of Rod Grupen). (c) Carnegie Mellon’s au-tonomous helicopter leveraged a model-based policy searchapproach to learn a robust flight controller (Bagnell andSchneider, 2001). (d) The Sarcos humanoid DB learneda pole-balancing task (Schaal, 1996) using forward models(Picture reprint with permission of Stefan Schaal).

hierarchy of problems, and the relations between them,is a complex one, varying in manifold attributes anddifficult to condense to something like a simple linearordering on problems. Much recent work in the ma-chine learning community has focused on understand-ing the diversity and the inter-relations between prob-lem classes. The figure should be understood in thislight as providing a crude picture of the relationshipbetween areas of machine learning research importantfor robotics.

Each problem subsumes those that are both belowand to the left in the sense that one may always framethe simpler problem in terms of the more complex one;note that some problems are not linearly ordered. Inthis sense, reinforcement learning subsumes much ofthe scope of classical machine learning as well as con-textual bandit and imitation learning problems. Re-duction algorithms (Langford and Zadrozny, 2005) areused to convert effective solutions for one class of prob-lems into effective solutions for others, and have provento be a key technique in machine learning.

At lower left, we find the paradigmatic problem ofsupervised learning, which plays a crucial role in ap-plications as diverse as face detection and spam filter-ing. In these problems (including binary classificationand regression), a learner’s goal is to map observations(typically known as features or covariates) to actionswhich are usually a discrete set of classes or a realvalue. These problems possess no interactive compo-nent: the design and analysis of algorithms to address

Rew

ard

Str

uctu

re C

ompl

exit

y

Interactive/Sequential Complexity

Binary Classification

Cost-sensitive Learning Structured Prediction

Supervised Learning Imitation Learning

Contextual Bandit Baseline Distribution RL Reinforcement Learning

Figure 2: An illustration of the inter-relations betweenwell-studied learning problems in the literature along axesthat attempt to capture both the information and com-plexity available in reward signals and the complexity ofsequential interaction between learner and environment.Each problem subsumes those to the left and below; reduc-tion techniques provide methods whereby harder problems(above and right) may be addressed using repeated appli-cation of algorithms built for simpler problems. (Langfordand Zadrozny, 2005)

these problems rely on training and testing instancesas independent and identical distributed random vari-ables. This rules out any notion that a decision madeby the learner will impact future observations: su-pervised learning algorithms are built to operate in aworld in which every decision has no effect on the fu-ture examples considered. Further, within supervisedlearning scenarios, during a training phase the “cor-rect” or preferred answer is provided to the learner, sothere is no ambiguity about action choices.

More complex reward structures are also often stud-ied: one such is known as cost-sensitive learning, whereeach training example and each action or prediction isannotated with a cost for making such a prediction.Learning techniques exist that reduce such problemsto the simpler classification problem, and active re-search directly addresses such problems as they arecrucial in practical learning applications.

Contextual bandit or associative reinforcementlearning problems begin to address the fundamentalproblem of exploration-vs-exploitation, as informationis provided only about a chosen action and not what-might-have-been. These find wide-spread applicationin problems as diverse as pharmaceutical drug discov-ery to ad placement on the web, and are one of themost active research areas in the field.

Problems of imitation learning and structured pre-diction may be seen to vary from supervised learningon the alternate dimension of sequential interaction.Structured prediction, a key technique used withincomputer vision and robotics, where many predictionsare made in concert by leveraging inter-relations be-tween them, may be seen as a simplified variant ofimitation learning (Daumé III et al., 2009; Ross et al.,2011a). In imitation learning, we assume that an ex-pert (for example, a human pilot) that we wish tomimic provides demonstrations of a task. While “cor-rect answers” are provided to the learner, complexityarises because any mistake by the learner modifies thefuture observations from what would have been seenhad the expert chosen the controls. Such problemsprovably lead to compounding errors and violate thebasic assumption of independent examples required for

2

Page 3: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

successful supervised learning. In fact, in sharp con-trast with supervised learning problems where only asingle data-set needs to be collected, repeated inter-action between learner and teacher appears to bothnecessary and sufficient (Ross et al., 2011b) to provideperformance guarantees in both theory and practice inimitation learning problems.

Reinforcement learning embraces the full complex-ity of these problems by requiring both interactive,sequential prediction as in imitation learning as wellas complex reward structures with only “bandit” stylefeedback on the actions actually chosen. It is thiscombination that enables so many problems of rele-vance to robotics to be framed in these terms; it isthis same combination that makes the problem bothinformation-theoretically and computationally hard.

We note here briefly the problem termed “BaselineDistribution RL”: this is the standard RL problem withthe additional benefit for the learner that it may drawinitial states from a distribution provided by an ex-pert instead of simply an initial state chosen by theproblem. As we describe further in Section 5.1, thisadditional information of which states matter dramat-ically affects the complexity of learning.

1.2 Reinforcement Learning in theContext of Optimal Control

Reinforcement Learning (RL) is very closely relatedto the theory of classical optimal control, as wellas dynamic programming, stochastic programming,simulation-optimization, stochastic search, and opti-mal stopping (Powell, 2012). Both RL and optimalcontrol address the problem of finding an optimal pol-icy (often also called the controller or control policy)that optimizes an objective function (i.e., the accu-mulated cost or reward), and both rely on the notionof a system being described by an underlying set ofstates, controls and a plant or model that describestransitions between states. However, optimal controlassumes perfect knowledge of the system’s descriptionin the form of a model (i.e., a function T that de-scribes what the next state of the robot will be giventhe current state and action). For such models, op-timal control ensures strong guarantees which, never-theless, often break down due to model and compu-tational approximations. In contrast, reinforcementlearning operates directly on measured data and re-wards from interaction with the environment. Rein-forcement learning research has placed great focus onaddressing cases which are analytically intractable us-ing approximations and data-driven techniques. Oneof the most important approaches to reinforcementlearning within robotics centers on the use of classi-cal optimal control techniques (e.g. Linear-QuadraticRegulation and Differential Dynamic Programming)to system models learned via repeated interaction withthe environment (Atkeson, 1998; Bagnell and Schnei-der, 2001; Coates et al., 2009). A concise discussionof viewing reinforcement learning as “adaptive optimalcontrol” is presented in (Sutton et al., 1991).

1.3 Reinforcement Learning in theContext of Robotics

Robotics as a reinforcement learning domain dif-fers considerably from most well-studied reinforcementlearning benchmark problems. In this article, we high-light the challenges faced in tackling these problems.Problems in robotics are often best represented withhigh-dimensional, continuous states and actions (notethat the 10-30 dimensional continuous actions commonin robot reinforcement learning are considered large(Powell, 2012)). In robotics, it is often unrealistic toassume that the true state is completely observableand noise-free. The learning system will not be ableto know precisely in which state it is and even vastlydifferent states might look very similar. Thus, roboticsreinforcement learning are often modeled as partiallyobserved, a point we take up in detail in our formalmodel description below. The learning system musthence use filters to estimate the true state. It is oftenessential to maintain the information state of the en-vironment that not only contains the raw observationsbut also a notion of uncertainty on its estimates (e.g.,both the mean and the variance of a Kalman filtertracking the ball in the robot table tennis example).

Experience on a real physical system is tedious toobtain, expensive and often hard to reproduce. Evengetting to the same initial state is impossible for therobot table tennis system. Every single trial run, alsocalled a roll-out, is costly and, as a result, such ap-plications force us to focus on difficulties that do notarise as frequently in classical reinforcement learningbenchmark examples. In order to learn within a rea-sonable time frame, suitable approximations of state,policy, value function, and/or system dynamics needto be introduced. However, while real-world experi-ence is costly, it usually cannot be replaced by learningin simulations alone. In analytical or learned modelsof the system even small modeling errors can accumu-late to a substantially different behavior, at least forhighly dynamic tasks. Hence, algorithms need to berobust with respect to models that do not capture allthe details of the real system, also referred to as under-modeling, and to model uncertainty. Another chal-lenge commonly faced in robot reinforcement learningis the generation of appropriate reward functions. Re-wards that guide the learning system quickly to successare needed to cope with the cost of real-world expe-rience. This problem is called reward shaping (Laud,2004) and represents a substantial manual contribu-tion. Specifying good reward functions in robotics re-quires a fair amount of domain knowledge and mayoften be hard in practice.

Not every reinforcement learning method is equallysuitable for the robotics domain. In fact, many ofthe methods thus far demonstrated on difficult prob-lems have been model-based (Atkeson et al., 1997;Abbeel et al., 2007; Deisenroth and Rasmussen, 2011)and robot learning systems often employ policy searchmethods rather than value function-based approaches(Gullapalli et al., 1994; Miyamoto et al., 1996; Bagnelland Schneider, 2001; Kohl and Stone, 2004; Tedrake

3

Page 4: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

et al., 2005; Peters and Schaal, 2008a,b; Kober andPeters, 2009; Deisenroth et al., 2011). Such designchoices stand in contrast to possibly the bulk of theearly research in the machine learning community(Kaelbling et al., 1996; Sutton and Barto, 1998). Weattempt to give a fairly complete overview on realrobot reinforcement learning citing most original pa-pers while grouping them based on the key insightsemployed to make the Robot Reinforcement Learn-ing problem tractable. We isolate key insights suchas choosing an appropriate representation for a valuefunction or policy, incorporating prior knowledge, andtransfer knowledge from simulations.

This paper surveys a wide variety of tasks where re-inforcement learning has been successfully applied torobotics. If a task can be phrased as an optimiza-tion problem and exhibits temporal structure, rein-forcement learning can often be profitably applied toboth phrase and solve that problem. The goal of thispaper is twofold. On the one hand, we hope thatthis paper can provide indications for the roboticscommunity which type of problems can be tackledby reinforcement learning and provide pointers to ap-proaches that are promising. On the other hand, forthe reinforcement learning community, this paper canpoint out novel real-world test beds and remarkableopportunities for research on open questions. We fo-cus mainly on results that were obtained on physicalrobots with tasks going beyond typical reinforcementlearning benchmarks.

We concisely present reinforcement learning tech-niques in the context of robotics in Section 2. The chal-lenges in applying reinforcement learning in roboticsare discussed in Section 3. Different approaches tomaking reinforcement learning tractable are treatedin Sections 4 to 6. In Section 7, the example of ball-in-a-cup is employed to highlight which of the variousapproaches discussed in the paper have been particu-larly helpful to make such a complex task tractable.Finally, in Section 8, we summarize the specific prob-lems and benefits of reinforcement learning in roboticsand provide concluding thoughts on the problems andpromise of reinforcement learning in robotics.

2 A Concise Introduction to

Reinforcement Learning

In reinforcement learning, an agent tries to maxi-mize the accumulated reward over its life-time. In anepisodic setting, where the task is restarted after eachend of an episode, the objective is to maximize the to-tal reward per episode. If the task is on-going withouta clear beginning and end, either the average rewardover the whole life-time or a discounted return (i.e., aweighted average where distant rewards have less influ-ence) can be optimized. In such reinforcement learningproblems, the agent and its environment may be mod-eled being in a state s ∈ S and can perform actionsa ∈ A, each of which may be members of either dis-crete or continuous sets and can be multi-dimensional.A state s contains all relevant information about the

current situation to predict future states (or observ-ables); an example would be the current position of arobot in a navigation task1. An action a is used to con-trol (or change) the state of the system. For example,in the navigation task we could have the actions corre-sponding to torques applied to the wheels. For everystep, the agent also gets a reward R, which is a scalarvalue and assumed to be a function of the state andobservation. (It may equally be modeled as a randomvariable that depends on only these variables.) In thenavigation task, a possible reward could be designedbased on the energy costs for taken actions and re-wards for reaching targets. The goal of reinforcementlearning is to find a mapping from states to actions,called policy π, that picks actions a in given statess maximizing the cumulative expected reward. Thepolicy π is either deterministic or probabilistic. Theformer always uses the exact same action for a givenstate in the form a = π(s), the later draws a samplefrom a distribution over actions when it encounters astate, i.e., a ∼ π(s, a) = P (a|s). The reinforcementlearning agent needs to discover the relations betweenstates, actions, and rewards. Hence exploration is re-quired which can either be directly embedded in thepolicy or performed separately and only as part of thelearning process.

Classical reinforcement learning approaches arebased on the assumption that we have a Markov Deci-sion Process (MDP) consisting of the set of states S,set of actions A, the rewards R and transition probabil-ities T that capture the dynamics of a system. Transi-tion probabilities (or densities in the continuous statecase) T (s′, a, s) = P (s′|s, a) describe the effects of theactions on the state. Transition probabilities general-ize the notion of deterministic dynamics to allow formodeling outcomes are uncertain even given full state.The Markov property requires that the next state s′

and the reward only depend on the previous state s

and action a (Sutton and Barto, 1998), and not on ad-ditional information about the past states or actions.In a sense, the Markov property recapitulates the ideaof state – a state is a sufficient statistic for predictingthe future, rendering previous observations irrelevant.In general in robotics, we may only be able to findsome approximate notion of state.

Different types of reward functions are commonlyused, including rewards depending only on the currentstate R = R(s), rewards depending on the current stateand action R = R(s, a), and rewards including the tran-sitions R = R(s′, a, s). Most of the theoretical guar-antees only hold if the problem adheres to a Markovstructure, however in practice, many approaches workvery well for many problems that do not fulfill thisrequirement.

1When only observations but not the complete state is avail-able, the sufficient statistics of the filter can alternativelyserve as state s. Such a state is often called information orbelief state.

4

Page 5: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

2.1 Goals of Reinforcement Learning

The goal of reinforcement learning is to discover anoptimal policy π∗ that maps states (or observations)to actions so as to maximize the expected return J ,which corresponds to the cumulative expected reward.There are different models of optimal behavior (Kael-bling et al., 1996) which result in different definitionsof the expected return. A finite-horizon model only at-tempts to maximize the expected reward for the hori-zon H, i.e., the next H (time-)steps h

J = E

H∑

h=0

Rh

.

This setting can also be applied to model problemswhere it is known how many steps are remaining.

Alternatively, future rewards can be discounted bya discount factor γ (with 0 ≤ γ < 1)

J = E

∞∑

h=0

γhRh

.

This is the setting most frequently discussed in clas-sical reinforcement learning texts. The parameter γ

affects how much the future is taken into account andneeds to be tuned manually. As illustrated in (Kael-bling et al., 1996), this parameter often qualitativelychanges the form of the optimal solution. Policiesdesigned by optimizing with small γ are myopic andgreedy, and may lead to poor performance if we ac-tually care about longer term rewards. It is straight-forward to show that the optimal control law can beunstable if the discount factor is too low (e.g., it isnot difficult to show this destabilization even for dis-counted linear quadratic regulation problems). Hence,discounted formulations are frequently inadmissible inrobot control.

In the limit when γ approaches 1, the metric ap-proaches what is known as the average-reward crite-rion (Bertsekas, 1995),

J = limH→∞

E

1

H

H∑

h=0

Rh

.

This setting has the problem that it cannot distin-guish between policies that initially gain a transient oflarge rewards and those that do not. This transientphase, also called prefix, is dominated by the rewardsobtained in the long run. If a policy accomplishes bothan optimal prefix as well as an optimal long-term be-havior, it is called bias optimal Lewis and Puterman(2001). An example in robotics would be the tran-sient phase during the start of a rhythmic movement,where many policies will accomplish the same long-term reward but differ substantially in the transient(e.g., there are many ways of starting the same gaitin dynamic legged locomotion) allowing for room forimprovement in practical application.

In real-world domains, the shortcomings of the dis-counted formulation are often more critical than thoseof the average reward setting as stable behavior is often

more important than a good transient (Peters et al.,2004). We also often encounter an episodic controltask, where the task runs only for H time-steps andthen reset (potentially by human intervention) andstarted over. This horizon, H, may be arbitrarily large,as long as the expected reward over the episode canbe guaranteed to converge. As such episodic tasks areprobably the most frequent ones, finite-horizon modelsare often the most relevant.

Two natural goals arise for the learner. In the first,we attempt to find an optimal strategy at the end ofa phase of training or interaction. In the second, thegoal is to maximize the reward over the whole time therobot is interacting with the world.

In contrast to supervised learning, the learner mustfirst discover its environment and is not told the opti-mal action it needs to take. To gain information aboutthe rewards and the behavior of the system, the agentneeds to explore by considering previously unused ac-tions or actions it is uncertain about. It needs to de-cide whether to play it safe and stick to well known ac-tions with (moderately) high rewards or to dare tryingnew things in order to discover new strategies with aneven higher reward. This problem is commonly knownas the exploration-exploitation trade-off.

In principle, reinforcement learning algorithms forMarkov Decision Processes with performance guar-antees are known (Kakade, 2003; Kearns and Singh,2002; Brafman and Tennenholtz, 2002) with polyno-mial scaling in the size of the state and action spaces,an additive error term, as well as in the horizon length(or a suitable substitute including the discount factoror “mixing time” (Kearns and Singh, 2002)). However,state spaces in robotics problems are often tremen-dously large as they scale exponentially in the num-ber of state variables and often are continuous. Thischallenge of exponential growth is often referred to asthe curse of dimensionality (Bellman, 1957) (also dis-cussed in Section 3.1).

Off-policy methods learn independent of the em-ployed policy, i.e., an explorative strategy that is dif-ferent from the desired final policy can be employedduring the learning process. On-policy methods collectsample information about the environment using thecurrent policy. As a result, exploration must be builtinto the policy and determines the speed of the policyimprovements. Such exploration and the performanceof the policy can result in an exploration-exploitationtrade-off between long- and short-term improvementof the policy. Modeling exploration models with prob-ability distributions has surprising implications, e.g.,stochastic policies have been shown to be the optimalstationary policies for selected problems (Sutton et al.,1999; Jaakkola et al., 1993) and can even break thecurse of dimensionality (Rust, 1997). Furthermore,stochastic policies often allow the derivation of newpolicy update steps with surprising ease.

The agent needs to determine a correlation betweenactions and reward signals. An action taken does nothave to have an immediate effect on the reward butcan also influence a reward in the distant future. Thedifficulty in assigning credit for rewards is directly re-

5

Page 6: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

lated to the horizon or mixing time of the problem. Italso increases with the dimensionality of the actions asnot all parts of the action may contribute equally.

The classical reinforcement learning setup is a MDPwhere additionally to the states S, actions A, and re-wards R we also have transition probabilities T (s′, a, s).Here, the reward is modeled as a reward functionR(s, a). If both the transition probabilities and rewardfunction are known, this can be seen as an optimalcontrol problem (Powell, 2012).

2.2 Reinforcement Learning in theAverage Reward Setting

We focus on the average-reward model in this section.Similar derivations exist for the finite horizon and dis-counted reward cases. In many instances, the average-reward case is often more suitable in a robotic settingas we do not have to choose a discount factor and wedo not have to explicitly consider time in the deriva-tion.

To make a policy able to be optimized by continuousoptimization techniques, we write a policy as a condi-tional probability distribution π(s, a) = P (a|s). Below,we consider restricted policies that are paramertizedby a vector θ. In reinforcement learning, the policyis usually considered to be stationary and memory-less. Reinforcement learning and optimal control aimat finding the optimal policy π∗ or equivalent pol-icy parameters θ∗ which maximize the average returnJ (π) =

s,a µπ(s)π(s, a)R(s, a) where µπ is the sta-

tionary state distribution generated by policy π actingin the environment, i.e., the MDP. It can be shown(Puterman, 1994) that such policies that map states(even deterministically) to actions are sufficient to en-sure optimality in this setting – a policy needs neitherto remember previous states visited, actions taken, orthe particular time step. For simplicity and to easeexposition, we assume that this distribution is unique.Markov Decision Processes where this fails (i.e., non-ergodic processes) require more care in analysis, butsimilar results exist (Puterman, 1994). The transitionsbetween states s caused by actions a are modeled asT (s, a, s′) = P (s′|s, a). We can then frame the controlproblem as an optimization of

maxπJ (π) =

s,aµπ(s)π(s, a)R(s, a), (1)

s.t. µπ(s′) =∑

s,aµπ(s)π(s, a)T (s, a, s′), ∀s′ ∈ S, (2)

1 =∑

s,aµπ(s)π(s, a) (3)

π(s, a) ≥ 0, ∀s ∈ S, a ∈ A.

Here, Equation (2) defines stationarity of the state dis-tributions µπ (i.e., it ensures that it is well defined) andEquation (3) ensures a proper state-action probabilitydistribution. This optimization problem can be tack-led in two substantially different ways (Bellman, 1967,1971). We can search the optimal solution directly inthis original, primal problem or we can optimize inthe Lagrange dual formulation. Optimizing in the pri-mal formulation is known as policy search in reinforce-ment learning while searching in the dual formulationis known as a value function-based approach.

2.2.1 Value Function Approaches

Much of the reinforcement learning literature has fo-cused on solving the optimization problem in Equa-tions (1-3) in its dual form (Gordon, 1999; Puterman,1994)2. Using Lagrange multipliers V π(s′) and R, wecan express the Lagrangian of the problem by

L =∑

s,a

µπ(s)π(s, a)R(s, a)

+∑

s′V π(s′)

s,a

µπ(s)π(s, a)T (s, a, s′)− µπ(s′)

+ R

1−∑

s,a

µπ(s)π(s, a)

=∑

s,a

µπ(s)π(s, a)

R(s, a) +∑

s′V π(s′)T (s, a, s′)− R

−∑

s′V π(s′)µπ(s′)

a′π(s′, a′)

︸ ︷︷ ︸

=1

+ R.

Using the property∑

s′,a′ V (s′)µπ(s′)π(s′, a′) =∑

s,a V (s)µπ(s)π(s, a), we can obtain the Karush-Kuhn-Tucker conditions (Kuhn and Tucker, 1950) bydifferentiating with respect to µπ(s)π(s, a) which yieldsextrema at

∂µππL = R(s, a) +∑

s′V π(s′)T (s, a, s′)− R− V π(s) = 0.

This statement implies that there are as many equa-tions as the number of states multiplied by the num-ber of actions. For each state there can be one orseveral optimal actions a∗ that result in the samemaximal value, and, hence, can be written in termsof the optimal action a∗ as V π∗

(s) = R(s, a∗) − R +∑

s′ Vπ∗

(s′)T (s, a∗, s′). As a∗ is generated by the sameoptimal policy π∗, we know the condition for the mul-tipliers at optimality is

V ∗(s) = maxa∗

R(s, a∗)− R+∑

s′V ∗(s′)T (s, a∗, s′)

,

(4)where V ∗(s) is a shorthand notation for V π∗

(s). Thisstatement is equivalent to the Bellman Principle ofOptimality (Bellman, 1957)3 that states “An optimalpolicy has the property that whatever the initial stateand initial decision are, the remaining decisions mustconstitute an optimal policy with regard to the stateresulting from the first decision.” Thus, we have toperform an optimal action a∗, and, subsequently, fol-low the optimal policy π∗ in order to achieve a globaloptimum. When evaluating Equation (4), we realizethat optimal value function V ∗(s) corresponds to the

2For historical reasons, what we call the dual is often referredto in the literature as the primal. We argue that problemof optimizing expected reward is the fundamental problem,and values are an auxiliary concept.

3This optimality principle was originally formulated for a set-ting with discrete time steps and continuous states and ac-tions but is also applicable for discrete states and actions.

6

Page 7: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

long term additional reward, beyond the average re-ward R, gained by starting in state s while taking op-timal actions a∗ (according to the optimal policy π∗).This principle of optimality has also been crucial inenabling the field of optimal control (Kirk, 1970).

Hence, we have a dual formulation of the origi-nal problem that serves as condition for optimality.Many traditional reinforcement learning approachesare based on identifying (possibly approximate) solu-tions to this equation, and are known as value functionmethods. Instead of directly learning a policy, theyfirst approximate the Lagrangian multipliers V ∗(s),also called the value function, and use it to reconstructthe optimal policy. The value function V π(s) is definedequivalently, however instead of always taking the op-timal action a∗, the action a is picked according to apolicy π

V π(s) =∑

a

π(s, a)

(

R(s, a)− R+∑

s′V π(s′)T (s, a, s′)

)

.

Instead of the value function V π(s) many algorithmsrely on the state-action value function Qπ(s, a) instead,which has advantages for determining the optimal pol-icy as shown below. This function is defined as

Qπ(s, a) = R(s, a)− R+∑

s′V π(s′)T (s, a, s′).

In contrast to the value function V π(s), the state-action value function Qπ(s, a) explicitly contains theinformation about the effects of a particular action.The optimal state-action value function is

Q∗(s, a) = R(s, a)− R+∑

s′V ∗(s′)T (s, a, s′).

= R(s, a)− R+∑

s′

(

maxa′

Q∗(s′, a′))

T (s, a, s′).

It can be shown that an optimal, deterministic pol-icy π∗(s) can be reconstructed by always picking theaction a∗ in the current state that leads to the state swith the highest value V ∗(s)

π∗(s) = argmaxa

(

R(s, a)− R+∑

s′V ∗(s′)T (s, a, s′)

)

If the optimal value function V ∗(s′) and the transi-tion probabilities T (s, a, s′) for the following states areknown, determining the optimal policy is straightfor-ward in a setting with discrete actions as an exhaustivesearch is possible. For continuous spaces, determiningthe optimal action a∗ is an optimization problem in it-self. If both states and actions are discrete, the valuefunction and the policy may, in principle, be repre-sented by tables and picking the appropriate action isreduced to a look-up. For large or continuous spacesrepresenting the value function as a table becomes in-tractable. Function approximation is employed to finda lower dimensional representation that matches thereal value function as closely as possible, as discussedin Section 2.4. Using the state-action value functionQ∗(s, a) instead of the value function V ∗(s)

π∗(s) = argmaxa

(Q∗(s, a)

),

avoids having to calculate the weighted sum over thesuccessor states, and hence no knowledge of the tran-sition function is required.

A wide variety of methods of value function basedreinforcement learning algorithms that attempt to es-timate V ∗(s) or Q∗(s, a) have been developed andcan be split mainly into three classes: (i) dynamicprogramming-based optimal control approaches suchas policy iteration or value iteration, (ii) rollout-basedMonte Carlo methods and (iii) temporal differencemethods such as TD(λ) (Temporal Difference learn-ing), Q-learning, and SARSA (State-Action-Reward-State-Action).

Dynamic Programming-Based Methods require amodel of the transition probabilities T (s′, a, s) and thereward function R(s, a) to calculate the value function.The model does not necessarily need to be predeter-mined but can also be learned from data, potentiallyincrementally. Such methods are called model-based.Typical methods include policy iteration and value it-eration.

Policy iteration alternates between the two phasesof policy evaluation and policy improvement. The ap-proach is initialized with an arbitrary policy. Policyevaluation determines the value function for the cur-rent policy. Each state is visited and its value is up-dated based on the current value estimates of its suc-cessor states, the associated transition probabilities, aswell as the policy. This procedure is repeated until thevalue function converges to a fixed point, which corre-sponds to the true value function. Policy improvementgreedily selects the best action in every state accord-ing to the value function as shown above. The twosteps of policy evaluation and policy improvement areiterated until the policy does not change any longer.

Policy iteration only updates the policy once thepolicy evaluation step has converged. In contrast,value iteration combines the steps of policy evalua-tion and policy improvement by directly updating thevalue function based on Eq. (4) every time a state isupdated.

Monte Carlo Methods use sampling in order to es-timate the value function. This procedure can beused to replace the policy evaluation step of the dy-namic programming-based methods above. MonteCarlo methods are model-free, i.e., they do not needan explicit transition function. They perform roll-outs by executing the current policy on the system,hence operating on-policy. The frequencies of transi-tions and rewards are kept track of and are used toform estimates of the value function. For example, inan episodic setting the state-action value of a givenstate action pair can be estimated by averaging all thereturns that were received when starting from them.

Temporal Difference Methods, unlike Monte Carlomethods, do not have to wait until an estimate of thereturn is available (i.e., at the end of an episode) toupdate the value function. Rather, they use tempo-ral errors and only have to wait until the next time

7

Page 8: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

step. The temporal error is the difference between theold estimate and a new estimate of the value function,taking into account the reward received in the currentsample. These updates are done iterativley and, incontrast to dynamic programming methods, only takeinto account the sampled successor states rather thanthe complete distributions over successor states. Likethe Monte Carlo methods, these methods are model-free, as they do not use a model of the transition func-tion to determine the value function. In this setting,the value function cannot be calculated analyticallybut has to be estimated from sampled transitions inthe MDP. For example, the value function could beupdated iteratively by

V ′(s) = V (s) + α(

R(s, a)− R+ V (s′)− V (s))

,

where V (s) is the old estimate of the value function,V ′(s) the updated one, and α is a learning rate. Thisupdate step is called the TD(0)-algorithm in the dis-counted reward case. In order to perform action selec-tion a model of the transition function is still required.

The equivalent temporal difference learning algo-rithm for state-action value functions is the averagereward case version of SARSA with

Q′(s, a) = Q(s, a) + α(

R(s, a)− R+Q(s′, a′)−Q(s, a))

,

where Q(s, a) is the old estimate of the state-actionvalue function and Q′(s, a) the updated one. This al-gorithm is on-policy as both the current action a aswell as the subsequent action a′ are chosen accordingto the current policy π. The off-policy variant is calledR-learning (Schwartz, 1993), which is closely related toQ-learning, with the updates

Q′(s, a) = Q(s, a)+α(

R(s, a)−R+maxa′

Q(s′, a′)−Q(s, a))

.

These methods do not require a model of the transi-tion function for determining the deterministic optimalpolicy π∗(s). H-learning (Tadepalli and Ok, 1994) isa related method that estimates a model of the tran-sition probabilities and the reward function in orderto perform updates that are reminiscent of value iter-ation.

An overview of publications using value functionbased methods is presented in Table 1. Here, model-based methods refers to all methods that employ apredetermined or a learned model of system dynam-ics.

2.2.2 Policy Search

The primal formulation of the problem in terms of pol-icy rather then value offers many features relevant torobotics. It allows for a natural integration of expertknowledge, e.g., through both structure and initializa-tions of the policy. It allows domain-appropriate pre-structuring of the policy in an approximate form with-out changing the original problem. Optimal policiesoften have many fewer parameters than optimal valuefunctions. For example, in linear quadratic control,the value function has quadratically many parameters

in the dimensionality of the state-variables while thepolicy requires only linearly many parameters. Localsearch in policy space can directly lead to good resultsas exhibited by early hill-climbing approaches (Kirk,1970), as well as more recent successes (see Table 2).Additional constraints can be incorporated naturally,e.g., regularizing the change in the path distribution.As a result, policy search often appears more naturalto robotics.

Nevertheless, policy search has been considered theharder problem for a long time as the optimal solutioncannot directly be determined from Equations (1-3)while the solution of the dual problem leveraging Bell-man Principle of Optimality (Bellman, 1957) enablesdynamic programming based solutions.

Notwithstanding this, in robotics, policy search hasrecently become an important alternative to valuefunction based methods due to better scalability aswell as the convergence problems of approximate valuefunction methods (see Sections 2.3 and 4.2). Most pol-icy search methods optimize locally around existingpolicies π, parametrized by a set of policy parametersθi, by computing changes in the policy parameters ∆θithat will increase the expected return and results in it-erative updates of the form

θi+1 = θi +∆θi.

The computation of the policy update is the keystep here and a variety of updates have been pro-posed ranging from pairwise comparisons (Strens andMoore, 2001; Ng et al., 2004a) over gradient estima-tion using finite policy differences (Geng et al., 2006;Kohl and Stone, 2004; Mitsunaga et al., 2005; Robertset al., 2010; Sato et al., 2002; Tedrake et al., 2005),and general stochastic optimization methods (such asNelder-Mead (Bagnell and Schneider, 2001), cross en-tropy (Rubinstein and Kroese, 2004) and population-based methods (Goldberg, 1989)) to approaches com-ing from optimal control such as differential dynamicprogramming (DDP) (Atkeson, 1998) and multipleshooting approaches (Betts, 2001). We may broadlybreak down policy-search methods into “black box”and “white box” methods. Black box methods are gen-eral stochastic optimization algorithms (Spall, 2003)using only the expected return of policies, estimated bysampling, and do not leverage any of the internal struc-ture of the RL problem. These may be very sophisti-cated techniques (Tesch et al., 2011) that use responsesurface estimates and bandit-like strategies to achievegood performance. White box methods take advan-tage of some of additional structure within the rein-forcement learning domain, including, for instance, the(approximate) Markov structure of problems, devel-oping approximate models, value-function estimateswhen available (Peters and Schaal, 2008c), or evensimply the causal ordering of actions and rewards. Amajor open issue within the field is the relative mer-its of the these two approaches: in principle, whitebox methods leverage more information, but with theexception of models (which have been demonstratedrepeatedly to often make tremendous performance im-provements, see Section 6), the performance gains are

8

Page 9: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Value Function Approaches

Approach Employed by. . .

Model-Based Bakker et al. (2006); Hester et al. (2010, 2012); Kalmár et al. (1998); Martínez-Marínand Duckett (2005); Schaal (1996); Touzet (1997)

Model-Free Asada et al. (1996); Bakker et al. (2003); Benbrahim et al. (1992); Benbrahim andFranklin (1997); Birdwell and Livingston (2007); Bitzer et al. (2010); Conn andPeters II (2007); Duan et al. (2007, 2008); Fagg et al. (1998); Gaskett et al. (2000);Gräve et al. (2010); Hafner and Riedmiller (2007); Huang and Weng (2002); Huberand Grupen (1997); Ilg et al. (1999); Katz et al. (2008); Kimura et al. (2001);Kirchner (1997); Konidaris et al. (2011a, 2012); Kroemer et al. (2009, 2010); Kwokand Fox (2004); Latzke et al. (2007); Mahadevan and Connell (1992); Matarić (1997);Morimoto and Doya (2001); Nemec et al. (2009, 2010); Oßwald et al. (2010); Palettaet al. (2007); Pendrith (1999); Platt et al. (2006); Riedmiller et al. (2009); Rottmannet al. (2007); Smart and Kaelbling (1998, 2002); Soni and Singh (2006); Tamošiunaiteet al. (2011); Thrun (1995); Tokic et al. (2009); Touzet (1997); Uchibe et al. (1998);Wang et al. (2006); Willgoss and Iqbal (1999)

Table 1: This table illustrates different value function based reinforcement learning methods employed for robotic tasks(both average and discounted reward cases) and associated publications.

traded-off with additional assumptions that may be vi-olated and less mature optimization algorithms. Somerecent work including (Stulp and Sigaud, 2012; Teschet al., 2011) suggest that much of the benefit of policysearch is achieved by black-box methods.

Some of the most popular white-box general re-inforcement learning techniques that have translatedparticularly well into the domain of robotics include:(i) policy gradient approaches based on likelihood-ratio estimation (Sutton et al., 1999), (ii) policy up-dates inspired by expectation-maximization (Tous-saint et al., 2010), and (iii) the path integral methods(Kappen, 2005).

Let us briefly take a closer look at gradient-basedapproaches first. The updates of the policy parametersare based on a hill-climbing approach, that is followingthe gradient of the expected return J for a definedstep-size α

θi+1 = θi + α∇θJ.

Different methods exist for estimating the gradient∇θJ and many algorithms require tuning of the step-size α.

In finite difference gradients P perturbed policy pa-rameters are evaluated to obtain an estimate of thegradient. Here we have ∆Jp ≈ J(θi+∆θp)−Jref , wherep = [1..P ] are the individual perturbations, ∆Jp the es-timate of their influence on the return, and Jref is areference return, e.g., the return of the unperturbedparameters. The gradient can now be estimated bylinear regression

∇θJ ≈(

∆ΘT∆Θ

)−1∆Θ

T∆J ,

where the matrix ∆Θ contains all the stacked samplesof the perturbations ∆θp and ∆J contains the corre-sponding ∆Jp. In order to estimate the gradient thenumber of perturbations needs to be at least as largeas the number of parameters. The approach is verystraightforward and even applicable to policies thatare not differentiable. However, it is usually consid-ered to be very noisy and inefficient. For the finite

difference approach tuning the step-size α for the up-date, the number of perturbations P , and the typeand magnitude of perturbations are all critical tuningfactors.

Likelihood ratio methods rely on the insight that inan episodic setting where the episodes τ are generatedaccording to the distribution P θ(τ) = P (τ |θ) with thereturn of an episode Jτ =

∑Hh=1Rh and number of

steps H the expected return for a set of policy param-eter θ can be expressed as

Jθ =∑

τ

P θ(τ)Jτ . (5)

The gradient of the episode distribution can be writtenas4

∇θPθ(τ) = P θ(τ)∇θ logP

θ(τ), (6)

which is commonly known as the the likelihood ratioor REINFORCE (Williams, 1992) trick. CombiningEquations (5) and (6) we get the gradient of the ex-pected return in the form

∇θJθ =

τ

∇θPθ(τ)Jτ =

τ

P θ(τ)∇θ logPθ(τ)Jτ

= E{

∇θ logPθ(τ)Jτ

}

.

If we have a stochastic policy πθ(s, a) that generatesthe episodes τ , we do not need to keep track of theprobabilities of the episodes but can directly expressthe gradient in terms of the policy as ∇θ logP

θ(τ) =∑H

h=1∇θ log πθ(s, a). Finally the gradient of the ex-

pected return with respect to the policy parameterscan be estimated as

∇θJθ = E

H∑

h=1

∇θ log πθ(sh, ah)

.

If we now take into account that rewards at thebeginning of an episode cannot be caused by actions

4From multi-variate calculus we have ∇θ logPθ(τ) =

∇θPθ(τ)/P θ(τ).

9

Page 10: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

taken at the end of an episode, we can replace the re-turn of the episode Jτ by the state-action value func-tion Qπ(s, a) and get (Peters and Schaal, 2008c)

∇θJθ = E

H∑

h=1

∇θ log πθ(sh, ah)Q

π(sh, ah)

,

which is equivalent to the policy gradient theorem (Sut-ton et al., 1999). In practice, it is often advisable tosubtract a reference Jref , also called baseline, from thereturn of the episode Jτ or the state-action value func-tion Qπ(s, a) respectively to get better estimates, simi-lar to the finite difference approach. In these settings,the exploration is automatically taken care of by thestochastic policy.

Initial gradient-based approaches such as finite dif-ferences gradients or REINFORCE (REward Incre-ment = Nonnegative Factor times Offset Reinforce-ment times Characteristic Eligibility) (Williams, 1992)have been rather slow. The weight perturbationalgorithm is related to REINFORCE but can dealwith non-Gaussian distributions which significantlyimproves the signal to noise ratio of the gradient(Roberts et al., 2010). Recent natural policy gradientapproaches (Peters and Schaal, 2008c,b) have allowedfor faster convergence which may be advantageous forrobotics as it reduces the learning time and requiredreal-world interactions.

A different class of safe and fast policy search meth-ods, that are inspired by expectation-maximization,can be derived when the reward is treated as an im-proper probability distribution (Dayan and Hinton,1997). Some of these approaches have proven success-ful in robotics, e.g., reward-weighted regression (Petersand Schaal, 2008a), Policy Learning by Weighting Ex-ploration with the Returns (Kober and Peters, 2009),Monte Carlo Expectation-Maximization(Vlassis et al.,2009), and Cost-regularized Kernel Regression (Koberet al., 2010). Algorithms with closely related updaterules can also be derived from different perspectivesincluding Policy Improvements with Path Integrals(Theodorou et al., 2010) and Relative Entropy PolicySearch (Peters et al., 2010a).

Finally, the Policy Search by Dynamic Programming(Bagnell et al., 2003) method is a general strategy thatcombines policy search with the principle of optimality.The approach learns a non-stationary policy backwardin time like dynamic programming methods, but doesnot attempt to enforce the Bellman equation and theresulting approximation instabilities (See Section 2.4).The resulting approach provides some of the strongestguarantees that are currently known under functionapproximation and limited observability It has beendemonstrated in learning walking controllers and infinding near-optimal trajectories for map exploration(Kollar and Roy, 2008). The resulting method is moreexpensive than the value function methods because itscales quadratically in the effective time horizon of theproblem. Like DDP methods (Atkeson, 1998), it is tiedto a non-stationary (time-varying) policy.

An overview of publications using policy searchmethods is presented in Table 2.

One of the key open issues in the field is determiningwhen it is appropriate to use each of these methods.Some approaches leverage significant structure specificto the RL problem (e.g. (Theodorou et al., 2010)), in-cluding reward structure, Markovanity, causality of re-ward signals (Williams, 1992), and value-function esti-mates when available (Peters and Schaal, 2008c). Oth-ers embed policy search as a generic, black-box, prob-lem of stochastic optimization (Bagnell and Schneider,2001; Lizotte et al., 2007; Kuindersma et al., 2011;Tesch et al., 2011). Significant open questions remainregarding which methods are best in which circum-stances and further, at an even more basic level, howeffective leveraging the kinds of problem structuresmentioned above are in practice.

2.3 Value Function Approaches versusPolicy Search

Some methods attempt to find a value function or pol-icy which eventually can be employed without signif-icant further computation, whereas others (e.g., theroll-out methods) perform the same amount of com-putation each time.

If a complete optimal value function is known, aglobally optimal solution follows simply by greed-ily choosing actions to optimize it. However, value-function based approaches have thus far been difficultto translate into high dimensional robotics as they re-quire function approximation for the value function.Most theoretical guarantees no longer hold for this ap-proximation and even finding the optimal action canbe a hard problem due to the brittleness of the ap-proximation and the cost of optimization. For highdimensional actions, it can be as hard computing animproved policy for all states in policy search as find-ing a single optimal action on-policy for one state bysearching the state-action value function.

In principle, a value function requires total cover-age of the state space and the largest local error de-termines the quality of the resulting policy. A par-ticularly significant problem is the error propagationin value functions. A small change in the policy maycause a large change in the value function, which againcauses a large change in the policy. While this maylead more quickly to good, possibly globally optimalsolutions, such learning processes often prove unsta-ble under function approximation (Boyan and Moore,1995; Kakade and Langford, 2002; Bagnell et al., 2003)and are considerably more dangerous when applied toreal systems where overly large policy deviations maylead to dangerous decisions.

In contrast, policy search methods usually only con-sider the current policy and its neighborhood in or-der to gradually improve performance. The result isthat usually only local optima, and not the global one,can be found. However, these methods work well inconjunction with continuous features. Local coverageand local errors results into improved scaleability inrobotics.

Policy search methods are sometimes called actor -only methods; value function methods are sometimes

10

Page 11: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Policy Search

Approach Employed by. . .

Gradient Deisenroth and Rasmussen (2011); Deisenroth et al. (2011); Endo et al. (2008);Fidelman and Stone (2004); Geng et al. (2006); Guenter et al. (2007); Gullapalliet al. (1994); Hailu and Sommer (1998); Ko et al. (2007); Kohl and Stone (2004);Kolter and Ng (2009a); Michels et al. (2005); Mitsunaga et al. (2005); Miyamotoet al. (1996); Ng et al. (2004a,b); Peters and Schaal (2008c,b); Roberts et al. (2010);Rosenstein and Barto (2004); Tamei and Shibata (2009); Tedrake (2004); Tedrakeet al. (2005)

Other Abbeel et al. (2006, 2007); Atkeson and Schaal (1997); Atkeson (1998); Bagnell andSchneider (2001); Bagnell (2004); Buchli et al. (2011); Coates et al. (2009); Danielet al. (2012); Donnart and Meyer (1996); Dorigo and Colombetti (1993); Erden andLeblebicioğlu (2008); Kalakrishnan et al. (2011); Kober and Peters (2009); Koberet al. (2010); Kolter et al. (2008); Kuindersma et al. (2011); Lizotte et al. (2007);Matarić (1994); Pastor et al. (2011); Peters and Schaal (2008a); Peters et al. (2010a);Schaal and Atkeson (1994); Stulp et al. (2011); Svinin et al. (2001); Tamošiunaiteet al. (2011); Yasuda and Ohkura (2008); Youssef (2005)

Table 2: This table illustrates different policy search reinforcement learning methods employed for robotic tasks andassociated publications.

called critic-only methods. The idea of a critic is tofirst observe and estimate the performance of choosingcontrols on the system (i.e., the value function), thenderive a policy based on the gained knowledge. Incontrast, the actor directly tries to deduce the optimalpolicy. A set of algorithms called actor-critic meth-ods attempt to incorporate the advantages of each: apolicy is explicitly maintained, as is a value-functionfor the current policy. The value function (i.e., thecritic) is not employed for action selection. Instead,it observes the performance of the actor and decideswhen the policy needs to be updated and which actionshould be preferred. The resulting update step fea-tures the local convergence properties of policy gradi-ent algorithms while reducing update variance (Green-smith et al., 2004). There is a trade-off between thebenefit of reducing the variance of the updates andhaving to learn a value function as the samples re-quired to estimate the value function could also beemployed to obtain better gradient estimates for theupdate step. Rosenstein and Barto (2004) propose anactor-critic method that additionally features a super-visor in the form of a stable policy.

2.4 Function Approximation

Function approximation (Rivlin, 1969) is a family ofmathematical and statistical techniques used to rep-resent a function of interest when it is computation-ally or information-theoretically intractable to repre-sent the function exactly or explicitly (e.g. in tabularform). Typically, in reinforcement learning te func-tion approximation is based on sample data collectedduring interaction with the environment. Function ap-proximation is critical in nearly every RL problem, andbecomes inevitable in continuous state ones. In largediscrete spaces it is also often impractical to visit oreven represent all states and actions, and function ap-proximation in this setting can be used as a means togeneralize to neighboring states and actions.

Function approximation can be employed to rep-

resent policies, value functions, and forward mod-els. Broadly speaking, there are two kinds of func-tion approximation methods: parametric and non-parametric. A parametric function approximator usesa finite set of parameters or arguments with the goalis to find parameters that make this approximation fitthe observed data as closely as possible. Examples in-clude linear basis functions and neural networks. Incontrast, non-parametric methods expand representa-tional power in relation to collected data and henceare not limited by the representation power of a cho-sen parametrization (Bishop, 2006). A prominent ex-ample that has found much use within reinforcementlearning is Gaussian process regression (Rasmussenand Williams, 2006). A fundamental problem with us-ing supervised learning methods developed in the lit-erature for function approximation is that most suchmethods are designed for independently and identi-cally distributed sample data. However, the data gen-erated by the reinforcement learning process is usuallyneither independent nor identically distributed. Usu-ally, the function approximator itself plays some rolein the data collection process (for instance, by servingto define a policy that we execute on a robot.)

Linear basis function approximators form one of themost widely used approximate value function tech-niques in continuous (and discrete) state spaces. Thisis largely due to the simplicity of their representa-tion as well as a convergence theory, albeit limited, forthe approximation of value functions based on samples(Tsitsiklis and Van Roy, 1997). Let us briefly take acloser look at a radial basis function network to illus-trate this approach. The value function maps states toa scalar value. The state space can be covered by a gridof points, each of which correspond to the center of aGaussian-shaped basis function. The value of the ap-proximated function is the weighted sum of the valuesof all basis functions at the query point. As the in-fluence of the Gaussian basis functions drops rapidly,the value of the query points will be predominantly

11

Page 12: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

influenced by the neighboring basis functions. Theweights are set in a way to minimize the error betweenthe observed samples and the reconstruction. For themean squared error, these weights can be determinedby linear regression. Kolter and Ng (2009b) discussthe benefits of regularization of such linear functionapproximators to avoid over-fitting.

Other possible function approximators for valuefunctions include wire fitting, whichBaird and Klopf(1993) suggested as an approach that makes contin-uous action selection feasible. The Fourier basis hadbeen suggested by Konidaris et al. (2011b). Even dis-cretizing the state-space can be seen as a form of func-tion approximation where coarse values serve as es-timates for a smooth continuous function. One ex-ample is tile coding (Sutton and Barto, 1998), wherethe space is subdivided into (potentially irregularlyshaped) regions, called tiling. The number of differ-ent tilings determines the resolution of the final ap-proximation. For more examples, please refer to Sec-tions 4.1 and 4.2.

Policy search also benefits from a compact represen-tation of the policy as discussed in Section 4.3.

Models of the system dynamics can be representedusing a wide variety of techniques. In this case, it isoften important to model the uncertainty in the model(e.g., by a stochastic model or Bayesian estimates ofmodel parameters) to ensure that the learning algo-rithm does not exploit model inaccuracies. See Sec-tion 6 for a more detailed discussion.

3 Challenges in Robot

Reinforcement Learning

Reinforcement learning is generally a hard problemand many of its challenges are particularly apparentin the robotics setting. As the states and actions ofmost robots are inherently continuous, we are forced toconsider the resolution at which they are represented.We must decide how fine grained the control is that werequire over the robot, whether we employ discretiza-tion or function approximation, and what time step weestablish. Additionally, as the dimensionality of bothstates and actions can be high, we face the “Curse ofDimensionality” (Bellman, 1957) as discussed in Sec-tion 3.1. As robotics deals with complex physical sys-tems, samples can be expensive due to the long ex-ecution time of complete tasks, required manual in-terventions, and the need maintenance and repair. Inthese real-world measurements, we must cope with theuncertainty inherent in complex physical systems. Arobot requires that the algorithm runs in real-time.The algorithm must be capable of dealing with delaysin sensing and execution that are inherent in physi-cal systems (see Section 3.2). A simulation might al-leviate many problems but these approaches need tobe robust with respect to model errors as discussedin Section 3.3. An often underestimated problem isthe goal specification, which is achieved by designinga good reward function. As noted in Section 3.4, thischoice can make the difference between feasibility and

Figure 3: This Figure illustrates the state space used inthe modeling of a robot reinforcement learning task of pad-dling a ball.

an unreasonable amount of exploration.

3.1 Curse of Dimensionality

When Bellman (1957) explored optimal control in dis-crete high-dimensional spaces, he faced an exponentialexplosion of states and actions for which he coined theterm “Curse of Dimensionality”. As the number of di-mensions grows, exponentially more data and compu-tation are needed to cover the complete state-actionspace. For example, if we assume that each dimensionof a state-space is discretized into ten levels, we have10 states for a one-dimensional state-space, 103 = 1000

unique states for a three-dimensional state-space, and10n possible states for a n-dimensional state space.Evaluating every state quickly becomes infeasible withgrowing dimensionality, even for discrete states. Bell-man originally coined the term in the context of opti-mization, but it also applies to function approximationand numerical integration (Donoho, 2000). While su-pervised learning methods have tamed this exponen-tial growth by considering only competitive optimalitywith respect to a limited class of function approxima-tors, such results are much more difficult in reinforce-ment learning where data must collected throughoutstate-space to ensure global optimality.

Robotic systems often have to deal with these highdimensional states and actions due to the many de-grees of freedom of modern anthropomorphic robots.For example, in the ball-paddling task shown in Fig-ure 3, a proper representation of a robot’s state wouldconsist of its joint angles and velocities for each of itsseven degrees of freedom as well as the Cartesian po-sition and velocity of the ball. The robot’s actionswould be the generated motor commands, which oftenare torques or accelerations. In this example, we have2 × (7 + 3) = 20 state dimensions and 7-dimensionalcontinuous actions. Obviously, other tasks may re-quire even more dimensions. For example, human-like actuation often follows the antagonistic principle(Yamaguchi and Takanishi, 1997) which additionallyenables control of stiffness. Such dimensionality is a

12

Page 13: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

major challenge for both the robotics and the rein-forcement learning communities.

In robotics, such tasks are often rendered tractableto the robot engineer by a hierarchical task decom-position that shifts some complexity to a lower layerof functionality. Classical reinforcement learning ap-proaches often consider a grid-based representationwith discrete states and actions, often referred to asa grid-world. A navigational task for mobile robotscould be projected into this representation by employ-ing a number of actions like “move to the cell to theleft” that use a lower level controller that takes careof accelerating, moving, and stopping while ensuringprecision. In the ball-paddling example, we may sim-plify by controlling the robot in racket space (which islower-dimensional as the racket is orientation-invariantaround the string’s mounting point) with an opera-tional space control law (Nakanishi et al., 2008). Manycommercial robot systems also encapsulate some of thestate and action components in an embedded controlsystem (e.g., trajectory fragments are frequently usedas actions for industrial robots). However, this formof a state dimensionality reduction severely limits thedynamic capabilities of the robot according to our ex-perience (Schaal et al., 2002; Peters et al., 2010b).

The reinforcement learning community has a longhistory of dealing with dimensionality using computa-tional abstractions. It offers a larger set of applicabletools ranging from adaptive discretizations (Buşoniuet al., 2010) and function approximation approaches(Sutton and Barto, 1998) to macro-actions or op-tions (Barto and Mahadevan, 2003; Hart and Grupen,2011). Options allow a task to be decomposed intoelementary components and quite naturally translateto robotics. Such options can autonomously achieve asub-task, such as opening a door, which reduces theplanning horizon (Barto and Mahadevan, 2003). Theautomatic generation of such sets of options is a keyissue in order to enable such approaches. We will dis-cuss approaches that have been successful in robot re-inforcement learning in Section 4.

3.2 Curse of Real-World Samples

Robots inherently interact with the physical world.Hence, robot reinforcement learning suffers from mostof the resulting real-world problems. For example,robot hardware is usually expensive, suffers from wearand tear, and requires careful maintenance. Repair-ing a robot system is a non-negligible effort associ-ated with cost, physical labor and long waiting peri-ods. To apply reinforcement learning in robotics, safeexploration becomes a key issue of the learning process(Schneider, 1996; Bagnell, 2004; Deisenroth and Ras-mussen, 2011; Moldovan and Abbeel, 2012), a problemoften neglected in the general reinforcement learningcommunity. Perkins and Barto (2002) have come upwith a method for constructing reinforcement learn-ing agents based on Lyapunov functions. Switchingbetween the underlying controllers is always safe andoffers basic performance guarantees.

However, several more aspects of the real-world

make robotics a challenging domain. As the dynamicsof a robot can change due to many external factorsranging from temperature to wear, the learning pro-cess may never fully converge, i.e., it needs a “trackingsolution” (Sutton et al., 2007). Frequently, the en-vironment settings during an earlier learning periodcannot be reproduced. External factors are not al-ways clear – for example, how light conditions affectthe performance of the vision system and, as a result,the task’s performance. This problem makes compar-ing algorithms particularly hard. Furthermore, the ap-proaches often have to deal with uncertainty due to in-herent measurement noise and the inability to observeall states directly with sensors.

Most real robot learning tasks require some formof human supervision, e.g., putting the pole back onthe robot’s end-effector during pole balancing (see Fig-ure 1d) after a failure. Even when an automatic resetexists (e.g., by having a smart mechanism that resetsthe pole), learning speed becomes essential as a taskon a real robot cannot be sped up. In some tasks likea slowly rolling robot, the dynamics can be ignored;in others like a flying robot, they cannot. Especiallyin the latter case, often the whole episode needs to becompleted as it is not possible to start from arbitrarystates.

For such reasons, real-world samples are expensivein terms of time, labor and, potentially, finances. Inrobotic reinforcement learning, it is often consideredto be more important to limit the real-world interac-tion time instead of limiting memory consumption orcomputational complexity. Thus, sample efficient al-gorithms that are able to learn from a small numberof trials are essential. In Section 6 we will point outseveral approaches that allow the amount of requiredreal-world interactions to be reduced.

Since the robot is a physical system, there are strictconstraints on the interaction between the learning al-gorithm and the robot setup. For dynamic tasks, themovement cannot be paused and actions must be se-lected within a time-budget without the opportunityto pause to think, learn or plan between actions. Theseconstraints are less severe in an episodic setting wherethe time intensive part of the learning can be post-poned to the period between episodes. Hester et al.(2012) has proposed a real-time architecture for model-based value function reinforcement learning methodstaking into account these challenges.

As reinforcement learning algorithms are inherentlyimplemented on a digital computer, the discretiza-tion of time is unavoidable despite that physical sys-tems are inherently continuous time systems. Time-discretization of the actuation can generate undesir-able artifacts (e.g., the distortion of distance betweenstates) even for idealized physical systems, which can-not be avoided. As most robots are controlled at fixedsampling frequencies (in the range between 500Hz and3kHz) determined by the manufacturer of the robot,the upper bound on the rate of temporal discretizationis usually pre-determined. The lower bound dependson the horizon of the problem, the achievable speed ofchanges in the state, as well as delays in sensing and

13

Page 14: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

actuation.All physical systems exhibit such delays in sensing

and actuation. The state of the setup (represented bythe filtered sensor signals) may frequently lag behindthe real state due to processing and communication de-lays. More critically, there are also communication de-lays in actuation as well as delays due to the fact thatneither motors, gear boxes nor the body’s movementcan change instantly. Due to these delays, actions maynot have instantaneous effects but are observable onlyseveral time steps later. In contrast, in most generalreinforcement learning algorithms, the actions are as-sumed to take effect instantaneously as such delayswould violate the usual Markov assumption. This ef-fect can be addressed by putting some number of re-cent actions into the state. However, this significantlyincreases the dimensionality of the problem.

The problems related to time-budgets and delayscan also be avoided by increasing the duration of thetime steps. One downside of this approach is that therobot cannot be controlled as precisely; another is thatit may complicate a description of system dynamics.

3.3 Curse of Under-Modeling and ModelUncertainty

One way to offset the cost of real-world interaction is touse accurate models as simulators. In an ideal setting,this approach would render it possible to learn the be-havior in simulation and subsequently transfer it to thereal robot. Unfortunately, creating a sufficiently accu-rate model of the robot and its environment is chal-lenging and often requires very many data samples. Assmall model errors due to this under-modeling accu-mulate, the simulated robot can quickly diverge fromthe real-world system. When a policy is trained usingan imprecise forward model as simulator, the behav-ior will not transfer without significant modificationsas experienced by Atkeson (1994) when learning theunderactuated pendulum swing-up. The authors haveachieved a direct transfer in only a limited number ofexperiments; see Section 6.1 for examples.

For tasks where the system is self-stabilizing (thatis, where the robot does not require active controlto remain in a safe state or return to it), transfer-ring policies often works well. Such tasks often fea-ture some type of dampening that absorbs the energyintroduced by perturbations or control inaccuracies.If the task is inherently stable, it is safer to assumethat approaches that were applied in simulation worksimilarly in the real world (Kober and Peters, 2010).Nevertheless, tasks can often be learned better in thereal world than in simulation due to complex mechan-ical interactions (including contacts and friction) thathave proven difficult to model accurately. For exam-ple, in the ball-paddling task (Figure 3) the elasticstring that attaches the ball to the racket always pullsback the ball towards the racket even when hit veryhard. Initial simulations (including friction models,restitution models, dampening models, models for theelastic string, and air drag) of the ball-racket contactsindicated that these factors would be very hard to con-

trol. In a real experiment, however, the reflections ofthe ball on the racket proved to be less critical than insimulation and the stabilizing forces due to the elas-tic string were sufficient to render the whole systemself-stabilizing.

In contrast, in unstable tasks small variations havedrastic consequences. For example, in a pole balanc-ing task, the equilibrium of the upright pole is verybrittle and constant control is required to stabilize thesystem. Transferred policies often perform poorly inthis setting. Nevertheless, approximate models servea number of key roles which we discuss in Section 6,including verifying and testing the algorithms in simu-lation, establishing proximity to theoretically optimalsolutions, calculating approximate gradients for localpolicy improvement, identifing strategies for collectingmore data, and performing “mental rehearsal”.

3.4 Curse of Goal Specification

In reinforcement learning, the desired behavior is im-plicitly specified by the reward function. The goal ofreinforcement learning algorithms then is to maximizethe accumulated long-term reward. While often dra-matically simpler than specifying the behavior itself,in practice, it can be surprisingly difficult to define agood reward function in robot reinforcement learning.The learner must observe variance in the reward signalin order to be able to improve a policy: if the samereturn is always received, there is no way to determinewhich policy is better or closer to the optimum.

In many domains, it seems natural to provide re-wards only upon task achievement – for example, whena table tennis robot wins a match. This view resultsin an apparently simple, binary reward specification.However, a robot may receive such a reward so rarelythat it is unlikely to ever succeed in the lifetime of areal-world system. Instead of relying on simpler bi-nary rewards, we frequently need to include interme-diate rewards in the scalar reward function to guidethe learning process to a reasonable solution, a pro-cess known as reward shaping (Laud, 2004).

Beyond the need to shorten the effective problemhorizon by providing intermediate rewards, the trade-off between different factors may be essential. For in-stance, hitting a table tennis ball very hard may re-sult in a high score but is likely to damage a robot orshorten its life span. Similarly, changes in actions maybe penalized to avoid high frequency controls that arelikely to be very poorly captured with tractable lowdimensional state-space or rigid-body models. Rein-forcement learning algorithms are also notorious forexploiting the reward function in ways that are notanticipated by the designer. For example, if the dis-tance between the ball and the desired highest pointis part of the reward in ball paddling (see Figure 3),many locally optimal solutions would attempt to sim-ply move the racket upwards and keep the ball on it.Reward shaping gives the system a notion of closenessto the desired behavior instead of relying on a rewardthat only encodes success or failure (Ng et al., 1999).

Often the desired behavior can be most naturally

14

Page 15: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

represented with a reward function in a particularstate and action space. However, this representationdoes not necessarily correspond to the space wherethe actual learning needs to be performed due to bothcomputational and statistical limitations. Employingmethods to render the learning problem tractable of-ten result in different, more abstract state and actionspaces which might not allow accurate representationof the original reward function. In such cases, a rewar-dartfully specifiedin terms of the features of the spacein which the learning algorithm operates can prove re-markably effective. There is also a trade-off betweenthe complexity of the reward function and the com-plexity of the learning problem. For example, in theball-in-a-cup task (Section 7) the most natural rewardwould be a binary value depending on whether the ballis in the cup or not. To render the learning problemtractable, a less intuitive reward needed to be devisedin terms of a Cartesian distance with additional direc-tional information (see Section 7.1 for details). An-other example is Crusher (Ratliff et al., 2006a), anoutdoor robot, where the human designer was inter-ested in a combination of minimizing time and risk tothe robot. However, the robot reasons about the worldon the long time horizon scale as if it was a very sim-ple, deterministic, holonomic robot operating on a finegrid of continuous costs. Hence, the desired behaviorcannot be represented straightforwardly in this state-space. Nevertheless, a remarkably human-like behav-ior that seems to respect time and risk priorities canbe achieved by carefully mapping features describingeach state (discrete grid location with features com-puted by an on-board perception system) to cost.

Inverse optimal control, also known as inverse re-inforcement learning (Russell, 1998), is a promisingalternative to specifying the reward function manu-ally. It assumes that a reward function can be recon-structed from a set of expert demonstrations. Thisreward function does not necessarily correspond tothe true reward function, but provides guarantees onthe resulting performance of learned behaviors (Abbeeland Ng, 2004; Ratliff et al., 2006b). Inverse optimalcontrol was initially studied in the control community(Kalman, 1964) and in the field of economics (Keeneyand Raiffa, 1976). The initial results were only ap-plicable to limited domains (linear quadratic regulatorproblems) and required closed form access to plant andcontroller, hence samples from human demonstrationscould not be used. Russell (1998) brought the fieldto the attention of the machine learning community.Abbeel and Ng (2004) defined an important constrainton the solution to the inverse RL problem when rewardfunctions are linear in a set of features: a policy that isextracted by observing demonstrations has to earn thesame reward as the policy that is being demonstrated.Ratliff et al. (2006b) demonstrated that inverse op-timal control can be understood as a generalizationof ideas in machine learning of structured predictionand introduced efficient sub-gradient based algorithmswith regret bounds that enabled large scale applicationof the technique within robotics. Ziebart et al. (2008)extended the technique developed by Abbeel and Ng

(2004) by rendering the idea robust and probabilis-tic, enabling its effective use for both learning poli-cies and predicting the behavior of sub-optimal agents.These techniques, and many variants, have been re-cently successfully applied to outdoor robot navigation(Ratliff et al., 2006a; Silver et al., 2008, 2010), manipu-lation (Ratliff et al., 2007), and quadruped locomotion(Ratliff et al., 2006a, 2007; Kolter et al., 2007).

More recently, the notion that complex policies canbe built on top of simple, easily solved optimal con-trol problems by exploiting rich, parametrized re-ward functions has been exploited within reinforce-ment learning more directly. In (Sorg et al., 2010;Zucker and Bagnell, 2012), complex policies are de-rived by adapting a reward function for simple opti-mal control problems using policy search techniques.Zucker and Bagnell (2012) demonstrate that this tech-nique can enable efficient solutions to robotic marble-maze problems that effectively transfer between mazesof varying design and complexity. These works high-light the natural trade-off between the complexity ofthe reward function and the complexity of the under-lying reinforcement learning problem for achieving adesired behavior.

4 Tractability Through

Representation

As discussed above, reinforcement learning providesa framework for a remarkable variety of problems ofsignificance to both robotics and machine learning.However, the computational and information-theoreticconsequences that we outlined above accompany thispower and generality. As a result, naive application ofreinforcement learning techniques in robotics is likelyto be doomed to failure. The remarkable successesthat we reference in this article have been achievedby leveraging a few key principles – effective repre-sentations, approximate models, and prior knowledgeor information. In the following three sections, wereview these principles and summarize how each hasbeen made effective in practice. We hope that under-standing these broad approaches will lead to new suc-cesses in robotic reinforcement learning by combiningsuccessful methods and encourage research on noveltechniques that embody each of these principles.

Much of the success of reinforcement learning meth-ods has been due to the clever use of approximaterepresentations. The need of such approximationsis particularly pronounced in robotics, where table-based representations (as discussed in Section 2.2.1)are rarely scalable. The different ways of making rein-forcement learning methods tractable in robotics aretightly coupled to the underlying optimization frame-work. Reducing the dimensionality of states or ac-tions by smart state-action discretization is a repre-sentational simplification that may enhance both pol-icy search and value function-based methods (see Sec-tion 4.1). A value function-based approach requires anaccurate and robust but general function approxima-tor that can capture the value function with sufficient

15

Page 16: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

precision (see Section 4.2) while maintaining stabil-ity during learning. Policy search methods require achoice of policy representation that controls the com-plexity of representable policies to enhance learningspeed (see Section 4.3). An overview of publicationsthat make particular use of efficient representations torender the learning problem tractable is presented inTable 3.

4.1 Smart State-Action Discretization

Decreasing the dimensionality of state or action spaceseases most reinforcement learning problems signifi-cantly, particularly in the context of robotics. Here, wegive a short overview of different attempts to achievethis goal with smart discretization.

Hand Crafted Discretization. A variety of authorshave manually developed discretizations so that ba-sic tasks can be learned on real robots. For low-dimensional tasks, we can generate discretizationsstraightforwardly by splitting each dimension into anumber of regions. The main challenge is to find theright number of regions for each dimension that allowsthe system to achieve a good final performance whilestill learning quickly. Example applications includebalancing a ball on a beam (Benbrahim et al., 1992),one degree of freedom ball-in-a-cup (Nemec et al.,2010), two degree of freedom crawling motions (Tokicet al., 2009), and gait patterns for four legged walking(Kimura et al., 2001). Much more human experienceis needed for more complex tasks. For example, in abasic navigation task with noisy sensors (Willgoss andIqbal, 1999), only some combinations of binary stateor action indicators are useful (e.g., you can drive leftand forward at the same time, but not backward andforward). The state space can also be based on vastlydifferent features, such as positions, shapes, and colors,when learning object affordances (Paletta et al., 2007)where both the discrete sets and the mapping fromsensor values to the discrete values need to be crafted.Kwok and Fox (2004) use a mixed discrete and contin-uous representation of the state space to learn activesensing strategies in a RoboCup scenario. They firstdiscretize the state space along the dimension withthe strongest non-linear influence on the value func-tion and subsequently employ a linear value functionapproximation (Section 4.2) for each of the regions.

Learned from Data. Instead of specifying the dis-cretizations by hand, they can also be built adap-tively during the learning process. For example, arule based reinforcement learning approach automati-cally segmented the state space to learn a cooperativetask with mobile robots (Yasuda and Ohkura, 2008).Each rule is responsible for a local region of the state-space. The importance of the rules are updated basedon the rewards and irrelevant rules are discarded. Ifthe state is not covered by a rule yet, a new one isadded. In the related field of computer vision, Pi-ater et al. (2011) propose an approach that adaptivelyand incrementally discretizes a perceptual space into

discrete states, training an image classifier based onthe experience of the RL agent to distinguish visualclasses, which correspond to the states.

Meta-Actions. Automatic construction of meta-actions (and the closely related concept of options)has fascinated reinforcement learning researchers andthere are various examples in the literature. The ideais to have more intelligent actions that are composedof a sequence of movements and that in themselvesachieve a simple task. A simple example would be tohave a meta-action “move forward 5m.” A lower levelsystem takes care of accelerating, stopping, and cor-recting errors. For example, in (Asada et al., 1996),the state and action sets are constructed in a way thatrepeated action primitives lead to a change in the stateto overcome problems associated with the discretiza-tion. Q-learning and dynamic programming based ap-proaches have been compared in a pick-n-place task(Kalmár et al., 1998) using modules. Huber and Gru-pen (1997) use a set of controllers with associatedpredicate states as a basis for learning turning gateswith a quadruped. Fidelman and Stone (2004) use apolicy search approach to learn a small set of parame-ters that controls the transition between a walking anda capturing meta-action in a RoboCup scenario. Atask of transporting a ball with a dog robot (Soni andSingh, 2006) can be learned with semi-automaticallydiscovered options. Using only the sub-goals of prim-itive motions, a humanoid robot can learn a pour-ing task (Nemec et al., 2009). Other examples in-clude foraging (Matarić, 1997) and cooperative tasks(Matarić, 1994) with multiple robots, grasping withrestricted search spaces (Platt et al., 2006), and mo-bile robot navigation (Dorigo and Colombetti, 1993).If the meta-actions are not fixed in advance, but ratherlearned at the same time, these approaches are hierar-chical reinforcement learning approaches as discussedin Section 5.2. Konidaris et al. (2011a, 2012) proposean approach that constructs a skill tree from humandemonstrations. Here, the skills correspond to optionsand are chained to learn a mobile manipulation skill.

Relational Representations. In a relational repre-sentation, the states, actions, and transitions are notrepresented individually. Entities of the same prede-fined type are grouped and their relationships are con-sidered. This representation may be preferable forhighly geometric tasks (which frequently appear inrobotics) and has been employed to learn to navigatebuildings with a real robot in a supervised setting (Co-cora et al., 2006) and to manipulate articulated objectsin simulation (Katz et al., 2008).

4.2 Value Function Approximation

Function approximation has always been the key com-ponent that allowed value function methods to scaleinto interesting domains. In robot reinforcement learn-ing, the following function approximation schemeshave been popular and successful. Using function

16

Page 17: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Smart State-Action Discretization

Approach Employed by. . .

Hand crafted Benbrahim et al. (1992); Kimura et al. (2001); Kwok and Fox (2004); Nemec et al.(2010); Paletta et al. (2007); Tokic et al. (2009); Willgoss and Iqbal (1999)

Learned Piater et al. (2011); Yasuda and Ohkura (2008)Meta-actions Asada et al. (1996); Dorigo and Colombetti (1993); Fidelman and Stone (2004);

Huber and Grupen (1997); Kalmár et al. (1998); Konidaris et al. (2011a, 2012);Matarić (1994, 1997); Platt et al. (2006); Soni and Singh (2006); Nemec et al. (2009)

RelationalRepresentation

Cocora et al. (2006); Katz et al. (2008)

Value Function Approximation

Approach Employed by. . .

Physics-inspiredFeatures

An et al. (1988); Schaal (1996)

Neural Networks Benbrahim and Franklin (1997); Duan et al. (2008); Gaskett et al. (2000); Hafnerand Riedmiller (2003); Riedmiller et al. (2009); Thrun (1995)

Neighbors Hester et al. (2010); Mahadevan and Connell (1992); Touzet (1997)Local Models Bentivegna (2004); Schaal (1996); Smart and Kaelbling (1998)GPR Gräve et al. (2010); Kroemer et al. (2009, 2010); Rottmann et al. (2007)

Pre-structured Policies

Approach Employed by. . .

Via Points & Splines Kuindersma et al. (2011); Miyamoto et al. (1996); Roberts et al. (2010)Linear Models Tamei and Shibata (2009)Motor Primitives Kohl and Stone (2004); Kober and Peters (2009); Peters and Schaal (2008c,b); Stulp

et al. (2011); Tamošiunaite et al. (2011); Theodorou et al. (2010)GMM & LLM Deisenroth and Rasmussen (2011); Deisenroth et al. (2011); Guenter et al. (2007);

Lin and Lai (2012); Peters and Schaal (2008a)Neural Networks Endo et al. (2008); Geng et al. (2006); Gullapalli et al. (1994); Hailu and Sommer

(1998); Bagnell and Schneider (2001)Controllers Bagnell and Schneider (2001); Kolter and Ng (2009a); Tedrake (2004); Tedrake et al.

(2005); Vlassis et al. (2009); Zucker and Bagnell (2012)Non-parametric Kober et al. (2010); Mitsunaga et al. (2005); Peters et al. (2010a)

Table 3: This table illustrates different methods of making robot reinforcement learning tractable by employing asuitable representation.

approximation for the value function can be com-bined with using function approximation for learn-ing a model of the system (as discussed in Section 6)in the case of model-based reinforcement learning ap-proaches.

Unfortunately the max-operator used within theBellman equation and temporal-difference updates cantheoretically make most linear or non-linear approxi-mation schemes unstable for either value iteration orpolicy iteration. Quite frequently such an unstablebehavior is also exhibited in practice. Linear func-tion approximators are stable for policy evaluation,while non-linear function approximation (e.g., neuralnetworks) can even diverge if just used for policy eval-uation (Tsitsiklis and Van Roy, 1997).

Physics-inspired Features. If good hand-crafted fea-tures are known, value function approximation can beaccomplished using a linear combination of features.However, good features are well known in robotics onlyfor a few problems, such as features for local stabiliza-tion (Schaal, 1996) and features describing rigid bodydynamics (An et al., 1988). Stabilizing a system at

an unstable equilibrium point is the most well-knownexample, where a second order Taylor expansion ofthe state together with a linear value function approx-imator often suffice as features in the proximity of theequilibrium point. For example, Schaal (1996) showedthat such features suffice for learning how to stabilize apole on the end-effector of a robot when within ±15−30

degrees of the equilibrium angle. For sufficient fea-tures, linear function approximation is likely to yieldgood results in an on-policy setting. Nevertheless, it isstraightforward to show that impoverished value func-tion representations (e.g., omitting the cross-terms inquadratic expansion in Schaal’s set-up) will make itimpossible for the robot to learn this behavior. Sim-ilarly, it is well known that linear value function ap-proximation is unstable in the off-policy case (Tsitsiklisand Van Roy, 1997; Gordon, 1999; Sutton and Barto,1998).

Neural Networks. As good hand-crafted features arerarely available, various groups have employed neuralnetworks as global, non-linear value function approxi-mation. Many different flavors of neural networks have

17

Page 18: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Figure 4: The Brainstormer Tribots won the RoboCup2006 MidSize League (Riedmiller et al., 2009)(Picturereprint with permission of Martin Riedmiller).

been applied in robotic reinforcement learning. Forexample, multi-layer perceptrons were used to learna wandering behavior and visual servoing (Gaskettet al., 2000). Fuzzy neural networks (Duan et al., 2008)and explanation-based neural networks (Thrun, 1995)have allowed robots to learn basic navigation. CMACneural networks have been used for biped locomotion(Benbrahim and Franklin, 1997).

The Brainstormers RoboCup soccer team is a par-ticularly impressive application of value function ap-proximation.(see Figure 4). It used multi-layer per-ceptrons to learn various sub-tasks such as learningdefenses, interception, position control, kicking, mo-tor speed control, dribbling and penalty shots (Hafnerand Riedmiller, 2003; Riedmiller et al., 2009). The re-sulting components contributed substantially to win-ning the world cup several times in the simulation andthe mid-size real robot leagues. As neural networksare global function approximators, overestimating thevalue function at a frequently occurring state will in-crease the values predicted by the neural network forall other states, causing fast divergence (Boyan andMoore, 1995; Gordon, 1999).Riedmiller et al. (2009)solved this problem by always defining an absorbingstate where they set the value predicted by their neu-ral network to zero, which “clamps the neural networkdown” and thereby prevents divergence. It also allowsre-iterating on the data, which results in an improvedvalue function quality. The combination of iterationon data with the clamping technique appears to be thekey to achieving good performance with value functionapproximation in practice.

Generalize to Neighboring Cells. As neural net-works are globally affected from local errors, muchwork has focused on simply generalizing from neigh-boring cells. One of the earliest papers in robot re-inforcement learning (Mahadevan and Connell, 1992)introduced this idea by statistically clustering states tospeed up a box-pushing task with a mobile robot, seeFigure 1a. This approach was also used for a naviga-tion and obstacle avoidance task with a mobile robot(Touzet, 1997). Similarly, decision trees have beenused to generalize states and actions to unseen ones,e.g., to learn a penalty kick on a humanoid robot (Hes-

ter et al., 2010). The core problem of these methodsis the lack of scalability to high-dimensional state andaction spaces.

Local Models. Local models can be seen as an ex-tension of generalization among neighboring cells togeneralizing among neighboring data points. Locallyweighted regression creates particularly efficient func-tion approximation in the context of robotics both insupervised and reinforcement learning. Here, regres-sion errors are weighted down by proximity to querypoint to train local modelsThe predictions of theselocal models are combined using the same weightingfunctions. Using local models for value function ap-proximation has allowed learning a navigation taskwith obstacle avoidance (Smart and Kaelbling, 1998),a pole swing-up task (Schaal, 1996), and an air hockeytask (Bentivegna, 2004).

Gaussian Process Regression. Parametrized globalor local models need to pre-specify, which requires atrade-off between representational accuracy and thenumber of parameters. A non-parametric function ap-proximator like Gaussian Process Regression (GPR)could be employed instead, but potentially at the costof a higher computational complexity. GPR has theadded advantage of providing a notion of uncertaintyabout the approximation quality for a query point.Hovering with an autonomous blimp (Rottmann et al.,2007) has been achieved by approximation the state-action value function with a GPR. Similarly, anotherpaper shows that grasping can be learned using Gaus-sian process regression (Gräve et al., 2010) by addi-tionally taking into account the uncertainty to guidethe exploration. Grasping locations can be learnedby approximating the rewards with a GPR, and try-ing candidates with predicted high rewards (Kroemeret al., 2009), resulting in an active learning approach.High reward uncertainty allows intelligent explorationin reward-based grasping (Kroemer et al., 2010) in abandit setting.

4.3 Pre-structured Policies

Policy search methods greatly benefit from employ-ing an appropriate function approximation of the pol-icy. For example, when employing gradient-based ap-proaches, the trade-off between the representationalpower of the policy (in the form of many policy pa-rameters) and the learning speed (related to the num-ber of samples required to estimate the gradient) needsto be considered. To make policy search approachestractable, the policy needs to be represented with afunction approximation that takes into account do-main knowledge, such as task-relevant parameters orgeneralization properties. As the next action pickedby a policy depends on the current state and ac-tion, a policy can be seen as a closed-loop controller.Roberts et al. (2011) demonstrate that care needs to betaken when selecting closed-loop parameterizations forweakly-stable systems, and suggest forms that are par-ticularly robust during learning. However, especially

18

Page 19: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Figure 5: Boston Dynamics LittleDog jumping (Kolter and Ng, 2009a) (Picture reprint with permission of Zico Kolter).

for episodic RL tasks, sometimes open-loop policies(i.e., policies where the actions depend only on thetime) can also be employed.

Via Points & Splines. An open-loop policy may of-ten be naturally represented as a trajectory, eitherin the space of states or targets or directly as a setof controls. Here, the actions are only a functionof time, which can be considered as a component ofthe state. Such spline-based policies are very suitablefor compressing complex trajectories into few param-eters. Typically the desired joint or Cartesian posi-tion, velocities, and/or accelerations are used as ac-tions. To minimize the required number of parame-ters, not every point is stored. Instead, only impor-tant via-points are considered and other points are in-terpolated. Miyamoto et al. (1996) optimized the po-sition and timing of such via-points in order to learna kendama task (a traditional Japanese toy similar toball-in-a-cup). A well known type of a via point repre-sentations are splines, which rely on piecewise-definedsmooth polynomial functions for interpolation. Forexample, Roberts et al. (2010) used a periodic cubicspline as a policy parametrization for a flapping systemand Kuindersma et al. (2011) used a cubic spline torepresent arm movements in an impact recovery task.

Linear Models. If model knowledge of the system isavailable, it can be used to create features for lin-ear closed-loop policy representations. For example,Tamei and Shibata (2009) used policy-gradient rein-forcement learning to adjust a model that maps fromhuman EMG signals to forces that in turn is used in acooperative holding task.

Motor Primitives. Motor primitives combine linearmodels describing dynamics with parsimonious move-ment parametrizations. While originally biologically-inspired, they have a lot of success for representingbasic movements in robotics such as a reaching move-ment or basic locomotion. These basic movementscan subsequently be sequenced and/or combined to

achieve more complex movements. For both goal ori-ented and rhythmic movement, different technical rep-resentations have been proposed in the robotics com-munity. Dynamical system motor primitives (Ijspeertet al., 2003; Schaal et al., 2007) have become a popularrepresentation for reinforcement learning of discretemovements. The dynamical system motor primitivesalways have a strong dependence on the phase of themovement, which corresponds to time. They can beemployed as an open-loop trajectory representation.Nevertheless, they can also be employed as a closed-loop policy to a limited extent. In our experience, theyoffer a number of advantages over via-point or splinebased policy representation (see Section 7.2). The dy-namical system motor primitives have been trainedwith reinforcement learning for a T-ball batting task(Peters and Schaal, 2008c,b), an underactuated pendu-lum swing-up and a ball-in-a-cup task (Kober and Pe-ters, 2009), flipping a light switch (Buchli et al., 2011),pouring water (Tamošiunaite et al., 2011), and play-ing pool and manipulating a box (Pastor et al., 2011).For rhythmic behaviors, a representation based on thesame biological motivation but with a fairly differenttechnical implementation (based on half-elliptical lo-cuses) have been used to acquire the gait patterns foran Aibo robot dog locomotion (Kohl and Stone, 2004).

Gaussian Mixture Models and Radial Basis Function

Models. When more general policies with a strongstate-dependence are needed, general function approx-imators based on radial basis functions, also calledGaussian kernels, become reasonable choices. Whilelearning with fixed basis function centers and widthsoften works well in practice, estimating them is chal-lenging. These centers and widths can also be esti-mated from data prior to the reinforcement learningprocess. This approach has been used to generalizea open-loop reaching movement (Guenter et al., 2007;Lin and Lai, 2012) and to learn the closed-loop cart-pole swingup task (Deisenroth and Rasmussen, 2011).Globally linear models were employed in a closed-loopblock stacking task (Deisenroth et al., 2011).

19

Page 20: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Neural Networks are another general function ap-proximation used to represent policies. Neural os-cillators with sensor feedback have been used tolearn rhythmic movements where open and closed-loop information were combined, such as gaits fora two legged robot (Geng et al., 2006; Endo et al.,2008). Similarly, a peg-in-hole (see Figure 1b), a ball-balancing task (Gullapalli et al., 1994), and a naviga-tion task (Hailu and Sommer, 1998) have been learnedwith closed-loop neural networks as policy function ap-proximators.

Locally Linear Controllers. As local linearity ishighly desirable in robot movement generation toavoid actuation difficulties, learning the parameters ofa locally linear controller can be a better choice thanusing a neural network or radial basis function repre-sentation. Several of these controllers can be combinedto form a global, inherently closed-loop policy. Thistype of policy has allowed for many applications, in-cluding learning helicopter flight (Bagnell and Schnei-der, 2001), learning biped walk patterns (Tedrake,2004; Tedrake et al., 2005), driving a radio-controlled(RC) car, learning a jumping behavior for a robot dog(Kolter and Ng, 2009a) (illustrated in Figure 5), andbalancing a two wheeled robot (Vlassis et al., 2009).Operational space control was also learned by Petersand Schaal (2008a) using locally linear controller mod-els. In a marble maze task, Zucker and Bagnell (2012)used such a controller as a policy that expressed thedesired velocity of the ball in terms of the directionalgradient of a value function.

Non-parametric Policies. Polices based on non-parametric regression approaches often allow a moredata-driven learning process. This approach is oftenpreferable over the purely parametric policies listedabove becausethe policy structure can evolve duringthe learning process. Such approaches are especiallyuseful when a policy learned to adjust the existingbehaviors of an lower-level controller, such as whenchoosing among different robot human interaction pos-sibilities (Mitsunaga et al., 2005), selecting among dif-ferent striking movements in a table tennis task (Pe-ters et al., 2010a), and setting the meta-actions fordart throwing and table tennis hitting tasks (Koberet al., 2010).

5 Tractability Through Prior

Knowledge

Prior knowledge can dramatically help guide the learn-ing process. It can be included in the form of initialpolicies, demonstrations, initial models, a predefinedtask structure, or constraints on the policy such astorque limits or ordering constraints of the policy pa-rameters. These approaches significantly reduce thesearch space and, thus, speed up the learning process.Providing a (partially) successful initial policy allowsa reinforcement learning method to focus on promising

regions in the value function or in policy space, see Sec-tion 5.1. Pre-structuring a complex task such that itcan be broken down into several more tractable onescan significantly reduce the complexity of the learn-ing task, see Section 5.2. An overview of publicationsusing prior knowledge to render the learning problemtractable is presented in Table 4. Constraints may alsolimit the search space, but often pose new, additionalproblems for the learning methods. For example, pol-icy search limits often do not handle hard limits onthe policy well. Relaxing such constraints (a trick of-ten applied in machine learning) is not feasible if theywere introduced to protect the robot in the first place.

5.1 Prior Knowledge ThroughDemonstration

People and other animals frequently learn using a com-bination of imitation and trial and error. When learn-ing to play tennis, for instance, an instructor will re-peatedly demonstrate the sequence of motions thatform an orthodox forehand stroke. Students subse-quently imitate this behavior, but still need hours ofpractice to successfully return balls to a precise loca-tion on the opponent’s court. Input from a teacherneed not be limited to initial instruction. The instruc-tor may provide additional demonstrations in laterlearning stages (Latzke et al., 2007; Ross et al., 2011a)and which can also be used as differential feedback(Argall et al., 2008).

This combination of imitation learning with rein-forcement learning is sometimes termed apprenticeshiplearning (Abbeel and Ng, 2004) to emphasize the needfor learning both from a teacher and by practice. Theterm “apprenticeship learning” is often employed to re-fer to “inverse reinforcement learning” or “inverse op-timal control” but is intended here to be employed inthis original, broader meaning. For a recent surveydetailing the state of the art in imitation learning forrobotics, see (Argall et al., 2009).

Using demonstrations to initialize reinforcementlearning provides multiple benefits. Perhaps the mostobvious benefit is that it provides supervised trainingdata of what actions to perform in states that are en-countered. Such data may be helpful when used tobias policy action selection.

The most dramatic benefit, however, is that demon-stration – or a hand-crafted initial policy – removesthe need for global exploration of the policy or state-space of the RL problem. The student can improveby locally optimizing a policy knowing what states areimportant, making local optimization methods feasi-ble. Intuitively, we expect that removing the demandsof global exploration makes learning easier. However,we can only find local optima close to the demonstra-tion, that is, we rely on the demonstration to providea good starting point. Perhaps the textbook exam-ple of such in human learning is the rise of the “Fos-bury Flop” (Wikipedia, 2013) method of high-jump(see Figure 6). This motion is very different from aclassical high-jump and took generations of Olympiansto discover. But after it was first demonstrated, it was

20

Page 21: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Figure 6: This figure illustrates the “Fosbury Flop” (publicdomain picture from Wikimedia Commons).

soon mastered by virtually all athletes participatingin the sport. On the other hand, this example alsoillustrates nicely that such local optimization aroundan initial demonstration can only find local optima.

In practice, both approximate value function basedapproaches and policy search methods work best forreal system applications when they are constrained tomake modest changes to the distribution over stateswhile learning. Policy search approaches implicitlymaintain the state distribution by limiting the changesto the policy. On the other hand, for value functionmethods, an unstable estimate of the value functioncan lead to drastic changes in the policy. Multiplepolicy search methods used in robotics are based onthis intuition (Bagnell and Schneider, 2003; Peters andSchaal, 2008b; Peters et al., 2010a; Kober and Peters,2010).

The intuitive idea and empirical evidence thatdemonstration makes the reinforcement learning prob-lem simpler can be understood rigorously. In fact,Kakade and Langford (2002); Bagnell et al. (2003)demonstrate that knowing approximately the state-distribution of a good policy5 transforms the problemof reinforcement learning from one that is provably in-tractable in both information and computational com-plexity to a tractable one with only polynomial sampleand computational complexity, even under functionapproximation and partial observability. This typeof approach can be understood as a reduction fromreinforcement learning to supervised learning. Bothalgorithms are policy search variants of approximatepolicy iteration that constrain policy updates. Kol-lar and Roy (2008) demonstrate the benefit of thisRL approach for developing state-of-the-art map ex-ploration policies and Kolter et al., 2008 employeda space-indexed variant to learn trajectory followingtasks with an autonomous vehicle and a RC car.

Demonstrations by a Teacher. Demonstrations by ateacher can be obtained in two different scenarios. Inthe first, the teacher demonstrates the task using hisor her own body; in the second, the teacher controlsthe robot to do the task. The first scenario is limitedby the embodiment issue, as the movement of a hu-man teacher usually cannot be mapped directly to the

5That is, a probability distribution over states that will beencountered when following a good policy.

robot due to different physical constraints and capabil-ities. For example, joint angles of a human demonstra-tor need to be adapted to account for the kinematicdifferences between the teacher and the robot. Of-ten it is more advisable to only consider task-relevantinformation, such asthe Cartesian positions and veloc-ities of the end-effector and the object. Demonstra-tions obtained by motion-capture have been used tolearn a pendulum swingup (Atkeson and Schaal, 1997),ball-in-a-cup (Kober et al., 2008) and grasping (Gräveet al., 2010).

The second scenario obtains demonstrations by ahuman teacher directly controlling the robot. Herethe human teacher first has to learn how to achievea task with the particular robot’s hardware, addingvaluable prior knowledge. For example, remotely con-trolling the robot initialized a Q-table for a navigationtask (Conn and Peters II, 2007). If the robot is back-drivable, kinesthetic teach-in (i.e., by taking it by thehand and moving it) can be employed, which enablesthe teacher to interact more closely with the robot.This method has resulted in applications includingT-ball batting (Peters and Schaal, 2008c,b), reachingtasks (Guenter et al., 2007; Bitzer et al., 2010), ball-in-a-cup (Kober and Peters, 2009), flipping a light switch(Buchli et al., 2011), playing pool and manipulating abox (Pastor et al., 2011), and opening a door and pick-ing up objects (Kalakrishnan et al., 2011). A marblemaze task can be learned using demonstrations by ahuman player (Bentivegna et al., 2004).

One of the more stunning demonstrations of thebenefit of learning from a teacher is the helicopter air-shows of (Coates et al., 2009). This approach combinesinitial human demonstration of trajectories, machinelearning to extract approximate models from multi-ple trajectories, and classical locally-optimal controlmethods (Jacobson and Mayne, 1970) to achieve state-of-the-art acrobatic flight.

Hand-Crafted Policies. When human interactionwith the system is not straightforward due to tech-nical reasons or human limitations, a pre-programmedpolicy can provide alternative demonstrations. For ex-ample, a vision-based mobile robot docking task canbe learned faster with such a basic behavior than usingQ-learning alone, as demonstrated in (Martínez-Marínand Duckett, 2005) Providing hand-coded, stable ini-tial gaits can significantly help in learning robot lo-comotion, as shown on a six-legged robot (Erden andLeblebicioğlu, 2008) as well as on a biped (Tedrake,2004; Tedrake et al., 2005). Alternatively, hand-crafted policies can yield important corrective actionsas prior knowledge that prevent the robot to deviatessignificantly from the desired behavior and endangeritself. This approach has been applied to adapt thewalking patterns of a robot dog to new surfaces (Bird-well and Livingston, 2007) by Q-learning. Rosensteinand Barto (2004) employed a stable controller to teachthe robot about favorable actions and avoid risky be-havior while learning to move from a start to a goalposition.

21

Page 22: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

5.2 Prior Knowledge Through TaskStructuring

Often a task can be decomposed hierarchically intobasic components or into a sequence of increasinglydifficult tasks. In both cases the complexity of thelearning task is significantly reduced.

Hierarchical Reinforcement Learning. A task canoften be decomposed into different levels. For ex-ample when using meta-actions (Section 4.1), thesemeta-actions correspond to a lower level dealing withthe execution of sub-tasks which are coordinated bya strategy level. Hierarchical reinforcement learningdoes not assume that all but one levels are fixed butrather learns all of them simultaneously. For exam-ple, hierarchical Q-learning has been used to learn dif-ferent behavioral levels for a six legged robot: mov-ing single legs, locally moving the complete body, andglobally moving the robot towards a goal (Kirchner,1997). A stand-up behavior considered as a hierar-chical reinforcement learning task has been learnedusing Q-learning in the upper-level and a continuousactor-critic method in the lower level (Morimoto andDoya, 2001). Navigation in a maze can be learnedusing an actor-critic architecture by tuning the influ-ence of different control modules and learning thesemodules (Donnart and Meyer, 1996). Huber and Gru-pen (1997) combine discrete event system and rein-forcement learning techniques to learn turning gatesfor a quadruped. Hart and Grupen (2011) learn tobi-manual manipulation tasks by assembling policieshierarchically. Daniel et al. (2012) learn options in atetherball scenario and Muelling et al. (2012) learn dif-ferent strokes in a table tennis scenario. Whitman andAtkeson (2010) show that the optimal policy for someglobal systems (like a walking controller) can be con-structed by finding the optimal controllers for simplersubsystems and coordinating these.

Progressive Tasks. Often complicated tasks are eas-ier to learn if simpler tasks can already be performed.This progressive task development is inspired by howbiological systems learn. For example, a baby firstlearns how to roll, then how to crawl, then how towalk. A sequence of increasingly difficult missions hasbeen employed to learn a goal shooting task in (Asadaet al., 1996) using Q-learning. Randløv and Alstrøm(1998) discuss shaping the reward function to includeboth a balancing and a goal oriented term for a sim-ulated bicycle riding task. The reward is constructedin such a way that the balancing term dominates theother term and, hence, this more fundamental behav-ior is learned first.

5.3 Directing Exploration with PriorKnowledge

As discussed in Section 2.1, balancing explorationand exploitation is an important consideration. Taskknowledge can be employed to guide to robots curios-ity to focus on regions that are novel and promising at

the same time. For example, a mobile robot learns todirect attention by employing a modified Q-learningapproach using novelty (Huang and Weng, 2002). Us-ing “corrected truncated returns” and taking into ac-count the estimator variance, a six legged robot em-ployed with stepping reflexes can learn to walk (Pen-drith, 1999). Offline search can be used to guide Q-learning during a grasping task (Wang et al., 2006).Using upper confidence bounds (Kaelbling, 1990) todirect exploration into regions with potentially highrewards, grasping can be learned efficiently (Kroemeret al., 2010).

6 Tractability Through Models

In Section 2, we discussed robot reinforcement learningfrom a model-free perspective where the system sim-ply served as a data generating process. Such model-free reinforcement algorithms try to directly learn thevalue function or the policy without any explicit mod-eling of the transition dynamics. In contrast, manyrobot reinforcement learning problems can be madetractable by learning forward models, i.e., approxima-tions of the transition dynamics based on data. Suchmodel-based reinforcement learning approaches jointlylearn a model of the system with the value functionor the policy and often allow for training with lessinteraction with the the real environment. Reducedlearning on the real robot is highly desirable as simula-tions are frequently faster than real-time while safer forboth the robot and its environment. The idea of com-bining learning in simulation and in the real environ-ment was popularized by the Dyna-architecture (Sut-ton, 1990), prioritized sweeping (Moore and Atkeson,1993), and incremental multi-step Q-learning (Pengand Williams, 1996) in reinforcement learning. Inrobot reinforcement learning, the learning step on thesimulated system is often called “mental rehearsal”.We first discuss the core issues and techniques in men-tal rehearsal for robotics (Section 6.1), and, subse-quently, we discuss learning methods that have beused in conjunction with learning with forward mod-els (Section 6.2). An overview of publications usingsimulations to render the learning problem tractableis presented in Table 5.

6.1 Core Issues and General Techniquesin Mental Rehearsal

Experience collected in the real world can be usedto learn a forward model (Åström and Wittenmark,1989) from data. Such forward models allow trainingby interacting with a simulated environment. Onlythe resulting policy is subsequently transferred to thereal environment. Model-based methods can make thelearning process substantially more sample efficient.However, depending on the type of model, these meth-ods may require a great deal of memory. In the follow-ing paragraphs, we deal with the core issues of mentalrehearsal: simulation biases, stochasticity of the realworld, and efficient optimization when sampling from

22

Page 23: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Prior Knowledge Through Demonstration

Approach Employed by. . .

Teacher Atkeson and Schaal (1997); Bentivegna et al. (2004); Bitzer et al. (2010); Conn andPeters II (2007); Gräve et al. (2010); Kober et al. (2008); Kober and Peters (2009);Latzke et al. (2007); Peters and Schaal (2008c,b)

Policy Birdwell and Livingston (2007); Erden and Leblebicioğlu (2008); Martínez-Marínand Duckett (2005); Rosenstein and Barto (2004); Smart and Kaelbling (1998);Tedrake (2004); Tedrake et al. (2005); Wang et al. (2006)

Prior Knowledge Through Task Structuring

Approach Employed by. . .

Hierarchical Daniel et al. (2012); Donnart and Meyer (1996); Hart and Grupen (2011); Huberand Grupen (1997); Kirchner (1997); Morimoto and Doya (2001); Muelling et al.(2012); Whitman and Atkeson (2010)

Progressive Tasks Asada et al. (1996); Randløv and Alstrøm (1998)

Directed Exploration with Prior Knowledge

Approach Employed by. . .

Directed Exploration Huang and Weng (2002); Kroemer et al. (2010); Pendrith (1999)

Table 4: This table illustrates different methods of making robot reinforcement learning tractable by incorporating priorknowledge.

a simulator.

Dealing with Simulation Biases. It is impossible toobtain a forward model that is accurate enough tosimulate a complex real-world robot system withouterror. If the learning methods require predicting thefuture or using derivatives, even small inaccuracies canquickly accumulate, significantly amplifying noise anderrors (An et al., 1988). Reinforcement learning ap-proaches exploit such model inaccuracies if they arebeneficial for the reward received in simulation (Atke-son and Schaal, 1997). The resulting policies may workwell with the forward model (i.e., the simulator) butpoorly on the real system. This is known as simula-tion bias. It is analogous to over-fitting in supervisedlearning – that is, the algorithm is doing its job wellon the model and the training data, respectively, butdoes not generalize well to the real system or noveldata. Simulation bias often leads to biased, poten-tially physically non-feasible solutions while even iter-ating between model learning and policy will have slowconvergence. Averaging over the model uncertainty inprobabilistic models can be used to reduce the bias;see the next paragraph for examples. Another resultfrom these simulation biases is that relatively few re-searchers have successfully demonstrated that a pol-icy learned in simulation can directly be transferredto a real robot while maintaining a high level of per-formance. The few examples include maze navigationtasks (Bakker et al., 2003; Oßwald et al., 2010; Youssef,2005), obstacle avoidance (Fagg et al., 1998) for a mo-bile robot, very basic robot soccer (Duan et al., 2007)and multi-legged robot locomotion (Ilg et al., 1999;Svinin et al., 2001). Nevertheless, simulation biasescan be addressed by introducing stochastic models ordistributions over models even if the system is veryclose to deterministic. Artificially adding a little noisewill smooth model errors and avoid policy over-fitting

(Jakobi et al., 1995; Atkeson, 1998). On the downside,potentially very precise policies may be eliminated dueto their fragility in the presence of noise. This tech-nique can be beneficial in all of the approaches de-scribed in this section. Nevertheless, in recent work,Ross and Bagnell (2012) presented an approach withstrong guarantees for learning the model and policy inan iterative fashion even if the true system is not inthe model class, indicating that it may be possible todeal with simulation bias.

Distributions over Models for Simulation. Modellearning methods that maintain probabilistic uncer-tainty about true system dynamics allow the RL algo-rithm to generate distributions over the performanceof a policy. Such methods explicitly model the uncer-tainty associated with the dynamics at each state andaction. For example, when using a Gaussian processmodel of the transition dynamics, a policy can be eval-uated by propagating the state and associated uncer-tainty forward in time. Such evaluations in the modelcan be used by a policy search approach to identifywhere to collect more data to improve a policy, andmay be exploited to ensure that control is safe androbust to model uncertainty (Schneider, 1996; Bagnelland Schneider, 2001). When the new policy is eval-uated on the real system, the novel observations cansubsequently be incorporated into the forward model.Bagnell and Schneider (2001) showed that maintain-ing model uncertainty and using it in the inner-loopof a policy search method enabled effective flight con-trol using only minutes of collected data, while per-formance was compromised by considering a best-fitmodel. This approach uses explicit Monte-Carlo sim-ulation in the sample estimates.

By treating model uncertainty as if it were noise(Schneider, 1996) as well as employing analytic ap-proximations of forward simulation, a cart-pole task

23

Page 24: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Figure 7: Autonomous inverted helicopter flight (Nget al., 2004b)(Picture reprint with permission of AndrewNg).

can be solved with less than 20 seconds of interactionwith the physical system (Deisenroth and Rasmussen,2011); a visually driven block-stacking task has alsobeen learned data-efficiently (Deisenroth et al., 2011).Similarly, solving a linearized control problem withmultiple probabilistic models and combining the re-sulting closed-loop control with open-loop control hasresulted in autonomous sideways sliding into a parkingspot (Kolter et al., 2010). Instead of learning a modelof the system dynamics, Lizotte et al. (2007) directlylearned the expected return as a function of the pol-icy parameters using Gaussian process regression ina black-box fashion, and, subsequently, searched forpromising parameters in this model. The method hasbeen applied to optimize the gait of an Aibo robot.

Sampling by Re-Using Random Numbers. A for-ward model can be used as a simulator to create roll-outs for training by sampling. When comparing re-sults of different simulation runs, it is often hard totell from a small number of samples whether a policyreally worked better or whether the results are an effectof the simulated stochasticity. Using a large numberof samples to obtain proper estimates of the expecta-tions become prohibitively expensive if a large num-ber of such comparisons need to be performed (e.g.,for gradient estimation within an algorithm). A com-mon technique in the statistics and simulation commu-nity (Glynn, 1987) to address this problem is to re-usethe series of random numbers in fixed models, hencemitigating the noise contribution. Ng et al. (2004a,b)extended this approach for learned simulators. Theresulting approach, PEGASUS, found various applica-tions in the learning of maneuvers for autonomous he-licopters (Bagnell and Schneider, 2001; Bagnell, 2004;Ng et al., 2004a,b), as illustrated in Figure 7. It hasbeen used to learn control parameters for a RC car(Michels et al., 2005) and an autonomous blimp (Koet al., 2007).

While mental rehearsal has a long history inrobotics, it is currently becoming again a hot topic,especially due to the work on probabilistic virtual sim-ulation.

6.2 Successful Learning Approaches withForward Models

Model-based approaches rely on an approach thatfinds good policies using the learned model. In thissection, we discuss methods that directly obtain a newpolicy candidate directly from a forward model. Someof these methods have a long history in optimal controland only work in conjunction with a forward model.

Iterative Learning Control. A powerful idea that hasbeen developed in multiple forms in both the rein-forcement learning and control communities is the useof crude, approximate models to determine gradients,e.g., for an update step. The resulting new policy isthen evaluated in the real world and the model is up-dated. This approach is known as iterative learningcontrol (Arimoto et al., 1984). A similar precedingidea was employed to minimize trajectory tracking er-rors (An et al., 1988) and is loosely related to feedbackerror learning (Kawato, 1990). More recently, varia-tions on the iterative learning control has been em-ployed to learn robot control (Norrlöf, 2002; Bukkemset al., 2005), steering a RC car with a general analy-sis of approximate models for policy search in (Abbeelet al., 2006), a pick and place task (Freeman et al.,2010), and an impressive application of tying knotswith a surgical robot at superhuman speeds (van denBerg et al., 2010).

Locally Linear Quadratic Regulators. Instead ofsampling from a forward model-based simulator, suchlearned models can be directly used to compute op-timal control policies. This approach has resultedin a variety of robot reinforcement learning applica-tions that include pendulum swing-up tasks learnedwith DDP (Atkeson and Schaal, 1997; Atkeson, 1998),devil-sticking (a form of gyroscopic juggling) obtainedwith local LQR solutions (Schaal and Atkeson, 1994),trajectory following with space-indexed controllerstrained with DDP for an autonomous RC car (Kolteret al., 2008), and the aerobatic helicopter flight trainedwith DDP discussed above (Coates et al., 2009).

Value Function Methods with Learned Models.

Obviously, mental rehearsal can be used straightfor-wardly with value function methods by simply pre-tending that the simulated roll-outs were generatedfrom the real system. Learning in simulation whilethe computer is idle and employing directed explo-ration allows Q-learning to learn a navigation taskfrom scratch in 20 minutes (Bakker et al., 2006). Tworobots taking turns in learning a simplified soccer taskwere also able to profit from mental rehearsal (Uchibeet al., 1998). Nemec et al. (2010) used a value functionlearned in simulation to initialize the real robot learn-ing. However, it is clear that model-based methodsthat use the model for creating more direct experienceshould potentially perform better.

Policy Search with Learned Models. Similarly as forvalue function methods, all model-free policy search

24

Page 25: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Core Issues and General Techniques in Mental Rehearsal

Approach Employed by. . .

Dealing withSimulation Biases

An et al. (1988); Atkeson and Schaal (1997); Atkeson (1998); Bakker et al. (2003);Duan et al. (2007); Fagg et al. (1998); Ilg et al. (1999); Jakobi et al. (1995); Oßwaldet al. (2010); Ross and Bagnell (2012); Svinin et al. (2001); Youssef (2005)

Distributions overModels for Simulation

Bagnell and Schneider (2001); Deisenroth and Rasmussen (2011); Deisenroth et al.(2011); Kolter et al. (2010); Lizotte et al. (2007)

Sampling by Re-UsingRandom Numbers

Bagnell and Schneider (2001); Bagnell (2004); Ko et al. (2007); Michels et al. (2005);Ng et al. (2004a,b)

Successful Learning Approaches with Forward Models

Approach Employed by. . .

Iterative LearningControl

Abbeel et al. (2006); An et al. (1988); van den Berg et al. (2010); Bukkems et al.(2005); Freeman et al. (2010); Norrlöf (2002)

Locally LinearQuadratic Regulators

Atkeson and Schaal (1997); Atkeson (1998); Coates et al. (2009); Kolter et al. (2008);Schaal and Atkeson (1994); Tedrake et al. (2010)

Value FunctionMethods with LearnedModels

Bakker et al. (2006); Nemec et al. (2010); Uchibe et al. (1998)

Policy Search withLearned Models

Bagnell and Schneider (2001); Bagnell (2004); Deisenroth et al. (2011); Deisenrothand Rasmussen (2011); Kober and Peters (2010); Ng et al. (2004a,b); Peters et al.(2010a)

Table 5: This table illustrates different methods of making robot reinforcement learning tractable using models.

methods can be used in conjunction with learned sim-ulators. For example, both pairwise comparisons ofpolicies and policy gradient approaches have been usedwith learned simulators (Bagnell and Schneider, 2001;Bagnell, 2004; Ng et al., 2004a,b). Transferring EM-like policy search Kober and Peters (2010) and Rel-ative Entropy Policy Search Peters et al. (2010a) ap-pears to be a natural next step. Nevertheless, as men-tioned in Section 6.1, a series of policy update methodshas been suggested that were tailored for probabilis-tic simulation Deisenroth et al. (2011); Deisenroth andRasmussen (2011); Lizotte et al. (2007).

There still appears to be the need for new methodsthat make better use of the model knowledge, both inpolicy search and for value function methods.

7 A Case Study: Ball-in-a-Cup

Up to this point in this paper, we have reviewed a largevariety of problems and associated solutions withinrobot reinforcement learning. In this section, we willtake a complementary approach and discuss one taskin detail that has previously been studied.

This ball-in-a-cup task due to its relative simplic-ity can serve as an example to highlight some of thechallenges and methods that were discussed earlier.We do not claim that the method presented is thebest or only way to address the presented problem;instead, our goal is to provide a case study that showsdesign decisions which can lead to successful roboticreinforcement learning.

In Section 7.1, the experimental setting is describedwith a focus on the task and the reward. Section 7.2discusses a type of pre-structured policies that hasbeen particularly useful in robotics. Inclusion of priorknowledge is presented in Section 7.3. The advantages

of the employed policy search algorithm are explainedin Section 7.4. The use of simulations in this task isdiscussed in Section 7.5 and results on the real robotare described in Section 7.6. Finally, an alternativereinforcement learning approach is briefly explored inSection 7.7.

7.1 Experimental Setting: Task andReward

The children’s game ball-in-a-cup, also known as baleroand bilboquet, is challenging even for adults. The toyconsists of a small cup held in one hand (in this case,it is attached to the end-effector of the robot) anda small ball hanging on a string attached to the cup’sbottom (for the employed toy, the string is 40cm long).Initially, the ball is at rest, hanging down vertically.The player needs to move quickly to induce motion inthe ball through the string, toss the ball in the air,and catch it with the cup. A possible movement isillustrated in Figure 8a.

The state of the system can be described by jointangles and joint velocities of the robot as well as theCartesian coordinates and velocities of the ball (ne-glecting states that cannot be observed straightfor-wardly like the state of the string or global room airmovement). The actions are the joint space acceler-ations, which are translated into torques by a fixedinverse dynamics controller. Thus, the reinforcementlearning approach has to deal with twenty state andseven action dimensions, making discretization infea-sible.

An obvious reward function would be a binary re-turn for the whole episode, depending on whether theball was caught in the cup or not. In order to givethe reinforcement learning algorithm a notion of close-

25

Page 26: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

(a) Schematic drawings of the ball-in-a-cup motion

(b) Kinesthetic teach-in

(c) Final learned robot motion

Figure 8: This figure shows schematic drawings of the ball-in-a-cup motion (a), the final learned robot motion (c), aswell as a kinesthetic teach-in (b). The green arrows show the directions of the current movements in that frame. Thehuman cup motion was taught to the robot by imitation learning. The robot manages to reproduce the imitated motionquite accurately, but the ball misses the cup by several centimeters. After approximately 75 iterations of the Policylearning by Weighting Exploration with the Returns (PoWER) algorithm the robot has improved its motion so that theball regularly goes into the cup.

ness, Kober and Peters (2010) initially used a rewardfunction based solely on the minimal distance betweenthe ball and the cup. However, the algorithm has ex-ploited rewards resulting from hitting the cup withthe ball from below or from the side, as such behav-iors are easier to achieve and yield comparatively highrewards. To avoid such local optima, it was essentialto find a good reward function that contains the ad-ditional prior knowledge that getting the ball into thecup is only possible from one direction. Kober andPeters (2010) expressed this constraint by computingthe reward as r(tc) = exp

(

−α(xc − xb)2 − α(yc − yb)

2)

while r (t) = 0 for all t 6= tc. Here, tc corresponds to thetime step when the ball passes the rim of the cup witha downward direction, the cup position is denoted by[xc, yc, zc] ∈ R

3, the ball position is [xb, yb, zb] ∈ R3 and

the scaling parameter α = 100. This reward functiondoes not capture the case where the ball does not passabove the rim of the cup, as the reward will always bezero. The employed approach performs a local policysearch, and hence, an initial policy that brings the ballabove the rim of the cup was required.

The task exhibits some surprising complexity as thereward is not only affected by the cup’s movementsbut foremost by the ball’s movements. As the ball’smovements are very sensitive to small perturbations,the initial conditions, or small arm movement changeswill drastically affect the outcome. Creating an ac-curate simulation is hard due to the nonlinear, unob-servable dynamics of the string and its non-negligibleweight.

7.2 Appropriate Policy Representation

The policy is represented by dynamical system motorprimitives (Ijspeert et al., 2003; Schaal et al., 2007).The global movement is encoded as a point attractorlinear dynamical system with an additional local trans-formation that allows a very parsimonious representa-tion of the policy. This framework ensures the sta-bility of the movement and allows the representationof arbitrarily shaped movements through the primi-tive’s policy parameters. These parameters can be es-timated straightforwardly by locally weighted regres-sion. Note that the dynamical systems motor primi-tives ensure the stability of the movement generationbut cannot guarantee the stability of the movementexecution. These primitives can be modified throughtheir meta-parameters in order to adapt to the finalgoal position, the movement amplitude, or the dura-tion of the movement. The resulting movement canstart from arbitrary positions and velocities and go toarbitrary final positions while maintaining the overallshape of the trajectory.

While the original formulation in (Ijspeert et al.,2003) for discrete dynamical systems motor primitivesused a second-order system to represent the phase z

of the movement, this formulation has proven to beunnecessarily complicated in practice. Since then, ithas been simplified and, in (Schaal et al., 2007), it wasshown that a single first order system suffices

z = −ταzz. (7)

This canonical system has the time constant τ = 1/T

where T is the duration of the motor primitive, a pa-rameter αz which is chosen such that z ≈ 0 at T to

26

Page 27: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

ensure that the influence of the transformation func-tion, shown in Equation (9), vanishes. Subsequently,the internal state x of a second system is chosen suchthat positions q of all degrees of freedom are given byq = x1, the velocities q by q = τx2 = x1 and the accel-erations q by q = τ x2. Under these assumptions, thelearned dynamics of Ijspeert motor primitives can beexpressed in the following form

x2 = ταx (βx (g − x1)− x2) + τAf (z) , (8)

x1 = τx2.

This set of differential equations has the same timeconstant τ as the canonical system, parameters αx, βxset such that the system is critically damped, a goalparameter g, a transformation function f and an am-plitude matrix A = diag(a1, a2, . . . , an), with the am-plitude modifier a = [a1, a2, . . . , an]. In (Schaal et al.,2007), they use a = g−x01 with the initial position x01,which ensures linear scaling. Alternative choices arepossibly better suited for specific tasks, see e.g., (Parket al., 2008). The transformation function f (z) al-ters the output of the first system, in Equation (7), sothat the second system, in Equation (8), can representcomplex nonlinear patterns and it is given by

f (z) =∑N

i=1ψi (z)wiz. (9)

Here, wi contains the ith adjustable parameter of alldegrees of freedom, N is the number of parametersper degree of freedom, and ψi(z) are the correspondingweighting functions (Schaal et al., 2007). NormalizedGaussian kernels are used as weighting functions givenby

ψi (z) =exp

(

−hi (z − ci)2)

∑Nj=1 exp

(

−hj(z − cj

)2) . (10)

These weighting functions localize the interaction inphase space using the centers ci and widths hi. Notethat the degrees of freedom (DoF) are usually all mod-eled as independent in Equation (8). All DoFs are syn-chronous as the dynamical systems for all DoFs start atthe same time, have the same duration, and the shapeof the movement is generated using the transformationf (z) in Equation (9). This transformation function islearned as a function of the shared canonical systemin Equation (7).

This policy can be seen as a parameterization ofa mean policy in the form a = θTµ(s, t), which islinear in parameters. Thus, it is straightforward toinclude prior knowledge from a demonstration usingsupervised learning by locally weighted regression.

This policy is augmented by an additive explorationǫ(s, t) noise term to make policy search methods possi-ble. As a result, the explorative policy can be given inthe form a = θTµ(s, t)+ ǫ(µ(s, t)). Some policy searchapproaches have previously used state-independent,white Gaussian exploration, i.e., ǫ(µ(s, t)) ∼ N (0,Σ).However, such unstructured exploration at every stephas several disadvantages, notably: (i) it causes a largevariance which grows with the number of time-steps,

(ii) it perturbs actions too frequently, thus, “washingout” their effects and (iii) it can damage the systemthat is executing the trajectory.

Alternatively, one could generate a form of struc-tured, state-dependent exploration (Rückstieß et al.,2008) ǫ(µ(s, t)) = εT

t µ(s, t) with [εt]ij ∼ N (0, σ2ij),where σ2ij are meta-parameters of the exploration thatcan also be optimized. This argument results in thepolicy a = (θ + εt)

T µ(s, t) corresponding to the distri-bution a ∼ π(at|st, t) = N (a|θTµ(s, t),µ(s, t)TΣ(s, t)).Instead of directly exploring in action space, this typeof policy explores in parameter space.

7.3 Generating a Teacher Demonstration

Children usually learn this task by first observing an-other person presenting a demonstration. They thentry to duplicate the task through trial-and-error-basedlearning. To mimic this process, the motor primitiveswere first initialized by imitation. Subsequently, theywere improved them by reinforcement learning.

A demonstration for imitation was obtained byrecording the motions of a human player performingkinesthetic teach-in as shown in Figure 8b. Kines-thetic teach-in means “taking the robot by the hand”,performing the task by moving the robot while it isin gravity-compensation mode, and recording the jointangles, velocities and accelerations. It requires a back-drivable robot system that is similar enough to a hu-man arm to not cause embodiment issues. Even withdemonstration, the resulting robot policy fails to catchthe ball with the cup, leading to the need for self-improvement by reinforcement learning. As discussedin Section 7.1, the initial demonstration was needed toensure that the ball goes above the rim of the cup.

7.4 Reinforcement Learning by PolicySearch

Policy search methods are better suited for a scenariolike this, where the task is episodic, local optimiza-tion is sufficient (thanks to the initial demonstration),and high dimensional, continuous states and actionsneed to be taken into account. A single update stepin a gradient based method usually requires as manyepisodes as parameters to be optimized. Since the ex-pected number of parameters was in the hundreds, adifferent approach had to be taken because gradientbased methods are impractical in this scenario. Fur-thermore, the step-size parameter for gradient basedmethods often is a crucial parameter that needs tobe tuned to achieve good performance. Instead, anexpectation-maximization inspired algorithm was em-ployed that requires significantly less samples and hasno learning rate.

Kober and Peters (2009) have derived a frameworkof reward weighted imitation. Based on (Dayan andHinton, 1997) they consider the return of an episode asan improper probability distribution. A lower boundof the logarithm of the expected return is maximized.Depending on the strategy of optimizing this lowerbound and the exploration strategy, the framework

27

Page 28: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

yields several well known policy search algorithmsas well as the novel Policy learning by WeightingExploration with the Returns (PoWER) algorithm.PoWER is an expectation-maximization inspired al-gorithm that employs state-dependent exploration (asdiscussed in Section 7.2). The update rule is given by

θ′ = θ + E

T∑

t=1

W (st, t)Qπ (st,at, t)

−1

E

T∑

t=1

W (st, t) εtQπ(st,at, t)

,

where W (st, t) = µ(s, t)µ(s, t)T(µ(s, t)TΣµ(s, t))−1.Intuitively, this update can be seen as a reward-weighted imitation, (or recombination) of previouslyseen episodes. Depending on its effect on the state-action value function, the exploration of each episodeis incorporated more or less strongly into the updatedpolicy. To reduce the number of trials in this on-policy scenario, the trials are reused through impor-tance sampling (Sutton and Barto, 1998). To avoidthe fragility that sometimes results from importancesampling in reinforcement learning, samples with verysmall importance weights were discarded. In essence,this algorithm performs a local search around the pol-icy learned from demonstration and prior knowledge.

7.5 Use of Simulations in RobotReinforcement Learning

The robot is simulated by rigid body dynamics withparameters estimated from data. The toy is simulatedas a pendulum with an elastic string that switches toa ballistic point mass when the ball is closer to the cupthan the string is long. The spring, damper and resti-tution constants were tuned to match data recorded ona VICON system. The SL framework (Schaal, 2009)allowed us to switch between the simulated robot andthe real one with a simple recompile. Even thoughthe simulation matches recorded data very well, poli-cies that get the ball in the cup in simulation usuallymissed the cup by several centimeters on the real sys-tem and vice-versa. One conceivable approach couldbe to first improve a demonstrated policy in simulationand only perform the fine-tuning on the real robot.

However, this simulation was very helpful to developand tune the algorithm as it runs faster in simulationthan real-time and does not require human supervi-sion or intervention. The algorithm was initially con-firmed and tweaked with unrelated, simulated bench-mark tasks (shown in (Kober and Peters, 2010)). Theuse of an importance sampler was essential to achievegood performance and required many tests over thecourse of two weeks. A very crude importance sam-pler that considers only the n best previous episodesworked sufficiently well in practice. Depending on thenumber n the algorithm exhibits are more or less pro-nounced greedy behavior. Additionally there are anumber of possible simplifications for the learning al-gorithm, some of which work very well in practice even

if the underlying assumption do not strictly hold in re-ality. The finally employed variant

θ′ = θ +

⟨∑T

t=1 εtQπ(st,at, t)

ω(τ )⟨∑T

t=1Qπ(st,at, t)

ω(τ )

assumes that only a single basis function is active ata given time, while there is actually some overlap forthe motor primitive basis functions. The importancesampler is denoted by 〈�〉ω(τ ). The implementation isfurther simplified as the reward is zero for all but onetime-step per episode.

To adapt the algorithm to this particular task, themost important parameters to tune were the “greed-iness” of the importance sampling, the initial magni-tude of the exploration, and the number of parame-ters for the policy. These parameters were identifiedby a coarse grid search in simulation with various ini-tial demonstrations and seeds for the random numbergenerator. Once the simulation and the grid searchwere coded, this process only took a few minutes. Theexploration parameter is fairly robust if it is in thesame order of magnitude as the policy parameters.For the importance sampler, using the 10 best pre-vious episodes was a good compromise. The numberof policy parameters needs to be high enough to cap-ture enough details to get the ball above the rim ofthe cup for the initial demonstration. On the otherhand, having more policy parameters will potentiallyslow down the learning process. The number of neededpolicy parameters for various demonstrations were inthe order of 30 parameters per DoF. The demonstra-tion employed for the results shown in more detail inthis paper employed 31 parameters per DoF for an ap-proximately 3 second long movement, hence 217 policyparameters total. Having three times as many policyparameters slowed down the convergence only slightly.

7.6 Results on the Real Robot

The first run on the real robot used the demonstrationshown in Figure 8 and directly worked without anyfurther parameter tuning. For the five runs with thisdemonstration, which took approximately one houreach, the robot got the ball into the cup for the firsttime after 42-45 episodes and regularly succeeded atbringing the ball into the cup after 70-80 episodes.The policy always converged to the maximum after100 episodes. Running the real robot experiment wastedious as the ball was tracked by a stereo vision sys-tem, which sometimes failed and required a manualcorrection of the reward. As the string frequently en-tangles during failures and the robot cannot unravel it,human intervention is required. Hence, the ball hadto be manually reset after each episode.

If the constraint of getting the ball above the rim ofthe cup for the initial policy is fulfilled, the presentedapproach works well for a wide variety of initial demon-strations including various teachers and two differentmovement strategies (swinging the ball or pulling theball straight up). Convergence took between 60 and

28

Page 29: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

130 episodes, which largely depends on the initial dis-tance to the cup but also on the robustness of thedemonstrated policy.

7.7 Alternative Approach with ValueFunction Methods

Nemec et al. (2010) employ an alternate reinforce-ment learning approach to achieve the ball-in-a-cuptask with a Mitsubishi PA10 robot. They decomposedthe task into two sub-tasks, the swing-up phase andthe catching phase. In the swing-up phase, the ballis moved above the cup. In the catching phase, theball is caught with the cup using an analytic predic-tion of the ball trajectory based on the movement ofa flying point mass. The catching behavior is fixed;only the swing-up behavior is learned. The paper pro-poses to use SARSA to learn the swing-up movement.The states consist of the cup positions and velocities aswell as the angular positions and velocities of the ball.The actions are the accelerations of the cup in a singleCartesian direction. Tractability is achieved by dis-cretizing both the states (324 values) and the actions(5 values) and initialization by simulation. The be-havior was first learned in simulation requiring 220 to300 episodes. The state-action value function learnedin simulation was used to initialize the learning on thereal robot. The robot required an additional 40 to 90episodes to adapt the behavior learned in simulationto the real environment.

8 Discussion

We have surveyed the state of the art in robot re-inforcement learning for both general reinforcementlearning audiences and robotics researchers to providepossibly valuable insight into successful techniques andapproaches.

From this overview, it is clear that using reinforce-ment learning in the domain of robotics is not yeta straightforward undertaking but rather requires acertain amount of skill. Hence, in this section, wehighlight several open questions faced by the roboticreinforcement learning community in order to makeprogress towards “off-the-shelf” approaches as well asa few current practical challenges. Finally, we try tosummarize several key lessons from robotic reinforce-ment learning for the general reinforcement learningcommunity.

8.1 Open Questions in RoboticReinforcement Learning

Reinforcement learning is clearly not applicable torobotics “out of the box” yet, in contrast to supervisedlearning where considerable progress has been madein large-scale, easy deployment over the last decade.As this paper illustrates, reinforcement learning canbe employed for a wide variety of physical systemsand control tasks in robotics. It is unlikely that sin-gle answers do exist for such a heterogeneous field,

but even for very closely related tasks, appropriatemethods currently need to be carefully selected. Theuser has to decide when sufficient prior knowledge isgiven and learning can take over. All methods requirehand-tuning for choosing appropriate representations,reward functions, and the required prior knowledge.Even the correct use of models in robot reinforcementlearning requires substantial future research. Clearly,a key step in robotic reinforcement learning is the au-tomated choice of these elements and having robustalgorithms that limit the required tuning to the pointwhere even a naive user would be capable of usingrobotic RL.

How to choose representations automatically? Theautomated selection of appropriate representationsremains a difficult problem as the action space inrobotics often is inherently continuous and multi-dimensional. While there are good reasons for usingthe methods presented in Section 4 in their respec-tive domains, the question whether to approximatestates, value functions, policies – or a mix of the three– remains open. The correct methods to handle theinevitability of function approximation remain underintense study, as does the theory to back it up.

How to generate reward functions from data?

Good reward functions are always essential to the suc-cess of a robot reinforcement learning approach (seeSection 3.4). While inverse reinforcement learning canbe used as an alternative to manually designing the re-ward function, it relies on the design of features thatcapture the important aspects of the problem spaceinstead. Finding feature candidates may require in-sights not altogether different from the ones needed todesign the actual reward function.

How much can prior knowledge help? How much

is needed? Incorporating prior knowledge is one ofthe main tools to make robotic reinforcement learningtractable (see Section 5). However, it is often hard totell in advance how much prior knowledge is requiredto enable a reinforcement learning algorithm to suc-ceed in a reasonable number of episodes. For suchcases, a loop of imitation and reinforcement learningmay be a desirable alternative. Nevertheless, some-times, prior knowledge may not help at all. For ex-ample, obtaining initial policies from human demon-strations can be virtually impossible if the morphologyof the robot is too different from a human’s (which isknown as the correspondence problem (Argall et al.,2009)). Whether alternative forms of prior knowledgecan help here may be a key question to answer.

How to integrate more tightly with perception?

Much current work on robotic reinforcement learningrelies on subsystems that abstract away perceptual in-formation, limiting the techniques to simple percep-tion systems and heavily pre-processed data. This ab-straction is in part due to limitations of existing rein-forcement learning approaches at handling inevitably

29

Page 30: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

incomplete, ambiguous and noisy sensor data. Learn-ing active perception jointly with the robot’s move-ment and semantic perception are open problems thatpresent tremendous opportunities for simplifying aswell as improving robot behavior programming.

How to reduce parameter sensitivity? Many algo-rithms work fantastically well for a very narrow rangeof conditions or even tuning parameters. A simpleexample would be the step size in a gradient basedmethod. However, searching anew for the best pa-rameters for each situation is often prohibitive withphysical systems. Instead algorithms that work fairlywell for a large range of situations and parameters arepotentially much more interesting for practical roboticapplications.

How to deal with model errors and under-modeling?

Model based approaches can significantly reduce theneed for real-world interactions. Methods that arebased on approximate models and use local optimiza-tion often work well (see Section 6). As real-worldsamples are usually more expensive than compara-tively cheap calculations, this may be a significantadvantage. However, for most robot systems, therewill always be under-modeling and resulting model er-rors. Hence, the policies learned only in simulationfrequently cannot be transferred directly to the robot.

This problem may be inevitable due to both un-certainty about true system dynamics, the non-stationarity of system dynamics6 and the inability ofany model in our class to be perfectly accurate inits description of those dynamics, which have led torobust control theory (Zhou and Doyle, 1997). Re-inforcement learning approaches mostly require thebehavior designer to deal with this problem by in-corporating model uncertainty with artificial noise orcarefully choosing reward functions to discourage con-trollers that generate frequencies that might excite un-modeled dynamics.

Tremendous theoretical and algorithmic challengesarise from developing algorithms that are robust toboth model uncertainty and under-modeling while en-suring fast learning and performance. A possible ap-proach may be the full Bayesian treatment of the im-pact of possible model errors onto the policy but hasthe risk of generating overly conservative policies, asalso happened in robust optimal control.

This list of questions is by no means exhaustive,however, it provides a fair impression of the criticalissues for basic research in this area.

8.2 Practical Challenges for RoboticReinforcement Learning

More applied problems for future research result frompractical challenges in robot reinforcement learning:6Both the environment and the robot itself change dynami-

cally; for example, vision systems depend on the lightingcondition and robot dynamics change with wear and thetemperature of the grease.

Exploit Data Sets Better. Humans who learn a newtask build upon previously learned skills. For exam-ple, after a human learns how to throw balls, learninghow to throw darts becomes significantly easier. Beingable to transfer previously-learned skills to other tasksand potentially to robots of a different type is cru-cial. For complex tasks, learning cannot be achievedglobally. It is essential to reuse other locally learnedinformation from past data sets. While such transferlearning has been studied more extensively in otherparts of machine learning, tremendous opportunitiesfor leveraging such data exist within robot reinforce-ment learning. Making such data sets with many skillspublicly available would be a great service for roboticreinforcement learning research.

Comparable Experiments and Consistent Evalua-

tion. The difficulty of performing, and reproducing,large scale experiments due to the expense, fragility,and differences between hardware remains a key lim-itation in robotic reinforcement learning. The cur-rent movement towards more standardization withinrobotics may aid these efforts significantly, e.g., bypossibly providing a standard robotic reinforcementlearning setup to a larger community – both real andsimulated. These questions need to be addressed inorder for research in self-improving robots to progress.

8.3 Robotics Lessons for ReinforcementLearning

Most of the article has aimed equally at both rein-forcement learning and robotics researchers as well aspractitioners. However, this section attempts to con-vey a few important, possibly provocative, take-homemessages for the classical reinforcement learning com-munity.

Focus on High-Dimensional Continuous Actions

and Constant Adaptation. Robotic problems clearlyhave driven theoretical reinforcement learning re-search, particularly in policy search, inverse optimalcontrol approaches, and model robustness. The prac-tical impact of robotic reinforcement learning prob-lems (e.g., multi-dimensional continuous action spaces,continuously drifting noise, frequent changes in thehardware and the environment, and the inevitabilityof undermodeling), may not yet have been sufficientlyappreciated by the theoretical reinforcement learningcommunity. These problems have often caused roboticreinforcement learning to take significantly differentapproaches than would be dictated by theory. Per-haps as a result, robotic reinforcement learning ap-proaches are often closer to classical optimal controlsolutions than the methods typically studied in themachine learning literature.

Exploit Domain Structure for Scalability. Thegrounding of RL in robotics alleviates the generalproblem of scaling reinforcement learning into high di-mensional domains by exploiting the structure of the

30

Page 31: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

physical world. Prior knowledge of the behavior ofthe system and the task is often available. Incorporat-ing even crude models and domain structure into thelearning approach (e.g., to approximate gradients) hasyielded impressive results(Kolter and Ng, 2009a).

Local Optimality and Controlled State Distributions.

Much research in classical reinforcement learning aimsat finding value functions that are optimal over theentire state space, which is most likely intractable. Incontrast, the success of policy search approaches inrobotics relies on their implicit maintenance and con-trolled change of a state distribution under which thepolicy can be optimized. Focusing on slowly chang-ing state distributions may also benefit value functionmethods.

Reward Design. It has been repeatedly demon-strated that reinforcement learning approaches benefitsignificantly from reward shaping (Ng et al., 1999), andparticularly from using rewards that convey a notionof closeness and are not only based on simple binarysuccess or failure. A learning problem is potentiallydifficult if the reward is sparse, there are significantdelays between an action and the associated signifi-cant reward, or if the reward is not smooth (i.e., verysmall changes to the policy lead to a drastically dif-ferent outcome). In classical reinforcement learning,discrete rewards are often considered, e.g., a smallnegative reward per time-step and a large positive re-ward for reaching the goal. In contrast, robotic rein-forcement learning approaches often need more phys-ically motivated reward-shaping based on continuousvalues and consider multi-objective reward functionslike minimizing the motor torques while achieving atask.

We hope that these points will help soliciting morenew targeted solutions from the reinforcement learningcommunity for robotics.

Acknowledgments & Funding

We thank the anonymous reviewers for their valuablesuggestions of additional papers and points for discus-sion. Their input helped us to significantly extend andimprove the paper.

The was supported by the European Community’sSeventh Framework Programme [grant numbers ICT-248273 GeRT, ICT-270327 CompLACS]; the DefenseAdvanced Research Project Agency through Autono-mus Robotics Manipulation-Software and the Office ofNaval Research Multidisciplinary University ResearchInitiatives Distributed Reasoning in Reduced Informa-tion Spaces and Provably Stable Vision-Based Controlof High-Speed Flight through Forests and Urban En-vironments.

References

Abbeel, P., Coates, A., Quigley, M., and Ng, A. Y.(2007). An application of reinforcement learning toaerobatic helicopter flight. In Advances in NeuralInformation Processing Systems (NIPS).

Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learn-ing via inverse reinforcement learning. In Interna-tional Conference on Machine Learning (ICML).

Abbeel, P., Quigley, M., and Ng, A. Y. (2006). Us-ing inaccurate models in reinforcement learning.In International Conference on Machine Learning(ICML).

An, C. H., Atkeson, C. G., and Hollerbach, J. M.(1988). Model-based control of a robot manipulator.MIT Press, Cambridge, MA, USA.

Argall, B. D., Browning, B., and Veloso, M. (2008).Learning robot motion control with demonstra-tion and advice-operators. In IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems(IROS).

Argall, B. D., Chernova, S., Veloso, M., and Brown-ing, B. (2009). A survey of robot learning fromdemonstration. Robotics and Autonomous Systems,57:469–483.

Arimoto, S., Kawamura, S., and Miyazaki, F. (1984).Bettering operation of robots by learning. Journalof Robotic Systems, 1(2):123–140.

Asada, M., Noda, S., Tawaratsumida, S., and Hosoda,K. (1996). Purposive behavior acquisition for a realrobot by vision-based reinforcement learning. Ma-chine Learning, 23(2-3):279–303.

Atkeson, C. G. (1994). Using local trajectory opti-mizers to speed up global optimization in dynamicprogramming. In Advances in Neural InformationProcessing Systems (NIPS).

Atkeson, C. G. (1998). Nonparametric model-basedreinforcement learning. In Advances in Neural In-formation Processing Systems (NIPS).

Atkeson, C. G., Moore, A., and Stefan, S. (1997).Locally weighted learning for control. AI Review,11:75–113.

Atkeson, C. G. and Schaal, S. (1997). Robot learningfrom demonstration. In International Conference onMachine Learning (ICML).

Bagnell, J. A. (2004). Learning Decisions: Robust-ness, Uncertainty, and Approximation. PhD the-sis, Robotics Institute, Carnegie Mellon University,Pittsburgh, PA.

Bagnell, J. A., Ng, A. Y., Kakade, S., and Schneider, J.(2003). Policy search by dynamic programming. InAdvances in Neural Information Processing Systems(NIPS).

Bagnell, J. A. and Schneider, J. (2003). Covariantpolicy search. In International Joint Conference onArtifical Intelligence (IJCAI).

Bagnell, J. A. and Schneider, J. C. (2001). Au-tonomous helicopter control using reinforcement

31

Page 32: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

learning policy search methods. In IEEE Inter-national Conference on Robotics and Automation(ICRA).

Baird, L. C. and Klopf, H. (1993). Reinforcementlearning with high-dimensional continuous actions.Technical Report WL-TR-93-1147, Wright Labora-tory, Wright-Patterson Air Force Base, OH 45433-7301.

Bakker, B., Zhumatiy, V., Gruener, G., and Schmid-huber, J. (2003). A robot that reinforcement-learnsto identify and memorize important previous obser-vations. In IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS).

Bakker, B., Zhumatiy, V., Gruener, G., and Schmidhu-ber, J. (2006). Quasi-online reinforcement learningfor robots. In IEEE International Conference onRobotics and Automation (ICRA).

Barto, A. G. and Mahadevan, S. (2003). Recent ad-vances in hierarchical reinforcement learning. Dis-crete Event Dynamic Systems, 13(4):341–379.

Bellman, R. E. (1957). Dynamic Programming.Princeton University Press, Princeton, NJ.

Bellman, R. E. (1967). Introduction to the Mathemat-ical Theory of Control Processes, volume 40-I. Aca-demic Press, New York, NY.

Bellman, R. E. (1971). Introduction to the Mathemati-cal Theory of Control Processes, volume 40-II. Aca-demic Press, New York, NY.

Benbrahim, H., Doleac, J., Franklin, J., and Selfridge,O. (1992). Real-time learning: A ball on a beam. InInternational Joint Conference on Neural Networks(IJCNN).

Benbrahim, H. and Franklin, J. A. (1997). Bipeddynamic walking using reinforcement learning.Robotics and Autonomous Systems, 22(3-4):283–302.

Bentivegna, D. C. (2004). Learning from ObservationUsing Primitives. PhD thesis, Georgia Institute ofTechnology.

Bentivegna, D. C., Atkeson, C. G., and Cheng, G.(2004). Learning from observation and practice us-ing behavioral primitives: Marble maze.

Bertsekas, D. P. (1995). Dynamic Programming andOptimal Control. Athena Scientific.

Betts, J. T. (2001). Practical methods for optimal con-trol using nonlinear programming, volume 3 of Ad-vances in Design and Control. Society for Indus-trial and Applied Mathematics (SIAM), Philadel-phia, PA.

Birdwell, N. and Livingston, S. (2007). Reinforcementlearning in sensor-guided AIBO robots. Technicalreport, University of Tennesse, Knoxville. advisedby Dr. Itamar Elhanany.

Bishop, C. (2006). Pattern Recognition and Ma-chine Learning. Information Science and Statistics.Springer.

Bitzer, S., Howard, M., and Vijayakumar, S. (2010).

Using dimensionality reduction to exploit con-straints in reinforcement learning. In IEEE/RSJInternational Conference on Intelligent Robots andSystems (IROS).

Boyan, J. A. and Moore, A. W. (1995). Generalizationin reinforcement learning: Safely approximating thevalue function. In Advances in Neural InformationProcessing Systems (NIPS).

Brafman, R. I. and Tennenholtz, M. (2002). R-max - ageneral polynomial time algorithm for near-optimalreinforcement learning. Journal of Machine Learn-ing Research, 3:213–231.

Buchli, J., Stulp, F., Theodorou, E., and Schaal, S.(2011). Learning variable impedance control. In-ternational Journal of Robotic Research, 30(7):820–833.

Bukkems, B., Kostic, D., de Jager, B., and Stein-buch, M. (2005). Learning-based identificationand iterative learning control of direct-drive robots.IEEE Transactions on Control Systems Technology,13(4):537–549.

Buşoniu, L., Babuška, R., de Schutter, B., and Ernst,D. (2010). Reinforcement Learning and DynamicProgramming Using Function Approximators. CRCPress, Boca Raton, Florida.

Coates, A., Abbeel, P., and Ng, A. Y. (2009). Appren-ticeship learning for helicopter control. Communi-cations of the ACM, 52(7):97–105.

Cocora, A., Kersting, K., Plagemann, C., Burgard,W., and de Raedt, L. (2006). Learning rela-tional navigation policies. In IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems(IROS).

Conn, K. and Peters II, R. A. (2007). Reinforce-ment learning with a supervisor for a mobile robotin a real-world environment. In IEEE Interna-tional Symposium on Computational Intelligence inRobotics and Automation (CIRA).

Daniel, C., Neumann, G., and Peters, J. (2012).Learning concurrent motor skills in versatile solu-tion spaces. In IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS).

Daumé III, H., Langford, J., and Marcu, D. (2009).Search-based structured prediction. Machine Learn-ing Journal (MLJ), 75:297–325.

Dayan, P. and Hinton, G. E. (1997). Usingexpectation-maximization for reinforcement learn-ing. Neural Computation, 9(2):271–278.

Deisenroth, M. P. and Rasmussen, C. E. (2011).PILCO: A model-based and data-efficient approachto policy search. In 28th International Conferenceon Machine Learning (ICML).

Deisenroth, M. P., Rasmussen, C. E., and Fox, D.(2011). Learning to control a low-cost manipula-tor using data-efficient reinforcement learning. InRobotics: Science and Systems (R:SS).

Donnart, J.-Y. and Meyer, J.-A. (1996). Learning re-active and planning rules in a motivationally au-

32

Page 33: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

tonomous animat. Systems, Man, and Cybernet-ics, Part B: Cybernetics, IEEE Transactions on,26(3):381–395.

Donoho, D. L. (2000). High-dimensional data analysis:the curses and blessings of dimensionality. In Amer-ican Mathematical Society Conference Math Chal-lenges of the 21st Century.

Dorigo, M. and Colombetti, M. (1993). Robot shaping:Developing situated agents through learning. Tech-nical report, International Computer Science Insti-tute, Berkeley, CA.

Duan, Y., Cui, B., and Yang, H. (2008). Robot naviga-tion based on fuzzy RL algorithm. In InternationalSymposium on Neural Networks (ISNN).

Duan, Y., Liu, Q., and Xu, X. (2007). Applicationof reinforcement learning in robot soccer. Engineer-ing Applications of Artificial Intelligence, 20(7):936–950.

Endo, G., Morimoto, J., Matsubara, T., Nakanishi, J.,and Cheng, G. (2008). Learning CPG-based bipedlocomotion with a policy gradient method: Appli-cation to a humanoid robot. International Journalof Robotics Research, 27(2):213–228.

Erden, M. S. and Leblebicioğlu, K. (2008). Free gaitgeneration with reinforcement learning for a six-legged robot. Robotics and Autonomous Systems,56(3):199–212.

Fagg, A. H., Lotspeich, D. L., Hoff, and J. Bekey, G. A.(1998). Rapid reinforcement learning for reactivecontrol policy design for autonomous robots. In Ar-tificial Life in Robotics.

Fidelman, P. and Stone, P. (2004). Learning ball ac-quisition on a physical robot. In International Sym-posium on Robotics and Automation (ISRA).

Freeman, C., Lewin, P., Rogers, E., and Ratcliffe, J.(2010). Iterative learning control applied to a gantryrobot and conveyor system. Transactions of the In-stitute of Measurement and Control, 32(3):251–264.

Gaskett, C., Fletcher, L., and Zelinsky, A. (2000). Re-inforcement learning for a vision based mobile robot.In IEEE/RSJ International Conference on Intelli-gent Robots and Systems (IROS).

Geng, T., Porr, B., and Wörgötter, F. (2006). Fastbiped walking with a reflexive controller and real-time policy searching. In Advances in Neural Infor-mation Processing Systems (NIPS).

Glynn, P. (1987). Likelihood ratio gradient estima-tion: An overview. In Winter Simulation Confer-ence (WSC).

Goldberg, D. E. (1989). Genetic algorithms. AddisionWesley.

Gordon, G. J. (1999). Approximate Solutions toMarkov Decision Processes. PhD thesis, School ofComputer Science, Carnegie Mellon University.

Gräve, K., Stückler, J., and Behnke, S. (2010). Learn-ing motion skills from expert demonstrations andown experience using Gaussian process regression.

In Joint International Symposium on Robotics (ISR)and German Conference on Robotics (ROBOTIK).

Greensmith, E., Bartlett, P. L., and Baxter, J. (2004).Variance reduction techniques for gradient estimatesin reinforcement learning. Journal of MachineLearning Research, 5:1471–1530.

Guenter, F., Hersch, M., Calinon, S., and Billard, A.(2007). Reinforcement learning for imitating con-strained reaching movements. Advanced Robotics,21(13):1521–1544.

Gullapalli, V., Franklin, J., and Benbrahim, H. (1994).Acquiring robot skills via reinforcement learning.Control Systems Magazine, IEEE, 14(1):13–24.

Hafner, R. and Riedmiller, M. (2003). Reinforce-ment learning on a omnidirectional mobile robot. InIEEE/RSJ International Conference on IntelligentRobots and Systems (IROS).

Hafner, R. and Riedmiller, M. (2007). Neural rein-forcement learning controllers for a real robot ap-plication. In IEEE International Conference onRobotics and Automation (ICRA).

Hailu, G. and Sommer, G. (1998). Integrating sym-bolic knowledge in reinforcement learning. In IEEEInternational Conference on Systems, Man and Cy-bernetics (SMC).

Hart, S. and Grupen, R. (2011). Learning generaliz-able control programs. IEEE Transactions on Au-tonomous Mental Development, 3(3):216–231.

Hester, T., Quinlan, M., and Stone, P. (2010). Gener-alized model learning for reinforcement learning on ahumanoid robot. In IEEE International Conferenceon Robotics and Automation (ICRA).

Hester, T., Quinlan, M., and Stone, P. (2012).RTMBA: A real-time model-based reinforcementlearning architecture for robot control. In IEEE In-ternational Conference on Robotics and Automation(ICRA).

Huang, X. and Weng, J. (2002). Novelty and reinforce-ment learning in the value system of developmentalrobots. In 2nd International Workshop on Epige-netic Robotics: Modeling Cognitive Development inRobotic Systems.

Huber, M. and Grupen, R. A. (1997). A feedbackcontrol structure for on-line learning tasks. Roboticsand Autonomous Systems, 22(3-4):303–315.

Ijspeert, A. J., Nakanishi, J., and Schaal, S. (2003).Learning attractor landscapes for learning motorprimitives. In Advances in Neural Information Pro-cessing Systems (NIPS).

Ilg, W., Albiez, J., Jedele, H., Berns, K., and Dill-mann, R. (1999). Adaptive periodic movement con-trol for the four legged walking machine BISAM.In IEEE International Conference on Robotics andAutomation (ICRA).

Jaakkola, T., Jordan, M. I., and Singh, S. P. (1993).Convergence of stochastic iterative dynamic pro-gramming algorithms. In Advances in Neural In-formation Processing Systems (NIPS).

33

Page 34: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Jacobson, D. H. and Mayne, D. Q. (1970). DifferentialDynamic Programming. Elsevier.

Jakobi, N., Husbands, P., and Harvey, I. (1995). Noiseand the reality gap: The use of simulation in evo-lutionary robotics. In 3rd European Conference onArtificial Life.

Kaelbling, L. P. (1990). Learning in Embedded Sys-tems. PhD thesis, Stanford University, Stanford,California.

Kaelbling, L. P., Littman, M. L., and Moore, A. W.(1996). Reinforcement learning: A survey. Journalof Artificial Intelligence Research, 4:237–285.

Kakade, S. (2003). On the Sample Complexity of Re-inforcement Learning. PhD thesis, Gatsby Compu-tational Neuroscience Unit. University College Lon-don.

Kakade, S. and Langford, J. (2002). Approxi-mately optimal approximate reinforcement learning.In International Conference on Machine Learning(ICML).

Kalakrishnan, M., Righetti, L., Pastor, P., and Schaal,S. (2011). Learning force control policies for compli-ant manipulation. In IEEE/RSJ International Con-ference on Intelligent Robots and Systems (IROS).

Kalman, R. E. (1964). When is a linear control systemoptimal? Journal of Basic Engineering, 86(1):51–60.

Kalmár, Z., Szepesvári, C., and Lőrincz, A. (1998).Modular reinforcement learning: An application toa real robot task. Lecture Notes in Artificial Intel-ligence, 1545:29–45.

Kappen, H. (2005). Path integrals and symmetrybreaking for optimal control theory. Journal of Sta-tistical Mechanics: Theory and Experiment, pageP11011.

Katz, D., Pyuro, Y., and Brock, O. (2008). Learningto manipulate articulated objects in unstructuredenvironments using a grounded relational represen-tation. In Robotics: Science and Systems (R:SS).

Kawato, M. (1990). Feedback-error-learning neuralnetwork for supervised motor learning. AdvancedNeural Computers, 6(3):365–372.

Kearns, M. and Singh, S. P. (2002). Near-optimal re-inforcement learning in polynomial time. MachineLearning, 49(2-3):209–232.

Keeney, R. and Raiffa, H. (1976). Decisions with mul-tiple objectives: Preferences and value tradeoffs. J.Wiley, New York.

Kimura, H., Yamashita, T., and Kobayashi, S. (2001).Reinforcement learning of walking behavior for afour-legged robot. In IEEE Conference on Decisionand Control (CDC).

Kirchner, F. (1997). Q-learning of complex behaviourson a six-legged walking machine. In EUROMICROWorkshop on Advanced Mobile Robots.

Kirk, D. E. (1970). Optimal control theory. Prentice-Hall, Englewood Cliffs, New Jersey.

Ko, J., Klein, D. J., Fox, D., and Hähnel, D. (2007).Gaussian processes and reinforcement learning foridentification and control of an autonomous blimp.In IEEE International Conference on Robotics andAutomation (ICRA).

Kober, J., Mohler, B., and Peters, J. (2008). Learn-ing perceptual coupling for motor primitives. InIEEE/RSJ International Conference on IntelligentRobots and Systems (IROS).

Kober, J., Oztop, E., and Peters, J. (2010). Reinforce-ment learning to adjust robot movements to new sit-uations. In Robotics: Science and Systems (R:SS).

Kober, J. and Peters, J. (2009). Policy search for mo-tor primitives in robotics. In Advances in NeuralInformation Processing Systems (NIPS).

Kober, J. and Peters, J. (2010). Policy search for mo-tor primitives in robotics. Machine Learning, 84(1-2):171–203.

Kohl, N. and Stone, P. (2004). Policy gradient rein-forcement learning for fast quadrupedal locomotion.In IEEE International Conference on Robotics andAutomation (ICRA).

Kollar, T. and Roy, N. (2008). Trajectory optimiza-tion using reinforcement learning for map explo-ration. International Journal of Robotics Research,27(2):175–197.

Kolter, J. Z., Abbeel, P., and Ng, A. Y. (2007). Hi-erarchical apprenticeship learning with applicationto quadruped locomotion. In Advances in NeuralInformation Processing Systems (NIPS).

Kolter, J. Z., Coates, A., Ng, A. Y., Gu, Y., andDuHadway, C. (2008). Space-indexed dynamic pro-gramming: Learning to follow trajectories. In Inter-national Conference on Machine Learning (ICML).

Kolter, J. Z. and Ng, A. Y. (2009a). Policy searchvia the signed derivative. In Robotics: Science andSystems (R:SS).

Kolter, J. Z. and Ng, A. Y. (2009b). Regularizationand feature selection in least-squares temporal dif-ference learning. In International Conference onMachine Learning (ICML).

Kolter, J. Z., Plagemann, C., Jackson, D. T., Ng,A. Y., and Thrun, S. (2010). A probabilistic ap-proach to mixed open-loop and closed-loop control,with application to extreme autonomous driving. InIEEE International Conference on Robotics and Au-tomation (ICRA).

Konidaris, G. D., Kuindersma, S., Grupen, R., andBarto, A. G. (2011a). Autonomous skill acquisitionon a mobile manipulator. In AAAI Conference onArtificial Intelligence (AAAI).

Konidaris, G. D., Kuindersma, S., Grupen, R., andBarto, A. G. (2012). Robot learning from demon-stration by constructing skill trees. InternationalJournal of Robotics Research, 31(3):360–375.

Konidaris, G. D., Osentoski, S., and Thomas, P.(2011b). Value function approximation in reinforce-ment learning using the Fourier basis. In AAAI

34

Page 35: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Conference on Artificial Intelligence (AAAI).

Kroemer, O., Detry, R., Piater, J., and Peters, J.(2009). Active learning using mean shift optimiza-tion for robot grasping. In IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems(IROS).

Kroemer, O., Detry, R., Piater, J., and Peters, J.(2010). Combining active learning and reactive con-trol for robot grasping. Robotics and AutonomousSystems, 58(9):1105–1116.

Kuhn, H. W. and Tucker, A. W. (1950). Nonlinearprogramming. In Berkeley Symposium on Mathe-matical Statistics and Probability.

Kuindersma, S., Grupen, R., and Barto, A. G. (2011).Learning dynamic arm motions for postural recov-ery. In IEEE-RAS International Conference on Hu-manoid Robots (HUMANOIDS).

Kwok, C. and Fox, D. (2004). Reinforcement learn-ing for sensing strategies. In IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems(IROS).

Langford, J. and Zadrozny, B. (2005). Relating re-inforcement learning performance to classificationperformance. In 22nd International Conference onMachine Learning (ICML).

Latzke, T., Behnke, S., and Bennewitz, M. (2007).Imitative reinforcement learning for soccer playingrobots. In RoboCup 2006: Robot Soccer World CupX, pages 47–58. Springer-Verlag.

Laud, A. D. (2004). Theory and Application of Re-ward Shaping in Reinforcement Learning. PhD the-sis, University of Illinois at Urbana-Champaign.

Lewis, M. E. and Puterman, M. L. (2001). The Hand-book of Markov Decision Processes: Methods andApplications, chapter Bias Optimality, pages 89–111. Kluwer.

Lin, H.-I. and Lai, C.-C. (2012). Learning collision-free reaching skill from primitives. In IEEE/RSJInternational Conference on Intelligent Robots andSystems (IROS).

Lizotte, D., Wang, T., Bowling, M., and Schuurmans,D. (2007). Automatic gait optimization with Gaus-sian process regression. In International Joint Con-ference on Artifical Intelligence (IJCAI).

Mahadevan, S. and Connell, J. (1992). Automatic pro-gramming of behavior-based robots using reinforce-ment learning. Artificial Intelligence, 55(2-3):311 –365.

Martínez-Marín, T. and Duckett, T. (2005). Fast rein-forcement learning for vision-guided mobile robots.In IEEE International Conference on Robotics andAutomation (ICRA).

Matarić, M. J. (1994). Reward functions for acceler-ated learning. In International Conference on Ma-chine Learning (ICML).

Matarić, M. J. (1997). Reinforcement learning in themulti-robot domain. Autonomous Robots, 4:73–83.

Michels, J., Saxena, A., and Ng, A. Y. (2005). Highspeed obstacle avoidance using monocular visionand reinforcement learning. In International Con-ference on Machine Learning (ICML).

Mitsunaga, N., Smith, C., Kanda, T., Ishiguro, H.,and Hagita, N. (2005). Robot behavior adaptationfor human-robot interaction based on policy gradi-ent reinforcement learning. In IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems(IROS).

Miyamoto, H., Schaal, S., Gandolfo, F., Gomi, H.,Koike, Y., Osu, R., Nakano, E., Wada, Y., andKawato, M. (1996). A Kendama learning robotbased on bi-directional theory. Neural Networks,9(8):1281–1302.

Moldovan, T. M. and Abbeel, P. (2012). Safe explo-ration in markov decision processes. In 29th Inter-national Conference on Machine Learning (ICML),pages 1711–1718.

Moore, A. W. and Atkeson, C. G. (1993). Prioritizedsweeping: Reinforcement learning with less data andless time. Machine Learning, 13(1):103–130.

Morimoto, J. and Doya, K. (2001). Acquisition ofstand-up behavior by a real robot using hierarchicalreinforcement learning. Robotics and AutonomousSystems, 36(1):37–51.

Muelling, K., Kober, J., Kroemer, O., and Peters,J. (2012). Learning to select and generalize strik-ing movements in robot table tennis. InternationalJournal of Robotics Research.

Nakanishi, J., Cory, R., Mistry, M., Peters, J., andSchaal, S. (2008). Operational space control: Atheoretical and emprical comparison. InternationalJournal of Robotics Research, 27:737–757.

Nemec, B., Tamošiunaite, M., Wörgötter, F., and Ude,A. (2009). Task adaptation through exploration andaction sequencing. In IEEE-RAS International Con-ference on Humanoid Robots (HUMANOIDS).

Nemec, B., Zorko, M., and Zlajpah, L. (2010). Learn-ing of a ball-in-a-cup playing robot. In InternationalWorkshop on Robotics in Alpe-Adria-Danube Region(RAAD).

Ng, A. Y., Coates, A., Diel, M., Ganapathi, V.,Schulte, J., Tse, B., Berger, E., and Liang, E.(2004a). Autonomous inverted helicopter flight viareinforcement learning. In International Symposiumon Experimental Robotics (ISER).

Ng, A. Y., Harada, D., and Russell, S. J. (1999). Pol-icy invariance under reward transformations: The-ory and application to reward shaping. In Interna-tional Conference on Machine Learning (ICML).

Ng, A. Y., Kim, H. J., Jordan, M. I., and Sastry, S.(2004b). Autonomous helicopter flight via reinforce-ment learning. In Advances in Neural InformationProcessing Systems (NIPS).

Norrlöf, M. (2002). An adaptive iterative learningcontrol algorithm with experiments on an industrialrobot. IEEE Transactions on Robotics, 18(2):245–

35

Page 36: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

251.

Oßwald, S., Hornung, A., and Bennewitz, M. (2010).Learning reliable and efficient navigation with ahumanoid. In IEEE International Conference onRobotics and Automation (ICRA).

Paletta, L., Fritz, G., Kintzler, F., Irran, J., andDorffner, G. (2007). Perception and developmen-tal learning of affordances in autonomous robots. InHertzberg, J., Beetz, M., and Englert, R., editors,KI 2007: Advances in Artificial Intelligence, volume4667 of Lecture Notes in Computer Science, pages235–250. Springer.

Park, D.-H., Hoffmann, H., Pastor, P., and Schaal, S.(2008). Movement reproduction and obstacle avoid-ance with dynamic movement primitives and poten-tial fields. In IEEE International Conference on Hu-manoid Robots (HUMANOIDS).

Pastor, P., Kalakrishnan, M., Chitta, S., Theodorou,E., and Schaal, S. (2011). Skill learning and taskoutcome prediction for manipulation. In IEEE In-ternational Conference on Robotics and Automation(ICRA).

Pendrith, M. (1999). Reinforcement learning in situ-ated agents: Some theoretical problems and prac-tical solutions. In European Workshop on LearningRobots (EWRL).

Peng, J. and Williams, R. J. (1996). Incrementalmulti-step q-learning. Machine Learning, 22(1):283–290.

Perkins, T. J. and Barto, A. G. (2002). Lyapunovdesign for safe reinforcement learning. Journal ofMachine Learning Research, 3:803–832.

Peters, J., Muelling, K., and Altun, Y. (2010a). Rel-ative entropy policy search. In National Conferenceon Artificial Intelligence (AAAI).

Peters, J., Muelling, K., Kober, J., Nguyen-Tuong,D., and Kroemer, O. (2010b). Towards motor skilllearning for robotics. In International Symposiumon Robotics Research (ISRR).

Peters, J. and Schaal, S. (2008a). Learning to con-trol in operational space. International Journal ofRobotics Research, 27(2):197–212.

Peters, J. and Schaal, S. (2008b). Natural actor-critic.Neurocomputing, 71(7-9):1180–1190.

Peters, J. and Schaal, S. (2008c). Reinforcement learn-ing of motor skills with policy gradients. NeuralNetworks, 21(4):682–697.

Peters, J., Vijayakumar, S., and Schaal, S. (2004).Linear quadratic regulation as benchmark for pol-icy gradient methods. Technical report, Universityof Southern California.

Piater, J. H., Jodogne, S., Detry, R., Kraft, D.,Krüger, N., Kroemer, O., and Peters, J. (2011).Learning visual representations for perception-action systems. International Journal of RoboticsResearch, 30(3):294–307.

Platt, R., Grupen, R. A., and Fagg, A. H. (2006). Im-

proving grasp skills using schema structured learn-ing. In International Conference on Developmentand Learning.

Powell, W. B. (2012). AI, OR and control theory: Arosetta stone for stochastic optimization. Technicalreport, Princeton University.

Puterman, M. L. (1994). Markov Decision Processes:Discrete Stochastic Dynamic Programming. Wiley-Interscience.

Randløv, J. and Alstrøm, P. (1998). Learning to drivea bicycle using reinforcement learning and shaping.In International Conference on Machine Learning(ICML), pages 463–471.

Rasmussen, C. and Williams, C. (2006). GaussianProcesses for Machine Learning. Adaptive Compu-tation And Machine Learning. MIT Press.

Åström, K. J. and Wittenmark, B. (1989). Adaptivecontrol. Addison-Wesley.

Ratliff, N., Bagnell, J. A., and Srinivasa, S. (2007). Im-itation learning for locomotion and manipulation. InIEEE-RAS International Conference on HumanoidRobots (HUMANOIDS).

Ratliff, N., Bradley, D., Bagnell, J. A., and Chestnutt,J. (2006a). Boosting structured prediction for imi-tation learning. In Advances in Neural InformationProcessing Systems (NIPS).

Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A.(2006b). Maximum margin planning. In Interna-tional Conference on Machine Learning (ICML).

Riedmiller, M., Gabel, T., Hafner, R., and Lange, S.(2009). Reinforcement learning for robot soccer. Au-tonomous Robots, 27(1):55–73.

Rivlin, T. J. (1969). An Introduction to the Approxi-mation of Functions. Courier Dover Publications.

Roberts, J. W., Manchester, I., and Tedrake, R.(2011). Feedback controller parameterizations forreinforcement learning. In IEEE Symposium onAdaptive Dynamic Programming and ReinforcementLearning (ADPRL).

Roberts, J. W., Moret, L., Zhang, J., and Tedrake,R. (2010). From Motor to Interaction Learning inRobots, volume 264 of Studies in Computational In-telligence, chapter Motor learning at intermediateReynolds number: experiments with policy gradienton the flapping flight of a rigid wing, pages 293–309.Springer.

Rosenstein, M. T. and Barto, A. G. (2004). Rein-forcement learning with supervision by a stable con-troller. In American Control Conference.

Ross, S. and Bagnell, J. A. (2012). Agnostic systemidentification for model-based reinforcement learn-ing. In International Conference on Machine Learn-ing (ICML).

Ross, S., Gordon, G., and Bagnell, J. A. (2011a). Areduction of imitation learning and structured pre-diction to no-regret online learning. In InternationalConference on Artifical Intelligence and Statistics

36

Page 37: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

(AISTATS).

Ross, S., Munoz, D., Hebert, M., and AndrewBagnell,J. (2011b). Learning message-passing inference ma-chines for structured prediction. In IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).

Rottmann, A., Plagemann, C., Hilgers, P., and Bur-gard, W. (2007). Autonomous blimp control us-ing model-free reinforcement learning in a contin-uous state and action space. In IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems(IROS).

Rubinstein, R. Y. and Kroese, D. P. (2004). The CrossEntropy Method: A Unified Approach To Combina-torial Optimization, Monte-carlo Simulation (Infor-mation Science and Statistics). Springer-Verlag NewYork, Inc.

Rückstieß, T., Felder, M., and Schmidhuber, J.(2008). State-dependent exploration for policy gra-dient methods. In European Conference on MachineLearning (ECML).

Russell, S. (1998). Learning agents for uncertain en-vironments (extended abstract). In Conference onComputational Learning Theory (COLT).

Rust, J. (1997). Using randomization to break thecurse of dimensionality. Econometrica, 65(3):487–516.

Sato, M., Nakamura, Y., and Ishii, S. (2002). Rein-forcement learning for biped locomotion. In Inter-national Conference on Artificial Neural Networks.

Schaal, S. (1996). Learning from demonstration. InAdvances in Neural Information Processing Systems(NIPS).

Schaal, S. (1999). Is imitation learning the route tohumanoid robots? Trends in Cognitive Sciences,3(6):233–242.

Schaal, S. (2009). The SL simulation and real-timecontrol software package. Technical report, Univer-sity of Southern California.

Schaal, S. and Atkeson, C. G. (1994). Robot juggling:An implementation of memory-based learning. Con-trol Systems Magazine, 14(1):57–71.

Schaal, S., Atkeson, C. G., and Vijayakumar, S.(2002). Scalable techniques from nonparametericstatistics for real-time robot learning. Applied In-telligence, 17(1):49–60.

Schaal, S., Mohajerian, P., and Ijspeert, A. J. (2007).Dynamics systems vs. optimal control – A unifyingview. Progress in Brain Research, 165(1):425–445.

Schneider, J. G. (1996). Exploiting model uncertaintyestimates for safe dynamic control learning. In Ad-vances in Neural Information Processing Systems(NIPS).

Schwartz, A. (1993). A reinforcement learning methodfor maximizing undiscounted rewards. In Interna-tional Conference on Machine Learning (ICML).

Silver, D., Bagnell, J. A., and Stentz, A. (2008). High

performance outdoor navigation from overhead datausing imitation learning. In Robotics: Science andSystems (R:SS).

Silver, D., Bagnell, J. A., and Stentz, A. (2010).Learning from demonstration for autonomous nav-igation in complex unstructured terrain. Interna-tional Journal of Robotics Research, 29(12):1565–1592.

Smart, W. D. and Kaelbling, L. P. (1998). Aframework for reinforcement learning on realrobots. In National Conference on Artificial Intel-ligence/Innovative Applications of Artificial Intelli-gence (AAAI/IAAI).

Smart, W. D. and Kaelbling, L. P. (2002). Effectivereinforcement learning for mobile robots. In IEEEInternational Conference on Robotics and Automa-tion (ICRA).

Soni, V. and Singh, S. P. (2006). Reinforcement learn-ing of hierarchical skills on the Sony AIBO robot.In International Conference on Development andLearning (ICDL).

Sorg, J., Singh, S. P., and Lewis, R. L. (2010). Rewarddesign via online gradient ascent. In Advances inNeural Information Processing Systems (NIPS).

Spall, J. C. (2003). Introduction to Stochastic Searchand Optimization. John Wiley & Sons, Inc., NewYork, NY, USA, 1 edition.

Strens, M. and Moore, A. (2001). Direct policy searchusing paired statistical tests. In International Con-ference on Machine Learning (ICML).

Stulp, F. and Sigaud, O. (2012). Path integral pol-icy improvement with covariance matrix adaptation.In International Conference on Machine Learning(ICML).

Stulp, F., Theodorou, E., Kalakrishnan, M., Pastor,P., Righetti, L., and Schaal, S. (2011). Learningmotion primitive goals for robust manipulation. InIEEE/RSJ International Conference on IntelligentRobots and Systems (IROS).

Sutton, R. S. (1990). Integrated architectures forlearning, planning, and reacting based on approxi-mating dynamic programming. In International Ma-chine Learning Conference (ICML).

Sutton, R. S. and Barto, A. G. (1998). ReinforcementLearning. MIT Press, Boston, MA.

Sutton, R. S., Barto, A. G., and Williams, R. J. (1991).Reinforcement learning is direct adaptive optimalcontrol. In American Control Conference.

Sutton, R. S., Koop, A., and Silver, D. (2007). On therole of tracking in stationary environments. In Inter-national Conference on Machine Learning (ICML).

Sutton, R. S., McAllester, D., Singh, S. P., and Man-sour, Y. (1999). Policy gradient methods for rein-forcement learning with function approximation. InAdvances in Neural Information Processing Systems(NIPS).

Svinin, M. M., Yamada, K., and Ueda, K. (2001).

37

Page 38: Reinforcement Learning in Robotics: A Survey - TU · PDF fileReinforcement Learning in Robotics: A Survey ... autonomous vehicles, and humanoid robots. (a) The OBELIX robot is a wheeled

Emergent synthesis of motion patterns for locomo-tion robots. Artificial Intelligence in Engineering,15(4):353–363.

Tadepalli, P. and Ok, D. (1994). H-learning: Areinforcement learning method to optimize undis-counted average reward. Technical Report 94-30-1, Department of Computer Science, Oregon StateUniversity.

Tamei, T. and Shibata, T. (2009). Policy gradientlearning of cooperative interaction with a robot us-ing user’s biological signals. In International Confer-ence on Neural Information Processing (ICONIP).

Tamošiunaite, M., Nemec, B., Ude, A., and Wörgöt-ter, F. (2011). Learning to pour with a robotarm combining goal and shape learning for dynamicmovement primitives. Robotics and AutonomousSystems, 59:910 – 922.

Tedrake, R. (2004). Stochastic policy gradient rein-forcement learning on a simple 3D biped. In Inter-national Conference on Intelligent Robots and Sys-tems (IROS).

Tedrake, R., Manchester, I. R., Tobenkin, M. M., andRoberts, J. W. (2010). LQR-trees: Feedback motionplanning via sums of squares verification. Interna-tional Journal of Robotics Research, 29:1038–1052.

Tedrake, R., Zhang, T. W., and Seung, H. S. (2005).Learning to walk in 20 minutes. In Yale Workshopon Adaptive and Learning Systems.

Tesch, M., Schneider, J. G., and Choset, H. (2011). Us-ing response surfaces and expected improvement tooptimize snake robot gait parameters. In IEEE/RSJInternational Conference on Intelligent Robots andSystems (IROS).

Theodorou, E. A., Buchli, J., and Schaal, S. (2010).Reinforcement learning of motor skills in high di-mensions: A path integral approach. In IEEE In-ternational Conference on Robotics and Automation(ICRA).

Thrun, S. (1995). An approach to learning mobilerobot navigation. Robotics and Autonomous Sys-tems, 15:301–319.

Tokic, M., Ertel, W., and Fessler, J. (2009). Thecrawler, a class room demonstrator for reinforce-ment learning. In International Florida ArtificialIntelligence Research Society Conference (FLAIRS).

Toussaint, M., Storkey, A., and Harmeling, S. (2010).Inference and Learning in Dynamic Models, chap-ter Expectation-Maximization methods for solving(PO)MDPs and optimal control problems. Cam-bridge University Press.

Touzet, C. (1997). Neural reinforcement learning forbehaviour synthesis. Robotics and Autonomous Sys-tems, Special Issue on Learning Robot: The NewWave, 22(3-4):251–281.

Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis oftemporal-difference learning with function approxi-mation. IEEE Transactions on Automatic Control,42(5):674–690.

Uchibe, E., Asada, M., and Hosoda, K. (1998). Coop-erative behavior acquisition in multi mobile robotsenvironment by reinforcement learning based onstate vector estimation. In IEEE International Con-ference on Robotics and Automation (ICRA).

van den Berg, J., Miller, S., Duckworth, D., Hu, H.,Wan, A., Fu, X.-Y., Goldberg, K., and Abbeel,P. (2010). Superhuman performance of surgicaltasks by robots using iterative learning from human-guided demonstrations. In International Conferenceon Robotics and Automation (ICRA).

Vlassis, N., Toussaint, M., Kontes, G., and Piperidis,S. (2009). Learning model-free robot control by aMonte Carlo EM algorithm. Autonomous Robots,27(2):123–130.

Wang, B., Li, J., and Liu, H. (2006). A heuristic re-inforcement learning for robot approaching objects.In IEEE Conference on Robotics, Automation andMechatronics.

Whitman, E. C. and Atkeson, C. G. (2010). Controlof instantaneously coupled systems applied to hu-manoid walking. In IEEE-RAS International Con-ference on Humanoid Robots (HUMANOIDS).

Wikipedia (2013). Fosbury flop.http://en.wikipedia.org/wiki/Fosbury Flop.

Willgoss, R. A. and Iqbal, J. (1999). Reinforcementlearning of behaviors in mobile robots using noisy in-frared sensing. In Australian Conference on Roboticsand Automation.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine Learning, 8:229–256.

Yamaguchi, J. and Takanishi, A. (1997). Developmentof a biped walking robot having antagonistic drivenjoints using nonlinear spring mechanism. In IEEEInternational Conference on Robotics and Automa-tion (ICRA).

Yasuda, T. and Ohkura, K. (2008). A reinforcementlearning technique with an adaptive action genera-tor for a multi-robot system. In International Con-ference on Simulation of Adaptive Behavior (SAB).

Youssef, S. M. (2005). Neuro-based learning of mobilerobots with evolutionary path planning. In ICGSTInternational Conference on Automation, Roboticsand Autonomous Systems (ARAS).

Zhou, K. and Doyle, J. C. (1997). Essentials of RobustControl. Prentice Hall.

Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey,A. K. (2008). Maximum entropy inverse reinforce-ment learning. In AAAI Conference on ArtificialIntelligence (AAAI).

Zucker, M. and Bagnell, J. A. (2012). Reinforcementplanning: RL for optimal planners. In IEEE In-ternational Conference on Robotics and Automation(ICRA).

38