local policy optimization for trajectory-centric reinforcement … · 2020. 6. 2. · local policy...

9
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Local Policy Optimization for Trajectory-Centric Reinforcement Learning Jha, Devesh; Kolaric, Patrik; Raghunathan, Arvind; Lewis, Frank; Benosman, Mouhacine; Romeres, Diego; Nikovski, Daniel N. TR2020-062 June 03, 2020 Abstract The goal of this paper is to present a method for simultaneous trajectory and local stabilizing policy optimization to generate local policies for trajectory-centric model-based reinforcement learning (MBRL). This is motivated by the fact that global policy optimization for non-linear systems could be a very challenging problem both algorithmically and numerically. However, a lot of robotic manipulation tasks are trajectorycentric, and thus do not require a global model or policy. Due to inaccuracies in the learned model estimates, an openloop trajectory optimization process mostly results in very poor performance when used on the real system. Motivated by these problems, we try to formulate the problem of trajectory optimization and local policy synthesis as a single optimization problem. It is then solved simultaneously as an instance of nonlinear programming. We provide some results for analysis as well as achieved performance of the proposed technique under some simplifying assumptions. IEEE International Conference on Robotics and Automation (ICRA) This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2020 201 Broadway, Cambridge, Massachusetts 02139

Upload: others

Post on 23-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

MITSUBISHI ELECTRIC RESEARCH LABORATORIEShttp://www.merl.com

Local Policy Optimization for Trajectory-CentricReinforcement Learning

Jha, Devesh; Kolaric, Patrik; Raghunathan, Arvind; Lewis, Frank; Benosman, Mouhacine;Romeres, Diego; Nikovski, Daniel N.

TR2020-062 June 03, 2020

AbstractThe goal of this paper is to present a method for simultaneous trajectory and local stabilizingpolicy optimization to generate local policies for trajectory-centric model-based reinforcementlearning (MBRL). This is motivated by the fact that global policy optimization for non-linearsystems could be a very challenging problem both algorithmically and numerically. However,a lot of robotic manipulation tasks are trajectorycentric, and thus do not require a globalmodel or policy. Due to inaccuracies in the learned model estimates, an openloop trajectoryoptimization process mostly results in very poor performance when used on the real system.Motivated by these problems, we try to formulate the problem of trajectory optimization andlocal policy synthesis as a single optimization problem. It is then solved simultaneously as aninstance of nonlinear programming. We provide some results for analysis as well as achievedperformance of the proposed technique under some simplifying assumptions.

IEEE International Conference on Robotics and Automation (ICRA)

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy inwhole or in part without payment of fee is granted for nonprofit educational and research purposes provided that allsuch whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi ElectricResearch Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and allapplicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall requirea license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.

Copyright c© Mitsubishi Electric Research Laboratories, Inc., 2020201 Broadway, Cambridge, Massachusetts 02139

Page 2: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh
Page 3: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

Local Policy Optimization for Trajectory-Centric Reinforcement Learning

Patrik Kolaric1, Devesh K. Jha2, Arvind U. Raghunathan2, Frank L. Lewis1,Mouhacine Benosman2, Diego Romeres2 Daniel Nikovski2

Abstract— The goal of this paper is to present a method forsimultaneous trajectory and local stabilizing policy optimizationto generate local policies for trajectory-centric model-basedreinforcement learning (MBRL). This is motivated by the factthat global policy optimization for non-linear systems could be avery challenging problem both algorithmically and numerically.However, a lot of robotic manipulation tasks are trajectory-centric, and thus do not require a global model or policy.Due to inaccuracies in the learned model estimates, an open-loop trajectory optimization process mostly results in verypoor performance when used on the real system. Motivated bythese problems, we try to formulate the problem of trajectoryoptimization and local policy synthesis as a single optimizationproblem. It is then solved simultaneously as an instance ofnonlinear programming. We provide some results for analysisas well as achieved performance of the proposed techniqueunder some simplifying assumptions.

I. INTRODUCTION

Reinforcement learning (RL) is a learning framework thataddresses sequential decision-making problems, wherein an‘agent’ or a decision maker learns a policy to optimize along-term reward by interacting with the (unknown) environ-ment. At each step, the RL agent obtains evaluative feedback(called reward or cost) about the performance of its action,allowing it to improve the performance of subsequent actions[1], [2]. Although RL has witnessed huge successes in recenttimes [3], [4], there are several unsolved challenges, whichrestrict the use of these algorithms for industrial systems. Inmost practical applications, control policies must be designedto satisfy operational constraints, and a satisfactory policyshould be learnt in a data-efficient fashion [5].

Model-based reinforcement learning (MBRL) methods [6]learn a model from exploration data of the system, and thenexploit the model to synthesize a trajectory-centric controllerfor the system [7]. These techniques are, in general, harderto train, but could achieve good data efficiency [8]. Learningreliable models is very challenging for non-linear systemsand thus, the subsequent trajectory optimization could failwhen using inaccurate models. However, modern machinelearning methods such as Gaussian processes (GP), stochas-tic neural networks (SNN), etc. can generate uncertainty esti-mates associated with predictions [9], [10]. These uncertaintyestimates could be used to estimate the confidence set ofsystem states at any step along a given controlled trajectory

1UTA Research Institute, University of Texas at Arlington, FortWorth, TX,USA (Email: [email protected],[email protected]

2 Mitsubishi Electric Research Laborato-ries (MERL), Cambridge, MA, USA. Email:jha,raghunathan,benosman,romeres,[email protected]

for the system. The idea presented in this paper considers thestabilization of the trajectory using a local feedback policythat acts as an attractor for the system in the known regionof uncertainty along the trajectory [11].

We present a method for simultaneous trajectory opti-mization and local policy optimization, where the policyoptimization is performed in a neighborhood (local sets)of the system states along the trajectory. These local setscould be obtained by a stochastic function approximator(e.g., GP, SNN, etc.) that used to learn the forward modelof the dynamical system. The local policy is obtained byconsidering the worst-case deviation of the system fromthe nominal trajectory at every step along the trajectory.Performing simultaneous trajectory and policy optimizationcould allow us to exploit the modeling uncertainty as itdrives the optimization to regions of low uncertainty, whereit might be easier to stabilize the trajectory. This allows usto constrain the trajectory optimization procedure to generaterobust, high-performance controllers. The proposed methodautomatically incorporates state and input constraints on thedynamical system.

Contributions. The main contributions of the currentpaper are:

1) We present a novel formulation of simultaneous tra-jectory optimization and time-invariant local policysynthesis for stabilization.

2) We present analysis of the proposed technique thatallows us to analytically derive the gradient of therobustness constraint for the optimization problem.

It is noted that this paper only presents the controllersynthesis part for MBRL – a more detailed analysis ofthe interplay between model uncertainties and controllersynthesis is deferred to another publication.

II. RELATED WORK

MBRL has raised a lot of interest recently in roboticsapplications, because model learning algorithms are largelytask independent and data-efficient [12], [8], [6]. However,MBRL techniques are generally considered to be hard totrain and likely to result in poor performance of the re-sulting policies/controllers, because the inaccuracies in thelearned model could guide the policy optimization processto low-confidence regions of the state space. For non-linear control, the use of trajectory optimization techniquessuch as differential dynamic programming [13] or its first-order approximation, the iterative Linear Quadratic Regu-lator (iLQR) [14] is very popular, as it allows the use ofgradient-based optimization, and thus could be used for

Page 4: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

high-dimensional systems. As the iLQR algorithm solvesthe local LQR problem at every point along the trajectory,it also computes a sequence of feedback gain matrices touse along the trajectory. However, the LQR problem isnot solved for ensuring robustness, and furthermore thecontroller ends up being time-varying, which makes its usesomewhat inconvenient for robotic systems. Thus, we believethat the controllers we propose might have better stabilizationproperties, while also being time-invariant.

Most model-based methods use a function approximator tofirst learn an approximate model of the system dynamics, andthen use stochastic control techniques to synthesize a policy.Some of the seminal work in this direction could be foundin [8], [6]. The method proposed in [8] has been shown to bevery effective in learning trajectory-based local policies bysampling several initial conditions (states) and then fittinga neural network which can imitate the trajectories bysupervised learning. This can be done by using ADMM [15]to jointly optimize trajectories and learn the neural networkpolicies. This approach has achieved impressive performanceon several robotic tasks [8]. The method has been shownto scale well for systems with higher dimensions. Severaldifferent variants of the proposed method were introducedlater [16], [17], [18]. However, no theoretical analysis couldbe provided for the performance of the learned policies.

Another set of seminal work related to the proposedwork is on the use of sum-of-square (SOS) programmingmethods for generating stabilizing controller for non-linearsystems [11]. In these techniques, a stabilizing controller,expressed as a polynomial function of states, for a non-linearsystem is generated along a trajectory by solving an opti-mization problem to maximize its region of attraction [19].

Another set of relevant work could be found in [20],[21] where the idea is to allow constraint satisfaction forthe partially-known system by appropriately bounding modelerrors. Furthermore, authors in [22] present a way to per-form approximate dynamic programming using the ideasof invariant sets. However, these methods find the globalcontroller for the system which could be inefficient for high-dimensional systems. lastly, some model-free trajectory-centric approaches could be found in [23], [24].

III. PROBLEM FORMULATION

In this section, we describe the problem studied in therest of the paper. To perform trajectory-centric control, wepropose a novel formulation for simultaneous design ofopen-loop trajectory and a time-invariant, locally stabilizingcontroller that is robust to bounded model uncertaintiesand/or system measurement noise. As we will present inthis section, the proposed formulation is different from thatconsidered in the literature in the sense it allows us to exploitsets of possible deviation of a system to design stabilizingcontroller.

A. Trajectory Optimization as Non-linear ProgramConsider the discrete-time dynamical system

xk+1 = f(xk, uk) (1)

where xk ∈ Rnx , uk ∈ Rnu are the differential states andcontrols, respectively. The function f : Rnx+nu → Rnx

governs the evolution of the differential states. Note thatthe discrete-time formulation (1) can be obtained from acontinuous time system x = f(x, u) by using the explicitEuler integration scheme (xk+1−xk) = ∆tf(xk, uk) where∆t is the time-step for integration.

For clarity of exposition we have limited our focus todiscrete-time dynamical systems of the form in (1) althoughthe techniques we describe can be easily extended to implicitdiscretization schemes.

In typical applications the states and controls are restrictedto lie in sets X := x ∈ Rnx |x ≤ x ≤ x ⊆ Rnx andU := u ∈ Rnu |u ≤ u ≤ u ⊆ Rnu , i.e. xk ∈ X , uk ∈ U .We use [K] to denote the index set 0, 1, . . . ,K. Further,there may exist nonlinear inequality constraints of the form

g(xk) ≥ 0 (2)

with g : Rnx → Rm. The inequalities in (2) are termedas path constraints. The trajectory optimization problem isto manipulate the controls uk over a certain number oftime steps [T − 1] so that the resulting trajectory xkk∈[T ]

minimizes a cost function c(xk, uk). Formally, we aim tosolve the trajectory optimization problem

minxk,uk

∑k∈[T ]

c(xk, uk)

s.t. Eq. (1)− (2) for k ∈ [T ]

x0 = x0

xk ∈ X for k ∈ [T ]

uk ∈ U for k ∈ [T − 1]

(TrajOpt)

where x0 is the differential state at initial time k = 0. Beforeintroducing the main problem of interest, we would like tointroduce some notations.

In the following text, we use the following short-hand notation, ||v||2M = vTMv. We denote the nomi-nal trajectory as X ≡ x0, x1, x2, x3, . . . , xT−1, xT , U ≡u0, u1, u2, u3, ..., uT−1. The actual trajectory followed by thesystem is denoted as X ≡ x0, x1, x2, x3, . . . , xT−1, xT . Wedenote a local policy as πW , where π is the policy and Wdenotes the parameters of the policy. The trajectory cost isalso sometimes denoted as J =

∑k∈[T ]

c(xk, uk).

B. Trajectory Optimization with Local Stabilization

This subsection introduces the main problem of interest inthis paper. A schematic of the problem studied in the paperis also shown in Figure 1. In the rest of this section, we willdescribe how we can simplify the trajectory optimization andlocal stabilization problem and turn it into an optimizationproblem that can be solved by standard non-linear optimiza-tion solvers.

Consider the case where the system dynamics, f is onlypartially known, and the known component of f is used todesign the controller. Consider the deviation of the systemat any step ’k’ from the state trajectory X and denote it as

Page 5: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

Fig. 1: A schematic representation of the robustness con-straint introduced in the paper.δxk ≡ xk − xk. We introduce a local (time-invariant) policyπW that regulates the local trajectory deviation δxk and thus,the final controller is denoted as uk = uk + πW (δxk). Theclosed-loop dynamics for the system under this control isthen given by the following:

xk+1 = f(xk, uk) = f(xk + δxk, uk + πW (δxk)) (3)

The main objective of the paper is to find the time-invariantfeedback policy πW that can stabilize the open-loop trajec-tory X locally within Rk ⊂ Rnx where Rk defines the set ofuncertainty for the deviation δxk. The uncertainty region Rkcan be approximated by fitting an ellipsoid to the uncertaintyestimate using a diagonal positive definite matrix Sk suchthat Rk = δxk : δxTk Skδxk ≤ 1. The general optimizationproblem that achieves that is proposed as:

J∗ = minU,X,W

Eδxk∈Rk

[J(X + δX,U + πW (δxk)]

xk+1 = f(xk, uk)(4)

where f(·, ·) denotes the known part of the model. Notethat in the above equation, we have introduced additionaloptimization parameters corresponding to the policy πWwhen compared to TrajOpt in the previous section. However,to solve the above, one needs to resort to sampling in order toestimate the expected cost. Instead we introduce a constraintthat solves for the worst-case cost for the above problem.

Robustness Certificate. The robust trajectory optimiza-tion problem is to minimize the trajectory cost while at thesame time satisfying a robust constraint at every step alongthe trajectory. This is also explained in Figure 1, where thepurpose of the local stabilizing controller is to push the max-deviation state at every step along the trajectory to ε-toleranceballs around the trajectory. Mathematically, we express theproblem as following:

minxk,uk,W

∑k∈[T ]

c(xk, uk)

s.t. Eq. (1)− (2) for k ∈ [T ]

x0 = x0

xk ∈ X for k ∈ [T ]

uk ∈ U for k ∈ [T − 1]

maxδxk∈Rk

||xk+1 − f(xk + δxk, uk + πW (δxk))||2 ≤ εk(RobustTrajOpt)

The additional constraint introduced in RobustTrajOptallows us to ensure stabilization of the trajectory by esti-mating parameters of the stabilizing policy πW . It is easyto see that RobustTrajOpt solves the worst-case problem forthe optimization considered in (4). However, RobustTrajOptintroduces another hyperparameter to the optimization prob-lem, εk. In the rest of the paper, we refer to the followingconstraint as the robust constraint:

maxδxT

k Skδxk≤1||xk+1−f(xk+δxk, uk+πW (δxk))||2 ≤ εk (5)

Solution of the robust constraint for generic non-linearsystem is out of scope of this paper. Instead, we linearizethe trajectory deviation dynamics as shown in the followingLemma.

Lemma 1. The trajectory deviation dynamics δxk+1 =xk+1 − xk+1 approximated locally around the optimal tra-jectory (X,U) are given by

δxk+1 = A(xk, uk) · δxk +B(xk, uk) · πW (δxk)

A(xk, uk) ≡ ∇xkf(xk, uk)

B(xk, uk) ≡ ∇ukf(xk, uk)

(6)

Proof: Use Taylor’s series expansion to obtain thedesired expression.

To ensure feasibility of the RobustTrajOpt problem andavoid tuning the hyperparameter εk, we make another re-laxation by removing the robust constraint from the set ofconstraints and move it to the objective function. Thus, thesimplified robust trajectory optimization problem that wesolve in this paper can be expressed as following (we skipthe state constraints to save space).

minxk,uk,W

(∑k∈[T ]

c(xk, uk) + α∑k∈[T ]

dmax,k)

s.t. Eq. (1)− (2) for k ∈ [T ](RelaxedRobustTrajOpt)

where the term dmax,k is defined as following after lineariza-tion.dmax,k ≡

maxδxT

k Skδxk≤1||A(xk, uk) · δxk +B(xk, uk) · πW (δxk)||2P

(7)Note that the matrix P allows to weigh states differently.In the next section, we present the solution approach tocompute the gradient for the RelaxedRobustTrajOpt whichis then used to solve the optimization problem. Note thatthis results in simultaneous solution to open-loop and thestabilizing policy πW .

IV. SOLUTION APPROACH

This section introduces the main contribution of the paper,which is a local feedback design that regulates the deviationof an executed trajectory from the optimal trajectory gener-ated by the optimization procedure.

To solve the optimization problem presented in the lastsection, we first need to obtain the gradient information of therobustness heuristic that we introduced. However, calculating

Page 6: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

the gradient of the robust constraint is not straightforward,because the max function is non-differentiable. The gradientof the robustness constraint is computed by the applicationof Danskins Theorem [25], which is stated next.Danskin’s Theorem: Let K ⊆ Rm be a nonempty, closedset and let Ω ⊆ Rn be a nonempty, open set. Assume thatthe function f : Ω ×K → R is continuous on Ω ×K andthat ∇xf(x, y) exists and is continuous on Ω × K. Definethe function g : Ω→ R ∪ ∞ by

g(x) ≡ supy∈K

f(x, y), x ∈ Ω

andM(x) ≡ y ∈ K | g(x) = f(x, y).

Let x ∈ Ω be a given vector. Suppose that a neighborhoodN (x) ⊆ Ω of x exists such that M(x′) is nonempty forall x′ ∈ N (x) and the set ∪x′∈N (x)M(x′) is bounded. Thefollowing two statements (a) and (b) are valid. (a)

1) The function g is directionally differentiable at x and

g′(x; d) = supy∈M(x)

∇xf(x, y)T d.

2) If M(x) reduces to a singleton, say M(x) = y(x),then g is Gaeaux differentiable at x and

∇g(x) = ∇xf(x, y(x)).

Proof See [26], Theorem 10.2.1.Danskin’s theorem allows us to find the gradient of therobustness constraint by first computing the argument ofthe maximum function and then evaluating the gradient ofthe maximum function at the point. Thus, in order to findthe gradient of the robust constraint (5), it is necessary tointerpret it as an optimization problem in δxk, which ispresented next. In Section III-B, we presented a generalformulation for the stabilization controller πW , where Ware the parameters that are obtained during optimization.However, solution of the general problem is beyond thescope of the current paper. Rest of this section considers alinear πW for analysis.

Lemma 2. Assume the linear feedback πW (δxk) = Wδxk.Then, the constraint (7) is quadratic in δxk,

maxδxk

||Mkδxk||2P = maxδxk

δxTkMTk · P ·Mkδxk

s.t. δxTk Skδxk ≤ 1(8)

where Mk is shorthand notation for

Mk(xk, uk,W ) ≡ A(xk, uk) +B(xk, uk) ·W (9)

Proof: Write dmax from (7) as the optimization prob-lem

dmax =

maxδxk

||A(xk, uk) · δxk +B(xk, uk) · πW (δxk)||2P

s.t. δxTk Skδxk ≤ 1

(10)

Introduce the linear controller and use the shorthand notationfor Mk to write (8).

The next lemma is one of the main results in thepaper. It connects the robust trajectory tracking formula-tion RelaxedRobustTrajOpt with the optimization problemthat is well known in the literature.

Lemma 3. The worst-case measure of deviation dmax is

dmax =

λmax(S− 1

2

k MTk · P ·MkS

− 12

k ) = ||P 12MkS

− 12

k ||22

where λmax(·) denotes the maximum eigenvalue of a matrixand || · ||2 denotes the spectral norm of a matrix. Moreover,the worst-case deviation δmax is the corresponding maximumeigenvector

δmax =

δxk :[S− 1

2

k MTk · P ·MkS

− 12

k

]· δxk = dmax · δxk

(10)

Proof: Apply coordinate transformation δxk = Sk12 δxk

in (8) and write

maxδxk

δxkS− 1

2

k MTk · P ·MkS

− 12

k δxk

s.t. δxkδxk ≤ 1(11)

Since S− 1

2

k MTk · P · MkS

− 12

k is positive semi-definite, themaximum lies on the boundary of the set defined by theinequality. Therefore, the problem is equivalent to

maxδxk

δxkS− 1

2

k MTk · P ·MkS

− 12

k δxk

s.t. δxkδxk = 1(12)

The formulation (12) is a special case with a known analyticsolution. Specifically, the maximizing deviation δmax thatsolves (12) is the maximum eigenvector of S−

12

k MTk · P ·

MkS− 1

2

k , and the value dmax at the optimum is the corre-sponding eigenvalue.

This provides us with the maximum deviation along thetrajectory at any step ’k’, and now we can use Danskin’stheorem to compute the gradient which is presented next.

Theorem 1. Introduce the following notation, M(z) =

S− 1

2

k MTk (z) · P · Mk(z)S

− 12

k . The gradient of the robustinequality constraint dmax with respect to an arbitrary vectorz is

∇zdmax = ∇zδTmaxM(z)δmax

Where δmax is maximum trajectory deviation introduced inLemma 3.

Proof: Start from the definition of gradient of robustconstraint

∇zdmax = ∇z maxδxk

δxkM(z)δxk

Use Danskin’s Theorem and the result from Lemma 3 towrite the gradient of robust constraint with respect to anarbitrary z,

∇zdmax = ∇zδTmaxM(z)δmax

Page 7: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

which completes the proof.The gradient computed from Theorem 1 is used in solution

of the RelaxedRobustTrajOpt– however, this is solved onlyfor a linear controller. The next section shows some resultsin simulation and on a real physical system.

V. EXPERIMENTAL RESULTS

In this section, we present some results using the proposedalgorithm for an under-actuated inverted pendulum, as wellas on a experimental setup for a ball-and-beam system.We use a Python wrapper for the standard interior pointsolver IPOPT to solve the optimization problem discussedin previous sections. We perform experiments to evaluatethe following questions:

1) Can an off-the-shelf optimization solver find feasiblesolutions to the joint optimization problem describedin the paper?

2) Can the feedback controller obtained by this optimiza-tion stabilize the open-loop trajectory in the presenceof bounded uncertainties?

3) How good is the performance of the controller on aphysical system with unknown system parameters ?

In the following sections, we try to answer these questionsusing simulations as well as experiments on real systems.

A. Simulation Results for Underactuated Pendulum

The objective of this subsection is twofold: first, to provideinsight into the solution of the optimization problem; andsecond, to demonstrate the effectiveness of that solution inthe stabilization of the optimal trajectory. For clarity of pre-

Fig. 2: State-space representation of the optimal trajectory(green), stable closed-loop system using the obtained solu-tion (blue), and unstable open-loop trajectory without thelocal feedback (red). Note that the feedback is time-invariant.

sentation, we use an underactuated pendulum system, wheretrajectories can be visualized in state space. The dynamicsof the pendulum is modeled as Iθ+ bθ+mgl · sin(θ) = u.The continuous-time model is discretized as (θk+1, θk+1) =f((θk, θk), uk). The goal state is xg = [π, 0], and the initialstate is x0 = [0, 0] and the control limit is u ∈ [−1.7, 1.7].The cost is quadratic in both the states and input. The initialsolution provided to the controller is trivial (all states andcontrol are 0). The number of discretization points along thetrajectory is N = 120, and the discretization time step is

∆t = 1/30. The cost weight on robust slack variables isselected to be α = 10. The uncertainty region is roughly

estimated as xTk

[1.0 0.00.0 5.5

]xk < 1 along the trajectory. De-

tailed analysis on uncertainty estimation based on Gaussianprocesses is deferred to future work, due to space limits. Theoptimization procedure terminates in 50 iterations with thestatic solution W = [−2.501840,−7.38725].

The controller generated by the optimization procedure isthen tested in simulation, with noise added to each state ofthe pendulum model at each step of simulation as xk+1 =f(xk, uk) + ω with ωθ ∼ U(−0.2rad, 0.2rad]) and ωθ ∼U(−0.05rad/s, 0.05rad/s]).

We tested the controller over several settings and foundthat the underactuated setting was the most challenging tostabilize. In Figure 2, we show the state-space trajectoryfor the controlled (underactuated) system with additionalnoise as described above. As seen in the plot, the open-loop controller becomes unstable with the additional noise,while the proposed controller can still stabilize the wholetrajectory.

Fig. 3: a) open-loop control b) static feedback matrix ob-tained from optimization c) LQR feedback

In Figure 3, we show the control inputs, the time-invariantfeedback gains obtained by the optimization problem. Wealso the time-varying LQR gains obtained along the trajec-tory to show provide some insight between the two solutions.As the proposed optimization problem is finding the feedbackgain for the worst-case deviation from the trajectory, thesolutions are different than the LQR-case. Next, in Figure 4,we plot the error statistics for the controlled system (in theunderactuated setting) over 2 different uncertainty balls usingeach 12 sample for each ball. We observe that the steady-stateerror goes to zero and the closed-loop system is stable alongthe entire trajectory. As we are using a linear approximationof the system dynamics, the uncertainty sets are still small,however the results are indicating that incorporating the fulldynamics during stabilization could allow to generate muchbigger basins of attraction for the stabilizing controller.

B. Results on Ball-and-Beam System

Next, we implemented the proposed method on a ball-and-beam system (shown in Figure 5) [27]. The ball-and-beam system is a low-dimensional non-linear system with

Page 8: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

Fig. 4: Error statistics for the controlled system using theproposed method on the underactuated pendulum with noiseamplitude

Fig. 5: The ball-and-beam system used for the experiments.There is an RGB camera above that measures the locationof the ball. The encoder (seen in the figure) measures theangular position of the beam.

the non-linearity due to the dry friction and delay in theservo motors attached to the table (see Figure 5). The ball-and-beam system can be modeled with 4 state variables[x, x, θ, θ], where x is the position of the ball, x is the ball’svelocity, θ is the beam angle in radians, and θ is the angularvelocity of the beam. The acceleration of the ball, x, is givenby

x =mballxθ

2 − b1x− b2mballg cos(θ)−mballg sin(θ)Iball

r2ball+mball

,

where mball is the mass of the ball, Iball is the moment ofinertia of the ball, rball is the radius of the ball, b1 is thecoefficient of viscous friction of the ball on the beam, b2 isthe coefficient of static (dry) friction of the ball on the beam,and g is the acceleration due to gravity. The beam is actuatedby a servo motor (position controlled) and an approximatemodel for its motion is estimated by fitting an auto-regressivemodel. We use this model for the analysis where the ball’srotational inertia is ignored and we approximately estimatethe dry friction. The model is inaccurate, as can be seenfrom the performance of the open-loop controller in Figure 6.However, the proposed controller is still able to regulatethe ball position at the desired goal showing the stabilizingbehavior for the system (see the performance of the closed-loop controller in Figure 6). The plot shows the mean and the

Fig. 6: Comparison of the performance of the proposedcontroller on a ball-and-beam system with the open-loopsolution. The plot shows the error in the position of the ballfrom the regulated position averaged over 12 runs.

standard deviation of the error for 12 runs of the controller.It can be observed that the mean regulation error goesto zero for the closed-loop controller. We believe that theperformance of the controller will improve as we improvethe model accuracy. In future research, we would like tostudy the learning behavior for the proposed controller bylearning the residual dynamics using GP [10].

VI. CONCLUSION AND FUTURE WORK

This paper presents a method for simultaneously comput-ing an optimal trajectory along with a local, time-invariantstabilizing controller for a dynamical system with knownuncertainty bounds. The time-invariant controller was com-puted by adding a robustness constraint to the trajectory op-timization problem. We prove that under certain simplifyingassumptions, we can compute the gradient of the robustnessconstraint so that a gradient-based optimization solver couldbe used to find a solution for the optimization problem. Wetested the proposed approach that shows that it is possible tosolve the proposed problem simultaneously. We showed thateven a linear parameterization of the stabilizing controllerwith a linear approximation of the error dynamics allows usto successfully control non-linear systems locally. We testedthe proposed method in simulation as well as a physicalsystem. Due to space limitations, we have to skip extraresults regarding the behavior of the algorithm.

However, the current approach has two limitations– itmakes linear approximation of dynamics for finding theworst-case deviation, and secondly, the linear parameteriza-tion of the stabilizing controller can be limiting for a lotof robotic systems. In future research, we will incorporatethese two limitations using Lyapunov theory for non-linearcontrol, for better generalization and more powerful resultsof the proposed approach. We also hope to extend the currentformulation to more complex settings for feedback controlof manipulation problems [28].

REFERENCES

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction(2nd Edition). MIT press Cambridge, 2018, vol. 1.

Page 9: Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

[2] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal AdaptiveControl and Differential Games by Reinforcement Learning Principles,ser. IET control engineering series. Institution of Engineering andTechnology, 2013.

[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al., “Mastering the game of go with deep neuralnetworks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.

[4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Masteringthe game of go without human knowledge,” Nature, vol. 550, no. 7676,p. 354, 2017.

[5] K. Vamvoudakis, P. Antsaklis, W. Dixon, J. Hespanha, F. Lewis,H. Modares, and B. Kiumarsi, “Autonomy and machine intelligence incomplex systems: A tutorial,” in American Control Conference (ACC),2015, July 2015, pp. 5062–5079.

[6] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based anddata-efficient approach to policy search,” in Proceedings of the 28thInternational Conference on machine learning (ICML-11), 2011, pp.465–472.

[7] S. Levine and V. Koltun, “Guided policy search,” in InternationalConference on Machine Learning, 2013, pp. 1–9.

[8] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,” The Journal of Machine LearningResearch, vol. 17, no. 1, pp. 1334–1373, 2016.

[9] C. E. Rasmussen, “Gaussian processes in machine learning,” inSummer School on Machine Learning. Springer, 2003, pp. 63–71.

[10] D. Romeres, D. K. Jha, A. DallaLibera, B. Yerazunis, and D. Nikovski,“Semiparametrical gaussian processes learning of forward dynamicalmodels for navigating in a circular maze,” in 2019 InternationalConference on Robotics and Automation (ICRA). IEEE, 2019, pp.3195–3202.

[11] R. Tedrake, I. R. Manchester, M. Tobenkin, and J. W. Roberts, “LQR-trees: Feedback motion planning via sums-of-squares verification,” TheInternational Journal of Robotics Research, vol. 29, no. 8, pp. 1038–1052, 2010.

[12] T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang,G. Zhang, P. Abbeel, and J. Ba, “Benchmarking Model-Based Rein-forcement Learning,” arXiv e-prints, p. arXiv:1907.02057, Jul 2019.

[13] D. H. Jacobson, “New second-order and first-order algorithms fordetermining optimal control: A differential dynamic programmingapproach,” Journal of Optimization Theory and Applications, vol. 2,no. 6, pp. 411–440, 1968.

[14] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization ofcomplex behaviors through online trajectory optimization,” in 2012IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE, 2012, pp. 4906–4913.

[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Dis-tributed optimization and statistical learning via the alternating di-rection method of multipliers,” Foundations and Trends R© in Machinelearning, vol. 3, no. 1, pp. 1–122, 2011.

[16] Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, andS. Levine, “Path integral guided policy search,” in 2017 IEEE in-ternational conference on robotics and automation (ICRA). IEEE,2017, pp. 3381–3388.

[17] W. H. Montgomery and S. Levine, “Guided policy search via approx-imate mirror descent,” in Advances in Neural Information ProcessingSystems, 2016, pp. 4008–4016.

[18] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural networkdynamics for model-based deep reinforcement learning with model-free fine-tuning,” in 2018 IEEE International Conference on Roboticsand Automation (ICRA). IEEE, 2018, pp. 7559–7566.

[19] A. Majumdar, A. A. Ahmadi, and R. Tedrake, “Control designalong trajectories with sums of squares programming,” in 2013 IEEEInternational Conference on Robotics and Automation. IEEE, 2013,pp. 4054–4061.

[20] A. Chakrabarty, D. K. Jha, and Y. Wang, “Data-driven control policiesfor partially known systems via kernelized lipschitz learning,” in 2019American Control Conference (ACC), July 2019, pp. 4192–4197.

[21] A. Chakrabarty, D. K. Jha, G. T. Buzzard, Y. Wang, andK. Vamvoudakis, “Safe approximate dynamic programming via ker-nelized lipschitz estimation,” arXiv preprint arXiv:1907.02151, 2019.

[22] A. Chakrabarty, R. Quirynen, C. Danielson, and W. Gao, “Approxi-mate dynamic programming for linear systems with state and input

constraints,” in 2019 18th European Control Conference (ECC), June2019, pp. 524–529.

[23] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integralcontrol approach to reinforcement learning,” journal of machine learn-ing research, vol. 11, no. Nov, pp. 3137–3181, 2010.

[24] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots,and E. A. Theodorou, “Information theoretic mpc for model-basedreinforcement learning,” in 2017 IEEE International Conference onRobotics and Automation (ICRA). IEEE, 2017, pp. 1714–1721.

[25] D. P. Bertsekas, “Nonlinear programming,” Journal of the OperationalResearch Society, vol. 48, no. 3, pp. 334–334, 1997.

[26] F. Facchinei and J.-S. Pang, Finite-Dimensional Variational Inequal-ities and Complementarity Problems, Volume II. New York, NY:Springer, 2003.

[27] D. K. Jha, D. Nikovski, W. Yerazunis, and A. Farahmand, “Learningto regulate rolling ball motion,” in 2017 IEEE Symposium Series onComputational Intelligence (SSCI), Nov 2017, pp. 1–6.

[28] W. Han and R. Tedrake, “Local trajectory stabilization for dexterousmanipulation via piecewise affine approximations,” arXiv e-prints,2019, https://arxiv.org/abs/1909.08045.