doubled inverted pendulum on a cart[2]

8/7/2019 doubled inverted pendulum on a cart[2]

1/14

Optimal Control of aDouble Inverted Pendulum on a Cart

Alexander Bogdanov

Department of Computer Science & Electrical Engineering,OGI School of Science & Engineering, OHSU

Technical Report CSE-04-006December 2004

In this report a number of algorithms for optimal control of a double inverted pendulum on a cart (DIPC)are investigated and compared. Modeling is based on Euler-Lagrange equations derived by specifying aLagrangian, difference between kinetic and potential energy of the DIPC system. This results in a systemof nonlinear differential equations consisting of three 2-nd order equations. This system of equations is thentransformed into a usual form of six 1-st order ordinary differential equations (ODE) for control design pur-poses. Control of a DIPC poses a certain challenge, since unlike a robot, the system is underactuated: onecontrolling force per three degrees of freedom (DOF). In this report, problem of optimal control minimizinga quadratic cost functional is addressed. Several approaches are tested: linear quadratic regulator (LQR),state-dependent Riccati equation (SDRE), optimal neural network (NN) control, and combinations of theNN with the LQR and the SDRE. Simulations reveal superior performance of the SDRE over the LQR andimprovements provided by the NN, which compensates for model inadequacies in the LQR. Limited capa-bilities of the NN to approximate functions over the wide range of arguments prevent it from signicantlyimproving the SDRE performance, providing only marginal benets at larger pendulum deections.

1 Nomenclature

m massli distance from a pivot joint to the i-th pen-

dulum link center of massL i length of an i-th pendulum link0 wheeled cart position1 , 2 pendulum anglesI i moment of inertia of i-th pendulum link

w.r.t. its center of mass

g gravity constantu control forceT kinetic energyP potential energyL Lagrangianw neural network (NN) weightsN h number of neurons in the hidden layer t digital control sampling time

Subscripts 0 , 1, 2 used with the aforementioned param-eters refer to the cart, rst (bottom) pendulum and sec-ond (top) pendulum correspondingly.

2 IntroductionAs a nonlinear underactuated plant, double invertedpendulum on a cart (DIPC) poses a challenging controlproblem. It seems to have been one of attractive toolsfor testing linear and nonlinear control laws [6, 17, 1].Numerous papers use DIPC as a testbed. Cited onesare merely an example. Nearly all works on pendulumcontrol concentrate on two problems: pendulums swing-up control design and stabilization of the inverted pen-dulums. In this report, optimal nonlinear stabilization

Senior Research Associate, [email protected]

problem is addressed: stabilize DIPC minimizing an ac-cumulative cost functional quadratic in states and con-trols. For linear systems, this leads to linear feedbackcontrol, which is found by solving a Riccati equation[5], and thus referred to as linear quadratic regulator(LQR). DIPC, however, is a highly nonlinear system,and its linearization is far from adequate for control de-sign purposes. Nonetheless, the LQR will be consideredas a baseline controller, and other control designs willbe tested and compared against it. To solve the non-

linear optimal control problem, we will employ severaldifferent approaches: a nonlinear extension to the LQRcalled SDRE; direct nonlinear optimization using a neu-ral network (NN); and combinations of the direct NNoptimization with the LQR and the SDRE.

For a baseline nonlinear control, we will utilize atechnique that has shown considerable promise and in-volves manipulating the system dynamic equations intoa pseudo-linear state-dependent coefficient (SDC) form,in which system matrices are given explicitly as a func-tion of the current state. Treating the system matricesas constant, the approximate solution of the nonlinearstate-dependent Riccati equation is obtained for the re-formulated pseudo-linear dynamical system in discrete

time steps. The solution is then used to calculate a feed-back control law that is optimized around the systemstate estimated at each time step. This technique, re-ferred to as State-Dependent Riccati Equation (SDRE)control, is thus an extension to the LQR as it solves theLQR problem at each time step.

As a next step, a direct optimization approach willbe investigated. For this purpose, a NN will be used inthe feedback loop, and a standard calculus of variationsapproach will be employed to adjust the NN parameters(weights) to optimize the cost functional over a wide

1


2/14

X

Y 1

2

m2g

m1 g

0

m0

l2

ul1

L1

L2

I 1

I 2

Figure 1: Double inverted pendulum on a cart

range of initial DIPC states. As a universal functionapproximator, the NN is thus trained to implement anonlinear optimal controller.

As the last step, two combinations of feedback NNcontrol with LQR and SDRE will be designed. Approxi-mation capabilities of a NN are limited by its size and thefact that optimization in the space of its weights is non-convex. Thus, only local minimums are usually found,and the solution is at most suboptimal. The problemis more severe with wider ranges of NN inputs and out-puts. To address this, a combination of the NN witha conventional feedback suboptimal control is designedto simplify the NN input-output mapping and thereforereduce training complexity. If the conventional feedbackcontrol were optimal, the optimal NN output would triv-ially be zero. Simple logic says that a suboptimal con-ventional control will simplify the NN mapping to someextent: instead of generating optimal controls, the NN istrained to produce corrections to the controls generatedby the conventional suboptimal controller. For example,the LQR provides near-optimal control in the vicinityof the equilibrium, since nonlinear and linearized DIPC

dynamics are close in the equilibrium. Thus the NN willonly have to correct the LQR output when the lineariza-tion accuracy diminishes. The next sections will discussthe above concepts in details and illustrate them withsimulations.

3 ModelingThe DIPC system is graphically depicted in Fig. 1. Toderive its equations of motion, one of the possible waysis to use Lagrange equations:

ddt

L

L

= Q (1)

where L = T P is a Lagrangian, Q is a vector of generalized forces (or moments) acting in the directionof generalized coordinates and not accounted for informulation of kinetic energy T and potential energy P .

Kinetic and potential energies of the system are givenby the sum of energies of its individual components (awheeled cart and two pendulums):

T = T 0 + T 1 + T 2

P = P 0 + P 1 + P 2

where

T 0 =12

m0 20

T 1 =12

m1 0 + l1 1 cos 12 + l1 1 sin 1

2

+12

I 1 21

=12

m1 20 +12

m1 l21 + I 1 21 + m1 l1 0 1 cos 1

T 2 =12

m2 0 + L1 1 cos 1 + l2 2 cos 22

+ L1 1 sin 1 + l2 2 sin 22 +

12

I 2 22

=12

m2 20 +12

m2L21 21 +

12

m2 l22 + I 2 22

+ m2L1 0 1 cos 1 + m2 l2 0 2 cos 2+ m2L1 l2 1 2 cos(1 2)

P 0 = 0P 1 = m1gl1 cos 1P 2 = m2g L1 cos 1 + l2 cos 2

Thus the Lagrangian of the system is given by

L =12

m0 + m1 + m2 20

+12

m1 l21 + m2L21 + I 1

21 +

12

m2 l22 + I 2 22

+ m1 l1 + m2L1 cos(1)0 1+ m2l2 cos(2)0 2 + m2L1 l2 cos(1 2)1 2 m1 l1 + m2L1 g cos 1 m2 l2g cos 2

Differentiating the Lagrangian by and yields La-grange equation (1) as

ddt

L 0

L0 = u

ddt

L 1

L1

= 0

ddt

L 2

L2

= 0

Or explicitly,

u = m i 0 + m1 l1 + m2L1 cos(1)1

+ m2 l2 cos(2)2 m1 l1 + m2L1 sin(1)21 m2 l2 sin(2)22

0 = m1 l1 + m2L1 cos(1)0+ m1 l21 + m2L

21 + I 1 1

+ m2L1 l2 cos(1 2)2+ m2L1 l2 sin(1 2)22 m1 l1 + m2L1 g sin 1

0 = m2 l2 cos(2)0 + m2L1 l2 cos(1 2)1+ m2 l22 + I 2 2 m2L1 l2 sin(1 2)

21

m2 l2g sin 2

2


3/14

Lagrange equations for the DIPC system can be writ-ten in a more compact matrix form:

D () + C (, ) + G () = H u (2)where

D ()=d1 d2 cos 1 d3 cos 2

d2 cos 1 d4 d5 cos(1 2)d3 cos 2 d5 cos(1 2) d6

(3)

C (, )=0 d2 sin(1)1 d3 sin(2)20 0 d5 sin(1 2)20 d5 sin(1 2)1 0

(4)

G () =0

f 1 sin 1 f 2 sin 2

(5)

H = (1 0 0) T

Assuming that centers of mass of the pendulums are inthe geometrical center of the links, which are solid rods,we have: li = L i / 2, I i = m i L2i / 12. Then for the ele-ments of matrices D (), C (, ), and G () we get:

d1 = m0 + m1 + m2

d2 = m1 l1 + m2L1 =12m1 + m2 L1

d3 = m2 l2 =12

m2L2

d4 = m1 l21 + m2L21 + I 1 =

13

m1 + m2 L21

d5 = m2L1 l2 =12

m2L1L2

d6 = m2 l22 + I 2 =13

m2L22

f 1 = ( m1 l1 + m2L1)g = (12

m1 + m2)L1g

f 2 = m2 l2g =1

2m2L2g

Note that matrix D () is symmetric and nonsingular.

4 ControlTo design a control law, Lagrange equations of motion(2) are reformulated into a 6-th order system of ordinarydifferential equations. To do this, a state vector x R 6is introduced:

x = ( )T

Then dropping dependencies of the system matrices onthe generalized coordinates and their derivatives, thesystem dynamic equations appear as:

x = 0 I0 D 1C x +0

D 1G +0

D 1H u (6)

In this report, optimal nonlinear stabilization control de-sign is addressed: stabilize the DIPC minimizing an ac-cumulative cost functional quadratic in states and con-trols. The general problem of designing an optimal con-trol law involves minimizing a cost function

J t =t final

k = t

Lk (x k , u k ), (7)

which represents an accumulated cost of the sequence of states x k and controls u k from the current discrete time tto the nal time t final . For regulation problems t final = . Optimization is done with respect to the controlsequence subject to constraints of the system dynamics(6). In our case,

Lk (xk , u k ) = xT k Qx k + uT k Ru k (8)

corresponds to the standard Linear Quadratic cost. Forlinear systems, this leads to linear state-feedback control,LQR, designed in the next subsection. For nonlinearsystems the optimal control problem generally requiresa numerical solution, which can be computationally pro-hibitive. An analytical approximation to the nonlin-ear optimal control solution is utilized in subsection onSDRE control, which represents a nonlinear extensionto the LQR and yields superior results. Neural net-work (NN) capabilities for function approximation areemployed to approximate the nonlinear control solutionin subsection on NN control. And combinations of theNN with LQR and SDRE are investigated in the subsec-tion following the NN control.

4.1 Linear Quadratic RegulatorThe linear quadratic regulator yields an optimal solu-

tion to the control problem (7)(8) when system dynam-ics are linear. Since DIPC is nonlinear, as described by(6), it can be linearized to derive an approximate linearsolution to the optimal control problem. Linearizationof (6) around x = 0 yields:

x = Ax + B u (9)

where

A =0 I

D (0) 1 G ( 0 )

0(10)

B = 0D (0) 1H (11)

and the continuous LQR solution is obtained then by:

u = R 1B T P c x K c x (12)

where P c is a steady-state solution of the differentialRiccati equation. To implement computerized digitalcontrol, dynamic equations (9) are approximately dis-cretized as eA t , B t, and digital LQR con-trol is then given by

uk = R 1 T Px k Kx k (13)

where P is the steady state solution of the differenceRiccati equation, obtained by solving the discrete-timealgebraic Riccati equation

T [P P (R + T P ) 1 T P ] P + Q = 0 (14)

where Q R6 6 and R R are positive denite stateand control cost matrices. Since linearization (9)(11)accurately represents the DIPC system (6) in the equi-librium, the LQR control (12) or (13) will be a locallynear-optimal stabilizing control.

3


4/14

4.2 State-Dependent Riccati Equation ControlAn approximate nonlinear analytical solution of the

optimal control problem (7)(8) subject to (6) is givenby a technique referred to as the state-dependent Ric-cati equation (SDRE) control. The SDRE approach [3]involves manipulating the dynamic equations

x = f (x, u )

into a pseudo-linear state-dependent coefficient (SDC)form in which system matrices are explicit functions of the current state:

x = A (x)x + B (x)u (15)

A standard LQR problem (Riccati equation) can thenbe solved at each time step to design the state feedbackcontrol law on-line. For digital implementation, (15) isapproximately discretized at each time step into

xk +1 = (x k )x k + (x k )u k (16)

And the SDRE regulator is then specied similar to thediscrete LQR (compare with (13)) as

u k = R 1 T (xk )P (x k )x k K (xk )x k (17)

where P (xk ) is the steady state solution of the differenceRiccati equation, obtained by solving the discrete-timealgebraic Riccati equation (14) using state-dependentmatrices (xk ) and (x k ), which are treated as beingconstant at each time step. Thus the approach is some-times considered as a nonlinear extension to the LQR.

Proposition 1. Dynamic equations of a double inverted pendulum on a cart are presentable in SDC form (15).

Proof. From the derived dynamic equations for theDIPC system (6) it is clear that the required SDC form

(15) can be obtained if vector G () is presentable in theSDC form: G () = G sd (). Let us construct G sd () as

G sd () =0 0 00 f 1 sin 1 1 00 0 f 2 sin 2 2

Elements of constructed G sd () are bounded everywhereand G () = G sd () as required. Thus the system dy-namic equations can be presented in the SDC form as

x = 0 I D 1G sd D 1C x +0

D 1H u (18)

Derived system equations (18) (compare with the lin-earized system (9)(11)) are discretized at each time stepinto (16), and control is then computed as given by (17).

Remark. In the neighborhood of equilibrium x = 0system equations in the SDC form (18) turn into lin-earized equations (9) used in the LQR design. This canbe checked either by direct computation or by notingthat on one hand, linearization yields

x = Ax + B u + O (x)2

Double invertedpendulum on a cart

Neural network

1q

T k k x Qx

k x

k x

k u1k

x

2

k Ru

Figure 2: Neural network control diagram

and on the other hand, SDC form of the system dynamicsis

x = A (x)x + B (x)u

Since O (x)2 0 when x 0, then A (x) A andB (x) B . Therefore, it is natural to expect that per-formance of the SDRE regulator will be very close to theLQR in the vicinity of the equilibrium, and the differ-ence will show at larger pendulum deections. This isexactly illustrated in the Simulation Results section.

4.3 Neural Network Learning Control

A neural network (NN) control is often popular in con-trol of nonlinear systems due to universal function ap-proximation capabilities of NNs. Neural networks withonly one hidden layer and an arbitrarily large num-ber of neurons represent nonlinear mappings, which canbe used for approximation of any nonlinear functionf C (R n , R m ) over a compact subset of R n [7, 4, 12]. Inthis section, we utilize the function approximation prop-erties of NNs to approximate a solution of the nonlinearoptimal control problem (7)(8) subject to system dy-

namics (6), and thus design an optimal NN regulator.The problem can be solved by directly implementing afeedback controller (see Figure 2) as:

u k = NN (x k , w) ,

Next, an optimal set of weights w is computed tosolve the optimization problem. To do this, a stan-dard calculus of variations approach is taken: mini-mization of a functional subject to equality constraintsxk +1 = f (x k , u k ). Let k be a vector of Lagrange mul-tipliers in the augmented cost function

H =t final

k = t

L(xk , u k ) + T k (x k +1 f (xk , u k )) (19)

We can now derive the recurrent Euler-Lagrange equa-tions by solving H x k = 0 w.r.t. the Lagrange multipliersand then nd the optimal set of NN weights w by solv-ing the optimality condition H w = 0 (numerically bymeans of gradient descent).

Dropping dependence of L(x k , u k ) and f (x k , u k ) onxk and u k , and NN (x k , w ) on x k and w for brevity, the

4


5/14

PendulumJacobian

NN Jacobian

1q

2k Qx

k d x

NN k d x k du

1k

2k Ru

+

+

k

Figure 3: Adjoint system diagram

Euler-Lagrange equations are derived as

k = f

x k+

f u k

NN xk

T

k +1

+L x k

+L u k

NN xk

T

(20)

with t final initialized as zero vector. For L(x k , u k )given by (8), L x k = 2 x

T k Q , and

L u k = 2 u

T k R . These

equations correspond to an adjoint system shown graph-ically in Figure 3, with optimality condition

H w

=t final

k = t

T k +1 f

u k+

L u k

NN w

= 0. (21)

The overall training procedure for the NN can now besummarized as follows:

1. Simulate the system forward in time for t final timesteps (Figure 2). Although, as we mentioned,t final = for regulation problems, in practice itis set to a sufficiently large number (in our case

t final = 500).2. Run the adjoint system backward in time to accu-

mulate the Lagrange multipliers (20) (see Figure 3).System Jacobians are evaluated analytically or byperturbation.

3. Update the weights using gradient descent 1 , w = H w

T .

4. Repeat until convergence or until an acceptable levelof cost reduction is achieved.

Due to the nature of the training process (propagation of the Lagrange multipliers backwards through time), thistraining algorithm is referred to as Back PropagationThrough Time (BPTT) [12, 16].

As seen from Euler-Lagrange equations (20), backpropagation of the Lagrange multipliers includes compu-tation of the system and NN Jacobians. The DIPC Ja-cobians can be computed numerically by perturbation orderived analytically from the dynamic equations (6). Letus show the analytically derived Jacobians f c ( x ,u ) x and

1 In practice we use an adaptive learning rate for each weight inthe network using a procedure similar to delta-bar-delta [8]

f c ( x ,u )u , where f c (x , u) is a brief notation for the right-

hand side of the continuous system (6), i.e. x = f c (x , u).

f c (x , u) x

= 0 I D 1M 2D 1C (22)

f c (x , u)u

= 0D 1H (23)

where matrix M (, ) = ( M 0 M 1 M 2), and each of itscolumns (note that D

1

i = D 1 D

i D 1) is given by

M i = C i

+ G i

D i

D 1(C + G H u) (24)

Remark 1. Clearly, Jacobians (22)(24) transform intolinear system matrices (10)(11) in equilibrium = 0, = 0.Remark 2. From (3)(5) it follows that M 0 0.Remark 3. Jacobians (22)(24) are derived from thecontinuous system equations (6). BPTT requires com-putation of the discrete system Jacobians. Thus, to use

the derived matrices in NN training, they should be dis-cretized (e.g. as it was done for the LQR and SDRE).Computation of the NN Jacobian is easy to perform

given the nonlinear functions of individual neural ele-ments are y = tanh( z). In this case, NN with N 0 inputs,single output u and a single hidden layer with N h ele-ments is described by

u =N h

i =1

w2 i tanhN 0

j =1

w1 i,j x j + wb1 i + wb2

Or in a more compact form,

u = w2 tanh W 1x + w b1 + wb2

where tanh (z) = (tanh( z1), . . . , tanh( zN h ))T , w2 RN h

is a row-vector of weights in the NN output, W 1 RN h N 0 is a matrix of weights of the hidden layer el-ements, w b1 RN h is a vector of bias weights in thehidden layer and wb2 is a bias weight in the NN output.Graphically NN is depicted in Figure 4.Remark 4 . In case of a NN with N out outputs(MIMO control problems), w 2 becomes a matrix W 2 RN out N h , and wb2 becomes a vector w b2 RN out .

Noting that tanh( z )z = 1 tanh2(z), the NN Jacobian

w.r.t. the NN input (state vector x) is given by

NN x

= w T 2 dtanh W 1x + wb1

T W 1 (25)

where operator is an element-by-elementmultiplication of two vectors, i.e. x y =(x1y1 , . . . , x n yn )

T ; and vector eld dtanh (z) =1 tanh 2(z1), . . . , 1 tanh 2(zN h )

T . Again, in MIMO

case when the NN has N out outputs, individual rows of W 2 take place of row-vector w 2 in (25).

5


6/14

W1 1,1W1 1,2

W1 1,6

W1 2,1W1 2,2

W1 2,6

W1Nh,1

W1 Nh,2

W1 Nh,6Wb1 Nh

Wb1 2

Wb1 1

1x

2x

6x

tanh

tanh

tanh

+

W2 1W2 2

W2 Nh

+

+

+

u

Wb2

Hidden layer

Output

1 1,bW w

2 2,bW w

1

2

Nh

Figure 4: Neural network structure

The last essential part of the NN training algorithm isthe NN Jacobian w.r.t. the network weights:

NN W 1 i


T x i

NN w2 = tanh W 1x + w

b1 (26)

NN w b1


T

NN wb2

= 1

where W 1 i is an i-th column of W 1 .Now we have all the quantities to compute weight up-

dates and control. Explicitly, from (21) and (26) theweight update recurrent law is given by

W 1n +1 = W 1n t final

k = t

W T 2n T k +1

f

u k+

L

u k

T

dtanh W 1n x k + w b1n xT k

W 2n +1 = W 2n t final

k = t

T k +1 f

u k+

L u k

T

tanh T W 1n x k + wb1n (27)

w b1n +1 = wb1n

t final

k = t

W T 2n T k +1

f u k

+L u k

T

dtanh W 1n x k + wb1n

w b2n +1 = wb2n

t final

k = t

T k +1 f

u k+

L u k

T

In conclusion to this section, it should be mentionedthat the number of elements in the NN hidden layer af-fects optimality of the control design. One one hand, themore elements, the better the theoretically achievable


Neural network

1q

T

k k x Qx

k x

k x

k u1k

x

2

k Ru

+

K

Figure 5: Neural network + LQR control diagram

function approximation capabilities of the NN. On theother hand, too many elements make gradient descentin space of NN weights more prone to getting stuck inlocal minima (recall, this is a non-convex problem), thusprohibiting achievement of good approximation levels.Therefore a reasonable compromise is usually necessary.

4.4 Neural Network Control + LQR/SDREIn the previous section a NN was optimized directly to

stabilize the DIPC at minimum cost (7)(8). To achievethis, the NN was trained to generate optimal controlsover the wide range of the pendulums motion. As illus-trated in Simulation Results section, NN approximationcapabilities, limited by the number of neurons and nu-merical challenges of non-convex minimization, providedachievement of stabilization only within the range com-parable to the LQR. One might ask whether it is possibleto make the NN training more efficient and faster. Sim-ple logic says that if the DIPC were stable and closerto optimal, the NN would not have much to learn sinceits optimal output would be closer to zero. In the trivialcase, if the DIPC were optimally stabilized by an internalcontroller, the optimal NN output would be zero. Nowlet us recall that LQR provides optimal control in thevicinity of the equilibrium. Thus if we include LQR intothe control scheme as shown in Figure 5, it is reason-able to expect better overall performance: the NN willbe trained to generate only corrections to the LQRcontrols to provide optimality in the wide range of pen-dulums motion.

Although in Figure 5 an LQR is shown, the SDREcontroller can be used as well in place of the LQR.As demonstrated in the Simulation Results section, theSDRE control provides superior performance in termsof minimized cost and range of stability over the LQR.Thus, it is natural to expect a better performance of thecontrol design shown in Figure 5 when the SDRE is usedin place of the LQR. Since both the LQR and the SDREhave a lot in common (in fact, the SDRE is often called anonlinear extension to the LQR), both cases will be dis-cussed in this section, and when the differences betweenthe two will call for clarication, it will be provided.

The problem solved in this section is nearly the sameas in the previous section, therefore, almost all the for-mulas stay the same or have just a few changes.

Since control of the cart is a sum of the NN outputand the LQR (SDRE),

uk = NN (w, xk ) + Kx k

6


7/14

PendulumJacobian

NN Jacobian

1q

2k Qx

k d x

NN

k d x k du

1k

2k

Ru

+

+

k

K

+

Figure 6: Adjoint NN + LQR system diagram

where in case of SDRE K K (x k ), minimization of theaugmented cost function (19) by taking derivatives of H w.r.t. x k and w yields slightly modied Euler-Lagrangeequations:

k = f

x k+

f u k

NN x k

+ KT

k +1

+L x k

+L u k

NN x k

+ KT

(28)

Note that when SDRE is used instead of the LQR, (K ( x k ) x k )

x k should be used in place of K in the aboveequation. These equations correspond to an adjoint sys-tem shown graphically in Figure 6 (compare to Figure 3),with optimality condition (21)

The training procedure for the NN is the same as inthe previous section, and all the system Jacobians andthe NN Jacobian are computed in the same way. Theonly new item in this section is the SDRE Jacobian ( K ( x k ) x k )

x k which appears in the Euler-Lagrange equa-tions (28). This Jacobian is computed either numeri-cally, which may be computationally expensive, or ap-

proximately as (K (xk )xk )

xk K (xk )

Experiments in the Simulation Results section illus-trate the superior performance of such combination con-trol over the pure NN or LQR control. For the SDRE,a noticeable cost reduction is achieved only near criti-cal pendulum deections (close to the boundaries of theSDRE recovery region).

Remark . The idea of using NNs to adjust outputsof a conventional controller to account for differencesbetween the actual system and its model used in theconventional control design was also employed in a num-ber of works [15, 14, 2, 11]. Calise, Rysdyk and John-son [2, 11] used a NN controller to supplement an ap-proximate feedback linearization control of a xed wingaircraft and a helicopter. The NN provided additionalcontrols to match the response of the vehicle to the ref-erence model, compensating for the approximations inthe autopilot design. Wan and Bogdanov [15, 14] de-signed a model predictive neural control (MPNC) for anautonomous helicopter, where a NN worked in pair withthe SDRE autopilot and provided minimization of thequadratic cost function over a receding horizon. The

SDRE control was designed for a simplied nonlinearmodel, and the NN made it possible to compensate forwind disturbances and higher-order terms unaccountedfor in the SDRE design.

4.5 Receding Horizon Neural Network ControlIs it possible to further improve control scheme de-

signed in the previous section? Recall, that the NN wastrained numerous times over the wide range of the pen-dulum angles to minimize the cost functional (19). Thecomplexity of the approach is a function of the nal timet final , which determines the length of a training epoch.Solution suggested in this section is to apply a receding horizon framework and train the NN in a limited rangeof the pendulum angles, along a trajectory starting inthe current point and having relatively short duration(horizon) N . This is accomplished by rewriting the costfunction (19) as

H =t+N 1

k = t

L(x k , u k )+ T k (x k +1 f (xk , u k )) + V (x t + N )

where the last term V (x t + N ) denotes the cost-to-go from

time t + N to time t final . Since range of the NN inputsin this case is limited and the horizon length N is short,a shorter time is required to train the NN. However, onlylocal minimization of function (19) is achieved: startingwith a signicantly different initial condition the NN willnot provide cost minimization since it was not trained forit. This issue is addressed by periodically retraining theNN: after a period of time called an update interval , theNN is retrained taking the current state vector as initial.Update interval is usually signicantly shorter than thehorizon, and in classical model-predictive control (MPC)is often only one time step. This technique was appliedin helicopter control problem [15, 14] and is referred toas Model Predictive Neural Control (MPNC).

In practice, the true value of V (x t + N ) is unknown,and must be approximated. Most common is to simplyset V (x t + N ) = 0; however, this may lead to reducedstability and poor performance for short horizon lengths[9]. Alternatively, we may include a control Lyapunovfunction (CLF), which guarantees stability if the CLF isan upper bound on the cost-to-go, and results in a regionof attraction for the MPC of at least that of the CLF [9].The cost-to-go V (x t + N ) can be approximated using thesolution of the SDRE at time t + N ,

V (x t + N ) x T t + N P (x t + N )x t + N

This CLF provides the exact cost-to-go for regulationassuming a linear system at the horizon time. A sim-ilar formulation was used for nonlinear regulation byCloutier et al [13].

All the equations from the previous section apply hereas well. The Euler-Lagrange equations (20) are initial-ized in this case as

t + N =V (x t + N )

x t + N

T

P (x t + N )x t + N

where dependence of the SDRE solution P on state vec-tor x t + N was neglected.

7


8/14

Stability of the MPNC is closely related to that of thetraditional MPC. Ideally, in the case of unconstrainedoptimization, stability is guaranteed provided V (x t + N )is a CLF and is an (incremental) upper bound on thecost-to-go [9]. In this case, the minimum region of at-traction of the receding horizon optimal control is de-termined by the CLF used and horizon length. Theguaranteed region of operation contains that of the CLFcontroller and may be made as large as desired by in-creasing the optimization horizon (restricted to the in-nite horizon domain) [10]. In our case, the minimumregion of attraction of the receding horizon MPNC isdetermined by the SDRE solution used as the CLF toapproximate the terminal cost. In addition, we also re-strict the controls to be of the form given by (4.4) andthe optimization is performed with respect to the NNweights w . In theory, the universal mapping capabilityof NNs implies that the stability guarantees are equiva-lent to that of the traditional MPC framework. However,in practice stability is affected by the chosen size of theNN (which affects the actual mapping capabilities), aswell as the horizon length N and update interval length(how often NN is re-optimized). When the horizon isshort, performance is more affected by the chosen CLF.On the other hand, when the horizon is long, perfor-mance is limited by the NN properties. An additionalfactor affecting stability is the specic algorithm usedfor numeric optimization. Gradient descent, which weuse to minimize the cost function (4.5), is guaranteed toconverge to only a local minimum (the cost function isnot guaranteed convex with respect to the NN weights),and thus depends on the initial conditions. In addition,convergence is assured only if the learning rate is keptsufficiently small. To summarize these points, stabil-ity of the MPNC is guaranteed under certain restrictedideal conditions. In practice, the designer must selectappropriate settings and perform sufficient experimentsto assure stability and performance.

4.6 Neural Network Adaptive ControlWhat if pendulum parameters are unknown? In this

case the system Jacobians (22)(23) can not be com-puted directly. A solution then is to use an additionalNN, called emulator , to emulate pendulum dynamics(see Figure (7)(8)). This NN-emulator is trained torecreate pendulum outputs, given control input and cur-rent state vector. The regular error back-propagation(BP) algorithm is used if the NN-emulator is connectedas in Figure (7), and the BPTT is employed in case de-picted in Figure (8). The NN-regulator training proce-dure is the same as in subsection on the NN learningcontrol, but instead of the system Jacobians (22)(23),the NN-emulator Jacobians computed as (25) are used(see Figure (9)).

5 Simulation results

To evaluate the control performance, two sets of simula-tions were conducted. In the rst set, the LQR, SDRE,NN and NN+LQR/SDRE control schemes were tested tostabilize the DIPC when both pendulums were initiallydeected from their vertical position. System parame-


Neural network

1q

T k k x Qx

k x

k x

k u1k x

2

k RuNeural network

emulator+

1

k x -

+

k x

k u

Figure 7: Neural network adaptive control diagram


Neural network

1q

T

k k x Qx

k x

k x

k u

1k x

2

k Ru

Neural networkemulator

+1

k

x-

+

k

x

k u

1q

Figure 8: Neural network adaptive control diagram

NN emulatorJacobian

NN Jacobian

1q

2k Qx

k d x

NN k d x k du

1k

2k Ru

+

+

k

Pendulum

Figure 9: Adjoint NN adaptive system diagram

ters used in the simulations are presented in Table 1.Slight deection (10 degrees in the opposite directions

or 20 degrees in the same direction) results in a verysimilar performance of all control approaches as it waspredicted (see Fig. 1011).

Larger deections (15 degrees in the opposite direc-tions or 30 degrees in the same direction) reveal differ-ences between the LQR and the SDRE: the latter oneprovides lesser cost (7)(8) and keeps pendulum veloci-ties lower (see Fig. 1213 and Table 2). The NN-basedcontrollers behave similar to the SDRE.

Note that the LQR could not recover the pendulumsstarting from 19 deg. initial deection in the oppositedirections or 36 deg. deection in the same direction.Critical cases for the LQR (pendulums starting at 18deg. deections in the opposite directions from the up-ward position and 35 deg. in the same direction) arepresented in Fig. 1415. The NN control provides con-sistently better cost than the LQR at these larger deec-

8


9/14

Table 1: Simulation parameters

Parameter Valuem0 1.5 kgm1 0.5 kgm2 0.75 kgL1 0.5 mL2 0.75 m t 0.02 s

Q diag(5 50 50 20 700 700)R 1N h 40t final 500

Table 2: Control performance (cost) comparison

Control Deection of pendulums, deg.Opposite direction

10 15 18 28LQR 115.2 328.4 705.0 n/rSDRE 112.9 277.3 437.5 3655.8NN 114.1 323.4 n/r n/rNN+LQR 113.3 275.7 448.3 n/rNN+SDRE 112.5 276.3 436.6 2753.7

Same direction20 30 35 67

LQR 36.7 108.3 325.3 n/rSDRE 36.0 84.0 118.0 5250.5NN 36.9 85.7 144.8 n/rNN+LQR 37.3 86.2 136.4 n/rNN+SDRE 36.0 84.0 118.0 n/r

Table 3: Regions of pendulums recoveryControl Recovery region, deg.

Max. opposite dir. Max. same dir.LQR 18 35SDRE 28 67NN 15 38NN+LQR 21 40NN+SDRE 28 62

tions, but it cant stabilize the system beyond 15 (38)deg. pendulum deections in the opposite (same) di-rections due to limited approximation capabilities of theNN.

In the second set of experiments, regions of pendulumrecovery ( i.e. maximum initial pendulum deectionsfrom which a control law can bring the DIPC back tothe equilibrium) were evaluated. The results are shownin Table 3. Cases of critical initial pendulums positionsfor SDRE (28 deg. deections the in opposite directionsor 67 deg. in the same direction) are presented for thereference in Fig. 1617. Note that no limits were imposedon the control force magnitude in all simulations. Thepurpose was to compare LQR and SDRE without intro-ducing model changes which are not directly accountedfor in control design. It should be mentioned however,that limited control authority and ways of taking it intoaccount in the SDRE design were investigated earlier [3]

but are left currently beyond our scope.

6 Conclusions

This report demonstrated potential advantage of theSDRE technique over the LQR design in nonlinear opti-mal control problems with an underactuated plant. Re-gion of pendulum recovery for SDRE appeared to be 55

to 91 percent larger than in case of the LQR control.Direct optimization via using neural networks yieldsresults superior to the LQR, but the recovery region isabout the same as in the LQR case. This happens due tolimited approximation capabilities of the NN and non-convex numerical optimization challenges.

Combination of the NN control with the LQR (or withthe SDRE) provides larger recovery regions and betteroverall performance. In this case the NN learns to gener-ate corrections to the LQR (SDRE) control to compen-sate for suboptimality of the LQR (SDRE).

To enhance this report, it would be valuable to inves-tigate taking limited control authority into account inall control designs.

References

[1] R. W. Brockett and H. Li. A light weight rotarydouble pendulum: maximizing the domain of at-traction. In Proceedings of the 42nd IEEE Confer-ence on Decision and Control , Maui, Hawaii, De-cember 2003.

[2] A. Calise and R. Rysdyk. Nonlinear adaptive ightcontrol using neural networks. IEEE Control Sys-tems Magazine , 18(6), December 1998.

[3] J. R. Cloutier, C. N. DSouza, and C. P. Mracek.Nonlinear regulation and nonlinear H-innity con-trol via the state-dependent Riccati equation tech-nique: Part1, theory. In Proceedings of the Interna-tional Conference on Nonlinear Problems in Avia-tion and Aerospace , Daytona Beach, FL, May 1996.

[4] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Sig-nals, and Systems , 2(4), 1989.

[5] G. F. Franklin, J. D. Powell, and A. Emami-Naeini.Feedback control of dynamic systems . Addison-Wesley, 2 edition, 1991.

[6] K. Furuta, T. Okutani, and H. Sone. Computercontrol of a double inverted pendulum. Computer and Electrical Engineering , 5:6784, 1978.

[7] K. Hornik, M. Stinchcombe, and H. White. Multi-layer feedforward neural networks are universal ap-proximators. Neural Networks , 2:359366, 1989.

[8] R. A. Jacobs. Increasing rates of convergencethrough learning rate adaptation. Neural Networks ,1(4):295307, 1988.

9


10/14

[9] A. Jadbabaie, J. Yu, and J. Hauser. Stabilizing re-ceding horizon control of nonlinear systems: a con-trol Lyapunov function approach. In Proceedings of American Control Conference , 1999.

[10] A. Jadbabaie, J. Yu, and J. Hauser. Unconstrainedreceding horizon control of nonlinear systems. InProceedings of IEEE Conference on Decision and Control , 1999.

[11] E. Johnson, A. Calise, R. Rysdyk, and H. El-Shirbiny. Feedback linearization with neural net-work augmentation applied to x-33 attitude control.In Proceedings of the AIAA Guidance, Navigation,and Control Conference , August 2000.

[12] W. T. Miller, R. S. Sutton, and P. J. Werbos. Neural networks for control . MIT Press, Cambridge, MA,1990.

[13] M. Sznaizer, J. Cloutier, R. Hull, D. Jacques, andC. Mracek. Receding horizon control Lyapunovfunction approach to suboptimal regulation of non-linear systems. Journal of Guidance, Control and Dynamics , 23(3):399405, May-June 2000.

[14] E. Wan, A. Bogdanov, R. Kieburtz, A. Baptista,M. Carlsson, Y. Zhang, and M. Zulauf. Modelpredictive neural control for aggressive helicoptermaneuvers. In T. Samad and G. Balas, editors,Software Enabled Control: Information Technolo-gies for Dynamical Systems , chapter 10, pages 175200. IEEE Press, John Wiley & Sons, 2003.

[15] E. A. Wan and A. A. Bogdanov. Model predictiveneural control with applications to a 6 DOF heli-copter model. In Proceedings of IEEE American Control Conference , Arlington, VA, June 2001.

[16] P. Werbos. Backpropagation through time: what itdoes and how to do it. Proceedings of IEEE, spe-cial issue on neural networks , 2:15501560, October1990.

[17] W. Zhong and H. Rock. Energy and passivity basedcontrol of the double inverted pendulum on a cart.In Proceedings of the IEEE international confer-ence on control applications , Mexico City, Mexico,September 2001.

10


11/14

0 1 2 3 4 5 6 725

20

15

10

5

0

5

10

time, s

deg

Bottom pendulum angles 1

0 1 2 3 4 5 6 7300

200

100

0

100

time, s

deg/s

Bottom pendulum velocities d 1 /dt

SDRELQRNNNN+LQRNN+SDRE


0 1 2 3 4 5 6 715

10

5

0

5

10

time, s

deg

Top pendulum angles 2

0 1 2 3 4 5 6 730

20

10

0

10

20

30

time, s

deg/s

Top pendulum velocities d 2 /dt



0 1 2 3 4 5 6 73

2.5

2

1.5

1

0.5

0

0.5

time, s

m

Cart position 0

0 1 2 3 4 5 6 7

2

1

0

1

2

time, s

m/s

Cart velocity d 0 /dt



0 0.5 1 1.5 2 2.5 340

20

0

20

40

60

80

100

120

140

time, s

N

Control force


Figure 10: 10 deg. opposite direction

0 1 2 3 4 5 6 710

5

0

5

10

15

20

25

time, s

deg


0 1 2 3 4 5 6 760

40

20

0

20

40

time, s

deg/s

Bottom pendulum velocities d 1

/dt



0 1 2 3 4 5 6 710

5

0

5

10

15

20

time, s

deg


0 1 2 3 4 5 6 740

30

20

10

0

10

20

time, s

deg/s



LQRSDRENNNN+LQRNN+SDRE

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

time, s

m

Cart position 0

0 1 2 3 4 5 6 71

0.5

0

0.5

1

1.5

2

2.5

time, s

m/s




0 0.5 1 1.5 2 2.5 315

10

5

0

5

10

15

20

25

time, s

N

Control force


Figure 11: 20 deg. same direction

11


12/14

0 1 2 3 4 5 6 7

30

20

10

0

10

20

time, s

deg


0 1 2 3 4 5 6 7

400

300

200

100

0

100

200

time, s

deg/s

Bottom pendulum velocities d

1/dt



0 1 2 3 4 5 6 7

20

10

0

10

time, s

deg


0 1 2 3 4 5 6 740

20

0

20

40

60

80

time, s

deg/s




0 1 2 3 4 5 6 74

3

2

1

0

1

time, s

m

Cart position 0

0 1 2 3 4 5 6 74

2

0

2

4

time, s

m/s




0 0.5 1 1.5 2 2.5 3100

50

0

50

100

150

200

250

time, s

N

Control force



0 1 2 3 4 5 6 7

10

0

10

20

30

time, s

deg


0 1 2 3 4 5 6 7

150

100

50

0

50

time, s

deg/s




0 1 2 3 4 5 6 7

10

0

10

20

30

time, s

deg


0 1 2 3 4 5 6 7

60

40

20

0

time, s

deg/s

Top pendulum velocities d 2

/dt



0 1 2 3 4 5 6 71

0

1

2

3

4

5

time, s

m

Cart position 0

0 1 2 3 4 5 6 72

1

0

1

2

3

4

time, s

m/s

Cart velocity d 0

/dt



0 0.5 1 1.5 2 2.5 320

10

0

10

20

30

40

50

time, s

N

Control force



12


13/14

0 1 2 3 4 5 6 740

30

20

10

0

10

20

30

time, s

deg


0 1 2 3 4 5 6 7600

400

200

0

200

400

time, s

deg/s


SDRELQRNN+LQRNN+SDRE


0 1 2 3 4 5 6 7

20

10

0

10

time, s

deg


0 1 2 3 4 5 6 7

50

0

50

100

time, s

deg/s


/dt



0 1 2 3 4 5 6 74

3

2

1

0

1

time, s

m

Cart position 0

0 1 2 3 4 5 6 76

4

2

0

2

4

6

time, s

m/s

Cart velocity d 0

/dt



0 0.5 1 1.5 2 2.5 3150

100

50

0

50

100

150

200

250

300

350

time, s

N

Control force



0 1 2 3 4 5 6 7

20

0

20

40

time, s

deg


0 1 2 3 4 5 6 7300

200

100

0

100

time, s

deg/s

Bottom pendulum velocities d

1/dt



0 1 2 3 4 5 6 7

10

0

10

20

30

time, s

deg


0 1 2 3 4 5 6 7

100

50

0

time, s

deg/s


/dt


LQRSDRENNNN+LQRNN+SDRE

0 1 2 3 4 5 6 71

0

1

2

3

4

5

6

time, s

m

Cart position 0

0 1 2 3 4 5 6 72

0

2

4

6

time, s

m/s




0 0.5 1 1.5 2 2.5 360

40

20

0

20

40

60

80

100

time, s

N

Control force



13


14/14

0 1 2 3 4 5 6 7

40

20

0

20

time, s

deg


0 1 2 3 4 5 6 71000

500

0

500

time, s

deg/s


SDRENN+SDRE

SDRENN+SDRE

0 1 2 3 4 5 6 740

30

20

10

0

10

20

time, s

deg


0 1 2 3 4 5 6 7400

300

200

100

0

100

200

time, s

deg/s


/dt

SDRENN+SDRE

SDRENN+SDRE

0 1 2 3 4 5 6 73

2

1

0

1

time, s

m

Cart position 0

0 1 2 3 4 5 6 7

10

5

0

5

10

time, s

m/s


SDRENN+SDRE

SDRENN+SDRE

0 0.5 1 1.5 2 2.5 3600

400

200

0

200

400

600

800

1000

time, s

N

Control force

SDRENN+SDRE

Figure 16: 28 deg. opposite direction, SDRE vs.NN+SDRE

0 1 2 3 4 5 6 740

20

0

20

40

60

80

time, s

deg

Pendulum angles

0 1 2 3 4 5 6 71500

1000

500

0

500

1000

1500

time, s

deg/s

Pendulum velocities

1

2

d 1

/dt

d 2

/dt

0 1 2 3 4 5 6 75

0

5

10

15

time, s

m

Cart position, 0

0 1 2 3 4 5 6 75

0

5

10

15

20

25

time, s

m/s

Cart velocity, d 0 /dt

0 0.5 1 1.5 2 2.5 3

1000

800

600

400

200

0

200

400

600

time, s

N

Control force

Figure 17: 67 deg. same direction, SDRE

14

doubled inverted pendulum on a cart[2]

Documents