athans workshop 10 07

Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington

F.L. LewisMoncrief-ODonnell Endowed Chair

Head, Controls & Sensors Group

Talk available online at http://ARRI.uta.edu/acs

Notes on Optimal Control

Supported by :NSF - PAUL WERBOSARO RANDY ZACHERY

Draguna Vrabie

Mike Athans

BOOKSOptimal Control (McGraw Hill, 1966) Systems, Networks and Computation: Basic Concepts (McGraw Hill, 1972) Multivariable Methods (McGraw Hill, 1974)

Third President of the IEEE Control Systems Society, 1972 to 1974

PhD from Berkeley in 1961

More than 40 Phd students- how many?

Awards

American Automatic Control Council's Donald P. Eckman Award "for outstanding contributions to the field of automatic control" (first recipient of the award, 1964) American Society for Engineering Education's Frederick E. Terman Award as "the outstanding young electrical engineering educator" (first recipient, 1969) American Control Council's Education Award for "outstanding contributions and distinguished leadership in automatic control education" (second recipient, 1980) Fellow of the IEEE (1973) Fellow of the AAAS (1977) Distinguished Member of the IEEE Control Systems Society (1983) IEEE Control Systems Society's 1993 H.W. Bode Prize, including the delivery of the Bode Plenary Lecture at the 1993 IEEE Conference on Decision and Control American Automatic Control Council's Richard E. Bellman Control Heritage Award (1995) "In Recognition of a Distinguished Career in Automatic Control; As a Leader and Champion of Innovative Research; As a Contributor to Fundamental Knowledge in Optimal, Adaptive, Robust, Decentralized and Distributed Control; and as a Mentor to his Students"

Athans & Falb, Optimal Control (McGraw Hill, 1966) First book on OC?

1. Use Neural Networks as Function Approximators in Adaptive Control

2. Optimal Adaptive Control gives new adaptive control algorithms

Two-layer feedforward static neural network (NN)

(.)

(.)

(.)

(.)

x1

x2

y1

y2

VT WT

inputs

hidden layer

outputs

xn ym

1

2

3

L

(.)

(.)

(.)

Summation eqs Matrix eqs

)( xVWy TT=

+

+=

= =

K

ki

n

jkjkjiki wvxvwy

10

10

Have the universal approximation propertyOvercome Barrons fundamental accuracy limitation of 1-layer NN

qdRobot System[ I]

Robust ControlTerm

q

v(t)

e

PD Tracking Loop

rKv

Neural Network Robot Controller

^

qd

f(x)

Nonlinear Inner Loop

..

Feedforward Loop

Universal Approximation Property

Problem- Nonlinear in the NN weights sothat standard proof techniques do not work

Feedback linearization

NN universal basis property means no regression matrix is neededExtension of Adaptive Control to Nonlinear-in the parameters systems

Theorem 1 (NN Weight Tuning for Stability)

Let the desired trajectory )(tqd and its derivatives be bounded. Let the initial tracking error bewithin a certain allowable set U . Let MZ be a known upper bound on the Frobenius norm of theunknown ideal weights Z . Take the control input as

vrKxVW vTT += )( with rZZKtv MFZ )()( += .

Let weight tuning be provided by

WrFxrVFrFW TTT ' = , VrGrWGxV TT )'( =

with any constant matrices 0,0 >=>= TT GGFF , and scalar tuning parameter 0> . Initialize the weight estimates as randomVW == ,0 .

Then the filtered tracking error )(tr and NN weight estimates VW , are uniformly ultimately bounded. Moreover, arbitrarily small tracking error may be achieved by selecting large controlgains vK . Backprop terms-

Werbosrobustifying terms-e-mod, sigma mod, etc.

Extension of Adaptive Control to nonlinear-in parameters systemsNo regression matrix needed

Extra Jacobian terms from NLIP network

NLIP

[ I]

Robust ControlTerm vi(t)

Tracking Loop

rKr

Nonlinear FB Linearization Loop

F1(x)^qr =

qrqr.e

ee = .

qd =qdqd.

..qd

RobotSystem1/KB1 i

F2(x)^

Kid

NN#1

NN#2Backstepping Loop

ue[ I]

Robust ControlTerm vi(t)

Tracking Loop

rKrKr

Nonlinear FB Linearization Loop

F1(x)F^1(x)^

Neural network backstepping controller for Flexible-Joint robot arm

qr =qrqr.qr =qr =qrqr.qrqr.e

ee = .e = .

qd =qdqd.qd =qd =qdqd.qdqd.

..qd..qd

RobotSystem1/KB1 i

F2(x)F^2(x)^

KKid

NN#1

NN#2Backstepping Loop

ue

Backstepping

Advantages over traditional Backstepping- no regression functions needed

Add an extra feedback loopTwo NN neededUse passivity to show stability

Flexible & Vibratory Systems

MechanicalSystem

Kv[ ]v

reqd

Estimateof NonlinearFunction

w

--

D(u)u

NN DeadzonePrecompensator

I

II

( )f x

q

dq NN in Feedforward Loop- Deadzone Compensation

Acts like a 2-layer NNWith enhanced backprop tuning !

little observer

Actuator Nonlinearities - Deadzone, saturation, backlash

iiiTTTTT

iii WWrTkWrTkUuUWrwUTW )(')( 21 = WrSkrwUWUuUSW TTiii

TT )()(' 1= Critic:Actor:

Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.

Cellular Metabolism

Permeability control of the cell membrane

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Biological Systems

Optimality in Control Systems DesignR. Kalman 1960

Rocket Orbit Injection

http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf

FmmmF

rwvv

mF

rrvw

wr

=+=

+==

cos

sin22

ObjectivesGet to orbit in minimum timeUse minimum fuel

Dynamics

Adaptive Control is generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

We want ONLINE ADAPTIVE OPTIMAL Control

),( uxfx = +==t

T

t

dtRuuxQdtuxrtxV ))((),())((

),,(),(),(),(),(0 uxVxHuxruxf

xVuxrx

xVuxrV

TT

+

=+

=+= 0)0( =V

+=

+= ),(),(min),(min0

*

)(

*

)(uxf

xVuxrx

xVuxr

T

tu

T

tu

xVxgRtxh T

= *

12

1* )())((

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

+

= 0)0( =V

System

Cost

Hamiltonian

Optimal cost

Optimal control

HJB equation

Continuous-Time Optimal Control

Bellman

+=

+= ),(),(min),(min0

)()(uxf

xVuxrx

xVuxr

T

tu

T

tu

,

,

For a given control, the cost satisfies this eq.

In LQR, this is a Lyapunov eq

In LQR, this is a Riccati eq

Linear system, quadratic cost

System:Utility:

The cost is quadratic

Optimal control (state feedback):

HJB equation is the algebraic Riccati equation (ARE):

PBPBRQPAPA TT 10 ++=

1( ) ( ) ( )Tu t R B Px t Lx t= =

BuAxx +=0,0;),( >+= QRRuuQxxuxr TT

==t

T tPxtxduxrtxV )()(),())((

Full system dynamics must be knownOff-line solution

),,(),(),(0 uxVxHuxruxf

xV T

+

=

0 ( , ( )) ( , ( ))T

kk k

V f x h x r x h xx

= + (0) 0kV =

1121( ) ( )

T kk

Vh x R g xx

+

=

CT Policy Iteration

Convergence proved by Saridis 1979 if Lyapunov eq. solved exactly

Beard & Saridis used complicated GalerkinIntegrals to solve Lyapunov eq.

Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence

RuuxQuxr T+= )(),(Utility

Cost for any given u(t)

Lyapunov equation

Iterative solution

Pick stabilizing initial control

Find cost

Update control

Full system dynamics must be knownOff-line solution

To avoid solving HJB equation

LQR Policy iteration = Kleinman algorithm

1. For a given control policy solve for the cost:

2. Improve policy:

If started with a stabilizing control policy the matrix monotonically converges to the unique positive definite solution of the Riccati equation.

Every iteration step will return a stabilizing controller. The system has to be known.

xLu k=k

Tk

Tkkk

Tk RLLCCAPPA +++=0

11 = kTk PBRL

kk BLAA =

0L kP

Kleinman 1968

Lyapunov eq.

Policy Iteration Solution

( ) T TRic P A P PA Q PBB P + +

1 1( ) ( ) 0T T T T

i i i i i iA BB P P P A BB P PBB P Q+ + + + + =

( ) 11 ( ), 0,1,ii i P iP P Ric Ric P i+ = =

Policy iteration

This is in fact a Newtons Method

Then, Policy Iteration is

Frechet Derivative

)()()(' iTT

iT

P PBBAPPPBBAPRic i +

Policy Iterations without Lyapunov Equations

Dynamic programming built on Bellmans optimality principle alternative

form for CT Systems [Athans & Falb 1966, Lewis & Syrmos 1995]

* *( )

( ( )) min ( ( ), ( )) ( ( ))t t

utt t t

V x t r x u d V x t t

+

< +

= + +

)()()()())(),(( RuuQxxuxr TT +=

Draguna Vrabie

f(x) and g(x) do not appear

Solving for the cost Our approach

))(()())(( TtxVdtRuuQxxtxVTt

t

TT +++= +Lxu =

Draguna Vrabie

f(x) and g(x) do not appear

For a given control

The cost satisfies

)()()()()( TtPxTtxdtRuuQxxtPxtx TTt

t

TTT ++++= +

PBRL T1=

LQR case

Optimal gain is

1 1( ( )) ( ) ( ( ))t T

T kT kk k

t

V x t x Qx u Ru dt V x t T+

+ += + + +1. Policy iteration

11

1 ++ = kTk PBRLInitial stabilizing control is needed

Draguna Vrabie

For LQR case

Cost update

Control gain update

A and B do not appear

B needed for control update

( ) ( )k ku t L x t=

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q L RL x d x t T P x t T

++ += + + + +

Solving for the cost Our approach

1 1( ) ( ) 0T T T T

i i i i i iA B B P P P A B B P P B B P Q+ + + + + =Theorem D. Vrabie

This algorithm converges and is equivalent to Kleinmans Algorithm( ) 11 ( ) , 0 , 1 ,ii i P iP P R i c R i c P i+ = =

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k


++ += + + + +

1 1 11 1( ) ( ) 0

T T T Tk k k k k kA BR B P P P A BR B P P BR B P Q

+ + + + + =

Lemma 1

is equivalent to

xQRKKxxAPPAxdt

xPxdi

Ti

Tiii

Ti

TiT

)()()( +=+=

Proof:

( ) ( ) ( ) ( ) ( ) ( )t T t T

T T T T Ti i i i i

t t

x Q K RK xd d x Px x t Px t x t T Px t T+ +

+ = = + +

1 Tk kL R B P

=Solves Lyapunov equation without knowing A or B

Only B is needed

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k


++ += + + + +

Critic update

11

1 ++ = kTk PBRL

1kp +Now use RLS along the trajectory to get new weights Unpack weights into the matrix Pk+1Then find updated FB gain

( )( ) ( )Tvec ABC C A vec B=

[ ]1 1( ) ( ) ( ) ( ) ( ) ( )t T

T T T Tk k i i

tp t p x t x t T x Q K RK x d

++ + + = +

1 1( ) ( ) ( ) ( ) ( )t T

T T T Tk i i k

tp x t x Q K RK x d p x t T

++ += + + +

is the quadratic basis set

Use Kronecker product

To set this up as ( ) ( ) ( )x t x t x t=

( , )t t T + Reinforcement on time interval [t, t+T]

Algorithm Implementation

Quadratic regression vector

1. Select initial control policy

2. Find associated cost

3. Improve control 11 1

Tk kL R B P

+ +=

[ ]1 ( ) ( ) ( ) ( ) ( ) ( , )t T

T T Tk k k

tp x t x t T x Q L RL x d t t T

++ + = + = +

Solves Lyapunov eq. without knowing dynamics

t t+T

observe x(t)

observe x(t+T)

apply uk(t)=Lkx(t)

observe cost integral

update P

do RLS until convergence to Pk+1

update control gain to Lk+1

Measure cost increment (reinforcement)by adding V as a state. Then

( )T k T kV x Qx u Ru= +

A is not needed anywhere


The Critic update

can be setup as

Evaluating for n(n+1)/2 trajectory points, one can setup a least squares problem to solve

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k


++ += + + + +

11 ( )

Tip XX XY

+ = 1 2[ ( ) ( ) ... ( )]NX t t t =1 2[ ( , ) ( , ) ... ( , )]N Ti i iY d x K d x K d x K=

( ( ), )kd x t L

Or use batch Least-Squares solution along the trajectory

[ ]1 1( ) ( ) ( ) ( ) ( ) ( ) ( ( ), )t T

T T T Ti i k k k

tp t p x t x t T x Q L RL x d d x t L

++ + + = +

is the quadratic basis set( ) ( ) ( )x t x t x t=

Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval

Draguna Vrabie

u

VZOH T

0; xBuAxx +=System

RuuQxxV TT +=

Critic

TT

xu

VZOH T

0; xBuAxx +=System

RuuQxxV TT +=

Critic

Actor

TT

x

FB Gain L

Lkmultiplier

DynamicControlSystem

)()( txLtu kk =

Continuous-time control with discrete gain updates

t

Lk

k0 1 2 3 4 5

Sample periods need not be the sameThey can be selected on-line in real time

Gain update (Policy)

Control

1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k


++ = + + + +

1( ( )) ( ) ( ( ))t T

T kT kk k

t

V x t x Qx u Ru dt V x t T+

+ = + + +2. CT ADP Greedy iteration

)()( txLtu kk =

11

1 ++ = kTk PBRL

No initial stabilizing control needed

Draguna Vrabie

LQR

Control policy

Cost update

Control gain update A and B do not appear

B needed for control update

Direct Optimal Adaptive Control for Partially Unknown CT Systems

( )( ) ( )Tvec ABC C A vec B=

1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k


++ = + + + +

1 ( ) ( ) ( ) ( ) ( )t T

T T T Tk k k k

tp x t x Q L RL x d p x t T

++ = + + +

The critic update

Use Kronecker product

To set this up as

1kp +Now use RLS along the trajectory to get new weights

Unpack weights into the matrix Pk+1Then find updated FB gain 1

11 ++ = kTk PBRL

is the quadratic basis set( ) ( ) ( )x t x t x t=


Previous weightsRegression vector

1. Select control policy

2. Find associated cost

3. Improve control 11 1

Tk kL R B P

+ +=

1 ( ) ( ) ( ) ( ) ( )t T

T T T Tk k k k

tp x t x Q L RL x d p x t T

++ = + + +

Solves for cost update without knowing dynamics

t t+T

observe x(t)

observe x(t+T)

apply u

observe cost integral

update P

do RLS until convergence to Pk+1

update control gain to Lk+1

Measure cost increment by adding V as a state. Then ( )T i T iV x Qx u Ru= +

No initial stabilizing control needed

A is not needed anywhere

Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed value over the interval

Draguna Vrabie

u

VZOH T

0; xBuAxx +=System

RuuQxxV TT +=

Critic

TT

xu

VZOH T

0; xBuAxx +=System

RuuQxxV TT +=

Critic

Actor

TT

x

FB Gain L

LkDynamicControlSystem

Has a different critic cost updateNo initial stabilizing gain needed

Greedy update is equivalent to

Analysis of the algorithm

For a given control policy

( ){ }1 0( ( )) ( ( )), 0t T TT k kk ktV x t x Qx u Ru d V x t T V++ = + + + =

( ) ( )k ku t L x t= kTk PBRL 1=k

Tk PBBRAA

1=

1 ( )T Tk k k k

t TA t A t A T A TT

k k k kt

P e Q L RL e dt e P e+

+ = + +

with

a strange pseudo-discretized RE

Draguna Vrabie

( ) APBBPBPBPAQAPAP kTKTkkTkTk 11 + ++=c.f. DT RE ( ) kKTkTkkkTkk LBPBPLQAPAP +++=+1

++=+T

tAk

Tkkk

tAkk dteRLLQAPAPePP k

Tk

01 )(

Lemma 2. CT HDP is equivalent to

kT

k PBBRAA1=

ADP solves the CT ARE without knowledge of the system dynamics A

Analysis of the algorithmDraguna Vrabie

This extra term means the initial Control action need not be stabilizing

When ADP converges, the resulting P satisfies the Continuous-Time ARE !!

Direct OPTIMAL ADAPTIVE CONTROL

Solve the Riccati Equation WITHOUT knowing the plant dynamics

Model-free ADP

Works for Nonlinear Systemsa neural network is used to approximate the cost

Robustness?Comparison with adaptive control methods?

Policy Evaluation Critic updateLet K be any state feedback gain for the system (1). One can measure the associated cost over the infinite time horizon

where is an initial infinite horizon cost to go.

( , ( )) ( ) ( ) ( ) ( , ( ))t T

T T

tV t x t x Q K RK x d W t T x t T

+= + + + +

( , ( ))W t T x t T+ +

What to do about the tail issues in Receding Horizon Control

)()( txLtu kk =

Continuous-time control with discrete gain updates

t

Lk

k0 1 2 3 4 5

Sample periods need not be the same

Gain update (Policy)

Control

0 0.5 1 1.5 2-0.3

-0.2

-0.1

0Control signal

Time (s)

0 0.5 1 1.5 2-0.4

-0.2

0Controller parameters

Time (s)

0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4System states

Time (s)

0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

Critic parameters

Time (s)

P(1,1)P(1,2)P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal

Simulations on: F-16 autopilotLoad frequency control for power system

A matrix not needed

Converge to SS Riccati equation soln

Nonlinear Case Works Too

( ) ( )x f x g x u= +

( ( )) ( , ) ( ( ) )Tt t

V x t r x u dt Q x u Ru dt

= = +

10 ( , ( )) ( , ( ))T

kk k

V f x h x r x h xx

+ = +

1 1121 1( ) ( )

T kk k

Vu h x R g xx

++ +

= =

( )1 1( ( )) ( ) ( ( ))t T Tk k k kt

V x t Q x u Ru dt V x t T+

+ += + + +1 1

1 1( ) ( )T k

k kVu h x R g x

x +

+ += =

Policy Iteration

Prove this is equivalent to:

which is known to converge

f(x) not needed

f(x) needed

( )1 1( ( )) ( ) ( ( ))t T Tk k k kt

V x t Q x u Ru dt V x t T+

+ += + + +Approximate value by Neural Network ( )TV W x=

( )1 1( ( )) ( ) ( ( ))t T Tk k k kt

W x t Q x u Ru dt W x t T +

+ += + + +[ ] ( )1 ( ( )) ( ( )) ( )t T Tk k k

t

W x t x t T Q x u Ru dt +

+ + = +regression vector Reinforcement on time interval [t, t+T]

1kW +Now use RLS along the trajectory to get new weights

Then find updated FB1 11 1

2 21 1 1( ( ))( ) ( ) ( )

( )

TT Tk

k k kV x tu h x R g x R g x Wx x t

+ + +

= = =


),( uxfx = +==t

T

t

dtRuuxQdtuxrtxV ))((),())((

),,(),(),(),(),(0 uxVxHuxruxf

xVuxrx

xVuxrV

TT

+

=+

=+= 0)0( =V

+=

+= ),(),(min),(min0

*

)(

*

)(uxf

xVuxrx

xVuxr

T

tu

T

tu

xVxgRtxh T

= *

12

1* )())((

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

+

= 0)0( =V

System

Cost

Hamiltonian

Optimal cost

Optimal control

HJB equation

Continuous-Time Optimal Control

Bellman

+=

+= ),(),(min),(min0

)()(uxf

xVuxrx

xVuxr

T

tu

T

tu

)())(,()( 1++= khkkkh xVxhxrxV

c.f. DT value recursion, where f(), g() do not appear

athans workshop 10 07

Documents