invited by dr. fouad al-sunni - uta notes/dt adp - kfupm may 201… · kkkkkk kk kk px x qx u ru ax...

Invited by Dr. Fouad AL-Sunni

Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington

F.L. Lewis & Draguna VrabieMoncrief-O’Donnell Endowed Chair

Head, Controls & Sensors Group

Talk available online at http://ARRI.uta.edu/acs

Adaptive Dynamic Programming (ADP)For Feedback Control Systems

Supported by :NSF - PAUL WERBOSARO- RANDY ZACHERY

It is man’s obligation to explore the most difficult questions inthe clearest possible way and use reason and intellect to arriveat the best answer.

Man’s task is to understand patterns in nature and society.

The first task is to understand the individual problem, then toanalyze symptoms and causes, and only then to designtreatment and controls.

Ibn Sina 1002-1042(Avicenna)

Importance of Feedback Control

Darwin- FB and natural selectionVolterra- FB and fish population balanceAdam Smith- FB and international economyJames Watt- FB and the steam engineFB and cell homeostasis

The resources available to most species for their survival are meager and limited

Nature uses Optimal control

Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.

Cellular Metabolism

Permeability control of the cell membrane

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Biological Systems

Optimality in Control Systems DesignR. Kalman 1960

Rocket Orbit Injection

http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf

FmmmF

rwvv

mF

rrvw

wr

cos

sin2

2

ObjectivesGet to orbit in minimum timeUse minimum fuel

Dynamics

Adaptive Control is generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing

Reinforcement Learning turns out to be the key to this!

Different methods of learning

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Actor

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

, ' Pr{ ' | , }ux xP x x u

( , , , )X U P R

'uxxR

Markov Decision Process

X= statesU= controls

Probability of going to state x’ from state xgiven that the control is u

Reward on going to state x’ from state xgiven that the control is u

Value Function

A control policy is a mapping from states to control actions

= probability of taking action a while in state s

Control Policy

The Bellman EquationA consistency equation that holds at each state

Reinforcement Learning for Markov Decision Processes

Allows one to solve n equations for the value given any policyn= nr of states

( ) { | } .i ki k

i kV x E r x x

: [0,1]X U

( , ) Pr{ | }x u u x

( 1)

1( ) { | }i k

k i ki k

V x E r r x x

' ''

( ) ( , ) ( ') .u uxx xx

u xV x x u P R V x

Optimal Value

By Bellman’s consistency eq.

Bellman’s optimality equation

This consists of n eqs that must be solved for V*(s)n= nr of states

Optimal policy

*( ) min { | }i ki k

i kV x E r x x

*' '

'( ) min ( ) min ( , ) ( ') .u u

xx xxu x

V x V x x u P R V x

* *' '

'( ) min ( ') .u u

xx xxu xV x P R V x

* *' '

'arg min ( ') .u u

xx xxu x

u P R V x

Policy IterationStart with admissible policy

Policy evaluation

Policy Improvement

Barto calls this Dynamic programmingTo avoid knowing the system model, use

Monte CarloTemporal Difference

Policy Iteration with Temporal DifferenceStart with admissible policy

Policy evaluation ALONG A SAMPLE PATH

A stochastic approximation approach

Policy Improvement along sample path

Temporal Difference is

- An iterative solution algorithm for the optimal value and policy

' ''

( ) ( , ) ( ')u uj j xx xx j

u xV x x u P R V x

1 ' ''

( , ) arg min ( ')u uj xx xx j

u xx u P R V x

1( ) ( , ( )) ( ) .j k k j k j kV x r x h x V x

1 1(.)

( ) arg min ( , ( )) ( ) ,j k k k j kh

h x r x h x V x

1 1 1( ) ( ) ( , ( )) ( ) .j j k k j k j ke k V x r x h x V x

ki

iiki

kh uxrxV ),()(

Discrete-Time Optimal Control

cost

( 1)

1( ) ( , ) ( , )i k

h k k k i ii k

V x r x u r x u

1 ( ) ( )k k k kx f x g x u system

Example ( , ) T Tk k k k k kr x u x Qx u Ru

1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Value function recursion

)( kk xhu = the prescribed control input functionControl policy

Example k ku Kx Linear state variable feedback

This is like:

, ' Pr{ ' | , }ux xP x x u

'uxxR

( , ) Pr{ | }x u u x

ki

iiki

kh uxrxV ),()(


cost

( 1)

1( ) ( , ) ( , )i k

h k k k i ii k

V x r x u r x u

1 ( ) ( )k k k kx f x g x u system

Example ( , ) T Tk k k k k kr x u x Qx u Ru

1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Value function recursion

)( kk xhu = the prescribed control input functionControl policy

Example k ku Kx Linear state variable feedback

BELLMAN EQUATION

ki

iiki

kh uxrxV ),()(

)())(,()( 1 khkkkh xVxhxrxV

))())(,((min)( 1*

khkkhk xVxhxrxV

Hamiltonian

))(),((min)( 1**

kkkuk xVuxrxVk

))(),((minarg)(* 1*

kkkuk xVuxrxhk


cost

Bellman Equation

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH

Optimal cost

Bellman’s Principle

Optimal Control

)( kk xhu = the prescribed control policy

Dynamic programming- Backwards in time solution

1 ( ) ( )k k k kx f x g x u

1( ) min ( )

min ( ) ( )k

k

T Tk k k k k ku

T Tk k k k k k ku

V x x Qx u Ru V x

x Qx u Ru V f x g x u

1 1

1

( )1( ) ( )2

T kk k

k

V xu x R g x

x

( ) T Tk i i i i

i kV x x Qx u Ru

System

DT HJB equation

Difficult to solveContains the dynamics

The Solution: Hamilton-Jacobi-Bellman Equation

1

1

( )2 ( ) 0T kk k

k

dV xRu g xdx

Minimize wrt uk

DT Optimal Control – Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu system

cost

Fact. The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

( ) Tk k kV x x Px for some symmetric matrix P

1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation

1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px

( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu

Stationarity Condition- minimize with respect to uk

0 2 2 ( )Tk k kRu B P Ax Bu

1( )T Tk k ku R B PB B PAx Lx

10 ( ( ) )T T T T Tk kx A PA P Q A PB R B PB B PA x

HJB = DT Riccati equation

Bellman Equation– Linear Systems Quadratic cost (LQR)


cost


( ) T Tk i i i i

i kV x x Qx u Ru


1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation

1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px

( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu

Bellman eq= Lyapunov equationk ku Kx

0 (( ) ( ) )T T Tk kx A BK P A BK P Q K RK x

Assume stabilizing fixed feedback policy

Then is the cost of using the SVFB ( ) Tk k kV x x Px k ku Kx

1( )T TL R B PB B PA

DT Optimal Control – Linear Systems Quadratic cost (LQR)


cost

HJB = DT Riccati equation

Optimal Control

Optimal Cost*( ) T

k k kV x x Px

10 ( )T T T TA PA P Q A PB R B PB B PA

k ku Lx


( ) T Tk i i i i

i kV x x Qx u Ru


Off-line solutionDynamics must be known

ki

iiki

kh uxrxV ),()(


))())(,((min)( 1*

khkkhk xVxhxrxV

Hamiltonian

))(),((min)( 1**

kkkuk xVuxrxVk

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Discrete-Time Optimal Adaptive Control

cost

Bellman eq.


Optimal cost

Bellman’s Opt. Principle

Optimal Control


Focus on these two eqs

1( ) ( , ( )) ( ), (0) 0h k k k h k hV x r x h x V x V


Bellman eq


Solutions by Comp. Intelligence Community

( ) ( , ( ))i kh k i i

i kV x r x h x

Theorem: Let solve the Bellman equation. Then ( )h kV x

Gives value for any prescribed control policy

Policy Evaluation for any given current policy

Policy must be stabilizing

Dynamics Does Not Appear

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Optimal Control

Bellman’s result

1'( ) arg min( ( , ) ( ))k

k k k h kuh x r x u V x

What about? -

Theorem. Bertsekas. Let be the value of any given policy h(xk).

Then

( )h kV x

' ( ) ( )h k h kV x V x

Policy Improvement

for a given policy h(.) ?

One step improvement property of Rollout Algorithms

DT Policy Iteration

)())(,()( 111 kjkjkkj xVxhxrxV

1 1 1 1( ) arg min( ( , ) ( ))k

j k k k j kuh x r x u V x

Howard (1960) proved convergence for MDP


Cost for any given control policy h(xk) satisfies the recursion

Recursive solution - Actor/Critic Structure

Pick stabilizing initial control

Policy Evaluation

Policy Improvement

f(.) and g(.) do not appear

Bellman eq.

Recursive formConsistency equation

e.g. Control policy = SVFB

( )k kh x Lx

1 1 1( ) ( , ( )) ( )k j k k j k j ke V x r x h x V x

Temporal difference

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x

( ) ( ) ( )T Tk i i i i

i kV x x Qx u x Ru x

1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

Hewer proved convergence in 1971

DT Lyapunov eq.

DT Policy Iteration – Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( ) TV x x Px

For any stabilizing policy, the cost is

DT Policy iterations

1 1

11 1 1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A

Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics

Equivalent to an Underlying Problem- DT LQR:

1 ( ) ,k k k kx Ax Bu A BL x

LQR value is quadratic

k ku Lx


1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

DT Policy Iteration – Linear Systems Quadratic Cost- LQR


How to implement online?

1 1 1 1T T T Tk j k k j k k k j jx P x x P x x Qx u Ru


DT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR


( ) TV x x Px


1k k kx Ax Bu

LQR cost is quadratic

1 111 12 11 121 2 1 2 1

1 12 212 22 12 22 1

1 2 1 21

1 2 1 211 12 22 11 12 22 1 1

2 2 2 21

( ) ( )2 2( ) ( )

k kk k k k

k k

k k

k k k k

k k

p p p px xx x x x

p p p px x

x xp p p x x p p p x x

x x

Quadratic basis set

( ) ( ) ( )Tk i i i i


for some matrix P

Then update control using1( ) ( )T T

j k j k j j kh x L x R B P B B P Ax Need to know A AND B

for control update

1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x

Implementation- DT Policy IterationNonlinear Case

Value Function Approximation (VFA)

)()( xWxV T

basis functionsweights

LQR case- V(x) is quadratic

( ) ( )T TV x x Px W x

Quadratic basis functions

Nonlinear system case- use Neural Network

][ 1211 ppW T

( )x

)())(,()( 111 kjkjkkj xVxhxrxV Value function update for given control

Assume measurements of xk and xk+1 are available to compute uk+1

Then

))(,()()( 11 kjkkkTj xhxrxxW

Solve for weights using RLSor, many trajectories with different initial conditions over a compact set

Then update control using

Need to know g(xk) for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Policy Iteration

)()( kTjkj xWxV VFA

Indirect Adaptive control with identification of the optimal value

1 11 11 1 1 1

1

( )1 1( ) ( ) ( ) ( )2 2

j kT T T Tj k k k k j

k

dV xu x R g x R g x x W

dx


111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

1. Select control policy

2. Find associated cost

3. Improve control

Needs 10 lines of MATLAB code

Direct optimal adaptive control

Solves Lyapunov eq. without knowing dynamics

k k+1

observe xk

observe xk+1

apply uk

observe cost rk

update V

do until convergence to Vj+1 update control to uj+1


DT Policy Iteration – For Linear Systems Quadratic Cost- LQR


1k k kx Ax Bu

Form quadratic basis vectors

11 1211 12 22 1

12 22j

p pp p p P

p p

( ) ( ) ( )Tk i i i i


Then update SVFB gain using

11 1 1( )T T

j j jL R B P B B P A

Need to know A AND B for control update

1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x

Unpack weights into Lyapunov solution

1 2 1 21

1 2 1 21 1 1

2 2 2 21

( ) ( )( ) 2 , ( ) 2

( ) ( )

k k

k k k k k k

k k

x xx x x x x x

x x

Find new weights

SVFB gain jL control ( )j k j ku x L x

Apply control to system and observe data set 1, , ( ) ( )T Tk k k k j k j kx x x Qx u x Ru x

Use batch LS or RLS until convergence

System

Action network

Policy Evaluation(Critic network)

( )j kh x

cost

The Adaptive Critic Architecture

Control policy


Adaptive Critics

))(),((minarg)( 111 kjkkukj xVuxrxhk

Value update

Control policy update

Leads to ONLINE FORWARD-IN-TIME implementation of optimal control

Optimal Adaptive Control

Use RLS until convergence

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

)()( xWxV T

A problem with DT Policy Iteration


Then



)()( kTjkj xWxV

Policy Evaluation

1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

Policy Improvement

1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)

for control update

Easy to fix – Later!

Greedy Value Fn. Update- Approximate Dynamic Programming ADP Method 1 - Heuristic Dynamic Programming (HDP)

Paul Werbos


Policy Iteration

1 1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A

For LQRUnderlying RE

Hewer 1971




Value Iteration

1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

P A BL P A BL Q L RL

L R B P B B P A

For LQRUnderlying RE Lancaster & Rodman

proved convergence

Two occurrences of cost allows def. of greedy update

Initial stabilizing control is needed

Lyapunov eq.

Simple recursion

Initial stabilizing control is NOT needed

Motivation for Value Iteration

1 1( ) ( )T Tj j j j j jA BL P A BL P Q L RL

1 1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x

1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x

1 ( ) ( ) Tj j i j j jP A BL P A BL Q L RL

PI Policy Evaluation Step

VI Policy Evaluation Step

Theorem

Let gain L be fixed and (A-BL) stable.Let in MR

Then

0 0P

LE= Lyapunov equation

MR= Matrix recursion

1i jP P

i.e. repeated application of the VI policy evaluation stepis the same as one application of the PI policy evaluation step

IF THE CONTROL POLICY IS NOT UPDATED

Draguna Vrabie-For CT systems

Idea of GPI

Needs stabilizing gain

Does not need stabilizing gain

1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x Value function update for given control


Then

1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x

Solve for weights using RLSor, many trajectories with different initial conditions over a compact set

Then update control using

1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)

for control update


Implementation- DT HDP – Value Iteration

)()( kTjkj xWxV VFA

regression matrix Old weights

1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x

111

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

1. Select control policy

2. Find associated cost

3. Improve control

Needs 10 lines of MATLAB code

Direct optimal adaptive control

Solves Lyapunov recursion without knowing dynamics

k k+1

observe xk

observe xk+1

apply uk

observe cost rk

update V

do until convergence to Vj+1 update control to uj+1

1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x

Four ADP Methods proposed by Paul Werbos

Heuristic dynamic programming

Dual heuristic programming

AD Heuristic dynamic programming

AD Dual heuristic programming

(Watkins Q Learning)

Critic NN to approximate:

Value

Gradient xV

)( kxV Q function ),( kk uxQ

GradientsuQ

xQ

,

Action NN to approximate the Control

Bertsekas- Neurodynamic Programming

Barto & Bradtke- Q-learning proof (Imposed a settling time)

Adaptive (Approximate) Dynamic Programming

Value Iteration

Discrete-time Nonlinear Approximate dynamic programming : Convergence Proof

• Problem Formulation

• requires solving the DT HJB

1 ( ) ( )k k k kx f x g x u

1( ) min ( )

min ( ) ( )k

k

T Tk k k k k ku

T Tk k k k k k ku

V x x Qx u Ru V x

x Qx u Ru V f x g x u

1 1

1

( )1( ) ( )2

T kk k

k

dV xu x R g x

dx

( ) mink

k i i i iu i kV x x Qx u Ru

Asma Al-Tamimi

1 ( ) ( ) ( )k k k kx f x g x u x

( ) T Tk i i i ii k

V x x Qx u Ru

1

1

( )

( )

T T T Tk k k k k i i i ii k

T Tk k k k k

V x x Qx u Ru x Qx u Ru

x Qx u Ru V x

1 1( ) arg min( ( ))T Ti k k k i ku

u x x Qx u Ru V x

1 1( ) ( )T Ti k k k i kV x x Qx u Ru V x

Discrete-time NonlinearHeuristic Dynamic Programming:

HDP (Value Iteration)

Value function recursion

Asma Al-Tamimi

System dynamics

Asma Al-Tamimi

Flavor of proofs

Proof of convergence of DT nonlinear HDP

ˆ ( , ) ( )Ti k Vi Vi kV x W W x ˆ ( , ) ( )T

i k ui ui ku x W W x

1

1

ˆˆ ˆ( ( ), ) ( ) ( ) ( )ˆ ˆ( ) ( ) ( )

T T Tk Vi k k i k i k i k

T T Tk k i k i k Vi k

d x W x Qx u x Ru x V x

x Qx u x Ru x W x

1( ) arg min( ( ))T Ti k k k i ku

u x x Qx u Ru V x

1 1( ) ( )T Ti k k k i kV x x Qx u Ru V x

Standard Neural Network VFA for On-Line Implementation

Define target cost function

NN for Value - Critic NN for control action

HDP

Backpropagation- P. Werbos

Implicit equation for DT control- use gradient descent for action update

( ) ( ) 1( 1) ( )

( )

ˆˆ ˆ( ( )T Tk k i j i j i k

ui j ui jui j

x Qx u Ru V xW W

W

1 1( )

1

( )ˆ( )(2 ( ) )T

j j T Tkui ui k i j k Vi

k

xW W x Ru g x Wx

ˆ ˆ( , ) ( , )argmin

ˆ ˆ( ( ) ( ) ( , ))

T Tk k k k

ui Wi k k k

x Qx u x W Ru x WW

V f x g x u x W

(can use 2-layer NN)

1

21 1arg min{ | ( ) ( ( ), ) | }

Vi

T TVi Vi k k Vi kW

W W x d x W dx

Explicit equation for cost – use LS for Critic NN update1

1 ( ) ( ) ( ) ( ( ), , )T T T TVi k k k k Vi uiW x x dx x d x W W dx

1 1 1 11( ) ( ) ( , ) ( )T T T

Vi Vi k Vi k k k Vi km m mW W x W x r x u W x

or

1 ( ) ( ) ( )k k k kx f x g x u x

Implicit equation for DT control- use gradient descent for action update

( ) ( ) 1( 1) ( )

( )

ˆˆ ˆ( ( )T Tk k i j i j i k

ui j ui jui j

x Qx u Ru V xW W

W

1 1( )

1

( )ˆ( )(2 ( ) )j j T Tkui ui k i j k Vi

k

xW W x Ru g x Wx

ˆ ˆ( , ) ( , )argmin

ˆ ˆ( ( ) ( ) ( , ))

T Tk k k k

uii k k k

x Qx u x Ru xW

V f x g x u x

ˆ ( , ) ( )Ti k ui ui ku x W W x

NN for control action

Note that state internal dynamics f(xk) is NOT needed since:

1. NN Approximation for action is used

2. xk+1 is measured in training phase

Interesting Fact for HDP for Nonlinear systemskj

Tj

Tkjkj AxPBBPBIxLxh 1)()( Linear Case

must know system A and B matrices

g(.) is needed

Information about A is stored in NN

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• Simulation Example 1• Linear system – Aircraft longitudinal dynamics

• The HJB, i.e. ARE, Solution

1.0722 0.0954 0 -0.0541 -0.0153 4.1534 1.1175 0 -0.8000 -0.1010

A= 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353

-0.0453 -0.0175-1.0042 -0.1131

B= 0.0075 0.0134 0.8647 0 0 0.8647

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P

-0.7265 -0.1215 0.0464 0.0782 1.0240

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L

Unstable, Two-input system


• Simulation• The Cost function approximation

• The Policy approximation

ˆ ( )Ti ui ku W x

1 2 3 4 5( )T x x x x x x

11 12 13 14 15

21 22 23 24 25

u u u u uTu

u u u u u

w w w w wW

w w w w w

1 1 1ˆ ( , ) ( )Ti k Vi Vi kV x W W x

1 2

2 2 2 2 21 2 1 3 1 4 1 5 2 3 4 2 2 5 3 3 4 3 5 4 4 5 5( )T x x x x x x x x x x x x x x x x x x x x x x x x x x

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15T

V V V V V V V V V V V V V V V VW w w w w w w w w w w w w w w w


• SimulationThe convergence of the cost

[55.5411 15.2789 31.3032 -9.3255 -1.4536 2.3142 2.9234 -1.6594 -0.2430 24.8262 -1.3076 0.0920 1.5388 0.1564 1.0240]

TVW

11 12 13 14 15 1 2 3 4 5

21 22 23 24 25 2 6 7 8 9

31 32 33 34 35 3 7 10 11 12

41 42 43 44 45 4 8 11 13

51 52 53 54 55

0.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0

V V V V V

V V V V V

V V V V V

V V V V

P P P P P w w w w wP P P P P w w w w wP P P P P w w w w wP P P P P w w w wP P P P P

14

5 9 12 14 15

.50.5 0.5 0.5 0.5

V

V V V V V

ww w w w w

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P

-0.7265 -0.1215 0.0464 0.0782 1.0240

Actual ARE soln:


• SimulationThe convergence of the control policy

4.1068 0.7164 0.3756 -0.5274 -0.0707 0.6330 0.1005 -0.1216 -0.0653 -0.0798uW

11 12 13 14 15 11 12 13 14 15

21 22 23 24 25 21 22 23 24 25

u u u u u

u u u u u

L L L L L w w w w wL L L L L w w w w w

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L

Note- In this example, internal dynamics matrix A is NOT Needed.Riccati equation solved online without knowing A matrix

Actual optimal ctrl.

Batch LS

LS solution for Critic NN update

Issues with Nonlinear ADP

Integral over a region of state-spaceApproximate using a set of points

time

x1

x2

1

1 ( ) ( ) ( ) ( ( ), , )T T T TVi k k k k Vi uiW x x dx x d x W W dx

Set of points over a region vs. points along a trajectory

Exploitation (optimal regulation) vs Exploration

For Linear systems- these are the same under PE condition

Selection of NN Training Set

time

x1

x2

Take sample points along a single trajectory

Recursive Least-Squares RLS

1 1 1 11( ) ( ) ( , ) ( )T T T

Vi Vi k Vi k k k Vi km m mW W x W x r x u W x

PE allows local smooth solution of Bellman eq.

DT HDP vs. Receding Horizon Optimal Control

11

0

( )0

T T T Ti i i i iP A PA Q A PB R B PB B PA

P

11 1 1 1( )T T T T

k k k k k

N

P A P A Q A P B R B P B B P AP

Forward-in-time HDP

Backward-in-time optimization – RHC

Control Lyapunov Function overbounding P

1

0( )k N

T T Tk i i i i k N k N

i kV x x Qx u Ru x P x

1k k kx Ax Bu

Adaptive Terminal Cost RHC Hongwei ZhangDr. Jie Huang

11 0( ) ,T T T T

i i i i iP A PA Q A PB R B PB B PA P

11 1 1 1 1( )RH T T RH

k N N k N ku R B P B B P A x L x

Standard RHC

Requires P0 to be a CLF that overbounds the optimal inf. horizon cost, or large N

P0 is the same for each stage

HWZ Theorem- Let under the usual suspect observability and controllability assumptionsATC RHC guarantees ultimate uniform exponential stability

for ANY P0 > 0.Moreover, our solution converges to the optimal inf. horizon cost.

1N

1

( )k N

T T Tk i i i i k N kN k N


Our ATC RHC

11 ( ) ,T T T T

i i i i i kNP A PA Q A PB R B PB B PA P

Final cost from previous stage

k k+N

Adaptive Terminal Cost RHC

Let N=1. Then

1

( )k N

T T Tk i i i i k N kN k N


becomes

1 1( ) ( )T Tj k k k k k j kV x x Qx u Ru V x

i.e. value iteration

So, for N=1, ATC RHC can be implemented using Value Iteration without knowing the system A matrix

1

1 1arg min{ }k N

RH T T T RHk i i i i k N kN k N N ku i k

u x Qx u Ru x P x L x

1 1arg min{ ( )}RH T Tk k k k k j ku

u x Qx u Ru V x

Four ADP Methods proposed by Werbos

Heuristic dynamic programming

Dual heuristic programming

AD Heuristic dynamic programming

AD Dual heuristic programming

(Watkins Q Learning)

Critic NN to approximate:

Value

Gradient xV

)( kxV Q function ),( kk uxQ

GradientsuQ

xQ

,

Action NN to approximate the Control

Bertsekas- Neurodynamic Programming

Barto & Bradtke- Q-learning proof (Imposed a settling time)

Q Learning

)(),(),( 1 khkkkkh xVuxruxQ policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ

Define Q function

Note

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Recursion for Q

)),((min)( **kkuk uxQxV

k

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

- Action Dependent ADP

)())(,()( 1 khkkkh xVxhxrxV Value function recursion for given policy h(xk)

Optimal Adaptive Control for completely unknown DT systems

k k+1

xk xk+1uk

h(x)

)(),(),( 1 khkkkkh xVuxruxQ

Specify a control policy ,....1,);( kkjxhu jj

policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ

Define Q function

Note

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Bellman equation for Q

))(),(),( 1**

kkkkk xVuxruxQ

))(,(),(),( 1*

1**

kkkkkk xhxQuxruxQ

Optimal Q function

)))(,((min))(,()( ***kkhhkkk xhxQxhxQxV

Optimal control solution

)),((min)( **kkuk uxQxV

k

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

))(,((minarg)(* kkhhk xhxQxh

Q Function Definition

Q Function HDP – Action Dependent HDP

Bradtke & Barto (1994) proved convergence for LQR

Q function for any given control policy h(xk) satisfies the Bellman equation

Recursive solution

Pick stabilizing initial control policy

Find Q function

Update control

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ

))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ

)),((minarg)( 11 kkjukj uxQxhk

Now f(xk,uk) not needed

Q Learning does not need to know f(xk) or g(xk)


)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx

For LQR PxxxWxV TT )()(

k

k

uuux

xuxxT

k

k

k

kT

k

k

k

kTT

TTT

k

k

ux

HHHH

ux

ux

Hux

ux

PBBRPABPBAPAAQ

ux

Q is quadratic in x and u

Control update is found by ][2])([20 kuukuxkT

kT

kuHxHuPBBRPAxB

uQ

sokjkuxuuk

TTk xLxHHPAxBPBBRu 1

11)(

Control found only from Q functionA and B not needed

V is quadratic in x

Q function update for control is given by

Assume measurements of uk, xk and xk+1 are available to compute uk+1

),(),( uxWuxQ T

Then

),(),(),( 111 kjkkjkkkTj xLxrxLxuxW

Solve for weights using RLS or backprop.

Since xk+1 is measured in training phase, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Q Function Policy Iteration

),(),(),( 1111 kjkjkkkkj xLxQuxruxQ

kjk xLu

Now u is an input to the NN- Werbos- Action dependent NN

)(x

For LQR

For LQR case

QFA – Q Fn. Approximation

Bradtke and Barto

),(),(),( 1111 kjkjkkkkj xLxQuxruxQ

Q Policy Iteration

)),((minarg)( 11 kkjukj uxQxhk

Control policy update

),(),(),( 111 kjkkjkkkTj xLxrxLxuxW

kjkuxuuk xLxHHu 11

Model-free policy iteration

Bradtke, Ydstie, Barto

Greedy Q Fn. Update - Approximate Dynamic ProgrammingADP Method 3. Q Learning

Action-Dependent Heuristic Dynamic Programming (ADHDP)

Paul WerbosModel-free HDP

))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ

Greedy Q Update

1111 target),(),(),( jkjkTjkjkkk

Tj xLxWxLxruxW

Update weights by RLS or backprop.

Stable initial control needed

Stable initial control NOT needed

Direct OPTIMAL ADAPTIVE CONTROL

Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics

Model-free ADP

Works for Nonlinear Systems

Proofs?Robustness?Comparison with adaptive control methods?

• continuous-state and action spaces discrete-time dynamical system

with quadratic cost

• The zero-sum game problem can be formulated as follows:

• The goal is to find the optimal strategies (State-feedback)*( )w x Kx

1

,k k k k

k k k k

x Ax Bu Ewy x z Hx

nRx

pRy

1mk Ru

2mk Rw

ki i

Tii

Tii

Tiwuk wwuuQxxxV 2maxmin)(

*( )u x Lx

2( ) T T T Tk i i i i i ii k

V x x H Hx u u w w

Discrete-Time Zero-Sum Games

L

wk zkuk ,

1

kk

kkkk

xyEwBuAxx

yk

2 22

0 0, { }k k k

k kz w w

• Using Bellman optimality principle “Dynamic Programming”

• The Game Algebraic Riccati equation GARE

• The condition for saddle point are

1

2[ ]

T T TT T T

T T T

I B PB B PE B PAP A PA Q A PB A PE

E PA E PE I E PA

2 0TI E PE 0TI B PB

1

1 1

( ) minmax( ^ 2 ( ))

minmax( ( , , ) ).k k

k k

T T Tk k k k k k k ku w

T Tk k k k k k ku w

V x x Qx u u w w V x

x Px r x u w x Px

Game Algebraic Riccati Equation

The optimal policies for control and disturbance are 2 1 1 2 1( ( ) ) ( ( ) ).T T T T T T T TL I B PB B PE E PE I E PB B PE E PE I E PA B PA

2 1 1 1( ( ) ) ( ( ) ).T T T T T T TK E PE I E PB I B PB B PE E PB I B PB BPA E PA

System dynamics A,B,E is needed

DT GameHeuristic Dynamic Programming:

Forward-in-time Formulation• An Approximate Dynamic Programming Scheme (ADP) where one has the

following incremental optimization

which is equivalently written as

)(maxmin)( 12

1 kikTkk

Tkk

Tkwuki xVwwuuQxxxV

kk

)()()()()()( 12

1 kikikTikik

Tik

Tkki xVxwxwxuxuQxxxV

Asma Al-Tamimi

( ) Tk k kV x x Px

1( , , ) ( , , ) ( )k k k k k k kTT T T T T T

k k k k k k

Q x u w r x u w V x

x u w H x u w

21 1 1 1 1 1 1[ ] [ ] [ ] [ ]T T T T T T T T T T T T T T T T T

k k k i k k k k k k k k k k k k i k k kx u w H x u w x Qx u u w w x u w H x u w

wwwuwx

uwuuux

xwxuxx

HHHHHHHHH

( ) , ( )i k i k i k i ku x L x w x K x

1 1 1

1 1 1

( ) ( ),

( ) ( ).

i i i i i i i ii uu uw ww wu uw ww wx ux

i i i i i i i ii ww wu uu uw wu uu ux wx

L H H H H H H H H

K H H H H H H H H

21

1 1 1

ˆ ˆ ˆ ˆ ˆ ˆ( , ( ), ( )) ( ) ( ) ( ) ( )ˆ ˆ( , ( ), ( ))

T T Ti k i k i k k k i k i k i k i k

i k i k i k

Q x u x w x x Qx u x u x w x w xQ x u x w x

Linear Quadratic case- V and Q are quadratic

Q function update

Control Action and Disturbance updates

A, B, E NOT needed

Asma Al-Tamimi

Q learning for H-infinity Control


)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx

T T TT Tk k k k k xx xu k

T Tk k k k k ux uu k

x x x x x H H xA PA Q A PBH

u u u u u H H uB PA B PB I

Compare to Q function for H2 Optimal Control Case

H-infinity Game Q function

Q

ˆ ( , ) T Ti i iQ z h z H z h z

ˆ ( )i iu x L x ˆ ( )i iw x K x

Quadratic Basis set is used to allow on-line solution

TT T Tz x u w 2 2 21 1 2 2 3 1( , , , , , , , )q q q qz z z z z z z z z z

))(ˆ),(ˆ,()(ˆ)(ˆ)(ˆ)(ˆ))(ˆ),(ˆ,(

111

21

kikiki

kiT

kikiT

kikTkkikiki

xwxuxQxwxwxuxuRxxxwxuxQ

)()(ˆ)(ˆ)(ˆ)(ˆ)( 12

1 kTiki

Tkiki

Tkik

Tkk

Ti xzhxwxwxuxuRxxxzh

kkikei nxLxu 1)(ˆ kkikei nxKxw 2)(ˆ

Probing Noise injected to get Persistence of Excitation

Proof- Still converges to exact result

Q function update

Solve for ‘NN weights’ - the elements of kernel matrix HUse batch LS or online RLS

where and

Quadratic Kronecker basis

Asma Al-Tamimi

Control and Disturbance Updates

Asma Al-Tamimi

ADHDP Application for Power system

• System Description

• The Discrete-time Model is obtained by applying ZOH to the CT

( ) [ ( ) ( ) ( ) ( )]

1/ / 0 00 1/ 1/ 0

1/ 0 1/ 1/0 0 0

0 0 1/ 0

1 / 0 0 0

Tg g

p p p

T T

G G G

E

TG

Tp p

x t f t P t X t F t

T K TT T

ART T T

K

B T

E K T

1/ [0.033,0.1]

/ [4,12]

1/ [2.564,4.762]1/ [9.615,17.857]1/ [3.081,10.639]

p

p p

T

G

G

T

K T

TTRT


• The system state∆f _incremental frequency deviation (Hz) ∆Pg _incremental change in generator output (p.u. MW)∆Xg _incremental change in governor position (p.u. MW) ∆F _incremental change in integral control.∆Pd _is the load disturbance (p.u. MW); and

• The system parameters are:TG _the governor time constant

- TT _turbine time constant- TP _plant model time constant- Kp _planet model gain- R _speed regulation due to governor action - KE_ integral control gain.


• ADHDP policy tuning

0 1000 2000 3000-1

0

1

2

3

time (k)

The

conv

erge

nce

of P

P11

P12

P13

P22

P23

P33

P34

P44 0 1000 2000 3000

-3

-2

-1

0

1

Time (k)The

conv

erge

nce

of th

e co

ntro

l po

L11

L12

L13

L14

Converges to

Optimal GARE solution and optimal game control

• Comparison

The ADHDP controller design The design from [1]

• The maximum frequency deviation when using the ADHDP controller is improved by 19.3% from the controller designed in [1]

[1] Wang, Y., R. Zhou, C. Wen, “Robust load-frequency controller design for power systems”, IEE Proc.-C, Vol. 140, No. I , 1993

0 5 10 15 20-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time in sec

stat

es x

1, x2, x

3,x4

X: 0.5Y: -0.2024

Frequency deviationIncrmental change of the governer outIncrmental change of the governer posIncrmental change of the in itegral cont

0 5 10 15 20-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

X: 0.5794Y: -0.2507

Time sec

stat

es x

1, x2, x

3,x4

Frequency deviationIncrmental change of the generator ouIncrmental change of the governer posIncrmental change of the in itegral cont


Q learning for H-inf control (Game Theory) using LMIWork by Dr. Jinhoon Kim, Chungbuk National Univ., Korea

How can one do Policy Iteration for Unknown Continuous-Time Systems?What is Value Iteration for Continuous-Time systems?How can one do ADP for CT Systems?


Directly leads to temporal difference techniques System dynamics does not occur Two occurrences of value allow greedy value iteration methods

Discrete-Time Systems

How to define temporal difference? System dynamics DOES occur Only ONE occurrence of value gradient

( , , ) ( , ) ( , ) ( , ) ( , )T TV V VH x u V r x u x r x u f x u r x u

x x x

Leads to off-line solutions if system dynamics is knownHard to do on-line learning

Continuous-Time Systems

),( uxfx

t

T

t

dtRuuxQdtuxrtxV ))((),())((

),,(),(),(),(),(0 uxVxHuxruxf

xVuxrx

xVuxrV

TT

0)0( V

),(),(min),(min0*

)(

*

)(uxf

xVuxrx

xVuxr

T

tu

T

tu

xVxgRtxh T

*

12

1* )())((

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

0)0( V

System

Cost

Hamiltonian

Optimal cost

Optimal control

HJB equation

Continuous-Time Optimal Control

Bellman

c.f. DT value recursion, where f(), g() do not appear

)())(,()( 1 khkkkh xVxhxrxV BIG PROBLEM

System dynamics appears

),,(),(),(0 uxVxHuxruxf

xV T

0 ( , ( )) ( , ( ))T

kk k

V f x h x r x h xx

(0) 0kV

1121( ) ( )T k

kVh x R g xx

CT Policy Iteration

• Convergence proved by Saridis 1979 if Lyapunov eq. solved exactly

• Beard & Saridis used complicated Galerkin Integrals to solve Lyapunov eq.

• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence

RuuxQuxr T )(),(Utility

Cost for any given u(t)

Lyapunov equation

Iterative solution

Pick stabilizing initial control

Find cost

Update control

Full system dynamics must be knownOff-line solution

To avoid solving HJB equation dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

LQR Policy iteration = Kleinman algorithm

1. For a given control policy solve for the cost:

2. Improve policy:

If started with a stabilizing control policy the matrix monotonically converges to the unique positive definite solution of the Riccati equation.

Every iteration step will return a stabilizing controller. The system has to be known.

xLu k

kT

kT

kkkT

k RLLCCAPPA 0

11

k

Tk PBRL

kk BLAA

0L kP

Kleinman 1968

Lyapunov eq.

tuxr

tVV

uxrtVV

uxrxVuxVxH tt

Dtttt

),(

),(),()(),,( 11

Small Time-Step Approximate Tuning for Continuous-Time Adaptive Critics

txVxVuxr

uxA ttttD

tt

)()(),(),(

*1*

1

Baird’s Advantage function

This defines a temporal difference error. System dynamics does NOT appear Has TWO occurrences of value V and so can be used to define greedy Value Iteration NOTE that the UTILITY must also be discretized or it does not work

0 ( ) ( , ) ( , ) ( , ) ( , , )TV VV x r x u f x u r x u H x u

x x

Problems with

Use Euler’s forward approximation

Problems:Only an approximationMust select a constant sampling period up frontPhysics-based CT nonlinearities are lost

Solving for the cost – Our approach

( ( )) ( , ) ( ( ))t T

t

V x t r x u d V x t T

Lxu

Draguna Vrabie

f(x) and g(x) do not appear

For a given policy u(x)

The cost satisfies

( ) ( ) ( ) ( ) ( )t T

T T T T T

t

x t Px t x Qx x L RLx dt x t T Px t T

PBRL T1

LQR case

Optimal gain is

(0) 0V

( ( )) ( , )t

V x t r x u d

Same as:

c.f. DT value recursion, where f(), g() do not appear

)())(,()( 1 khkkkh xVxhxrxV SAME FORM

Lemma 1

is equivalent to

( ( )) ( , ) ( , )Td V x V f x u r x u

dt x

Proof:

( , ) ( ( )) ( ( )) ( ( ))t T t T

t t

r x u d d V x V x t V x t T

Solves Lyapunov equation without knowing A or B

( ( )) ( , ) ( ( ))t T

t

V x t r x u d V x t T

(0) 0V

),,(),(),(0 uxVxHuxruxf

xV T

(0) 0V

Draguna Vrabie

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T T

tx t Px t x Q L RL x d x t T Px t T

0T Tc cA P P A L RL Q

Lemma 1 - LQR case

is equivalent to

( ) ( ) ( )T

T T T Tc c

d x P x x A P PA x x L RL Q xdt

Proof:

( ) ( ) ( ) ( ) ( ) ( )t T t T

T T T T T

t t

x Q L RL xd d x Px x t Px t x t T Px t T


cA A BL

( ) ( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T T

te t x t Px t x Q L RL x d x t T Px t T

Define CT temporal difference

Draguna Vrabie

1 1( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1. Policy iteration

11

1

kT

k PBRL

Initial stabilizing control is needed

Draguna Vrabie

For LQR case

Cost update

Control gain update

A and B do not appear

B needed for control update

( ) ( )k ku t L x t

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q L RL x d x t T P x t T

Solving for the cost – Our approach

Policy evaluation

Policy improvement


Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval

Draguna Vrabie

DynamicControlSystem

xu

V

ZOH T

0; xBuAxx System

RuuQxxV TT

Critic

Actor

K

T T

Runs RLS or uses batch L.S.To identify value of current control

Update FB gain afterCritic has converged

Reinforcement interval T can be selected on line on the fly

)()( txLtu kk

Continuous-time control with discrete gain updates

t

Lk

k0 1 2 3 4 5

Reinforcement Intervals T need not be the sameThey can be selected on-line in real time

Gain update (Policy)

Control

T

1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q L RL x d x t T P x t T

1( ( )) ( ) ( ( ))t T

T kT kk k

t

V x t x Qx u Ru dt V x t T

2. CT ADP Greedy iteration

)()( txLtu kk

11

1

kT

k PBRL

No initial stabilizing control needed

Draguna Vrabie

LQR

Control policy

Cost update

Control gain update A and B do not appear

B needed for control update

Direct Optimal Adaptive Control for Partially Unknown CT Systems

Value Iteration

Direct OPTIMAL ADAPTIVE CONTROL

Solve the Riccati Equation WITHOUT knowing the plant dynamics

Model-free ADP

Works for Nonlinear Systemsa neural network is used to approximate the cost

Robustness?Comparison with adaptive control methods?

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

invited by dr. fouad al-sunni - uta notes/dt adp - kfupm may 201… · kkkkkk kk kk px x qx u ru ax...

Documents