invited by dr. fouad al-sunni - uta notes/dt adp - kfupm may 201… · kkkkkk kk kk px x qx u ru ax...

90
Invited by Dr. Fouad AL-Sunni

Upload: others

Post on 22-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Invited by Dr. Fouad AL-Sunni

Page 2: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington

F.L. Lewis & Draguna VrabieMoncrief-O’Donnell Endowed Chair

Head, Controls & Sensors Group

Talk available online at http://ARRI.uta.edu/acs

Adaptive Dynamic Programming (ADP)For Feedback Control Systems

Supported by :NSF - PAUL WERBOSARO- RANDY ZACHERY

Page 3: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

It is man’s obligation to explore the most difficult questions inthe clearest possible way and use reason and intellect to arriveat the best answer.

Man’s task is to understand patterns in nature and society.

The first task is to understand the individual problem, then toanalyze symptoms and causes, and only then to designtreatment and controls.

Ibn Sina 1002-1042(Avicenna)

Page 4: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Importance of Feedback Control

Darwin- FB and natural selectionVolterra- FB and fish population balanceAdam Smith- FB and international economyJames Watt- FB and the steam engineFB and cell homeostasis

The resources available to most species for their survival are meager and limited

Nature uses Optimal control

Page 5: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.

Cellular Metabolism

Permeability control of the cell membrane

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Biological Systems

Page 6: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Optimality in Control Systems DesignR. Kalman 1960

Rocket Orbit Injection

http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf

FmmmF

rwvv

mF

rrvw

wr

cos

sin2

2

ObjectivesGet to orbit in minimum timeUse minimum fuel

Dynamics

Page 7: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Adaptive Control is generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing

Reinforcement Learning turns out to be the key to this!

Page 8: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Different methods of learning

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Actor

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

Page 9: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

, ' Pr{ ' | , }ux xP x x u

( , , , )X U P R

'uxxR

Markov Decision Process

X= statesU= controls

Probability of going to state x’ from state xgiven that the control is u

Reward on going to state x’ from state xgiven that the control is u

Page 10: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Value Function

A control policy is a mapping from states to control actions

= probability of taking action a while in state s

Control Policy

The Bellman EquationA consistency equation that holds at each state

Reinforcement Learning for Markov Decision Processes

Allows one to solve n equations for the value given any policyn= nr of states

( ) { | } .i ki k

i kV x E r x x

: [0,1]X U

( , ) Pr{ | }x u u x

( 1)

1( ) { | }i k

k i ki k

V x E r r x x

' ''

( ) ( , ) ( ') .u uxx xx

u xV x x u P R V x

Page 11: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Optimal Value

By Bellman’s consistency eq.

Bellman’s optimality equation

This consists of n eqs that must be solved for V*(s)n= nr of states

Optimal policy

*( ) min { | }i ki k

i kV x E r x x

*' '

'( ) min ( ) min ( , ) ( ') .u u

xx xxu x

V x V x x u P R V x

* *' '

'( ) min ( ') .u u

xx xxu xV x P R V x

* *' '

'arg min ( ') .u u

xx xxu x

u P R V x

Page 12: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Policy IterationStart with admissible policy

Policy evaluation

Policy Improvement

Barto calls this Dynamic programmingTo avoid knowing the system model, use

Monte CarloTemporal Difference

Policy Iteration with Temporal DifferenceStart with admissible policy

Policy evaluation ALONG A SAMPLE PATH

A stochastic approximation approach

Policy Improvement along sample path

Temporal Difference is

- An iterative solution algorithm for the optimal value and policy

' ''

( ) ( , ) ( ')u uj j xx xx j

u xV x x u P R V x

1 ' ''

( , ) arg min ( ')u uj xx xx j

u xx u P R V x

1( ) ( , ( )) ( ) .j k k j k j kV x r x h x V x

1 1(.)

( ) arg min ( , ( )) ( ) ,j k k k j kh

h x r x h x V x

1 1 1( ) ( ) ( , ( )) ( ) .j j k k j k j ke k V x r x h x V x

Page 13: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ki

iiki

kh uxrxV ),()(

Discrete-Time Optimal Control

cost

( 1)

1( ) ( , ) ( , )i k

h k k k i ii k

V x r x u r x u

1 ( ) ( )k k k kx f x g x u system

Example ( , ) T Tk k k k k kr x u x Qx u Ru

1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Value function recursion

)( kk xhu = the prescribed control input functionControl policy

Example k ku Kx Linear state variable feedback

This is like:

, ' Pr{ ' | , }ux xP x x u

'uxxR

( , ) Pr{ | }x u u x

Page 14: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ki

iiki

kh uxrxV ),()(

Discrete-Time Optimal Control

cost

( 1)

1( ) ( , ) ( , )i k

h k k k i ii k

V x r x u r x u

1 ( ) ( )k k k kx f x g x u system

Example ( , ) T Tk k k k k kr x u x Qx u Ru

1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Value function recursion

)( kk xhu = the prescribed control input functionControl policy

Example k ku Kx Linear state variable feedback

BELLMAN EQUATION

Page 15: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ki

iiki

kh uxrxV ),()(

)())(,()( 1 khkkkh xVxhxrxV

))())(,((min)( 1*

khkkhk xVxhxrxV

Hamiltonian

))(),((min)( 1**

kkkuk xVuxrxVk

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Discrete-Time Optimal Control

cost

Bellman Equation

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH

Optimal cost

Bellman’s Principle

Optimal Control

)( kk xhu = the prescribed control policy

Dynamic programming- Backwards in time solution

Page 16: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1 ( ) ( )k k k kx f x g x u

1( ) min ( )

min ( ) ( )k

k

T Tk k k k k ku

T Tk k k k k k ku

V x x Qx u Ru V x

x Qx u Ru V f x g x u

1 1

1

( )1( ) ( )2

T kk k

k

V xu x R g x

x

( ) T Tk i i i i

i kV x x Qx u Ru

System

DT HJB equation

Difficult to solveContains the dynamics

The Solution: Hamilton-Jacobi-Bellman Equation

1

1

( )2 ( ) 0T kk k

k

dV xRu g xdx

Minimize wrt uk

Page 17: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

DT Optimal Control – Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu system

cost

Fact. The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

( ) Tk k kV x x Px for some symmetric matrix P

1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation

1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px

( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu

Stationarity Condition- minimize with respect to uk

0 2 2 ( )Tk k kRu B P Ax Bu

1( )T Tk k ku R B PB B PAx Lx

10 ( ( ) )T T T T Tk kx A PA P Q A PB R B PB B PA x

HJB = DT Riccati equation

Page 18: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Bellman Equation– Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu system

cost

Fact. The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

( ) Tk k kV x x Px for some symmetric matrix P

1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation

1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px

( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu

Bellman eq= Lyapunov equationk ku Kx

0 (( ) ( ) )T T Tk kx A BK P A BK P Q K RK x

Assume stabilizing fixed feedback policy

Then is the cost of using the SVFB ( ) Tk k kV x x Px k ku Kx

Page 19: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1( )T TL R B PB B PA

DT Optimal Control – Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu system

cost

HJB = DT Riccati equation

Optimal Control

Optimal Cost*( ) T

k k kV x x Px

10 ( )T T T TA PA P Q A PB R B PB B PA

k ku Lx

Fact. The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

( ) Tk k kV x x Px for some symmetric matrix P

Off-line solutionDynamics must be known

Page 20: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ki

iiki

kh uxrxV ),()(

)())(,()( 1 khkkkh xVxhxrxV

))())(,((min)( 1*

khkkhk xVxhxrxV

Hamiltonian

))(),((min)( 1**

kkkuk xVuxrxVk

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Discrete-Time Optimal Adaptive Control

cost

Bellman eq.

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH

Optimal cost

Bellman’s Opt. Principle

Optimal Control

)( kk xhu = the prescribed control policy

Focus on these two eqs

Page 21: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1( ) ( , ( )) ( ), (0) 0h k k k h k hV x r x h x V x V

Discrete-Time Optimal Control

Bellman eq

)( kk xhu = the prescribed control policy

Solutions by Comp. Intelligence Community

( ) ( , ( ))i kh k i i

i kV x r x h x

Theorem: Let solve the Bellman equation. Then ( )h kV x

Gives value for any prescribed control policy

Policy Evaluation for any given current policy

Policy must be stabilizing

Dynamics Does Not Appear

Page 22: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Optimal Control

Bellman’s result

1'( ) arg min( ( , ) ( ))k

k k k h kuh x r x u V x

What about? -

Theorem. Bertsekas. Let be the value of any given policy h(xk).

Then

( )h kV x

' ( ) ( )h k h kV x V x

Policy Improvement

for a given policy h(.) ?

One step improvement property of Rollout Algorithms

Page 23: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

DT Policy Iteration

)())(,()( 111 kjkjkkj xVxhxrxV

1 1 1 1( ) arg min( ( , ) ( ))k

j k k k j kuh x r x u V x

Howard (1960) proved convergence for MDP

)())(,()( 1 khkkkh xVxhxrxV

Cost for any given control policy h(xk) satisfies the recursion

Recursive solution - Actor/Critic Structure

Pick stabilizing initial control

Policy Evaluation

Policy Improvement

f(.) and g(.) do not appear

Bellman eq.

Recursive formConsistency equation

e.g. Control policy = SVFB

( )k kh x Lx

1 1 1( ) ( , ( )) ( )k j k k j k j ke V x r x h x V x

Temporal difference

Page 24: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x

( ) ( ) ( )T Tk i i i i

i kV x x Qx u x Ru x

1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

Hewer proved convergence in 1971

DT Lyapunov eq.

DT Policy Iteration – Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( ) TV x x Px

For any stabilizing policy, the cost is

DT Policy iterations

1 1

11 1 1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A

Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics

Equivalent to an Underlying Problem- DT LQR:

1 ( ) ,k k k kx Ax Bu A BL x

LQR value is quadratic

k ku Lx

Page 25: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x

1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

DT Policy Iteration – Linear Systems Quadratic Cost- LQR

DT Policy iterations

How to implement online?

Page 26: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1 1 1 1T T T Tk j k k j k k k j jx P x x P x x Qx u Ru

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x

DT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( ) TV x x Px

DT Policy iterations

1k k kx Ax Bu

LQR cost is quadratic

1 111 12 11 121 2 1 2 1

1 12 212 22 12 22 1

1 2 1 21

1 2 1 211 12 22 11 12 22 1 1

2 2 2 21

( ) ( )2 2( ) ( )

k kk k k k

k k

k k

k k k k

k k

p p p px xx x x x

p p p px x

x xp p p x x p p p x x

x x

Quadratic basis set

( ) ( ) ( )Tk i i i i

i kV x x Qx u x Ru x

for some matrix P

Then update control using1( ) ( )T T

j k j k j j kh x L x R B P B B P Ax Need to know A AND B

for control update

1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x

Page 27: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Implementation- DT Policy IterationNonlinear Case

Value Function Approximation (VFA)

)()( xWxV T

basis functionsweights

LQR case- V(x) is quadratic

( ) ( )T TV x x Px W x

Quadratic basis functions

Nonlinear system case- use Neural Network

][ 1211 ppW T

( )x

Page 28: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

)())(,()( 111 kjkjkkj xVxhxrxV Value function update for given control

Assume measurements of xk and xk+1 are available to compute uk+1

Then

))(,()()( 11 kjkkkTj xhxrxxW

Solve for weights using RLSor, many trajectories with different initial conditions over a compact set

Then update control using

Need to know g(xk) for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Policy Iteration

)()( kTjkj xWxV VFA

Indirect Adaptive control with identification of the optimal value

1 11 11 1 1 1

1

( )1 1( ) ( ) ( ) ( )2 2

j kT T T Tj k k k k j

k

dV xu x R g x R g x x W

dx

Page 29: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

)())(,()( 111 kjkjkkj xVxhxrxV

111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

1. Select control policy

2. Find associated cost

3. Improve control

Needs 10 lines of MATLAB code

Direct optimal adaptive control

Solves Lyapunov eq. without knowing dynamics

k k+1

observe xk

observe xk+1

apply uk

observe cost rk

update V

do until convergence to Vj+1 update control to uj+1

))(,()()( 11 kjkkkTj xhxrxxW

Page 30: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

DT Policy Iteration – For Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

1k k kx Ax Bu

Form quadratic basis vectors

11 1211 12 22 1

12 22j

p pp p p P

p p

( ) ( ) ( )Tk i i i i

i kV x x Qx u x Ru x

Then update SVFB gain using

11 1 1( )T T

j j jL R B P B B P A

Need to know A AND B for control update

1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x

Unpack weights into Lyapunov solution

1 2 1 21

1 2 1 21 1 1

2 2 2 21

( ) ( )( ) 2 , ( ) 2

( ) ( )

k k

k k k k k k

k k

x xx x x x x x

x x

Find new weights

SVFB gain jL control ( )j k j ku x L x

Apply control to system and observe data set 1, , ( ) ( )T Tk k k k j k j kx x x Qx u x Ru x

Use batch LS or RLS until convergence

Page 31: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

System

Action network

Policy Evaluation(Critic network)

( )j kh x

cost

The Adaptive Critic Architecture

Control policy

)())(,()( 111 kjkjkkj xVxhxrxV

Adaptive Critics

))(),((minarg)( 111 kjkkukj xVuxrxhk

Value update

Control policy update

Leads to ONLINE FORWARD-IN-TIME implementation of optimal control

Optimal Adaptive Control

Use RLS until convergence

Page 32: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

)()( xWxV T

Page 33: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

A problem with DT Policy Iteration

Assume measurements of xk and xk+1 are available to compute uk+1

Then

))(,()()( 11 kjkkkTj xhxrxxW

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

)()( kTjkj xWxV

Policy Evaluation

1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

Policy Improvement

1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)

for control update

Easy to fix – Later!

Page 34: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Greedy Value Fn. Update- Approximate Dynamic Programming ADP Method 1 - Heuristic Dynamic Programming (HDP)

Paul Werbos

)())(,()( 111 kjkjkkj xVxhxrxV

Policy Iteration

1 1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A

For LQRUnderlying RE

Hewer 1971

))(),((minarg)( 111 kjkkukj xVuxrxhk

)())(,()( 11 kjkjkkj xVxhxrxV

))(),((minarg)( 111 kjkkukj xVuxrxhk

Value Iteration

1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

P A BL P A BL Q L RL

L R B P B B P A

For LQRUnderlying RE Lancaster & Rodman

proved convergence

Two occurrences of cost allows def. of greedy update

Initial stabilizing control is needed

Lyapunov eq.

Simple recursion

Initial stabilizing control is NOT needed

Page 35: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Motivation for Value Iteration

1 1( ) ( )T Tj j j j j jA BL P A BL P Q L RL

1 1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x

1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x

1 ( ) ( ) Tj j i j j jP A BL P A BL Q L RL

PI Policy Evaluation Step

VI Policy Evaluation Step

Theorem

Let gain L be fixed and (A-BL) stable.Let in MR

Then

0 0P

LE= Lyapunov equation

MR= Matrix recursion

1i jP P

i.e. repeated application of the VI policy evaluation stepis the same as one application of the PI policy evaluation step

IF THE CONTROL POLICY IS NOT UPDATED

Draguna Vrabie-For CT systems

Idea of GPI

Needs stabilizing gain

Does not need stabilizing gain

Page 36: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x Value function update for given control

Assume measurements of xk and xk+1 are available to compute uk+1

Then

1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x

Solve for weights using RLSor, many trajectories with different initial conditions over a compact set

Then update control using

1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)

for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

Implementation- DT HDP – Value Iteration

)()( kTjkj xWxV VFA

regression matrix Old weights

Page 37: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x

111

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

1. Select control policy

2. Find associated cost

3. Improve control

Needs 10 lines of MATLAB code

Direct optimal adaptive control

Solves Lyapunov recursion without knowing dynamics

k k+1

observe xk

observe xk+1

apply uk

observe cost rk

update V

do until convergence to Vj+1 update control to uj+1

1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x

Page 38: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Four ADP Methods proposed by Paul Werbos

Heuristic dynamic programming

Dual heuristic programming

AD Heuristic dynamic programming

AD Dual heuristic programming

(Watkins Q Learning)

Critic NN to approximate:

Value

Gradient xV

)( kxV Q function ),( kk uxQ

GradientsuQ

xQ

,

Action NN to approximate the Control

Bertsekas- Neurodynamic Programming

Barto & Bradtke- Q-learning proof (Imposed a settling time)

Adaptive (Approximate) Dynamic Programming

Value Iteration

Page 39: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P
Page 40: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Discrete-time Nonlinear Approximate dynamic programming : Convergence Proof

• Problem Formulation

• requires solving the DT HJB

1 ( ) ( )k k k kx f x g x u

1( ) min ( )

min ( ) ( )k

k

T Tk k k k k ku

T Tk k k k k k ku

V x x Qx u Ru V x

x Qx u Ru V f x g x u

1 1

1

( )1( ) ( )2

T kk k

k

dV xu x R g x

dx

( ) mink

k i i i iu i kV x x Qx u Ru

Asma Al-Tamimi

Page 41: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1 ( ) ( ) ( )k k k kx f x g x u x

( ) T Tk i i i ii k

V x x Qx u Ru

1

1

( )

( )

T T T Tk k k k k i i i ii k

T Tk k k k k

V x x Qx u Ru x Qx u Ru

x Qx u Ru V x

1 1( ) arg min( ( ))T Ti k k k i ku

u x x Qx u Ru V x

1 1( ) ( )T Ti k k k i kV x x Qx u Ru V x

Discrete-time NonlinearHeuristic Dynamic Programming:

HDP (Value Iteration)

Value function recursion

Asma Al-Tamimi

System dynamics

Page 42: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Asma Al-Tamimi

Flavor of proofs

Proof of convergence of DT nonlinear HDP

Page 43: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ˆ ( , ) ( )Ti k Vi Vi kV x W W x ˆ ( , ) ( )T

i k ui ui ku x W W x

1

1

ˆˆ ˆ( ( ), ) ( ) ( ) ( )ˆ ˆ( ) ( ) ( )

T T Tk Vi k k i k i k i k

T T Tk k i k i k Vi k

d x W x Qx u x Ru x V x

x Qx u x Ru x W x

1( ) arg min( ( ))T Ti k k k i ku

u x x Qx u Ru V x

1 1( ) ( )T Ti k k k i kV x x Qx u Ru V x

Standard Neural Network VFA for On-Line Implementation

Define target cost function

NN for Value - Critic NN for control action

HDP

Backpropagation- P. Werbos

Implicit equation for DT control- use gradient descent for action update

( ) ( ) 1( 1) ( )

( )

ˆˆ ˆ( ( )T Tk k i j i j i k

ui j ui jui j

x Qx u Ru V xW W

W

1 1( )

1

( )ˆ( )(2 ( ) )T

j j T Tkui ui k i j k Vi

k

xW W x Ru g x Wx

ˆ ˆ( , ) ( , )argmin

ˆ ˆ( ( ) ( ) ( , ))

T Tk k k k

ui Wi k k k

x Qx u x W Ru x WW

V f x g x u x W

(can use 2-layer NN)

1

21 1arg min{ | ( ) ( ( ), ) | }

Vi

T TVi Vi k k Vi kW

W W x d x W dx

Explicit equation for cost – use LS for Critic NN update1

1 ( ) ( ) ( ) ( ( ), , )T T T TVi k k k k Vi uiW x x dx x d x W W dx

1 1 1 11( ) ( ) ( , ) ( )T T T

Vi Vi k Vi k k k Vi km m mW W x W x r x u W x

or

1 ( ) ( ) ( )k k k kx f x g x u x

Page 44: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Implicit equation for DT control- use gradient descent for action update

( ) ( ) 1( 1) ( )

( )

ˆˆ ˆ( ( )T Tk k i j i j i k

ui j ui jui j

x Qx u Ru V xW W

W

1 1( )

1

( )ˆ( )(2 ( ) )j j T Tkui ui k i j k Vi

k

xW W x Ru g x Wx

ˆ ˆ( , ) ( , )argmin

ˆ ˆ( ( ) ( ) ( , ))

T Tk k k k

uii k k k

x Qx u x Ru xW

V f x g x u x

ˆ ( , ) ( )Ti k ui ui ku x W W x

NN for control action

Note that state internal dynamics f(xk) is NOT needed since:

1. NN Approximation for action is used

2. xk+1 is measured in training phase

Interesting Fact for HDP for Nonlinear systemskj

Tj

Tkjkj AxPBBPBIxLxh 1)()( Linear Case

must know system A and B matrices

g(.) is needed

Information about A is stored in NN

Page 45: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• Simulation Example 1• Linear system – Aircraft longitudinal dynamics

• The HJB, i.e. ARE, Solution

1.0722 0.0954 0 -0.0541 -0.0153 4.1534 1.1175 0 -0.8000 -0.1010

A= 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353

-0.0453 -0.0175-1.0042 -0.1131

B= 0.0075 0.0134 0.8647 0 0 0.8647

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P

-0.7265 -0.1215 0.0464 0.0782 1.0240

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L

Unstable, Two-input system

Page 46: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• Simulation• The Cost function approximation

• The Policy approximation

ˆ ( )Ti ui ku W x

1 2 3 4 5( )T x x x x x x

11 12 13 14 15

21 22 23 24 25

u u u u uTu

u u u u u

w w w w wW

w w w w w

1 1 1ˆ ( , ) ( )Ti k Vi Vi kV x W W x

1 2

2 2 2 2 21 2 1 3 1 4 1 5 2 3 4 2 2 5 3 3 4 3 5 4 4 5 5( )T x x x x x x x x x x x x x x x x x x x x x x x x x x

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15T

V V V V V V V V V V V V V V V VW w w w w w w w w w w w w w w w

Page 47: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• SimulationThe convergence of the cost

[55.5411 15.2789 31.3032 -9.3255 -1.4536 2.3142 2.9234 -1.6594 -0.2430 24.8262 -1.3076 0.0920 1.5388 0.1564 1.0240]

TVW

11 12 13 14 15 1 2 3 4 5

21 22 23 24 25 2 6 7 8 9

31 32 33 34 35 3 7 10 11 12

41 42 43 44 45 4 8 11 13

51 52 53 54 55

0.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0

V V V V V

V V V V V

V V V V V

V V V V

P P P P P w w w w wP P P P P w w w w wP P P P P w w w w wP P P P P w w w wP P P P P

14

5 9 12 14 15

.50.5 0.5 0.5 0.5

V

V V V V V

ww w w w w

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P

-0.7265 -0.1215 0.0464 0.0782 1.0240

Actual ARE soln:

Page 48: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• SimulationThe convergence of the control policy

4.1068 0.7164 0.3756 -0.5274 -0.0707 0.6330 0.1005 -0.1216 -0.0653 -0.0798uW

11 12 13 14 15 11 12 13 14 15

21 22 23 24 25 21 22 23 24 25

u u u u u

u u u u u

L L L L L w w w w wL L L L L w w w w w

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L

Note- In this example, internal dynamics matrix A is NOT Needed.Riccati equation solved online without knowing A matrix

Actual optimal ctrl.

Page 49: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Batch LS

LS solution for Critic NN update

Issues with Nonlinear ADP

Integral over a region of state-spaceApproximate using a set of points

time

x1

x2

1

1 ( ) ( ) ( ) ( ( ), , )T T T TVi k k k k Vi uiW x x dx x d x W W dx

Set of points over a region vs. points along a trajectory

Exploitation (optimal regulation) vs Exploration

For Linear systems- these are the same under PE condition

Selection of NN Training Set

time

x1

x2

Take sample points along a single trajectory

Recursive Least-Squares RLS

1 1 1 11( ) ( ) ( , ) ( )T T T

Vi Vi k Vi k k k Vi km m mW W x W x r x u W x

PE allows local smooth solution of Bellman eq.

Page 50: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

DT HDP vs. Receding Horizon Optimal Control

11

0

( )0

T T T Ti i i i iP A PA Q A PB R B PB B PA

P

11 1 1 1( )T T T T

k k k k k

N

P A P A Q A P B R B P B B P AP

Forward-in-time HDP

Backward-in-time optimization – RHC

Control Lyapunov Function overbounding P

Page 51: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1

0( )k N

T T Tk i i i i k N k N

i kV x x Qx u Ru x P x

1k k kx Ax Bu

Adaptive Terminal Cost RHC Hongwei ZhangDr. Jie Huang

11 0( ) ,T T T T

i i i i iP A PA Q A PB R B PB B PA P

11 1 1 1 1( )RH T T RH

k N N k N ku R B P B B P A x L x

Standard RHC

Requires P0 to be a CLF that overbounds the optimal inf. horizon cost, or large N

P0 is the same for each stage

HWZ Theorem- Let under the usual suspect observability and controllability assumptionsATC RHC guarantees ultimate uniform exponential stability

for ANY P0 > 0.Moreover, our solution converges to the optimal inf. horizon cost.

1N

1

( )k N

T T Tk i i i i k N kN k N

i kV x x Qx u Ru x P x

Our ATC RHC

11 ( ) ,T T T T

i i i i i kNP A PA Q A PB R B PB B PA P

Final cost from previous stage

k k+N

Page 52: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Adaptive Terminal Cost RHC

Let N=1. Then

1

( )k N

T T Tk i i i i k N kN k N

i kV x x Qx u Ru x P x

becomes

1 1( ) ( )T Tj k k k k k j kV x x Qx u Ru V x

i.e. value iteration

So, for N=1, ATC RHC can be implemented using Value Iteration without knowing the system A matrix

1

1 1arg min{ }k N

RH T T T RHk i i i i k N kN k N N ku i k

u x Qx u Ru x P x L x

1 1arg min{ ( )}RH T Tk k k k k j ku

u x Qx u Ru V x

Page 53: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Four ADP Methods proposed by Werbos

Heuristic dynamic programming

Dual heuristic programming

AD Heuristic dynamic programming

AD Dual heuristic programming

(Watkins Q Learning)

Critic NN to approximate:

Value

Gradient xV

)( kxV Q function ),( kk uxQ

GradientsuQ

xQ

,

Action NN to approximate the Control

Bertsekas- Neurodynamic Programming

Barto & Bradtke- Q-learning proof (Imposed a settling time)

Page 54: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Q Learning

)(),(),( 1 khkkkkh xVuxruxQ policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ

Define Q function

Note

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Recursion for Q

)),((min)( **kkuk uxQxV

k

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

- Action Dependent ADP

)())(,()( 1 khkkkh xVxhxrxV Value function recursion for given policy h(xk)

Optimal Adaptive Control for completely unknown DT systems

k k+1

xk xk+1uk

h(x)

Page 55: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

)(),(),( 1 khkkkkh xVuxruxQ

Specify a control policy ,....1,);( kkjxhu jj

policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ

Define Q function

Note

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Bellman equation for Q

))(),(),( 1**

kkkkk xVuxruxQ

))(,(),(),( 1*

1**

kkkkkk xhxQuxruxQ

Optimal Q function

)))(,((min))(,()( ***kkhhkkk xhxQxhxQxV

Optimal control solution

)),((min)( **kkuk uxQxV

k

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

))(,((minarg)(* kkhhk xhxQxh

Q Function Definition

Page 56: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Q Function HDP – Action Dependent HDP

Bradtke & Barto (1994) proved convergence for LQR

Q function for any given control policy h(xk) satisfies the Bellman equation

Recursive solution

Pick stabilizing initial control policy

Find Q function

Update control

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ

))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ

)),((minarg)( 11 kkjukj uxQxhk

Now f(xk,uk) not needed

Page 57: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Q Learning does not need to know f(xk) or g(xk)

)(),(),( 1 khkkkkh xVuxruxQ

)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx

For LQR PxxxWxV TT )()(

k

k

uuux

xuxxT

k

k

k

kT

k

k

k

kTT

TTT

k

k

ux

HHHH

ux

ux

Hux

ux

PBBRPABPBAPAAQ

ux

Q is quadratic in x and u

Control update is found by ][2])([20 kuukuxkT

kT

kuHxHuPBBRPAxB

uQ

sokjkuxuuk

TTk xLxHHPAxBPBBRu 1

11)(

Control found only from Q functionA and B not needed

V is quadratic in x

Page 58: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Q function update for control is given by

Assume measurements of uk, xk and xk+1 are available to compute uk+1

),(),( uxWuxQ T

Then

),(),(),( 111 kjkkjkkkTj xLxrxLxuxW

Solve for weights using RLS or backprop.

Since xk+1 is measured in training phase, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Q Function Policy Iteration

),(),(),( 1111 kjkjkkkkj xLxQuxruxQ

kjk xLu

Now u is an input to the NN- Werbos- Action dependent NN

)(x

For LQR

For LQR case

QFA – Q Fn. Approximation

Bradtke and Barto

Page 59: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

),(),(),( 1111 kjkjkkkkj xLxQuxruxQ

Q Policy Iteration

)),((minarg)( 11 kkjukj uxQxhk

Control policy update

),(),(),( 111 kjkkjkkkTj xLxrxLxuxW

kjkuxuuk xLxHHu 11

Model-free policy iteration

Bradtke, Ydstie, Barto

Greedy Q Fn. Update - Approximate Dynamic ProgrammingADP Method 3. Q Learning

Action-Dependent Heuristic Dynamic Programming (ADHDP)

Paul WerbosModel-free HDP

))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ

Greedy Q Update

1111 target),(),(),( jkjkTjkjkkk

Tj xLxWxLxruxW

Update weights by RLS or backprop.

Stable initial control needed

Stable initial control NOT needed

Page 60: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Direct OPTIMAL ADAPTIVE CONTROL

Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics

Model-free ADP

Works for Nonlinear Systems

Proofs?Robustness?Comparison with adaptive control methods?

Page 61: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P
Page 62: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

• continuous-state and action spaces discrete-time dynamical system

with quadratic cost

• The zero-sum game problem can be formulated as follows:

• The goal is to find the optimal strategies (State-feedback)*( )w x Kx

1

,k k k k

k k k k

x Ax Bu Ewy x z Hx

nRx

pRy

1mk Ru

2mk Rw

ki i

Tii

Tii

Tiwuk wwuuQxxxV 2maxmin)(

*( )u x Lx

2( ) T T T Tk i i i i i ii k

V x x H Hx u u w w

Discrete-Time Zero-Sum Games

L

wk zkuk ,

1

kk

kkkk

xyEwBuAxx

yk

2 22

0 0, { }k k k

k kz w w

Page 63: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

• Using Bellman optimality principle “Dynamic Programming”

• The Game Algebraic Riccati equation GARE

• The condition for saddle point are

1

2[ ]

T T TT T T

T T T

I B PB B PE B PAP A PA Q A PB A PE

E PA E PE I E PA

2 0TI E PE 0TI B PB

1

1 1

( ) minmax( ^ 2 ( ))

minmax( ( , , ) ).k k

k k

T T Tk k k k k k k ku w

T Tk k k k k k ku w

V x x Qx u u w w V x

x Px r x u w x Px

Game Algebraic Riccati Equation

The optimal policies for control and disturbance are 2 1 1 2 1( ( ) ) ( ( ) ).T T T T T T T TL I B PB B PE E PE I E PB B PE E PE I E PA B PA

2 1 1 1( ( ) ) ( ( ) ).T T T T T T TK E PE I E PB I B PB B PE E PB I B PB BPA E PA

System dynamics A,B,E is needed

Page 64: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

DT GameHeuristic Dynamic Programming:

Forward-in-time Formulation• An Approximate Dynamic Programming Scheme (ADP) where one has the

following incremental optimization

which is equivalently written as

)(maxmin)( 12

1 kikTkk

Tkk

Tkwuki xVwwuuQxxxV

kk

)()()()()()( 12

1 kikikTikik

Tik

Tkki xVxwxwxuxuQxxxV

Asma Al-Tamimi

Page 65: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

( ) Tk k kV x x Px

1( , , ) ( , , ) ( )k k k k k k kTT T T T T T

k k k k k k

Q x u w r x u w V x

x u w H x u w

21 1 1 1 1 1 1[ ] [ ] [ ] [ ]T T T T T T T T T T T T T T T T T

k k k i k k k k k k k k k k k k i k k kx u w H x u w x Qx u u w w x u w H x u w

wwwuwx

uwuuux

xwxuxx

HHHHHHHHH

( ) , ( )i k i k i k i ku x L x w x K x

1 1 1

1 1 1

( ) ( ),

( ) ( ).

i i i i i i i ii uu uw ww wu uw ww wx ux

i i i i i i i ii ww wu uu uw wu uu ux wx

L H H H H H H H H

K H H H H H H H H

21

1 1 1

ˆ ˆ ˆ ˆ ˆ ˆ( , ( ), ( )) ( ) ( ) ( ) ( )ˆ ˆ( , ( ), ( ))

T T Ti k i k i k k k i k i k i k i k

i k i k i k

Q x u x w x x Qx u x u x w x w xQ x u x w x

Linear Quadratic case- V and Q are quadratic

Q function update

Control Action and Disturbance updates

A, B, E NOT needed

Asma Al-Tamimi

Q learning for H-infinity Control

Page 66: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

)(),(),( 1 khkkkkh xVuxruxQ

)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx

T T TT Tk k k k k xx xu k

T Tk k k k k ux uu k

x x x x x H H xA PA Q A PBH

u u u u u H H uB PA B PB I

Compare to Q function for H2 Optimal Control Case

H-infinity Game Q function

Q

Page 67: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ˆ ( , ) T Ti i iQ z h z H z h z

ˆ ( )i iu x L x ˆ ( )i iw x K x

Quadratic Basis set is used to allow on-line solution

TT T Tz x u w 2 2 21 1 2 2 3 1( , , , , , , , )q q q qz z z z z z z z z z

))(ˆ),(ˆ,()(ˆ)(ˆ)(ˆ)(ˆ))(ˆ),(ˆ,(

111

21

kikiki

kiT

kikiT

kikTkkikiki

xwxuxQxwxwxuxuRxxxwxuxQ

)()(ˆ)(ˆ)(ˆ)(ˆ)( 12

1 kTiki

Tkiki

Tkik

Tkk

Ti xzhxwxwxuxuRxxxzh

kkikei nxLxu 1)(ˆ kkikei nxKxw 2)(ˆ

Probing Noise injected to get Persistence of Excitation

Proof- Still converges to exact result

Q function update

Solve for ‘NN weights’ - the elements of kernel matrix HUse batch LS or online RLS

where and

Quadratic Kronecker basis

Asma Al-Tamimi

Control and Disturbance Updates

Page 68: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Asma Al-Tamimi

Page 69: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ADHDP Application for Power system

• System Description

• The Discrete-time Model is obtained by applying ZOH to the CT

( ) [ ( ) ( ) ( ) ( )]

1/ / 0 00 1/ 1/ 0

1/ 0 1/ 1/0 0 0

0 0 1/ 0

1 / 0 0 0

Tg g

p p p

T T

G G G

E

TG

Tp p

x t f t P t X t F t

T K TT T

ART T T

K

B T

E K T

1/ [0.033,0.1]

/ [4,12]

1/ [2.564,4.762]1/ [9.615,17.857]1/ [3.081,10.639]

p

p p

T

G

G

T

K T

TTRT

Page 70: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ADHDP Application for Power system

• The system state∆f _incremental frequency deviation (Hz) ∆Pg _incremental change in generator output (p.u. MW)∆Xg _incremental change in governor position (p.u. MW) ∆F _incremental change in integral control.∆Pd _is the load disturbance (p.u. MW); and

• The system parameters are:TG _the governor time constant

- TT _turbine time constant- TP _plant model time constant- Kp _planet model gain- R _speed regulation due to governor action - KE_ integral control gain.

Page 71: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

ADHDP Application for Power system

• ADHDP policy tuning

0 1000 2000 3000-1

0

1

2

3

time (k)

The

conv

erge

nce

of P

P11

P12

P13

P22

P23

P33

P34

P44 0 1000 2000 3000

-3

-2

-1

0

1

Time (k)The

conv

erge

nce

of th

e co

ntro

l po

L11

L12

L13

L14

Converges to

Optimal GARE solution and optimal game control

Page 72: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

• Comparison

The ADHDP controller design The design from [1]

• The maximum frequency deviation when using the ADHDP controller is improved by 19.3% from the controller designed in [1]

[1] Wang, Y., R. Zhou, C. Wen, “Robust load-frequency controller design for power systems”, IEE Proc.-C, Vol. 140, No. I , 1993

0 5 10 15 20-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time in sec

stat

es x

1, x2, x

3,x4

X: 0.5Y: -0.2024

Frequency deviationIncrmental change of the governer outIncrmental change of the governer posIncrmental change of the in itegral cont

0 5 10 15 20-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

X: 0.5794Y: -0.2507

Time sec

stat

es x

1, x2, x

3,x4

Frequency deviationIncrmental change of the generator ouIncrmental change of the governer posIncrmental change of the in itegral cont

ADHDP Application for Power system

Page 73: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Q learning for H-inf control (Game Theory) using LMIWork by Dr. Jinhoon Kim, Chungbuk National Univ., Korea

Page 74: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P
Page 75: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P
Page 76: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

How can one do Policy Iteration for Unknown Continuous-Time Systems?What is Value Iteration for Continuous-Time systems?How can one do ADP for CT Systems?

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH

Directly leads to temporal difference techniques System dynamics does not occur Two occurrences of value allow greedy value iteration methods

Discrete-Time Systems

How to define temporal difference? System dynamics DOES occur Only ONE occurrence of value gradient

( , , ) ( , ) ( , ) ( , ) ( , )T TV V VH x u V r x u x r x u f x u r x u

x x x

Leads to off-line solutions if system dynamics is knownHard to do on-line learning

Continuous-Time Systems

Page 77: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

),( uxfx

t

T

t

dtRuuxQdtuxrtxV ))((),())((

),,(),(),(),(),(0 uxVxHuxruxf

xVuxrx

xVuxrV

TT

0)0( V

),(),(min),(min0*

)(

*

)(uxf

xVuxrx

xVuxr

T

tu

T

tu

xVxgRtxh T

*

12

1* )())((

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

0)0( V

System

Cost

Hamiltonian

Optimal cost

Optimal control

HJB equation

Continuous-Time Optimal Control

Bellman

c.f. DT value recursion, where f(), g() do not appear

)())(,()( 1 khkkkh xVxhxrxV BIG PROBLEM

System dynamics appears

Page 78: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

),,(),(),(0 uxVxHuxruxf

xV T

0 ( , ( )) ( , ( ))T

kk k

V f x h x r x h xx

(0) 0kV

1121( ) ( )T k

kVh x R g xx

CT Policy Iteration

• Convergence proved by Saridis 1979 if Lyapunov eq. solved exactly

• Beard & Saridis used complicated Galerkin Integrals to solve Lyapunov eq.

• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence

RuuxQuxr T )(),(Utility

Cost for any given u(t)

Lyapunov equation

Iterative solution

Pick stabilizing initial control

Find cost

Update control

Full system dynamics must be knownOff-line solution

To avoid solving HJB equation dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

Page 79: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

LQR Policy iteration = Kleinman algorithm

1. For a given control policy solve for the cost:

2. Improve policy:

If started with a stabilizing control policy the matrix monotonically converges to the unique positive definite solution of the Riccati equation.

Every iteration step will return a stabilizing controller. The system has to be known.

xLu k

kT

kT

kkkT

k RLLCCAPPA 0

11

k

Tk PBRL

kk BLAA

0L kP

Kleinman 1968

Lyapunov eq.

Page 80: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

tuxr

tVV

uxrtVV

uxrxVuxVxH tt

Dtttt

),(

),(),()(),,( 11

Small Time-Step Approximate Tuning for Continuous-Time Adaptive Critics

txVxVuxr

uxA ttttD

tt

)()(),(),(

*1*

1

Baird’s Advantage function

This defines a temporal difference error. System dynamics does NOT appear Has TWO occurrences of value V and so can be used to define greedy Value Iteration NOTE that the UTILITY must also be discretized or it does not work

0 ( ) ( , ) ( , ) ( , ) ( , , )TV VV x r x u f x u r x u H x u

x x

Problems with

Use Euler’s forward approximation

Problems:Only an approximationMust select a constant sampling period up frontPhysics-based CT nonlinearities are lost

Page 81: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Solving for the cost – Our approach

( ( )) ( , ) ( ( ))t T

t

V x t r x u d V x t T

Lxu

Draguna Vrabie

f(x) and g(x) do not appear

For a given policy u(x)

The cost satisfies

( ) ( ) ( ) ( ) ( )t T

T T T T T

t

x t Px t x Qx x L RLx dt x t T Px t T

PBRL T1

LQR case

Optimal gain is

(0) 0V

( ( )) ( , )t

V x t r x u d

Same as:

c.f. DT value recursion, where f(), g() do not appear

)())(,()( 1 khkkkh xVxhxrxV SAME FORM

Page 82: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Lemma 1

is equivalent to

( ( )) ( , ) ( , )Td V x V f x u r x u

dt x

Proof:

( , ) ( ( )) ( ( )) ( ( ))t T t T

t t

r x u d d V x V x t V x t T

Solves Lyapunov equation without knowing A or B

( ( )) ( , ) ( ( ))t T

t

V x t r x u d V x t T

(0) 0V

),,(),(),(0 uxVxHuxruxf

xV T

(0) 0V

Draguna Vrabie

Page 83: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T T

tx t Px t x Q L RL x d x t T Px t T

0T Tc cA P P A L RL Q

Lemma 1 - LQR case

is equivalent to

( ) ( ) ( )T

T T T Tc c

d x P x x A P PA x x L RL Q xdt

Proof:

( ) ( ) ( ) ( ) ( ) ( )t T t T

T T T T T

t t

x Q L RL xd d x Px x t Px t x t T Px t T

Solves Lyapunov equation without knowing A or B

cA A BL

( ) ( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T T

te t x t Px t x Q L RL x d x t T Px t T

Define CT temporal difference

Draguna Vrabie

Page 84: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1 1( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1. Policy iteration

11

1

kT

k PBRL

Initial stabilizing control is needed

Draguna Vrabie

For LQR case

Cost update

Control gain update

A and B do not appear

B needed for control update

( ) ( )k ku t L x t

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q L RL x d x t T P x t T

Solving for the cost – Our approach

Policy evaluation

Policy improvement

Solves Lyapunov equation without knowing A or B

Page 85: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval

Draguna Vrabie

DynamicControlSystem

xu

V

ZOH T

0; xBuAxx System

RuuQxxV TT

Critic

Actor

K

T T

Runs RLS or uses batch L.S.To identify value of current control

Update FB gain afterCritic has converged

Reinforcement interval T can be selected on line on the fly

Page 86: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

)()( txLtu kk

Continuous-time control with discrete gain updates

t

Lk

k0 1 2 3 4 5

Reinforcement Intervals T need not be the sameThey can be selected on-line in real time

Gain update (Policy)

Control

T

Page 87: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q L RL x d x t T P x t T

1( ( )) ( ) ( ( ))t T

T kT kk k

t

V x t x Qx u Ru dt V x t T

2. CT ADP Greedy iteration

)()( txLtu kk

11

1

kT

k PBRL

No initial stabilizing control needed

Draguna Vrabie

LQR

Control policy

Cost update

Control gain update A and B do not appear

B needed for control update

Direct Optimal Adaptive Control for Partially Unknown CT Systems

Value Iteration

Page 88: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Direct OPTIMAL ADAPTIVE CONTROL

Solve the Riccati Equation WITHOUT knowing the plant dynamics

Model-free ADP

Works for Nonlinear Systemsa neural network is used to approximate the cost

Robustness?Comparison with adaptive control methods?

Page 89: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

Page 90: Invited by Dr. Fouad AL-Sunni - UTA notes/DT ADP - KFUPM May 201… · kkkkkk kk kk Px x Qx u Ru Ax Bu P Ax Bu Bellman eq= Lyapunov equation uKx kk 0(( )( ) )TT T x kk ABK PABK P