invited by dr. fouad al-sunni - uta notes/dt adp - kfupm may 201… · kkkkkk kk kk px x qx u ru ax...
TRANSCRIPT
Invited by Dr. Fouad AL-Sunni
Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington
F.L. Lewis & Draguna VrabieMoncrief-O’Donnell Endowed Chair
Head, Controls & Sensors Group
Talk available online at http://ARRI.uta.edu/acs
Adaptive Dynamic Programming (ADP)For Feedback Control Systems
Supported by :NSF - PAUL WERBOSARO- RANDY ZACHERY
It is man’s obligation to explore the most difficult questions inthe clearest possible way and use reason and intellect to arriveat the best answer.
Man’s task is to understand patterns in nature and society.
The first task is to understand the individual problem, then toanalyze symptoms and causes, and only then to designtreatment and controls.
Ibn Sina 1002-1042(Avicenna)
Importance of Feedback Control
Darwin- FB and natural selectionVolterra- FB and fish population balanceAdam Smith- FB and international economyJames Watt- FB and the steam engineFB and cell homeostasis
The resources available to most species for their survival are meager and limited
Nature uses Optimal control
Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.
Cellular Metabolism
Permeability control of the cell membrane
http://www.accessexcellence.org/RC/VL/GG/index.html
Optimality in Biological Systems
Optimality in Control Systems DesignR. Kalman 1960
Rocket Orbit Injection
http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf
FmmmF
rwvv
mF
rrvw
wr
cos
sin2
2
ObjectivesGet to orbit in minimum timeUse minimum fuel
Dynamics
Adaptive Control is generally not Optimal
Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.
We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing
Reinforcement Learning turns out to be the key to this!
Different methods of learning
SystemAdaptiveLearning system
ControlInputs
outputs
environmentTuneactor
Reinforcementsignal
Actor
Critic
Desiredperformance
Reinforcement learningIvan Pavlov 1890s
Actor-Critic Learning
We want OPTIMAL performance- ADP- Approximate Dynamic Programming
, ' Pr{ ' | , }ux xP x x u
( , , , )X U P R
'uxxR
Markov Decision Process
X= statesU= controls
Probability of going to state x’ from state xgiven that the control is u
Reward on going to state x’ from state xgiven that the control is u
Value Function
A control policy is a mapping from states to control actions
= probability of taking action a while in state s
Control Policy
The Bellman EquationA consistency equation that holds at each state
Reinforcement Learning for Markov Decision Processes
Allows one to solve n equations for the value given any policyn= nr of states
( ) { | } .i ki k
i kV x E r x x
: [0,1]X U
( , ) Pr{ | }x u u x
( 1)
1( ) { | }i k
k i ki k
V x E r r x x
' ''
( ) ( , ) ( ') .u uxx xx
u xV x x u P R V x
Optimal Value
By Bellman’s consistency eq.
Bellman’s optimality equation
This consists of n eqs that must be solved for V*(s)n= nr of states
Optimal policy
*( ) min { | }i ki k
i kV x E r x x
*' '
'( ) min ( ) min ( , ) ( ') .u u
xx xxu x
V x V x x u P R V x
* *' '
'( ) min ( ') .u u
xx xxu xV x P R V x
* *' '
'arg min ( ') .u u
xx xxu x
u P R V x
Policy IterationStart with admissible policy
Policy evaluation
Policy Improvement
Barto calls this Dynamic programmingTo avoid knowing the system model, use
Monte CarloTemporal Difference
Policy Iteration with Temporal DifferenceStart with admissible policy
Policy evaluation ALONG A SAMPLE PATH
A stochastic approximation approach
Policy Improvement along sample path
Temporal Difference is
- An iterative solution algorithm for the optimal value and policy
' ''
( ) ( , ) ( ')u uj j xx xx j
u xV x x u P R V x
1 ' ''
( , ) arg min ( ')u uj xx xx j
u xx u P R V x
1( ) ( , ( )) ( ) .j k k j k j kV x r x h x V x
1 1(.)
( ) arg min ( , ( )) ( ) ,j k k k j kh
h x r x h x V x
1 1 1( ) ( ) ( , ( )) ( ) .j j k k j k j ke k V x r x h x V x
ki
iiki
kh uxrxV ),()(
Discrete-Time Optimal Control
cost
( 1)
1( ) ( , ) ( , )i k
h k k k i ii k
V x r x u r x u
1 ( ) ( )k k k kx f x g x u system
Example ( , ) T Tk k k k k kr x u x Qx u Ru
1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Value function recursion
)( kk xhu = the prescribed control input functionControl policy
Example k ku Kx Linear state variable feedback
This is like:
, ' Pr{ ' | , }ux xP x x u
'uxxR
( , ) Pr{ | }x u u x
ki
iiki
kh uxrxV ),()(
Discrete-Time Optimal Control
cost
( 1)
1( ) ( , ) ( , )i k
h k k k i ii k
V x r x u r x u
1 ( ) ( )k k k kx f x g x u system
Example ( , ) T Tk k k k k kr x u x Qx u Ru
1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Value function recursion
)( kk xhu = the prescribed control input functionControl policy
Example k ku Kx Linear state variable feedback
BELLMAN EQUATION
ki
iiki
kh uxrxV ),()(
)())(,()( 1 khkkkh xVxhxrxV
))())(,((min)( 1*
khkkhk xVxhxrxV
Hamiltonian
))(),((min)( 1**
kkkuk xVuxrxVk
))(),((minarg)(* 1*
kkkuk xVuxrxhk
Discrete-Time Optimal Control
cost
Bellman Equation
)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH
Optimal cost
Bellman’s Principle
Optimal Control
)( kk xhu = the prescribed control policy
Dynamic programming- Backwards in time solution
1 ( ) ( )k k k kx f x g x u
1( ) min ( )
min ( ) ( )k
k
T Tk k k k k ku
T Tk k k k k k ku
V x x Qx u Ru V x
x Qx u Ru V f x g x u
1 1
1
( )1( ) ( )2
T kk k
k
V xu x R g x
x
( ) T Tk i i i i
i kV x x Qx u Ru
System
DT HJB equation
Difficult to solveContains the dynamics
The Solution: Hamilton-Jacobi-Bellman Equation
1
1
( )2 ( ) 0T kk k
k
dV xRu g xdx
Minimize wrt uk
DT Optimal Control – Linear Systems Quadratic cost (LQR)
1k k kx Ax Bu system
cost
Fact. The cost is quadratic
( ) T Tk i i i i
i kV x x Qx u Ru
( ) Tk k kV x x Px for some symmetric matrix P
1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation
1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px
( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu
Stationarity Condition- minimize with respect to uk
0 2 2 ( )Tk k kRu B P Ax Bu
1( )T Tk k ku R B PB B PAx Lx
10 ( ( ) )T T T T Tk kx A PA P Q A PB R B PB B PA x
HJB = DT Riccati equation
Bellman Equation– Linear Systems Quadratic cost (LQR)
1k k kx Ax Bu system
cost
Fact. The cost is quadratic
( ) T Tk i i i i
i kV x x Qx u Ru
( ) Tk k kV x x Px for some symmetric matrix P
1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation
1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px
( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu
Bellman eq= Lyapunov equationk ku Kx
0 (( ) ( ) )T T Tk kx A BK P A BK P Q K RK x
Assume stabilizing fixed feedback policy
Then is the cost of using the SVFB ( ) Tk k kV x x Px k ku Kx
1( )T TL R B PB B PA
DT Optimal Control – Linear Systems Quadratic cost (LQR)
1k k kx Ax Bu system
cost
HJB = DT Riccati equation
Optimal Control
Optimal Cost*( ) T
k k kV x x Px
10 ( )T T T TA PA P Q A PB R B PB B PA
k ku Lx
Fact. The cost is quadratic
( ) T Tk i i i i
i kV x x Qx u Ru
( ) Tk k kV x x Px for some symmetric matrix P
Off-line solutionDynamics must be known
ki
iiki
kh uxrxV ),()(
)())(,()( 1 khkkkh xVxhxrxV
))())(,((min)( 1*
khkkhk xVxhxrxV
Hamiltonian
))(),((min)( 1**
kkkuk xVuxrxVk
))(),((minarg)(* 1*
kkkuk xVuxrxhk
Discrete-Time Optimal Adaptive Control
cost
Bellman eq.
)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH
Optimal cost
Bellman’s Opt. Principle
Optimal Control
)( kk xhu = the prescribed control policy
Focus on these two eqs
1( ) ( , ( )) ( ), (0) 0h k k k h k hV x r x h x V x V
Discrete-Time Optimal Control
Bellman eq
)( kk xhu = the prescribed control policy
Solutions by Comp. Intelligence Community
( ) ( , ( ))i kh k i i
i kV x r x h x
Theorem: Let solve the Bellman equation. Then ( )h kV x
Gives value for any prescribed control policy
Policy Evaluation for any given current policy
Policy must be stabilizing
Dynamics Does Not Appear
))(),((minarg)(* 1*
kkkuk xVuxrxhk
Optimal Control
Bellman’s result
1'( ) arg min( ( , ) ( ))k
k k k h kuh x r x u V x
What about? -
Theorem. Bertsekas. Let be the value of any given policy h(xk).
Then
( )h kV x
' ( ) ( )h k h kV x V x
Policy Improvement
for a given policy h(.) ?
One step improvement property of Rollout Algorithms
DT Policy Iteration
)())(,()( 111 kjkjkkj xVxhxrxV
1 1 1 1( ) arg min( ( , ) ( ))k
j k k k j kuh x r x u V x
Howard (1960) proved convergence for MDP
)())(,()( 1 khkkkh xVxhxrxV
Cost for any given control policy h(xk) satisfies the recursion
Recursive solution - Actor/Critic Structure
Pick stabilizing initial control
Policy Evaluation
Policy Improvement
f(.) and g(.) do not appear
Bellman eq.
Recursive formConsistency equation
e.g. Control policy = SVFB
( )k kh x Lx
1 1 1( ) ( , ( )) ( )k j k k j k j ke V x r x h x V x
Temporal difference
1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x
( ) ( ) ( )T Tk i i i i
i kV x x Qx u x Ru x
1 111 1
1
( )1( ) ( )2
j kTj k k
k
dV xu x R g x
dx
Hewer proved convergence in 1971
DT Lyapunov eq.
DT Policy Iteration – Linear Systems Quadratic Cost- LQR
Solves Lyapunov eq. without knowing A and B
( ) TV x x Px
For any stabilizing policy, the cost is
DT Policy iterations
1 1
11 1 1
( ) ( )
( )
T Tj j j j j j
T Tj j j
A BL P A BL P Q L RL
L R B P B B P A
Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics
Equivalent to an Underlying Problem- DT LQR:
1 ( ) ,k k k kx Ax Bu A BL x
LQR value is quadratic
k ku Lx
1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x
1 111 1
1
( )1( ) ( )2
j kTj k k
k
dV xu x R g x
dx
DT Policy Iteration – Linear Systems Quadratic Cost- LQR
DT Policy iterations
How to implement online?
1 1 1 1T T T Tk j k k j k k k j jx P x x P x x Qx u Ru
1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x
DT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR
Solves Lyapunov eq. without knowing A and B
( ) TV x x Px
DT Policy iterations
1k k kx Ax Bu
LQR cost is quadratic
1 111 12 11 121 2 1 2 1
1 12 212 22 12 22 1
1 2 1 21
1 2 1 211 12 22 11 12 22 1 1
2 2 2 21
( ) ( )2 2( ) ( )
k kk k k k
k k
k k
k k k k
k k
p p p px xx x x x
p p p px x
x xp p p x x p p p x x
x x
Quadratic basis set
( ) ( ) ( )Tk i i i i
i kV x x Qx u x Ru x
for some matrix P
Then update control using1( ) ( )T T
j k j k j j kh x L x R B P B B P Ax Need to know A AND B
for control update
1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x
Implementation- DT Policy IterationNonlinear Case
Value Function Approximation (VFA)
)()( xWxV T
basis functionsweights
LQR case- V(x) is quadratic
( ) ( )T TV x x Px W x
Quadratic basis functions
Nonlinear system case- use Neural Network
][ 1211 ppW T
( )x
)())(,()( 111 kjkjkkj xVxhxrxV Value function update for given control
Assume measurements of xk and xk+1 are available to compute uk+1
Then
))(,()()( 11 kjkkkTj xhxrxxW
Solve for weights using RLSor, many trajectories with different initial conditions over a compact set
Then update control using
Need to know g(xk) for control update
Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update
regression matrix
Implementation- DT Policy Iteration
)()( kTjkj xWxV VFA
Indirect Adaptive control with identification of the optimal value
1 11 11 1 1 1
1
( )1 1( ) ( ) ( ) ( )2 2
j kT T T Tj k k k k j
k
dV xu x R g x R g x x W
dx
)())(,()( 111 kjkjkkj xVxhxrxV
111 1
1
( )1( ) ( )2
j kTj k k
k
dV xu x R g x
dx
1. Select control policy
2. Find associated cost
3. Improve control
Needs 10 lines of MATLAB code
Direct optimal adaptive control
Solves Lyapunov eq. without knowing dynamics
k k+1
observe xk
observe xk+1
apply uk
observe cost rk
update V
do until convergence to Vj+1 update control to uj+1
))(,()()( 11 kjkkkTj xhxrxxW
DT Policy Iteration – For Linear Systems Quadratic Cost- LQR
Solves Lyapunov eq. without knowing A and B
1k k kx Ax Bu
Form quadratic basis vectors
11 1211 12 22 1
12 22j
p pp p p P
p p
( ) ( ) ( )Tk i i i i
i kV x x Qx u x Ru x
Then update SVFB gain using
11 1 1( )T T
j j jL R B P B B P A
Need to know A AND B for control update
1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x
Unpack weights into Lyapunov solution
1 2 1 21
1 2 1 21 1 1
2 2 2 21
( ) ( )( ) 2 , ( ) 2
( ) ( )
k k
k k k k k k
k k
x xx x x x x x
x x
Find new weights
SVFB gain jL control ( )j k j ku x L x
Apply control to system and observe data set 1, , ( ) ( )T Tk k k k j k j kx x x Qx u x Ru x
Use batch LS or RLS until convergence
System
Action network
Policy Evaluation(Critic network)
( )j kh x
cost
The Adaptive Critic Architecture
Control policy
)())(,()( 111 kjkjkkj xVxhxrxV
Adaptive Critics
))(),((minarg)( 111 kjkkukj xVuxrxhk
Value update
Control policy update
Leads to ONLINE FORWARD-IN-TIME implementation of optimal control
Optimal Adaptive Control
Use RLS until convergence
Adaptive Control
Plantcontrol output
Identify the Controller-Direct Adaptive
Identify the system model-Indirect Adaptive
Identify the performance value-Optimal Adaptive
)()( xWxV T
A problem with DT Policy Iteration
Assume measurements of xk and xk+1 are available to compute uk+1
Then
))(,()()( 11 kjkkkTj xhxrxxW
Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update
)()( kTjkj xWxV
Policy Evaluation
1 111 1
1
( )1( ) ( )2
j kTj k k
k
dV xu x R g x
dx
Policy Improvement
1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)
for control update
Easy to fix – Later!
Greedy Value Fn. Update- Approximate Dynamic Programming ADP Method 1 - Heuristic Dynamic Programming (HDP)
Paul Werbos
)())(,()( 111 kjkjkkj xVxhxrxV
Policy Iteration
1 1
1
( ) ( )
( )
T Tj j j j j j
T Tj j j
A BL P A BL P Q L RL
L R B P B B P A
For LQRUnderlying RE
Hewer 1971
))(),((minarg)( 111 kjkkukj xVuxrxhk
)())(,()( 11 kjkjkkj xVxhxrxV
))(),((minarg)( 111 kjkkukj xVuxrxhk
Value Iteration
1
1
( ) ( )
( )
T Tj j j j j j
T Tj j j
P A BL P A BL Q L RL
L R B P B B P A
For LQRUnderlying RE Lancaster & Rodman
proved convergence
Two occurrences of cost allows def. of greedy update
Initial stabilizing control is needed
Lyapunov eq.
Simple recursion
Initial stabilizing control is NOT needed
Motivation for Value Iteration
1 1( ) ( )T Tj j j j j jA BL P A BL P Q L RL
1 1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x
1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x
1 ( ) ( ) Tj j i j j jP A BL P A BL Q L RL
PI Policy Evaluation Step
VI Policy Evaluation Step
Theorem
Let gain L be fixed and (A-BL) stable.Let in MR
Then
0 0P
LE= Lyapunov equation
MR= Matrix recursion
1i jP P
i.e. repeated application of the VI policy evaluation stepis the same as one application of the PI policy evaluation step
IF THE CONTROL POLICY IS NOT UPDATED
Draguna Vrabie-For CT systems
Idea of GPI
Needs stabilizing gain
Does not need stabilizing gain
1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x Value function update for given control
Assume measurements of xk and xk+1 are available to compute uk+1
Then
1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x
Solve for weights using RLSor, many trajectories with different initial conditions over a compact set
Then update control using
1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)
for control update
Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update
Implementation- DT HDP – Value Iteration
)()( kTjkj xWxV VFA
regression matrix Old weights
1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x
111
1
( )1( ) ( )2
j kTj k k
k
dV xu x R g x
dx
1. Select control policy
2. Find associated cost
3. Improve control
Needs 10 lines of MATLAB code
Direct optimal adaptive control
Solves Lyapunov recursion without knowing dynamics
k k+1
observe xk
observe xk+1
apply uk
observe cost rk
update V
do until convergence to Vj+1 update control to uj+1
1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x
Four ADP Methods proposed by Paul Werbos
Heuristic dynamic programming
Dual heuristic programming
AD Heuristic dynamic programming
AD Dual heuristic programming
(Watkins Q Learning)
Critic NN to approximate:
Value
Gradient xV
)( kxV Q function ),( kk uxQ
GradientsuQ
xQ
,
Action NN to approximate the Control
Bertsekas- Neurodynamic Programming
Barto & Bradtke- Q-learning proof (Imposed a settling time)
Adaptive (Approximate) Dynamic Programming
Value Iteration
Discrete-time Nonlinear Approximate dynamic programming : Convergence Proof
• Problem Formulation
• requires solving the DT HJB
1 ( ) ( )k k k kx f x g x u
1( ) min ( )
min ( ) ( )k
k
T Tk k k k k ku
T Tk k k k k k ku
V x x Qx u Ru V x
x Qx u Ru V f x g x u
1 1
1
( )1( ) ( )2
T kk k
k
dV xu x R g x
dx
( ) mink
k i i i iu i kV x x Qx u Ru
Asma Al-Tamimi
1 ( ) ( ) ( )k k k kx f x g x u x
( ) T Tk i i i ii k
V x x Qx u Ru
1
1
( )
( )
T T T Tk k k k k i i i ii k
T Tk k k k k
V x x Qx u Ru x Qx u Ru
x Qx u Ru V x
1 1( ) arg min( ( ))T Ti k k k i ku
u x x Qx u Ru V x
1 1( ) ( )T Ti k k k i kV x x Qx u Ru V x
Discrete-time NonlinearHeuristic Dynamic Programming:
HDP (Value Iteration)
Value function recursion
Asma Al-Tamimi
System dynamics
Asma Al-Tamimi
Flavor of proofs
Proof of convergence of DT nonlinear HDP
ˆ ( , ) ( )Ti k Vi Vi kV x W W x ˆ ( , ) ( )T
i k ui ui ku x W W x
1
1
ˆˆ ˆ( ( ), ) ( ) ( ) ( )ˆ ˆ( ) ( ) ( )
T T Tk Vi k k i k i k i k
T T Tk k i k i k Vi k
d x W x Qx u x Ru x V x
x Qx u x Ru x W x
1( ) arg min( ( ))T Ti k k k i ku
u x x Qx u Ru V x
1 1( ) ( )T Ti k k k i kV x x Qx u Ru V x
Standard Neural Network VFA for On-Line Implementation
Define target cost function
NN for Value - Critic NN for control action
HDP
Backpropagation- P. Werbos
Implicit equation for DT control- use gradient descent for action update
( ) ( ) 1( 1) ( )
( )
ˆˆ ˆ( ( )T Tk k i j i j i k
ui j ui jui j
x Qx u Ru V xW W
W
1 1( )
1
( )ˆ( )(2 ( ) )T
j j T Tkui ui k i j k Vi
k
xW W x Ru g x Wx
ˆ ˆ( , ) ( , )argmin
ˆ ˆ( ( ) ( ) ( , ))
T Tk k k k
ui Wi k k k
x Qx u x W Ru x WW
V f x g x u x W
(can use 2-layer NN)
1
21 1arg min{ | ( ) ( ( ), ) | }
Vi
T TVi Vi k k Vi kW
W W x d x W dx
Explicit equation for cost – use LS for Critic NN update1
1 ( ) ( ) ( ) ( ( ), , )T T T TVi k k k k Vi uiW x x dx x d x W W dx
1 1 1 11( ) ( ) ( , ) ( )T T T
Vi Vi k Vi k k k Vi km m mW W x W x r x u W x
or
1 ( ) ( ) ( )k k k kx f x g x u x
Implicit equation for DT control- use gradient descent for action update
( ) ( ) 1( 1) ( )
( )
ˆˆ ˆ( ( )T Tk k i j i j i k
ui j ui jui j
x Qx u Ru V xW W
W
1 1( )
1
( )ˆ( )(2 ( ) )j j T Tkui ui k i j k Vi
k
xW W x Ru g x Wx
ˆ ˆ( , ) ( , )argmin
ˆ ˆ( ( ) ( ) ( , ))
T Tk k k k
uii k k k
x Qx u x Ru xW
V f x g x u x
ˆ ( , ) ( )Ti k ui ui ku x W W x
NN for control action
Note that state internal dynamics f(xk) is NOT needed since:
1. NN Approximation for action is used
2. xk+1 is measured in training phase
Interesting Fact for HDP for Nonlinear systemskj
Tj
Tkjkj AxPBBPBIxLxh 1)()( Linear Case
must know system A and B matrices
g(.) is needed
Information about A is stored in NN
Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof
• Simulation Example 1• Linear system – Aircraft longitudinal dynamics
• The HJB, i.e. ARE, Solution
1.0722 0.0954 0 -0.0541 -0.0153 4.1534 1.1175 0 -0.8000 -0.1010
A= 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353
-0.0453 -0.0175-1.0042 -0.1131
B= 0.0075 0.0134 0.8647 0 0 0.8647
55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782
P
-0.7265 -0.1215 0.0464 0.0782 1.0240
-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798
L
Unstable, Two-input system
Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof
• Simulation• The Cost function approximation
• The Policy approximation
ˆ ( )Ti ui ku W x
1 2 3 4 5( )T x x x x x x
11 12 13 14 15
21 22 23 24 25
u u u u uTu
u u u u u
w w w w wW
w w w w w
1 1 1ˆ ( , ) ( )Ti k Vi Vi kV x W W x
1 2
2 2 2 2 21 2 1 3 1 4 1 5 2 3 4 2 2 5 3 3 4 3 5 4 4 5 5( )T x x x x x x x x x x x x x x x x x x x x x x x x x x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15T
V V V V V V V V V V V V V V V VW w w w w w w w w w w w w w w w
Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof
• SimulationThe convergence of the cost
[55.5411 15.2789 31.3032 -9.3255 -1.4536 2.3142 2.9234 -1.6594 -0.2430 24.8262 -1.3076 0.0920 1.5388 0.1564 1.0240]
TVW
11 12 13 14 15 1 2 3 4 5
21 22 23 24 25 2 6 7 8 9
31 32 33 34 35 3 7 10 11 12
41 42 43 44 45 4 8 11 13
51 52 53 54 55
0.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0
V V V V V
V V V V V
V V V V V
V V V V
P P P P P w w w w wP P P P P w w w w wP P P P P w w w w wP P P P P w w w wP P P P P
14
5 9 12 14 15
.50.5 0.5 0.5 0.5
V
V V V V V
ww w w w w
55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782
P
-0.7265 -0.1215 0.0464 0.0782 1.0240
Actual ARE soln:
Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof
• SimulationThe convergence of the control policy
4.1068 0.7164 0.3756 -0.5274 -0.0707 0.6330 0.1005 -0.1216 -0.0653 -0.0798uW
11 12 13 14 15 11 12 13 14 15
21 22 23 24 25 21 22 23 24 25
u u u u u
u u u u u
L L L L L w w w w wL L L L L w w w w w
-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798
L
Note- In this example, internal dynamics matrix A is NOT Needed.Riccati equation solved online without knowing A matrix
Actual optimal ctrl.
Batch LS
LS solution for Critic NN update
Issues with Nonlinear ADP
Integral over a region of state-spaceApproximate using a set of points
time
x1
x2
1
1 ( ) ( ) ( ) ( ( ), , )T T T TVi k k k k Vi uiW x x dx x d x W W dx
Set of points over a region vs. points along a trajectory
Exploitation (optimal regulation) vs Exploration
For Linear systems- these are the same under PE condition
Selection of NN Training Set
time
x1
x2
Take sample points along a single trajectory
Recursive Least-Squares RLS
1 1 1 11( ) ( ) ( , ) ( )T T T
Vi Vi k Vi k k k Vi km m mW W x W x r x u W x
PE allows local smooth solution of Bellman eq.
DT HDP vs. Receding Horizon Optimal Control
11
0
( )0
T T T Ti i i i iP A PA Q A PB R B PB B PA
P
11 1 1 1( )T T T T
k k k k k
N
P A P A Q A P B R B P B B P AP
Forward-in-time HDP
Backward-in-time optimization – RHC
Control Lyapunov Function overbounding P
1
0( )k N
T T Tk i i i i k N k N
i kV x x Qx u Ru x P x
1k k kx Ax Bu
Adaptive Terminal Cost RHC Hongwei ZhangDr. Jie Huang
11 0( ) ,T T T T
i i i i iP A PA Q A PB R B PB B PA P
11 1 1 1 1( )RH T T RH
k N N k N ku R B P B B P A x L x
Standard RHC
Requires P0 to be a CLF that overbounds the optimal inf. horizon cost, or large N
P0 is the same for each stage
HWZ Theorem- Let under the usual suspect observability and controllability assumptionsATC RHC guarantees ultimate uniform exponential stability
for ANY P0 > 0.Moreover, our solution converges to the optimal inf. horizon cost.
1N
1
( )k N
T T Tk i i i i k N kN k N
i kV x x Qx u Ru x P x
Our ATC RHC
11 ( ) ,T T T T
i i i i i kNP A PA Q A PB R B PB B PA P
Final cost from previous stage
k k+N
Adaptive Terminal Cost RHC
Let N=1. Then
1
( )k N
T T Tk i i i i k N kN k N
i kV x x Qx u Ru x P x
becomes
1 1( ) ( )T Tj k k k k k j kV x x Qx u Ru V x
i.e. value iteration
So, for N=1, ATC RHC can be implemented using Value Iteration without knowing the system A matrix
1
1 1arg min{ }k N
RH T T T RHk i i i i k N kN k N N ku i k
u x Qx u Ru x P x L x
1 1arg min{ ( )}RH T Tk k k k k j ku
u x Qx u Ru V x
Four ADP Methods proposed by Werbos
Heuristic dynamic programming
Dual heuristic programming
AD Heuristic dynamic programming
AD Dual heuristic programming
(Watkins Q Learning)
Critic NN to approximate:
Value
Gradient xV
)( kxV Q function ),( kk uxQ
GradientsuQ
xQ
,
Action NN to approximate the Control
Bertsekas- Neurodynamic Programming
Barto & Bradtke- Q-learning proof (Imposed a settling time)
Q Learning
)(),(),( 1 khkkkkh xVuxruxQ policy h(.) used after time k
uk arbitrary
)())(,( khkkh xVxhxQ
Define Q function
Note
))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Recursion for Q
)),((min)( **kkuk uxQxV
k
Simple expression of Bellman’s principle
)),((minarg)(* *kkuk uxQxh
k
- Action Dependent ADP
)())(,()( 1 khkkkh xVxhxrxV Value function recursion for given policy h(xk)
Optimal Adaptive Control for completely unknown DT systems
k k+1
xk xk+1uk
h(x)
)(),(),( 1 khkkkkh xVuxruxQ
Specify a control policy ,....1,);( kkjxhu jj
policy h(.) used after time k
uk arbitrary
)())(,( khkkh xVxhxQ
Define Q function
Note
))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Bellman equation for Q
))(),(),( 1**
kkkkk xVuxruxQ
))(,(),(),( 1*
1**
kkkkkk xhxQuxruxQ
Optimal Q function
)))(,((min))(,()( ***kkhhkkk xhxQxhxQxV
Optimal control solution
)),((min)( **kkuk uxQxV
k
Simple expression of Bellman’s principle
)),((minarg)(* *kkuk uxQxh
k
))(,((minarg)(* kkhhk xhxQxh
Q Function Definition
Q Function HDP – Action Dependent HDP
Bradtke & Barto (1994) proved convergence for LQR
Q function for any given control policy h(xk) satisfies the Bellman equation
Recursive solution
Pick stabilizing initial control policy
Find Q function
Update control
))(,(),(),( 11 kkhkkkkh xhxQuxruxQ
))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ
)),((minarg)( 11 kkjukj uxQxhk
Now f(xk,uk) not needed
Q Learning does not need to know f(xk) or g(xk)
)(),(),( 1 khkkkkh xVuxruxQ
)()( kkT
kkkTkk
Tk BuAxPBuAxRuuQxx
For LQR PxxxWxV TT )()(
k
k
uuux
xuxxT
k
k
k
kT
k
k
k
kTT
TTT
k
k
ux
HHHH
ux
ux
Hux
ux
PBBRPABPBAPAAQ
ux
Q is quadratic in x and u
Control update is found by ][2])([20 kuukuxkT
kT
kuHxHuPBBRPAxB
uQ
sokjkuxuuk
TTk xLxHHPAxBPBBRu 1
11)(
Control found only from Q functionA and B not needed
V is quadratic in x
Q function update for control is given by
Assume measurements of uk, xk and xk+1 are available to compute uk+1
),(),( uxWuxQ T
Then
),(),(),( 111 kjkkjkkkTj xLxrxLxuxW
Solve for weights using RLS or backprop.
Since xk+1 is measured in training phase, do not need knowledge of f(x) or g(x) for value fn. update
regression matrix
Implementation- DT Q Function Policy Iteration
),(),(),( 1111 kjkjkkkkj xLxQuxruxQ
kjk xLu
Now u is an input to the NN- Werbos- Action dependent NN
)(x
For LQR
For LQR case
QFA – Q Fn. Approximation
Bradtke and Barto
),(),(),( 1111 kjkjkkkkj xLxQuxruxQ
Q Policy Iteration
)),((minarg)( 11 kkjukj uxQxhk
Control policy update
),(),(),( 111 kjkkjkkkTj xLxrxLxuxW
kjkuxuuk xLxHHu 11
Model-free policy iteration
Bradtke, Ydstie, Barto
Greedy Q Fn. Update - Approximate Dynamic ProgrammingADP Method 3. Q Learning
Action-Dependent Heuristic Dynamic Programming (ADHDP)
Paul WerbosModel-free HDP
))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ
Greedy Q Update
1111 target),(),(),( jkjkTjkjkkk
Tj xLxWxLxruxW
Update weights by RLS or backprop.
Stable initial control needed
Stable initial control NOT needed
Direct OPTIMAL ADAPTIVE CONTROL
Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics
Model-free ADP
Works for Nonlinear Systems
Proofs?Robustness?Comparison with adaptive control methods?
• continuous-state and action spaces discrete-time dynamical system
with quadratic cost
• The zero-sum game problem can be formulated as follows:
• The goal is to find the optimal strategies (State-feedback)*( )w x Kx
1
,k k k k
k k k k
x Ax Bu Ewy x z Hx
nRx
pRy
1mk Ru
2mk Rw
ki i
Tii
Tii
Tiwuk wwuuQxxxV 2maxmin)(
*( )u x Lx
2( ) T T T Tk i i i i i ii k
V x x H Hx u u w w
Discrete-Time Zero-Sum Games
L
wk zkuk ,
1
kk
kkkk
xyEwBuAxx
yk
2 22
0 0, { }k k k
k kz w w
• Using Bellman optimality principle “Dynamic Programming”
• The Game Algebraic Riccati equation GARE
• The condition for saddle point are
1
2[ ]
T T TT T T
T T T
I B PB B PE B PAP A PA Q A PB A PE
E PA E PE I E PA
2 0TI E PE 0TI B PB
1
1 1
( ) minmax( ^ 2 ( ))
minmax( ( , , ) ).k k
k k
T T Tk k k k k k k ku w
T Tk k k k k k ku w
V x x Qx u u w w V x
x Px r x u w x Px
Game Algebraic Riccati Equation
The optimal policies for control and disturbance are 2 1 1 2 1( ( ) ) ( ( ) ).T T T T T T T TL I B PB B PE E PE I E PB B PE E PE I E PA B PA
2 1 1 1( ( ) ) ( ( ) ).T T T T T T TK E PE I E PB I B PB B PE E PB I B PB BPA E PA
System dynamics A,B,E is needed
DT GameHeuristic Dynamic Programming:
Forward-in-time Formulation• An Approximate Dynamic Programming Scheme (ADP) where one has the
following incremental optimization
which is equivalently written as
)(maxmin)( 12
1 kikTkk
Tkk
Tkwuki xVwwuuQxxxV
kk
)()()()()()( 12
1 kikikTikik
Tik
Tkki xVxwxwxuxuQxxxV
Asma Al-Tamimi
( ) Tk k kV x x Px
1( , , ) ( , , ) ( )k k k k k k kTT T T T T T
k k k k k k
Q x u w r x u w V x
x u w H x u w
21 1 1 1 1 1 1[ ] [ ] [ ] [ ]T T T T T T T T T T T T T T T T T
k k k i k k k k k k k k k k k k i k k kx u w H x u w x Qx u u w w x u w H x u w
wwwuwx
uwuuux
xwxuxx
HHHHHHHHH
( ) , ( )i k i k i k i ku x L x w x K x
1 1 1
1 1 1
( ) ( ),
( ) ( ).
i i i i i i i ii uu uw ww wu uw ww wx ux
i i i i i i i ii ww wu uu uw wu uu ux wx
L H H H H H H H H
K H H H H H H H H
21
1 1 1
ˆ ˆ ˆ ˆ ˆ ˆ( , ( ), ( )) ( ) ( ) ( ) ( )ˆ ˆ( , ( ), ( ))
T T Ti k i k i k k k i k i k i k i k
i k i k i k
Q x u x w x x Qx u x u x w x w xQ x u x w x
Linear Quadratic case- V and Q are quadratic
Q function update
Control Action and Disturbance updates
A, B, E NOT needed
Asma Al-Tamimi
Q learning for H-infinity Control
)(),(),( 1 khkkkkh xVuxruxQ
)()( kkT
kkkTkk
Tk BuAxPBuAxRuuQxx
T T TT Tk k k k k xx xu k
T Tk k k k k ux uu k
x x x x x H H xA PA Q A PBH
u u u u u H H uB PA B PB I
Compare to Q function for H2 Optimal Control Case
H-infinity Game Q function
Q
ˆ ( , ) T Ti i iQ z h z H z h z
ˆ ( )i iu x L x ˆ ( )i iw x K x
Quadratic Basis set is used to allow on-line solution
TT T Tz x u w 2 2 21 1 2 2 3 1( , , , , , , , )q q q qz z z z z z z z z z
))(ˆ),(ˆ,()(ˆ)(ˆ)(ˆ)(ˆ))(ˆ),(ˆ,(
111
21
kikiki
kiT
kikiT
kikTkkikiki
xwxuxQxwxwxuxuRxxxwxuxQ
)()(ˆ)(ˆ)(ˆ)(ˆ)( 12
1 kTiki
Tkiki
Tkik
Tkk
Ti xzhxwxwxuxuRxxxzh
kkikei nxLxu 1)(ˆ kkikei nxKxw 2)(ˆ
Probing Noise injected to get Persistence of Excitation
Proof- Still converges to exact result
Q function update
Solve for ‘NN weights’ - the elements of kernel matrix HUse batch LS or online RLS
where and
Quadratic Kronecker basis
Asma Al-Tamimi
Control and Disturbance Updates
Asma Al-Tamimi
ADHDP Application for Power system
• System Description
• The Discrete-time Model is obtained by applying ZOH to the CT
( ) [ ( ) ( ) ( ) ( )]
1/ / 0 00 1/ 1/ 0
1/ 0 1/ 1/0 0 0
0 0 1/ 0
1 / 0 0 0
Tg g
p p p
T T
G G G
E
TG
Tp p
x t f t P t X t F t
T K TT T
ART T T
K
B T
E K T
1/ [0.033,0.1]
/ [4,12]
1/ [2.564,4.762]1/ [9.615,17.857]1/ [3.081,10.639]
p
p p
T
G
G
T
K T
TTRT
ADHDP Application for Power system
• The system state∆f _incremental frequency deviation (Hz) ∆Pg _incremental change in generator output (p.u. MW)∆Xg _incremental change in governor position (p.u. MW) ∆F _incremental change in integral control.∆Pd _is the load disturbance (p.u. MW); and
• The system parameters are:TG _the governor time constant
- TT _turbine time constant- TP _plant model time constant- Kp _planet model gain- R _speed regulation due to governor action - KE_ integral control gain.
ADHDP Application for Power system
• ADHDP policy tuning
0 1000 2000 3000-1
0
1
2
3
time (k)
The
conv
erge
nce
of P
P11
P12
P13
P22
P23
P33
P34
P44 0 1000 2000 3000
-3
-2
-1
0
1
Time (k)The
conv
erge
nce
of th
e co
ntro
l po
L11
L12
L13
L14
Converges to
Optimal GARE solution and optimal game control
• Comparison
The ADHDP controller design The design from [1]
• The maximum frequency deviation when using the ADHDP controller is improved by 19.3% from the controller designed in [1]
[1] Wang, Y., R. Zhou, C. Wen, “Robust load-frequency controller design for power systems”, IEE Proc.-C, Vol. 140, No. I , 1993
0 5 10 15 20-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Time in sec
stat
es x
1, x2, x
3,x4
X: 0.5Y: -0.2024
Frequency deviationIncrmental change of the governer outIncrmental change of the governer posIncrmental change of the in itegral cont
0 5 10 15 20-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
X: 0.5794Y: -0.2507
Time sec
stat
es x
1, x2, x
3,x4
Frequency deviationIncrmental change of the generator ouIncrmental change of the governer posIncrmental change of the in itegral cont
ADHDP Application for Power system
Q learning for H-inf control (Game Theory) using LMIWork by Dr. Jinhoon Kim, Chungbuk National Univ., Korea
How can one do Policy Iteration for Unknown Continuous-Time Systems?What is Value Iteration for Continuous-Time systems?How can one do ADP for CT Systems?
)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH
Directly leads to temporal difference techniques System dynamics does not occur Two occurrences of value allow greedy value iteration methods
Discrete-Time Systems
How to define temporal difference? System dynamics DOES occur Only ONE occurrence of value gradient
( , , ) ( , ) ( , ) ( , ) ( , )T TV V VH x u V r x u x r x u f x u r x u
x x x
Leads to off-line solutions if system dynamics is knownHard to do on-line learning
Continuous-Time Systems
),( uxfx
t
T
t
dtRuuxQdtuxrtxV ))((),())((
),,(),(),(),(),(0 uxVxHuxruxf
xVuxrx
xVuxrV
TT
0)0( V
),(),(min),(min0*
)(
*
)(uxf
xVuxrx
xVuxr
T
tu
T
tu
xVxgRtxh T
*
12
1* )())((
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
0)0( V
System
Cost
Hamiltonian
Optimal cost
Optimal control
HJB equation
Continuous-Time Optimal Control
Bellman
c.f. DT value recursion, where f(), g() do not appear
)())(,()( 1 khkkkh xVxhxrxV BIG PROBLEM
System dynamics appears
),,(),(),(0 uxVxHuxruxf
xV T
0 ( , ( )) ( , ( ))T
kk k
V f x h x r x h xx
(0) 0kV
1121( ) ( )T k
kVh x R g xx
CT Policy Iteration
• Convergence proved by Saridis 1979 if Lyapunov eq. solved exactly
• Beard & Saridis used complicated Galerkin Integrals to solve Lyapunov eq.
• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence
RuuxQuxr T )(),(Utility
Cost for any given u(t)
Lyapunov equation
Iterative solution
Pick stabilizing initial control
Find cost
Update control
Full system dynamics must be knownOff-line solution
To avoid solving HJB equation dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
LQR Policy iteration = Kleinman algorithm
1. For a given control policy solve for the cost:
2. Improve policy:
If started with a stabilizing control policy the matrix monotonically converges to the unique positive definite solution of the Riccati equation.
Every iteration step will return a stabilizing controller. The system has to be known.
xLu k
kT
kT
kkkT
k RLLCCAPPA 0
11
k
Tk PBRL
kk BLAA
0L kP
Kleinman 1968
Lyapunov eq.
tuxr
tVV
uxrtVV
uxrxVuxVxH tt
Dtttt
),(
),(),()(),,( 11
Small Time-Step Approximate Tuning for Continuous-Time Adaptive Critics
txVxVuxr
uxA ttttD
tt
)()(),(),(
*1*
1
Baird’s Advantage function
This defines a temporal difference error. System dynamics does NOT appear Has TWO occurrences of value V and so can be used to define greedy Value Iteration NOTE that the UTILITY must also be discretized or it does not work
0 ( ) ( , ) ( , ) ( , ) ( , , )TV VV x r x u f x u r x u H x u
x x
Problems with
Use Euler’s forward approximation
Problems:Only an approximationMust select a constant sampling period up frontPhysics-based CT nonlinearities are lost
Solving for the cost – Our approach
( ( )) ( , ) ( ( ))t T
t
V x t r x u d V x t T
Lxu
Draguna Vrabie
f(x) and g(x) do not appear
For a given policy u(x)
The cost satisfies
( ) ( ) ( ) ( ) ( )t T
T T T T T
t
x t Px t x Qx x L RLx dt x t T Px t T
PBRL T1
LQR case
Optimal gain is
(0) 0V
( ( )) ( , )t
V x t r x u d
Same as:
c.f. DT value recursion, where f(), g() do not appear
)())(,()( 1 khkkkh xVxhxrxV SAME FORM
Lemma 1
is equivalent to
( ( )) ( , ) ( , )Td V x V f x u r x u
dt x
Proof:
( , ) ( ( )) ( ( )) ( ( ))t T t T
t t
r x u d d V x V x t V x t T
Solves Lyapunov equation without knowing A or B
( ( )) ( , ) ( ( ))t T
t
V x t r x u d V x t T
(0) 0V
),,(),(),(0 uxVxHuxruxf
xV T
(0) 0V
Draguna Vrabie
( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T T
tx t Px t x Q L RL x d x t T Px t T
0T Tc cA P P A L RL Q
Lemma 1 - LQR case
is equivalent to
( ) ( ) ( )T
T T T Tc c
d x P x x A P PA x x L RL Q xdt
Proof:
( ) ( ) ( ) ( ) ( ) ( )t T t T
T T T T T
t t
x Q L RL xd d x Px x t Px t x t T Px t T
Solves Lyapunov equation without knowing A or B
cA A BL
( ) ( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T T
te t x t Px t x Q L RL x d x t T Px t T
Define CT temporal difference
Draguna Vrabie
1 1( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
1. Policy iteration
11
1
kT
k PBRL
Initial stabilizing control is needed
Draguna Vrabie
For LQR case
Cost update
Control gain update
A and B do not appear
B needed for control update
( ) ( )k ku t L x t
1 1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q L RL x d x t T P x t T
Solving for the cost – Our approach
Policy evaluation
Policy improvement
Solves Lyapunov equation without knowing A or B
Direct Optimal Adaptive Controller
A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval
Draguna Vrabie
DynamicControlSystem
xu
V
ZOH T
0; xBuAxx System
RuuQxxV TT
Critic
Actor
K
T T
Runs RLS or uses batch L.S.To identify value of current control
Update FB gain afterCritic has converged
Reinforcement interval T can be selected on line on the fly
)()( txLtu kk
Continuous-time control with discrete gain updates
t
Lk
k0 1 2 3 4 5
Reinforcement Intervals T need not be the sameThey can be selected on-line in real time
Gain update (Policy)
Control
T
1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q L RL x d x t T P x t T
1( ( )) ( ) ( ( ))t T
T kT kk k
t
V x t x Qx u Ru dt V x t T
2. CT ADP Greedy iteration
)()( txLtu kk
11
1
kT
k PBRL
No initial stabilizing control needed
Draguna Vrabie
LQR
Control policy
Cost update
Control gain update A and B do not appear
B needed for control update
Direct Optimal Adaptive Control for Partially Unknown CT Systems
Value Iteration
Direct OPTIMAL ADAPTIVE CONTROL
Solve the Riccati Equation WITHOUT knowing the plant dynamics
Model-free ADP
Works for Nonlinear Systemsa neural network is used to approximate the cost
Robustness?Comparison with adaptive control methods?
Adaptive Control
Plantcontrol output
Identify the Controller-Direct Adaptive
Identify the system model-Indirect Adaptive
Identify the performance value-Optimal Adaptive