reinforcement learning control with robust stability chuck anderson, matt kretchmar, department of...
TRANSCRIPT
Reinforcement Learning Control with Robust StabilityChuck Anderson, Matt Kretchmar, Department of Computer Science,
Peter Young, Department of Electrical and Computer Engineering
Douglas Hittle, Department of Mechanical Engineering
Colorado State University, Fort Collins, CO
Reinforcement Learning Agent in Parallel with Controller
Reinforcement learning algorithm guides adjustment of actor's weights.
IQC places bounding box in weight space of actor network, beyond which stability has not been verified.
Incorporating Time-Varying IQC in Reinforcement Learning
weight space (high-dimensional)
initial guaranteed-stable region
Step 1
initial weight vector
Step 0 trajectory of
weights while
learning
Step 2
must find new stable region
Step 3
next guaranteed-stable region
Step 4
Now learning can continue until edge of new bounding box is encountered.
Step 5 …
weight space (high-dimensional)
UNSTABLE REGION !final weight vector
weight trajectory with robust contstraints
weight trajectory without robust contstraints
Trajectory of Weights and Bounds on Regions of Stability
BB
CC
DD
EE
AAinitial weight
vector
Motivation
Robust control theoryGuarantees stability
Results in less aggressive controllers
Reinforcement learning
Optimizes the performance of a controller
No guarantee of stability while learning
external airT
T
T
T T
T
T fan
inputs
outputs
T AE
CD
E
CD
R
T AR
TA
I
TW
S
CW
H
CV
P
FW
TA
O
FA
mixingbox
filter
return air
discharge air
CB
S
Interface
P E
TW
I
PE
EP
variablefrequency
drive
TW
O
valve
heating coil
water heater
air flow
Experimental HVAC System
),(),(
),(),(
),(),(),(
11
1
011
1
tttt
T
kktkt
ktt
T
kktkt
ktttt
asQasRE
asRasRE
asRasREasQ
Reinforcement Learning
),(),(),(),( 11 tttttttt asQasQasREasQ
Subtract right side from left to get algorithm for updating Q
),(),(),(),( 11 ttttttttt asQasQasRasQ
Replace expectation with sample (Monte Carlo approach)
Temporal-difference Temporal-difference errorerror
T
kktkt
ktt asREasQ
0
),(),(
actionactionstatestate
policy functionpolicy function
value value functiofunctionn
discount factordiscount factorreinforcement (|error|)reinforcement (|error|)
Robust Control based on IQCs
Uncertainties (D)
Contoller/Plant (M)
v w
An Integral Quadratic Constraint (IQC) describes the relationship between signals as
0)(ˆ
)(ˆ)(
)(ˆ
)(ˆ*
dwjww
jwvjw
jww
jwv
Stability of the closed loop system is guaranteed if
II
jwMjw
I
jwM
)()(
)(*
for all w and for e > 0.
Given specific IQCs for a particular system, this inequality problem becomes a linear, matrix inequality (LMI) problem.
ReferenceOutput
Good response
Nominal Perturbed
Terrible response
Robust Reinforcement LearningPerturbed case, no learning
Perturbed case, with learning
Through learning, controller has been fine-tuned to actual dynamics of real plant without losing guarantee of stability !
Sum Squared Error
Nominal Controller 0.646
Robust Controller 0.286
Robust RL Controller 0.243
Conclusions• IQC bounds on parameters of tanh and sigmoid
networks exist for which the combination of a reinforcement learning agent and feedback control system satisfy the requirements of robust stability theorems. (static and dynamic stability)
• Robust reinforcement learning algorithm improves control performance while avoiding instability on several simulated problems.
• Reinforcement learning is now more acceptable in practical applications as an adaptive controller that modifies its behavior over time, due to the guarantees of stability.
• Initial, conservative robust controller becomes more aggressive through adaptation to actual physical system.
• See http://www.cs.colostate.edu/~anderson/res/rl/
Integral Quadratic Constraints
M
1 1 2 2
M1 1
2 2
M
( )
( )
Neural Net and Robust Control with IQCsBounds on neural net weight adjustment in green Neural net as reinforcement learning actor in blueRobust controller and plant in red
First Example
Second Example
Without robust constraints, becomes unstable before learning final stable solution.
Third Example
Distillation Column
Fourth Example
1st Order
2nd Order
)()(
)()()1(
txty
tutxtx
)(01)(
)(1
0)(
9.005.0
05.01)1(
txty
tutxtx
Without robust constraints, becomes unstable before learning final stable solution.