resilient control in - uta lectures/2015 isrcs resilience … · algebraic graph theory. ......
TRANSCRIPT
Resilient Control inCooperative and Adversarial Multi-agent Networks
Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA
F.L. Lewis National Academy of Inventors
Talk available online at http://www.UTA.edu/UTARI/acs
Supported by :ONR – Tony Seman, Lynn Petersen, Marc SteinbergUS TARDEC – Dariusz MikulskiAFOSR Europe – Kevin BollinoNSF
Reinforcement Learning for Resilient Control inCooperative and Adversarial Multi-agent Networks:
CPS Applications in Microgrid and Human-Robot Interactions
AndQian Ren Consulting Professor, State Key Laboratory of Synthetical Automation
for Process IndustriesNortheastern University, Shenyang, China
Supported by :China NNSFChina Project 111
Invited by Chaoli Wang
J.J. Finnigan, Complex science for a complex world
The Internet
ecosystem ProfessionalCollaboration network
Barcelona rail network
Structure of Natural andManmade Systems
Peer-to-Peer Relationships in networked systems
Distributed Decision & ControlLocal nature of Physical Laws
Clusters of galaxies
Distributed Adaptive Control for Multi‐Agent Systems
FlockingReynolds, Computer Graphics 1987
Reynolds’ Rules:Alignment : align headings
Cohesion : steer towards average position of neighbors- towards c.g.Separation : steer to maintain separation from neighbors
( )i
i ij j ij N
a
Nature and Biological Systems are Resilient !
Resilient Control in Cooperative Multi-agent NetworksModeling and Control of Adversarial Networked TeamsCPS Applications-
Distributed Control of Renewable Energy MicrogridsShared Learning in Human-Robot Interactions
Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids
F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2014
Key Point
Lyapunov Functions and Performance IndicesMust depend on graph topology
Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.
H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.
Synchronized Motion of Biological Groups
Fishschool
Birdsflock
Locustsswarm
Firefliessynchronize
Nature and Biological Systems are Resilient
The Power of Synchronization Coupled OscillatorsDiurnal Rhythm
1
2
3
4
56
Diameter= length of longest path between two nodes
Volume = sum of in-degrees1
N
ii
Vol d
Spanning treeRoot node
Strongly connected if for all nodes i and j there is a path from i to j.
Tree- every node has in-degree=1Leader or root node
Followers
Communication Graph
Communication Graph 1
2
3
4
56
N nodes
[ ]ijA a
0 ( , )ij j i
i
a if v v E
if j N
oN1
Noi ji
jd a
Out-neighbors of node iCol sum= out-degree
42a
Adjacency matrix
0 0 1 0 0 01 0 0 0 0 11 1 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 1 0
A
iN1
N
i ijj
d a
In-neighbors of node iRow sum= in-degreei
(V,E)
iJohn Baras - social standing
Algebraic Graph Theory
L=D-A = graph Laplacian matrix
1
N
dD
d
[ ]ijA a
Graph Laplacian MatrixAdjacency matrix
In-Degree Matrix
1
2 3
4 5 6
12
3
4 5
6
Graph Eigenvalues for Different Communication Topologies
Directed Tree-Chain of command
Directed Ring-Gossip networkOSCILLATIONS
Graph Eigenvalues for Different Communication Topologies
Directed graph-
Undirected graph-
65
34
2
1
4
5
6
2
3
1
Synchronization on Good Graphs
Chris Elliott fast video
65
34
2
1
1
2 3
4 5 6
Mesh graph4 neighbors
Synchronization on Gossip Rings
Chris Elliott weird video
12
3
4 5
6
Ring graphor cycle
Dynamic Graph- the Distributed Structure of Control
Each node has an associated state i ix u
Standard local voting protocol ( )i
i ij j ij N
u a x x
1
1i i
i i ij ij j i i i iNj N j N
N
xu x a a x d x a a
x
( )u Dx Ax D A x Lx L=D-A = graph Laplacian matrix
x Lx
If x is an n-vector then ( )nx L I x
x
1
N
uu
u
1
N
dD
d
Closed-loop dynamics
i
j
[ ]ijA a
Communication Graph
N nodes or agents
State at node i is ( )ix t
Synchronization problem ( ) ( ) 0i jx t x t
Formation ControlSynchronization of Rigid-body Spacecraft DynamicsSynchronization of Trust Opinions in Multi-agent NetworksDiurnal Rhythm Synchronization in Nature
Controlled Consensus: Cooperative Tracker
Node state i ix uDistributed Local voting protocol with control node v
( ) ( )i
i ij j i i ij N
u a x x b v x
( ) 1x L B x B v
0ib If control v is in the neighborhood of node i
{ }iB diag b
Theorem. Let graph have a spanning tree and for at least one root node. Then L+B is nonsingular with all e-vals positiveand -(L+B) is asymptotically stableAnd all nodes synchronize to the control node
0ib
control node v
Local Neighborhood Tracking Error
1
N
xx
x
Global state
Synchronization on Good Graphs
Chris Elliott fast video
65
34
2
1
1
2 3
4 5 6
Mesh graph4 neighbors
Synchronization on Gossip Rings
Chris Elliott weird video
12
3
4 5
6
Ring graphor cycle
2 1 1,
2 1 0A B
0.5 0.5K
Agent Dynamics and Local Feedback Design
i i ix Ax Bu
i iu Kx
1
2
3
4
56
Couple 6 agents with communication graph
Nodes synchronize to consensus heading
-350 -300 -250 -200 -150 -100 -50 0 50-300
-250
-200
-150
-100
-50
0
50
x
y
0( ) ( )i
i ij j i i ij N
e x x g x x
Local neighborhood tracking error
i iu K
0xc.g. leader
Example- Non Resilience in Communication Networks
2 1 1,
2 1 0A B
0.5 0.5K
Agent Dynamics and local Feedback design
i i ix Ax Bu
i iu Kx
1
2
3
4
56
ADD another comm. Link- more information flow
0( ) ( )i
i ij j i i ij N
e x x g x x
Local neighborhood tracking error
i iu K
Causes Unstable Formation!WHY?
-30 -25 -20 -15 -10 -5 0 5 10 15 20-25
-20
-15
-10
-5
0
5
10
15
20
25
x
y
Example- Non Resilience in Communication Networks
Lewis and Syrmos1995
DECOUPLES CONTROL DESIGN FROM COMMUNICATION GRAPH STRUCTURE
OPTIMALDesign at Each node
Lewis and ZhangIEEE TAC 2011
LOCAL OPTIMAL DESIGN Guarantees Global Synchronization
12
0
( )T Ti i i i iJ Q u Ru dt
minimizes
0( ) ( )i
i i ij j i i ij N
u cK cK e x x g x x
Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids
Cell HomeostasisThe individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.
Cellular Metabolism
Permeability control of the cell membrane
http://www.accessexcellence.org/RC/VL/GG/index.html
Optimality in Biological SystemsWhy are Biological Systems Resilient?
Reinforcement Learning
Every living organism improves its control actions based on rewards received from the environment
The resources available to living organisms are usually meager.Nature uses optimal control.
1. Apply a control. Evaluate the benefit of that control.2. Improve the control policy.
RL finds optimal policies by evaluating the effects of suboptimal policies
Optimality in Biological Systems
Optimality Provides an Organizational Principle for Behavior
Charles Darwin showed that Optimal Control over long timescalesIs responsible for Natural Selection of Species
Why are Biological Systems Resilient?
D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012.
BooksF.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.
New Chapters on:Reinforcement LearningDifferential Games
F.L. Lewis and D. Vrabie,“Reinforcement learning andadaptive dynamic programmingfor feedback control,”IEEE Circuits & SystemsMagazine, Invited Feature Article,pp. 32-50, Third Quarter 2009.
IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K.Vamvoudakis,“Reinforcement learning andfeedback Control,” Dec. 2012
xux A x B u
SystemControlK
PBPBRQPAPA TT 10
1 TK R B P
On-line real-timeControl Loop
FORMAL DESIGN PROCEDURE
Off-line Design LoopUsing ARE
Optimal Control- The Linear Quadratic Regulator (LQR)
An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)
System modeling is expensive, time consuming, and inaccurate
( , )Q R
User prescribed optimization criterion ( ( )) ( )T T
t
V x t x Qx u Ru d
OPTIMAL CONTROL IS RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS
Aircraft Autopilots -Linear Quadratic RegulatorLinear Quadratic Gaussian LTR
A. Optimal control B. Zero-sum games C. Non zero-sum games
1. System dynamics
2. Value/cost function
3. Bellman equation
4. HJ solution equation (Riccati eq.)
5. Policy iteration – gives the structure we need
Adaptive Control Structures for Resilience:
We want to find optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics
For nonlinear systems and general performance indices
Adaptive Control is online and works for unknown systems.Generally not Optimal
Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.
We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing
Reinforcement Learning turns out to be the key to this!
RL brings together Optimal Control and Adaptive Control
Different methods of learning
SystemAdaptiveLearning system
ControlInputs
outputs
environmentTuneactor
Reinforcementsignal
Actor
Critic
Desiredperformance
Reinforcement learningIvan Pavlov 1890s
Actor-Critic Learning
We want OPTIMAL performanceADP- Approximate Dynamic Programming
Sutton & Barto book
t
T
t
dtRuuxQdtuxrtxV ))((),())((
( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u
x x x
112( ) ( )T Vu h x R g x
x
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
Nonlinear System dynamics
Cost/value
Bellman Equation, in terms of the Hamiltonian function
Stationary Control Policy
Hamilton‐Jacobi‐Bellmen Equation
Derivation of Nonlinear Optimal Regulator
Off‐line solutionHJB hard to solve. May not have smooth solution.Dynamics must be known
Stationarity condition 0Hu
( , ) ( ) ( )x f x u f x g x u
Leibniz gives Differential equivalent
To find online methods for optimal control Iterate on these two equations
),,(),(),(0 uxVxHuxruxf
xV T
0 ( , ( )) ( , ( ))T
jj j
Vf x h x r x h x
x
(0) 0jV
1121( ) ( ) jT
j
Vh x R g x
x
CT Policy Iteration – a Reinforcement Learning Technique
• Convergence proved by Leake and Liu 1967, Saridis 1979 if Lyapunov eq. solved exactly
• Beard & Saridis used Galerkin Integrals to solve Lyapunov eq.• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence
RuuxQuxr T )(),(Utility
The cost is given by solving the CT Bellman equation
Policy Iteration Solution
Pick stabilizing initial control policy
Policy Evaluation ‐ Find cost, Bellman eq.
Policy improvement ‐ Update control Full system dynamics must be knownOff‐line solution
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
Scalar equation
M. Abu-Khalaf, F.L. Lewis, and J. Huang, “Policyiterations on the Hamilton-Jacobi-Isaacs equation for H-infinity state feedback control with input saturation,”IEEE Trans. Automatic Control, vol. 51, no. 12, pp.1989-1995, Dec. 2006.
( ) ( )u x h xGiven any admissible policy
0 ( )h x
Converges to solution of HJB
Lemma 1 – Draguna Vrabie
Solves Bellman equation without knowing f(x,u)
( ( )) ( , ) ( ( )), (0) 0t T
t
V x t r x u d V x t T V
0 ( , ) ( , ) ( , , ), (0) 0TV Vf x u r x u H x u V
x x
Allows definition of temporal difference error for CT systems
( ) ( ( )) ( , ) ( ( ))t T
t
e t V x t r x u d V x t T
Integral reinf. form (IRL) for the CT Bellman eq.Is equivalent to
( ( )) ( , ) ( , ) ( , )t T
t t t T
V x t r x u d r x u d r x u d
value
Key Idea
Bad Bellman Equation
Good Bellman Equation
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
IRL Policy iteration
Initial stabilizing control is needed
Cost update
Control gain update
f(x) and g(x) do not appear
g(x) needed for control update
Policy evaluation‐ IRL Bellman Equation
Policy improvement11
21 1( ) ( )T kk k
Vu h x R g xx
),,(),(),(0 uxVxHuxruxf
xV T
Equivalent to
Solves Bellman eq. (nonlinear Lyapunov eq.) without knowing system dynamics
CT Bellman eq.
Integral Reinforcement Learning (IRL)- Draguna Vrabie
D. Vrabie proved convergence to the optimal value and controlAutomatica 2009, Neural Networks 2009
Converges to solution to HJB eq.dx
dVggRdx
dVxQfdx
dV TTT *
1*
41
*
)(0
( ( )) ( ) ( ( ))t T
Tk k k k
t
V x t Q x u Ru dt V x t T
Approximate value by Weierstrass Approximator Network ( )TV W x
( ( )) ( ) ( ( ))t T
T T Tk k k k
t
W x t Q x u Ru dt W x t T
( ( )) ( ( )) ( )t T
T Tk k k
t
W x t x t T Q x u Ru dt
regression vector Reinforcement on time interval [t, t+T]
kWNow use RLS along the trajectory to get new weights
Then find updated FB
1 11 12 21 1
( ( ))( ) ( ) ( )( )
TT Tk
k k kV x tu h x R g x R g x Wx x t
Real-time ImplementationApproximate Dynamic Programming
Direct Optimal Adaptive Control for Partially Unknown CT Systems
Value Function Approximation (VFA) to Solve Bellman Equation
Scalar equation with vector unknowns
Same form as standard System ID problems
( ( )) ( ( )) ( )t T
T Tk k k
t
W x t x t T Q x u Ru dt
2
( ( )) ( ( 2 )) ( )t T
T Tk k k
t T
W x t T x t T Q x u Ru dt
3
2
( ( 2 )) ( ( 3 )) ( )t T
T Tk k k
t T
W x t T x t T Q x u Ru dt
11 12
12 22
p pp p
11 12 22TW p p p
Solving the IRL Bellman Equation
Need data from 3 time intervals to get 3 equations to solve for 3 unknowns
Now solve by Batch least-squares
Solve for value function parameters
t t+T
observe x(t)
apply uk=Lkx
observe cost integral
update P
Do RLS until convergence to Pk
update control gain
Integral Reinforcement Learning (IRL)
Data set at time [t,t+T)
( ), ( , ), ( )x t t t T x t T
t+2T
observe x(t+T)
apply uk=Lkx
observe cost integral
update P
t+3T
observe x(t+2T)
apply uk=Lkx
observe cost integral
update P
11
Tk kK R B P
( , )t t T ( , 2 )t T t T ( 2 , 3 )t T t T
Or use batch least-squares
(x( )) ( ( )) ( ) ( ) ( ) ( , )t T
T T Tk k k
tW t x t T x Q K RK x d t t T
Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics
This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.
A is not needed anywhere
( ) ( )k ku t K x t
Continuous‐time control with discrete gain updates
t
Kk
k0 1 2 3 4 5
Reinforcement Intervals T need not be the sameThey can be selected on‐line in real time
Gain update (Policy)
Control
T
Persistence of Excitation
Regression vector must be PE
Relates to choice of reinforcement interval T
( ( )) ( ( )) ( )t T
T Tk k k
t
W x t x t T Q x u Ru dt
Implementation Policy evaluationNeed to solve online
Add a new state= Integral Reinforcement
T Tx Q x u R u
This is the controller dynamics or memory
(x( )) ( ( )) ( ) ( ) ( ) ( , )t T
T T Tk k k
tW t x t T x Q K RK x d t t T
Direct Optimal Adaptive Controller
A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval
Draguna Vrabie
DynamicControlSystemw/ MEMORY
xu
V
ZOH T
x A x B u System
T Tx Q x u R u
Critic
ActorK
T T
Run RLS or use batch L.S.To identify value of current control
Update FB gain afterCritic has converged
Reinforcement interval T can be selected on line on the fly – can change
Solves Riccati Equation Online without knowing A matrixOPTIMAL CONTROL IS
RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS
Kung Tz 500 BCConfucius
ArcheryChariot driving
MusicRites and Rituals
PoetryMathematics
孔子Man’s relations to
FamilyFriendsSocietyNationEmperorAncestors
Tian xia da tongHarmony under heaven
124 BC - Han Imperial University in Chang-an
Actor / Critic structure for CT Systems
Theta waves 4-8 Hz
Reinforcement learning
Motor control 200 Hz
Optimal Adaptive IRL for CT systems
A new structure of adaptive controllers
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
1121 1( ) ( )T k
k kVu h x R g xx
D. Vrabie, 2009
Motor control 200 Hz
Oscillation is a fundamental property of neural tissue
Brain has multiple adaptive clocks with different timescales
theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.
gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.
• consolidation of memory• spatial mapping of the environment – place cells
The high frequency processing is due to the large amounts of sensorial data to be processed
Spinal cord
D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.
D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.
Doya, Kimura, Kawato 2001
Limbic system
Motor control 200 Hz
theta rhythms 4-10 HzDeliberativeevaluation
control
OPTIMALITY AND RESILIENCEOF BIOLOGICAL SYSTEMS
Simulation 1‐ F‐16 aircraft pitch rate controller
1.01887 0.90506 0.00215 00.82225 1.07741 0.17555 0
0 0 1 1x x u
Stevens and Lewis 2003
*1 11 12 13 22 23 33[p 2p 2p p 2p p ]
[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]
T
T
W
PBPBRQPAPA TT 10 ARE
Exact solution
,Q I R I
][ eqx
Select quadratic NN basis set for VFA
0 0.5 1 1.5 2-0.3
-0.2
-0.1
0Control signal
Time (s)
0 0.5 1 1.5 2-0.4
-0.2
0Controller parameters
Time (s)
0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4System states
Time (s)
0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
Critic parameters
Time (s)
P(1,1)P(1,2)P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal
Simulations on: F‐16 autopilot
A matrix not needed
Converge to SS Riccati equation soln
Solves ARE online without knowing A
PBPBRQPAPA TT 10
1/ / 0 0 00 1/ 1/ 0 0
,1/ 0 1/ 1/ 1/
0 0 0 0
p p p
T T
G G G G
E
T K TT T
A BRT T T T
K
x Ax Bu
( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t
FrequencyGenerator outputGovernor positionIntegral control
Simulation 2: Load Frequency Control of Electric Power system
PBPBRQPAPA TT 10
0.4750 0.4766 0.0601 0.4751
0.4766 0.7831 0.1237 0.3829
0.0601 0.1237 0.0513 0.0298
0.4751 0.3829 0.0298 2.3370
.AREP
ARE
ARE solution using full dynamics model (A,B)
A. Al-Tamimi, D. Vrabie, Youyi Wang
0.4750 0.4766 0.0601 0.4751
0.4766 0.7831 0.1237 0.3829
0.0601 0.1237 0.0513 0.0298
0.4751 0.3829 0.0298 2.3370
.AREP
PBPBRQPAPA TT 10
0 10 20 30 40 50 60-0.5
0
0.5
1
1.5
2
2.5P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4)
Time (s)
P(1,1)P(1,3)P(2,4)P(4,4)
( ), ( ), ( : )x t x t T t t T Fifteen data points
Hence, the value estimate was updated every 1.5s.
IRL period of T= 0.1s.
Solves ARE online without knowing A
0.4802 0.4768 0.0603 0.4754
0.4768 0.7887 0.1239 0.3834
0.0603 0.1239 0.0567 0.0300
0.4754 0.3843 0.0300 2.3433
.critic NNP
0 1 2 3 4 5 6-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1System states
Time (s)
Adaptive Control
Plantcontrol output
Identify the Controller-Direct Adaptive
Identify the system model-Indirect Adaptive
Identify the performance value-Optimal Adaptive
)()( xWxV TOPTIMALITY YIELDSRESILIENCEROBUSTNESSPERFORMANCE
GUARANTEES
Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids
Sun Tz bin fa孙子兵法
Games on Communication Graphs
500 BC
Fast Decision UsingNew Concept of Graphical Games to limit each agent’s decision horizon
Faster decisions based on local informationUnder certain conditions, the local decisions are globally optimalNash Equilibrium on Graphs is defined
Speed up Multi-agent Decision-making by Imposing Sparse EfficientCommunication Structures
Bounded Rationality -
Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities
( )i
i i i i i i ij j jj N
A d g B u e B u
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
Each process has its own dynamics
And cost function
Each process helps other processes achieve optimality and efficiency
F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2014
Key Point
Lyapunov Functions and Performance IndicesMust depend on graph topology
Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.
H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.
,i i i ix Ax B u
0 0x Ax
0( ) ( ),ix t x t i
0( ) ( ),i
i ij i j i ij N
e x x g x x
( ) ,nix t ( ) im
iu t
Graphical GamesSynchronization‐ Cooperative Tracker Problem
Node dynamics
Target generator dynamics
Synchronization problem
Local neighborhood tracking error (Lihua Xie)
x0(t)
( )i
i i i i i i ij j jj N
A d g B u e B u
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
12
0
( ( ), ( ), ( ))i i i iL t u t u t dt
Local nbhd. tracking error dynamics
Define Local nbhd. performance index
Local agent dynamics driven by neighbors’ controls
Values driven by neighbors’ controls
K.G. Vamvoudakis, F.L. Lewis, and G.R. Hudas, “Multi-Agent Differential Graphical Games: online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, Aug. 2012.
Impose Optimality to get Resilience and Robustness
1u
2u
iu Control action of player i
Value function of player i
New Differential Graphical Game
( )i
i i i i i i ij j jj N
A d g B u e B u
State dynamics of agent i
Local DynamicsLocal Value Function
Only depends on graph neighbors
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
OPTIMAL CONTROL IS RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS
1
N
i ii
z Az B u
12
10
( (0), , ) ( )N
T Ti i i j ij j
j
J z u u z Qz u R u dt
1u
2u
iu Control action of player i
Central Dynamics
Value function of player i
Standard Multi-Agent Differential Game
Central DynamicsLocal Value Functiondepends on ALL
other control actions
1 1 11 1 2 3 1 2 1 3 13 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
1 1 12 1 2 3 2 1 2 3 23 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
1 1 13 1 2 3 3 1 3 2 33 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
The objective functions of each player can be written as a team average term plus a conflict of interest term:
1 1
1 1
( ) , 1,N N
coii j i j team iN N
j j
J J J J J J i N
For N-player zero-sum games, the first term is zero, i.e. the players have no goals in common.
For N-players
Team Interest vs. Self InterestCooperation vs. Collaboration
* * *( , ) ( , ), ,i j G j i j G jJ u u J u u i j N
New Definition of Nash Equilibrium for Graphical Games
* * *1 2, ,...,u u u
* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N
Def: Interactive Nash equilibrium
are in Interactive Nash equilibrium if
2. There exists a policy such that ju
1.
That is, every player can find a policy that changes the value of every other player.
i.e. they are in Nash equilibrium
Theorem 3. Let (A,Bi) be reachable for all i. Let agent i be in local best response
Then are in global Interactive Nash iff the graph is strongly connected.
*( , ) ( , ),i i i i i iJ u u J u u i
* * *1 2, ,...,u u u
To restore the Symmetry of Nash equilibrium
1 1 12 2 2( , , , ) ( ) 0
i i
TT T Ti i
i i i i i i i i i ij j j i ii i i ii i j ij ji i j N j N
V VH u u A d g B u e B u Q u R u u R u
10 ( ) Ti ii i i ii i
i i
H Vu d g R B
u
2 1 2 1 11 1 12 2 2( ) ( ) 0,
i
TT Tj jc T T Ti i i
i i ii i i i i ii i j j j jj ij jj ji i i j jj N
V VV V VA Q d g B R B d g B R R R B i N
2 1 1( ) ( ) ,i
jc T Tii i i i i ii i ij j j j jj j
i jj N
VVA A d g B R B e d g B R B i N
* *( , , , ) 0ii i i i
i
VH u u
* 2 11 1 12 2 20 ( , , , ) ( )
i
T Tc T T Ti i i i
i i i i i i ii i i i i ii i j ij ji i i i j N
V V V VH u u A Q d g B R B u R u
2 1( )
i
c T ii i i i i ii i ij j j
i j N
VA A d g B R B e B u
12( ( )) ( )
i
T T Ti i i ii i i ii i j ij j
j Nt
V t Q u R u u R u dt
Value function
Differential equivalent (Leibniz formula) is Bellman’s Equation
Stationarity Condition
1. Coupled HJ equations
2. Best Response HJ Equations – other players have fixed policies
where
where
Graphical Game Solution Using Integral Reinforcement Learning
ju
To find online methods for GG
Iterate on these two equations
Online Solution of Graphical Games
Use Reinforcement LearningConvergence Results
POLICY ITERATION
Kyriakos Vamvoudakis
MULTI-AGENT LEARNING
Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids
Distributed Autonomy in Active Energy Distribution Systems
Ali Davoudi, F.L. Lewis, Chris Edrington
Supported by ONR Tony Seman & Lynn Petersen
What is a micro‐grid?• Micro-grid is a small-scale power system that provides the power for
a group of consumers.• Micro-grid enables
local power support for local and critical loads.
• Micro-grid has the ability to work in both grid-connected and islanded modes.
• Micro-grid facilitates the integration of Distributed EnergyResources (DER).
Photo from: http://www.horizonenergygroup.com
• The main building block of smart-grids
• Rural plants
• Business buildings, hospitals,and factories
Smart‐grid photo from: http://www.sustainable‐sphere.com
An introduction to micro‐grids: Micro‐grid applications
Distributed Generators (DG)Distributed Energy Resources (DER)
• Non-renewables Internal combustion engine Micro-turbines Fuel cells
• Renewables Photovoltaic Wind Hydroelectric Biomass
104
Micro‐grid Advantages
• Micro-grid provides high quality and reliable power to the criticalconsumers
• During main grid disturbances, micro-grid can quickly disconnectform the main grid and provide reliable power for its local loads
• DGs can be simply installed close to the loads which significantlyreduces the power transmission line losses
• By using renewable energy resources, a micro-grid reduces CO2emissions
105
Micro‐grid Hierarchical Control Structure
Tertiary ControlmodesOptimal operation in both operating
modes
Secondary Control
Primary Control
MicrogridTie
Power flow control in grid-tied mode
Voltage deviation mitigationFrequency deviation alleviation
Voltage stability provision
preservingFrequency stability
preservingPlug and play capability for DGs
Main grid
Do coop. ctrl. here toSynchronize frequencyand voltage
Bidram, A., & Davoudi, A. (2012). Hierarchical structure of microgrids control system. IEEE Transactions on Smart Grid, vol. 3, pp. 1963‐1976, Dec 2012.
Maintains Stability
Cooperative Game-theoretic Control of Active Loads in DC Microgrids
Ling-ling Fan, Vahidreza Nasirian, Hamidreza Modares,
Frank L. Lewis, Yong-duan Song, and Ali Davoudi,
3t
2t
1t
3t
2t
1t
e
outp
inp
e
outp
3t
2t
1t
inp
Power buffer operation during a step change in power demand.
18r
48r
58r
59r
47r
27r
67r
69r
39r
iv i
p
s1vs1
r
iu
ie
Power buffers in Microgrid Network
2
,i
i ii
i i
ve p
rr u
ìïïï = -ïíïï =ïïî
Active Load Power Buffer
Stored energyInput impedanceBus voltage Control input Output power = a disturbance
ieir
iviu
ip
Vahid Nasirian
2
0
d , 1, , ,i
i j ij j i ij N
J u t i M M Nr¥
Î
æ ö÷ç ÷ç= + = + +÷ç ÷ç ÷çè øåò x Q xT
Define coupled performance indices
( )
2q q
1( )q
0 00 2 1
1 00 0 0
0 10 0 0
2 0 ,
0
i i i ii
i ii ii i
i i i i
i i
M N
ij jj M i
i
e ei i
r r u w
p p
r
i
g
g+
= + ¹
é ùé ù é ù é ù é ù- -ê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê ú= + + +ê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úë û ë û ë û ë ûë û
é ùê úê úê úê ú+ ê úê úê úê úë û
å
x x B DA
1, , ,i M M N= + +
Solve for bus voltage to get coupled agent dynamics
Define Communication GraphSparse efficient topologyOptimal design provides Resilience
and disturbance rejection
Vahid NasirianReza Modares
Dr. Ali Davoudi
111
Microgrid
DG 1DG 2 DG 3
DG 4
DG 5
DG 6DG 7
DG 8
DG 1DG 2 DG 3
DG 4
DG 5
DG 6
DG 8
DG 7
Communication link
Cybercommunication
framework
Micro‐grid secondary control:Distributed CPS structure
Physical LayerThe interconnect structure of the power grid
Cyber layerA sparse, efficient communication network to allow
cooperative control for synchronization ofvoltage and frequency
Work of Ali BidramWith Dr. A. Davoudi
Cyber Physical System (CPS)
Resilience Because of-Sparse, Efficient communication networkOptimal Design- Graph GamesDistributed computation – no SPOF
Case Study
Physical Microgrid Layer
Cyber Communication Layer
A 48 V DC distribution network with three active loads and three sources
1 2 3 4 5 6 7 8 9
47
48
49
Time (s)
MG
Vol
tage
(V)
1 2 3 4 5 6 7 8 960
80
100
120
Time (s)
Buf
fer V
olta
ge (V
)
1 2 3 4 5 6 7 8 9
47
48
49
Time (s)
Load
Vol
tage
(V)
1 2 3 4 5 6 7 8 91.5
22.5
33.5
Time (s)
Sou
rce
Cur
rent
(A)
1 2 3 4 5 6 7 8 910
15
20
25
Time (s)
Sto
red
Ene
rgy
(J)
1 2 3 4 5 6 7 8 90
10
20
30
Time (s)
Res
ista
nce
( )
1 2 3 4 5 6 7 8 950
100
150
Time (s)O
utpu
t Pow
er (W
)
10 20 3010
15
20
25
Sto
red
Ene
rgy
(J)
10 20 30Input Resistance ( )
10 20 30
Ω
Buffer 4 Buffer 5 Buffer 6
Ω
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
s1i
s2i
s3i
b4v
b5v
b6v
4v
5v
6v
o4v
o5v
o6v
4e
5e
6e
4r
5r
6r
4p5p
6p
A B DC
Load change at terminal 5
Optimal Leader-Follower TrackersCPS #2 – Resilient Human-robot Interaction Systems
On‐policy RL
116
Target and behaviorpolicy
SystemRef.
Target policy: The policy that we are learning about.
Behavior policy: The policy that generates actions and behavior
Target policy and behavior policy are the same
Off‐policy RL
117
Behavior Policy System
Target policy
Ref.
Target policy and behavior policy are different
Humans can learn optimal policies while actually applying suboptimal policies
Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies
( ) ( )x f x g x u
( ) ( ), ( )t
J x r x u d
[ [ [] [ ]] ]( ( )) ( ( )) ( )i i
t T t
t i i
T
t TJ x t J x t T Q x d Ru du
On-policy IRL
[ 1] 1 [ ]12
i T ixu R g J
[ ] [ ]( )i ix f gu g u u Off-policy IRL
[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i
t T t T t T
T i i T iJ x t J x t T Q x d R d du u u R u u
This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA
[ ]iJ [ 1]iu
1. Completely unknown system dynamics2. Can use applied u(t) for –
disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012
DDO
Yu Jiang & Zhong-Ping Jiang, Automatica 2012
system
value
Must know g(x)
Augmented system state: ( ) ( ) ( )T
T T
dX t x t y té ù= ê úë û
Augmented system: 1
A BX X u T X B u
F
é ù é ùê ú ê ú= + º +ê ú ê úê ú ê úë û ë û
00 0
Value function:
( ) ( )1 1
( ) ( ) ( )2 2
t T T T
Tt
V X t e X Q X u R u d X t P X tg t t¥
- - é ù= + =ê úë ûò
1 1 1, [ ]TTQ C QC C C I= = -
LQT Bellman equation:
1 10 ( ) ( )T T T T T
TTX B u P X X P TX B u X P X X Q X u Rug= + + + - + +
121
Assume the reference trajectory is generated by d dy Fy
x A x B u
y C x
= +=
System Dynamics
LQ Tracker Problem for Continuous-time Systems
Online solution to the CT LQT ARE: Off‐policy IRL
1 1( )i
iX TX B u T X B K X u= + = + + ,
1
i
iT T B K= -
( )
( ) ( )1
1
( ) ( ) ( ) ( ) ( )
( ) 2 ( )
t TT T i T i t T i
t
t T t Tt T iT i t i T T i
Tt t
iRK
de X t T P X t T X t P X t e X P X d
d
e X Q K RK X d e u K X B P X d
g g t
g t g t
tt
t t
+- - -
+ +- - - -
+
+ + - = -
= - + + +
ò
ò ò
Off-policy IRL Bellman equation
Algorithm. Online Off-policy IRL algorithm for LQT
Online step: Apply a fixed control input and collect some data
Offline step: Policy evaluation and improvement using LS on collected data
( )( ) ( ) ( ) ( )t T
T T i T i t Tit
e X t T P X t T X t P X t e X Q X dg g t t+
- - -+ + - =- +ò
( ) 12 ( )t T
t i T i
te u K X RK X dg t t
+- - ++ò No knowledge of the dynamics is required
125
Humans can learn optimal policies while actually applying suboptimal policies
Convergence
Lemma: The Bellman equation, the IRL Bellman equation and the off-policy IRL
Bellman equation have the same solution for the value function and the control
update law.
Theorem. The IRL Algorithm and Off-policy IRL Algorithm for the LQT problem
converge to the optimal solution.
127
Optimal Leader-Follower TrackersCPS #2 – Resilient Human-robot Interaction Systems
PR2 meets Isura
H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human-robot Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, to appear 2015.
RL for Human-Robot Interactions
RobotManipulator
TorqueDesign
PrescribedImpedance Model
xhf
mx
e
th
KHuman
FeedforwardControl
‐d
x+
-
Performance
Robot-specific inner control loop
Task-specific outer-loop control designUsing IRL Optimal Tracking Implemented on PR2 robot
2-Loop HRI Learning Controller Based on Human Factors Performance StudiesHumans learn 2 components in HRI-
1. must learn to compensate for nonlinear robot dynamics2. must learn to perform the task
Our result - The Robot adapts in order to minimize human effort
Robot control inner loopusing Neural Adaptive Learning
Task control outer loopusing Online Reinforcement Learning
2-Loop HRI Learning Controller Based on Human Factors Performance StudiesThe Robot adapts in order to minimize human effort
Human-Robot Interaction Using Reinforcement Learning
Modeling and Control of Adversarial Networked TeamsBipartite Synchronization on Antagonistic Communication GraphsZero-Sum GamesMulti-agent Pursuit-Evasion Games
Modeling and Control of Adversarial Networked TeamsBipartite Synchronization on Antagonistic Communication GraphsZero-Sum GamesMulti-agent Pursuit-Evasion Games
2
0
2
0
2
0
2
0
2
)(
)(
)(
)(
dttd
dtuhh
dttd
dttz T
System( ) ( ) ( )( )
TT T
x f x g x u k x dy h x
z y u
( )u l x
d
u
z
x control
Performance output disturbance
Find control u(t) so that
For all L2 disturbancesAnd a prescribed gain
L2 Gain Problem
H‐Infinity Control Using Neural Networks
Zero‐Sum differential gameNature as the opposing player
Disturbance Rejection
TwoAntagonistic Inputs
2* 2
0
( (0)) min max ( (0), , ) min max ( ) ( )T T
u ud dV x V x u d h x h x u Ru d dt
Define 2‐player zero‐sum game as
min max ( (0), , ) max min ( (0), , )u ud d
V x u d V x u d
The game has a unique value (saddle‐point solution) iff the Nash condition holds
Optimal Game Design YieldsResilienceRobustnessGuaranteed PerformanceDisturbance Rejection
TwoAntagonistic Inputs
control node v
Cooperative and Adversarial Multi-agent NetworkAdversarial disturbance at each agent
Agents want to cooperate to synchronize to the control leader
Zero-sum Graphical Games
i i i i
i i
x Ax Bu Ddy Cx
( ) ( ) ( )dy (I )
I A I B u I DC
Cooperative Synchronization0 0
0 0
x Axy Cx
System
leader
0x Ix
2( ) T Tz t Q u Ru
H
Performance output
2
0 0
0 0
2
2
( ) ( )
( ) ( ) ( )
T T
TT
z t dt Q u Ru dt
d t dt d t Td t dt
Bounded L2 synchronization error
2
0
( , , ) ( )T T TJ u d Q u Ru d Td dt
H‐inf performance index
Resilience to Network Disturbances
Condition on graph topology
Condition for existenceof local OPFB
1 1( ) PR L G
2 2 22TR K C B P M
Kristian Hengster-MovricCooperative SynchronizationH
J. Gadewadikar, Frank L. Lewis, L. Xie, V. Kucera and M. Abu-Khalaf, “Parameterization of all stabilizing H∞static state-feedback gains: Application to output-feedback design,” Automatica, vol. 43, no. 9, pp. 1597-1604, September 2007
Condition on graph topology
Condition for existenceof local OPFB
1 1( ) PR L G
2 2 22TR K C B P M
ye L G C
Fine Local OPFB structure Coarse global OPFB structure
Symmetry between OPFB and Graph Topology Information Restrictions
Same Same
Communication Network Limits Available Information Exchange
Communication Network Tries to Destroy OptimalityProper Controls Design Restores Optimality
0(y ) (y )i
i i yi ij j i i ij N
u cKC cKe cK e y g y
i iy Cx
Network info flow restrictions
Local system measurement restrictions
0 0
0 0
,,
x Axy Cx
2-player Zero-sum Nash Equilibrium
,,
i i i i
i i
x Ax Bu Ddy Cx
Command generator
Agent Dynamics
Nash Equilibrium
2
0
( , , ) ( )T T TJ u d Q u Ru d Td dt
H‐inf performance index
Resilience to Network Disturbances
control node v
Cooperative and Adversarial Multi-agent NetworkAdversarial disturbance at each agent
Agents want to cooperate to synchronize to the control leader
Zero-sum Graphical Games
Theorem 3. Under the hypotheses of Theorem 2, suppose that in addition matrix satisfies2M
* * * *2 2 2 2 2 2 2 2 2 2 2( ) ( ) ( ) ( ) 0T T T T TC K K R K K C M K K C C K K M
for any *2 2K K
* *2( )u c L G K C
* 2 1 Td T D P
Then a Nash equilibrium is provided by
2
0
( , , ) ( )T T TJ u d Q u Ru d Td dt
with Q,R,T given in Theorem 2 and
1 2 12 2 2 2 2 2 2 2 2 22 0T T T TA P P A Q P BR B P P DD P M R M
12 2 2
*2 ( )TK C R B P M
1 1 ( )P cR L G
1 2P P P
Kristian Hengster-Movric
Optimal H-inf control is distributed
Worst case disturbance is NOT distributed
2-player Zero-sum Nash Equilibrium,
,i i i i
i i
x Ax Bu Ddy Cx
Agent Dynamics
The cloud-capped towers, the gorgeous palaces,The solemn temples, the great globe itself,Yea, all which it inherit, shall dissolve,And, like this insubstantial pageant faded,Leave not a rack behind.
We are such stuff as dreams are made on, and our little life is rounded with a sleep.
Our revels now are ended. These our actors, As I foretold you, were all spirits, and Are melted into air, into thin air.
Prospero, in The Tempest, act 4, sc. 1, l. 152-6, Shakespeare