resilient control in - uta lectures/2015 isrcs resilience … · algebraic graph theory. ......

Resilient Control inCooperative and Adversarial Multi-agent Networks

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

Supported by :ONR – Tony Seman, Lynn Petersen, Marc SteinbergUS TARDEC – Dariusz MikulskiAFOSR Europe – Kevin BollinoNSF

Reinforcement Learning for Resilient Control inCooperative and Adversarial Multi-agent Networks:

CPS Applications in Microgrid and Human-Robot Interactions

AndQian Ren Consulting Professor, State Key Laboratory of Synthetical Automation

for Process IndustriesNortheastern University, Shenyang, China

Supported by :China NNSFChina Project 111

Invited by Chaoli Wang

J.J. Finnigan, Complex science for a complex world

The Internet

ecosystem ProfessionalCollaboration network

Barcelona rail network

Structure of Natural andManmade Systems

Peer-to-Peer Relationships in networked systems

Distributed Decision & ControlLocal nature of Physical Laws

Clusters of galaxies

Distributed Adaptive Control for Multi‐Agent Systems

FlockingReynolds, Computer Graphics 1987

Reynolds’ Rules:Alignment : align headings

Cohesion : steer towards average position of neighbors- towards c.g.Separation : steer to maintain separation from neighbors

( )i

i ij j ij N

a

Nature and Biological Systems are Resilient !

Resilient Control in Cooperative Multi-agent NetworksModeling and Control of Adversarial Networked TeamsCPS Applications-

Distributed Control of Renewable Energy MicrogridsShared Learning in Human-Robot Interactions

Multi-agent Networks on Communication GraphsRobustness of Optimal Design Reinforcement LearningCooperative Agents - Games on Communication GraphsCPS #1 – Resilient Distributed Control of Renewable Energy Microgrids

F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2014

Key Point

Lyapunov Functions and Performance IndicesMust depend on graph topology

Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.

H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.

Synchronized Motion of Biological Groups

Fishschool

Birdsflock

Locustsswarm

Firefliessynchronize

Nature and Biological Systems are Resilient

The Power of Synchronization Coupled OscillatorsDiurnal Rhythm

1

2

3

4

56

Diameter= length of longest path between two nodes

Volume = sum of in-degrees1

N

ii

Vol d

Spanning treeRoot node

Strongly connected if for all nodes i and j there is a path from i to j.

Tree- every node has in-degree=1Leader or root node

Followers

Communication Graph

Communication Graph 1

2

3

4

56

N nodes

[ ]ijA a

0 ( , )ij j i

i

a if v v E

if j N

oN1

Noi ji

jd a

Out-neighbors of node iCol sum= out-degree

42a

Adjacency matrix

0 0 1 0 0 01 0 0 0 0 11 1 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 1 0

A

iN1

N

i ijj

d a

In-neighbors of node iRow sum= in-degreei

(V,E)

iJohn Baras - social standing

Algebraic Graph Theory

L=D-A = graph Laplacian matrix

1

N

dD

d

[ ]ijA a

Graph Laplacian MatrixAdjacency matrix

In-Degree Matrix

1

2 3

4 5 6

12

3

4 5

6

Graph Eigenvalues for Different Communication Topologies

Directed Tree-Chain of command

Directed Ring-Gossip networkOSCILLATIONS

Graph Eigenvalues for Different Communication Topologies

Directed graph-

Undirected graph-

65

34

2

1

4

5

6

2

3

1

Synchronization on Good Graphs

Chris Elliott fast video

65

34

2

1

1

2 3

4 5 6

Mesh graph4 neighbors

Synchronization on Gossip Rings

Chris Elliott weird video

12

3

4 5

6

Ring graphor cycle

Dynamic Graph- the Distributed Structure of Control

Each node has an associated state i ix u

Standard local voting protocol ( )i

i ij j ij N

u a x x

1

1i i

i i ij ij j i i i iNj N j N

N

xu x a a x d x a a

x

( )u Dx Ax D A x Lx L=D-A = graph Laplacian matrix

x Lx

If x is an n-vector then ( )nx L I x

x

1

N

uu

u

1

N

dD

d

Closed-loop dynamics

i

j

[ ]ijA a

Communication Graph

N nodes or agents

State at node i is ( )ix t

Synchronization problem ( ) ( ) 0i jx t x t

Formation ControlSynchronization of Rigid-body Spacecraft DynamicsSynchronization of Trust Opinions in Multi-agent NetworksDiurnal Rhythm Synchronization in Nature

Controlled Consensus: Cooperative Tracker

Node state i ix uDistributed Local voting protocol with control node v

( ) ( )i

i ij j i i ij N

u a x x b v x

( ) 1x L B x B v

0ib If control v is in the neighborhood of node i

{ }iB diag b

Theorem. Let graph have a spanning tree and for at least one root node. Then L+B is nonsingular with all e-vals positiveand -(L+B) is asymptotically stableAnd all nodes synchronize to the control node

0ib

control node v

Local Neighborhood Tracking Error

1

N

xx

x

Global state

Synchronization on Good Graphs

Chris Elliott fast video

65

34

2

1

1

2 3

4 5 6

Mesh graph4 neighbors

Synchronization on Gossip Rings

Chris Elliott weird video

12

3

4 5

6

Ring graphor cycle

2 1 1,

2 1 0A B

0.5 0.5K

Agent Dynamics and Local Feedback Design

i i ix Ax Bu

i iu Kx

1

2

3

4

56

Couple 6 agents with communication graph

Nodes synchronize to consensus heading

-350 -300 -250 -200 -150 -100 -50 0 50-300

-250

-200

-150

-100

-50

0

50

x

y

0( ) ( )i

i ij j i i ij N

e x x g x x

Local neighborhood tracking error

i iu K

0xc.g. leader

Example- Non Resilience in Communication Networks

2 1 1,

2 1 0A B

0.5 0.5K

Agent Dynamics and local Feedback design

i i ix Ax Bu

i iu Kx

1

2

3

4

56

ADD another comm. Link- more information flow

0( ) ( )i

i ij j i i ij N

e x x g x x

Local neighborhood tracking error

i iu K

Causes Unstable Formation!WHY?

-30 -25 -20 -15 -10 -5 0 5 10 15 20-25

-20

-15

-10

-5

0

5

10

15

20

25

x

y

Example- Non Resilience in Communication Networks

Lewis and Syrmos1995

DECOUPLES CONTROL DESIGN FROM COMMUNICATION GRAPH STRUCTURE

OPTIMALDesign at Each node

Lewis and ZhangIEEE TAC 2011

LOCAL OPTIMAL DESIGN Guarantees Global Synchronization

12

0

( )T Ti i i i iJ Q u Ru dt

minimizes

0( ) ( )i

i i ij j i i ij N

u cK cK e x x g x x

Cell HomeostasisThe individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.

Cellular Metabolism

Permeability control of the cell membrane

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Biological SystemsWhy are Biological Systems Resilient?

Reinforcement Learning

Every living organism improves its control actions based on rewards received from the environment

The resources available to living organisms are usually meager.Nature uses optimal control.

1. Apply a control. Evaluate the benefit of that control.2. Improve the control policy.

RL finds optimal policies by evaluating the effects of suboptimal policies

Optimality in Biological Systems

Optimality Provides an Organizational Principle for Behavior

Charles Darwin showed that Optimal Control over long timescalesIs responsible for Natural Selection of Species

Why are Biological Systems Resilient?

D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012.

BooksF.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.

New Chapters on:Reinforcement LearningDifferential Games

F.L. Lewis and D. Vrabie,“Reinforcement learning andadaptive dynamic programmingfor feedback control,”IEEE Circuits & SystemsMagazine, Invited Feature Article,pp. 32-50, Third Quarter 2009.

IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K.Vamvoudakis,“Reinforcement learning andfeedback Control,” Dec. 2012

xux A x B u

SystemControlK

PBPBRQPAPA TT 10

1 TK R B P

On-line real-timeControl Loop

FORMAL DESIGN PROCEDURE

Off-line Design LoopUsing ARE

Optimal Control- The Linear Quadratic Regulator (LQR)

An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)

System modeling is expensive, time consuming, and inaccurate

( , )Q R

User prescribed optimization criterion ( ( )) ( )T T

t

V x t x Qx u Ru d

OPTIMAL CONTROL IS RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS

Aircraft Autopilots -Linear Quadratic RegulatorLinear Quadratic Gaussian LTR

A. Optimal control B. Zero-sum games C. Non zero-sum games

1. System dynamics

2. Value/cost function

3. Bellman equation

4. HJ solution equation (Riccati eq.)

5. Policy iteration – gives the structure we need

Adaptive Control Structures for Resilience:

We want to find optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics

For nonlinear systems and general performance indices

Adaptive Control is online and works for unknown systems.Generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing

Reinforcement Learning turns out to be the key to this!

RL brings together Optimal Control and Adaptive Control

Different methods of learning

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Actor

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performanceADP- Approximate Dynamic Programming

Sutton & Barto book

t

T

t

dtRuuxQdtuxrtxV ))((),())((

( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u

x x x

112( ) ( )T Vu h x R g x

x

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

Nonlinear System dynamics

Cost/value

Bellman Equation, in terms of the Hamiltonian function

Stationary Control Policy

Hamilton‐Jacobi‐Bellmen Equation

Derivation of Nonlinear Optimal Regulator

Off‐line solutionHJB hard to solve. May not have smooth solution.Dynamics must be known

Stationarity condition 0Hu

( , ) ( ) ( )x f x u f x g x u

Leibniz gives Differential equivalent

To find online methods for optimal control Iterate on these two equations

),,(),(),(0 uxVxHuxruxf

xV T

0 ( , ( )) ( , ( ))T

jj j

Vf x h x r x h x

x

(0) 0jV

1121( ) ( ) jT

j

Vh x R g x

x

CT Policy Iteration – a Reinforcement Learning Technique

• Convergence proved by Leake and Liu 1967, Saridis 1979 if Lyapunov eq. solved exactly

• Beard & Saridis used Galerkin Integrals to solve Lyapunov eq.• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence

RuuxQuxr T )(),(Utility

The cost is given by solving the CT Bellman equation

Policy Iteration Solution

Pick stabilizing initial control policy

Policy Evaluation ‐ Find cost, Bellman eq.

Policy improvement ‐ Update control Full system dynamics must be knownOff‐line solution

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

Scalar equation

M. Abu-Khalaf, F.L. Lewis, and J. Huang, “Policyiterations on the Hamilton-Jacobi-Isaacs equation for H-infinity state feedback control with input saturation,”IEEE Trans. Automatic Control, vol. 51, no. 12, pp.1989-1995, Dec. 2006.

( ) ( )u x h xGiven any admissible policy

0 ( )h x

Converges to solution of HJB

Lemma 1 – Draguna Vrabie

Solves Bellman equation without knowing f(x,u)

( ( )) ( , ) ( ( )), (0) 0t T

t

V x t r x u d V x t T V

0 ( , ) ( , ) ( , , ), (0) 0TV Vf x u r x u H x u V

x x

Allows definition of temporal difference error for CT systems

( ) ( ( )) ( , ) ( ( ))t T

t

e t V x t r x u d V x t T

Integral reinf. form (IRL) for the CT Bellman eq.Is equivalent to

( ( )) ( , ) ( , ) ( , )t T

t t t T

V x t r x u d r x u d r x u d

value

Key Idea

Bad Bellman Equation

Good Bellman Equation

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

IRL Policy iteration

Initial stabilizing control is needed

Cost update

Control gain update

f(x) and g(x) do not appear

g(x) needed for control update

Policy evaluation‐ IRL Bellman Equation

Policy improvement11

21 1( ) ( )T kk k

Vu h x R g xx

),,(),(),(0 uxVxHuxruxf

xV T

Equivalent to

Solves Bellman eq. (nonlinear Lyapunov eq.) without knowing system dynamics

CT Bellman eq.

Integral Reinforcement Learning (IRL)- Draguna Vrabie

D. Vrabie proved convergence to the optimal value and controlAutomatica 2009, Neural Networks 2009

Converges to solution to HJB eq.dx

dVggRdx

dVxQfdx

dV TTT *

1*

41

*

)(0

( ( )) ( ) ( ( ))t T

Tk k k k

t

V x t Q x u Ru dt V x t T

Approximate value by Weierstrass Approximator Network ( )TV W x

( ( )) ( ) ( ( ))t T

T T Tk k k k

t

W x t Q x u Ru dt W x t T

( ( )) ( ( )) ( )t T

T Tk k k

t

W x t x t T Q x u Ru dt

regression vector Reinforcement on time interval [t, t+T]

kWNow use RLS along the trajectory to get new weights

Then find updated FB

1 11 12 21 1

( ( ))( ) ( ) ( )( )

TT Tk

k k kV x tu h x R g x R g x Wx x t

Real-time ImplementationApproximate Dynamic Programming

Direct Optimal Adaptive Control for Partially Unknown CT Systems

Value Function Approximation (VFA) to Solve Bellman Equation

Scalar equation with vector unknowns

Same form as standard System ID problems

( ( )) ( ( )) ( )t T

T Tk k k

t


2

( ( )) ( ( 2 )) ( )t T

T Tk k k

t T

W x t T x t T Q x u Ru dt

3

2

( ( 2 )) ( ( 3 )) ( )t T

T Tk k k

t T

W x t T x t T Q x u Ru dt

11 12

12 22

p pp p

11 12 22TW p p p

Solving the IRL Bellman Equation

Need data from 3 time intervals to get 3 equations to solve for 3 unknowns

Now solve by Batch least-squares

Solve for value function parameters

t t+T

observe x(t)

apply uk=Lkx

observe cost integral

update P

Do RLS until convergence to Pk

update control gain

Integral Reinforcement Learning (IRL)

Data set at time [t,t+T)

( ), ( , ), ( )x t t t T x t T

t+2T

observe x(t+T)

apply uk=Lkx


update P

t+3T

observe x(t+2T)

apply uk=Lkx


update P

11

Tk kK R B P

( , )t t T ( , 2 )t T t T ( 2 , 3 )t T t T

Or use batch least-squares

(x( )) ( ( )) ( ) ( ) ( ) ( , )t T

T T Tk k k

tW t x t T x Q K RK x d t t T

Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics

This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.

A is not needed anywhere

( ) ( )k ku t K x t

Continuous‐time control with discrete gain updates

t

Kk

k0 1 2 3 4 5

Reinforcement Intervals T need not be the sameThey can be selected on‐line in real time

Gain update (Policy)

Control

T

Persistence of Excitation

Regression vector must be PE

Relates to choice of reinforcement interval T

( ( )) ( ( )) ( )t T

T Tk k k

t


Implementation Policy evaluationNeed to solve online

Add a new state= Integral Reinforcement

T Tx Q x u R u

This is the controller dynamics or memory

(x( )) ( ( )) ( ) ( ) ( ) ( , )t T

T T Tk k k

tW t x t T x Q K RK x d t t T

Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval

Draguna Vrabie

DynamicControlSystemw/ MEMORY

xu

V

ZOH T

x A x B u System

T Tx Q x u R u

Critic

ActorK

T T

Run RLS or use batch L.S.To identify value of current control

Update FB gain afterCritic has converged

Reinforcement interval T can be selected on line on the fly – can change

Solves Riccati Equation Online without knowing A matrixOPTIMAL CONTROL IS

RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS

Kung Tz 500 BCConfucius

ArcheryChariot driving

MusicRites and Rituals

PoetryMathematics

孔子Man’s relations to

FamilyFriendsSocietyNationEmperorAncestors

Tian xia da tongHarmony under heaven

124 BC - Han Imperial University in Chang-an

Actor / Critic structure for CT Systems

Theta waves 4-8 Hz

Reinforcement learning

Motor control 200 Hz

Optimal Adaptive IRL for CT systems

A new structure of adaptive controllers

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1121 1( ) ( )T k

k kVu h x R g xx

D. Vrabie, 2009


Oscillation is a fundamental property of neural tissue

Brain has multiple adaptive clocks with different timescales

theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.

gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.

• consolidation of memory• spatial mapping of the environment – place cells

The high frequency processing is due to the large amounts of sensorial data to be processed

Spinal cord

D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.

D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.

Doya, Kimura, Kawato 2001

Limbic system


theta rhythms 4-10 HzDeliberativeevaluation

control

OPTIMALITY AND RESILIENCEOF BIOLOGICAL SYSTEMS

Simulation 1‐ F‐16 aircraft pitch rate controller

1.01887 0.90506 0.00215 00.82225 1.07741 0.17555 0

0 0 1 1x x u

Stevens and Lewis 2003

*1 11 12 13 22 23 33[p 2p 2p p 2p p ]

[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]

T

T

W

PBPBRQPAPA TT 10 ARE

Exact solution

,Q I R I

][ eqx

Select quadratic NN basis set for VFA

0 0.5 1 1.5 2-0.3

-0.2

-0.1

0Control signal

Time (s)

0 0.5 1 1.5 2-0.4

-0.2

0Controller parameters

Time (s)

0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4System states

Time (s)

0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

Critic parameters

Time (s)

P(1,1)P(1,2)P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal

Simulations on: F‐16 autopilot

A matrix not needed

Converge to SS Riccati equation soln

Solves ARE online without knowing A

PBPBRQPAPA TT 10

1/ / 0 0 00 1/ 1/ 0 0

,1/ 0 1/ 1/ 1/

0 0 0 0

p p p

T T

G G G G

E

T K TT T

A BRT T T T

K

x Ax Bu

( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t

FrequencyGenerator outputGovernor positionIntegral control

Simulation 2: Load Frequency Control of Electric Power system

PBPBRQPAPA TT 10

0.4750 0.4766 0.0601 0.4751

0.4766 0.7831 0.1237 0.3829

0.0601 0.1237 0.0513 0.0298

0.4751 0.3829 0.0298 2.3370

.AREP

ARE

ARE solution using full dynamics model (A,B)

A. Al-Tamimi, D. Vrabie, Youyi Wang

0.4750 0.4766 0.0601 0.4751

0.4766 0.7831 0.1237 0.3829

0.0601 0.1237 0.0513 0.0298

0.4751 0.3829 0.0298 2.3370

.AREP

PBPBRQPAPA TT 10

0 10 20 30 40 50 60-0.5

0

0.5

1

1.5

2

2.5P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4)

Time (s)

P(1,1)P(1,3)P(2,4)P(4,4)

( ), ( ), ( : )x t x t T t t T Fifteen data points

Hence, the value estimate was updated every 1.5s.

IRL period of T= 0.1s.

Solves ARE online without knowing A

0.4802 0.4768 0.0603 0.4754

0.4768 0.7887 0.1239 0.3834

0.0603 0.1239 0.0567 0.0300

0.4754 0.3843 0.0300 2.3433

.critic NNP

0 1 2 3 4 5 6-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1System states

Time (s)

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

)()( xWxV TOPTIMALITY YIELDSRESILIENCEROBUSTNESSPERFORMANCE

GUARANTEES

Sun Tz bin fa孙子兵法

Games on Communication Graphs

500 BC

Fast Decision UsingNew Concept of Graphical Games to limit each agent’s decision horizon

Faster decisions based on local informationUnder certain conditions, the local decisions are globally optimalNash Equilibrium on Graphs is defined

Speed up Multi-agent Decision-making by Imposing Sparse EfficientCommunication Structures

Bounded Rationality -

Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities

( )i

i i i i i i ij j jj N

A d g B u e B u

12

0

( (0), , ) ( )i

T T Ti i i i i ii i i ii i j ij j

j N

J u u Q u R u u R u dt

Each process has its own dynamics

And cost function

Each process helps other processes achieve optimality and efficiency

F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2014

Key Point

Lyapunov Functions and Performance IndicesMust depend on graph topology

Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.

H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.

,i i i ix Ax B u

0 0x Ax

0( ) ( ),ix t x t i

0( ) ( ),i

i ij i j i ij N

e x x g x x

( ) ,nix t ( ) im

iu t

Graphical GamesSynchronization‐ Cooperative Tracker Problem

Node dynamics

Target generator dynamics

Synchronization problem

Local neighborhood tracking error (Lihua Xie)

x0(t)

( )i


A d g B u e B u

12

0

( (0), , ) ( )i


j N


12

0

( ( ), ( ), ( ))i i i iL t u t u t dt

Local nbhd. tracking error dynamics

Define Local nbhd. performance index

Local agent dynamics driven by neighbors’ controls

Values driven by neighbors’ controls

K.G. Vamvoudakis, F.L. Lewis, and G.R. Hudas, “Multi-Agent Differential Graphical Games: online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, Aug. 2012.

Impose Optimality to get Resilience and Robustness

1u

2u

iu Control action of player i

Value function of player i

New Differential Graphical Game

( )i


A d g B u e B u

State dynamics of agent i

Local DynamicsLocal Value Function

Only depends on graph neighbors

12

0

( (0), , ) ( )i


j N


OPTIMAL CONTROL IS RESILIENTROBUSTHAS GUARANTEED STABILITY MARGINS

1

N

i ii

z Az B u

12

10

( (0), , ) ( )N

T Ti i i j ij j

j

J z u u z Qz u R u dt

1u

2u

iu Control action of player i

Central Dynamics

Value function of player i

Standard Multi-Agent Differential Game

Central DynamicsLocal Value Functiondepends on ALL

other control actions

1 1 11 1 2 3 1 2 1 3 13 3 3( ) ( ) ( ) coi

teamJ J J J J J J J J J

1 1 12 1 2 3 2 1 2 3 23 3 3( ) ( ) ( ) coi


1 1 13 1 2 3 3 1 3 2 33 3 3( ) ( ) ( ) coi


The objective functions of each player can be written as a team average term plus a conflict of interest term:

1 1

1 1

( ) , 1,N N

coii j i j team iN N

j j

J J J J J J i N

For N-player zero-sum games, the first term is zero, i.e. the players have no goals in common.

For N-players

Team Interest vs. Self InterestCooperation vs. Collaboration

* * *( , ) ( , ), ,i j G j i j G jJ u u J u u i j N

New Definition of Nash Equilibrium for Graphical Games

* * *1 2, ,...,u u u

* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N

Def: Interactive Nash equilibrium

are in Interactive Nash equilibrium if

2. There exists a policy such that ju

1.

That is, every player can find a policy that changes the value of every other player.

i.e. they are in Nash equilibrium

Theorem 3. Let (A,Bi) be reachable for all i. Let agent i be in local best response

Then are in global Interactive Nash iff the graph is strongly connected.

*( , ) ( , ),i i i i i iJ u u J u u i

* * *1 2, ,...,u u u

To restore the Symmetry of Nash equilibrium

1 1 12 2 2( , , , ) ( ) 0

i i

TT T Ti i

i i i i i i i i i ij j j i ii i i ii i j ij ji i j N j N

V VH u u A d g B u e B u Q u R u u R u

10 ( ) Ti ii i i ii i

i i

H Vu d g R B

u

2 1 2 1 11 1 12 2 2( ) ( ) 0,

i

TT Tj jc T T Ti i i

i i ii i i i i ii i j j j jj ij jj ji i i j jj N

V VV V VA Q d g B R B d g B R R R B i N

2 1 1( ) ( ) ,i

jc T Tii i i i i ii i ij j j j jj j

i jj N

VVA A d g B R B e d g B R B i N

* *( , , , ) 0ii i i i

i

VH u u

* 2 11 1 12 2 20 ( , , , ) ( )

i

T Tc T T Ti i i i

i i i i i i ii i i i i ii i j ij ji i i i j N

V V V VH u u A Q d g B R B u R u

2 1( )

i

c T ii i i i i ii i ij j j

i j N

VA A d g B R B e B u

12( ( )) ( )

i

T T Ti i i ii i i ii i j ij j

j Nt

V t Q u R u u R u dt

Value function

Differential equivalent (Leibniz formula) is Bellman’s Equation

Stationarity Condition

1. Coupled HJ equations

2. Best Response HJ Equations – other players have fixed policies

where

where

Graphical Game Solution Using Integral Reinforcement Learning

ju

To find online methods for GG

Iterate on these two equations

Online Solution of Graphical Games

Use Reinforcement LearningConvergence Results

POLICY ITERATION

Kyriakos Vamvoudakis

MULTI-AGENT LEARNING

Distributed Autonomy in Active Energy Distribution Systems

Ali Davoudi, F.L. Lewis, Chris Edrington

Supported by ONR Tony Seman & Lynn Petersen

What is a micro‐grid?• Micro-grid is a small-scale power system that provides the power for

a group of consumers.• Micro-grid enables

local power support for local and critical loads.

• Micro-grid has the ability to work in both grid-connected and islanded modes.

• Micro-grid facilitates the integration of Distributed EnergyResources (DER).

Photo from: http://www.horizonenergygroup.com

• The main building block of smart-grids

• Rural plants

• Business buildings, hospitals,and factories

Smart‐grid photo from: http://www.sustainable‐sphere.com

An introduction to micro‐grids: Micro‐grid applications

Distributed Generators (DG)Distributed Energy Resources (DER)

• Non-renewables Internal combustion engine Micro-turbines Fuel cells

• Renewables Photovoltaic Wind Hydroelectric Biomass

104

Micro‐grid Advantages

• Micro-grid provides high quality and reliable power to the criticalconsumers

• During main grid disturbances, micro-grid can quickly disconnectform the main grid and provide reliable power for its local loads

• DGs can be simply installed close to the loads which significantlyreduces the power transmission line losses

• By using renewable energy resources, a micro-grid reduces CO2emissions

105

Micro‐grid Hierarchical Control Structure

Tertiary ControlmodesOptimal operation in both operating

modes

Secondary Control

Primary Control

MicrogridTie

Power flow control in grid-tied mode

Voltage deviation mitigationFrequency deviation alleviation

Voltage stability provision

preservingFrequency stability

preservingPlug and play capability for DGs

Main grid

Do coop. ctrl. here toSynchronize frequencyand voltage

Bidram, A., & Davoudi, A. (2012). Hierarchical structure of microgrids control system. IEEE Transactions on Smart Grid, vol. 3, pp. 1963‐1976, Dec 2012.

Maintains Stability

Cooperative Game-theoretic Control of Active Loads in DC Microgrids

Ling-ling Fan, Vahidreza Nasirian, Hamidreza Modares,

Frank L. Lewis, Yong-duan Song, and Ali Davoudi,

3t

2t

1t

3t

2t

1t

e

outp

inp

e

outp

3t

2t

1t

inp

Power buffer operation during a step change in power demand.

18r

48r

58r

59r

47r

27r

67r

69r

39r

iv i

p

s1vs1

r

iu

ie

Power buffers in Microgrid Network

2

,i

i ii

i i

ve p

rr u

ìïïï = -ïíïï =ïïî

Active Load Power Buffer

Stored energyInput impedanceBus voltage Control input Output power = a disturbance

ieir

iviu

ip

Vahid Nasirian

2

0

d , 1, , ,i

i j ij j i ij N

J u t i M M Nr¥

Î

æ ö÷ç ÷ç= + = + +÷ç ÷ç ÷çè øåò x Q xT

Define coupled performance indices

( )

2q q

1( )q

0 00 2 1

1 00 0 0

0 10 0 0

2 0 ,

0

i i i ii

i ii ii i

i i i i

i i

M N

ij jj M i

i

e ei i

r r u w

p p

r

i

g

g+

= + ¹

é ùé ù é ù é ù é ù- -ê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê ú= + + +ê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úë û ë û ë û ë ûë û

é ùê úê úê úê ú+ ê úê úê úê úë û

å

x x B DA

1, , ,i M M N= + +

Solve for bus voltage to get coupled agent dynamics

Define Communication GraphSparse efficient topologyOptimal design provides Resilience

and disturbance rejection

Vahid NasirianReza Modares

Dr. Ali Davoudi

111

Microgrid

DG 1DG 2 DG 3

DG 4

DG 5

DG 6DG 7

DG 8

DG 1DG 2 DG 3

DG 4

DG 5

DG 6

DG 8

DG 7

Communication link

Cybercommunication

framework

Micro‐grid secondary control:Distributed CPS structure

Physical LayerThe interconnect structure of the power grid

Cyber layerA sparse, efficient communication network to allow

cooperative control for synchronization ofvoltage and frequency

Work of Ali BidramWith Dr. A. Davoudi

Cyber Physical System (CPS)

Resilience Because of-Sparse, Efficient communication networkOptimal Design- Graph GamesDistributed computation – no SPOF

Case Study

Physical Microgrid Layer

Cyber Communication Layer

A 48 V DC distribution network with three active loads and three sources

1 2 3 4 5 6 7 8 9

47

48

49

Time (s)

MG

Vol

tage

(V)

1 2 3 4 5 6 7 8 960

80

100

120

Time (s)

Buf

fer V

olta

ge (V

)

1 2 3 4 5 6 7 8 9

47

48

49

Time (s)

Load

Vol

tage

(V)

1 2 3 4 5 6 7 8 91.5

22.5

33.5

Time (s)

Sou

rce

Cur

rent

(A)

1 2 3 4 5 6 7 8 910

15

20

25

Time (s)

Sto

red

Ene

rgy

(J)

1 2 3 4 5 6 7 8 90

10

20

30

Time (s)

Res

ista

nce

( )

1 2 3 4 5 6 7 8 950

100

150

Time (s)O

utpu

t Pow

er (W

)

10 20 3010

15

20

25

Sto

red

Ene

rgy

(J)

10 20 30Input Resistance ( )

10 20 30

Ω

Buffer 4 Buffer 5 Buffer 6

Ω

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

s1i

s2i

s3i

b4v

b5v

b6v

4v

5v

6v

o4v

o5v

o6v

4e

5e

6e

4r

5r

6r

4p5p

6p

A B DC

Load change at terminal 5

Optimal Leader-Follower TrackersCPS #2 – Resilient Human-robot Interaction Systems

On‐policy RL

116

Target and behaviorpolicy

SystemRef.

Target policy: The policy that we are learning about.

Behavior policy: The policy that generates actions and behavior

Target policy and behavior policy are the same

Off‐policy RL

117

Behavior Policy System

Target policy

Ref.

Target policy and behavior policy are different

Humans can learn optimal policies while actually applying suboptimal policies

Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies

( ) ( )x f x g x u

( ) ( ), ( )t

J x r x u d

[ [ [] [ ]] ]( ( )) ( ( )) ( )i i

t T t

t i i

T

t TJ x t J x t T Q x d Ru du

On-policy IRL

[ 1] 1 [ ]12

i T ixu R g J

[ ] [ ]( )i ix f gu g u u Off-policy IRL

[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i

t T t T t T

T i i T iJ x t J x t T Q x d R d du u u R u u

This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA

[ ]iJ [ 1]iu

1. Completely unknown system dynamics2. Can use applied u(t) for –

disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012

DDO

Yu Jiang & Zhong-Ping Jiang, Automatica 2012

system

value

Must know g(x)

Augmented system state: ( ) ( ) ( )T

T T

dX t x t y té ù= ê úë û

Augmented system: 1

A BX X u T X B u

F

é ù é ùê ú ê ú= + º +ê ú ê úê ú ê úë û ë û

00 0

Value function:

( ) ( )1 1

( ) ( ) ( )2 2

t T T T

Tt

V X t e X Q X u R u d X t P X tg t t¥

- - é ù= + =ê úë ûò

1 1 1, [ ]TTQ C QC C C I= = -

LQT Bellman equation:

1 10 ( ) ( )T T T T T

TTX B u P X X P TX B u X P X X Q X u Rug= + + + - + +

121

Assume the reference trajectory is generated by d dy Fy

x A x B u

y C x

= +=

System Dynamics

LQ Tracker Problem for Continuous-time Systems

Online solution to the CT LQT ARE: Off‐policy IRL

1 1( )i

iX TX B u T X B K X u= + = + + ,

1

i

iT T B K= -

( )

( ) ( )1

1

( ) ( ) ( ) ( ) ( )

( ) 2 ( )

t TT T i T i t T i

t

t T t Tt T iT i t i T T i

Tt t

iRK

de X t T P X t T X t P X t e X P X d

d

e X Q K RK X d e u K X B P X d

g g t

g t g t

tt

t t

+- - -

+ +- - - -

+

+ + - = -

= - + + +

ò

ò ò

Off-policy IRL Bellman equation

Algorithm. Online Off-policy IRL algorithm for LQT

Online step: Apply a fixed control input and collect some data

Offline step: Policy evaluation and improvement using LS on collected data

( )( ) ( ) ( ) ( )t T

T T i T i t Tit

e X t T P X t T X t P X t e X Q X dg g t t+

- - -+ + - =- +ò

( ) 12 ( )t T

t i T i

te u K X RK X dg t t

+- - ++ò No knowledge of the dynamics is required

125

Humans can learn optimal policies while actually applying suboptimal policies

Convergence

Lemma: The Bellman equation, the IRL Bellman equation and the off-policy IRL

Bellman equation have the same solution for the value function and the control

update law.

Theorem. The IRL Algorithm and Off-policy IRL Algorithm for the LQT problem

converge to the optimal solution.

127

Optimal Leader-Follower TrackersCPS #2 – Resilient Human-robot Interaction Systems

PR2 meets Isura

H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human-robot Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, to appear 2015.

RL for Human-Robot Interactions

RobotManipulator

TorqueDesign

PrescribedImpedance Model

xhf

mx

e

th

KHuman

FeedforwardControl

‐d

x+

-

Performance

Robot-specific inner control loop

Task-specific outer-loop control designUsing IRL Optimal Tracking Implemented on PR2 robot

2-Loop HRI Learning Controller Based on Human Factors Performance StudiesHumans learn 2 components in HRI-

1. must learn to compensate for nonlinear robot dynamics2. must learn to perform the task

Our result - The Robot adapts in order to minimize human effort

Robot control inner loopusing Neural Adaptive Learning

Task control outer loopusing Online Reinforcement Learning

2-Loop HRI Learning Controller Based on Human Factors Performance StudiesThe Robot adapts in order to minimize human effort

Human-Robot Interaction Using Reinforcement Learning

Modeling and Control of Adversarial Networked TeamsBipartite Synchronization on Antagonistic Communication GraphsZero-Sum GamesMulti-agent Pursuit-Evasion Games

2

0

2

0

2

0

2

0

2

)(

)(

)(

)(

dttd

dtuhh

dttd

dttz T

System( ) ( ) ( )( )

TT T

x f x g x u k x dy h x

z y u

( )u l x

d

u

z

x control

Performance output disturbance

Find control u(t) so that

For all L2 disturbancesAnd a prescribed gain

L2 Gain Problem

H‐Infinity Control Using Neural Networks

Zero‐Sum differential gameNature as the opposing player

Disturbance Rejection

TwoAntagonistic Inputs

2* 2

0

( (0)) min max ( (0), , ) min max ( ) ( )T T

u ud dV x V x u d h x h x u Ru d dt

Define 2‐player zero‐sum game as

min max ( (0), , ) max min ( (0), , )u ud d

V x u d V x u d

The game has a unique value (saddle‐point solution) iff the Nash condition holds

Optimal Game Design YieldsResilienceRobustnessGuaranteed PerformanceDisturbance Rejection

TwoAntagonistic Inputs

control node v

Cooperative and Adversarial Multi-agent NetworkAdversarial disturbance at each agent

Agents want to cooperate to synchronize to the control leader

Zero-sum Graphical Games

i i i i

i i

x Ax Bu Ddy Cx

( ) ( ) ( )dy (I )

I A I B u I DC

Cooperative Synchronization0 0

0 0

x Axy Cx

System

leader

0x Ix

2( ) T Tz t Q u Ru

H

Performance output

2

0 0

0 0

2

2

( ) ( )

( ) ( ) ( )

T T

TT

z t dt Q u Ru dt

d t dt d t Td t dt

Bounded L2 synchronization error

2

0

( , , ) ( )T T TJ u d Q u Ru d Td dt

H‐inf performance index

Resilience to Network Disturbances

Condition on graph topology

Condition for existenceof local OPFB

1 1( ) PR L G

2 2 22TR K C B P M

Kristian Hengster-MovricCooperative SynchronizationH

J. Gadewadikar, Frank L. Lewis, L. Xie, V. Kucera and M. Abu-Khalaf, “Parameterization of all stabilizing H∞static state-feedback gains: Application to output-feedback design,” Automatica, vol. 43, no. 9, pp. 1597-1604, September 2007

Condition on graph topology

Condition for existenceof local OPFB

1 1( ) PR L G

2 2 22TR K C B P M

ye L G C

Fine Local OPFB structure Coarse global OPFB structure

Symmetry between OPFB and Graph Topology Information Restrictions

Same Same

Communication Network Limits Available Information Exchange

Communication Network Tries to Destroy OptimalityProper Controls Design Restores Optimality

0(y ) (y )i

i i yi ij j i i ij N

u cKC cKe cK e y g y

i iy Cx

Network info flow restrictions

Local system measurement restrictions

0 0

0 0

,,

x Axy Cx

2-player Zero-sum Nash Equilibrium

,,

i i i i

i i

x Ax Bu Ddy Cx

Command generator

Agent Dynamics

Nash Equilibrium

2

0


H‐inf performance index

Resilience to Network Disturbances

control node v

Cooperative and Adversarial Multi-agent NetworkAdversarial disturbance at each agent

Agents want to cooperate to synchronize to the control leader

Zero-sum Graphical Games

Theorem 3. Under the hypotheses of Theorem 2, suppose that in addition matrix satisfies2M

* * * *2 2 2 2 2 2 2 2 2 2 2( ) ( ) ( ) ( ) 0T T T T TC K K R K K C M K K C C K K M

for any *2 2K K

* *2( )u c L G K C

* 2 1 Td T D P

Then a Nash equilibrium is provided by

2

0


with Q,R,T given in Theorem 2 and

1 2 12 2 2 2 2 2 2 2 2 22 0T T T TA P P A Q P BR B P P DD P M R M

12 2 2

*2 ( )TK C R B P M

1 1 ( )P cR L G

1 2P P P

Kristian Hengster-Movric

Optimal H-inf control is distributed

Worst case disturbance is NOT distributed

2-player Zero-sum Nash Equilibrium,

,i i i i

i i

x Ax Bu Ddy Cx

Agent Dynamics

The cloud-capped towers, the gorgeous palaces,The solemn temples, the great globe itself,Yea, all which it inherit, shall dissolve,And, like this insubstantial pageant faded,Leave not a rack behind.

We are such stuff as dreams are made on, and our little life is rounded with a sleep.

Our revels now are ended. These our actors, As I foretold you, were all spirits, and Are melted into air, into thin air.

Prospero, in The Tempest, act 4, sc. 1, l. 152-6, Shakespeare

resilient control in - uta lectures/2015 isrcs resilience … · algebraic graph theory. ......

Documents