f.l. lewis

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

and

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

Applications of Integral Reinforcement Learning:Microgrids, UAV, Human-Robot Interaction

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :ONRUS NSF

Supported by :China NNSFChina Project 111

Applications of Reinforcement Learning

Microgrid Control Human‐Robot Interactive LearningIndustrial process control‐Mineral grinding in Gansu, ChinaH‐infinity control for UAVResilient Control to Cyber‐Attacks in Networked Multi‐agent SystemsDecision & Control for Heterogeneous MAS (different dynamics)

8

Work of Vahidreza Nasirian with Ali Davoudi

Game-theoretic Control for DC Microgrids

AC Microgrid:

1) Complex synchronization procedure for grid-tied operation (frequency, magnitude, and phase match is required)

2) Complex control circuitry (voltage, frequency, and active/reactive power control)

3) Unwanted transmission loss due to reactive power exchange

4) Redundant dc-ac-dc conversions for integration of renewable sources, loads, and storage units

5) Harmonic current management and phase unbalances

DC Microgrid:

1) Only voltage and power control is needed2) No reactive power flow and, thus, an

improved overall efficiency3) Converted renewable energies are

basically dc and, thus, a dc distribution is more effective for integration of these sources

4) No harmonic current or phase unbalance issue

10

Advantages of DC Microgrids

Cooperative Game-theoretic Control of Active Loads in DC Microgrids

3t

2t

1t

3t

2t

1t

e

outp

inp

e

outp

3t

2t

1t

inp

Power buffer operation during a step change in power demand.

Supplies excess power needed during load changes until sources can respond

18r

48r

58r

59r

47r

27r

67r

69r

39r

iv i

p

s1vs1

r

iu

ie

Power buffers in Microgrid Network

Ling-ling Fan, V. Nasirian, H. Modares, F.L. Lewis,Y.D. Song, and A. Davoudi, “Game-theoretic Controlof Active Loads in DC Microgrids,” IEEE Trans.Energy Conversion, vol. 31, no. 3, pp. 882-895, 2016.

2

,i

i ii

i i

ve p

rr u

ìïïï = -ïíïï =ïïî

Active Load Power Buffer

Stored energyInput impedanceBus voltage Control input Output power = a disturbance

ieir

iviu

ip

Vahid Nasirian

Nonlinear dynamicsNot obvious how to handle ip

2

0

d , 1, , ,i

i j ij j i ij N

J u t i M M Nr¥

Î

æ ö÷ç ÷ç= + = + +÷ç ÷ç ÷çè øåò x Q xT

Define coupled performance indices

( )

2q q

1( )q

0 00 2 1

1 00 0 0

0 10 0 0

2 0 ,

0

i i i ii

i ii ii i

i i i i

i i

M N

ij jj M i

i

e ei i

r r u w

p p

r

i

g

g+

= + ¹

é ùé ù é ù é ù é ù- -ê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê ú= + + +ê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úë û ë û ë û ë ûë û

é ùê úê úê úê ú+ ê úê úê úê úë û

å

x x B DA

1, , ,i M M N= + +

Solve for bus voltage to get coupled agent dynamics

Define Communication GraphSparse efficient topologyOptimal design provides Resilience

and disturbance rejection

Vahid NasirianReza Modares

Dr. Ali DavoudiLinearize.Add as a state.Formulate as H‐infinity Problem.

ip

Coupling terms

14

Optimal Cooperative Control as a Dynamic Game

14

Minimize the performance function for active loads

Ji x jTQijx j

jNi

iui2

dt

0

Let’s define the neighborhood state vector as xi xiT, x j

T jNi T

The optimal solution is in a general form of

With such solutions, the performance function Ji is quadratic in x:

ui kixi

Ji (xi ) xiTPixi

which helps to find the optimal solution by solving an algebraic Riccati equation

ui* Bii

TPixi i1

Graphical Game

15

Optimal Cooperative Control: Policy Iteration finds Optimal Solutions

15

• Substituting the optimal solution in Bellman equations leads to the following coupled Algebraic Riccati Equations (ARE)

• Policy iteration (a class of reinforcement learning) is used to solve ARE and find Pi and the optimal control input

• Policy evaluation: the performance of a given control policy, ui, is evaluated using the Bellman equation, and Pi are found.

• Policy improvement: an improved control policy, ui, is found for each agent, using Pi found in the first step.

• Policy evaluation and improvement are repeated until no improvement in control policies, ui, of any agent is observed.

Hi xiTQixi

T +i ui* 2

+xiTPi Aixi Biui

* Diwi (xi ) + Aixi Biui

* Diwi (xi ) TPixi =0

ui* ui

*, uj* jNi

T

ui* Bii

TPixi i1

(a) DC microgrid system(b) Active load(c) Communication network

16

Controller Implementation

Microgrid Setup and Cooperative Controller

Controller Performance with Load Change

17

(a) microgrid bus voltages at the load terminals, (b) Output voltage of the power buffers, (c) output voltage across theresistive loads, (d) Source currents, (e) Stored energies in power buffers, (f) Input impedance of the power buffers, (g)Output of the active loads, (h) energy-impedance trajectory of power buffers during the load transient.

Load change in bus 5; Buffers 4 & 5 assisting Load change in bus 4; Multiple assistive buffers

Intelligent Operational Control for Complex Industrial Processes

Professor Chai Tianyou

State Key Laboratory of Synthetical Automation for Process Industries

Northeastern UniversityMay 20, 2013

Jinliang Ding

1. Jinliang Ding, H. Modares, Tianyou Chai, and F.L. Lewis, "Data-based Multi-objective Plant-widePerformance Optimization of Industrial Processes under Dynamic Environments,” IEEE Trans. IndustrialInformatics, vol.12, no. 2, pp. 454-465, April 2016.

2. Xinglong Lu, B. Kiumarsi, Tianyou Chai, and F.L. Lewis, “Data-driven Optimal Control of OperationalIndices for a Class of Industrial Processes,” IET Control Theory & Applications, vol. 10, no. 12, pp. 1348-1356, 2016.

Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities

Production line for mineral processing plant

Mineral Processing Plant in Gansu China

Existing Manual Control for Plant production indices, unit operational indices, and unit process control for a production line

Overall

ˆ ( )kQ t

( )kQ mT

,~ { } 1,

1, 2,3i jr i n

j

r r

ˆ( )tr

( )mTr

*min max, ,k k kQ Q Q

*,i jr


( )kQ mT


, ( )i jr mT

Automated online reinforcement learning for determining operational indices

Implemented by Jingliang Ding and Chai Tianyou’s group in biggest mineral processingfactory of hematite iron ore in China, Gansu Province.

Savings of 30.75 million RMB per year were realized by implementing this automatedoptimization procedure instead of the standard industry practice of human operatorselection of process operational indices.

2 RL loopsAnd Value Function Approximation

Xinglong Lu, B. Kiumarsi, Tianyou Chai, and F.L. Lewis, “Data-driven Optimal Control ofOperational Indices for a Class of Industrial Processes,” IET Control Theory & Applications, vol.10, no. 12, pp. 1348-1356, 2016.

Yi Jiang, Jialu Fan, Tianyou Chai, Jinna Li, and F.L. Lewis, “Data-Driven Flotation IndustrialProcess Operational Optimal Control Based on Reinforcement Learning,” IEEE Trans. IndustrialInformatics, to appear, 2018.

Jinna Li, Tianyou Chai, F.L. Lewis, Jialu Fan, Zhangtao Ding, and Jinliang Ding, “Off-policy Q-learning: set-point design for optimizing dual-rate rougher flotation operational processes,” IEEETrans. Industrial electronics, vol. 65, no. 5, pp. 4092-4102, May 2018.

Jinna Li, Bahare Kiumarsi, Tianyou Chai, F.L. Lewis, and Jialu Fan, “Off-Policy ReinforcementLearning: Optimal Operational Control for Two-Time-Scale Industrial Processes,” IEEE Trans.Cybernetics, vol. 47, no. 12, pp. 4547-4558, Dec. 2017.

Jinliang Ding, H. Modares, Tianyou Chai, and F.L. Lewis, "Data-based Multi-objective Plant-widePerformance Optimization of Industrial Processes under Dynamic Environments,” IEEE Trans.Industrial Informatics, vol.12, no. 2, pp. 454-465, April 2016.

Control of Non-affine Aerial Systems Using Off-policy Reinforcement Learning

( ) ( ( )) ( ( )) ( ) ( )X t f X t g X t L u D w t= + +

1 1 1

2 2 2

3 3 32

22 1 3 4 2

cos cos

cos sin

sin

sin

( cos cos )

sincos

zz

z

z

x V d w

x V d w

x V d w

nV V g T n

Vg

nV

gn

V

g yg yg

a g a a a

g f g

y fg

= += +=- +

=- - + - -

= -

=

max

max

cos

sin

x

x

TT Dn

mgTT K

nmg

a

a

-=

+=

with

UAV dynamics

Non‐affine nonlinear aerial vehicle model

( ) ( ( )) ( ( )) ( ) ( )X t f X t g X t L u D w t= + +

4 5 6

4 5 6

4 52

2 4 5

54

cos( )cos( )

cos( )sin( )

sin( )

( ( )) sin( )

( cos( )

0

x x x

x x x

x x

f X t x g x

gx

x

a

é ùê úê úê úê ú-ê úê ú= - -ê úê úê ú

-ê úê úê úê úë û

41 3 2

4

4 5

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

( ( )) 0 0

0 0 0 1 0

0 0 0 0cos( )

g X tx

g

x x

aa a

é ùê úê úê úê úê úê úê ú= - -ê úê úê úê úê úê úê úê úë û

11

222

3 2

4 2 3

5 2 3

( ( ))

cos( )

sin( )

uL

uL

L uL u t

L u u

L u u

é ùé ùê úê úê úê úê úê úê úê ú= = ê úê úê úê úê úê úê úê úê úê úë û ë û

1 2 3{ , , , , , }TX x x x V g y=State

Dynamics

where

Optimal Control for Constrained Input Systems

This is a quasi-norm

2

0

2 ( )u

Tq

u d

Weaker than a norm –homogeneity property is replaced by the weaker symmetry property

qqxx

(Used by Lyshevsky for H2 control)

Control constrained by saturation function tanh(p)

p

1

-1

0 0

( , ) ( ) 2 ( )u

TJ u d Q x d dt

Encode constraint into Value function

Then Is BOUNDED1 ( )T Vu R g xx

Murad Abu-Khalaf

( ) ( )d d d

X t h X=

2( )

2

2( )

( )

( )

t

t

t

t

e z d

e w d

a t

a t

t tg

t t

¥- -

¥- -

£ò

ò

2 T( ) ( ) ( ) ( ( ))d d

z t X X Q X X W L u= - - +

( ) ( ( )) ( ( )) ( ) ( )X t f X t g X t L u D w t= + +

( ) T 2 T( ) ( ) ( ) ( ( ))td dt

J X e X X Q X X W L u w w da t g t¥

- - é ù= - - + -ê úë ûò

1 1

2 2

u u

u u

£

£

where

H‐infinity Control Tracking Problem

UAV dynamics

Desired trajectory generator

Bounded L2 norm

Constrained controls

Formulate as Optimal Control Problem

( ) ( ) ( )d

e t X t X t= -

( )( )

( )d

e tZ t

X t

é ùê ú= ê úê úë û

( ) ( ) ( ( )) ( )

( ) ( ) ( ( )) ( ( )) ( ) ( )( ) ( ( )) 0 0

d d d

d d d

e t f e X h X t g e X DL u w t F Z t G Z t L u Kw t

X t h X t

é ù é ù é ù é ù+ - +ê ú ê ú ê ú ê ú= + + º + +ê ú ê ú ê ú ê úê ú ê ú ê ú ê úë û ë û ë û ë û

( ) T 2 T1

( ( ), ) ( ( ))t

tJ L u w e Z Q Z W L u w w da t g t

¥- - é ù= + -ê úë ûò

1

0

0 0

QQ

é ùê ú= ê úê úë û

Write Augmented System and Leader Dynamics

Tracking error

Augmented State

Augmented Tracking Dynamics

Performance Index

with

T T( ) tanh (( ) )Z

L u L V G* *= -

2 T1( )

2 Zw V Kg* - *=

T 2 T1

( ( )) ( ) ( ) 0Z Q Z W L u w w V Z V Zg a+ - - + =

T 2 T T1

( , ( ), , ) ( ( )) ( ) ( ( ) ( ) ( ) ) 0Z Z

H Z L u w V Z Q Z W L u w w V Z V F Z G Z L u Kwg a= + - - + + + =

( )( ) argmin ( , ( ), , )

L uL u H Z L u w V* *=

arg max ( , ( ), , )w

w H Z L u w V* *=

Optimal H‐inf Tracker

Bellman Equation

Stationarity Condition gives Optimal Control and worst‐case disturbance

So that

1( tanh ( ))Tu L L v* - *= -

Assume L(u) is Invertible

Then

Reinforcement Learning Policy Iteration Solution

Need to know input matrices G and K

( ) ( ) ( ( )) ( )

( ) ( ) ( ( )) ( ( )) ( ) ( )( ) ( ( )) 0 0

d d d

d d d

e t f e X h X t g e X DL u w t F Z t G Z t L u Kw t

X t h X t

é ù é ù é ù é ù+ - +ê ú ê ú ê ú ê ú= + + º + +ê ú ê ú ê ú ê úê ú ê ú ê ú ê úë û ë û ë û ë û

( ) ( ( )) ( ( )) ( ) ( ( ))( ( ) ( )) ( )j j j jZ t F Z t G Z t L u Kw G Z t L u L u K w w= + + + - + -

Off‐Policy IRL Solution

Do not need any of the dynamics of UAV or leader

Off‐PolicyBellman Equation

Data‐Driven Real‐Time Solution Using VFA

Approximate critic, control, disturbance

Plug into Off‐Policy Bellman Equation to get algebraic equations for the weights

RL for Human-Robot Interaction (HRI)1. H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human-robot

Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, vol. 46, no. 3,pp. 655-667, 2016.

2. I. Ranatunga, F.L. Lewis, D.O. Popa, and S.M. Tousif, "Adaptive Admittance Control forHuman-Robot Interaction Using Model Reference Design and Adaptive Inverse Filtering" IEEETransactions on Control Systems Technology, vol. 25, no. 1, pp. 278-285, Jan. 2017.

3. B. AlQaudi, H. Modares, I. Ranatunga, S.M. Tousif, F.L. Lewis, and D.O. Popa, “Modelreference adaptive impedance control for physical human robot interaction,” Control Theory andTechnology, vol. 14, no. 1, pp. 1-15, Feb. 2016.

PR2 meets Isura

Robot dynamics

Prescribed Error system

Control torque depends onImpedance model parameters

Impedance Control

Standard Robot Trajectory Tracking Controller

Where is the human?

Human task learning has 2 components:1. Human learns a robot dynamics model to compensate for robot nonlinearities2. Human learns a task model to properly perform a task

Inner Robot Specific Control LoopINDEPENDENT OF TASK

Outer Task Specific Control LoopINDEPENDENT OF ROBOT DETAILS

Human Performance Factors Studies

Robot control inner loop

Task control outer loop

RL for Human‐Robot Interactions

No task trajectory information is used in this inner‐loop robot controllerThe inner‐loop robot controller makes the model‐following error smallThe admittance model parameters are not neededOnly the admittance model trajectories are needed., ,m m mx x x

New Inner Robot Control Loop

Three Outer Loop DesignsTo appear 2016

2C. Outer‐loop Task Specific Design #3

Reinforcement Learning for minimum human effort

Feedforward assistive control term

‐ 2 1( )Ms Bs K -+ +hK

(.)l

+

-

dx

mxh

fd

e ++

1( )p d

K s K s -+

PrescribedImpedanceModelHuman

Find robot impedance model parametersTo minimize human force effortAnd task trajectory following error

, ,M B Khf

de

Human force amplifier

Work of Reza Modares

Force exerted by human indicates his discontent‐A measure of Human Intent

Feedback linearization loop

Robot Impedance Model Unknown Human Model

1,0h d p h e d d h h h d

f K K f k K e A f E e-= - + º +

( )d p e d

K s K f k e+ =

d h p h e dK f K f k e+ =

Minimize human effort and tracking error

( )T T Td d d h h h e e

t

J e Q e f Q f u R u dt¥

= + +ò

Performance index

1 2e d hu K e K f= +

Then control is

( )T Te e

t

J X Q X u R u dt¥

= +ò

Overall Augmented Dynamics

nd d m

e x x= - Î

2[ ]T T T nd d d d

e e e x x= = - Î

Augmented Tracker Dynamics with Human and Tracking Error

We want online method to learn the optimal control without knowing the System Matrix A

Optimal Design Always Admits Reinforcement Learning for Real‐time Optimal Adaptive Control

Optimal control is an offline methodBased on solving AREKnowing all the plant dynamics

Take enough data along the system trajectoryTo solve this equation using least‐squares

OFF‐POLICYReinforcement LearningNeeds NO knowledge of the system dynamics

f.l. lewis

Documents