f. daneshfar h. bevrani author personal copyeng.uok.ac.ir/bevrani/bevrani/83.pdff. daneshfar h....

IETdo

www.ietdl.org

Published in IET Generation, Transmission & DistributionReceived on 30th March 2009Revised on 28th August 2009doi: 10.1049/iet-gtd.2009.0168

ISSN 1751-8687

Load–frequency control: a GA-basedmulti-agent reinforcement learningF. Daneshfar H. BevraniDepartment of Electrical and Computer Engineering, University of Kurdistan, Sanandaj, PO Box 416, Kurdistan, IranE-mail: [email protected]

Abstract: The load–frequency control (LFC) problem has been one of the major subjects in a power system. Inpractice, LFC systems use proportional– integral (PI) controllers. However since these controllers are designedusing a linear model, the non-linearities of the system are not accounted for and they are incapable ofgaining good dynamical performance for a wide range of operating conditions in a multi-area power system.A strategy for solving this problem because of the distributed nature of a multi-area power system ispresented by using a multi-agent reinforcement learning (MARL) approach. It consists of two agents in eachpower area; the estimator agent provides the area control error (ACE) signal based on the frequency bias (b)estimation and the controller agent uses reinforcement learning to control the power system in which geneticalgorithm optimisation is used to tune its parameters. This method does not depend on any knowledge of thesystem and it admits considerable flexibility in defining the control objective. Also, by finding the ACE signalbased on b estimation the LFC performance is improved and by using the MARL parallel, computation isrealised, leading to a high degree of scalability. Here, to illustrate the accuracy of the proposed approach, athree-area power system example is given with two scenariosPer

sona

l Cop

y

Perso

nal C

opy

i
1 IntroductionFrequency changes in large-scale power systems are a directresult of the imbalance between the electrical load and thepower supplied by system connected generators [1].Therefore load–frequency control (LFC) is one of theimportant power system control problems for which therehave been considerable research works [2–5]. In [6], acentralised controller is designed for a two-area powersystem, which requires the knowledge of the whole system.In [7], decentralised controllers for a two-area powersystem are proposed. These controllers are designed basedon modern control theory, and each area requiresknowledge of the other area. If the dimensions of thepower system increase, then these controllers may becomemore complex as the number of the state variables increasesignificantly.
Also, there has been continuing interest in designingload–frequency controllers with better performance usingvarious decentralised robust and optimal control methodsduring the last two decades [8–18].

Autho

r

Autho

r

Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26: 10.1049/iet-gtd.2009.0168

In [17], two robust decentralised control designmethodologies for the LFC are proposed. The first one isbased on control design using the linear matrix inequalities(LMI) technique and the second one is tuned by a robustcontrol design algorithm. However, in [18], a decentralisedLFC synthesis is formulated as an HN-control problemand is solved using an iterative LMI algorithm that gainslower order proportional–integral (PI) controller than [17].Both controllers are tested on a three-control area powersystem with three scenarios of load disturbances todemonstrate their robust performances.

But all the above-mentioned controllers are designed for aspecific disturbance; if the nature of the disturbance varies,they may not perform as expected. Also, they are model-based controllers that are dependent on a specific model,and are not usable for large systems like power systemswith non-linearities, not-defined parameters and model.The proposed methods assume that all model parametersare defined and measurable (fixed) too, that in a real powersystem some parameters like b change with environmentconditions and they do not have constant values. Therefore

13

& The Institution of Engineering and Technology 2009

14

&

www.ietdl.org

r

the design of intelligent controllers that are more adaptivethan linear and robust controllers has become an appealingapproach [19–21].

One of the adaptive and non-linear control techniques thathas found applications in the LFC design is reinforcementlearning (RL) [22–27]. These controllers learn and theyare adjusted to keep the area control error (ACE) small ineach sampling time of a discretised LFC cycle. Since theyare based on learning methods and are independent ofenvironmental conditions and can learn each kind ofenvironmental disturbances, therefore they are not modelbased and can easily be scalable for large-scale systems.They can also work well in non-linear conditions and non-linear systems.

In this paper, a multi-agent reinforcement learning(MARL)-based control structure is proposed that has the b

estimation as one of its functionalities. It consists of twoagents in each control area that communicate with eachother to control the whole system. The first agent (i.e. theestimator agent) provides the ACE signal based on the b

parameter estimation and the second agent (i.e. thecontroller agent) provides DPc according to the ACE signalreceived from the estimator agent, using RL; then it isdistributed among the different units under control usingfixed participation factors. In a multi-area power system,the learning process is a MARL process and all agents ofall areas learn together (not individually).

The above technique has been applied to the LFC problemin a three-control area power system as a case study. In thenew environment, the overall power system can also beconsidered as a collection of control areas interconnectedthrough high-voltage transmission lines or tie-lines. Eachcontrol area consists of a number of generating companies(Gencos) and it is responsible for tracking its own load andperforming the LFC task.

The organisation of the rest of the paper is as follows. InSection 2, a brief introduction to the single-agent-basedand multi-agent-based RL and the LFC problem is given.In Section 3, an explanation on how a load–frequencycontroller can work within this formulation is provided. InSection 4, a case study of a three-control area powersystem, for which the above architecture is implemented, isdiscussed. Simulation results are provided in Section 5 andpaper is concluded in Section 6.

2 BackgroundIn this section, a brief background on single-agent RL andMARL [28] is introduced. First, the single-agent RL isdefined and its solution is described. Then, the multi-agenttask is defined. The discussion is restricted to finite stateand action spaces, since the major part of the MARLresults are given for finite spaces [29].

Autho

r Pe

The Institution of Engineering and Technology 2009

2.1 Single-agent RL

RL is learning what to do – how to map situations to actions –so as to maximise a numerical reward signal [26]. In fact, thelearner will discover which action should be taken byinteracting with the system and trying the different actionsthat may lead to the highest reward. RL will evaluate theactions taken and gives the learner a feedback on how goodthe action taken was and whether it should repeat this actionin the same situation or not. In another words, RL methodslearn to solve a problem by interacting with a system. Thelearner is called the agent and the system it interacts with isknown as the environment. During the learning process, theagent interacts with the environment and takes an action at

from a set of actions at time t. These actions will affect thesystem and will take it to a new state xtþ1. Therefore theagent is provided with the corresponding reward signal(gtþ1). This agent–environment interaction is repeated untilthe desired goal is achieved. In this paper, what is meant bythe state is the required information for making a decision,therefore what we would like, ideally, is a state signal thatsummarises past perceptions in a way in which all relevantinformation is retained. A state signal that succeeds inretaining all relevant information is said to be Markov, or ashaving the Markov property [26] and an RL task thatsatisfies this property is called a finite Markov decisionprocess (MDP). If an environment has the Markovproperty, then its dynamics enable us to predict the nextstate and expected next reward, given the current state andaction. In the remaining of the text, it is assumed that theenvironment has the Markov property, therefore an MDPproblem is solved. In each MDP, the objective is tomaximise the sum of returned rewards over time, and theexpected sum of discounted rewards is defined by

R ¼X1k¼0

gkrtþkþ1 (1)

where 0 , g , 1 is a discount factor, which gives themaximum importance to the recent rewards.

Another term is the value function which is defined as theexpected return (reward) when starting at state xt whilefollowing policy p(x, a) [see (2)]. Policy is the way inwhich the agent maps the states to the actions [26]

V p(x) ¼ Ep

X1k¼0

[gkrtþkþ1jxt ¼ x]

( )(2)

The optimal policy is the one that maximises the valuefunction. Therefore once the optimal state value is derived,the optimal policy can be found using

V �(x) ¼ max V p

p(x), 8x [ X (3)

In most RL methods, instead of calculating the state value,another term known as the action value is calculated (4),

sona

l Cop

y

IET Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26doi: 10.1049/iet-gtd.2009.0168

IETdo

www.ietdl.org

r

which is defined as the expected discounted reward whilestarting at state xt and taking action at

Qp(x, a) ¼ Ep

X1k¼0

[gkrtþkþ1jxt ¼ x, at ¼ a]

( )(4)

Bellman’s equation, as shown below, is used to find theoptimal action value. Overall, an optimal policy is one thatmaximises the Q-function defined in (5)

Q"�(x, a) ¼max TpE#p{r#(t þ 1)þ g max T a0

[Q"�(x#(t þ 1), a)]jx#t ¼ x, a#t ¼ a} (5)

Different RL methods have been proposed to solve the aboveequations. In some algorithms, the agent first approximatesthe model of the system in order to calculate theQ-function. The method used in this paper is of thetemporal difference type which learns the model ofthe system under control. The only available information isthe reward achieved by each action taken and the nextstate. The algorithm, called Q-learning, will approximatethe Q-function and by the computed function, the optimalpolicy that maximises this function is derived [26].

2.2 Multi-agent reinforcement learning

A multi-agent system [30] can be defined as a group ofautonomous, interacting entities (or agents) [31] sharing acommon environment, which they perceive with sensors andupon which they act with actuators [32]. Multi-agentsystems can be used in a wide variety of domains includingrobotic teams, distributed control, resource management,collaborative decision support systems and so on. Well-understood algorithms with good convergence andconsistency properties are available for solving the single-agent RL task, both when the agent knows the dynamics ofthe environment and the reward function (the task model),and when it does not. However, the scalability of algorithmsto realistic problem sizes is problematic in single-agentRL, and is one of the great reasons for which MARLshould be used [29]. In addition to scalability and benefitsowing to the distributed nature of the multi-agent solution,such as parallel computation, multiple RL agents may utilisenew benefits from sharing experience, for example, bycommunication, teaching or imitation [29]. Theseproperties make RL suitable for multi-agent learning.However, several new challenges arise for RL in multi-agentsystems. In multi-agent systems, other adapting agents makethe environment no longer stationary, violating the Markovproperty that traditional single-agent behaviour learningrelies on; this non-stationarity properties decrease theconvergence properties of most single-agent RL algorithms[33]. Another problem is the difficulty of defining a goodlearning goal for the multiple RL agents [29]. Only then itwill be able to coordinate its behaviour with other agents.These challenges make the MARL design and learningdifficult in big applications; therefore it uses a special

Autho

r Pe

Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26i: 10.1049/iet-gtd.2009.0168

learning algorithm as mentioned below. (Violating theMarkov property problem – caused by the multi-agentstructure – can be solved using following learning algorithm.)

2.2.1 Learning algorithm: The generalisation of theMDP to the multi-agent case is as follows:

Suppose a tuple kX , A1, . . . , An, p, r1, . . . , rnl where n isthe number of agents, X is the discrete set of environmentstates, Ai, i ¼ 1, . . . , n, are the discrete sets of actionsavailable to the agents, yielding the joint action setA ¼ A1 � � � � � An, p:X � A � X ! [0, 1] is the statetransition probability function, and ri ¼ X � A �X !R, i ¼ 1, . . . , n are the reward functions of the agents.

In the multi-agent case, the state transitions are theresult of the joint action of all the agents, ak ¼

[aT1 k, . . . , aT

n,k], ak [ A, ai,k [ Ai (where T denotes vectortranspose). As a result, the rewards ri,kþ1 and the returnsRi,k also depend on the joint action. The policieshi :X � Ai ! [0, 1] form together the joint policy h. TheQ-function of each agent depends on the joint action andis conditioned on the joint policy, Qh

i :X � A! R [29].

2.3 Load–frequency control

A large-scale power system consists of a number ofinterconnected control areas [34]. Fig. 1 shows the blockdiagram of control area i, which includes n Gencos, from anN-control area power system. As is usual in the LFC designliterature, three first-order transfer functions are used tomodel generators, turbine and power system (rotating massand load) units. The parameters are described in the list ofsymbols in [34]. Following a load disturbance within acontrol area, the frequency of that area experiences atransient change, the feedback mechanism comes into playand generates appropriate rise/lower signal to theparticipating Gencos according to their participation factorsaji in order to make the generation follow the load. In thesteady state, the generation is matched with the load, drivingthe tie-line power and frequency deviations to zero. Thebalance between connected control areas is achieved bydetecting the frequency and tie-line power deviations inorder to generate the ACE signal which is, in turn, utilisedin the PI control strategy, as shown in Fig. 1. The ACE foreach control area can be expressed as a linear combination oftie-line power change and frequency deviation [34]

ACEi ¼ BiDfi þ DPtie�i (6)

3 Proposed control frameworkIn practice, the LFC controller structure is traditionally a PI-type controller using the ACE as its input, as shown in Fig. 1.In this section, the intelligent control design algorithm forsuch a load–frequency controller using the MARLtechnique is presented. The objective of the proposed design

sona

l Cop

y

15


16

&

www.ietdl.org

Figure 1 LFC system with different generation units and participation factors in area i [34] py

r

o
is to control the frequency so as to achieve the sameperformance as the proposed robust control design in [17, 18].
Regarding the LFC problem to be an MDP problem [28],the RL can be applied to its controller.

Fig. 2 shows the proposed model for area i; two kinds ofintelligent agents have been used in this structure, thecontroller agent and the estimator agent. The estimatoragent is responsible for estimating the frequency biasparameter (b) and calculating the ACE signal, whereas thecontroller agent is responsible for finding DPc according tothe ACE signal using RL and GA.

3.1 Controller agent

The PI controller in LFC can be replaced with an intelligentcontroller that decides on the set point changes of separate

Figure 2 Proposed model for area i

Autho

r Pe


closed-loop control systems. This will allow us to designthe LFC algorithm in order to operate on a discrete timescale and the results are more flexible. In this view, theintelligent controller (controller agent) system can beabstracted as follows: At each instant (on a discrete timescale k), k ¼ 1, 2, . . . , the controller agent observes thecurrent state of the system, xk, and takes an action, ak. Thestate vector consists of some quantities, which are normallyavailable to the controller agent. Here, the average of ACEsignal instances and DPL instances over the LFC executionperiod are used as the state vector. For the algorithmpresented in this paper, it is assumed that the set of allpossible states X, is finite. Therefore the values of variousquantities that constitute the state information should bequantised.

The possible actions of the controller agent are the variousvalues of DPc, that can be demanded in the generation levelwithin an LFC interval. DPc is also discretised to somefinite number of levels. Now, since both X and A are finitesets, a model for this dynamic system can be specifiedthrough a set of probabilities.

Here an RL algorithm is used for estimating Q� and theoptimal policy. It is the same as the algorithm used in [25].

Suppose there is a sequence of samples (xk,xkþ1, ak, r),k ¼ 1, 2,. . . (k is the LFC execution period). Each sampleis such that xkþ1 is the (random) state that resulted whenaction ak is performed in state xk and rk ¼ g(xk, xkþ1, ak) isthe consequent immediate reinforcement. Such a sequenceof samples can be obtained either through a simulationmodel of the system or by observing the actual system inoperation. This sequence of samples (called training set)can be used for estimating Q�. The specific algorithm that

sona

l C


IEd

www.ietdl.org

r

is used is as follows. Suppose Qk is the estimate of Q� at thekth iteration. Let the next sample be (xk, xkþ1, ak, r). ThenQkþ1 is obtained as

Qkþ1(xk, ak) ¼ Qk(xk, ak)þ a[g(xk, xkþ1, ak)

þ g max Qk

a[A

(xkþ1, a�)� Qk(xk, ak)] (7)

where 0 , a , 1 is a constant called the step size of learningalgorithm.

At each time step (as determined by the sampling time forthe LFC action), the state input vector, x, to the LFC isdetermined, and then an action in that state is selected andapplied on the model; the model is integrated for a timeinterval equal to the sampling time of the LFC to obtainthe state vector x� at the next time step.

Here, the exploration policy for choosing actions indifferent states is used. It is based on a learning automataalgorithm called the pursuit algorithm [35]. This is astochastic policy where, for each state x, actions are chosenbased on a probability distribution over the action space.Let Px

k denote the probability distribution over the actionset for the state vector x at the kth iteration of learning.That is, Px

k(a) is the probability of choosing action a instate x at iteration k. Initially (i.e. at k ¼ 0), a uniformprobability distribution is chosen. That is

P0x (a) ¼

1

jAj8a [ A 8x [ X (8)

At the kth iteration, let the state xk be equal to x. An actionak, based on px

k(.) is chosen at random. That is,

Prob(ak ¼ a) ¼ pxk(a). Using our simulation model, the

system goes to the next state xkþ1 by applying action a instate x and is integrated for the next time interval. Then Qk

is updated to Qkþ1 using (7) and the probabilities too areupdated as follows.

Pkþ1x (ag) ¼ Pk

x(ag)þ b(1� Pkx(ag))

Pkþ1x (a) ¼ Pk

x(a)(1� b) 8a [ A, a = ag

Pkþ1y (a) ¼ Pk

y (a) 8a [ A, 8y [ X , y = x

(9)

where 0 , b , 1 is a constant. Thus at iteration k, theprobability of choosing the greedy action ag in state x isslightly increased and the probabilities of choosing all otheractions in state x are proportionally decreased.

3.1.1 Learning the controller: In this algorithm, theaim is to achieve the conventional LFC objective and keepthe ACE within a small band around zero. This choice ismotivated by the fact that all the existing LFCimplementations use this as the control objective andhence, it will be possible for us to compare the proposed

Autho

r Pe

T Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26oi: 10.1049/iet-gtd.2009.0168

RL approach with the designed robust PI based LFCapproaches in [17, 18]. As mentioned above in thisformulation, each state vector consists of two state variables:the average value of the ACE (the first state variable, x1)and the DPL (the second state variable, x2). Since the RLalgorithms are considered, which assume a finite number ofstates, state variables should be discretised to finite levels.Here, the genetic algorithms (GAs) optimisation is used tofind good discretised state vector levels.

The control action of the LFC is to change the generationset point, DPc. Since the range of generation change that canbe affected in an LFC cycle is known, this range can bediscretised to finite levels using GA, too.

The next step is to choose an immediate reinforcementfunction by defining the function g. The reward matrixinitially is full of zero; at each LFC execution period, theaverage value of DPL and the average value of the ACEsignal are obtained, and then according to the discritisedvalues gained from GA, the state of the system isdetermined. Whenever the state is desirable (i.e.jACEj isless than 1), reward function g(xk, xkþ1,ak) is assigned avalue zero. When it is undesirable (i.e. jACEj . 1), theng(xk, xkþ1,ak) is assigned a value 2jACEj (all actionswhich cause to go to an undesirable state are penalised witha negative value).

3.1.2 Finding actions and states based on GA: GAis used to gain better results and to tune the quantised valuesof the state vector and the action vector.

To quantise the state range and the action range using GA,each individual that is explanatory quantised values of statesand actions, should be a double vector. It is clear that withincreasing the number of variables in a double vector, thestates (ACE signal quantised values) are found moreprecisely. Actually here, system states are more importantthan system actions (DPc quantised values) and has agreater effect on the whole system performance (keepingthe ACE within a small band around zero), becausesystems with more states can learn better (more precisely)than similar systems with less states. Actually, themaximum number of states that could be defined in GAfor the mentioned system were 400 (for the values greaterthan this, there was memory error); however, for actionsvariable if the number of variables is limited, the speed oflearning process is more (because it is not necessary toexamine extra actions in each state). The least valuableactions number that could be defined was 6. Then eachindividual should be a double vector (population type) with406 variables between [0 1] that consists of 400 variablesfor the ACE signal and six variables for the DPcsignal.

The start population size is equal to 30 individuals and itwas run for 100 generations. Fig. 3 shows the result of therunning proposed GA for area 1 of the three-control areapower system given in [17, 18].

sona

l Cop

y

17


18

&

www.ietdl.org

r

Finding individual’s fitness: To find the eligibility (fitness) ofindividuals, six variables are randomly chosen as discretisedvalues of actions from each individual (which contains 406variables); then, these values should be scaled according to theaction range (variable’s range is between 0 and 1; however, thevariation of the DPc action signal is between[DP#c Min DP#c Max]) and the remaining other 400 variablesare discretised values of the ACE signal which should bescaled to the valid range ([(ACE)#Min(ACE)#Max]). Afterscaling and finding the corresponding quantised state andaction vector, the model is run with these properties, and theindividual’s fitness is obtained from

Individual fitness ¼

PjACEj

(simulation time)(10)

Each individual that has the smallest fitness is the best one.

Finding ACE and DPc variation range: Hence, the AGC’srole is limited to correcting the observed ACE in a limitedrange; if the ACE goes beyond this range, other emergencycontrol steps may have to be taken by the operator. Let thevaluable range of the ACE for which the AGC is expectedto act properly is [(ACE)#Min(ACE)#Max]: In fact,ACEMin and ACEMax get automatically determined by theoperating policy of the area. ACEMax is the maximumACE deviation that is expected to be corrected by theAGC (in practice, ACE deviations beyond this value arecorrected only through operator intervention) and ACEMin

is the amount of ACE deviation below which we do notwant the AGC to respond. This value must be necessarilynon-zero.

The other variable to be quantised is the control actionDPc. This also requires that design choice be made for therange [DP#c Min DP#c Max]. DPc Max is automaticallydetermined by the equipment constraints of the system. Itis the maximum power change that can be effected within

Figure 3 Result of running GA for area 1 of the three-control area power system given in [17, 18]

Autho

r Pe


one AGC execution period. DPc Min is the minimumchange that can be demanded in the generation.

3.2 Estimator agent

To find the ACE from (6), it is necessary to know thefrequency bias factor (b). The conventional approaches intie-line bias control use the frequency bias coefficient210B to offset the area’s frequency response characteristic,b. But it is related to many factors and with 210B ¼ b,the ACE would only react to internal disturbances.Therefore many works have used the b approximation(which is not easily obtained on a real time basis) insteadof a constant value [36–39]. Getting a good estimate ofthe area’s b to improve the LFC performance is amotivation for the estimator agent to estimate the b

parameter and find the ACE signal based on it.

Each time, the estimator agent obtains DPtie, Df , DPm,DPL (DPL is not measurable directly but it can be

estimated by classical and conventional solutions [40]) asinputs, then calculates the b parameter and finds the ACEsignal according to that. The proposed estimationalgorithm used in this agent is based on the ACE modelgiven in Fig. 1.

The per unit equation of the electromechanical powerbalance for the local control area (Fig. 1) can be expressed as

Xn

j¼1

DPmji(t)� DPLi(t)� DPtie,i(t)

¼ 2HiDfi (t)þDiDfi (t)pu (11)

Also, the equation given below is concluded from (6)

DPtie,i(t) ¼ ACEi(t)� biDfi (t) pu (12)

Using the results of the two above equations

Xn

j¼1

DPmji(t)� DPLi(t)þ biDfi (t)� ACEi(t)

¼ 2HiDfi (t)þDiDfi (t) (13)

Then

ACEi(t) ¼Xn

j¼1

DPmji(t)� DPLi(t)þ (bi �Di)Dfi (t)

� 2HiDfi (t) pu (14)

Applying the following definition for a moving average over aT-second interval to (14)

XT ¼1

T

ðtf

ti

X (t)dt, tf � ti ¼ T s (15)

sona

l Cop

y


IEd

www.ietdl.org

r
Figure 5 Three-control area power system
Figure 4 best and bcal over 120 s

e

T Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26oi: 10.1049/iet-gtd.2009.0168

P

the following equation is obtained

ACET ¼1

T

XT

Xn

j¼1

DPmji �X

T

DPLi þ (bi �Di)

(

�X

T

Dfi � 2Hi(Df (tf )� Df (ti))

)(16)

By applying the measured values of ACE and other variablesin the above equation over a time interval, the values of b canbe estimated for the corresponding period. Since the values ofb vary with system conditions, these model parameters wouldhave to be updated regularly using a recursive least square(RLS) algorithm [41].

Suitable values of the duration T depend on the system’sdynamic behaviour. Although a larger T would yieldsmoother b values, it would also slow the convergence tothe proper value. For the example at hand, a T equal to60 s gave good results.

Fig. 4 shows the estimated and calculated b of Fig. 1 forarea 1 of the three-control area power system given in [17,18] over a 120 s simulation. For this test, the 210B of thetarget control area was set equal to bcal. As it is clear, best

converged rapidly to the bcal and remained there about overthe rest of the run.

4 Case study: a three-control areapower systemTo illustrate the effectiveness of the proposed controlstrategy, and to compare the results with linear robust

sona

l Cop

y

Figure 6 Proposed multi-agent structure for the three-control area power system

Autho

r

19


20

&

www.ietdl.org

Figure 7 System responses in Case 1

a Area 1b Area 2c Area 3Solid line: proposed method; dotted line: robust PI controller [17]; and dashed line: robust PI controller [18]

Autho

r Per

sona

l Cop

y

IET Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26The Institution of Engineering and Technology 2009 doi: 10.1049/iet-gtd.2009.0168

IETdo

www.ietdl.org

Figure 8 System responses in Case 2

a Area 1b Area 2c Area 3Solid line: proposed method; dotted line: robust PI controller [17]; and dashed line: robust PI controller [18]

Autho

r Per

sona

l Cop

y

Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26 21i: 10.1049/iet-gtd.2009.0168 & The Institution of Engineering and Technology 2009

22

&

www.ietdl.org

Figure 9 Non-linear test case study topology al Cop

y

rn

control techniques, a three-control area power system (sameas example used in [17, 18]) is considered as a test system.It is assumed that each control area includes three Gencosand its parameters are given in [17, 18].

A block schematic diagram of the model used forsimulation studies is shown in Fig. 5 and the proposedmulti-agent structure is shown in Fig. 6.

Our purpose here is essentially to clearly show the varioussteps in the implementation and to illustrate the method.After the design steps of the application of the algorithm isfinished, the controller must be trained by running thesimulation in the learning mode as explained in Section3. The performance results presented here correspond tothe performance of the controllers after the learning phaseis completed and the controller’s actions at various stateshave converged to their optimal values. The simulation isrun as follows: during the simulation, the estimator agentestimates the b parameter based on the input parameters(Df , DPtie, DPm, DPL) every 60 s (according to Section3.2); also, the ACE signal is calculated by the estimatoragent at each simulation sample time based on theestimated b parameter until that time, then at each LFCexecution period (that is greater than the simulation sampletime and is equal to 2 s), the controller agent of each area,averages all corresponding ACE signal instances calculatedby the estimator agent (based on the estimated b

parameter) and averages all load changes instances obtainedduring the LFC execution period. Three average values of

Autho

r Pe


ACE signal instances (each related to one area) plus threeaverage values of load changes instances form the currentjoint state vector, xk (that is obtained according to thequantised states gained from GA); then the controlleragents choose an action ak according to the quantisedactions gained from GA and the exploration policymentioned above. Each joint action ak consists of threeactions (DPc1, DPc2, DPc3) to change the set points of thegovernors. Using this change to the governors setting, thepower system model is integrated for the next LFCexecution period. During the next cycle (i.e. till the nextinstant of LFC gained), three values of average ACEinstances and average load changes instances in each areaare formed in the next joint state (xkþ1).

In the simulation studies presented here, the input variableis obtained as follows: At each LFC execution period, averagevalues of ACE signal instances corresponding to each area arecalculated, they are the first state variables. (x1

avg1, x1avg2, x3

avg3)The average value of load changes instances of three areas atthat LFC execution period are the second state variables(x2

avg1, x2avg2, x2

avg3). Because the MARL process is used andagents of all areas are learning together, the joint statevector consists of all state vectors of three areas, the jointaction vector consists of all action vectors of three areasand, as shown in truple k(X1, X2, X3), (A1, A2, A3), p, rl orkX , A, p, rl, where Xi ¼ (x1

avg i, x2avg i) is the discrete set of

each area states and X is the joint state, Ai is the discretesets of each area actions available to the area i and A is thejoint action.

so


IEd

www.ietdl.org

Figure 10 System responses to a network with the same topology as IEEE 10 Generators 39 Bus with the proposed method

a Area 1b Area 2c Area 3

Autho

r Per

sona

l Cop

y

T Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26 23oi: 10.1049/iet-gtd.2009.0168 & The Institution of Engineering and Technology 2009

24

&

www.ietdl.org

r

In each LFC execution period after averaging of ACEi

and DPLi of all areas (over instances obtained in thatperiod), depending on the current joint state (X1, X2, X3):the joint action (DPc1, DPc2, DPc3) is chosen according tothe exploration policy. Consequently, the reward r alsodepends on the joint action whenever the next state (X ) isdesirable (i.e. when all jACEij are less than 1, where 1 isthe smallest ACE signal value that the LFC can operate)and then reward function r is assigned a zero value. Whenthe next state is undesirable (i.e. when least one jACEij isgreater than 1), then r is assigned an average value of all2jACEij

Here, since all power system areas are connected through atie-line power, and according to (6), the tie-line powerchanges because of ACE signal changes, therefore agents ofall areas can’t operate independently; thus, single-agent RLis not capable of solving this problem.

The MARL speeds up the learning process in thisproblem. Also, this RL algorithm is more scalable than thesingle-agent RL.

5 Simulation resultsTo demonstrate the effectiveness of the proposed controldesign, some simulations were carried out. In thesesimulations, the proposed controllers were applied to thethree-control area power system described in Fig. 6 withthe assumptions that the parameters D and H are known,time invariant and fix parameters. Also, D and H ’s areavalues are a linear combination of all generator’s D and Hvalues in that area.

In this section, the performance of the closed-loop systemusing the linear robust PI controllers [17, 18] compared tothe designed MARL controller will be tested for thevarious possible load disturbances.

Case 1: As the first test case, the following large loaddisturbances (step increase in demand) are applied to threeareas

DPd1 ¼ 100 MW, DPd2 ¼ 80 MW, DPd2 ¼ 50 MW

The frequency deviation (Df ), ACE and control action (DPc)signals of the closed-loop system are shown in Fig. 7 (becausethe frequency is the same in all areas, it is not necessary toshow Df1).

Case 2: Consider larger demands by areas 2 and 3, that is

DPd1 ¼ 100 MW, DPd2 ¼ 100 MW, DPd2 ¼ 100 MW

The closed-loop responses for each control area are shown inFig. 8 (because the frequency is the same in all areas, it is notnecessary to show Df1).

Autho

r Pe


Using the proposed method with the b estimation, theACE and frequency deviation of all areas are properlydriven back to zero, as well as robust controllers. Also, theconvergence speed of the frequency deviation and the ACEsignal to its final values are good; they attain to the steadystate as rapidly as the signals in [17, 18]. However, themaximum frequency deviation occurs at 2 s in which loaddisturbances occur.

As shown in the above figures, the generation controlsignal deviation (DPc) change is low and it smoothly goesto the steady state and satisfies the system physicalconditions well. Also, it is clear that the DPm (mechanicalpower deviation) is proportional to the participation factorof each generator precisely.

Case 3: As another test case, the proposed method wasapplied to a network with the same topology as IEEE 10Generators 39 Bus System [42] as a non-linear test casestudy (Fig. 9) and was tested with the following loadchange scenario (more explanations on the network aregiven in Appendix): In area 1, 3.8% of total area load atbus 8, 4.3% of total area load at bus 3 in area 2, and 6.4%of total area load at bus 16 in area 3 have been added.

The closed-loop responses for each control area are shownin Fig. 10.

As shown in the simulation results, using the proposedmethod, the ACE and frequency deviation of all areas areproperly driven close to zero. Furthermore, assuming thatthe proposed algorithm is an adaptive algorithm and isbased on the learning methods – in each state, it finds thelocal optimum solution so as to gain the system objectives(the ACE signal near zero) – therefore the intelligentcontrollers provide smoother control action signals.

6 ConclusionA new method for the LFC, using an MARL based on GAoptimisation and with b estimation functionality, has beenproposed for a large-scale power system. The proposedmethod was applied to a three-control area power systemand was tested with different load change scenarios. Theresults show that the new algorithm performs very well,compares well with the performance of recently designedlinear robust controllers, and, finding the ACE signal basedon the b estimation, improves its performance. The twoimportant features of the new approach: modelindependence and flexibility in specifying the controlobjective, make the approach very suitable for this kindof applications. However, the scalability of the MARL torealistic problem sizes is one of the main reasons to useit. In addition to the scalability and the benefits owing tothe distributed nature of the multi-agent solution, suchas parallel computation, multiple RL agents may utilisenew benefits from sharing experience, for example bycommunication, teaching or imitation.

sona

l Cop

y


IETdo

www.ietdl.org

r

7 References

[1] JALEELI N., EWART D.N., FINK L.H.: ‘Understanding automaticgeneration control’, IEEE Trans. Power Syst., 1992, 7, (3),pp. 1106–1112

[2] IBRAHEEM K.P., KOTHARI P.: ‘Recent philosophies ofautomatic generation control strategies in powersystems’, IEEE Trans. Power Syst., 2005, 20, (1),pp. 346–357

[3] ATHAY T.M.: ‘Generation scheduling and control’, Proc.IEEE, 1987, 75, (12), pp. 1592–1606

[4] KOTHARI M.L., NANDA J., KOTHARI D.P., DAS D.: ‘Discrete modeAGC of a two area reheat thermal system with new ACE’,IEEE Trans. Power Syst., 1989, 4, pp. 730–738

[5] SHOULTS R.R., JATIVA J.A.: ‘Multi area adaptive LFCdeveloped for a comprehensive AGC simulator’, IEEETrans. Power Syst., 1991, 8, pp. 541–547

[6] MOON Y.H., RYU H.S.: ‘Optimal tracking approach to loadfrequency control in power systems’. IEEE PowerEngineering Society Winter Meeting, 2000

[7] ALRIFAI M., ZRIBI M.: ‘A robust decentralized controller forpower system load frequency control’. 39th Int. UniversitiesPower Engineering Conf., 2004, vol. 1, pp. 794–799

[8] HIYAMA T.: ‘Design of decentralised load – frequencyregulators for interconnected power systems’, IEE Proc. CGener. Transm. Distrib., 1982, 129, pp. 17–23

[9] FELIACHI A.: ‘Optimal decentralized load frequencycontrol’, IEEE Trans. Power Syst., 1987, 2, pp. 379–384

[10] LIAW C.M., CHAO K.H.: ‘On the design of an optimalautomatic generation controller for interconnected powersystems’, Int. J. Control, 1993, 58, pp. 113–127

[11] WANG Y., ZHOU R., WEN C.: ‘Robust load – frequencycontroller design for power systems’, IEE Proc. C Gener.Transm. Distrib., 1993, 140, (1), pp. 11–16

[12] LIM K.Y., WANG Y., ZHOU R.: ‘Robust decentralisedload frequency control of multi-area power systems’, IEEProc. Gener. Transm. Distrib., 1996, 5, (143), pp. 377–386

[13] ISHI T., SHIRAI G., FUJITA G.: ‘Decentralized load frequencybased on H-inf control’, Electr. Eng. Jpn., 2001, 3, (136),pp. 28–38

[14] KAZEMI M.H., KARRARI M., MENHAJ M.B.: ‘Decentralizedrobust adaptive-output feedback controller for powersystem load frequency control’, Electr. Eng. J., 2002, 84,pp. 75–83

Autho

r Pe

Gener. Transm. Distrib., 2010, Vol. 4, Iss. 1, pp. 13–26i: 10.1049/iet-gtd.2009.0168

[15] EL-SHERBINY M.K., EL-SAADY G., YOUSEF A.M.: ‘Efficient fuzzylogic load – frequency controller’, Energy Convers.Manage., 2002, 43, pp. 1853–1863

[16] BEVRANI H., HIYAMA T.: ‘Robust load–frequency regulation:a real-time laboratory experiment’, Opt. Control Appl.Methods, 2007, 28, (6), pp. 419–433

[17] RERKPREEDAPONG D., HASANOVIC A., FELIACHI A.: ‘Robust loadfrequency control using genetic algorithms and linearmatrix inequalities’, IEEE Trans. Power Syst., 2003, 2, (18),pp. 855–861

[18] BEVRANI H., MITANI Y., TSUJI K.: ‘Robust decentralised load–frequency control using an iterative linear matrixinequalities algorithm’, IEE Proc. Gener. Transm. Distrib.,2004, 3, (151), pp. 347–354

[19] KARNAVAS Y.L., PAPADOPOULOS D.P.: ‘AGC for autonomouspower system using combined intelligent techniques’,Electr. Power Syst. Res., 2002, 62, pp. 225–239

[20] DEMIROREN A., ZEYNELGIL H.L., SENGOR N.S.: ‘Automaticgeneration control for power system with SMES by usingneural network controller’, Electr. Power Comp. Syst.,2003, 31, (1), pp. 1–25

[21] XIUXIA D., PINGKANG L.: ‘Fuzzy logic control optimalrealization using GA for multi-area AGC systems’,Int. J. Inf. Technol., 2006, 12, (7), pp. 63–72

[22] ATIC N., FELIACHI A., RERKPREEDAPONG D.: ‘CPS1 and CPS2compliant wedge-shaped model predictive load frequencycontrol’. IEEE Power Engineering Society General Meeting,2004, vol. 1, pp. 855–860

[23] ERNST D., GLAVIC M., WEHENKEL L.: ‘Power system stabilitycontrol: reinforcement learning framework’, IEEE Trans.Power Syst., 2004, 19, (1), pp. 427–436

[24] AHAMED T.P.I., RAO P.S.N., SASTRY P.S.: ‘Reinforcementlearning controllers for automatic generation control inpower systems having reheat units with GRC and dead-band’, Int. J. Power Energy Syst., 2006, 26, (2), pp. 137–146

[25] AHAMED T.P.I., RAO P.S.N., SASTRY P.S.: ‘A reinforcementlearning approach to automatic generation control’, Electr.Power Syst. Res., 2002, 63, pp. 9–26

[26] EFTEKHARNEJAD S., FELIACHI A.: ‘Stability enhancementthrough reinforcement learning: load frequency controlcase study’, Bulk Power Syst. Dyn. Control-VII, Charlston,USA, 2007

[27] AHAMED T.P.I.: ‘A neural network based automaticgeneration controller design through reinforcement learning’,Int. J. Emerging Electr. Power Syst., 2006, 6, (1), pp. 1–31

sona

l Cop

y

25


26

& T

www.ietdl.org

r

[28] SUTTON R.S., BARTO A.G.: ‘Reinforcement learning: anintroduction’ (MIT Press, Cambridge, MA, 1998)

[29] BUSONIU L., BABUSKA R., DE SCHUTTER B.: ‘A comprehensivesurvey of multi-agent reinforcement learning’, IEEE TransSyst. Man. Cybern. C: Appl. Rev., 2008, 38, (2), pp. 156–172

[30] WEISS G. (ED.):‘Multi-agent systems: a modern approachto distributed artificial intelligence’ (MIT Press, Cambridge,MA, 1999)

[31] WOOLDRIDGE M., WEISS G. (EDS.):‘Intelligent agents, inmulti-agent systems’ (MIT Press, Cambridge, MA, 1999),pp. 3–51

[32] VLASSIS N.: ‘A concise introduction to multi-agentsystems and distributed AI’. Technical Report Fac. Sci.Univ. Amsterdam, Amsterdam, The Netherlands, 2003

[33] YANG G.: ‘Multi-agent reinforcement learning for multi-robot systems: a survey’. Technical report, CSM-404, 2004

[34] BEVRANI H.: ‘Real power compensation and frequencycontrol’, ‘Robust power system frequency control’(Springer Press, 2009, 1st edn.), pp. 15–41

[35] THATHACHAR M.A.L., HARITA B.R.: ‘An estimator algorithm forlearning automata with changing number of actions’,Int. J. Gen. Syst., 1988, 14, (2), pp. 169–184

[36] HOONCHAREON N.B., ONG C.M., KRAMER R.A.: ‘Feasibility ofdecomposing ACE to identify the impact of selected loadson CPS1 and CPS2’, IEEE Trans. Power Syst., 2002, 22, (5),pp. 752–756

[37] CHANG-CHIEN L.R., HOONCHAREON N.B., ONG C.M., KRAMER R.A.:‘Estimation of b for adaptive frequency bias setting inload frequency control’, IEEE Trans. Power Syst., 2003, 18,(2), pp. 904–911

[38] CHANG-CHIEN L.R., ONG C.M., KRAMER R.A.: ‘Field tests andrefinements of an ace model’, IEEE Trans. Power Syst.,2002, 18, (2), pp. 898–903

[39] CHANG-CHIEN L.R., LIN Y.J., WU C.C.: ‘An online approach toallocate operating reserve for an isolated power system’,IEEE Trans. Power Syst., 2007, 22, (3), pp. 1314–1321

Autho

r Pe

he Institution of Engineering and Technology 2009

[40] BEVRANI H.: ‘Frequency control in emergencyconditions’, ‘Robust power system frequency control’,(Springer Press, 2009, 1st edn.), pp. 165–168

[41] CHONG E.K.P., ZAK S.H.: ‘An introduction to optimization’(John Wiley & Sons Press, New York, 1996)

[42] PAI M.A.: ‘Energy function analysis for power systemstability’ (Kluwer Academic Publishers, 1989)

[43] DANESHMAND P.: ‘Power system frequency control in thepresence of renewable energy sources’. MSc dissertation,Department of Electrical and Computer Engineering,University of Kurdistan Sanandaj, Iran, 2009

8 AppendixTo investigate the capability of the proposed control methodon the power system (particularly on the system frequency), asimulation study that has been provided in the Simpowerenvironment of MATLAB software and presented in [43]is used. It is a network with the same topology as the well-known IEEE 10 generators 39-bus test system.

This test system is widely used as a standard system fortesting of new power system analysis and control synthesismethodologies. This system has 10 generators, 19 loads, 34transmission lines and 12 transformers and it is updated bytwo wind farms in areas 1 and 3, as shown in Fig. 9.

The 39 buses system is organised into three areas. Thetotal system installed capacities are 841.2 MW ofconventional generation and 45.34 MW of wind powergeneration. There are 198.96 MW of conventionalgeneration, 22.67 MW of wind power generation and265.25 MW of load in area 1. In area 2, there are232.83 MW of conventional generation, and 232.83 MWof load. In area 3, there are 160.05 MW of conventionalgeneration, 22.67 MW of wind power generation and124.78 MW of load.

The simulation parameters for the generators, loads, linesand transformers of the test system are given in [43]. Allpower plants in the power system are equipped with aspeed governor and a power system stabiliser (PSS).However, only one generator in each area is responsible forthe LFC task; G1 in area 1, G9 in area 2 and G4 in area 3.

sona

l Cop

y


f. daneshfar h. bevrani author personal copyeng.uok.ac.ir/bevrani/bevrani/83.pdff. daneshfar h....

Documents