optimization of wireless sensor networks using machine ... · optimization of wireless sensor...

Optimization of Wireless Sensor Networks using Machine Learning

N I K L A S W I R S T R Ö M

Master of Science Thesis Stockholm, Sweden 2006

Optimization of Wireless Sensor Networks using Machine Learning

N I K L A S W I R S T R Ö M

Master’s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2006 Supervisor at CSC was Örjan Ekeberg Examiner was Anders Lansner TRITA-CSC-E 2006:100 ISRN-KTH/CSC/E--06/100--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

Abstract

Wireless Sensor Networks (WSN) may be used for a wide range of appli-cations. Often, it is desirable for the networks to be as energy-efficientas possible without affecting performance too much. It is also desirablefor the networks to be self configuring in order to make deploymenteasier.

In this Master’s thesis we explore how the nodes of a WSN can usepolicies for self configuration. Depending on the state of a node’s localenvironment, the policy determines how the node should configure.

We show how Machine Learning methods can be used in simulatednetworks to search for optimal policies for specific scenarios.

Referat

Optimering av trådlösa sensornätverk med hjälp av

maskininlärning

Trådlösa sensor-nätverk (WSN) kan användas till en lång rad ap-plikationer. Ofta är det önskvärt att nätverket är så energisnålt sommöjligt utan att kvaliteten av dess tjänster blir alltför lidande. För attunderlätta utplacering av nätverket är det även önskvärt att nätverketär självkonfigurerande.

I denna rapport utforskar vi hur noder i ett WSN kan använda sigav beteende-policies för självkonfigurering. Beteende-policyn avgör huren nod ska konfigurera sig beroende på dess lokala tillstånd.

Vi visar hur maskininlärningsmetoder kan användas i simuleradenätverk för sökning efter optimala beteende-policies för specifika scena-rier.

Contents

Contents

1 Introduction 1

1.1. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and Related Work 5

2.1. Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1. Why Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2. Routing Algorithms and Clustering Strategies . . . . . . . . . 6

2.1.3. Radio Transmission . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.4. WSN Performance . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2. Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 10

2.2.2. Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . 13

2.2.3. Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3. Discrete Event Simulation . . . . . . . . . . . . . . . . . . . . . . . . 15

3 A WSN System Model 17

3.1. The Energy Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2. The Radio Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3. The Sensor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Machine Learning Methods for Optimization of WSN 21

4.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2. A Non-Stationary Environment . . . . . . . . . . . . . . . . . . . . . 22

4.3. A Monte Carlo Approach . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4. The State and Action Spaces . . . . . . . . . . . . . . . . . . . . . . 23

4.5. Perception of the State Signal . . . . . . . . . . . . . . . . . . . . . . 24

4.6. Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.7. Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.8. Policy Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.9. Quantizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.10. Using GA for Selecting Quantizing Values . . . . . . . . . . . . . . . 27

5 Implementation 315.1. The Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2. The Experiment Framework . . . . . . . . . . . . . . . . . . . . . . . 325.3. The Graphical Interface . . . . . . . . . . . . . . . . . . . . . . . . . 325.4. The Experiment Configuration Framework . . . . . . . . . . . . . . . 325.5. The Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . 33

5.5.1. Computation of State and Action Identifiers . . . . . . . . . . 335.5.2. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.6. The Simulation Model Implementation . . . . . . . . . . . . . . . . . 365.6.1. Radio Communication . . . . . . . . . . . . . . . . . . . . . . 365.6.2. Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 The Experiment Set-Up 396.1. The Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.1. The Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.1.2. The Sensor Node . . . . . . . . . . . . . . . . . . . . . . . . . 396.1.3. The Radio Model . . . . . . . . . . . . . . . . . . . . . . . . . 406.1.4. The Central . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2. Optimization and Learning Aspects . . . . . . . . . . . . . . . . . . . 416.2.1. Optimization Parameters . . . . . . . . . . . . . . . . . . . . 416.2.2. The State Signal and Action Representation . . . . . . . . . . 436.2.3. The QoS Measure . . . . . . . . . . . . . . . . . . . . . . . . 446.2.4. Measure Normalization . . . . . . . . . . . . . . . . . . . . . 45

7 Experiments and Results 477.1. Experiment 1 - Reinforcement Learning . . . . . . . . . . . . . . . . 477.2. Experiment 2 - Improving Performance using GA . . . . . . . . . . . 507.3. Experiment 3 - A Hand-Coded Policy . . . . . . . . . . . . . . . . . 527.4. Experiment 4 - Simultaneous GA and RL . . . . . . . . . . . . . . . 557.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8 Conclusions 59

Bibliography 61

Chapter 1

Introduction

A wireless sensor network (WSN) is a network consisting of small devices (sensornodes) scattered over an area, cooperating to monitor changes of some predefinedkind over that area. The type of sensors the nodes are equipped with vary dependingon the tasks the network has to accomplish, and may be microphones, motion-,seismic-, moisture detectors etc. Each node may also have several sensors connectedto it.

The objective of this thesis is to explore methodology for automatic computationof policies for how the nodes of a WSN should configure based on, for the nodes, localobservations. Simulation based machine learning approaches for computation of self-configuration policies are presented and tested experimentally in a simulator. Theresults are compared with those of a simple, hand-coded configuration algorithm.

WSN node configuring may consist in assigning values to parameters, suchas transmission power or duty cycling, selecting among predefined algorithms fordata analysis or routing algorithms, etc. The local observations on which self-configuration is based may consist of the number of other nodes in the proximity,the remaining energy or anything giving a hint of the local status of the WSN.

Different perspectives on WSN exist. They vary in the manner the WSN isdeployed, the sizes and costs of the the single nodes, the nodes’ energy resources,topology, etc. The experiments in this thesis are carried out using a model of animmobile, matchbox-sized, manually deployed, infrastructure-based WSN, cf [10].Further, the experiments here focus on optimization for a subset of all the param-eters available in the WSN.

An illustrative example For surveillance or statistical purposes, it may be desir-able to monitor how people or vehicles move in a certain area. For this, a WSN maybe used. Depending on the hardware of the single sensor nodes, the environmentthey are to be deployed in and the task the WSN as a whole has to accomplish etc,the nodes may have to be configured in a specific way to perform optimally. Man-ual configuration of each node, in order to achieve a well working, energy efficientWSN, would be a time consuming task, where special skills and knowledge of WSN

1

CHAPTER 1. INTRODUCTION

may be required at deployment. Instead, the use of precomputed policies using asimulator may speed up deployment and may require less knowledge of the peopledeploying the network. Figure 1.1 shows this concept. Once a policy for a specificnetwork is found, it may be reused for other networks with similar properties.

Figure 1.1. A simulator is used to find a policy for a specific scenario. The policyis then loaded into the sensor nodes.

This report is divided into several chapters. In Chapter 2, a background onWSN and machine learning is given together with presentations of related works.In Chapter 3, an abstract model of a WSN system is described. Chapter 4 introducesan approach to policy computation using Monte Carlo methods alone, as well asin combination with genetic algorithms. In Chapter 5, a short presentation of howselected parts of the simulation tool was implemented. Chapter 6 presents thescenario and the common parameter settings for the experimental tests. Finally,Chapter 7 describes the experiments carried out and presents the results achievedfrom them.

1.1. Acknowledgments

This thesis has been carried out at the Intelligent Systems Lab (ISL) at SICS(Swedish Institute of Computer Science).

The major parts of the software for simulation and learning used in this thesishas been developed by the author. The following parts have however mainly beendeveloped by others:

Experiment Configuration Framework

The Experiment Framework

The author wishes to thank Sverker Janson, Joakim Eriksson and Niclas Finneat the ISL at SICS for their support. The author thanks Joakim Eriksson and Niclas

2

1.1. ACKNOWLEDGMENTS

Finne for feedback and ideas concerning implementations and experiment setup andthe contribution of the above specified parts. The author thanks Sverker Jansonfor feedback, ideas and all the comments regarding the writing of this thesis.

3

Chapter 2

Background and Related Work

In this chapter, theory and background for the different research fields concern-ing this thesis are presented. The chapter is divided into three sections. The firstsection gives a background on wireless sensor networks and some theory associatedwith that research field. Also, related work such as different attempts to maximizeenergy efficiency is presented.

The second section covers the theory regarding two machine learning fields,namely reinforcement learning and genetic algorithms. It is within the frameworkof these, a method for finding self-configuration policies is proposed.

The third section covers the basic theory and concepts of discrete-event simula-tion.

2.1. Wireless Sensor Networks

This section is a short presentation of the wireless sensor network (WSN) re-search field. Some examples of different WSN applications are given. Also, theobjective of energy efficiency is motivated and some ways of getting there are pre-sented.

Often in the literature, the WSN is assumed to consist of hundreds or thousandsof small nodes which are placed very densely [8]. In many applications the nodesof the network are assumed to be randomly scattered over the area which is tobe monitored. Without any foreknowledge of the topology, it is necessary for thenetwork and the nodes to be self organizing in the sense that once deployed, thenetwork should be able to configure itself in a satisfying way.

Basically, WSNs are used in two ways. As alarm systems, where the network isevent driven, i.e it triggers when it senses some kind of changes in the monitoredarea, or demand driven, where the network works more like a continuously updateddata base, and the clients poll the information from the network [11, 4].

A base station or some kind of central, is typically placed in connection to thenetwork, with the purpose of constituting an access point for retrieving data fromthe network. The base station could, e.g be connected to the Internet so that

5

CHAPTER 2. BACKGROUND AND RELATED WORK

clients easily can retrieve data from it [2]. Nodes which lie close to the central cancommunicate with it directly, while nodes further away have to route informationthrough other nodes in a multi-hop fashion.

There is a wide range of possible applications for WSN:s. They may for instancebe used for monitoring, to humans hostile environments, such as radioactive areasor a volcano right after an eruption, but also for long term monitoring in friendlierambient, such as bird habitat monitoring or heat flux in a building.[8, 2].

2.1.1. Why Energy Efficiency

Especially for long term monitoring tasks, it is important to efficiently use theenergy that each node disposes, in order to keep the network alive for a sufficientlylong time. Otherwise a costly or sometimes impossible replacement of batteries willbe necessary. Therefore, much effort has been made trying to optimize the WSNwith respect to the energy usage at the same time as keeping the quality of service(QoS) on an acceptable level.

There are several factors that affect a node’s power consumption. The routingalgorithm being followed, the nodes’ duty cycles, and how often nodes use theirradios are examples of such. A node’s duty cycle is the ratio of the duration ofthe node’s active periods to its passive periods, i.e. how much of the time thenode is spending awake [2]. A too high ratio would, of course, result in high powerconsumption, but a too low one would as well, due to the difficulties routing infor-mation through the network when few nodes are awake at a time. Also the QoS willdecrease with this ratio, since it will result in a lower sampling rate of the changesover the monitored area [22].

To minimize the network traffic, local data analysis may be used. In general,computations have to be relatively simple in order to not expend too much energy.Possible approaches may be summing or averaging over a set of aggregated values[5]. Other ways may be to let the nodes classify the data previous to transmitting.A simple identifier can then be sent instead of all the data observed. However,classification is often a quite expensive kind of data analysis. The use of dataanalysis is in general a trade-off between information accuracy and the level ofnetwork traffic.

2.1.2. Routing Algorithms and Clustering Strategies

Many of the attempts made to optimize the energy efficiency focus on the net-work’s routing algorithm. In this section some terminology and algorithms will becovered very briefly. One of these algorithms, LEACH, is explained in more detail.

In general, there are several ways of defining an energy efficient routing path.Some approaches are the maximum available power route, the minimum energyroute, the minimum hop route and the maximum minimum available power noderoute [8]. None of these, however, solves the routing problem in a satisfying way.The first and fourth approach will drain the net evenly, but it might choose expensive

6

2.1. WIRELESS SENSOR NETWORKS

routes. The second and third approach have the disadvantage of using up all energyin one specific route at once [8]. Further, there is an energy cost associated withgetting the information needed to make the decisions necessary for the above routingschemes, namely the cost of propagating information of nodes’ remaining energiesthrough the network.

A common way to simplify the network is to, in some way, build up a backboneof nodes, which is responsible for routing the data to and from the central. Thenodes not included in the backbone typically lie only one hop away from it [7, 23].To drain the network evenly, the nodes constituting the back bone change overtime. Often, and especially if nodes are randomly scattered, it is necessary to letthe nodes locally decide whether or not to be a part of the backbone at a certaintime. The more locally these decisions are made, the less information has to bepropagated through the network [23].

Dai and Wu propose an algorithm where these decisions are based on prioritiescomputed by each node [18]. The nodes collect information about their one-hop, ortwo-hop neighbors’ priorities in order to decide whether or not it should be a partof the back bone. The back bone formed constitutes a dominating node set, whichis a set such that each node in the network either belongs to it or has a one-hopneighbor that belongs to it. The algorithm proposed was later extended by Carleand Simplot-Ryl to find the smallest set of nodes covering a network’s monitoringarea [4].

Another algorithm is LEACH (Low-Energy Adaptive Clustering Hierarchy) byHeinzelman et al. [7]. Here, in order to simplify the network, it is divided into dif-ferent clusters. Each cluster consists of one cluster-head and a number of followers.Here the cluster-head is the node responsible for forwarding messages toward andfrom the sink. LEACH is also based on local decision-making. In a way, even morethan the dominating node sets algorithm [18], since in LEACH, the decisions arebased solely upon a random number and a threshold. If this random number is lessthan the threshold, the node chooses to become a cluster-head, if not it waits forthe cluster-heads to announce themselves and then choose one of these to followand so become a follower of that cluster-head. On predetermined time intervals anew round starts and the nodes will follow the same procedure all over again. Theprobability of which a node chooses to become a cluster-head increases with thenumber of rounds since it last was a cluster-head. In this way the probability isincreased that the network is evenly drained.

In the basic version of LEACH the cluster heads transmit directly to the sink(the central). If the sink is placed far away from a cluster-head, however, it isnot sure the node’s radio actually is able to transmit with the necessary power, inwhich case the message would not reach the sink. A possible solution to this is,by using a hierarchical approach, letting the cluster-heads themselves form clusterswith super-cluster-heads which in turn communicate either directly with the sinkor with other super-super-cluster-heads, and so on [7].

A major disadvantage with LEACH is that there is a risk that cluster heads arenot evenly distributed over the network, causing it to be not fully connected. Also,

7


since new rounds starts on predetermined time intervals, synchronization betweenthe sensor nodes is necessary.

In order to achieve optimal energy usage, it is not enough to only focus on therouting strategies. Enz et al. [3] argue that much can be earned in attempting toco-design hardware, software and the routing protocol.

2.1.3. Radio Transmission

Here, the physical aspects and theory regarding construction of a model for radiotransmission are covered briefly. Energy costs associated with radio transmissionare also covered.

Three different energy costs are associated with using the radio. First we havethe transmission cost which is the power needed to send a message. Second we havethe receiving cost which is the power needed to actively receive a message. Andthird, we have the cost of keeping the radio on in order to discover when a messageis sent to the radio and start receiving it.

When transmitting a message over radio, a node should adjust the strengthof the radio signal to one suitable for the current network configuration. Usingvery low transmission power may result in that even the closest neighbors will haveproblems receiving the data without errors, and the message will either need to beretransmitted or the message will not be received. Using too high power, on theother hand will, in addition to its obvious extra cost, cause the transmitted signalto interfere with weaker signals farther away. Therefore the choice of transmissionpower is an important factor for the performance of the network as a whole.[8]

How far a transmitted signal will go is determined by the attenuation. Atten-uation is the reduction in strength of a signal as it passes from the transmitter tothe receiver. Attenuation is measured in dB, and may be caused by a number offactors. The attenuation of a signal is typically irregular in the sense that it is notequal in all the directions. Such non-isotropic path losses originates from reflection,diffraction, scattering and non-isotropic antenna gain. The latter refers to differ-ences in the transmitted signal strength due to irregularities in the transmitter’santenna.[9]

A simple and commonly used model for calculation of the received power at adistance d from the transmitter is as follows [25]:

PR = PTK

dn. (2.1)

Here K is a constant, PT is the transmission power and n is a constant dependingon the kind of terrain between the two transceivers. For open air transmission, n isapproximately 2, while for heavier terrain n is close to 4 [25, 24].

It is not only the distance between transmitter and receiver and the obstaclesin between them, that matters for how a signal is received. Background noise atthe transmission frequency and interfering signals from other transmissions on thesame frequency also affect the signal. It is therefore common to measure the SNR

8

2.2. MACHINE LEARNING

(signal to noise ratio) and the CIR (carrier to interference ratio), instead of thesignal strength. These are measured in dB according to Equations 2.2 and 2.3,respectively.

SNR = 10 logPS

PN(2.2)

CIR = 10 logPS

PI(2.3)

Here PS is the signal power of the current signal, PN , the power of the noisesignal, and PI , the power of the interfering signal.

According to the principle of superposition of waves, two waves that overlap, willresult in a new wave, which’ displacement in each point will be equal to the sum ofthe displacements in the two original waves. This means that background noise dueto interfering signals may be computed as the sum of the powers of the signals as inEquation 2.4 [12]. It is also true that two overlapping waves do not alter each other[12]. This means that a signal interfering with another signal somewhere betweenthe transmitter and receiver of the other signal (but not at that receiver’s position),will not have any effect on the signal finally received by the receiver.

Ii =∑

∀j|sj∈S

Pj − Pi (2.4)

Interference at the receiver’s antenna may however, cause signals to be receivedwith errors. A common way to measure how erroneous a message is when receivedis by computing the BER (bit error ratio). The BER is the fraction between thenumber of erroneous received bits and the total number of sent bits. Further, thereference sensitivity is the minimum signal strength at the receiver’s antenna thatdoes not cause the BER to exceed a certain level.

2.1.4. WSN Performance

In order to evaluate the performance of WSN, a measure is necessary. An obviousmeasure is the quality of service (QoS), which is a measure of how well a networkperforms its task. Another is the fault tolerance which refers to a networks abilityto handle node failures without letting the QoS fall below a certain level. Nodefailures may occur due to lack of energy, shifting quality in the nodes’ hardware,etc [8].

A third measure is the lifetime of the entire network. The networks life timemay be defined as the time during which the networks QoS is above a certain level.

2.2. Machine Learning

This section gives a short introduction to two fields in Machine Learning (ML),namely Reinforcement Learning (RL) and Genetic Algorithms (GA).

9


ML have been defined as “...the study of computer algorithms that improveautomatically through experience” [13]. ML can be divided in two separate fields,supervised learning and unsupervised learning. Supervised learning is learning withthe aid of training examples, whereas in unsupervised learning no such examplesexist. Both GA and the methods of RL are examples of methods that may be usedfor unsupervised learning.

2.2.1. Reinforcement Learning

The basic problem of RL is how an agent can learn a behavior by interacting withits environment. In the most commonly used methods of RL the agent learns by atconsecutive time steps selecting different actions on basis of its current state and apolicy. At each time step, the agent also receives a reward in form of a numericalvalue [14]. The objective is to find a policy that maximizes these rewards. Herefollows, in more detail, some definitions and terminology which RL is built upon.

An agent is a learner, which by interacting with its environment is supposedto learn an optimal behavior. A player of a board game, an industrial robot or asoftware program may for instance each constitute an agent.

A state, s, is a specific situation that the agent is able to observe. The statesignal, which is what identifies and distinguishes the different states may, for in-stance, be composed of the positions of the pieces in a board game or the signalsfrom the resolvers of an industrial robot. The set of all possible states is called thestate space and is commonly denoted by S. The state the agent finds itself in attime t is denoted by st ∈ S. A state signal that describes the current state just aswell alone as if we would include information of all the prior states that led up tothe current one is said to have the Markov property or being a Markov signal. If,in the board game, the signal is able to represent the position of each single piece,it is a Markov signal, since the history of moves, leading to the current one is notrelevant [14].

An action, a, is something an agent can decide to do. These are typicallypredefined and the agent has a certain number of actions to choose between in eachstate. In some cases the set of possible actions remain the same for all differentstates in the state space, but this need not always be the case. The set of actionsavailable to the agent in state st is commonly denoted by At, and the action selectedat that time step is at ∈ At. The action space may also be denoted by As, given acertain state s, rather than an instance of time t [14].

In RL, the whole problem which is to be solved is called a task. Thus, the taskdescribes both the goal and the environment. The goal is stated in terms of therewards rt+1, given to the agent as it transits from state st to state st+1. Theproperties of the reward function may vary depending on the task, e.g the rewardmay be zero until a certain goal is reached and a positive reward is given. It mayalso be negative until the goal is reached or increase the closer the agent is to thegoal.

An episode is a time interval during which the agent acts. In board games,

10


one play may constitute an episode. In tasks like these an episode can easily bedefined, since the tasks have a natural terminal state. These kind of tasks is oftenreferred to as episodic tasks. In some cases, however, there exists no such naturalterminal state. Such cases are called continuous tasks. An example of this may becontrol in the process industry, where the task is to continuously keep an optimalperformance.

As mentioned above, depending on the state it finds itself in, the agent selectsan action. Which of the possible actions it selects is determined by the policy.Formally a policy, π(s), is a probability distribution over all the available actions inAs. Hence, the mapping π(s, a) is the probability of selecting action a from state s.

Policies may be defined in many ways. Commonly used policies are greedypolicies, ǫ-greedy policies and softmax policies. A greedy policy always selects theaction that is considered best at the moment. An ǫ-greedy policy behaves as agreedy policy, but with a probability ǫ it selects an action at random. Softmaxpolicies assign different probabilities to the actions based on their expected value.

With the above notation, an agent observes states and rewards, and take actionsin a sequence as the following:

st : at → rt+1, st+1 : at+1 → rt+2, st+2... (2.5)

If the state signal has the Markov property, then Equation 2.5 is a Markovdecision process (MDP) [13]. In an MDP, the probability of transiting to states′, being in state s and taking action a, is given by a state transition probabilityfunction Pa

s,s′. The expected immediate reward for taking action a from state sand transiting to a new state s′ is given by a reward function Ra

s,s′ [14]. Most ofthe mathematical theory regarding RL, such as convergence proofs assumes the RLtask to be an MDP [14, 13].

In some cases the environment changes over time. This means that the functionPa

s,s′ is not static. Such environments are referred to as non-stationary environ-ments, and do also not fit into the basic MDP framework.

What the agent may learn in an MDP, is to always take actions that maximizesome function of the rewards it expects to receive in the future. This value is referredto as the expected return, and is denoted Rt for the expected return followingtime t. There are several ways to define this function, but a common form is thefollowing:

Rt =T

∑

k=0

γkrt+k+1, 0 ≤ γ ≤ 1. (2.6)

Here, T is the length of an episode and γ is a discount factor, discountingrewards further into the future. For continuous tasks, T = ∞ and γ < 1, whereasfor episodic tasks T is the number of time steps following time step t in the currentepisode. For episodic tasks no discounting is necessary and therefore γ = 1.

We define two functions:

11


V π(s) = Eπ[Rt|st = s] (2.7)

andQπ(s, a) = Eπ[Rt|st = s, at = a]. (2.8)

Here Eπ denotes the expected value, following policy π. V π is the state-valuefunction and denotes the expected return following a given state. Qπ states theexpected return taking a certain action in a certain state and is therefore referredto as the action-value function. Qπ may be computed from V π if the state transitionprobability function Pa

s,s′ is known.For an MDP, there exists exactly one optimal state-value function V ∗, such that

for each state s, V ∗(s) is greater or equal to V π(s) for all π. The policies underwhich the state-value function is optimal are referred to as optimal policies, andare denoted by π∗. Learning to take actions that maximizes the expected return ishence equivalent to computing V ∗, and computing Q∗ from it.

In many cases, computing V ∗ or Q∗ may not be feasible and only an approx-imation is possible to achieve. In the methods of RL, the approximation of theoptimal value functions is made through two processes. Policy evaluation and pol-icy improvement. In the first process, the aim is to learn V π for the current policyπ. Typically, only an approximation will be learned here as well. In the secondprocess, the policy is improved. This is typically done assigning higher probabilitiesto actions which have yielded better rewards than others.

In RL, there are three different major learning strategies. These are dynamicprogramming (DP), Monte Carlo (MC) methods and temporal difference (TD)learning.

In DP, policy evaluation is an iterative process based on the Bellman equationfor V π which is as follows:

V π(s) = Eπ[Rt|st = s] =∑

a∈As

π(s, a)∑

s′∈S

Pas,s′[R

as,s′ + γV π(s′)] (2.9)

Policy improvement is then made by changing to the greedy policy with respectto V π. DP is guaranteed to converge, but since the Bellman equation includes thestate transition probability function Pa

s,s′ , a complete model of the environment isneeded in order to use DP methods. This constitutes a major limitation of DPmethods.

In MC methods, no model is needed. The agent learns by experience, usingsamples of state transitions. It selects actions using a policy π, receives feedbackand updates the value function for the whole chain of states visited. Two of severaldifferent update rules are:

V (st, i)← V (st, i− 1) + α [Rt − V (st, i− 1)] (2.10)

and

V (st, i)←

∑ij=0 wjRj,st

∑jj=0 wj

. (2.11)

12


In Equation 2.10, α is the learning rate, and defines how much each return shouldbe accounted for. This update rule is well suited for non-stationary environments[14], since older returns tend to have less and less importance for each episode.

Equation 2.11 is the weighted average of the returns of all episodes, followed agiven state.

TD methods also do not need a model of the environment. The differencebetween MC methods and TD methods is that TD methods use immediate rewardsand the (discounted) value of the current state to update the value of the precedingstates. When applicable TD methods may show better performance than do MCmethods [14].

2.2.2. Multi-Agent Reinforcement Learning

Multi-agent RL differs from ordinary single-agent RL in that there are manydifferent agents supposed to learn a task. Different terminology exists. Panait andLuke [17] divide multi-agent learning into two main classes: Concurrent learningand team learning.

Team Learning

In team learning there is only one learner. It may be seen as a single agentreinforcement learning with one agent representing n other. Thus, the states spaceconsists of all combinations of the agents’ states. For n agents with m states each,the resulting state space would be of the size |S| = mn. This makes team learningmethods limited.

There are two classes of team learning: Homogeneous and heterogeneous. Inhomogeneous team learning, all agents have the same behavior while in heteroge-neous, they may have different. If an agent in all states can choose among the samenumber of actions |A|, the policy space for a homogeneous team learning approachwould be of the size |A||S|. For a heterogeneous team learning approach it wouldbe |A|n|S|.

Since there is only one learner, team learning is a centralized way of learningwith policies depending on the state of the whole system. This is another limitationfor team learning methods.

Concurrent Learning

In concurrent learning, each agent is a learner and learn independently of otheragents. Concurrent learning exists in two forms, cooperative learning and com-petitive learning. In cooperative learning, the task is for agents to cooperate toaccomplish a goal. In competitive learning the agents goals are somewhat orthogo-nal.

In concurrent learning, the agents may be given individual rewards. This leadsto that, in addition to the credit assignment problem which is common in single

13


agent RL, there is also the problem of which agent to assign the credit of goodresults.

For each independent agent, the other agents constitute a part of the environ-ment’s dynamics. This leads to that when the other agents change their behaviordue to policy improvement, the just learned behavior may be obsolete. Concurrentlearning do not fit into the MDP framework. Common theoretical approaches arefrom the game-theoretic field, such as Markov games and Nash equilibrium [21, 20].

The state spaces of concurrent learning depends on the amount of informationeach agent has of the other agents. If n agents are interacting without any informa-tion of each other’s behaviors, the state space is n|S|, where S is the state space ofa single agent. Adding information about the other agents results in an incrementin each agents state space S with a factor dn−1, where d is the number of states ofthe information.

2.2.3. Genetic Algorithms

The development of genetic algorithms (GA) have been inspired by the evolutionfound in the nature. Though the terminology and the exact algorithm varies in theliterature, this section presents a commonly used approach.

A GA usually starts with a random set of individuals, a population. The pop-ulation existing at a certain time is referred to as a generation. Each individualrepresents a solution to the problem to be solved. Typically, solutions consist of aset of parameters ordered in a string. This string is referred to as a chromosome andthe parameters as genes. The individuals are evaluated to determine how well theysolve the problem and are assigned a fitness on the basis of that evaluation. Thefitness is determined by the fitness function which is typically a numerical functionand the same for all individuals.

When a whole generation has been evaluated, selection is made. Selection isthe process that determines which individuals will contribute to create the nextgeneration. In this process, all the individuals have a certain probability of beingselected. The probability for an individual, i, is based on its fitness and a commonway of computing the probability is as follows [13]:

Pi =f(i)

∑nj=1 f(j)

. (2.12)

Here, f(x) is the fitness of individual x, and n is the total number of individualsin the current generation.

The chromosomes of the individuals selected in this way may either be copiedinto a next generation, or be subject to some genetic operator. Two commonly usedoperators are the crossover and the mutation operators.

The crossover operator consists in interchanging parts of chromosomes from twodifferent individuals referred to as parents to construct the two new individualsreferred to as offspring. These parts are selected is determined by the crossovermask, which is a bit mask to be interpreted as if a bit is zero, the corresponding

14

2.3. DISCRETE EVENT SIMULATION

part of the first offspring is taken from the first parent and that of the secondoffspring from the second parent. For a bit equal to one, we do the other wayaround.

Different kinds of crossover operators exist. The one-point and uniform crossoveroperator are in a sense two extremes of possible approaches. The one-point crossoveruses a crossover mask with two parts. One part has only zeros and the other onlyones. The uniform crossover operator uses a crossover mask where each bit is setat random to zero or one .

The mutation operator operates on the new generation, after that it has beencreated, by randomly selecting individuals and randomly change selected parts ofthem. The selection of individuals to be modified in this way, is usually made withuniform probability since a selection based on fitness already has occurred.

Typically, each generation is of the same size as its preceding one. The amountof individuals to be copied and to be created using crossover, respectively, may beset to a fixed ratio at initialization. For mutation, the mutation rate determines theprobability for each individual to be mutated.

Chromosomes are typically represented as bit strings where the bits are groupedtogether, forming different genes. Genetic operators then typically operate on bitlevel. In problems where genes represents integers or real values, it is not guaran-teed that this representation always is the most useful. Instead, viewing each geneas whole entity and defining the genetic operators on gene level, may yield morenatural operations and opens the door for different kinds of crossover and mutationoperators such as averaging the values of the same parameter of two chromosomes,replacing a parameter value with a new random value, or creeping, i.e moving aparameter a small step in one direction [15, 16].

A general GA algorithm is as follows:

1. Generate a generation of individuals.

2. Evaluate each individual in the current generation.

3. If a satisfying solution is found, stop.

4. Select a subset of the individuals, based on their fitness

5. Select individuals for reproduction and let the offspring be the new generation.

6. Mutate the new generation.

7. Repeat from 2.

2.3. Discrete Event Simulation

In simulation in general, a system model represents the essential characteristicsof a real system. The system models are of different kinds. Two important classesare deterministic and stochastic models. In stochastic models some of the input

15


parameters on which simulation depends are random, while in deterministic modelsno such random variables exist [1].

The system model is a set of variables describing the state of the system at apoint in time. In discrete event simulation, the values of the state variables changeonly at discrete time steps. Each change is caused by the occurrence of an event.Since events represent discrete changes of the system’s state they have no durationthemselves. Activities and delays represent the duration of known and unknownlengths, respectively. A real world action that has a duration is modeled as a startevent followed by an end event a certain amount of time units later and is specifiedby an activity or a delay [1].

Entities are objects of such complexity or importance to the system that theyare modeled explicitly [1].

The way the system evolves through time is determined by the events thataffect it. A common algorithm used for executing events in correct order is theevent-scheduling/time-advance algorithm, shown in Table 2.1. Here, an event noticeencapsulates the information necessary for executing a certain event [1].

While event list is not emptyPop next event notice n, from event listAdvance clock to time t for event e, associated with nExecute e

Repeat

Table 2.1. The event-scheduling/time-advance algorithm.

16

Chapter 3

A WSN System Model

In this section, the abstract model used for simulation of the WSN is presented.This is done by describing the parts of which the model consists, and how theseparts together form a whole system.

The model presented here is designed to both reproduce the essential charac-teristics of a WSN, and be simple enough to enable fast simulation. The modelis a discrete event system representation of a WSN, and may be simulated usingapproaches of Section 2.3.

Each sensor node is modeled as a single entity. The nodes, together with mech-anisms for radio transmission and for monitoring phenomena, constitute the WSN.Each sensor node, in turn, consists of a number of entities, such as an energy source,detectors and a radio. Devices (entities) that need energy in order to function usethe node’s energy source. Figure 3.1 shows the relation between these. As suggestedby the figure, the node may be thought of as representing a point of connection be-tween the other entities. In the figure, entities such as processor and clock are leftout. These are assumed to be contained in the node.

The models of radio propagation and the phenomena to be monitored constitutethe model of the environment. When a phenomenon interacts with a sensing device,the device typically forwards information about the event to the node. This may inturn cause the node’s state to change which may result in the node to take actionswhich may trigger further events.

The remainder of this section covers the details of how the energy source, radiotransmission and detection is modeled.

3.1. The Energy Source

Each entity dependent on energy is assigned an energy source from where it takesits energy. Energy usage is modeled by two different mechanisms. One is suitablefor long term usage and the other one for short term usage. The first mechanismlets an entity inform the source that, from a certain time, it will be using a certainamount of power. The energy source will then account for this usage until it is

17

CHAPTER 3. A WSN SYSTEM MODEL

Radio

Node Energy Source

Sensor

Propagation Model

Phenomenon

Figure 3.1. The WSN model.

notified that the usage has stopped or the energy source has been emptied. Theother mechanism lets an entity take a single quantity of energy at once. When anenergy source is drained, the dependent entities stop functioning.

In reality, properties of an energy source vary with temperature, time, the levelof instantaneous usage etc. In this simplified model, the source is assumed to belinear. For a source’s capacity Cti , at time ti and the energy withdrawn uti−1,ti sincetime ti−1, we have Cti = Cti−1

−uti−1,ti , independently of period length, temperatureor instantaneous power usage uti−1,ti/(ti − ti−1).

3.2. The Radio Model

The radio model consists of two parts, one part representing the hardware andanother representing the propagation of the radio waves.

In this model only the free space path loss is taken into account. Hence, when aradio transmits, the propagation model makes use of Equation 2.1 for computationof signal strengths at the other radios’ positions. The strength of the interferingsignal is taken as the sum of all signals in the air at that point. No backgroundnoise is assumed, and the only way for signals to become corrupted is by interferencewith other explicitly modeled transmitted signals.

At the receivers’ antennas, the signal strength is compared to the receiversreference level to decide if the signal is strong enough. In this model, a signal witha strength above the reference level is received without errors (i.e BER = 0), and asignal below the reference level is classified as noise. For the experiments presentedin this thesis, this simplification is not expected to provoke misleading results.

18

3.3. THE SENSOR MODEL

The model of the radio’s mechanism for receiving signals is depicted as a statediagram in Figure 3.2. When turned on, the radio stays in its listening state untilnotified by the propagation model that a radio signal is present. The [isOK] con-dition in the diagram is true if (1) the radio is the intended receiver of the currentmessage, (2) the incoming signal is strong enough relative to the background noise,(3) the radio is turned on. If these conditions are met, the radio transits to thereceiving state and stays there until either transmission is ended or another signalinterferes with the signal.

Figure 3.2. A Diagram of the radio mechanism.

While the radio is turned on, it consumes energy. Listening, receiving andtransmission, are all associated with different levels of power consumption. Valuesfor these depend on the hardware modeled. The energy used for transmission as-sumed to be C · TxPower. That is, a requested output strength times a coefficientrepresenting the transmission overhead.

3.3. The Sensor Model

Several types of sensors exist that require specific models due to the naturaldifferences in the phenomena they are designed to monitor. Here a model suitablefor PIRs and similar sensor devices is used.

The devices and the moving objects to be detected are modeled explicitly. Thatis, not with the aid of statistical methods. The reason for this is to capture corre-lation between the density of nodes and their duty cycling properties and detectionof moving objects.

Typically, a PIR device’s sensing area in free space has the shape similar toa spherical cone. Since the device is not able to sense through walls and otherobstacles, these will shadow parts of the sensing area if lying within the boundaryof the cone.

In the 2D model presented here, the PIR device’s sensing area is representedby a region bounded by polygons. With polygons it is easy to both approximate a

19

CHAPTER 3. A WSN SYSTEM MODEL

Figure 3.3. Four nodes with PIRs, a moving object and a non-moving obstacle.Lines symbolize radio connection.

circular segment (the spherical cone representation in 2D), and cut out pieces forshadows caused by obstacles. See Figure 3.3.

The objects which may be detected move in one direction for a predeterminedperiod, then changes to a new direction in which it continues for another periodof time, and so on. A moving object intersecting the sensing area of a device willcause detection iff the device is turned on. Figure 3.4 shows this.

Figure 3.4. The states of a sensor.

20

Chapter 4

Machine Learning Methods for

Optimization of WSN

In this section, we will present methods to compute policies for self-configurationof the nodes in a WSN. The search for policies is based on simulation of the modeldescribed in Chapter 3. We constrain the policies to be learned to only cover choicesfor node parameter values. The choices are based on local observations.

Several reinforcement learning approaches may be used for this problem. It isnatural to view it as a multi agent reinforcement learning problem where each nodeconstitute an agent, which together with other agents tries to solve a problem.

A team learning approach as defined in Section 2.2.2 is not applicable for tworeasons: (1) The policies would be based on the state of the whole network, whichis information typically not known to the nodes. (2) The state space increasesdrastically with the size of the network.

The approach taken here is therefore a concurrent learning approach. We ignorethe problem of which agent to assign credit for good results, and give the samereward to all agents based on the performance of the network as a whole.

In concurrent learning, agents may learn to act as specialists. We want all thenodes to have the same policy. A reconfiguration policy is in a sense already a wayto determine when to configure as a specific specialist. Therefore, during learning,nodes use and update a common action-value function.

Thus, having n nodes learning during one episode is equivalent of having onenode learning during n episodes with a stationary policy. For each node, the othern− 1 nodes constitute a part of the environment.

In Section 4.3 we present a way to combine this approach with Monte Carloreinforcement learning methods. For this, methods for action selection and perfor-mance measure are proposed. Difficulties due to continuous state and action spacesare discussed and shown how to be managed by quantizing variables to differentpredefined values. Finally, a way to use genetic algorithms to automatically setthese predefined quantization values is proposed.

21

CHAPTER 4. MACHINE LEARNING METHODS FOR OPTIMIZATION OF WSN

4.1. Terminology

There exist variables that a node in a WSN may configure directly, e.g thelevel transmission power and duty cycling period lengths. We call this kind ofvariables internal variables, and denote them by u0, u1, . . . , um. The node may notdirectly configure variables such as node density and remaining energy. We callthese external variables, and denote them by v0, v1, . . . , vn.

We assume internal and external variables to have fixed and finite domainsU0, U1, . . . , Um and V0, V1, . . . , Vn respectively. We number the different values inthe domain Ui, ui,0, ui,1, . . . , ui,ai

, and similarly for Vi. Thus we have: ui ∈ Ui ={ui,0, ui,1, . . . , ui,ai

} , and vi ∈ Vi = {vi,0, vi,1, . . . , vi,bi}, respectively.

We also define the internal state space Sint and the external state space Sext tobe Sint = U0 ×U1 × . . .×Um, and Sext = V0× V1 × . . .× Vn respectively. Here × isthe Euclidean product operator.

The joint state space is then the set S = Sext×Sint, and we denote any variablein this set by xi ∈ S and its domain by Xi.

4.2. A Non-Stationary Environment

In Section 2.2.1, an MDP was defined to be a sequence as in Equation 2.5, givena Markov state signal. In our problem, an agent’s action may cause a state transi-tion in the following way: An agent finds itself in a certain state and reconfiguresaccording to its policy. Following the new configuration may cause changes in otheragents’ states due to changes in variables such as traffic level. These other agentsmay then reconfigure using that same policy. This may, in turn, result in that thefirst agent’s state changes once again, and so on.

Variables such as traffic level may also change due to varying activity of themonitored phenomena. Such state changes may also be viewed as caused by anagent’s actions, given that the agent is able to configure parameters such as its dutycycling.

For each agent, given a fixed policy, the emergence of the system may approx-imately be viewed as an MDP, with the state transition probability function Pa

s,s′

including also the dynamics (the behavior) of the other agents. However, since thepolicy typically changes over time due to policy improvement, the eventual responsefrom the other agents is unknown, and the environment is therefore non-stationaryand the MDP framework is not fully applicable. However, if only minor changesare made to the policies between two episodes, the old behavior of the system isexpected to function as an approximation to the new behavior.

4.3. A Monte Carlo Approach

The vague causality between actions and observations of new states, leads tothat methods like TD learning may be hard to apply. The fact that TD methods

22

4.4. THE STATE AND ACTION SPACES

make use of immediate rewards do not make them more applicable either, since itmay be difficult to compute the rewards for shorter periods of time. It is mucheasier to in a meaningful way compute a total return after each episode, and thenupdate a Q-function for the state-action pairs visited during that episode.

In the Monte Carlo reinforcement learning approach taken here, an episode isa fixed time interval during which the network operates. During each episode theperformance of the network is measured and when an episode finishes the measureof the total performance is given as a return to all the agents. Using a suitablepolicy update rule, the nodes update a common Q-function. In this way there isone learner, using the experience of multiple agents. Table 4.1 shows an algorithm,equivalent to the one used here.

The algorithm uses a history matrix Hk with entries Hk(s, a) representing thenumber of times action a has been selected from state s. When a node’s statechanges or when it has past T time units since last reconfiguration, the node re-configures using an ǫ-greedy policy. After each episode has completed, the return iscomputed, and the mean reward for each state-action pair is updated based on thehistory matrix. Finally, the greedy policy π is updated.

A node may reconfigure each time its state signal changes. This results in thatnodes that lie more densely may reconfigure more often than other nodes. Using aǫ-greedy policy, this results in that sparsely placed nodes also do less explorationthan densely placed nodes. The nodes therefore get a chance to reconfigure atcertain predetermined time intervals, in addition to the reconfigurations that aredue to state changes.

4.4. The State and Action Spaces

Using the entire state space S as a basis for computation of the state signalyields the possibility of letting the agent adjust one variable at the time since theagent remembers its internal state. It also makes it possible to, in a meaningfulway, assign values to states according to how good they are considered. Usingonly the external state space Sext as the state signal, the agent is memory-less,and it is for each reconfiguration necessary to decide values of all the variables atonce. The different learning tasks these two ways give rise to can be formulatedas Given a node’s current state, what internal variables should be changed, and towhat value, and Given a node’s current external state, what is the optimal internalstate, respectively. The size of the spaces to be searched for optimal policies are(∑m

i=0 |Ui|)|S| and |Sint|

|Sext|, respectively. The exponent in the first expression isgreater than that of the second, unless |Sint| = 1. This means the first expressiontypically results in a much greater value than the latter. We therefore take thesecond approach here.

An action in this approach consist in assigning a value to each of the parametersin the internal state space. The action space is equal for all states and may strictlybe formulated as in equation 4.1.

23


H0 ← 0

δt← −Tt← 0For each episode k:

Hk ← Hk−1

While episode k not finished:For each agent i:

s′i ← update_stateIf si 6= s′i or t− δt = T :

δt← tsi ← s′iDetermine reconfiguration action ai:With probability ǫ:

Use deterministic policy πOtherwise:

Select among all actions with equal probabilitiesHk(si, ai)← Hk(si, ai) + 1

R← evaluate_episodeFor each state s

For each action a

Q(s, a)← R +Hk−1(s,a)Hk(s,a) (Q(s, a)−R)

π(s)← argmaxa (Q(s, a)))

Table 4.1. The RL algorithm used in this report.

A =⋃

u∈Sint

[{u0, u1, . . . , um} ← u] (4.1)

4.5. Perception of the State Signal

In a real WSN, the values of external variables may rely on approximationsmade on the basis of single observations of the environment. Updating these ap-proximations would typically be associated with an energy cost. Therefore, timeintervals on which these approximations are updated may be regarded as internalvariables that may be configured for optimal performance. In this thesis, however,we will assume the state signal to be automatically updated when changed.

24

4.6. PERFORMANCE MEASURE

4.6. Performance Measure

To measure the performance of the WSN, we use a measure of the QoS andthe energy use. These two are then combined to a return function R for the MCmethod developed earlier in this chapter.

One possible measure of the energy efficiency is the average amount of energyused per node and per episode. This measure is easily computed, but does nottake into account how evenly the energy usage is distributed over the nodes. As ameasure of this, the variance or the standard deviation of the nodes’ energy usagesmay be used. The following equations show the mean average µ and the standarddeviation σ for the energy use:

µ =1

N

N∑

i=1

ui (4.2)

σ =1

N

N∑

i=1

√

(ui − µ)2 (4.3)

Here, N is the number of nodes and ui is the amount of energy used by the ithnode during the episode.

The most suitable QoS measure differs from task to task. For an alarm system,it may be sufficient to detect an change in the environment, while for a monitoringsystem, we may want to have more detailed information such as how an object havebeen moving in a certain area. Therefore, we postpone defining a QoS measureuntil Section 6, where we define the experiment scenario.

Combining the different measures into one return function, can be done in severalways. One is a weighted sum of measures with weights wQoS, wµ and wσ as follows:

R = wQoS ·QoS + wµ · µ + wσ · σ (4.4)

Here, the weight associated with the energy measures are negative. Typically, QoSand energy usage are more or less orthogonal to each other in the sense that ifone increases, the other decreases. With the return function in Equation 4.4, thereis no guarantee that optimization will occur with respect to all the terms. Usingthresholding functions TQoS and Tµ that return a 0 if the QoS is below the thresholdand 1 otherwise, and similarly for µ, gives the possibility of specifying a minimumacceptable value for each of these measures. The result of this is:

R′ = R + wTQoS· TQoS − wTµ · Tµ (4.5)

There is still no guarantee that episodes with highest return values are associatedwith policies that have both measures over the minimum acceptable level.

An alternative way is to, for each measure q, define a value pair (q−, q+), whereq− defines the worst acceptable value, and q+ defines a limit over which performance

25


is regarded to be high enough and further optimization with respect to this param-eter is not necessary. Defining a new normalized measure q′ as in Equation 4.6assures a value between 0 and 1.

q′ = max

(

0,min

(

1,q − q−q+ − q−

))

(4.6)

Hence, multiplying several normalized measures qi also results in a value in theinterval [0, 1], and a value of 0 corresponds to the case where at least one measureis not good enough.

Using the same measures as before, we now define the return R as follows:

R = 3√

QoS′ · µ′ · σ′ (4.7)

Which is the geometrical mean of the three normalized performance measures.The drawback with this definition is that no shaping occurs for episodes whereR ≤ 0. This means that the learner may regard two policies equally bad eventhough one of them actually may yield better performance for all measures whennormalizations is not applied. However, this return seem to work better than thatof the weighted sum in Equation 4.5 because of the fact that policies regarded asgood are guaranteed to have all measures above certain values. Therefore, thisnormalized measure will be used for the experiments here.

4.7. Action Selection

The way actions are selected during the episodes, determines the relation be-tween exploration and exploitation. We use an ǫ-greedy approach and let ǫ decreasewith each episode i according to the following equation:

ǫi =

(

1−i

i + 1

)

(ǫ0 − ǫ∞) + ǫ∞ (4.8)

Here, ǫ∞ is the value to which the function converges and s is the speed of whichit converges. Figure 4.1 shows the function which is used in this report. Here,ǫ0 = 0.1, ǫ∞ = 0.01 and s = 0.01.

4.8. Policy Updating

As already noted, the environment is non-stationary. This implies that theaveraging policy update rule, common for Monte Carlo methods, may not be themost suitable, since it accounts for the return of an old episode, as much as for thatof a new episode, though the old return is a measurement of how good a certainpolicy was in a environment that may no longer exists.

An update rule that prioritizes newer returns to older is that of Equation 2.10.In practice, this update rule do not seem to work so well for this task, however.Therefore, the average policy update rule is used in the experiments of this report.

26

4.9. QUANTIZING

0 500 1000 1500 2000 2500 3000 3500 40000.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

episodes

epsi

lon

Figure 4.1. The ǫ-function.

4.9. Quantizing

In the thesis this far, we have assumed state variables to be discrete and fi-nite. This is not the case for several parameters in the WSN optimization task.Continuous or infinite variables make it difficult representing a whole set of statevariables as one single valued state signal. It is also difficult to define actions forsetting values of parameters with infinite domains. Here, this problem is managedby quantizing continuous variables.

For any continuous variable zi we quantize it to value xi = xi,k such that

k = argminj (|zi − xi,j|) (4.9)

The choices of the values xi,j are of importance for what the learner is able tolearn and may vary between different scenarios.

For external variables, the values zi may lie within a small range, and withsparsely placed xi,j, xi may always take on the same value, which results in fewerstates that the agent can distinguish between. Manually finding suitable quantiza-tion values can be a time consuming task. It is therefore convenient to set the xi,j:sautomatically.

For internal variables the xi,j:s determine what values the variables may beconfigured to.

4.10. Using GA for Selecting Quantizing Values

Setting the values for quantizing as described in the previous section may bedone automatically. This section presents an approach doing that using genetic

27


algorithms (GA).Let a chromosome C represent a set of values for quantization:

C = {X0,X1, . . . ,Xp} . (4.10)

Here, the different Xi:s are domains as defined in Section 4.1.We define a memberof Xi as xi,j ∈ [mini,maxi]. Each xi,j may be seen as a separate gene. With thisrepresentation, it is natural to define non-binary genetic operators.

The mutation operator replaces one value, selected at random, with a randomnumber y. Assuming Xi to be an ordered set such that xi,0 ≤ xi,1 ≤ . . . ≤ xi,ni

, westrictly define the mutation operator as follows:

xi,j ←

y ∈ [xi,j−1, xi,j+1] if 0 < j < ni

y ∈ [min, xi,1] if j = 0y ∈ [xi,ni−1,max] if j = ni

(4.11)

That is, the mutated value is at the same position in the ordered set before andafter the operator has been applied.

We define a uniform crossover operator that interchanges two values with equalpositions in two corresponding ordered domains. Strictly this may be formulatedas follows:

xi,j ← bi,jx′i,j + (1− bi,j)x

′′i,j (4.12)

Here, bi,j is 0 and 1 with equal probability. x′i,j and x′′

i,j are genes of two selectedparents and xi,j their offspring.

Evaluation is made by simulating a number of episodes using evaluation only.The average of the returns that are computed after each episode is used as a measureof fitness for each chromosome. Hence, the fitness function is equal to Equation 4.7.

Selection is made according to Equation 2.12. In this case, individuals withfitness equal to zero will never be selected.

This approach can be used in two ways. We can take a chromosome C froma population and apply the MC methods mentioned in Section 4.3 for a numberof episodes, evaluate the policy (and the values for quantization), then take a newchromosome from that population and learn a new policy and so on. When thewhole population has been evaluated, we construct a new population with the ge-netic operators described above, and start all over again with evaluating the newgeneration. This approach may be, theoretically, the most appropriate way for find-ing good policies for our problem. In practice, however, it tend to be very slow,since each evaluation is associated with a policy search.

On the other hand, we can also first learn a policy for manually constructedquantization values we consider sound, and then use the GA to find better domains,given that policy.

The reason why this works can be demonstrated by the following example. Theagent may have learned to set the highest transmission power for certain states,

28

4.10. USING GA FOR SELECTING QUANTIZING VALUES

because this was the best alternative available. It may, however, be the case thatan even higher value would have yielded even better performance in these states.

Both approaches are evaluated in this report.

29

Chapter 5

Implementation

In this chapter we explain how different parts of the developed software areimplemented. All parts are implemented in Java. The diagrams in this chaptershow the conceptual models of the different frameworks. These do not alwayscorrespond exactly to the Java code.

5.1. The Simulator

The simulator is implemented as a discrete-event simulator. As suggested bySection 2.3, it consists of the parts depicted in Figure 5.1. The algorithm used foradvancing time and execution of events, is that presented in Table 2.1 in Section 2.3.

Figure 5.1. A conceptual diagram over the parts of the Simulation Framework.

31

CHAPTER 5. IMPLEMENTATION

5.2. The Experiment Framework

To use the simulator with learning methods, an experiment framework was devel-oped. This framework is assigned responsibilities for setting up simulation scenariosand keep track of episodes, etc.

In the experiment framework, an experiment is a collection of experiment runs.Each run is composed by a set of episodes. The exact nature of an experimentrun depends on the experiment. All runs may for example use the same kind ofscenario and the same kind of learning methods to get an average performance ofa certain method for a certain scenario. Each run may also use a different scenarioor different learning method in order to evaluate a learning method for differentscenarios or the opposite, respectively. Figure 5.2 shows a conceptual diagram ofthe Experiment Framework.

Figure 5.2. A conceptual scheme over the parts of the Experiment Framework.

5.3. The Graphical Interface

For debugging and demonstration purposes a GUI for the simulation tool wasimplemented. For any given time step, this GUI illustrates the positions of movingobjects, the nodes sensing areas, information about radio connectivity etc. Fig-ure 5.3 shows this, together with the GUI of the experiment framework.

With the GUI, it is possible to set the simulation speed, pause the currentsimulation and show or hide different visual properties. It is also possible to loaddifferent experiments, described by a configuration file.

5.4. The Experiment Configuration Framework

To omit the necessity of recompiling the source code each time change is made tothe network parameters, configuration files are used. In these it is possible to specify

32

5.5. THE LEARNING FRAMEWORK

Figure 5.3. A screen shot of the simulator.

specializations of software classes, set parameter values and specify the number ofruns a experiment should consist of, etc.

5.5. The Learning Framework

The learning framework consists of two separate parts. One part contains func-tionality for state updating, action selection and value updating. The other, func-tionality for logging information during simulation, which may be used for compu-tation of rewards at the end of episodes. This section describes these two parts, andhow they are used by other parts of the system.

Figure 5.4 shows a conceptual diagram of the different parts of the part of theLearning Framework responsible for policy updating and selection and execution ofactions.

5.5.1. Computation of State and Action Identifiers

Here, the dynamics and dependencies of states and the variables contained inthem are explained.

State Identifiers

An agent’s state space is a collection of state variables xi. In the specific ap-proach taken here these are the external variables vi. Each state variable is assigned

33


Figure 5.4. A conceptual diagram over the parts of the Learning Framework.

a distinct identifiers di computed as follows:

di ←

{

1 if i = 0∏i−1

j=0 |Xj | if i 6= 0(5.1)

We define ki as the index such that xi = xi,ki. The state signal is computed as

s =

n−1∑

i=0

diki. (5.2)

This may also be expressed with vector multiplication as follows:

s = [d0 d1 . . . dn−1 ]

k0

k1...

kn−1

(5.3)

When a state variables value changes, it affects the state signal of the statespace which it belongs to. Instead of recalculating the state signal according toEquation 5.2 when a variable changes, the following equation is used to computethe new state signal:

st+1 ← st + di (ki,t+1 − ki,t) (5.4)

Here, ki,t denotes the value of ki at time t.

34

5.5. THE LEARNING FRAMEWORK

Action Identifiers

Feeding the policy with the state signal, it returns an integer that is an actionidentifier a. In the approach taken here, an action consists of configuring all internalvariables ui. Each possible combination must therefore be identifiable by a singleinteger. This mapping is done using a lookup table M for which the elementsma,i ∈ [0, |Ui| − 1]. Taking the action identified by a, corresponds to setting eachvariable ui as follows:

ui ← ui,ma,i(5.5)

The lookup table M is constructed using the algorithm in Table 5.1. Here, n isthe number of internal variables.

a = 0ma,i ← 0, i = 0, 1, · · · , n − 1loop forever

a← a + 1ma,i ← ma−1,i,∀ij ← 0ma,j ← ma,j + 1while ma,j ≥ |Uj |

ma,j ← 0j ← j + 1if j = n

return M = m0, · · · ,ma−1

elsema,j ← ma,j + 1

Table 5.1. Algorithm for construction of the lookup table M

The computation of action identifiers may also be expressed as follows:

a = [1 h1 . . . hn−1]

k0

k1...

kn−1

(5.6)

Here, hi =∏i−1

j=0 |Uj |, and ki as the index such that the internal variable ui =ui,ki

.

35


5.5.2. Evaluation

During a simulation it is necessary to collect information about the agent andits environment for evaluation purposes. Each object that handles informationneeded for evaluation of the episode, registers variables representing the informationin an evaluator that maps variables (operands) to text strings. An object withinformation to be registered retrieves the variable corresponding to the informationkind and updates its value.

When an episode has finished, the evaluator is called to compute the return. Ituses a predefined set of operators that each make use of one or more of the registeredvariables. The values produced by the operators are then normalized and multipliedaccording to Equation 4.5. Figure 5.5 shows a conceptual diagram of the EvaluationFramework.

Figure 5.5. A conceptual diagram over the parts of the framework for evaluation.

5.6. The Simulation Model Implementation

Here, the implementation of some parts of the simulation model is presented.

5.6.1. Radio Communication

Radio communication is implemented in the following way: each radio has a listof all other neighbors it reaches if transmitting at maximum transmission power.At transmission, all the neighbors in the list are notified of the presence of a signal.The strength of the received signal is calculated based on the current transmissionpower and the distance between sender and receiver. Since this distance is knownin advance the expression 1/rk from Equation 2.1 may be precomputed. Only onemultiplication is therefore necessary to compute the received signal strength.

5.6.2. Detection

Detection of objects are represented by events for when detection starts andwhen it ends. To decide at what time detection events should occur one needs to

36

5.6. THE SIMULATION MODEL IMPLEMENTATION

solve a problem which can be stated as follows:

Given a point, p, on a line l, starting at time t0 and moving alongthe line with velocity v, and a polygon with edges e0, e1, ..., en. Atwhat instances of time, if ever, do the point p enter and exit the regionbounded by the polygon, respectively?

The problem can easily be solved analytically, but in the software implementation,due to round off errors, the analytic methods result in unwanted effects when in-tersection occurs close to the polygon’s corners. Also, if the starting position of apoint lies close to the one of polygons edges, further problems may be caused byround-off errors. An algorithm was developed here for solving this problem. Noticethat this problem is different from the problem of whether or not a specific pointlies inside a polygon.

37

Chapter 6

The Experiment Set-Up

This section presents the set-up of the system used for the experiments of thisthesis. First, the experiment scenario is presented. This is followed by a sectioncovering the issues concerning the optimization and learning problem, such as whatparameters constitutes the state and action spaces, and how do we evaluate theWSN performance.

6.1. The Scenario

Here, the characteristics of the scenario are described. This involves describ-ing a real-world task and specifying parameter settings for the hardware modeled,as well as specifying the communication protocol and assumptions made for theenvironment.

6.1.1. The Task

The following task is used in the experiments:A total of 15 sensor nodes are deployed one by one close to an intersection of

two roads so that their PIRs are directed toward the middle of one of the roads.The nodes’ positions are set to randomly determined points in the neighborhood ofthe intersection. The sensor nodes are assumed to know their own, as well as thecentral’s position. The task is for the sensor nodes to monitor objects moving onthe road. When a detection has occurred the nodes immediately send a messagetoward the central. The scenario can be seen in Figure 5.3 in Section 5.3.

The moving objects move with a predetermined speed and their arrivals are, oneach road uniformly randomly distributed over the interval [0, 1] hours after eachother.

6.1.2. The Sensor Node

The sensor node model is to some extent based on the characteristics of the ESB-2 sensor node [6]. The ESB-2 consists of a radio device and a number of sensors,

39

CHAPTER 6. THE EXPERIMENT SET-UP

such as a passive infrared device (PIR) which can detect motion in its neighborhood,a microphone, a seismic sensor for detection of vibration, a thermometer etc. Thenode’s energy source consist of three AA batteries.

Figure 6.1. The ESB-2 device.

The experiment only make use of the PIR device. Figure 6.1 shows a ESB-2device.

6.1.3. The Radio Model

Settings of parameters associated with the radio are to some extent based onvalues for the ESB-2 ’s radio unit, TR1001. Table 6.1 shows the parameter settingsfor the model [19]. Here, the transmission cost coefficient is multiplied with thetransmission power to get the

Parameter Value

Reference Level 10−13 WListening Cost 0.1 mWTransmission Cost Coefficient 8 WPeak Transmission Power 0.75 mWReceiving Cost 5.4 mWAttenuation Constant n 3.5Attenuation Constant K 1

Table 6.1. Radio Parameter Settings

A clustering algorithm similar to the LEACH algorithm is used. Modificationshave been made to accomplish two things:

Smaller probability to have large areas with only followers.

Individual cluster-head probabilities.

The first is accomplished by adding a sub-phase to the end of the set-up phasein the original LEACH algorithm. When entering this sub-phase and if a node is

40

6.2. OPTIMIZATION AND LEARNING ASPECTS

left without cluster head, it sends a help-request to the one of its neighbors whichat highest receiving power has been heard sending a follower-message. This nodethen becomes a cluster head for this node.

In the original LEACH algorithm, all nodes have the same probability to be-come a cluster head. The second point above, means that instead of having equalprobability for all the nodes, the nodes may have different probabilities for becom-ing cluster heads. In this way, nodes which are situated close to other nodes mayeach have a lower probability to become a cluster head than do nodes that are moreisolated.

The nodes use CSMA (carrier sense multiple access). That is, they checkwhether the medium is busy or not before transmitting. If busy, they wait a randomtime before retrying.

A node that is a cluster-head keeps the radio on, ready to receive, during thewhole steady-phase, while a node that is a follower keeps it off, and only uses it fortransmitting during this phase.

6.1.4. The Central

The central is placed such that it lies within a predetermined radius from arandomly selected node. It is modeled in the same way as the sensor nodes butwith unlimited power resources. Further, the central’s radio is always turned onand is always cluster-head.

6.2. Optimization and Learning Aspects

This section covers the aspects of optimization and learning policies for thescenario described above. Here we present both the parameters the nodes canconfigure and the variables constituting the state signals.

6.2.1. Optimization Parameters

Here, the parameters and values chosen for optimization are presented and de-scribed one by one. A summary is given in Tables 6.2 and 6.3. Here, also the theminimum and maximum values used for the search of quantization values are listed.

LEACH Configuration This parameter gives the node information about whetherit is a cluster-head or not. This is boolean parameter and quantization of its domainis not necessary. We refer to this variable as IsHead.

Transmission Power The node selects between three different values for itstransmission power parameter. The choice of this parameter affects how far atransmission will reach and hence how many hops the node is from the central.It also determines at what distances it will interfere with other transmissions. This

41


Internal Variable Description Min Max

TxPower Transmission Power 0 750 · 10−6

HeadProb Cluster Head Probability 0 1OnPeriod/OffPeriod Duty Cycling 0 1000BackOff CSMA Back-Off Time 0 10SendProb Sending Probability 0 1

Table 6.2. Configurable variables of which optimization is made.

External Variable Description Min Max

IsHead LEACH Configuration - -Density Node Density 8 · 10−3 13 · 10−3

Traffic Network Traffic Level 1 · 10−5 1 · 10−4

Table 6.3. Non-configurable variables of which optimization is made.

parameter also affects the energy used for transmission. We refer to this parameteras TxPower.

Cluster Head Probability The nodes’ choice of value for the probability of be-coming a cluster-head affects the backbone structure of the network. If all nodes usea high probability, there may be good connection in the network and the probabilityof high QoS increases. The network would however drain quite quickly this way. Alow cluster head probability may cause nodes to not have a cluster head to follow.The idea behind individual cluster head probabilities is that the optimal choice ofit may depend on how densely the nodes are placed in an area. The nodes are giventhree options for this parameter. We refer to it as HeadProb.

Duty Cycling A node’s duty cycle is the percentage of time a node stays awake.Awake in this scenario means having the PIR device turned on. The duty cyclingis determined by two parameters: One for the duration of each period the PIR isturned on, and another for the duration of the off periods. Each of these parame-ters has three different options. We refer to the two parameters as OnPeriod andOffPeriod.

The Back-off Time In the CSMA protocol, when the medium is busy, a senderhaving data to transmit waits a random amount of time before retrying transmission.In the model used here, the time a node will wait before retrying is uniformlyrandomly distributed on the interval [0,max] time units. With different choices ofmax, the node may configure differently for different situations. There are threechoices of values for this parameter. We refer to the parameter as BackOff.

42


Sending Probability The sending probability is the probability that a node willreport a detected motion. It is possible that nodes that are placed densely and sensea high traffic level, may choose a lower probability than other nodes. We refer tothis parameter as SendProb.

Node Density The node density is the measure of how close a node lies to othernodes. Several different approaches may be taken to compute this measure.

In the experiments described here, the density was computed for a node i, asthe inverted average distance to all the other nodes in the network as follows:

Di =n− 1

∑nj=1 δi,j

(6.1)

Here, n is the number of nodes in the network and δi,j is the distance betweentwo nodes i and j in the network. The nodes where assumed to know their owndensity which was computed at initialization.

In reality, nodes typically do not know other nodes’ positions for computationof Equation 6.1. Instead, it may use a measure of the strength of received signals.This results in a more inaccurate measure at the same time as an additional energycost would be associated with measurements.

The node density domain is represented by three discrete values. We refer tothis parameter as Density.

Network Traffic Level The amount of network traffic may affect the optimalchoice for values of parameters such as back-off time and report probability. Herewe measure the network traffic level Li for node i as follows:

Li =1

T

∑

i|xi∈X

ti (6.2)

Here T is the length of period over which the measurement is performed, X isthe set of signals in the air during that period, and ti is the duration of the signalxi ∈ X.

In reality, two interfering signals would be accounted for as one single signal xi

in Equation 6.2 with ti as the duration of the joint signal. In the simulations here,they are counted as two separate signals.

The network traffic level is quantized to two different values. We refer to thisvariable as Traffic.

6.2.2. The State Signal and Action Representation

The discrete state signal is computed using the values of the Traffic, IsHead andDensity parameters as follows:

43


s = [1 2 4]

TrafficIsHeadDensity

(6.3)

This equation corresponds to Equation 5.3. Table 6.4 shows the states andtheir associated identifiers. The distinct action identifiers may be computed asEquation 5.6 in the same section as follows:

a = [1 3 9 27 81 243 ]

TxPowerBackOff

HeadProbOnPeriodOffPeriodSendProb

(6.4)

These equations may be used for translating some hand-coded algorithms intothe policy representation used by the framework of this report.

ID 0 1 2 3 4 5 6 7 8 9 10 11

Traffic low high low high low high low high low high low highIsHead low low high high low low high high low low high highDensity low low low low mid mid mid mid high high high high

Table 6.4. The state identifiers and what they represent.

6.2.3. The QoS Measure

In Section 4.6 we postponed the definition of the QoS measure. The task isto monitor how objects move on a road. This implies that the fraction of detectedobjects does not alone describe how well the WSN performs. A measure for how wellwe are able to reconstruct the paths of the moving objects with the data reportedto the central is also necessary. The measure used here is as follows:

QoS =1

n

n∑

i=1

mi∑

j=0

min

(

ti,j − ti,j−1

T, 1

)

(6.5)

Here, n is the number of objects during an episode and ti,0, ti,1, . . . , ti,miis an

ordered list of time stamps for the reported detections of object i, with ti,0 defined asthe time when object i started. The key idea of this formulation of QoS is that theWSN should monitor each object evenly over time. The importance of each piece ofdetection data increases with the time elapsed since the preceding detection, withina certain time interval. In Equation 6.5, one of two reported simultaneous detectionswould have a value equal to 0, and all detections T time units after another wouldhave a value equal to 1. Summing the values implies that T defines an optimal

44


sampling interval since the maximum QoS then is reached with as little effort aspossible. Finally, the average of these sums are taken over all objects.

6.2.4. Measure Normalization

Table 6.5 shows the values used for normalization of the measures. The ’-’ and’+’ corresponds to q− and q+ in Equation 4.6. The choice of these is expected toaffect how optimization occurs.

Measure - +

QoS 2 5µ 0.07 0.05σ 0.2 0

Table 6.5. The worst acceptable (-) values and values considered as optimal (+) forthe measures used.

45

Chapter 7

Experiments and Results

In this section, we present the results from four different experiments. Theexperiments are based on methods and the scenario presented in Chapter 6 and 4,respectively, and are carried out using the simulation model presented in Chapter 3.In order to compare the results, the same performance measure is used for allexperiments, and lies in the range [0, 1] as described in Section 4.6. The performanceof the optimal combination of policy and quantization values is not known, and maybe less than 1.

The objective of the experiments is to evaluate different approaches to findpolicies and quantization values that combined yield optimal performance.

In the experiments each episode is a fresh set-up of the scenario being modeled,with its own random values. In this way the network’s behavior is evaluated for awider range of similar configurations and not only for a certain kind.

In the first experiment, manually set quantization values are used to learn apolicy with RL. The second experiment uses GA to adjust the quantization values tothis policy. In the third experiment we first try a hand-coded policy with manuallyset quantization values. We then improve the performance by finding a new setof quantization values using GA. In the fourth experiment, GA and RL are usedsimultaneously to find good quantization values.

7.1. Experiment 1 - Reinforcement Learning

In this experiment, 10,000 episodes of RL is applied to the manually set quan-tization values shown in Figure 7.1(a). In the figure, the values are normalized sothat for each domain i, the range [mini,maxi] is transformed into the range [0, 1].The learning process is repeated for 20 different runs, yielding a total of 20 differentpolicies.

After each 50 episodes, the learned policy is evaluated for 10 episodes. Figure 7.2shows the average performance for each of these evaluation periods, averaged overthe 20 different runs. The curve seems to have converged after around 5,000 episodesof learning.

47

CHAPTER 7. EXPERIMENTS AND RESULTS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

4

5

6

7

8

Quantization Values

Dom

ains

(a) Normalized, manually set quantization val-ues.

0 1 2 3 4 5 6 7 8 9 10 110

5

10

15

20

25

30

35

40

45

States

Vis

its (

%)

(b) The visits to each state in percent.

Figure 7.1. Properties of the quantization values used in Experiment 1.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Episodes

Per

form

ance

Figure 7.2. Average performance during learning in Experiment 1.

Table 7.1 shows a randomly picked policy, learned in one of the 20 runs. Here, IDis the state identifier. In Figure 7.1(b), the visits made to each state are displayed inpercent. Obviously, even states (i.e. states with even identifiers) are visited rarely.In common for all even states is that the Traffic parameter is set to low.

To make visits more evenly distributed, the quantization values for Traffic haveto be changed. It is not sure, however, that quantization values yielding the mostevenly distributed visits also are the best ground for appliance of RL.

The best of the 20 policies found is selected. Further evaluation of best of the20 polices found yields an average performance of 0.45.

48

7.1. EXPERIMENT 1 - REINFORCEMENT LEARNING

ID TxPower BackOff HeadProb OnPeriod OffPeriod SendProb

0 low high low high high high1 high high low high high high2 high low mid low mid high3 high high mid high low low4 high low low mid high high5 mid high high low mid mid6 high mid low high high high7 high low high low low high8 low low low mid low high9 high mid mid low high high10 high low high high mid high11 high high low low mid mid

Table 7.1. A policy learned in Experiment 1.

49


7.2. Experiment 2 - Improving Performance using GA

We here take the policy learned in the previous experiment and adjust quanti-zation values to it, using GA as proposed in Section 4.10. The fitness is computedas the average performance of 5 episodes of evaluation. The GA parameter settingsare listed in Table 7.2.

Parameter Value

Population Size 500Mutation Rate 0.01Crossover Ratio 0.9Evaluation Episodes 5Generations 60

Table 7.2. Parameters for the GA used for adjusting the manually set values.

Figure 7.3 shows the best and the average fitness of each of the 60 generations.These results are optimistic, since such few episodes of evaluation are used. Too fewevaluation episodes may for example lead to that sets of quantization values thatonly during some special circumstances yield good results are selected. To achievemore reliable results, the best individual is further evaluated. This yields an averageperformance equal to 0.45.

0 10 20 30 40 50 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 7.3. The best (above) and average (below) individual performance for eachgeneration of the GA of Experiment 2.

The set of quantization values yielding the best results is shown in Figure 7.4(a).Figure 7.4(b) shows the visits to each state in percent. Here, odd states are hardly

50

7.2. EXPERIMENT 2 - IMPROVING PERFORMANCE USING GA

visited at all. It is surprising that individuals selected by the GA have this property,since the policy which the GA was applied to was better explored for even states.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

4

5

6

7

8

Quantization Values

Dom

ains

(a) The quantization values found by the GA.The values have been normalized.

0 1 2 3 4 5 6 7 8 9 10 110

5

10

15

20

25

30

35

40

States

Vis

its (

%)

(b) The visits to each state in percent, withthe new quantization values.

Figure 7.4. Properties of the quantization values found in Experiment 2 for thepolicy learned in Experiment 1.

51


7.3. Experiment 3 - A Hand-Coded Policy

In this experiment, the hand-coded algorithm shown in Table 7.3 is evaluated.It is easily translated into the policy representation used above, using the vectormultiplications of Equations 6.3 and 6.4. The resulting policy is shown in Table 7.4

SendProb ← highBackOff ← mid

if Density = low thenOnPeriod ← highOffPeriod ← lowHeadProb ← highTxPower ← high

end if

else if Density = mid thenOnPeriod ← midOffPeriod ← midHeadProb ← midTxPower ← mid

end if

else if Density = mid thenOnPeriod ← lowOffPeriod ← highHeadProb ← lowTxPower ← low

end if

Table 7.3. A hand coded algorithm

Figure 7.5(a) shows the result of evaluation during 100 episodes using this policyin combination with the manual set quantization values of Experiment 1.

The results of Figure 7.5(a) are bad. Only for single episodes, all measures areabove the worst accepted limit. To achieve better results for this policy, we use aGA to adjust the quantization values in the same way as we did for the learnedpolicy. The GA parameter values used are again those of Table 7.2.

The results are shown in Figure 7.5(b). As in the previous Experiment, theseresults are optimistic. Further evaluation gives an average performance of 0.58.Obviously, the hand-coded policy, in combination with quantization values foundby the GA, outperforms the policies learned with RL.

The fact that a learned policy can be found that is equivalent to the hand-coded

52

7.3. EXPERIMENT 3 - A HAND-CODED POLICY


0 high mid high high low high1 high mid high high low high2 high mid high high low high3 high mid high high low high4 mid mid mid mid mid high5 mid mid mid mid mid high6 mid mid mid mid mid high7 mid mid mid mid mid high8 low mid low low high high9 low mid low low high high10 low mid low low high high11 low mid low low high high

Table 7.4. The algorithm in Table 7.3 in policy representation.

20 40 60 80 100 120 140 160 180 2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

(a) Performance of the hand-coded policy incombination with manually set quantizationvalues.

0 10 20 30 40 50 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(b) The best (above) and average (below) in-dividual performance for each generation.

Figure 7.5. Performances before and after adjustment of quantization values tohand-coded policy in Experiment 3

policy used, leads to the conclusion that either the RL gets stuck in local minimasor that the quantization values to which RL is applied are badly chosen.

Figures 7.6(a) and 7.6(b) shows the normalized quantization values and the statevisits, respectively. From Figure 7.6(b) it is evident that states for which Traffic ishigh or Density is low are hardly visited at all.

53


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

4

5

6

7

8

Quantization Values

Dom

ains

(a) The quantization values found by GA forthe hand-coded policy.

0 1 2 3 4 5 6 7 8 9 10 110

10

20

30

40

50

60

States

Vis

its (

%)

(b) The state visits in percent.

Figure 7.6. Properties of the learned quantization values in Experiment 3.

54

7.4. EXPERIMENT 4 - SIMULTANEOUS GA AND RL

7.4. Experiment 4 - Simultaneous GA and RL

Learning policies for predetermined quantization values has not resulted in find-ing optimal behavior. In this experiment we let a GA create the quantization valuesand apply 300 episodes of RL to each one of them. The fitness is computed fromthe average performance of 50 evaluation episodes of the learned policies.

The number of learning episodes is set to very few since this is a very timeconsuming approach. This may cause the GA to select such quantization values thatquickly give good results. Loosely speaking, a set of quantization values resultingin only a few states being visited may cause RL to easier find quite well performingpolicies, than if the policy need to take more states into account.

Figure 7.7(a) shows the best and the average individual performance for each of10 generations. The best individual was chosen and further RL was applied to it.The results are shown in Figure 7.7(b). The policy learned is shown in Table 7.5.Here, a ’-’ means that no rule exists for the corresponding state because it neverhas been visited during learning.

Further evaluation results in an average performance equal to 0.55. This is stillnot as good as the performance of the hand-coded policy with adjusted quantizationvalues, but it is better than those of Experiments 1 and 2.

Figure 7.8(a) shows the quantization values found and Figure 7.8(b) shows thevisits to each state.

55


1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

(a) The maximum and mean fitnesses of the individualsof each generation.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

(b) Performance during RL applied to quantization val-ues found by a GA.

Figure 7.7. Performances during learning in Experiment 4.

56

7.4. EXPERIMENT 4 - SIMULTANEOUS GA AND RL


0 high mid mid high high mid1 high high high mid high low2 mid mid mid high high mid3 - - - - - -4 low low mid mid mid high5 high mid low mid high low6 mid low mid low high high7 - - - - - -8 mid mid mid low mid mid9 high low mid low high low10 high high high mid mid low11 - - - - - -

Table 7.5. The policy found using RL with quantization values found in Experi-ment 5.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

4

5

6

7

8

Quantization Values

Dom

ains

(a) The quantization values found.

0 1 2 3 4 5 6 7 8 9 10 110

5

10

15

20

25

30

35

40

45

States

Vis

its (

%)

(b) The state visits in percent.

Figure 7.8. Properties of the quantization values that yielded the best performance.

57


7.5. Summary

Figure 7.9 summarizes the results from all experiments. Experiments 1 and 2did not yield very good results. This is not so surprising since the policy learned inExperiment 1 was based on badly chosen quantization values.

1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Experiments

Vis

its (

%)

Figure 7.9. A comparison of the measured performance yielded by the experiments.

Experiment 3 make use of a very simple but sound policy. Adjusting quantiza-tion values to this policy yields the best result of all the experiments.

Experiment 4 is expected to be the most accurate method for finding goodcombinations of quantization values and policies. It is on the other hand a veryslow method. To fully evaluate this method, more experiments with different GAparameter settings should be carried out.

58

Chapter 8

Conclusions

In this thesis, traditional machine learning methods have been applied to theWSN optimization problem. It has been shown that even though not being a Markovdecision process, RL approaches do work to some extent.

However, the use of a hand-coded policy in combination with a set of quanti-zation values selected by a GA has been seen to yield better results than policieslearned using GA combined with RL. The WSN optimization problem can howeverbe much more complex than the instance of the problem we have solved here, e.g ifusing data classification and aggregation. In such cases well performing hand-codedpolicies are expected to be more difficult to compute.

The major difficulties for finding good policies are due to the dependenciesbetween policies and quantization values which make it difficult to compute themseparately. On the other hand, simultaneous computation using the approach ofExperiment 4 is slow. In the experiment, the time spent on evaluation was thereforereduced to such extent that it affected the selection mechanism negatively. Betterresults are expected to be achievable by increasing the time spent on evaluation.

The use of GA in combination with RL is a way to deal with continuous actions.Further studies of different approaches to manage the general case of continuousactions should be made. But also enhancements of the GA methods used heremay also be possible. One way may be to let the population size decrease withthe generations. In this way greater population sizes may be used in the firstfew generations and more learning episodes can be affordable in the simultaneousapproach of Experiment 4.

The results of this thesis are theoretical. Values for hardware parameters andphysical constants may be set more carefully for more accurate results. It is expectedthat changes in these parameters principally affect the properties of the quantizationvalues and policies learned, and not how well the presented methods work.

To fully evaluate the learned policies, they should be tested in a real WSN.For this an policy engine is needed to decode the policies, as well as an engine fordetermining what state the agents are in.

59

Bibliography

[1] Banks, Carson, Nelso, Nicol. Discrete-Event System Simulation, 3rd Edi-tion. 2001.

[2] Culler, Hong. Wireless Sensor Networks. Communications of the ACM,2004.

[3] Enz, El-Hoiydi, Decotignie, Peiris. WiseNET: An Ultralow-Power Wire-less Sensor Network Solution. IEEE Computer Society, 2004.

[4] Carle, Simplot-Ryl. Energy-Efficient Area Monitoring for Sensor Net-works. IEEE Computer Society, 2004.

[5] Boulis, Ganeriwal, Srivastava. Aggregation in Sensor Networks: AnEnergy-Accuracy Trade-Off. Ad Hoc Networks, 2003.

[6] Schiller, Liers, Ritter, Winter, Voigt. ScatterWeb - Low Power SensorNodes and Energy Aware Routing. Proceedings of the 38th Hawaii In-ternational Conference on System Science, 2005.

[7] Heinzelman, Chandrakasan, Balakrishnan. Energy-Efficient Communi-cation Protocol for Wireless Microsensor Networks. Proceedings of the33rd Hawaii International Conference on System Science, 2000.

[8] Akyildiz, Su, Cayirci. Wireless Sensor Networks: A Survey. ComputerNetwork, 2002.

[9] Zhou, He, Krishnamurthy, Stankovic. Impact od Radio Irregularity onWireless Sensor Networks. MobiSYS, 2004.

[10] Römer, Mattern. The Design Space of Wireless Sensor Networks. IEEEWireless Communication, 2004.

[11] Govindan, Hellerstein, Hong, Madden, Franklin, Shenker. The Sen-sor Network as a Database. citeseer.ist.psu.edu/govindan02sensor.html,2002.

[12] Halliday, Resnick, Walker. Fundamentals of Physics. Wiley, 6th ed., 2001.

61

BIBLIOGRAPHY

[13] Mitchell. Machine Learning. McGraw-Hill, 1997.

[14] Sutton, Barto. Reinforcement Learning: An Introduction. The MITPress, 1998

[15] Beasley, Bull, Martin. An Overview of Genetic Algorithms Part 1. Uni-versity Computing, 1993.

[16] Beasley, Bull, Martin. An Overview of Genetic Algorithms Part 2. Uni-versity Computing, 1993.

[17] Panait, Luke. Cooperative Multi-Agent Learning: The State Of the Art.Technical Report GMU-CS-TR2003-1, Depertment of Computer Science,George Mason University.

[18] Dai, Wu. Distributed Dominant Pruning in Ad Hoc Networks. Proc.IEEE 2003 Int’l Conf. Communications.

[19] Data sheet for the TR1001 radio unit. RFM, 2003.

[20] Hu, Wellman. Multiagent Reinforcement Learning: Theoretical Frame-work and an Algorithm. University of Michigan.

[21] Michael L. Littman Markow games as a framework for multi-agent rein-forcement learning. Proceedings of the Eleventh International Conferenceon Machine Learning Research, 1996.

[22] Dousse, Mannersalo, Thiran. Latency of Wireless Sensor Networks withUncoordinated Power Saving Mechanisims. MobiHoc’04

[23] Chan, Perrig. ACE: An Emergent Algorithm for Highly Uniform ClusterFormation. EWSN 2004

[24] Andersen, Rappaport, Yoshida. Propagation Measurements and Modelsfor Wireless Communications Channels. IEEE Communications Maga-zine, January 1995

[25] Katz. CS294-7: Radio Propagation. University of California at Berkley.

62

TRITA-CSC-E 2006:100 ISRN-KTH/CSC/E--06/100--SE

ISSN-1653-5715

www.kth.se

optimization of wireless sensor networks using machine ... · optimization of wireless sensor...

Documents