[ieee 2009 ieee congress on evolutionary computation (cec) - trondheim, norway...

Fault Tolerance in Distributed Genetic Algorithms with Tree

Topologies

Yiyuan Gong

Alex S. Fukunaga

Abstract— We investigate the effects of communication fail-ures in grid-based, distributed genetic algorithms with varioustopologies. We evaluated the performance behavior of dis-tributed GAs under varying levels of persistent communicationfailures, using the sorting network problem as a benchmarkapplication. In this experiment, we find that distributed GAwith larger population size is less affected by the lowercommunication failure rate. However, the effect of lower com-munication failure on the performance of distributed GA varieswith the topologies when population size is small. For all thetree topologies we investigated, when communications failuresoccur extremely frequently, then a significant performancedegradation is observed. However, even in these extreme cases,we show that simple retry/reroute protocols for recovering fromcommunication failure are sufficient to recover most of theperformance.

I. INTRODUCTION

Grid computing and peer-to-peer (P2P) environments are

becoming increasingly mature, enabling many users to access

vast computational resources. Genetic algorithms and other

evolutionary computation methods can benefit significantly

from this trend, since evolutionary algorithms can be paral-

lelized and distributed straightforwardly.

One potential problem with genetic algorithms distributed

across many processors is the possibility of system faults.

Consider a distributed GA running on a P2P system, such

as distributed.net or SETI@home. Communications con-

nectivity in a P2P system is inherently unreliable. Nodes

can become disconnected and unavailable at any time. For

example, client computers in a P2P network (which are not

under the control of the GA implementor) can be taken

offline or rebooted at arbitrary times. Communications are

extremely unreliable in the case of clients running on mobile

computers. Therefore, fault-tolerance is a major concern in

P2P systems [1]. Although grid environments tend to be more

stable than P2P environments, reliability and fault-tolerance

is an issue in grid research [2]

In this paper, we consider fault-tolerance in distributed

genetic algorithms. We study what happens to moderate-

sized (15 process) and large-sized (127 process) distributed

GA performance behavior when communications failures are

present. We consider two types of communication failure

rates (low and high) and propose a simple resend or reroute

protocols to handle faults, and show that distributed GAs

Yiyuan Gong is with College of Mathematics and Computer Sciences,Fuzhou University,Fuzhou, Fujian, China; email: [email protected]

Alex Fukunaga is with the Global Edge Institute, Tokyo Institute ofTechnology, Meguro, Tokyo, Japan; email: [email protected].

can maintain their performance behavior even in extremely

unreliable environments.

The paper is structured as follows. First, we describe the

distributed GA model that we evaluate (Section II). Then,

we describe the fault model (Section III), and then discuss

simple strategies for incorporating fault tolerance into our

distributed GA model (Section IV). We present experimental

results using our fault model and fault tolerance mechanisms

in Section V, and conclude with a discussion of our results

and directions for future work in Section VI.

II. DISTRIBUTED GA MODEL

We present an island-model, distributed GA (DGA), in

which the migration is performed through a logical tree

topology. The logical topology is established in advance on

a distributed computing platform. Logical communications

links are decided based on considerations such as communi-

cation delay between two nodes, such that two nodes which

are adjacent in the logical topology can communicate with

small communication overhead.

In this paper, we focus on populations connected using

tree topologies. Every node except the root knows its parent

and every node except the leaves knows its children. Each

computing node generates its own initial chromosome set

as a subpopulation, and performs genetic operations on its

subpopulation independently.

Figure 1 is an example of a spanning tree. The arrowed

edges are chosen for the spanning tree.

Fig. 1. Example of spanning tree

Migration is performed from a child to its parent when the

migration condition is satisfied. Migration is initiated when

a new, best solution within a population (i.e., a local elite)

is found at some node. This new local elite is sent to the

node’s parent. When a node receives a chromosome sent

968978-1-4244-2959-2/09/$25.00 c© 2009 IEEE

from its child, the node replaces the worst chromosome in

its population with the migrated one.

The procedure for each computing node is stated as

follows:

0: procedure DGA;

1: begin

2: Generation := 0;

3: initialize a sub-population P;

4: while(termination condition is false)

5: begin

6: while (generate new population is true)

7: select two parents;

8: apply crossover and mutation;

9: evaluation;

10: end while;

11: check the communication buffer;

12: if(received chromosomes) begin

13: replace the worst chromosome in P

by the received one;

14: endif;

15: if(the best chromosome is updated)

16: send the best chromosome to

the parent;

17: endif;

18: Generation := Generation + 1;

19: end while;

20: end

Figure 2 shows the four tree topologies with 7 nodes: star,

line, sided binary tree, balanced binary tree. These four types

topologies with n nodes have the properties shown in Table

I .

TABLE I

PROPERTIES OF TOPOLOGIES[3]

Topology Depth Leaves Descendant

star 1 n − 1 (n − 1)/n

balanced log (n+1)2

(n + 1)/2 (n+1) log(n+1)/4+2n

sided (n − 1)/2 (n + 1)/2 (n − 1)(n + 1)/4nline n − 1 1 (n − 1)/2

In the first column of Table I, ‘star’, ‘balanced’, ‘sided’,

and ‘line’ correspond to the star, balanced binary tree, sided

binary tree, and line topologies, respectively. In the table, the

’Depth’ column shows the length of the longest path between

the root and a leaf node. The line topology has the longest

depth while the star has the shortest depth. The ’Leaves’

column indicates the number of leaf nodes (nodes without

children). Note that leaf nodes do not receive migrated

chromosomes. The star topology has the largest number

(n − 1) of leaf nodes. The ‘Descendant’ column shows

the average number of descendant nodes in the topology.

Migrated chromosomes are generated at descendant nodes.

The star and line are well-known topologies which represent

opposite ends of a spectrum (line has maximal depth, star has

maximal number of leaves). The balanced and binary trees

Fig. 2. Migration topologies

are representative of intermediate topologies between these

extremes.

III. FAULTS IN DISTRIBUTED GENETIC ALGORITHMS

There are two major failure modes in a distributed GA:

• It is possible for communications among nodes to fail,

due to unreliable communications channels or other

reasons.

• It is possible for a node to become unavailable for

computation due to a crash, reboot, or shutdown.

The latter (node unavailability) implies the former, since

it is not possible for a node to communicate when it has

been shut down or is otherwise unavailable. On the other

hand, communications failure to or from a node does not

necessarily mean that the node stops computation. In this

paper, we focus on communications failures, and we assume

that computations continue at all nodes even if some com-

munications fail.

2009 IEEE Congress on Evolutionary Computation (CEC 2009) 969

Communications failures can be characterized as either

transient or persistent. A transient failure is a short-term

(several seconds, up to about a minute) interruption of

communications. Transient failures are extremely common in

distributed systems, and it is sufficient to rely on low-level

networking infrastructure to handle short, transient failures.

For example, TCP/IP implements mechanisms to ensure that

packets are eventually received by the destination. On the

other hand, a persistent failure is a longer-term event where

a communications link fails for a more noticeable amount of

time (say, more than a minute). For long, persistent failures,

some mechanism to handle the failure may be required at

the application (GA) level. Therefore, this paper focuses on

handling persistent communications failures.

We assume that communications attempts between pro-

cesses in a GA are asynchronous (non-blocking). That is,

when the sender decides to send a message, it calls an

asynchronous method for sending a message, and then con-

tinues performing other computations. In contrast, a blocking

send would mean that the sender initiates a message trans-

fer to the receiver, and waits (without doing any further

computation) until the proper acknowledgment of receipt

by the receiver is detected by the sender. In general, non-

blocking communications is more efficient; in the presence of

extremely unreliable communications, this is especially true,

since blocking communications would result in processes

waiting idly while waiting for communications to succeed

on an unreliable channel.

In our experiments, we model a failure event as the

tuple (source, destination, start, end), which denotes that

all communications attempts from source to destination in

the time interval beginning with start and ending with end

will fail. We assume that a failure means that the sender at

the source attempts to send a message to the destination,

but for some reason, does not receive an appropriate ac-

knowledgment of receipt. We further assume that the failure

involves a temporary termination of connection (e.g., loss of

TCP connection) such that it is not possible for a lower-

level protocol such as TCP to automatically recover the

communication. This type of worst-case loss can happen in

unstable environments such as P2P networks.

In the distributed GA models described in Section II, a

failure has the effect of isolating groups of processors and

inducing a topology which differs somewhat from the ideal

fault-free topology. For example, Given the line topology

p1 → p2 → p3 → p4 → p5, if the communication between

2 and 3 is interrupted, then it is similar to running two lines

in parallel, p1 → p2 and p3 → p4 → p5, until the p2 → p3

communications link is restored. In the most extreme case

possible, when all communications fail among k processors,

the result would be equivalent to running k independent

populations in parallel. This might lead to a degradation in

performance. Previous work has shown that in parallel GAs

with sub-populations allocated to different processors, some

amount of communications/migration is desirable and leads

to improved performance (c.f. [4]), although the optimal

migration rate will depend on the application.

IV. SIMPLE FAULT TOLERANCE MECHANISMS FOR

DISTRIBUTED GAS

One approach to fault-tolerance is to not worry about

communications failures at all at the application level. As

mentioned above, in the worst case, this approach might

lead to the distributed GA behaving like a set of inde-

pendent, smaller populations. Although performance might

degrade, the amount of degradation will depend on the

application, and furthermore, the GA still functions correctly

as an optimization algorithm, as long as communication

is implemented using nonblocking primitives (if blocking

communications primitives are used, then the system may

enter a deadlocked state).

Another approach is to use population structures and com-

munications topologies which are fault tolerant. In Section

II, we described several distributed GA topologies: star, line,

sided binary tree, balanced binary tree. In situations where

we are able to choose a particular topology on which to

deploy a distributed GA, then knowledge about the relative

fault-tolerance of various topologies can guide us to select an

appropriate topology (of course, in this case, we must also

consider other factors, besides fault tolerance).

The above two strategies for dealing with failure are

entirely passive, and nothing is done during the actual

GA run to explicitly deal with failures. However, if it is

possible to implement some simple, application (GA) level

fault tolerance mechanisms so that the impact of faults is

minimized, then this is clearly worth investigating.

One very simple failure recovery method is for the sender

to periodically resend a message to the intended receiver

until the sender receives an explicit acknowledgment of

receipt from the receiver. (TCP already implements this,

but as mentioned above, we assume the failure involves a

connection termination, so we assume that we can not rely

on the lower-level protocols).

Another simple recovery method is reroute around a bad

communications link.

If a sender A does not receive an acknowledgment from

receiver B within a certain time limit, A can reroute the com-

munication and send the data to another node, specifically,

the next available node in the communications tree, skipping

links that is broken.For example, assume that we have a line

topology p1 → p2 → p3 → p4 → p5. If the communications

link (p2 → p3) is broken, then p2 can send to p4 instead of

p3.

The resend protocol is not a GA-specific technique – as

mentioned above, this is usually implemented at lower levels

of communication (TCP). However, when TCP connections

are terminated entirely, then this needs to done at the GA

level. The rerouting protocol is a GA-specific technique.

In most applications, we can not reroute the result of a

computation to another node, since that would usually result

in incorrect results. However, for distributed GAs, this is a

natural idea.

970 2009 IEEE Congress on Evolutionary Computation (CEC 2009)

V. EXPERIMENTAL EVALUATION

A. Sorting Network Problem

We use the minimal sorting network problem as a bench-

mark. This is the problem of designing a circuit (network)

with the minimal number of comparators, such that given any

input vector of n inputs (numbers), the circuit sorts the input

in order of size. This is a classical problem in theoretical

computer science [5]. Design of a sorting network with the

minimal number of comparators is a very difficult problem,

and evolutionary approaches for automatically designing a

minimal size sorting network have been investigated by

a number of researchers in the evolutionary computation

community [6], [7], [8], [9]. We use the same genetic

representation as Graham, Masum and Oppacher [9], and

we use the 14-input problem as a benchmark.

The fitness score for a 14-input sorting network problem

ranges between 0 and 16384 (i.e., no inputs sorted correctly,

and all 214 inputs sorted correctly, where higher scores

are better. The 14-input sorting network problem is very

difficult and time consuming, because evaluating a single

candidate individual (sorting network) requires executing the

network on 214 test cases. Sorting network problems require

a tremendous amount of computation, and can therefore

benefit significantly from parallel GAs. Because the fitness

function requires so much computation, individual evaluation

is the bottleneck for a distributed GA for this problem, and

communication costs are negligible compared to fitness com-

putations. Therefore, the sorting network problem provides

a realistic benchmark for distributed GAs.

B. Simulation of unreliable distributed algorithms on a reli-

able cluster

In order to perform controlled experiments under varying

conditions, we do not run our experiment on an actual, peer-

to-peer or grid system, since that makes it difficult to re-

peat experimental conditions. Instead, we simulate unreliable

distributed genetic algorithms by running a parallel genetic

algorithm on a large-scale computing cluster, and injecting

simulated faults into this simulated, distributed GA. The

experiments were conducted on the Tokyo Institute of Tech-

nology TSUBASA cluster, which consists of 90 Sun Blade

X6250 nodes connected by an InfiniBand interconnect, where

each node consists of two quad-core Xeon E5440(2.83GHz)

processors (8 cores per node, 720 total CPU cores). Unlike

a grid or peer-to-peer system, clusters such as TSUBASA

are typically located in a data center and carefully managed

so that reliability is not a large issue for users. Because the

processors in the cluster are connected with a highly reliable,

dedicated interconnect network, communications failures are

not a practical issue. Also, any communication topology can

be specified. By implementing a simulated fault injection

mechanism into the communications code used by our “dis-

tributed” GA, we can simulate a unreliable, distributed GA

on a high-performance cluster, enabling us to perform large-

scale experiments using a difficult benchmark such as the

14-input sorting net problem in a reasonable amount of time.

The simulated failure model works as follows: when a

message is sent from s to r, the communication link between

s and r fails with probability pf . If the link fails, the message

is not received by the r, and the communication link s → r

continues to fail for some random duration between dmin

and dmax seconds, and all messages sent from s to r during

that time will fail.

C. Experimental parameters

In all of our experiments, we run (simulated) distributed

genetic algorithms for the 14-input sorting network problem.

The mutation rate was 0.4, and the crossover rate was 0.1.

The migration policy described in Section II is used, so that

communications attempts occur when a new elite member

has been found on a processor.

First we tested a moderate-sized distributed algorithms

with 15 nodes, configured to use one of the four communi-

cation topologies described in Section II, with a population

is 1000 per processor, executed for 800 generations per run.

We then extended the experiments to a large-scale, distributed

GA with 127 nodes and a population of 200 per node.

We considered failure rates pf ∈ {0.05, 0.5} (failures are

assumed to occur with uniform probability). This models a

range of environments, from very unreliable (pf = 0.05)

to extremely unreliable (pf = 0.5). We use these high

failure rates because lower rates are not interesting (they

result in behaviors which are indistinguishable from having

no failures), and we wish to see how far we can push the

fault-tolerance of distributed genetic algorithms.

D. Results

Figures 3-16 show the results of the experiments. In all

of these figures, each curve represents the solution curve of

the root of the corresponded topology (see the caption of

the figure), and the solution is the average value of 30 runs.

The “retry” curve represents the root solution curve when

using resend failure recovery method. The “reroute” curve

represents the root solution curve when using reroute failure

recovery method. The “no communication failure” curve

represents the case that no communication failure occurs,

and the “ignore communication failure” curve shows the

result when no recovery method is used when communication

failure occurred (in other words, the failed message is

lost/ignored). We depict the results of the four topologies

at the two failure rates mentioned above.

1) Comparison of Fault-tolerance of topologies: Figures

3-10 show the results of the four topologies when com-

munication failure rate is 0.05. We observed that the four

curves “retry”, “reroute”, “no communication failure” and

“ignore communication failure” are very close to each other

for all topologies when the number of nodes is 15 and the

population size is 1000. This also means that in the case of

larger population size and smaller number of nodes, commu-

nication failures have very little impact on the performance

of the distributed GA when failure rate is low. The two

failure recovery methods “retry” and “reroute” also don’t

show obvious effects on the performance. However, when


Fig. 3. Star with 15 nodes, failure rate 0.05, population size 1000

the node number is increased to 127 and the population size

is decreased to 200, we find that communication failures

affect the performance of distributed GA depending on the

topologies. It have some impact on the distributed GA with

side and line topologies while have little impact on it with

star and binary topologies. This is because longer topologies

have more descendant nodes than shorter topologies. It also

means that the number of cooperative nodes in side and line

are more than it in star and binary. when communication

failures occur, more nodes in side and line topologies are

affected than the nodes in star and binary topologies.

2) Comparison of Fault-recovery strategies: Figures 11-

16 show the results when communication failure rate is 0.5.

Comparing to figures 3-10, this extremely high communica-

tion failure rate has an obvious impact on performance. And

this is more obvious when the number of nodes is increased

to 127, and the population size is decreased to 200. However,

both “retry” and “reroute” recover most of the performance

from the communication failures. From these figures, we

also observed that for any topology, “retry” and “ignore

communication failure” make slower initial progress than

“no communication failure” and “reroute”. However, “retry”

catches up to “no communication failure” and “reroute” later.

‘The performance of “reroute” is similar to “no communica-

tion failure”. Although communication failures prevent elite

solutions from migrating, the “reroute” method ensures that

the elite is communicated to some ancestor, and the “retry”

method resends as soon as the communications is restored.

Thus, both of these recovery methods eventually catch up to

the performance of a failure-free run.

In addition to the results shown here, we also ran experi-

ments with 127 nodes and a population of 50 per node. The

results, not shown here due to space limitations, were similar

to the results with 127 nodes and a population of 200 per

node.

VI. CONCLUSIONS

In this paper, we investigated the effect of communi-

cation failures on distributed genetic algorithms with tree

Fig. 4. Binary with 15 nodes, failure rate 0.05, population size 1000

Fig. 5. Side with 15 nodes, failure rate 0.05, population size 1000

Fig. 6. Line with 15 nodes, failure rate 0.05, population size 1000


Fig. 7. Star with 127 nodes, failure rate 0.05, population size 200











topologies. Distributed GAs with varying levels of persistent

communication failures are evaluated. We find that different

failure rates impact distributed GAs differently. Distributed

GAs with few nodes and larger population size are less

affected by the lower communication failure rate. However,

with a larger number of nodes and smaller population, the

effect of lower communication failure on the performance

of distributed GA varies with the topologies. Communi-

cation failures impact longer topologies more than shorter

topologies. In the case of extremely high communication

failures, which might be found in an unreliable P2P system,

can degrade distributed GA significantly. We proposed two

recovery methods “retry” and “reroute”, and the results

show that the two methods can recover most of the per-

formance from communication failures even when failure

rate is extremely high. The“retry” strategy corresponds to

a strategy where we just use a reliable, nonblocking, lower-

level transport mechanism which reestablishes connections if

necessary and keeps retrying communications until the send

succeeds. Thus, we can conclude that distributed GAs are

naturally, highly fault-tolerant, even in the presence of very

high communication failures.

As mentioned in Section III, a system-induced failure in a

distributed GA consists of communication failures and node

unavailability. In this paper, we focused on communication

failures, and assumed that all nodes continued computa-

tions even when communications failed. In future work,

we will consider the impact of node unavailability as well

as mechanisms for fault tolerance in the presence of node

unavailability.

ACKNOWLEDGMENTS

This work was supported by the Japan MEXT program,

“Promotion of Env. Improvement for Independence of Young

Researchers”, the JSPS Compview GCOE, and a JSPS Grant-

in-Aid for Young Scientists 20700131.


REFERENCES

[1] J. Aspnes, Z. Diamadi, and G. Shah, “Fault-tolerant routing in peer-to-peer systems,” in Proceedings of the ACM Symposium on Principles of

Distributed Computing (PODS), 2002, pp. 223–232.[2] S. Hwang and C. Kesselman, “A flexible framework for fault-tolerance

on the grid,” Journal of Grid Computing, vol. 1, pp. 251–272, 2003.[3] Y.Gong, M.Nakamura, T.Matsumura, and K.Onaga, “A distributed par-

allel genetic local search with tree-based migration on irregular networktopologies,” IEICE transactions, on Fundamentals of Electronics, Com-

munications and Computer Sciences, no. 6, pp. 1377–1385, 2004.[4] E. Cantu-Paz, “Migration policies, selection pressure, and parallel

evolutionary algorithms,” Journal of Heuristics, vol. 7, no. 4, pp. 311–334, 2001.

[5] D. Knuth, The Art of Computer Programming, 2nd ed. Addison Wesley,1998, vol. 3.

[6] W. Hillis, “Co-evolving parasites improve simulated evolution as anoptimization procedure,” Physica D, vol. 42, pp. 228–234, 1990.

[7] H. Juille, “Evolution of non-deterministic incremental algorithms as anew approach for search in state spaces,” in Proceedings of International

Conference on Genetic Algorithms, 1995, pp. 351–358.[8] S.-S. Choi and B. Moon, “A graph-based Lamarckian-Baldwinian

hybrid for the sorting network problem,” IEEE Transactions on Evolu-

tionary Computation, vol. 9, no. 1, pp. 105–114, 2005.[9] L. Graham, H. Masum, and F. Oppacher, “Statistical analysis of

heuristics for evolving sorting networks,” in Prceedings of GECCO,2005, pp. 1265–1270.


[ieee 2009 ieee congress on evolutionary computation (cec) - trondheim, norway...

Documents