[ieee 2009 ieee congress on evolutionary computation (cec) - trondheim, norway...
TRANSCRIPT
Fault Tolerance in Distributed Genetic Algorithms with Tree
Topologies
Yiyuan Gong
Alex S. Fukunaga
Abstract— We investigate the effects of communication fail-ures in grid-based, distributed genetic algorithms with varioustopologies. We evaluated the performance behavior of dis-tributed GAs under varying levels of persistent communicationfailures, using the sorting network problem as a benchmarkapplication. In this experiment, we find that distributed GAwith larger population size is less affected by the lowercommunication failure rate. However, the effect of lower com-munication failure on the performance of distributed GA varieswith the topologies when population size is small. For all thetree topologies we investigated, when communications failuresoccur extremely frequently, then a significant performancedegradation is observed. However, even in these extreme cases,we show that simple retry/reroute protocols for recovering fromcommunication failure are sufficient to recover most of theperformance.
I. INTRODUCTION
Grid computing and peer-to-peer (P2P) environments are
becoming increasingly mature, enabling many users to access
vast computational resources. Genetic algorithms and other
evolutionary computation methods can benefit significantly
from this trend, since evolutionary algorithms can be paral-
lelized and distributed straightforwardly.
One potential problem with genetic algorithms distributed
across many processors is the possibility of system faults.
Consider a distributed GA running on a P2P system, such
as distributed.net or SETI@home. Communications con-
nectivity in a P2P system is inherently unreliable. Nodes
can become disconnected and unavailable at any time. For
example, client computers in a P2P network (which are not
under the control of the GA implementor) can be taken
offline or rebooted at arbitrary times. Communications are
extremely unreliable in the case of clients running on mobile
computers. Therefore, fault-tolerance is a major concern in
P2P systems [1]. Although grid environments tend to be more
stable than P2P environments, reliability and fault-tolerance
is an issue in grid research [2]
In this paper, we consider fault-tolerance in distributed
genetic algorithms. We study what happens to moderate-
sized (15 process) and large-sized (127 process) distributed
GA performance behavior when communications failures are
present. We consider two types of communication failure
rates (low and high) and propose a simple resend or reroute
protocols to handle faults, and show that distributed GAs
Yiyuan Gong is with College of Mathematics and Computer Sciences,Fuzhou University,Fuzhou, Fujian, China; email: [email protected]
Alex Fukunaga is with the Global Edge Institute, Tokyo Institute ofTechnology, Meguro, Tokyo, Japan; email: [email protected].
can maintain their performance behavior even in extremely
unreliable environments.
The paper is structured as follows. First, we describe the
distributed GA model that we evaluate (Section II). Then,
we describe the fault model (Section III), and then discuss
simple strategies for incorporating fault tolerance into our
distributed GA model (Section IV). We present experimental
results using our fault model and fault tolerance mechanisms
in Section V, and conclude with a discussion of our results
and directions for future work in Section VI.
II. DISTRIBUTED GA MODEL
We present an island-model, distributed GA (DGA), in
which the migration is performed through a logical tree
topology. The logical topology is established in advance on
a distributed computing platform. Logical communications
links are decided based on considerations such as communi-
cation delay between two nodes, such that two nodes which
are adjacent in the logical topology can communicate with
small communication overhead.
In this paper, we focus on populations connected using
tree topologies. Every node except the root knows its parent
and every node except the leaves knows its children. Each
computing node generates its own initial chromosome set
as a subpopulation, and performs genetic operations on its
subpopulation independently.
Figure 1 is an example of a spanning tree. The arrowed
edges are chosen for the spanning tree.
Fig. 1. Example of spanning tree
Migration is performed from a child to its parent when the
migration condition is satisfied. Migration is initiated when
a new, best solution within a population (i.e., a local elite)
is found at some node. This new local elite is sent to the
node’s parent. When a node receives a chromosome sent
968978-1-4244-2959-2/09/$25.00 c© 2009 IEEE
from its child, the node replaces the worst chromosome in
its population with the migrated one.
The procedure for each computing node is stated as
follows:
0: procedure DGA;
1: begin
2: Generation := 0;
3: initialize a sub-population P;
4: while(termination condition is false)
5: begin
6: while (generate new population is true)
7: select two parents;
8: apply crossover and mutation;
9: evaluation;
10: end while;
11: check the communication buffer;
12: if(received chromosomes) begin
13: replace the worst chromosome in P
by the received one;
14: endif;
15: if(the best chromosome is updated)
16: send the best chromosome to
the parent;
17: endif;
18: Generation := Generation + 1;
19: end while;
20: end
Figure 2 shows the four tree topologies with 7 nodes: star,
line, sided binary tree, balanced binary tree. These four types
topologies with n nodes have the properties shown in Table
I .
TABLE I
PROPERTIES OF TOPOLOGIES[3]
Topology Depth Leaves Descendant
star 1 n − 1 (n − 1)/n
balanced log (n+1)2
(n + 1)/2 (n+1) log(n+1)/4+2n
sided (n − 1)/2 (n + 1)/2 (n − 1)(n + 1)/4nline n − 1 1 (n − 1)/2
In the first column of Table I, ‘star’, ‘balanced’, ‘sided’,
and ‘line’ correspond to the star, balanced binary tree, sided
binary tree, and line topologies, respectively. In the table, the
’Depth’ column shows the length of the longest path between
the root and a leaf node. The line topology has the longest
depth while the star has the shortest depth. The ’Leaves’
column indicates the number of leaf nodes (nodes without
children). Note that leaf nodes do not receive migrated
chromosomes. The star topology has the largest number
(n − 1) of leaf nodes. The ‘Descendant’ column shows
the average number of descendant nodes in the topology.
Migrated chromosomes are generated at descendant nodes.
The star and line are well-known topologies which represent
opposite ends of a spectrum (line has maximal depth, star has
maximal number of leaves). The balanced and binary trees
Fig. 2. Migration topologies
are representative of intermediate topologies between these
extremes.
III. FAULTS IN DISTRIBUTED GENETIC ALGORITHMS
There are two major failure modes in a distributed GA:
• It is possible for communications among nodes to fail,
due to unreliable communications channels or other
reasons.
• It is possible for a node to become unavailable for
computation due to a crash, reboot, or shutdown.
The latter (node unavailability) implies the former, since
it is not possible for a node to communicate when it has
been shut down or is otherwise unavailable. On the other
hand, communications failure to or from a node does not
necessarily mean that the node stops computation. In this
paper, we focus on communications failures, and we assume
that computations continue at all nodes even if some com-
munications fail.
2009 IEEE Congress on Evolutionary Computation (CEC 2009) 969
Communications failures can be characterized as either
transient or persistent. A transient failure is a short-term
(several seconds, up to about a minute) interruption of
communications. Transient failures are extremely common in
distributed systems, and it is sufficient to rely on low-level
networking infrastructure to handle short, transient failures.
For example, TCP/IP implements mechanisms to ensure that
packets are eventually received by the destination. On the
other hand, a persistent failure is a longer-term event where
a communications link fails for a more noticeable amount of
time (say, more than a minute). For long, persistent failures,
some mechanism to handle the failure may be required at
the application (GA) level. Therefore, this paper focuses on
handling persistent communications failures.
We assume that communications attempts between pro-
cesses in a GA are asynchronous (non-blocking). That is,
when the sender decides to send a message, it calls an
asynchronous method for sending a message, and then con-
tinues performing other computations. In contrast, a blocking
send would mean that the sender initiates a message trans-
fer to the receiver, and waits (without doing any further
computation) until the proper acknowledgment of receipt
by the receiver is detected by the sender. In general, non-
blocking communications is more efficient; in the presence of
extremely unreliable communications, this is especially true,
since blocking communications would result in processes
waiting idly while waiting for communications to succeed
on an unreliable channel.
In our experiments, we model a failure event as the
tuple (source, destination, start, end), which denotes that
all communications attempts from source to destination in
the time interval beginning with start and ending with end
will fail. We assume that a failure means that the sender at
the source attempts to send a message to the destination,
but for some reason, does not receive an appropriate ac-
knowledgment of receipt. We further assume that the failure
involves a temporary termination of connection (e.g., loss of
TCP connection) such that it is not possible for a lower-
level protocol such as TCP to automatically recover the
communication. This type of worst-case loss can happen in
unstable environments such as P2P networks.
In the distributed GA models described in Section II, a
failure has the effect of isolating groups of processors and
inducing a topology which differs somewhat from the ideal
fault-free topology. For example, Given the line topology
p1 → p2 → p3 → p4 → p5, if the communication between
2 and 3 is interrupted, then it is similar to running two lines
in parallel, p1 → p2 and p3 → p4 → p5, until the p2 → p3
communications link is restored. In the most extreme case
possible, when all communications fail among k processors,
the result would be equivalent to running k independent
populations in parallel. This might lead to a degradation in
performance. Previous work has shown that in parallel GAs
with sub-populations allocated to different processors, some
amount of communications/migration is desirable and leads
to improved performance (c.f. [4]), although the optimal
migration rate will depend on the application.
IV. SIMPLE FAULT TOLERANCE MECHANISMS FOR
DISTRIBUTED GAS
One approach to fault-tolerance is to not worry about
communications failures at all at the application level. As
mentioned above, in the worst case, this approach might
lead to the distributed GA behaving like a set of inde-
pendent, smaller populations. Although performance might
degrade, the amount of degradation will depend on the
application, and furthermore, the GA still functions correctly
as an optimization algorithm, as long as communication
is implemented using nonblocking primitives (if blocking
communications primitives are used, then the system may
enter a deadlocked state).
Another approach is to use population structures and com-
munications topologies which are fault tolerant. In Section
II, we described several distributed GA topologies: star, line,
sided binary tree, balanced binary tree. In situations where
we are able to choose a particular topology on which to
deploy a distributed GA, then knowledge about the relative
fault-tolerance of various topologies can guide us to select an
appropriate topology (of course, in this case, we must also
consider other factors, besides fault tolerance).
The above two strategies for dealing with failure are
entirely passive, and nothing is done during the actual
GA run to explicitly deal with failures. However, if it is
possible to implement some simple, application (GA) level
fault tolerance mechanisms so that the impact of faults is
minimized, then this is clearly worth investigating.
One very simple failure recovery method is for the sender
to periodically resend a message to the intended receiver
until the sender receives an explicit acknowledgment of
receipt from the receiver. (TCP already implements this,
but as mentioned above, we assume the failure involves a
connection termination, so we assume that we can not rely
on the lower-level protocols).
Another simple recovery method is reroute around a bad
communications link.
If a sender A does not receive an acknowledgment from
receiver B within a certain time limit, A can reroute the com-
munication and send the data to another node, specifically,
the next available node in the communications tree, skipping
links that is broken.For example, assume that we have a line
topology p1 → p2 → p3 → p4 → p5. If the communications
link (p2 → p3) is broken, then p2 can send to p4 instead of
p3.
The resend protocol is not a GA-specific technique – as
mentioned above, this is usually implemented at lower levels
of communication (TCP). However, when TCP connections
are terminated entirely, then this needs to done at the GA
level. The rerouting protocol is a GA-specific technique.
In most applications, we can not reroute the result of a
computation to another node, since that would usually result
in incorrect results. However, for distributed GAs, this is a
natural idea.
970 2009 IEEE Congress on Evolutionary Computation (CEC 2009)
V. EXPERIMENTAL EVALUATION
A. Sorting Network Problem
We use the minimal sorting network problem as a bench-
mark. This is the problem of designing a circuit (network)
with the minimal number of comparators, such that given any
input vector of n inputs (numbers), the circuit sorts the input
in order of size. This is a classical problem in theoretical
computer science [5]. Design of a sorting network with the
minimal number of comparators is a very difficult problem,
and evolutionary approaches for automatically designing a
minimal size sorting network have been investigated by
a number of researchers in the evolutionary computation
community [6], [7], [8], [9]. We use the same genetic
representation as Graham, Masum and Oppacher [9], and
we use the 14-input problem as a benchmark.
The fitness score for a 14-input sorting network problem
ranges between 0 and 16384 (i.e., no inputs sorted correctly,
and all 214 inputs sorted correctly, where higher scores
are better. The 14-input sorting network problem is very
difficult and time consuming, because evaluating a single
candidate individual (sorting network) requires executing the
network on 214 test cases. Sorting network problems require
a tremendous amount of computation, and can therefore
benefit significantly from parallel GAs. Because the fitness
function requires so much computation, individual evaluation
is the bottleneck for a distributed GA for this problem, and
communication costs are negligible compared to fitness com-
putations. Therefore, the sorting network problem provides
a realistic benchmark for distributed GAs.
B. Simulation of unreliable distributed algorithms on a reli-
able cluster
In order to perform controlled experiments under varying
conditions, we do not run our experiment on an actual, peer-
to-peer or grid system, since that makes it difficult to re-
peat experimental conditions. Instead, we simulate unreliable
distributed genetic algorithms by running a parallel genetic
algorithm on a large-scale computing cluster, and injecting
simulated faults into this simulated, distributed GA. The
experiments were conducted on the Tokyo Institute of Tech-
nology TSUBASA cluster, which consists of 90 Sun Blade
X6250 nodes connected by an InfiniBand interconnect, where
each node consists of two quad-core Xeon E5440(2.83GHz)
processors (8 cores per node, 720 total CPU cores). Unlike
a grid or peer-to-peer system, clusters such as TSUBASA
are typically located in a data center and carefully managed
so that reliability is not a large issue for users. Because the
processors in the cluster are connected with a highly reliable,
dedicated interconnect network, communications failures are
not a practical issue. Also, any communication topology can
be specified. By implementing a simulated fault injection
mechanism into the communications code used by our “dis-
tributed” GA, we can simulate a unreliable, distributed GA
on a high-performance cluster, enabling us to perform large-
scale experiments using a difficult benchmark such as the
14-input sorting net problem in a reasonable amount of time.
The simulated failure model works as follows: when a
message is sent from s to r, the communication link between
s and r fails with probability pf . If the link fails, the message
is not received by the r, and the communication link s → r
continues to fail for some random duration between dmin
and dmax seconds, and all messages sent from s to r during
that time will fail.
C. Experimental parameters
In all of our experiments, we run (simulated) distributed
genetic algorithms for the 14-input sorting network problem.
The mutation rate was 0.4, and the crossover rate was 0.1.
The migration policy described in Section II is used, so that
communications attempts occur when a new elite member
has been found on a processor.
First we tested a moderate-sized distributed algorithms
with 15 nodes, configured to use one of the four communi-
cation topologies described in Section II, with a population
is 1000 per processor, executed for 800 generations per run.
We then extended the experiments to a large-scale, distributed
GA with 127 nodes and a population of 200 per node.
We considered failure rates pf ∈ {0.05, 0.5} (failures are
assumed to occur with uniform probability). This models a
range of environments, from very unreliable (pf = 0.05)
to extremely unreliable (pf = 0.5). We use these high
failure rates because lower rates are not interesting (they
result in behaviors which are indistinguishable from having
no failures), and we wish to see how far we can push the
fault-tolerance of distributed genetic algorithms.
D. Results
Figures 3-16 show the results of the experiments. In all
of these figures, each curve represents the solution curve of
the root of the corresponded topology (see the caption of
the figure), and the solution is the average value of 30 runs.
The “retry” curve represents the root solution curve when
using resend failure recovery method. The “reroute” curve
represents the root solution curve when using reroute failure
recovery method. The “no communication failure” curve
represents the case that no communication failure occurs,
and the “ignore communication failure” curve shows the
result when no recovery method is used when communication
failure occurred (in other words, the failed message is
lost/ignored). We depict the results of the four topologies
at the two failure rates mentioned above.
1) Comparison of Fault-tolerance of topologies: Figures
3-10 show the results of the four topologies when com-
munication failure rate is 0.05. We observed that the four
curves “retry”, “reroute”, “no communication failure” and
“ignore communication failure” are very close to each other
for all topologies when the number of nodes is 15 and the
population size is 1000. This also means that in the case of
larger population size and smaller number of nodes, commu-
nication failures have very little impact on the performance
of the distributed GA when failure rate is low. The two
failure recovery methods “retry” and “reroute” also don’t
show obvious effects on the performance. However, when
2009 IEEE Congress on Evolutionary Computation (CEC 2009) 971
Fig. 3. Star with 15 nodes, failure rate 0.05, population size 1000
the node number is increased to 127 and the population size
is decreased to 200, we find that communication failures
affect the performance of distributed GA depending on the
topologies. It have some impact on the distributed GA with
side and line topologies while have little impact on it with
star and binary topologies. This is because longer topologies
have more descendant nodes than shorter topologies. It also
means that the number of cooperative nodes in side and line
are more than it in star and binary. when communication
failures occur, more nodes in side and line topologies are
affected than the nodes in star and binary topologies.
2) Comparison of Fault-recovery strategies: Figures 11-
16 show the results when communication failure rate is 0.5.
Comparing to figures 3-10, this extremely high communica-
tion failure rate has an obvious impact on performance. And
this is more obvious when the number of nodes is increased
to 127, and the population size is decreased to 200. However,
both “retry” and “reroute” recover most of the performance
from the communication failures. From these figures, we
also observed that for any topology, “retry” and “ignore
communication failure” make slower initial progress than
“no communication failure” and “reroute”. However, “retry”
catches up to “no communication failure” and “reroute” later.
‘The performance of “reroute” is similar to “no communica-
tion failure”. Although communication failures prevent elite
solutions from migrating, the “reroute” method ensures that
the elite is communicated to some ancestor, and the “retry”
method resends as soon as the communications is restored.
Thus, both of these recovery methods eventually catch up to
the performance of a failure-free run.
In addition to the results shown here, we also ran experi-
ments with 127 nodes and a population of 50 per node. The
results, not shown here due to space limitations, were similar
to the results with 127 nodes and a population of 200 per
node.
VI. CONCLUSIONS
In this paper, we investigated the effect of communi-
cation failures on distributed genetic algorithms with tree
Fig. 4. Binary with 15 nodes, failure rate 0.05, population size 1000
Fig. 5. Side with 15 nodes, failure rate 0.05, population size 1000
Fig. 6. Line with 15 nodes, failure rate 0.05, population size 1000
972 2009 IEEE Congress on Evolutionary Computation (CEC 2009)
Fig. 7. Star with 127 nodes, failure rate 0.05, population size 200
Fig. 8. Binary with 127 nodes, failure rate 0.05, population size 200
Fig. 9. Side with 127 nodes, failure rate 0.05, population size 200
Fig. 10. Line with 127 nodes, failure rate 0.05, population size 200
Fig. 11. Binary with 15 nodes, failure rate 0.5, population size 1000
Fig. 12. Side with 15 nodes, failure rate 0.5, population size 1000
2009 IEEE Congress on Evolutionary Computation (CEC 2009) 973
Fig. 13. Line with 15 nodes, failure rate 0.5, population size 1000
Fig. 14. Binary with 127 nodes, failure rate 0.5, population size 200
Fig. 15. Side with 127 nodes, failure rate 0.5, population size 200
Fig. 16. Line with 127 nodes, failure rate 0.5, population size 200
topologies. Distributed GAs with varying levels of persistent
communication failures are evaluated. We find that different
failure rates impact distributed GAs differently. Distributed
GAs with few nodes and larger population size are less
affected by the lower communication failure rate. However,
with a larger number of nodes and smaller population, the
effect of lower communication failure on the performance
of distributed GA varies with the topologies. Communi-
cation failures impact longer topologies more than shorter
topologies. In the case of extremely high communication
failures, which might be found in an unreliable P2P system,
can degrade distributed GA significantly. We proposed two
recovery methods “retry” and “reroute”, and the results
show that the two methods can recover most of the per-
formance from communication failures even when failure
rate is extremely high. The“retry” strategy corresponds to
a strategy where we just use a reliable, nonblocking, lower-
level transport mechanism which reestablishes connections if
necessary and keeps retrying communications until the send
succeeds. Thus, we can conclude that distributed GAs are
naturally, highly fault-tolerant, even in the presence of very
high communication failures.
As mentioned in Section III, a system-induced failure in a
distributed GA consists of communication failures and node
unavailability. In this paper, we focused on communication
failures, and assumed that all nodes continued computa-
tions even when communications failed. In future work,
we will consider the impact of node unavailability as well
as mechanisms for fault tolerance in the presence of node
unavailability.
ACKNOWLEDGMENTS
This work was supported by the Japan MEXT program,
“Promotion of Env. Improvement for Independence of Young
Researchers”, the JSPS Compview GCOE, and a JSPS Grant-
in-Aid for Young Scientists 20700131.
974 2009 IEEE Congress on Evolutionary Computation (CEC 2009)
REFERENCES
[1] J. Aspnes, Z. Diamadi, and G. Shah, “Fault-tolerant routing in peer-to-peer systems,” in Proceedings of the ACM Symposium on Principles of
Distributed Computing (PODS), 2002, pp. 223–232.[2] S. Hwang and C. Kesselman, “A flexible framework for fault-tolerance
on the grid,” Journal of Grid Computing, vol. 1, pp. 251–272, 2003.[3] Y.Gong, M.Nakamura, T.Matsumura, and K.Onaga, “A distributed par-
allel genetic local search with tree-based migration on irregular networktopologies,” IEICE transactions, on Fundamentals of Electronics, Com-
munications and Computer Sciences, no. 6, pp. 1377–1385, 2004.[4] E. Cantu-Paz, “Migration policies, selection pressure, and parallel
evolutionary algorithms,” Journal of Heuristics, vol. 7, no. 4, pp. 311–334, 2001.
[5] D. Knuth, The Art of Computer Programming, 2nd ed. Addison Wesley,1998, vol. 3.
[6] W. Hillis, “Co-evolving parasites improve simulated evolution as anoptimization procedure,” Physica D, vol. 42, pp. 228–234, 1990.
[7] H. Juille, “Evolution of non-deterministic incremental algorithms as anew approach for search in state spaces,” in Proceedings of International
Conference on Genetic Algorithms, 1995, pp. 351–358.[8] S.-S. Choi and B. Moon, “A graph-based Lamarckian-Baldwinian
hybrid for the sorting network problem,” IEEE Transactions on Evolu-
tionary Computation, vol. 9, no. 1, pp. 105–114, 2005.[9] L. Graham, H. Masum, and F. Oppacher, “Statistical analysis of
heuristics for evolving sorting networks,” in Prceedings of GECCO,2005, pp. 1265–1270.
2009 IEEE Congress on Evolutionary Computation (CEC 2009) 975