dynamic fault tolerance in fat trees

18
Dynamic Fault Tolerance in Fat Trees Frank Olaf Sem-Jacobsen, Tor Skeie, Olav Lysne, Member, IEEE, and Jose ´ Duato, Member, IEEE Abstract—Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k 1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k 1 limit with high probability. Index Terms—Fat trees, k-ary n-trees, dynamic fault tolerance, deterministic routing, adaptive routing. Ç 1 INTRODUCTION I N order to reduce costs, most clusters, and even super- computers, use commodity components. Now that the required computing power no longer can be achieved with a single processor, parallelization has become the de facto method of achieving high computational efficiency and huge processing powers. Large computer systems such as super- computers are constructed by interconnecting many smaller computing devices through a switched interconnection network. The numerous computing devices usually com- municate over the interconnection network in order to cooperatively solve a large computational problem, so a critical factor of such systems is the ability of the computing devices to communicate. Many applications require large amounts of data to be transported between the computing devices. The speed at which this communication occurs will, to a large extent, dictate the efficiency of the supercomputer. Therefore, an efficient and reliable interconnection network is a crucial component of large-scale computing systems. As the race for higher computing speeds progresses, existing technology is pushed to its limits. This also affects the interconnection network. Higher speeds leave less margin for error. Further, the increased power consumption might lead to shorter expected lifetimes of the network devices and decreased mean time between failures (MTBF). It becomes vitally important that the interconnection net- work is able to operate close to its nominal capacity even when network elements (typically switches and their interconnecting links) cease to function correctly. The High-Performance Computing (HPC) systems we consider in this paper, have typically relied on either proprietary interconnect technologies, or off-the-shelf sys- tems such as Quadrics and InfiniBand. This is still the case for the current top 500 supercomputer systems. However, as the speed and reliability of IP-based solutions increases, and as Ethernet eventually supports flow control (lossless opera- tion) there has been an increased interest in the application of these topologies to HPC systems as well. Although we focus on high-performance interconnects which usually rely on a centralized subnet manager (e.g., InfiniBand) in this paper, the solutions and techniques we propose can also be applied to Ethernet/IP-based interconnects. The ability of the interconnection network to maintain a high operational efficiency, or at least to remain operational without disconnecting any computing nodes, with failing network elements depends strongly on the network topology and the routing function used to generate paths through the network. For the system to remain connected after a fault has occurred, there must exist a path between every pair of computing nodes that avoids the failed element. The most simplistic fault-tolerance method, and perhaps the most widely deployed, is to shut down the network that contains the faulty components and replace them. This method requires no complex mechanisms, but the time to recovery might vary from hours to days, depending on the availability of replacement hardware. However, it might not be necessary to shut down the entire system to perform the replacement. An example of a slightly more complex mechanism is the ability of BlueGene/L to shut down the faulty parts of its network and continue functioning at reduced capacity. This example indicates another step on the path to reducing time to recovery, that of reconfiguring the network automatically either by halting it or while it remains in operation. Such methods are known as reconfiguration methods, because the network is reconfigured once the fault is discovered. A significant challenge when using reconfiguration is to guarantee that there are no dependencies between packets being routed in the network throughout the reconfiguration that might cause cycles and deadlock the network. A long- used method of ensuring this is to halt the network and its applications and drain it of all traffic before it is reconfigured, 508 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011 . F.O. Sem-Jacobsen is with Simula Research Laboratory, PO Box 134, 1325 Lysaker, Norway. E-mail: [email protected]. . T. Skeie and O. Lysne are with Simula Research Laboratory, PO Box 134, 1325 Lysaker, Norway, and University of Oslo, Norway. E-mail: {tskeie, olavly}@simula.no. . J. Duato is with Simula Research Laboratory, PO Box 134, 1325 Lysaker, Norway, and the Departamento de Informatica de Sistemas y Computa- dores, Universidad Politecnica de Valencia, Camino de Vera, s/n C.P. 46022, Valencia, Spain. E-mail: [email protected]. Manuscript received 2 Feb. 2009; revised 4 Dec. 2009; accepted 15 Mar. 2010; published online 15 Apr. 2010. Recommended for acceptance by B. Parhami. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TC-2009-02-0058. Digital Object Identifier no. 10.1109/TC.2010.97. 0018-9340/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

Upload: independent

Post on 22-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Dynamic Fault Tolerance in Fat TreesFrank Olaf Sem-Jacobsen, Tor Skeie, Olav Lysne, Member, IEEE, and Jose Duato, Member, IEEE

Abstract—Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of

failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively

routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently

returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely

masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets

around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k� 1 benign simultaneous switch

and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of

performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to

handle significantly more faults beyond the k� 1 limit with high probability.

Index Terms—Fat trees, k-ary n-trees, dynamic fault tolerance, deterministic routing, adaptive routing.

Ç

1 INTRODUCTION

IN order to reduce costs, most clusters, and even super-computers, use commodity components. Now that the

required computing power no longer can be achieved with asingle processor, parallelization has become the de factomethod of achieving high computational efficiency and hugeprocessing powers. Large computer systems such as super-computers are constructed by interconnecting many smallercomputing devices through a switched interconnectionnetwork. The numerous computing devices usually com-municate over the interconnection network in order tocooperatively solve a large computational problem, so acritical factor of such systems is the ability of the computingdevices to communicate. Many applications require largeamounts of data to be transported between the computingdevices. The speed at which this communication occurs will,to a large extent, dictate the efficiency of the supercomputer.Therefore, an efficient and reliable interconnection networkis a crucial component of large-scale computing systems.

As the race for higher computing speeds progresses,existing technology is pushed to its limits. This also affectsthe interconnection network. Higher speeds leave lessmargin for error. Further, the increased power consumptionmight lead to shorter expected lifetimes of the networkdevices and decreased mean time between failures (MTBF).It becomes vitally important that the interconnection net-work is able to operate close to its nominal capacity evenwhen network elements (typically switches and theirinterconnecting links) cease to function correctly.

The High-Performance Computing (HPC) systems weconsider in this paper, have typically relied on eitherproprietary interconnect technologies, or off-the-shelf sys-tems such as Quadrics and InfiniBand. This is still the case forthe current top 500 supercomputer systems. However, as thespeed and reliability of IP-based solutions increases, and asEthernet eventually supports flow control (lossless opera-tion) there has been an increased interest in the application ofthese topologies to HPC systems as well. Although we focuson high-performance interconnects which usually rely on acentralized subnet manager (e.g., InfiniBand) in this paper,the solutions and techniques we propose can also be appliedto Ethernet/IP-based interconnects.

The ability of the interconnection network to maintain ahigh operational efficiency, or at least to remain operationalwithout disconnecting any computing nodes, with failingnetwork elements depends strongly on the network topologyand the routing function used to generate paths through thenetwork. For the system to remain connected after a fault hasoccurred, there must exist a path between every pair ofcomputing nodes that avoids the failed element.

The most simplistic fault-tolerance method, and perhapsthe most widely deployed, is to shut down the network thatcontains the faulty components and replace them. Thismethod requires no complex mechanisms, but the time torecovery might vary from hours to days, depending on theavailability of replacement hardware. However, it mightnot be necessary to shut down the entire system to performthe replacement.

An example of a slightly more complex mechanism is theability of BlueGene/L to shut down the faulty parts of itsnetwork and continue functioning at reduced capacity. Thisexample indicates another step on the path to reducing timeto recovery, that of reconfiguring the network automaticallyeither by halting it or while it remains in operation. Suchmethods are known as reconfiguration methods, becausethe network is reconfigured once the fault is discovered. Asignificant challenge when using reconfiguration is toguarantee that there are no dependencies between packetsbeing routed in the network throughout the reconfigurationthat might cause cycles and deadlock the network. A long-used method of ensuring this is to halt the network and itsapplications and drain it of all traffic before it is reconfigured,

508 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

. F.O. Sem-Jacobsen is with Simula Research Laboratory, PO Box 134, 1325Lysaker, Norway. E-mail: [email protected].

. T. Skeie and O. Lysne are with Simula Research Laboratory, PO Box 134,1325 Lysaker, Norway, and University of Oslo, Norway.E-mail: {tskeie, olavly}@simula.no.

. J. Duato is with Simula Research Laboratory, PO Box 134, 1325 Lysaker,Norway, and the Departamento de Informatica de Sistemas y Computa-dores, Universidad Politecnica de Valencia, Camino de Vera, s/n C.P.46022, Valencia, Spain. E-mail: [email protected].

Manuscript received 2 Feb. 2009; revised 4 Dec. 2009; accepted 15 Mar. 2010;published online 15 Apr. 2010.Recommended for acceptance by B. Parhami.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TC-2009-02-0058.Digital Object Identifier no. 10.1109/TC.2010.97.

0018-9340/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

thereby, guaranteeing that there is no traffic in the networkduring the reconfiguration. This is static reconfiguration. It isvery time consuming and requires that the network applica-tions are checkpointed at regular intervals so that they can berestarted once the network is reconfigured. Much research is,therefore, being directed toward reconfiguring the networkwhile it is online without introducing dependencies thatmight deadlock it, known as dynamic reconfiguration.

Reconfiguration takes time, either to run a distributedreconfiguration algorithm or to gather the information thata central entity requires to calculate the new routing tablesand distribute them. Furthermore, since faulty networksmust generally be viewed as irregular networks, a genericrouting algorithm might be required to maintain connectiv-ity. This approach will yield good probabilities of remainingconnected despite network failures, but the use of general-purpose routing algorithms might have a severe negativeeffect on network performance.

The remaining approaches fall within the class ofrerouting algorithms. These are inherently dynamic. Byconfiguring the network with alternative paths from start-up it is possible to further reduce the time taken to routepackets around network faults. In the approach that we callend point dynamic rerouting, the network is configured withmultiple paths between every source/destination pair. Itrequires much less time before the fault is tolerated than thereconfiguration methods, but some time is still required toinform the sources of the network faults when they occur.During this time, traffic already in the network will stillencounter the fault and must be discarded.

The most expedient way of handling a fault is to let thedevices that are directly connected to the faulty element takecare of the problem. We call this approach local dynamicrerouting, because the fault is handled locally by the networkelements. For instance, if a switch learns that one of itsneighboring switches is no longer available, it is responsiblefor forwarding packets on alternative paths that avoid thefault. This requires that there are preconfigured pathsaround all single elements that can fail in the network. Thiscreates many possible paths in the network, and care mustbe taken to avoid dependencies and deadlock.

Of the approaches listed here, local dynamic reroutinghas by far the lowest time to recovery and so has the lowestnumber of packets lost when a fault occurs. In fact, onlypackets that are crossing the failing element at the time offailure, or packets irreversibly routed to the failing element,will be lost. However, due to the local nature of the pathsutilized when tolerating faults, they might be suboptimal.

The large time penalty incurred by static fault tolerancemight prevent the approach from providing high computa-tional efficiency in the case of frequent fault events. This isusually the case for large supercomputers, because theprobability of failure increases with the number of compo-nents. Further, the high hardware cost or long turnaroundtime makes it an expensive solution. Thus, in systems thathave frequent faults, some of which might be transient,dynamic fault tolerance is the preferred choice. Therefore,we will adopt the local dynamic rerouting approach in thispaper.

Local dynamic rerouting requires a mechanism forrerouting packets around faults locally. This can be achievedeither by using an adaptive routing algorithm, or by adding arerouting mechanism to the switches in the case ofdeterministic routing. We will explore both of these options.

Parallel computer systems often use the followingnetwork topologies: either direct networks such as the

k-ary n-cube and mesh, or Multistage InterconnectionNetworks (MIN). In direct networks (e.g., mesh), allswitching elements contain a computing node. On the otherhand, MINs are constructed with multiple switching stagesthrough which the packet must traverse from one proces-sing node to another. MINs were first introduced by Clos in1953 as a means of constructing nonblocking telephoneswitching networks [1].

A specific MIN topology is the fat tree. The fat tree wasproposed by Leiserson in 1985 [2]. Most of the commercialinterconnection technologies for SANs and clusters advocatethe use of fat trees or other bidirectional MINs, for instance,Infiniband [3] and Quadrics [4]. Because of this, the fat tree isto be found in many parallel computer systems. Two of thetop 10 supercomputers rely on the fat-tree topology as theirprimary interconnection network [5] (November 2009). Aclassical example of a supercomputer using fat trees isCM [6]. A more recent example is SGI Altix 3700 [7], which isused in NASA’s Colombia, and the Ranger system at theTexas Advanced Computing Centre. Due to the widespreaduse of the fat tree, we will constrain our proposal to thistopology.

In this paper, we present a dynamic local rerouting(DLR) methodology for fat trees. It can guarantee con-nectivity for up to and including k� 1 benign faults usingeither deterministic or adaptive routing, where k is half thenumber of ports in the switches. We implement thismethodology as four different routing algorithms: link faulttolerance and link/switch-fault tolerance for deterministicrouting, and link fault tolerance and link/switch-faulttolerance for adaptive routing. We show that usingdeterministic routing, we require two virtual channels(VCs) to guarantee deadlock freedom with more than onelink fault and three virtual channels for more than oneswitch fault, while deadlock freedom for the adaptiveversion of the dynamic local rerouting algorithm does notrequire the use of virtual channels at all, either for link orswitch faults. Furthermore, we have previously shown thatthe dynamic local rerouting algorithm also is applicable tosource routing for link faults [8].

The rest of the paper is organized as follows: Section 2introduces the k-ary n-tree, the commonly used fat tree weconsider in this paper. Section 3 discusses previous workrelevant for dynamic fault tolerance and multistage inter-connection networks. In Section 4, we present a set ofdefinitions that are needed for presenting the fault-tolerantrouting mechanism in Section 5. We apply this mechanismto deterministic routing in Section 6, and adaptive routing inSection 7, and show that the algorithms are connected anddeadlock free. In Section 8, we present a simple reconfigura-tion scheme for fat trees to which we will compare ourdynamic local rerouting algorithms. The algorithms areevaluated and compared in Section 9. Section 10 concludesthe paper.

2 BACKGROUND

The basic characteristic of a fat tree is that the link capacityat every tier is constant. This is different from an ordinarytree, in which the aggregate capacity of each tier becomessmaller as we approach the tree root. The increase in linkcapacity at each switch tier compared to ordinary trees maybe achieved by simply increasing the capacity of the linksand switches in a tree with only a single root. However, themost common approach is to add additional switches and

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 509

links so that the number of links and switches at every tieris the same, for instance, as a k-ary n-tree [9].

A typical fat tree with four-port switches is depicted inFig. 1. It is a k-ary n-tree (in this case a 2-ary 5-tree) as it isdefined in [9] and Definition 1. The k-ary n-tree is the classof topology we consider throughout the paper, but it is justas easy to apply the algorithms we develop here to them-port T-tree [10].

Assume a bidirectional MIN in which the top tier istier 0, and processing nodes are connected to the bottomtier, tier n� 1. Every switch has k ports in each direction ofthe bidirectional MIN. Each of the n tiers consists of kn�1

such switches. Within each switch tier, switches arenumbered sequentially from zero and upwards from leftto right.

The following definition states how switches should beinterconnected to switches at their neighboring tiers to forma k-ary n-tree. The definition uses a notation for n-tuples,f0; 1; . . . ; k� 1gn, which indicates that the n-tuple may beviewed as an n-digit number where each digit may haveone of the values f0; 1; . . . ; k� 1g. An n-tuple w may also berepresented by its individual digits as w0; w1; . . . ; wn�1.

Definition 1. A k-ary n-tree is composed of two types of vertices:N ¼ kn processing nodes and n � kn�1 communicationswitches, each with k ports in the upward and downwarddirections. Each node is an n-tuple f0; 1; . . . ; k� 1gn, whileeach switch is defined as an ordered pair <w; l>, where w isthe number of the switch in the sequence from left to rightrepresented as an ðn� 1Þ-tuple, w 2 f0; 1; . . . ; k� 1gn�1 andl 2 f0; 1; . . . ; n� 1g is the tier where it is located.

. Two switches<w0; w1; . . . ; wn�2; l> and<w00;w01; . . . ;

w0n�2; l0> are connected by an edge if and only if l0 ¼

lþ 1 and wi ¼ w0i for all i 6¼ l.. There is an edge between the switch <w0; w1; . . . ;

wn�2; n� 1> and the processing node p0; p1; . . . ; pn�1

if and only if wi ¼ pi8i 2 f0; 1; . . . ; n� 2g.Note that the tier numbering is opposite of that ofconventional multistage interconnection networks whichusually label the tier connected to the processing nodes astier 0. In this context upward and downward is relative tothe top and bottom of the tree, i.e., upwards is toward tier 0and downwards is toward tier n� 1. An example of thisswitch labeling is shown in Fig. 1.

Packet routing in a fat tree is carried out in much the sameway as in an ordinary tree, or any other least commonancestor network [11]. Packets are routed toward the tree root

until they reach a switch that is the least common ancestor ofthe source and destination, i.e., a switch that can reach boththe source and destination in the downward direction, butthrough different links. This is called the upwards phase. Theupwards phase is followed by a downwards phase in which thepacket is routed downwards to its destination.1 Given thatthe fat tree has multiple roots, there will be multiple pathsbetween any source/destination pair. This property can beutilized efficiently in a static/dynamic reconfiguration fault-tolerance scheme or an end point dynamic rerouting fault-tolerance scheme.

The fat-tree topology allows for fully adaptive routing inthe upward phase. Every switch port that leads further up inthe network is a shortest path toward a least commonancestor of the source and destination of the packet.However, in the downward phase, there is only a singleshortest path from any switch to the packet destination.Hence, it is difficult to implement local dynamic faulttolerance in the downward phase, because there are noadditional minimal paths that may be supplied by thealgorithm for selection. As a result, it is necessary to developa new routing algorithm that makes additional downwardpaths available in the case of link failure.

3 RELATED WORK

Much of the work on fault tolerance in multistageinterconnection networks focuses on providing multiplepaths end-to-end to facilitate either static reconfiguration orend point dynamic rerouting. A frequently used approachto supply multiple paths is to add hardware to the network,in the form of additional switch stages or additional linksand switches to the already existing stages [12], [13].Another approach relies on every source being connectedto more than one destination and every destination beingconnected to more than one source. The result is that it isimpossible to isolate a given set of sources and destinationswhen there is a network fault; therefore, it is possible tobypass a fault using the holistic property of the network.Such a network offers full dynamic access and allows packetsto be routed through the network in several passes by usingintermediate destinations [14], [15]. Hybrid approacheshave also been suggested, combining multiple paths withrouting in multiple passes to achieve a greater degree offault tolerance [16].

Similar work has been performed on types of fat trees thatdo not provide multiple paths in their basic construction. Anorthogonal fat tree attempts to maximize the number ofleaves connected to a tree for a given switch degree [17]. Thisapproach supplies only a single path between any source/destination pair. To adapt the approach so that it can dealwith faults, a fault-tolerant orthogonal fat tree [12] has beenproposed that supplies a number of disjoint paths betweenevery source/destination pair by increasing the switchdegree. This approach allows end point dynamic reroutingor static reconfiguration. However, it cannot be used toachieve local dynamic rerouting in a straightforward manner.

A comprehensive study of the fault-tolerant properties ofseveral multistage interconnection networks is performedin [18]. None of the methods evaluated consider dynamicfault tolerance, and the authors found that enhancingMIN topologies with additional hardware might lead to adecrease in performance compared to the original network.

510 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

Fig. 1. Switch labeling in a 2-ary 5-tree. The outline encompasses aswitch group (Definition 10).

1. Note that relative to the enumeration given in Definition 1, a packet inan upward phase will traverse decreasing switch tier numbers, and a packetin the downward phase will traverse increasing tier numbers.

To provide local dynamic rerouting, Sengupta andBansal [19] propose a modified version of the single pathOmega network to create multiple paths between thesource/destination pairs, which can be used to avoid faultsdynamically while the packet is traversing the network.Furthermore, Sengupta and Bansal [20] propose anothertopology, called the Quad Tree, which consists of twoparallel Double Trees. There are links interconnecting thesame stages in the two trees, which allow packets to bererouted dynamically. Both topologies are only able totolerate a single fault in any stage. Yet another networktopology that supports dynamic rerouting is the modifiedMIN presented in [21]. Redundant links are placed betweenswitches at the same stage. This allows a packet to beforwarded to another switch at the same stage when a faultis encountered. The same approach is used in [22], wherecrossover links between two parallel fat trees are used forlocal dynamic fault tolerance. This method can, in principle,be applied to many MIN topologies, but it still onlyprovides tolerance of a single fault.

FRoots [23] is a topology-agnostic routing algorithm thatis designed to provide both end-to-end and local dynamicfault tolerance. It is based on the idea that every networknode should be a leaf of a tree, so that if a node fails, notraffic passes through it other than traffic to and from it.This is achieved by constructing several Up*/Down*routing graphs such that each node is a leaf in at least onegraph. Each Up*/Down* graph is implemented in its ownvirtual layer. When a packet encounters a faulty networkelement, it can be switched to a new layer with a graph inwhich the faulty elements are leaves. It will then beforwarded to its destination along a different path.

Imposing tree topologies on the fat tree so that everyswitch and processing node in the network is a leaf in one ofthe trees is difficult, because the fat tree is already a treetopology. A large number of Up*/Down* graphs, and thusvirtual layers, will be required to ensure this. In addition,the complexity would lead to a much longer path when afault is encountered. Clearly, such a generic solution issuboptimal when considering a regular topology such asthe fat tree. It requires a large number of virtual channels toform the necessary virtual layers and the path length will belonger, even when the network is fault free.

With the speed increase of Ethernet in the last few yearsand the recent initial adoption of flow control, the use ofinterconnects based on Ethernet (or some other networktechnology) with IP at the network level configured as a fattree have been proposed for computer clusters. Someexamples of these include [24], [25], and [26]. In commonfor these approaches is that they consider network levelfault tolerance based on reconfiguring routing tables. This isachieved either through a central manager instructing theaffected nodes to recompute routing tables [24], or bypermeating updated fault state information through thenetwork from the affected switches [26]. This is timeconsuming compared to dynamic local rerouting, but aswe show later such solutions can be combined with theapproaches we present here with a positive result.

To the best of our knowledge, there exists no previousmethod that provides local dynamic rerouting for severalsimultaneous network faults without adding additionalhardware and without the requirement of additional virtualchannels. We argue that with the sizes of the super-computers that are currently being built, local dynamicfault tolerance is very appealing, if not mandatory.

4 DEFINITIONS

In this section, we present a set of definitions that providethe necessary framework to show that the fault-tolerantrouting algorithms that we will present in the next sectionsare actually fault tolerant and that they are deadlock free.

We begin by defining adaptive and deterministic routingalgorithms in general.

Definition 2. An adaptive routing function R* supplies a setof possible output queues to be used by an incoming packet oninput port i on switch s to reach a destination d. When a set ofpossible output queues for a packet has been given by a routingfunction, the actual queue into which the packet is forwarded ischosen by a selection function.

Definition 3. A deterministic routing function R* suppliesone output queue to be used by an incoming packet on inputport i on switch s to reach a destination d. The output queue isthe same for all incoming packets to the same destination.

Note that the incoming port i and the associated virtualchannel may or may not be part of the routing function,depending on what is required by the algorithm andsupported by the hardware.

Definition 4. A link is a connection between two switches thatare connected by an edge (as defined in Definition 1), allowinginformation such as data packets to pass between them.

A downward link from a switch at tier l is the connectionto a switch at tier lþ 1. Recall that we assume that queuesare located at the switch outputs. A downward queue is,therefore, a queue at a switch output port that is connectedto a switch one tier lower down (from tier l to lþ 1) by alink. Similarly, an upward queue is a queue at a switchoutput port that is connected to a switch one tier higher upby a link (from tier lþ 1 to l). This combination of queueand link may also be called a channel.

Let us next define the faults we consider for ourrouting algorithms.

Definition 5. A link fault is the total benign permanent/transient failure of a link or either of the switch portsconnecting a switch to the link.

Definition 6. A switch fault is the total benign permanent/transient failure of a switch.

The following definitions help us to name parts of thefat tree.

Definition 7. The least common ancestors ðlcaÞ of a source sand destination d are all the switches at tier l wherel ¼ minðjÞ; sj 6¼ dj, and lcai ¼ si ¼ di8i < l.

Recall that the routing in the fat tree consists of twophases: 1) the upward phase from the source up the treetoward one of the least common ancestors (toward tier 0),and then 2) the downward phase from the least commonancestor down the tree (through increasing tier numbers) tothe destination. The length of the path followed by thepacket is determined by the tier at which the least commonancestors of the source and destination are located.

To demonstrate the fault-tolerance capabilities of therouting algorithm that we will present, we must knowwhich part of the network is affected by the fault in that itcarries the additional load of the packet that has to berouted around the fault.

Definition 8. Two switches are neighbors if there is a linkconnecting them to each other.

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 511

The switches can only be neighbors if they are atneighboring switch tiers l and lþ 1.

Definition 9. Two sets of switches, a and b, are completelyinterconnected if every switch in a is a neighbor of allswitches in b (and implicitly vice versa).

Two such interconnected sets make up an integral part ofthe dynamic local routing algorithm we will present later.To identify the possible paths taken by a packet thatencounters link faults, we must know something about therelationship between the switches in different tiers of the fattree. This is expressed in the next definition.

Definition 10. A switch group g in a k-ary n-tree is the union oftwo completely interconnected sets of switches, a and b, wherea contains only switches at tier l and b contains only switchesat tier lþ 1. a contains all neighbors at tier l of all switches inb, and b contains all neighbors at tier lþ 1 of all switches in a,g ¼ a [ b.And now we describe the same switch group in the

context of the k-ary n-tree definition.

Definition 11. A switch group G in a k-ary n-tree is made upof all switches at the switch tiers l and lþ 1 wherevi ¼ ui8i 6¼ l, and v; u are n-tuples as defined in Definition 1.

More informally, if each switch has k links in the upwarddirection, a switch group consists of 2k switches. Of these,k switches are at the lower tier of the group, tier lþ 1, and kare at the upper tier, tier l. Every switch at the lower tier ofthe group has a link that connects it to all switches in theupper tier and vice versa. An example of a switch group isillustrated in Fig. 2.

The concept of the switch group can be extended toencompass a two-hop switch group, a switch group thatconsists of three tiers of switches instead of two. This isnecessary to tolerate switch faults.

Definition 12. A two-hop switch group G2 is made up of allswitches at three neighboring switch tiers l; lþ 1; lþ 2 wherevi ¼ ui8i 6¼ j; j ¼ ½l; lþ 1�, and v; u are n-tuples as defined inDefinition 1.

Routing in the fat tree prohibits packet transitionsbetween certain queues (downward to upward) in orderto guarantee deadlock freedom, a state in which it isimpossible for the network to experience deadlock. How-ever, when faults are introduced into the network, some ofthe illegal packet transitions must be performed in order to

keep all processors connected. The next definitions identifywhere in the fat tree these illegal transitions take place.

Definition 13. A U-turn is a part of a packet’s path where adownward queue (from tier l to tier lþ 1) is followed by anupward queue (from tier lþ 1 to tier l).

Thus, the term “U-turn” does not apply to the upward todownward transition that separates the upward and down-ward paths in the original routing algorithm.

Definition 14. A U-turn switch is the switch that contains theupward queue of a U-turn.

The next set of definitions will be used when consideringthe deadlock freedom and connectivity of the fault-tolerantalgorithm that we will propose.

Definition 15. A network is deadlocked if there exists a set offull queues Q such that all queues supplied by the routingfunction for the first packet in every queue in Q are also in Q.

More formally, there are many conceivable assignmentsof packets to queues in a network, but only some of theseassignments are actually possible, given a particular routingfunction.

Definition 16. A legal configuration for the routing functionR* is an assignment of packets to network queues that may bereached from an empty network following R*.

Definition 17. A configuration is deadlocked if the configura-tion is legal and there exists a set of queues Q such that all possiblenext-hop queues for the first packet in any queue2 Q is also2 Q.

Definition 18. A routing function R* is connected if itestablishes a path to the destination for every packet in everylegal configuration.

5 DYNAMIC LOCAL REROUTING

We will start by giving the reasoning behind the dynamiclocal rerouting mechanism that we propose. The basicmechanism is the same, regardless of whether we consideradaptive or deterministic routing and link or switch faults.We show in this section that when using our proposedrerouting mechanism, the network remains connected fork� 1 benign faults and is livelock and deadlock free for onebenign fault, regardless of whether adaptive or determinis-tic routing is utilized. We show in the following sectionshow the rerouting mechanism can be guaranteed livelockand deadlock free for k� 1 benign faults using bothdeterministic and adaptive routing. However, the actualimplementations of the algorithms and the proofs toachieve this differ for the two routing schemes, so theywill be discussed separately.

The fault model we consider is that of permanent ortransient benign faults that totally disable single ports orentire switches in the networks. Tolerating the failure of thelinks and switches that connect processors to the fat tree willrequire multiple network interface cards for each processingnode, and each of these nodes must be connected to multipleaccess points in the fat tree. This case must be solved byintroducing additional hardware, rather than by rerouting,as was done with the Siamese-twin fat-tree topology [22].

Given an arbitrary source and destination, it followsfrom Definition 1 that every possible upward queue (fromtier lþ 1 to tier l) in the upward phase of the routingalgorithm leads toward a least common ancestor. Faulttolerance in the upward phase of the routing algorithm is,therefore, quite simple to achieve. Whenever the outputsupplied by the routing algorithm is faulty, a rerouting

512 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

Fig. 2. A fat tree consisting of radix 8 switches with two link faults. Thefaulty links are marked as dotted lines, and the bold line describes thepath of a packet from its source to its destination. Note how the packet isrerouted within the group via the U-turn switch. The numbers refer to thecorresponding steps of the routing algorithm.

mechanism or adaptivity provides one of the other upwardports, which we know leads to a least common ancestor.

The downward phase is not as straightforward, becausethere are no alternative shortest paths from a switchconnected to a faulty downward link/switch. The packetmust, therefore, be rerouted on a longer path.

5.1 Link Faults

It can be seen from the fat-tree topology and Definition 10that every switch is part of one or two one-hop switchgroups, one in each direction in which it has neighbors. Thegroup consists of 1) the switch itself and k� 1 otherswitches at the same tier, and 2) k switches at one of the twoneighboring tiers (Definition 10). In other words, everyswitch in a fat tree except a bottom-tier switch (tier n� 1) isan upper tier switch in a switch group. Similarly, everyswitch in a fat-tree except a top-tier switch (tier 0) is a lowertier switch (larger l) in a switch group.

When a packet encounters a faulty link in the down-ward phase, the path to the destination that involves onlydownward movement is disconnected. We must, therefore,expand the routing algorithm in the downward phase tomake a larger number of paths available. Let s denote thenext hop switch that is unreachable from c=2 because of thelink fault (Fig. 2). The switch in which the packet currentlyresides (c=2) is an upper tier switch of a switch group, ands must, therefore, be a lower tier switch in the same group.We now propose to reach s by rerouting the packet withinthe group; first to an arbitrary lower tier switch in thegroup, and from there to an arbitrary upper tier switch inthe group (different from c=2). It follows from thedefinition of a group that there should be a link from thisarbitrary upper tier switch (marked “u” in Fig. 2) down tos. The lower tier switch to which the packet is reroutedbecomes a U-turn switch, because a downward to upwardtransition takes place (Definition 14). A U-turn switchrecognizes a rerouted packet as a packet arriving from anupward port for which it only has upward output ports inthe forwarding table, i.e., the packet is in the downwardphase, but has no downward path from the switch.

Lemma 1 states that there are k disjoint paths betweenany two switches at the upper tier of a one hop switchgroup; i.e., between c and any other switch at the same tierwithin the group.

Lemma 1. A switch u at the upper tier l in a one-hop group G1 atswitch tiers l; lþ 1 can reach any other switch v at the sametier in G1 with a path length of two hops with fewer thank� 1 link faults in the group.

Proof. Any switch c at the upper tier (l) of a switch grouphas k neighbors at the lower tier (lþ 1). Each of theseneighbors are neighbors to all switches in the group atthe same tier as c. Thus, there exist k paths from c to eachof the other switches in the group of the same tier, onethrough each of its neighbors in the group. There are nolinks between switches at the same tier; hence, thesepaths will be the shortest paths, each with a length of twohops. k� 1 link faults are not sufficient to disconnectk disjoint paths, so there exists a path of length two hopswithin the group between any two switches at the uppertier of the group. tuUsing this lemma, we can show that there exists a path

from any nonfaulty switch in the network to any destinationgiven k� 1 benign link faults when using the one-hoprerouting mechanism presented above.

Lemma 2. Every switch has a path to all destinations with fewerthan k benign link faults when rerouting down one tier onencountering link faults in the downward phase.

Proof. For all destinations reachable upwards from anyswitch there are k links providing this connectivity. Withk� 1 link faults one of these links will always beavailable, providing a path to move the packet one hopcloser to the destination. For destinations reachable in thedownward direction from a switch we have the follow-ing argument: When there are k� 1 link faults in thesystem, at least one of the k upper tier switches in agroup with link faults will not be connected to any faultylink. We call this switch St. Let the upper tier switches ofthe switch group be at tier l, then the lower tier switchesof the same group are at tier lþ 1. From Lemma 1, thereare k disjoint paths between any two switches at theupper tier in a group. k� 1 link faults are not enough todisconnect all these paths, and thus, any upper tierswitch connected to a faulty link is able to reach St,which we know has a nonfaulty link that moves thepacket one hop closer to its destination. Once the packethas reached a lower tier switch (tier lþ 1) in the groupwith a valid downward path, subsequent link failuresencountered will be tolerated in a different group,further down in the tree. tu

5.2 Switch Faults

For switch-fault tolerance, rerouting down one tier is notsufficient to avoid the faulty switch, as all paths to a specificdestination d within the switch group will lead through thesame switch s. However, rerouting down two tiers insteadof just one will avoid the faulty switch s and achieveconnectivity. In this case, we say that the faulty switch s islocated at the middle tier of a two-hop switch group G2

(Fig. 3). We do not need the concept of switch groups totolerate the failure of the top tier switches, as the failure ofthese need only be considered in the upward routing phase.Note that rerouting down two tiers to tolerate switch faultalso implies toleration of link faults. However, for linkfaults between the two bottom tiers of the fat tree, it isobviously not possible to reroute down two tiers for packetslocated at tier n� 2. Therefore, to guarantee full link faulttolerance, as well a switch-fault tolerance, it is necessary for

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 513

Fig. 3. A fat tree consisting of radix 8 switches with two switch faults. Thefaulty switches and their links are marked as dotted lines, and the boldline describes the path of a packet from its source to its destination. Notehow the packet is rerouted within the group via the U-turn switch. Thenumbers refer to the corresponding steps of the routing algorithm.

the bottom-tier switches to act as U-turn switches eventhough the rerouting mechanism is configured for twohops. Also, recall that extra hardware is required to toleratethe failure of the bottom-tier switches. In order to keep trackof the number of tiers the packet has been rerouted down,an extra field in the packet header, which we call the port-field, is required. This field is initially set to �1 to indicatethat the packet is not rerouted. After rerouting down thefirst tier (from l to lþ 1), it is set to �2 to inform the nextreceiving switch that it is the U-turn switch. Uponforwarding the packet upwards (from lþ 2 to lþ 1), thereceiving switch at the tier above the U-turn switch(tier lþ 1) records the incoming port number in the fieldin the packet header. This allows a packet that is returningto tier l and that encounters another switch fault to return tothe same U-turn switch from which it arrived. It will first beforwarded down the same link to tier lþ 1 from which itarrived, and then, using the port-field in the packet header,it can be returned down to the u-turn switch at tier lþ 2. AllU-turns performed by that packet to avoid switch faults attier lþ 1 will be through the one U-turn switch. Lemma 3demonstrates the same property for two-hop switch groupsas Lemma 1 did for one-hop switch groups.

Lemma 3. A nonfaulty switch u at the upper tier l in a two-hopgroup G2 over switch tiers l; lþ 1; lþ 2 can reach any othernonfaulty switch v at the same tier in G2 with a path length offour hops if there are fewer than k switch and link faults in thegroup.

Proof. The middle-tier switches in the two-hop switchgroup are at switch tier lþ 1 in the fat tree, and theyconsist of all possible permutations of the lth andðlþ 1Þth digits in their n-tuples, with all other digitsbeing equal. A digit has k possible permutations. Thus,when there are k� 1 switch faults in the group, thereexist k switches in the switch group at tier lþ 1, eachwith the same permutation of their lth digit, and aunique permutation of their ðlþ 1Þth digit that are faultfree. We call this set of switches F . Thus, following fromDefinition 1, there is at least one switch T at the lowertier in G2 that is fault free and connected to all switchesin F . T is consequently able to reach any nonfaulty uppertier switch in G2 through a switch in F . Any nonfaultyupper tier switch in G2 can thus reach any othernonfaulty upper tier switch in G2 through T .

The case for link faults is identical. As every switchfault is associated with 2 � k failed links, the necessarypaths are still available with k� 1 link faults. tuWe now continue with proving the same degree of

connectivity for k� 1 switch faults assuming we reroutedown two tiers when encountering switch faults in thedownward phase.

Lemma 4. Every switch has a path to all destinations with fewerthan k benign link or switch faults (assuming no faults in thebottom switch tier) when rerouting down two tiers onencountering switch faults in the downward phase.

Proof. For all destinations reachable upwards from anyswitch there are k links to k different switchesproviding this connectivity. With k� 1 link or switchfaults one of these links will always be nonfaulty andconnected to a nonfaulty switch, providing a path tomove the packet one hop closer to the destination. Fordestinations reachable in the downward direction froma switch we have the following argument: When thereare k� 1 link or switch faults in the system, at least one

of the k2 upper tier switches in a two-hop switch groupwith k� 1 switch faults at the middle and lower tierwill not be connected to any faulty switch or link. Wecall this switch St. Let the upper tier switches of theswitch group be at tier l, then the lower tier switches ofthe same group are at tier lþ 2. From Lemma 3, anyupper tier switch connected to a faulty switch is able toreach St, which we know is connected to a nonfaultyswitch at tier lþ 1 with no faulty links, one hop closeron the path to the destination. Once the packet hasreached a switch on the shortest path to the destinationat a lower tier (>l), it enters a new group, so subsequentswitch failures encountered will be tolerated in adifferent group, further down in the tree.

With at most k� 1 link faults between the two bottomtiers of the fat tree, link fault tolerance is guaranteedsince this reduces to rerouting in a one hop switch grouparound the link fault(s). tu

5.3 The General AlgorithmA generic version of the algorithm for link faults is givenbelow. The numbers of the points in the algorithmcorrespond to the numbers in Fig. 2.

1. In the upward phase toward the root, a link isprovided by the forwarding table as the packet’soutgoing link. If the link is faulty, it is disregardedand one of the other upward links are chosen by thererouting mechanism, rerouting the packet.

2. In the downward phase (toward tier n� 1), only asingle link is provided by the forwarding table;namely, the link on the shortest path from thecurrent switch to the destination. If this link is faulty,the following actions are performed:

a. If the packet came from the switch tier above(from l to lþ 1), another arbitrary downwardlink is selected by the rerouting mechanism,which forces the packet to be rerouted. This willresult in the packet being forwarded as long asthere are fewer than k link faults.

b. If the packet came from the switch tier below(from lþ 1 to l), the packet is rerouted backdown to the same lower tier (lþ 1) switch again.This is necessary to avoid livelock.

3. A switch that receives a packet from a link connectedto an upper tier for which it has no downward pathis a U-turn switch.

4. The U-turn switch chooses a different upward portthrough which to forward the packet that has notbeen previously tested.

5. If all upward ports from the U-turn switch havebeen tested, the path is disconnected and the packetmust be discarded.

Which output port is chosen from the U-turn switch inpoint 4 and how the switch knows that all upward ports havebeen tested in point 4 and 5 is specific to the deterministicand adaptive routing schemes and will be detailed in thenext sections. We have shown that this algorithm isconnected and Sem-Jacobsen et al. [27] show that it isdeadlock and livelock free with one link or switch faultwithout the use of additional resources.

6 DETERMINISTIC ROUTING

We have shown that the rerouting mechanism presented inthe previous section is sufficient to guarantee a connected

514 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

path from every switch to every destination. However, wehave yet to consider how we can be guaranteed to find thispath in the presence of k� 1 faults. We will now give adeterministic routing algorithm which is able to find theconnected path while remaining livelock and deadlock free.

The basic algorithm consists of regular upward-down-ward fat-tree routing combined with rerouting in thedownward phase around the link faults. To ensure thatthe routing algorithm is deadlock free, a strict orderingmust be imposed on the sequence of upward links (ports) totest from a U-turn switch when a packet is being rerouted.In the rest of this section, we call this sequence D.

6.1 Link Faults

k� 1 link fault tolerance requires the use of an additionalvirtual layer, the rerouting layer, in which packets will beforwarded while they are being rerouted. A virtual layer,VL, is a set of virtual channels such that every bidirectionallink has a bidirectional VC in the virtual layer. Forsimplicity, we assume that VC1 on all links belongs toVL1, the normal layer (NL), and VC2 on all links belongs toVL2, the rerouting layer (ML).

Based on this, the deterministic dynamic fault-tolerantalgorithm, Rlink

deterministic, for tolerating up to and includingk� 1 link faults is presented below as additions to thegeneric algorithm presented in the previous section. Thesteps of this algorithm are marked in Fig. 2 by numberscorresponding to the steps in the algorithm. Rlink

deterministic

1. The packet is forwarded in NL.2. . . . :

a. The packet is forwarded in NL.b. The packet is forwarded in ML.

3. All subsequent forwarding of this packet throughthis U-turn switch will take place in ML.

4. . . . :

a. If the packet arrives in NL the U-turn switchchooses the first upward link in testingsequence D.

b. The packet arrives in ML the U-turn switchchooses the next upward link in the testingsequence D after the port through which thepacket arrived and subsequently forwards thepacket up this link in ML. This is alwayspossible with k� 1 faults since D covers all thek upward ports of the U-turn switch.

5. A U-turn switch that receives a packet from anupward link that is the last link in the sequence Ddiscards the packet.

After reaching point 5 of the algorithm, the switch caninform the source node of the failure to prevent it fromtransmitting more packets. This can only happen when thenetwork contains more than k� 1 link faults.

Local dynamic rerouting one-link fault tolerance can beachieved with a simpler version of the algorithm, given thatwe know that any path taken by the packet after the U-turnswitch is fault free. Points 4 and 5 inRlink

deterministic can, therefore,be ignored in this case. An arbitrary upward link can be usedfrom the U-turn switch in point 3; there is no need for a testsequence. In addition, there is no need for additional virtualchannels, because a deadlock requires at least two U-turns inthe network as we proved in [27]. We can guarantee theexistence of only one U-turn by deterministically selecting the

same downward rerouting link on encountering the linkfault. All rerouted packets will always reach the same U-turnswitch through the same downward link, and thus there isonly one U-turn. Livelock freedom and k� 1 link faulttolerance for Rlink

deterministic has previously been proved in [27],and deadlock freedom will be shown in the next section.

6.2 Switch Faults

In order to tolerate switch faults, we need to modify thepacket header to include the port-field as we have indicatedpreviously. This is required to be able to efficiently reroutedown two tiers. Furthermore, to guarantee that this willremain deadlock free we require three virtual channels/layers, one more than for link faults. The deterministicdynamic switch-fault-tolerant algorithm, Rswitch

deterministic, fortolerating up to and including k� 1 switch faults is presentedbelow. The steps of this algorithm are marked in Fig. 3 bynumbers corresponding to the steps in the algorithm.Rswitchdeterministic

1. In the upward phase toward the root, one link isprovided by the forwarding table as the packet’soutgoing link. If the switch connected to the link isfaulty, it is disregarded and one of the other upwardlinks are chosen by the rerouting mechanism,rerouting the packet. The packet is forwarded in NL.

2. In the downward phase (toward tier n� 1), only asingle link is provided by the forwarding table;namely, the link on the shortest path from thecurrent switch to the destination.

a. If the packet arrives from below in ML2,forward the packet downwards in ML2.

b. If the packet arrives from above in ML2, forwardthe packet downwards in ML1.

c. If the packet arrives in ML1, forward the packetdownwards in NL.

d. If the packet arrives in NL, forward the packetdownwards in NL.

3. If the switch connected to this downward link isfaulty, the following actions are performed:

a. If the packet came from the switch tier above(from l to lþ 1), another arbitrary downward linkis selected, which forces the packet to be rerouted.This will result in the packet being forwarded aslong as there are fewer than k link faults.

i. If the packet is in ML2, forwarded in ML1.ii. If the packet is in ML1 or NL, forward it

in NL.b. If the packet came from the switch tier below

(from lþ 1 to l), the packet is rerouted backdown to the same lower tier (lþ 1) switch again.This is necessary to avoid livelock. The packet isforwarded in ML2.

4. A switch that receives a packet from a portconnected to a switch at the tier above, for which ithas no downward path takes the following actions:

a. If the port-field equals �1 another arbitrarydownward link is selected, which forces thepacket to be rerouted. This will result in thepacket being forwarded as long as there are fewerthan k switch faults. The packet is forwardedin NL.

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 515

b. If the port-field equals �2 or the port-field equals�1 and the switch is a bottom-tier switch (attier n� 1) this switch is a U-turn switch.

i. If the packet is in the normal channel,forward the packet upwards through thefirst port specified by the rerouting sequenceD from which the packet did not arrive. Aslong as there are fewer than k switch faults,one of these ports will be connected to afault-free switch.

ii. If the packet is in ML1, forward the packetthrough an upward output port in ML1following the incoming port in the reroutingsequence D.

iii. If the incoming port is the final output in thererouting sequenceD and the incoming layeris ML1, discard the packet. All upward linksof the U-turn switch have been tested andthere is no available path to the destinationfrom this switch. This may only happenwhen there are k or more faults.

c. If the port-field does not equal either �1 or �2,forward the packet down through the portindicated by the port-field and set the port-fieldto �2. The packet is forwarded in ML1.

5. A switch that receives a packet from a port connectedto the lower tier in which the port-field equals �2stores the incoming port number in the port-field andforwards the packet in ML2 through one of its upwardlinks. The algorithm is then repeated from step 2.

After reaching point 4biii of the algorithm, the switch caninform the source node of the failure to prevent it fromtransmitting more packets. This may only happen when thenetwork contains more than k� 1 link and switch faults.

If we consider only one switch fault, it is sufficient thatthis field only contains one bit to indicate whether thepacket is rerouting down the first or the second tier (from lto lþ 1, or from lþ 1 to lþ 2). Furthermore, any packet willbe rerouted only once before it reaches its destination;hence, there can be no livelock. Thus, no modifications arerequired in order to ensure freedom from livelock apartfrom the rerouting bit in the packet header, not even areroute vector in the switches.

Just as for link faults, Rswitchdeterministic allows transparent

reutilization of previously failed elements. The proof ofconnectivity and livelock freedom is available in [27].

6.3 Deadlock Freedom

In this section, we show that Rlinkdeterministic is deadlock free.

The method for showing that Rswitchdeterministic is deadlock free

is similar, except that three virtual layers are required.This proof is available in [27]. As we have indicated, toguarantee that Rlink

deterministic is deadlock free with more thanone fault, we require the use of two virtual layers: 1) anormal virtual layer (NL) where most of the packetforwarding takes place, and 2) a rerouting layer (ML) thatcontains the packets currently being rerouted. The transi-tion from NL to ML takes place in the U-turn switches.Each time a packet performs a U-turn, it enters ML. Thetransition from ML back to NL takes place in those lowertier switches in the switch group that are on the packet’spath to the destination. In other words, the transition backto NL takes place when the rerouted packet has reachedthe switch at the bottom end of the failed link aroundwhich the packet was initially rerouted.

An obvious property of a cyclic dependency chain is thatif each channel in the cycle were to be assigned a numberlarger than the previous channel in the cycle, there would,sooner or later, be a dependency from a channel with a highsequence number to a channel with a lower sequencenumber. Consequently, if it is possible to apply a set ofnumbers to all channels in the network in such a way thatany packet traversing the network always moves to achannel with a higher sequence number than the previouschannel, there can never be any cycle.

In order to show that Rlinkdeterministic is deadlock free, we

introduce a numbering scheme which we apply to allchannels in the fat tree. It is important to note that thisconstruct is only necessary for the deadlock freedom proof,such a numbering scheme need never be assigned to theactual links in the actual network.

The fat tree consists of four types of channels: upwardand downward channels in NL, and upward and down-ward channels in ML. Let Nu

s be the set of upward channelsin NL at link tier s and Nd

s the set of downward channels inNL at link tier s. Ms is the set of upward and downwardchannels at tier s in ML, and Mu

s and Mds are the subsets of

the upwards and downwards channels, respectively. u0 isthe sequence number assigned to the u. The assignmentu0 ¼ j where j is an integer greater than �1 and u 2 N [Massigns the number j to the channel u. The number of linktiers in the network is n� 1, and s is the link-tier number(½0; n� 2�). The tier number of a link is same as the tiernumber of the switch connected to its upward end. Hence,the link-tier number of the links connected to the rootswitches as tier 0 is also 0, and the link-tier number of thelinks connecting the leaf switches at tier n� 1 to theswitches at tier n� 2 is n� 2. We number the channels asfollows:

1. u0 ¼ n� 2� s;8u 2 Nus ; s 2 f0 . . .n� 2g.

All upward channels in NL are assigned thereversed indexing of the link tiers.

2. u0 ¼ v0 þ 1; 8u 2 Nd0 ; 8v 2 Nu

0 .All downward channels in NL at the topmost linktier are assigned a number that is one unit greaterthan the topmost upward channels in NL.

3. u0 > v0; 8u 2 Nds ; 8v 2Ms�1 [Nd

s�1; s > 0.Every downward channel at tier s in NL is given anumber that is larger than any downward channelin NL and any channel in ML of all tiers l higher

than s (l < s).4. u0 ¼ v0 þ 2iþ 1; w0 ¼ v0 þ 2iþ 2; 8u 2 D 2Mu

s ; w 2D 2Md

s , i is the index of u in the sequence D,i 2 f0; . . . ; k� 1g, 8v 2 Nd

s .Every rerouting channel is given a number largerthan the normal downward channels at tier s, in anincreasing order from v0 þ 1 corresponding to theindex of the rerouting channel in the reroutingsequence D from the u-turn switch. For instance,assuming that the largest sequence number of alldownward channels in NL at tier s is 10, the firstupward rerouting channel in the sequence to betested from a U-turn switch at tier s is 11, the firstdownward rerouting channel is 12, the secondupward rerouting channel is 13, and so forth.

We now show that the proposed numbering schemeprovides all channels in the network with a single number.

516 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

Lemma 5. Every channel in the fat tree can be assigned one andonly one number which satisfies all four points of thenumbering algorithm.

Proof. We consider first the channels in NL. Every channelin the fat tree belongs to a single link tier. Rule 1guarantees that all upward channels in NL are assigneda number. The only requirements on these numbers isthat they be the reverse order of the link-tier indexing,which is trivial to guarantee. There is only one possibleassignment for any link.

For the downward channels in NL, rules 2 and 3guarantee that every channel is assigned a number. Alldownward channels at the topmost link tier only haveone possible assignment as per rule 2. From point 4, themaximum number assigned to the rerouting channels ata tier l is limited by the number of channels in thererouting sequence, which is 2 � k. The maximumnumber assigned to any channel at a tier above tier l isn� 2þ lþ l � 2k (the numbers assigned to all channels inNu, plus the number assigned to the downward channelsat each tier, plus all the numbers assigned to reroutingchannels at each tier above l). Any downward channel attier l in NL can, therefore, be assigned a number thatsatisfies rules 2 and 3, which is one greater thann� 2þ lþ l � 2k. There is obviously only one suchnumber for any link at any link stage.

Similarly, rule 4 guarantees that all rerouting channelsat all tiers are assigned a number. Rule 2 guarantees thatall top-tier downward channels in NL are assigned thesame number n. Assigning the number nþ i toconsecutive channels with index i in the reroutingsequence D, combined with the fact that any reroutingchannel belongs to the rerouting sequence of one andonly one U-turn switch, guarantees that we conform torule 4 for the top-tier rerouting channels and that everyrerouting channel is assigned exactly one number. Forrerouting channels at subsequent lower link tiers, thesame reasoning applies, except that the number of thedownward channels in NL of each tier l is given byl � 2kþ lþ n� 2, as shown previously. tuThis numbering scheme is illustrated in Fig. 4 for a small

2-ary 3-tree. There is one figure for each of the upward anddownward channels in the normal and rerouting layers.Fig. 5 illustrates the path followed by a flow of packets fromthe node marked “source” to the node marked “destina-tion.” The dotted links carry packets of the flow upwards inthe network, dashed links carry packets downwards, andthe dashed and dotted link carries packets of the flow bothways. The link marked “x” is faulty, and since this is on thedeterministic path the packet must be rerouted. Since k ¼ 2,

packets must, therefore, backtrack down one tier towardthe source to the U-turn switch, as illustrated by the dashedand dotted line. The numbers indicate the number of theoutput channel that the packet occupies from the switch,according to our numbering scheme. It is clear that thererouted packets only cross channels with increasingnumbers, thereby, guaranteeing that the routing algorithmwill never deadlock.

Before we proceed with the formal proof of deadlockfreedom, we must consider one final corollary, whichshows that any upper tier switch in a switch group willhave the same index in the rerouting sequence D for anyU-turn switch in that group. This means that all upwardchannels in the rerouting layer to a given switch will havethe same sequence number i, and all downward channels inML from this switch will have the same sequence numberiþ 1 (this follows from rule 4).

Corollary 1. All upward links to an upper tier switch u in aswitch group G have the same index i in the reroutingsequence D of all U-turn switches in G.

Proof. Consider a switch group G with its upper tier attier l, and its lower tier at tier lþ 1. From Definition 1, allswitches in G are connected to port wl þ k of the switchesat tier lþ 1, where wl is the ðlÞth tuple of the switchn-tuple. Any upper tier switch is, therefore, connected tothe same port number on all the lower tier switches, andthese will have the same index in the rerouting sequenceD for all possible u-turn switches in the group. tuWe proceed with the proof of deadlock freedom for the

routing algorithm. We consider here the dependencies

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 517

Fig. 4. Channel numbering in a fat tree for link fault tolerance. Numbering of the different channels in the normal and rerouting layers, assuming thatthe ordered sequence D tests the upward ports from left to right. (a) Normal upward channels. (b) Normal downward channels. (c) Upward channelsin the reroute layer. (d) Downward channels in the reroute layer.

Fig. 5. The path followed by packets from source to destination. The linkmarked “X” is faulty, and since this is on the deterministic path, thepacket must be rerouted. The dotted links carry packets of the flowupwards in the network, dashed links carry packets downwards, and thedashed and dotted link carries packets of the flow both ways. Thenumbers indicate the number of the channel that the packet occupiesout from the switch according to our numbering scheme, the underlinednumbers being in the reroute layer. The packet path is indicated by thearrowed line.

between channels in the network. Since every channel in thenetwork has its own number, it is sufficient to show that thenext hop channel for any packet in any node in the networkhas a number that is larger than the sequence number of allchannels that the packet might previously have traversed.When following such a dependency chain where allchannels have a number, all next-hop channels that couldpossibly close a cycle will necessarily be assigned a numberthat is lower than or equal to the current channel.

Theorem 1. Rlinkdeterministic is deadlock free.

Proof. We give all channels in the network a sequencenumber conforming to the above rules. The routingalgorithm is deadlock free if we can show that it isimpossible to move to a channel that has a sequencenumber lower than or equal to the current channel.

There are four specific situations for any packetforwarded in the network.

1. The packet is forwarded upwards in NL: According torule 1, all upward channels at the next link tierupwards have a larger sequence number than thecurrent channel.

2. The packet is forwarded downwards in NL: Accordingto rules 2 and 3, all downward channels at thenext link tier downwards have a larger sequencenumber than the current channel, regardless ofwhether the current channel is a normal channelor a rerouting channel.

3. The packet is forwarded upwards in ML: This onlytakes place in the downward routing phase.According to rule 4, all rerouting channels at thecurrent link tier have a larger sequence numberthan any downward normal channel at the sametier, and all rerouting channels at the tiers above.In addition, the next rerouting channel always hasa larger sequence number than the current rerout-ing channel, following the ordered sequence Dand rule 4.

4. The packet is forwarded downwards in ML: There aretwo possibilities. Either the packet must return tothe U-turn switch, or it has a fault-free downwardpath toward the destination. We call the currentswitch z. z has the same position in the reroutingsequence D for all U-turn switches in the switchgroup. Thus, all upward rerouting channels in theswitch group connected to z will have the samesequence number, as will all the downwardrerouting channels in the switch group connectedto z (rule 4, see Fig. 4 for an example). Thesequence number of the downward reroutingchannel from z, whether it runs back to the U-turnswitch or toward the destination, must thereforebe larger than the sequence number of anyupward rerouting channel that might have led apacket to z.

No packet forwarded in the network will ever reach achannel that has a sequence number that is lower than orequal to that of any channels it has previously traversed,and Rlink

deterministic is deadlock free. tuIt is worth noting that resuming the use of a previously

failed link once it is restored to service is also deadlock free,because the numbering scheme that we have utilized hereremains intact.

7 ADAPTIVE ROUTING

We have given deterministic routing functions capable oftolerating k� 1 benign faults while remaining both livelockand deadlock free. The nature of the deterministic routingalgorithm required us to use one/two extra virtual chan-nel(s) on all links to guarantee deadlock freedom. When wenow consider adaptive routing, the adaptivity gives us anextra degree of freedom, removing the necessity of usingvirtual channels to maintain deadlock freedom.

Whereas deterministic routing required a strict orderingof the paths to be tested from the U-turn switches, we can usethe greater degree of freedom afforded by adaptive routingto simply ensure that we test all possible paths once, but inany order, allowing maximum adaptivity. To facilitate this, asingle rerouting vector of length k bits is required in theswitches, with one bit for each of the upward links to betested from the U-turn switch. By setting the bit in the vectorthat corresponds to the link on which a rerouted packetarrives, the U-turn switch can keep track of which upwardlinks lead to faulty paths and which are not connected to anyfaults, regardless of the pack destination. A point of interestis that this misrouting vector can also be applied to thedeterministic routing function in order to avoid repeatedtesting of paths in the rerouting sequence that are faulty.

A time-out mechanism is required to reset the reroutingvector after a period of U-turn inactivity in order to be ableto tolerate transient faults. Since the network operatescorrectly with the fault present, even if it is actually repaired,this time-out can be quite long if desired. Note that the use ofthe reroute vector is only required to guarantee the toleranceof more than one fault. If only one-fault tolerance isrequired, it is sufficient to choose an arbitrary upward linkfrom the U-turn switch.

7.1 Link Faults

Using the foregoing reasoning as a basis, the adaptivedynamic fault-tolerant algorithm, Rlink

adaptive, for tolerating upto and including k� 1 link faults is presented below. Thesteps of this algorithm are marked in Fig. 2 by numberscorresponding to the steps in the algorithm.Rlinkadaptive

1. . . .2. . . .

a. . . .b. . . .

3. The U-turn switch sets the bit in its rerouting bitvector that corresponds to the link on which thepacket arrived (it is not necessary to try to use thislink for forwarding the rerouted packet, because weknow the path is broken).

4. The U-turn switch adaptively chooses one of itsupward links that does not have its correspondingbit set in the bit vector and subsequently forwardsthe packet up this link. If there are fewer than k linkfaults, fewer than k of the bits will be set. Therefore,the packet will be forwarded as long as there arefewer than k link faults. The algorithm is thenrepeated from step 2.

5. A U-turn switch that receives a packet from anupward link in which all other bits in the reroutingvector are set discards the packet.

In the final stage of the algorithm, the U-turn switchcan choose to inform the source node of its inability to find

518 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

a correct path in order to prevent it from transmittingmore packets.

One important feature of the above algorithm is thatrelative to the original adaptive routing for the fault-freecase, only the rerouting in step 2 and the selection function(steps 1, 3, and 4) need to be modified. The only additionalfunctionality that is needed for temporary faults is a timerthat zeroes out the reroute vector after a given time interval.Note that this was not required for the deterministicrouting algorithm, since it did not store any state informa-tion in the switches.

Fig. 6 shows 1) a more detailed view of rerouting in aswitch group that has multiple faults and 2) the state of thererouting vector. The connectivity, and livelock and dead-lock freedom for the algorithm with k� 1 link faults areshown in [27].

7.2 Switch Faults

Switch-fault tolerance with adaptive routing is morecomplex than link fault tolerance with adaptive routingbecause of the number of possible paths that can be takenby a packet. Hence, in addition to rerouting down two tierswhich following from Lemma 4 guarantees connectivitywith less than k switch faults, we require the port-field asfor deterministic routing.

Below we list the routing algorithm Rswitchadaptive to tolerate

k� 1 benign switch faults. Rswitchadaptive is illustrated in Fig. 7.

The fault tolerance for k� 1 switch faults is guaranteed bythe fact that every possible U-turn switch u at the lower tierin the two-hop switch group has at least one upward linkthat leads to a switch in the set F , which again is connectedto the switch T , as defined in the proof of Lemma 3. T is notconnected to any faulty switch through its downward links,and it will therefore never reroute any packet down to u.Consequently, there will always be one upward port in anyU-turn switch for which the corresponding bit in the reroutevector will always be zero.Rswitchadaptive

1. In the upward phase (toward tier 0), all upwardlinks are supplied by the routing function and one isselected as the packet’s outgoing link. If any of thelinks are faulty, the selection function simply doesnot choose it. This will result in the packet beingforwarded as long as there are fewer than k linkfaults. When the packet has reached a switch withthe packet destination in its subtree, the upwardphase ends and is followed by the downward phase.

2. In the downward phase (toward tier n� 1), only onelink is supplied by the routing algorithm; namely,

the link on the shortest path from the current switchto the destination. If this link is faulty, the followingactions are performed:

a. If the packet came from the switch tier above(from l� 1 to l), another arbitrary downwardlink is selected, which forces the packet to bererouted. This will result in the packet beingforwarded as long as there are fewer than k linkfaults.

b. If the packet came from the switch tier below(from lþ 1 to l), the packet is rerouted backdown to the same lower tier (lþ 1) switch again.This will make a rerouted packet in a switchgroup with more than one fault follow the pathillustrated in Fig. 6. This is necessary to avoidlivelock.

3. A switch that receives a packet from a portconnected to a switch at the tier below and whichhas a shortest path fault-free downward link,ensures that the port-field is �1 and forwards thepacket downwards.

4. A switch that receives a packet from a portconnected to a switch at the tier above, for which ithas no downward path takes the following actions:

a. If the port-field equals �1 it is set to �2 andanother arbitrary downward link is selected,which forces the packet to be rerouted fartherdown. This will result in the packet beingforwarded as long as there are fewer than k linkfaults.

b. If the port-field equals �2 this switch is a U-turnswitch. The switch sets the bit in its reroutingbit vector that corresponds to the link on whichthe packet arrived (it is not necessary to try touse this link up for forwarding the reroutedpacket, because we know the path is broken).

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 519

Fig. 6. A combination of k� 1 faults with rerouting in a group in a k ¼ 4network. The terms “ingress” and “egress” refer to the switches wherethe packet enters and leaves the switch group, respectively. Thenumbered U-turns refer to the corresponding states of the reroute vector(listed next to the U-turn switch) as the U-turns are performed.

Fig. 7. Rerouting around a switch fault in a 2-ary 5-tree. The bold, wholelinks show the path from the source to the destination, rerouting arounda fault both in the upward and downward direction. The switches that aredenoted as open circles with their corresponding dashed links havefailed, and the switches and links with the gray outlines belong to a two-hop switch group. The numbers correspond to the points in the routingalgorithm.

c. If the switch is a bottom-tier switch (at tier n� 1)it becomes a U-turn switch and sets the bit in itsrerouting bit vector that corresponds to the linkon which the packet arrived to guarantee link-fault tolerance. The port-field is set to �1, andthe switch adaptively chooses one of its upwardlinks to forward the packet.

d. If the port-field does not equal either �1 or �2,forward the packet down through the portindicated by the port-field, which is reset to �2.

5. The U-turn switch adaptively chooses one of itsupward links that does not have its correspondingbit set in the bit vector and subsequently forwardsthe packet up this link. If there are fewer than k linkfaults, fewer than k of the bits will be set. Therefore,the packet will be forwarded as long as there arefewer than k link faults.

6. A switch that receives a packet from a portconnected to the lower tier in which the port-fieldequals �2 stores the incoming port number in theport-field and adaptively chooses one of its upwardlinks. The algorithm is then repeated from step 2.

7. A U-turn switch that receives a packet from anupward link in which all other bits in the reroutingvector are set and the port-field is �2 discardsthe packet. All upward links of the U-turn switchhave been tested and there is no available path to thedestination from this switch. This may only happenwhen there are k or more faults.

The connectivity and deadlock freedom properties of thisalgorithm for k� 1 switch faults are presented in [27].

8 ENDPOINT DYNAMIC REROUTING

Because of the lack of any other dynamic local reroutingalgorithms to compare with in the evaluation presented inthe next section, we compare our dynamic local reroutingalgorithm for deterministic dynamic local rerouting (DDLR)and adaptive dynamic local rerouting (ADLR) with a fault-tolerance algorithm that we call deterministic dynamicreconfiguration (DDRC). DDRC is one of the more straight-forward ways of tolerating faults dynamically in a determi-nistically routed fat tree. It can be used for both sourcerouting and distributed routing. For source routing, everynetwork element that is connected to the element that failsbroadcasts information about the failure to all processors inthe network. Each processor can then remove all paths thatinvolve the faulty element from their routing table, orremove the faulty element from its network graph andcompute new routing tables using standard fat-tree routing,thus avoiding the fault. Distributed routing tables ininterconnection networks are usually set up by a manage-ment unit. In this case, the nodes that are connected to thefailure send information to the management unit, whichcalculates new tables without the faulty elements anduploads these to the switches.

8.1 Hybrid Dynamic Rerouting

Since DLR and DDRC have two distinctly different methodsof operation, it is possible to combine the two into a hybridapproach. When both methods are used simultaneously,DLR can reroute packets while the processing nodes or amanagement unit is being informed of the fault event. Thiswill reduce the number of packets lost per fault. However,the probability that the routing algorithm will keep the

network connected is dictated by DDRC. The reason is thatall paths that contain faults will be removed by thealgorithm. Therefore, DLR will only operate in the intervalbetween 1) the faults being discovered and 2) the routingtable of the switches being updated by the managementunit and all packets that followed the old routes beingdrained from the network. For this hybrid approach, allperformance metrics that we evaluate in the next section arethe same as for DDRC, except for packets sunk per linkfault, which will be the same as for DLR. Therefore, it willnot be evaluated through any specific simulations.

9 EVALUATION

In this section, we evaluate the performance of the proposeddynamic local rerouting method (DLR) with both determinis-tic routing (DDLR) based on Rlink

deterministic and adaptiverouting (ADLR) based on Rlink

adaptive. The most prominentfeature of the DLR mechanisms is the speed of theirresponse to failures, which results in very few packets beinglost. However, the price of this almost instantaneous faulttolerance is paths in the network that are less optimal, andthus reduced performance. We compare the DLR mechan-isms against the deterministic dynamic reconfiguration method(DDRC) that we presented in Section 8. We will use animplementation of DDRC close to that which can beachieved in Infiniband. Upon discovering a faulty port,the switch that makes the discovery sends a message to theInfiniband subnet manager to inform it of the failure. Thesubnet manager then generates new routing tables anddistributes these to all switches in the network.

We evaluate the effects that the two paradigms have onperformance, to illustrate the difference between usingoptimal paths (DDRC) and suboptimal paths combinedwith deterministic (DDLR) or adaptive routing (ADLR). Wecompare the three methods with respect to performancedegradation with link faults; hence, DDRC is configuredwith only one virtual channel, even though DDLR in factrequires two. In this manner, we can use the performancedegradation as a measure of the cost of the fast failover ofADLR and DDLR compared to DDRC. We also comparethe probabilities that the three algorithms will keep thenetwork connected beyond their guaranteed connectivitylimit of k� 1 faults. For this comparison, the probabilitywill be the same for DDLR and ADLR as they both rely onthe same rerouting mechanism to maintain connectivity.

We have only included the evaluation for link faults inthis paper. Switch faults will have a more severe perfor-mance degradation than link faults because more paths aretaken out of commission by a single fault, and the reroutepath is longer than for link faults. However, the generalconclusions we draw at the end of the evaluation are validfor both types of faults.

9.1 Simulation Environment

We have conducted a series of network simulations in whichthe experiments were run with an increasing number of linkfaults (from 0 to 10) in order to examine how the methodsperform as the network degrades. The simulations wereperformed in an event driven simulator developed in-houseat Simula Research Laboratory. It is based upon the J-simframework [28]. In our evaluation, the link bandwidth was2.5 Gb/s. However, the links used the 8b/10b encodingscheme; hence, the maximum effective bandwidth for datatraffic was limited to 2 Gb/s. The size of a virtual outputqueue for a VC is 512 bytes, which is sufficient to allow

520 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

buffering of up to two 256 byte packets, which in turn isclose to the packet size/buffer space ratio that is often foundin real hardware. The send queue of the processing nodeswas set at 1.3 MB, which is sufficiently large to inject therequired traffic into a variably loaded network.

Two topologies were simulated: the 4-ary 3-tree and the2-ary 6-tree. These topologies consist of 48 (64) and 192 (64)switches (computing nodes), respectively. They have amoderate number of network elements, which allows therequired quantity of simulations to be run within areasonable time. Furthermore, the effects of faults in thesesmall-scale networks represent upper bounds on the impacton performance in larger networks, because the impact ofany single fault in a large network will be smaller than in asmall network. Regarding the traffic pattern, both uniformand hot-spot traffic were evaluated, but only the results foruniform traffic have been included as this clearly illustratesthe performance of the algorithms. Packet generation wasgoverned by an approximation of the Poisson distribution.Further, the packet size was 256 bytes. The link-transfer ratewas 128 bytes per cycle. With an effective link bandwidth of2 Gb/s, this gives 1 second ¼ 1;953;125 simulation cycles.

Two load cases were evaluated for increasing numbers oflink faults. The exact loads chosen for the evaluations aremarked by the vertical lines in Fig. 8 for uniform traffic:1) saturated, which was slightly above the saturation point(relative to the fault-free network), and 2) unsaturated,where the load was 30 percent below the saturation point.The figure shows the throughput and latency for an

increasing network load for the two traffic distributions,with adaptive and deterministic routing for both testedtopologies. The figure clearly shows when the networkssaturate, and the vertical lines mark the selected unsaturatedand saturated loads we use in the following experiments.

Each data point in the performance figures is the averageof 500 independent simulation experiments. For eachexperiment, we let the simulation run until the averagelatency had stabilized, before the measurements werestarted. Thereafter, the simulation was run for another10,000 simulation cycles.

9.2 Evaluation Results

We start the evaluation by considering the degree of networkutilization which the various routing algorithms are able tomaintain as more and more link faults are introduced intothe networks. The throughput obtained by the threemethods, DDLR, DDRC, and ADLR, in a 4-ary 3-tree and a2-ary 6-tree fat-tree topology for saturated and unsaturatedloads is presented in Fig. 9. The x-axis in these figuresrepresents the number of link faults introduced into thenetwork, and the y-axis is the average number of packets percycle accepted by the network injected by the end nodes. Thevertical line at the points 3 and 1 on the x-axis depicts thek� 1 fault tolerance limits in all three evaluated algorithms.

If we consider the case for the 4-ary 3-tree with uniformtraffic in Fig. 9a, we notice that for zero faults ADLRachieves a slightly lower saturated throughput than theDDLR and DDRC methods. Recall that without faults DDLRand DDRC are in essence the same routing algorithm. Thereason is that for a uniform traffic pattern, a well-balanceddeterministic routing algorithm will outperform adaptiverouting [29]. However, as the number of link faultsincreases, we see that the saturated throughput for DDLRquite rapidly decreases; already for one link fault, ADLRdisplays higher throughput than DDLR. Furthermore, wesee that the saturated throughput of DDRC is higher thanfor the two other algorithms for any number of faults(except for zero faults where it is identical to DDLR). Of theDLR methods, ADLR is in other words better at maintaininghigh throughput with link faults, showing only a slightlyhigher decrease in throughput as the number of faultsincreases than DDRC. The reason for this is that the pathused for rerouting around the link faults quickly becomes abottleneck in the network. This severely impacts theperformance of DDLR, while ADLR benefits from itsadaptivity in being able to redirect traffic to other nonfaulty

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 521

Fig. 8. Saturated and unsaturated loads with uniform traffic.

Fig. 9. Throughput for a 4-ary 3-tree and 2-ary 6-tree. Both uniform and hot-spot traffic are depicted, for the three methods: DDLR, DDRC, andADLR. The faults range from 0 to 10. (a) 4-ary 3-tree, uniform traffic. (b) 2-ary 6-tree, uniform traffic.

paths. This view is further strengthened if we consider theunsaturated loads for the three methods. In this case, allalgorithms are able to maintain the same throughput aswithout link faults, except for DDLR which also here showsa throughput reduction as we pass three link faults.

Moving on to the 2-ary 6-tree in Fig. 9b the situation issomewhat different. First, note that since k ¼ 2 thealgorithms can only guarantee toleration of one link fault.We see that for saturated throughput without link faults allthree algorithms attain almost equal performance. The2-ary 6-tree consists of switches with a significantly lowernumber of ports than the 4-ary 3-tree, limiting the numberof available paths for utilization by ADLR. The differencebetween deterministic and adaptive routing will, therefore,be less noticeable, although the adaptive throughput isslightly lower. However, as the number of faults increases,we again see that the throughput of DDLR is moredramatically affected by link faults than both ADLR andDDRC. Another interesting point is that as we pass ninelink faults the throughput of ADLR slips below thethroughput of DDLR. With this large number of link faultsthere is a high probability of encountering a number of linkfaults on either of the available paths in the network. As aconsequence of this, the relative performance differencebetween the various paths becomes smaller as in the casewithout link faults, and the performance of ADLR will belower than DDLR. The method with the best performanceis still DDRC which simply removes the faulty paths fromthe routes without having to reroute traffic. However, aswe will see later in the evaluation, the involvement of acentral entity, the subnet manager, causes a severe timepenalty to this operation. This will greatly reduce theeffectiveness of DDRC, especially for short-lived faults.

It is evident from Fig. 8 that the 2-ary 6-tree has a greaterforwarding capacity with uniform traffic than the 4-ary 3-treefor the same number of end points. For unsaturatedthroughput, the number of link faults we have introducedis not sufficient to cause network congestion, and thusthroughput reduction. However, careful examination willreveal the same behavior as for the 4-ary 3-tree, for 10 linkfaults the throughput of DDLR shows a slight decrease.

Let us next consider the number of packets discarded bythe network from the time that the fault occurs until it istolerated, which is the strongest point in favor of theDLR methods. For DDLR and ADLR, this is plotted inFig. 10. The number of packets lost per link fault is plottedalong the y-axis, while the increasing number of link faults

is plotted along the x-axis. Note that interconnectionnetworks do not discard packets, so the packet lossaccounted here is exclusively caused by the link failures.

We see that for any number of faults, the number ofpackets discarded because of each link fault for the saturatedcase is about 1.5 and 2:5� 3 in the 4-ary 3-tree with uniformtraffic (Fig. 10a) for DDLR and ADLR, respectively. For theunsaturated case, it increases from 0.4 to 0.6 for 1-10 faults forboth DDLR and ADLR. Recall that the DLR methods onlyloose packets that are either currently in transit over thefailing link or queued for the failing link.

We found that the simulation time we used wasinsufficient to encompass the reconfiguration of the DDRCmethod. Hence, we were forced to resort to calculatinganalytically the number of packets lost during the reconfi-guration. To achieve this, we first approximate the time ittakes to discover a fault, compute new forwarding tables,and distribute them. Using as a basis the work done byBermudez et al. [30], which evaluates the time it takes forthe Infiniband subnet manager to perform these tasks, wefind that the time taken to distribute the new forwardingtables is about 0.03 seconds in a 32-switch network. The fulltime taken to reconfigure the network is about 0.5 seconds,most of which is used for routing table calculation. In thecalculations, we assume that this calculation is instanta-neous, because the fat-tree routing algorithm is easy to setup. In addition, the subnet manager expends no time ondiscovering the topology change, because it is informed ofthe fault by the switches. For the sake of simplicity, weassume that the information is transmitted instantaneously.Therefore, 0.03 seconds is a lower bound on the time it willtake to reconfigure a 32-switch network.

First, the 0.03 seconds required for updating all for-warding tables from the subnet manager corresponds toapproximately 58,593 simulation cycles. Second, to calculatethe number of packets lost in the presence of one link fault,it is necessary to determine the packet rate over a single linkin the fat tree. This depends on 1) whether the link is at thebottom or top link tiers of the fat tree, and 2) the totalaccepted packet rate of the network, which can be seen fromFig. 9 for the saturated case with zero faults at approxi-mately 18 packets per cycle. Based on this, we can calculatethat the bottom-tier link has a packet rate of 0.56 packets percycle, and an upper tier link carries 0.42 packets per cycle.

Multiplying this by the number of cycles in 0.03 secondsyields a total of 32,812 and 24,704 packets lost per link faultlocated at the lower or upper link tiers, respectively; about

522 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

Fig. 10. Packets discarded for a 4-ary 3-tree and 2-ary 6-tree. Uniform traffic is depicted, for two of the three methods: DDLR and ADLR. The faultsrange from 0 to 10. (a) 4-ary 3-tree, uniform traffic. (b) 2-ary 6-tree, uniform traffic.

20,000 times as many packets as DDLR and 10,000 times asmany as ADLR. Keep in mind that the reconfiguration timeof 0.03 seconds was based on a 32-switch network, which issmaller than the one considered here, and it was assumedthat it would take zero time to inform the subnet managerof the failure and to calculate new tables. Hence, the actualpacket loss for DDRC might be significantly larger than thatcalculated here, given the packet size. In that respect, thecalculation is optimistic compared to a real case.

For the 2-ary 6-tree the situation is much the same, withADLR losing more than twice as many packets per linkfault as DDLR with saturated traffic. For unsaturated traffic,the packet discard count is quite low since an unsaturatednetwork has minimal queuing time, and thus fewer packetsare likely to be stuck in disconnected queues.

The reason for the increased packet loss of ADLRcompared to DDLR is that ADLR through its adaptivitydistributes traffic more evenly in the network, thus, forcingmore buffers to be occupied. Hence, there is a higherprobability of there being a large number of packets in thebuffers that are connected to the link that fails. The averagepacket loss will, therefore, be somewhat higher.

The fundamental property of any fault-tolerant routingalgorithm is the ability of the algorithm to keep the endpoints connected when there are faults present in thenetwork. In Fig. 11, we have plotted the probability (y-axis)of the routing algorithm keeping the network connectedafter the occurrence of a given number of faults (x-axis) forthe 4-ary 3-tree (Fig. 11a) and the 2-ary 6-tree (Fig. 11b) forthe three algorithms. Note that the probability of maintain-ing connectivity is the same for the two DLR methods,ADLR and DDLR, as they both use the same reroutingmechanism to forward the packets, and that this probabilityis different from the probability of the network beingphysically connected. The difference that can be observedbetween the two methods in the figures is, therefore, aresult of statistical variation for the 500 samples we havesimulated. Recall that the 4-ary 3-tree is able to guarantee aconnected network when there is less than four link faults.The figure shows that for the 4-ary 3-tree, all algorithmsmanaged to maintain connectivity for all 500 testedcombinations of four link faults. Although such connectiv-ity cannot be guaranteed, the probability of four link faultsdisconnecting a 4-ary 3-tree is very low. As the number oflink faults increases toward 10, the probability for all threealgorithms decreases by roughly the same amount downtoward 97 percent. For the 2-ary 6-tree, on the other hand,the algorithms can only guarantee the toleration of one

link fault. We see from the figure that already at two linkfaults none of the algorithms maintain a 100 percentprobability of connectivity. Furthermore, we see that theprobability of connectivity for DDRC stands out bydecreasing more rapidly than for the two DLR algorithms.Local rerouting is in other words capable of tolerating alarger number of fault combinations for a given number offaults than reconfiguring the network to avoid the faultypaths in fat trees.

Finally, we will consider the probability of thenetworks becoming deadlocked when operating outsidethe limit specified by the theory in the previoussections. For ADLR, this is any number of faults beyondk� 1. Recall that the deadlock freedom theory forRlinkadaptive only has validity when there are less than

k faults. Rlinkdeterministic, on the other hand, is guaranteed to

be deadlock free for any number of faults as long asone/two additional virtual channels are used whenrerouting. However, without these virtual channels,deadlock freedom can only be guaranteed with onefault. Any number of faults larger than this carries therisk of deadlocking the network.

To generate these figures, we have run 500 simulations foreach of 4, 7, and 10 link faults for the 4-ary 3-tree. We have leteach simulation run for 200,000 simulation cycles andrecorded the time at which the network became deadlocked,if at all. The resulting plots will asymptotically approach theprobability of deadlock for the given number of faults as thetime approaches infinity. We have obviously not been able tolet time approach infinity, but it is still possible to get ageneral idea of the probability from the figures presented. Forthe 4-ary 3-tree (Fig. 12), we see that for four link faults, the

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 523

Fig. 11. Probability of connectivity for a 4-ary 3-tree and 2-ary 6-tree. For the three methods: DDLR, DDRC, and ADLR. The faults range from 0 to 10.(a) 4-ary 3-tree. (b) 2-ary 6-tree.

Fig. 12. Probability of deadlock beyond k� 1 faults with adaptive routing.For a 4-ary 3-tree with 4, 7, and 10 faults.

lowest number of link faults for which we do not guaranteedeadlock freedom, the probability of deadlock slowlyapproaches one percent. For seven link faults, the probabilityapproaches seven percent, and for 10 faults the probability ofdeadlock has climbed to about 20 percent. Comparing this tothe probability of connectivity discussed previously, we seethat even though the 4-ary 3-tree remained connected for allcombinations of four link faults we tested, there is a onepercent probability of the network becoming deadlocked.Similarly, for 10 link faults, the probability of maintainingconnectivity was about 97 percent, but the probability ofdeadlock for this case is 20 percent.

10 CONCLUSION

The size and complexity of modern high-performancecomputer systems exacerbates the need for efficient fault-tolerant mechanisms to guarantee high system performance.In this paper, we have presented a fault-tolerance mechan-ism based on dynamic local rerouting, aimed at the muchutilized fat-tree topology. The fault-tolerance algorithm cantransparently tolerate up to and including k� 1 benign linkfaults, where k is the number of ports in one direction of aswitch in the fat tree. Depending on whether adaptive ordeterministic routing is utilized, deadlock freedom can beguaranteed without using additional resources, or with oneextra virtual channel, respectively. The proposed mechan-isms require some small modification of existing switches,and can thus not be directly implemented with currenthardware. However, for single network installations, neces-sary modifications can with all probability be achievedthrough updating the switch firmware.

Simulation experiments show a graceful degradation inperformance as the number of link faults in the systemincreases. For larger fat trees with higher radix switches, theeffect of any link fault in using dynamic local rerouting isbelieved to be significantly smaller. For the relatively smalltopologies we tested, the degradation is higher than for asimple reconfiguration mechanism to which we comparethe results. However, the speed and transparency of themechanism has a significantly lower impact on networktraffic. A high-speed reconfiguration mechanism might loseup to 20,000 times more packets than our dynamic localrerouting mechanism. Finally, the probability of maintain-ing network connectivity for all end nodes is slightly higherusing dynamic local rerouting compared to reconfiguringthe fat tree. Therefore, a careful combination of dynamiclocal rerouting with reconfiguration as in the hybridapproach will yield a fault-tolerance mechanism with highperformance, virtually no packet loss, and high probabilityof connectivity beyond the guaranteed connectivity region.

REFERENCES

[1] C. Clos, “A Study of Non-Blocking Switching Networks,” BellSystem Technical J., vol. 32, pp. 406-424, 1953.

[2] C.E. Leiserson, “Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing,” IEEE Trans. Computers, vol. 34, no. 10,pp. 892-901, Oct. 1985.

[3] I.T. Association, InfiniBand Architecture, Specification Volume 1,Release 1.0a, http://www.infinibandta.com, 2001.

[4] F. Petrini, W.-C. Feng, A. Hoisie, S. Coll, and E. Frachten-berg, “The Quadrics Network: High-Performance ClusteringTechnology,” IEEE Micro, vol. 22, no. 1, pp. 46-57, Jan./Feb.2002.

[5] “Top 500 Supercomputer Sites,”http://www.top500.org/, 2005.

[6] R. Ponnusamy, A. Choudhary, and G. Fox, “CommunicationOverhead on the CM5: An Experimental Performance Evalua-tion,” Proc. Fourth Symp. Frontiers of Massively Parallel Computation,pp. 108-115, 1992.

[7] M. Woodacre, D. Robb, D. Roe, and K. Feind, “The SGI Altix TM3000 Global Shared-Memory Architecture,” SGI HPC WhitePapers, 2003.

[8] F.O. Sem-Jacobsen, O. Lysne, and T. Skeie, “Combining SourceRouting and Dynamic Fault Tolerance,” Proc. 18th Int’l Symp.Computer Architecture and High Performance Computing (SBAC-PAD), A.F. De Souza, R. Buyya, and W. Meira, Jr., eds., pp. 151-158, 2006.

[9] F. Petrini and M. Vanneschi, “K-ary N-Trees: High PerformanceNetworks for Massively Parallel Architectures,” Proc. 11th Int’lSymp. Parallel Processing (IPPS ’97), p. 87, 1997.

[10] F.O. Sem-Jacobsen, “Towards a Unified Interconnect Architecture:Combining Dynamic Fault Tolerance with Quality of Service,Community Separation, and Power Saving,” PhD dissertation,2008.

[11] I.D. Scherson and C.-K. Chien, “Least Common Ancestor Net-works,” Proc. Int’l Parallel Processing Symp., pp. 507-513, http://citeseer.ist.psu.edu/scherson93least.html, 1993.

[12] M. Valerio, L.E. Moser, and P.M. Melliar-Smith, “Fault-TolerantOrthogonal Fat-Trees as Interconnection Networks,” Proc. FirstInt’l Conf. Algorithms and Architectures for Parallel Processing, vol. 2,pp. 749-754, 1995.

[13] T. Skeie, “A Fault-Tolerant Method for Wormhole MultistageNetworks,” Proc. Int’l Conf. Parallel and Distributed ProcessingTechniques and Applications (PDPTA ’98), pp. 637-644, 1998.

[14] T.-H. Lee and J.-J. Chou, “Some Directed Graph Theorems forTesting the Dynamic Full Access Property of Multistage Inter-connection Networks,” Proc. IEEE Region 10 Conf. Computer,Comm., Control and Power Eng. (TENCON), 1993.

[15] S. Chalasani, C.S. Raghavendra, and A. Varma, “Fault-TolerantRouting in MIN Based Supercomputers,” Proc. Supercomputing,pp. 244-253, 1990.

[16] N.K. Sharma, “Fault-Tolerance of a MIN Using Hybrid Redun-dancy,” Proc. Ann. Simulation Symp., pp. 142-149, Apr. 1994.

[17] M. Valerio, L.E. Moser, and P.M.M. Smith, “Recursively ScalableFat-Trees As Interconnection Networks,” Proc. 13th IEEE Ann. Int’lPhoenix Conf. Computers and Comm., 1994.

[18] Y. Mun and H.Y. Youn, “On Performance Evaluation of Fault-Tolerant Multistage Interconnection Networks,” Proc. ACM Symp.Applied Computing, pp. 1-10, 1992.

[19] J. Sengupta and P.K. Bansal, “High Speed Dynamic Fault-Tolerance,” Proc. IEEE Region 10 Int’l Conf. Electrical and ElectronicTechnology (TENCON ’10), vol. 2, pp. 669-675, 2001.

[20] J. Sengupta and P.K. Bansal, “Fault-Tolerant Routing in IrregularMINs,” Proc. IEEE Region 10 Int’l Conf. Global Connectivity inEnergy, Computer, Comm., and Control, vol. 2, pp. 638-641, 1998.

[21] N.-F. Tzeng, P.-C. Yew, and C.-Q. Zhu, “A Fault-Tolerant Schemefor Multistage Interconnection Networks,” Proc. Ann. Int’l Symp.Computer Architecture, pp. 368-375, 1985.

[22] F.O. Sem-Jacobsen, T. Skeie, O. Lysne, O. Tørudbakken, E.Rongved, and B.R. Johnsen, “Siamese-Twin: A DynamicallyFault-Tolerant Fat Tree,” Proc. Int’l Parallel and DistributedProcessing Symp. (IPDPS), 2005.

[23] I.T. Theiss and O. Lysne, “FRoots, A Fault Tolerant and TopologyAgnostic Routing Technique,” IEEE Trans. Parallel and DistributedSystems, 2005.

[24] N. Mysore et al., “PortLand: A Scalable Fault-Tolerant Layer 2Data Center Network Fabric,” ACM SIGCOMM Computer Comm.Rev., vol. 39, no. 4, pp. 39-50, http://portal.acm.org/citation.cfm?id=1594977.1592575, 2009.

[25] A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri,D. Maltz, P. Patel, and S. Sengupta, “VL2: A Scalable Flexible DataCenter Network,” ACM SIGCOMM Computer Comm. Rev., vol. 39,no. 4, pp. 51-62, http://portal.acm.org/citation.cfm?id=1594977.1592576, 2009.

[26] M. Al-Fares, A. Loukissas, and A. Vahdat, “A Scalable Commod-ity Data Center Network Architecture,” Proc. ACM SIGCOMM2008 Conf. Data Comm. (SIGCOMM ’08), pp. 63-74, http://portal.acm.org/citation.cfm?doid=1402958.1402967, 2008.

[27] F.O. Sem-Jacobsen, T. Skeie, and J. Duato, “Dynamic FaultTolerance in Fat-Trees,” Research Note, Simula, 2009.

524 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

[28] H.-Y. Tyan, “Design, Realization and Evaluation of a Component-Based Compositional Software Architecture for Network Simula-tion,” PhD dissertation, 2002.

[29] C. Gomez, F. Gilabert, M.E. Gomez, P. Lopez, and J. Duato,“Deterministic versus Adaptive Routing in Fat-Trees,” Proc.Workshop Comm. Architecture on Clusters, As a Part of IEEE Int’lParallel and Distributed Processing Symp. (IPDPS ’07), 2007.

[30] A. Bermudez, R. Casado, F.J. Quiles, and J. Duato, “HandlingTopology Changes in InfiniBand,” IEEE Trans. Parallel andDistributed Systems, vol. 18, no. 2, pp. 172-185, Feb. 2007.

Frank Olaf Sem-Jacobsen received the MSand PhD degrees in computer science from theUniversity of Oslo, Norway, in 2004 and 2008,respectively. He is currently working as apostdoctoral fellow at Simula Research Labora-tory where he continues pursuing his researchinterests in routing, fault tolerance, and quality ofservice. These challenges are pursued both inregular interconnection networks, and morerecently also in networks on chips.

Tor Skeie received the MS and PhD degrees incomputer science from the University of Oslo in1993 and 1998, respectively. He is currently aprofessor at Simula Research Laboratory andthe University of Oslo. His work is mainlyfocused on scalability, effective routing, faulttolerance, and quality of service in switchednetwork topologies. He is also a researcher inthe Industrial Ethernet area. The key topics herehave been the road to deterministic Ethernet

end-to-end and how precise time synchronization can be achievedacross switched Ethernet. He has also contributed to wirelessnetworking, hereunder quality of service in WLAN and cognitive radio.

Olav Lysne received the BSc degree in 1985,the Cand Scient degree in 1988, and the PhDdegree in 1992, all from the University of Oslo.He is the director of Basic Research at SimulaResearch Laboratory and a full professor at theUniversity of Oslo. He became full professor atthe University of Oslo in 1998. Currently, he hasa research focused on switching and routing innetworks. In these areas, his main focus hasbeen on efficient routing in irregular structures,

fault tolerance, and network dynamics. He is the coinventor of LayeredShortest Path Routing (LASH) that has been used in some of the world’slargest InfiniBand clusters. He is an author of numerous scientificpublications in term rewriting and algebraic specification, in routingstrategies for interconnect networks, and in network resilience in theInternet. He is a member of the IEEE.

Jose Duato received the MS and PhD degreesin electrical engineering from the PolytechnicUniversity of Valencia, Spain, in 1981 and 1985,respectively. He is currently a professor in theDepartment of Computer Engineering (DISCA)at the Polytechnic University of Valencia. Hisresearch interests include interconnection net-works and multiprocessor architectures. He haspublished more than 380 refereed papers. Heproposed a powerful theory of deadlock-free

adaptive routing for wormhole networks. Versions of this theory havebeen used in the design of the routing algorithms for the MIT ReliableRouter, the Cray T3E supercomputer, the internal router of the Alpha21364 microprocessor, and the IBM BlueGene/L supercomputer. He isthe first author of Interconnection Networks: An Engineering Approach.He is a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

SEM-JACOBSEN ET AL.: DYNAMIC FAULT TOLERANCE IN FAT TREES 525