fault-tolerant routing for exascale supercomputer: the bxi routing architecture

10
See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/281775198 Fault-Tolerant Routing for Exascale Supercomputer: The BXI Routing Architecture CONFERENCE PAPER · SEPTEMBER 2015 READS 30 2 AUTHORS: Pierre Vignéras Atos S.A. 19 PUBLICATIONS 39 CITATIONS SEE PROFILE Jean-Noël Quintin Atos S.A. 11 PUBLICATIONS 27 CITATIONS SEE PROFILE All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. Available from: Pierre Vignéras Retrieved on: 18 November 2015

Upload: paris-sud

Post on 14-Nov-2023

1 views

Category:

Documents


0 download

TRANSCRIPT

Seediscussions,stats,andauthorprofilesforthispublicationat:http://www.researchgate.net/publication/281775198

Fault-TolerantRoutingforExascaleSupercomputer:TheBXIRoutingArchitecture

CONFERENCEPAPER·SEPTEMBER2015

READS

30

2AUTHORS:

PierreVignéras

AtosS.A.

19PUBLICATIONS39CITATIONS

SEEPROFILE

Jean-NoëlQuintin

AtosS.A.

11PUBLICATIONS27CITATIONS

SEEPROFILE

Allin-textreferencesunderlinedinbluearelinkedtopublicationsonResearchGate,

lettingyouaccessandreadthemimmediately.

Availablefrom:PierreVignéras

Retrievedon:18November2015

Fault-Tolerant Routing for Exascale Supercomputer:The BXI Routing Architecture

Jean-Noël QuintinAtos

Campus Teratec2, rue de la Piquetterie

91680 Bruyères-le-ChâtelEmail: [email protected]

Pierre VignérasAtos

Campus Teratec2, rue de la Piquetterie

91680 Bruyères-le-ChâtelEmail: [email protected]

Abstract—BXI, Bull eXascale Interconnect, is the new inter-connection network developed by Atos for High PerformanceComputing. It has been designed to meet the requirementsof exascale supercomputers. At such scale, faults have to beexpected and dealt with transparently so that applications remainunaffected by them. BXI features various mechanisms for thispurpose, one of which is the BXI routing component presentedin this paper. The BXI routing module computes the full routingtables for a 64k nodes fat-tree in a few minutes. But with partialre-computation it can withstand numerous inter-router linkfailures without any noticeable impact on running applications.

Keywords—Fabric Management, Routing, Fault-Tolerant Rout-ing, BXI, Interconnect Management, High Performance Computing

I. INTRODUCTION

High Performance Computing workloads must scale toextreme levels of parallelism with applications using tens ofthousands of nodes and dozens/hundreds of threads on eachnode. BXI, Bull eXascale Interconnect, is the new intercon-nection network developed by Atos and designed to meetthese requirements. The BXI interconnect is based on twoASICs: a network interface controller (NIC) and a 48 portsswitch. The BXI switch is a 48 ports crossbar featuring per-portdestination based deterministic routing and adaptive routing.BXI technology also contains various features at hardwarelevels (NIC and switch) to ensure that faults are unnoticed byapplications if routing tables are modified in a given time frame– 10 seconds currently, including notification, computationand upload of modifications. This value is a tradeoff betweenresiliency and actual failure detection latency: increasing thistimeout might help improving resiliency for some failures butcan also lead a job to halt for too long before detecting thereis no way to bypass a failure. A detailed overview of the BXItechnology can be found in [1].

To use this infrastructure, routing tables must be initiallycomputed and loaded into the different BXI switches throughan out-of-band network. Each table defines how to reach anendnode (NID for Node IDentifier) based on its destinationaddress and the switch input port. The computation of theserouting tables is not trivial, since, regardless of topology, itmust guarantee the absence of:

• deadlock: messages can no longer move in the net-work due to lack of available resources;

• livelock: messages are constantly moving in the net-work but never reach their final destination — aswithin a loop;

• dead-end: messages reach an unusable port (nonexis-tent, not connected, or down).

The routing tables must be quickly computed in order tokeep the system operational without interrupting running ap-plications. While a few seconds delay is acceptable, standardalgorithms require dozens of minutes when applied to exascalesystem topologies with tens of thousands of nodes. Given thesize of these systems, component failures1 might be expectedon a regular basis. The naive approach where all routing tablesare re-computed from scratch for each failure is not acceptableat such scale. The two main reasons are:

• the computation time exceeds a timeout: faults aretherefore not hidden to applications (soft real-timerequirement);

• the modification of all routing tables impacts all run-ning applications even when only a small part of theplatform is concerned (minimum impact requirement).

This document presents our routing strategy for BXI exascalesupercomputers. The state of the art is first reviewed insection II. The major problems we address are then detailed insection III. Our solution is described in section IV along withsome results. Finally, we conclude in section V and explainthe directions for future developments.

II. RELATED WORKS

There is an abundant literature on supercomputing inter-connect topologies and their related routing algorithms [2], [3].In the recent past, topologies designed for high-radix switcheshave been proposed: fat-trees [4], [5], [6], [7], flattened-butterfly [8], [9], dragonfly [10], [11] or slimfly [12], aresuch examples. However, to our knowledge fault managementis rarely addressed in the literature for topology specificrouting algorithms. Most of the time, a link or switch faultis considered as a whole topology change, triggering routing

1Term “failures” must be understood from a fabric management point ofview. Hardware failures are of course seen as such, but also human mistakesand maintenance operations.

tables re-computation from scratch and requiring their com-plete upload. This option is not practical anymore for exascalesupercomputers mostly because of computational complexitiesin time as explained in section III. When the topology specificrouting algorithm cannot be used anymore on the degradedtopology, the computation switches to a topology-agnosticrouting algorithm [13], [14], [15], [16]. In such a case, therouting tables are also computed from scratch and uploaded;such an operation is far too heavy to meet our soft real-time and low impact requirements. Degraded topologies musttherefore be supported as far as possible for a given routingalgorithm.

The search for the shortest path is the basis of mosttopology-agnostic routing algorithms [17]. Recent results inthe dynamic solving of All-Pairs Shortest Paths Problem(APSP) [18], [19], [20], can help dealing with failures: aslight modification of the topology can be performed inO(f(n) + n2log n) (for most dynamic APSP solutions), andthe recomputation of routing tables can use the fast querytime in O(1) (for all dynamic APSP solutions). However,for exascale topologies, the corresponding graph and memoryconsumption associated with dynamic APSP solutions makestheir application unusable in practice.

Fault-tolerant topology-agnostic routing algorithms havebeen proposed [21], [22], [23], [24]. Such solutions do notmeet our requirements, since they support only one fault or aretoo slow for exascale topologies. Moreover, they often focuson the support of irregular topologies, producing routing tablesthat may be of lower quality than specific routing algorithms.

Recently, the Quasi-Fat-Tree topology has been pro-posed [25] along with a fault-tolerant routing algorithm inclosed-form allowing for an efficient parallel implementation.This is an elegant solution to the problem but only availablefor quasi-fat-tree topologies, whereas our solution proposes anefficient general architecture suitable to any topology.

III. MAJOR ISSUES

Computing routing tables for a topology containing Ndestinations (NID) requires at least Ω(N2) steps. Indeed, foreach destination, a route must be found to reach any of theother N − 1 destinations. That means, for targeted exascaletopologies, with 64k NIDs, a minimum of 4 billion routes mustbe computed. Soft real-time requirement imposes a 5 secondsconstraint to routing tables computation giving 5 other secondsfor notification and upload. Therefore, a rate of 860M routes/smust be achieved.

For comparison purposes, on Curie2, a fat-tree with5, 739 nodes, the Infiniband subnet manager (opensm/OFED)computes the routing tables in 14 seconds achieving a rate of2.3M of routes/s. Another similar result is given in [25]: theftree opensm routing algorithm computed the routing tablesof a fat-tree with 34, 992 nodes in 478 seconds giving a rateof 2.5M routes/s. This is 400 times slower than our objective.

Furthermore, with BXI adaptive routing feature, a muchlarger number of routes must be computed as shown by

2Curie is an instance of the previous petaflopic generation of Bullsupercomputer. It was ranked 9th in the Top500 list of June 2012:http://www.top500.org/system/177818

Figure 1. Number of routes in a fat-trees with adaptive routing.

figure 1 representing a fat-tree. The number of routes tocompute depends on the source-destination pair; it increasessignificantly with the nodes distance: there is a single routefrom A to B through switch 0.0.0, 4 routes (red dashed-lines)from B to C and 8 (purple thick-lines) from D to E. The totalnumber of computed routes is therefore significantly higherthan N2. On a l-level full fat-tree made of radix-k switches,with k/2 uplinks at each level i, i < l, the total number ofroutes for a given pair of source-destination can be as high as(k/2)l. In the case of a 64 k nodes 4-level BXI fat-tree, for asingle source-destination pair with a maximum distance, thatmeans (48/2)4 > 300k routes to consider.

From another perspective, if S is the total number ofswitches in the topology, the routing component must:

• assign integers between 0 and 47 (each switch has 48ports in BXI) to all S · 48 ·N entries of deterministicrouting tables (which are 1 byte each);

• set each 48 bits (6 bytes) of 48 · S · (N/8) = 6SNentries of adaptive tables (1 entry for a group of 8consecutive NIDs, hence N/8 entries per switch port).

Consequently, for a BXI topology with N = 62, 208 NIDand S = 9, 504 switches, the routing algorithm must fulfill28.3G deterministic entries and 3.5G adaptive entries for atotal of about 84 · S ·N = 49G bytes.

Keeping our 5 seconds constraint (soft real-time require-ment), an ideal processor clocked at 3 GHz must compute49GB/(5s · 3GHz) = 3.3 bytes per clock cycle. This perfor-mance is close to the limit achievable with modern processorsin various benchmarks3, especially with non-vectorized in-structions. This basically means a brute-force generic solutionto the initial problem of realtime routing tables computationis not a good approach anymore.

IV. SOLUTION

The main idea of our proposal relies on a clear separationbetween two distinct modes of operation: offline mode andonline mode. In order to evaluate our solution, a Bull R428server with 4 Intel Xeon CPUs E5-4640 (8 cores) clocked at2.40 GHz, 512 GB of RAM and a single disk was used.

3https://gmplib.org/~tege/x86-timing.pdf

A. Offline Mode

The offline mode is used mainly during bring-up phases(installation, extension, maintenance) but also for testing pur-poses and offline analysis. Routing tables are computed from anominal topology, where all equipments are considered up andrunning (even equipments expected to be installed afterwards).Those routing tables are usually uploaded after a validationphase: checking that neither deadlock, livelock nor dead-endare induced by these routing tables. Moreover, a quality checkcan also be performed according to some criteria such asload balancing, communication patterns, and so on. Since itis performed offline, this computation has no impact on anypart of the running supercomputer until the output is uploaded,and there is no stringent limit on the time required for thecomputation.

A given offline routing algorithm is a tradeoff betweengenericity – it provides valid routing tables for any topology– and performance – the computed routing tables are ofgood overall quality. Implementing an offline routing algorithmis easy to do thanks to a well defined API. BXI alreadyimplements several offline routing algorithms shortly describedbelow.

D-mod-K [7]: it distributes destinations to top-levelswitches of a fat-tree. All ports of a given switch share thesame routing table. It provides high quality routing tables, andcan guarantee non-blocking traffic for shift permutations onvery well defined fat-trees (known as Real Life Fat-Trees).

S-mod-K [5]: it distributes sources to top-level switchesof a fat-tree. It makes good use of the per-port routing tablesfeature of BXI. Depending on the communication pattern, itcan offer better or worse performance than D-mod-K [26]. AsD-mod-K, its performance can be guaranteed only with welldefined fat-trees.

Pr1ta (PGFT Random 1 Table per Asic): it is basedon Random [27], [6]: it basically selects one switch randomlyamong the set of Nearest Common Ancestor (NCA) for a givensource-destination pair. All ports of a given switch share thesame routing table. On some communication patterns, it canprovide better performance than D-mod-K or S-mod-K [26].It can be used on a wide set of fat-trees (any k-ary n-tree [6],XGFT [5] or PGFT [7] actually).

Pr1tp (PGFT Random 1 Table per Port): it is similarto pr1ta but uses one routing table per port; according to thecommunication pattern it can offer better or worse performancethat pr1ta. It can also be used on a wide set of fat-trees.

Minhop: it is based on the Dijkstra algorithm [28] tofind shortest paths between all pairs of source-destination. Itis topology-agnostic but not deadlock-free. It can therefore beused on any topology where minimal routing does not leadto a deadlock (see [29]). This is the case for most topologiesseen as fat-trees but that cannot be expressed as PGFTs.

UP*/DOWN* [30]: it is topology-agnostic anddeadlock-free. It can therefore be used on any topology.

Even if the offline mode is not time-bounded from afunctional perspective, it remains softly time-bounded froma user perspective. According to time and space complexities

discussed in section III, a naive implementation leads to un-reasonable execution time (dozens of minutes). This problemis also described in [7], [25] and proposed solutions consistin using a closed-form routing tables computation, allowingfor an efficient parallel implementation but limited to fewtopology/routing algorithm combinations.

Offline routing algorithms for BXI are implemented usinga big set of purely software optimization techniques: lock-less multi-threaded design, thread binding, NUMA memorybinding. Profiling shows the process is largely I/O bounded:the design has been changed to reduce drastically the numberof system calls – using mmap() and an asynchronous logginglibrary for example. These improvements make the offlinemode scalable up to the maximum of available cores.

Results are presented in figure 2 for a fat-tree containing57, 600 NIDs, 9, 600 switches and 153, 600 inter-switch links.The average time of 60 executions is given. In the best case,4 seconds only are required to compute the routing tablesproducing 84 ·S ·N = 46G bytes. Therefore rates of 829M ofroutes/s4 and 4.8 bytes per clock cycle are achieved. This re-sults from the speedup of our parallel implementation of offlinerouting algorithms. However, as shown in the figure, most ofthe time is taken by input/output operations: topology loading(in red) and BXI archive creation (in green). Considering allsteps gives 22M of routes/s and 0.12 bytes per clock cyclefor the best case. Though 10 times faster than opensm, thismight seem low compared to objectives given in section III.Nevertheless, since offline, only the end-user perspective mustbe taken into consideration: around 3 minutes are required inall cases to compute and to store the whole set of deterministicand adaptive routing tables for such big topologies on a singledisk. This is considered reasonable.

Figure 2. Execution time in seconds of various BXI offline routing algorithms.

B. Online-Mode

The online mode is used while the system is up andrunning. This mode deals with faults and recoveries, comput-ing small routing tables modifications while guaranteeing theabsence of deadlock, livelock and dead-end. Since it is online,the computation of the routing table modifications must becompleted in less than 5 seconds in order to limit the impactof the fault on the jobs using the faulty link. Note that the BXIswitch adaptive routing feature is targetted towards improving

4The actual rate is much higher since only deterministic routes are consid-ered in this rate computation.

communication performance as shown in [31], [32], not faultmanagement. The set of routing table modifications must alsobe kept to a minimum to limit the impact on jobs not using thefaulty link. Moreover, any node must be able to communicatewith any other node in the topology unless the topology graphis disconnected. Note that this last property is not possiblefor all online routing algorithms: a highly degraded fat-treefor example, might be routable by agnostic routing algorithmsonly such as minhop. Consider for example the case whereon the fat-tree shown on figure 1, uplinks 0.0.0-1.0.x withx ∈ [1, 3] and uplinks 0.0.3-1.0.y with y ∈ [0, 2] are all faulty.Minimal routes from B to D are given by the following path0.0.0; 2.0.0, 2.1.0; 1.1.0; 0.1.0; 1.1.3; 2.0.3, 2.1.3; 1.0.3;0.0.3. Such paths cannot be computed by a fat-tree specificrouting algorithm, but are provided by minhop. Note howeverthat minhop is not deadlock-free whereas any fat-tree specificrouting algorithm is by design. Therefore, any online routingalgorithm raises an error when it detects such a situation(as discussed in the introduction, switching to a new routingalgorithm is not an option).

1) Architecture: In online mode, the Routing compo-nent must react to events sent by any fabric equipment. Forthis purpose, the Backbone component acts as a middlewarebus: it receives topology updates from other componentsand publishes them to the Routing component shown onfigure 3.

The Routing component is actually made of differentparts which can be distributed between several managementhosts:

• Computerd: it starts from a bxiarchive that includesrouting tables previously written to a storage deviceby the offline mode. It then receives updates fromBackbone and computes the required modificationsof routing tables called rtmods before publishingthem to subscribers.

• Writerd: it also starts from a bxiarchive and sub-scribes to all routing tables modifications published byComputerd. It applies all received rtmods to itsown memory model of the actual topology and writesdown the result to the storage device, making a newbxiarchive (only new data are created, unmodified dataare hard linked to reduce storage space). Note that inBXI, an archive is produced each time modificationsare taken into consideration by the BXI FabricManagement. The set of archives allows for variousanalyses such as post-mortem debugging, study of thetimeline of events, fault handling optimizations, . . .

• Metricsd: it also starts from a bxiarchive and alsosubscribes to all routing tables modifications publishedby Computerd. It also applies all rtmods receivedto its own memory model of the actual topology. Itthen computes various metrics on the result (such aslink load balancing, number of sources/destinationsfor each route traversing each port, . . . ), and providesits feedbacks to Computerd. This may help theonline routing algorithm adapt its internal parametersto provide better rtmods according to some criteriameasured my Metricsd. This is an ongoing worknot presented in this paper.

• Uploaderd: its role is to tranform rtmods formatinto uploadable routing tables modifications format,and to upload them to the concerned switches usingthe SNMP protocol. Traffic is not stopped during theupload. A short discussion on deadlock-free dynamicreconfiguration is given in section IV-B2. In the newBull eXascale platform, a “cell” is composed of 3cabinets with up to 288 nodes. In such a cell, allswitches are not directly available through out-of-band SNMP protocol for various reasons5. Therefore,a given Uploaderd acts as a proxy for the set ofswitches it manages, given their shared topologicaladdress prefix. Using this prefix, it subscribes to asubset of Backbone updates. Thus, an Uploaderdreceives only rtmods related to the switches it isconcerned with, thus limiting the amount of datareceived and managed.

As an example, using the notation definedin [7], the fat-tree formally described byPGFT (4; 24, 12, 15, 15; 1, 12, 24, 5; 1, 2, 1, 3) contains:

• 64, 800 NIDs, 11, 160 switches and 194, 400 inter-switch links6;

• 12∗15∗15 = 2, 700 switches at level 1, each connectedto 24 nodes and 12 L2 switches with 2 links;

• 12∗15∗15 = 2, 700 switches at level 2, each connectedto 12 L1 switches and 24 L3 switches with 1 link;

• 12∗24∗15 = 4, 320 switches at level 3, each connectedto 15 L2 switches and 5 L4 switches with 3 links;

• 12∗24∗5 = 1, 440 switches at level 4, each connectedto 15 L3 switches.

For such a topology, the set of Uploaderd is defined by thefollowing schema:

• 15∗15 = 225 instances for the management of groupsof 24 L1/L2 switches;

• 24∗15 = 360 instances for the management of groupsof 12 L3 switches;

• 5 instances for the management of groups of 288 L4switches.

Thanks to the high performance zeromq7 socket library, thisarchitecture scales up to several hundreds of Uploaderd:225+360+5 = 590 in this extreme case. For other topologies,the layout of Uploaderd follows the same principle ofhierarchical layout.

Note that the online architecture follows a pipeline design:Computerd always receives a batch of topology updates con-sisting of link failures and link recoveries. A specific messagereceived from the Backbone triggers the actual computationon the whole set of updates. During that time, new updates are

5For any two cells, switches at same relative location are assigned sameprivate non-routed IP address for out-of-band communication. This easedelivery, bring-up and also optimize the out-of-band management network.More details in [1].

6Only inter-switch link faults can be dealt with by an online routingalgorithm.

7http://zeromq.org/

Figure 3. The Routing component architecture.

buffered. As soon as Computerd has published the computedrtmods, it starts working on next buffered updates. Thisway, simultaneous failures and failure happening during failuremanagement are dealt with in a consistent manner.

2) Results: To evaluate the online architecture presentedin the previous section, the following experiment has beendesigned and evaluated:

• a random link fault is generated, and the related triggeris sent to the Backbone;

• Backbone publishes the trigger to all subscribers(only Computerd in our case);

• Computerd handles the fault with the computationof routing tables modifications (rtmods);

• Computerd then publishes all rtmods to sub-scribers;

• Writerd writes down received rtmods into a newbxiarchive.

The topology used is the 64,800 nodes fat-tree defined inthe previous section. Note that it has 2 links between twoL1/L2 connected switches, only one interlink between L2/L3connected switches and 3 interlinks between L3/L4 switches.The impact of the number of links connecting a pair ofswitches on the computation of rtmods is discussed below.

For our experiment, first, we simulated 48k random linkfailures representing 25% of the total number of inter-switchlinks, which is much more than expected in production use.Even with such a degraded fabric, the system is still able toroute all the traffic correctly. The 48k missing links are thenrecovered and re-inserted in the fabric.

The offline deadlock-free routing algorithm is pr1tp(shortly described in subsection IV-A). The online routingalgorithm we developed is based on the Random [27], [6]fat-tree routing algorithm: for every disconnected path, a newNCA is selected randomly to generate a new path avoiding thefailed link. On recovery, the algorithm just rebalances the loadtowards newly available links, again using a random NCA foreach affected source-destination pair. Details regarding onlinerouting algorithms are beyond the scope of this paper. Notehowever, that in the specific case of fat-trees, fat-tree specificoffline and online routing algorithms are deadlock-free bydesign: a message always move up towards the NCA and then

moves down towards its final destination. Note also that whileuploading new routing tables, the fabric is for a limited timeinto a transient state where some switches have been updatedwhile others will soon. In the general case, this might leadto deadlock [13], but with fat-tree topologies, this cannot hap-pened if the computed routing tables follow the fat-tree specificpattern (move up towards an NCA, then down). Note that BXIalso provides a deadlock-free topology agnostic online routingalgorithm with a similar deadlock-free transient state property.The presentation of these online routing algorithms is farbeyond the scope of this paper. Of course, the presented resultsdepend on the actual pair of offline/online routing algorithmschosen. However, the results presented here provide enoughinformation to confirm the interest of the online architecturewe propose. Other experiments with different combinations ofoffline/online routing algorithms are under development.

Figure 4 presents the processing time for eachfault/recovery: that is the time between the reception ofthe trigger by Computerd and the reception of all rtmodsby Writerd. The processing time therefore includes boththe computation time of Computerd and the sending timebetween Computerd and Writerd. The different pointsare colored according to the level of the failing/recoveringlink.

Globally, there is a tendency towards a reduction in theprocessing time for fault handling. As faults occur, there arefewer possibilities for alternative routes, reducing the numberof rtmods to process. As expected, the exact opposite be-havior is shown for recoveries: the processing time increaseswith the number of newly available paths.

L3/L4 faults are the easiest to deal with in our example,and they exhibit the best processing time. Note that severalpoints for L3/L4 faults cost more than 100 ms (starting ataround fault#8000) whereas most are closer to 60 ms. This isdue to the number of interlinks between L3 and L4 switches:since 3 links connect them, when one link fails, dispatchingthe routes to the other two is fast and straightforward andimpacts as few as 2 switches only, thus limiting the number ofrtmods to compute. However, when no usable link remainsbetween these two, many other switches must be impactedand the number of rtmods is much higher. This is also thereason why an L1/L2 link fault can cost around 80 ms insome cases and close to 1180 ms in others. Since there are2 links between L1 and L2 switches instead of 3 between L3and L4 ones, the worst case happens more often. Moreover,

0 10000 20000 30000 40000 50000

020

040

060

080

010

0012

00

Fault number

Tim

e in

mill

isec

onds

Fault between level 1 and 2Fault between level 2 and 3Fault between level 3 and 4

0 10000 20000 30000 40000 50000

020

040

060

080

010

0012

00

Recovery number

Tim

e in

mill

isec

onds

Recovery between level 1 and 2Recovery between level 2 and 3Recovery between level 3 and 4

Figure 4. Processing time for each fault (left) and recovery (right). X-axis represents somewhat a discrete timeline.

when there is no usable link between two L1/L2 switches, theimpacted switches are all 2, 700 L1 switches. This case reachesthe maximum number of impacted switches by the algorithm.Since the chosen topology provides a single link between L2and L3 switches, this is always the most complex case: morethan 2 switches are always impacted.

On figure 5 right-most values represent the maximumnumber of rtmods produced, which happens when first L1/L2faults are encountered. Afterwards, if some links are alreadyunusable, there is no rtmod to compute for their related ports.The same applies to (137k rtmods, ~60 ms) which is themaximum number of L3/L4 faults and also corresponds tofirst times such faults are seen.

Figure 6 presents the processing time for a given numberof rtmods when more than 2 switches are impacted. Theprocessing time depends on the level: it can take up to 1200 msto deal with one worst-case L1/L2 fault while it takes only140 ms to deal with an L3/L4 one.

Handling recoveries should exhibit a similar behavior: thenumber of rtmods is expected to be roughly the same onfault handling than on recoveries. This is not the case as shownon figure 5 where 15,000 rtmods maximum is computedfor L3/L4 fault handling but at least 21,000 rtmods forrecovery handling. Our algorithm computed more rtmods forrecoveries than required. The problem has been identified andthe correction is under development.

The BXI offline routing tables checker has been used at10%, 20% and 25% of faults, and 10%, 20% and 25% ofrecoveries in order to check the validity of the computedrouting tables ensuring the absence of deadlock, livelock anddead-end.

Checking the overall quality of the new routing tablesis a challenge we are currently working on. Studying othertopologies than fat-tree – thanks to BXI topology-agnosticonline routing algorithms – is also an ongoing study.

V. CONCLUSION AND FUTURE WORKS

For the exascale topology size targeted by BXI technology,the complete computation of all routing tables (offline mode)usually requires dozen of minutes. This is far too long to over-come link failures without interrupting running applications.The main contribution of this paper is to present a radicallynew approach based on a clear separation of concern for thecomputation of routing tables:

• offline computation: tables are computed without real-time constraint and archived for analysis and valida-tion before being uploaded at production start-up.

• online computation: only routing tables modificationsneeded to bypass faults (or to deal with recoveries) arecomputed with soft realtime constraint up to a certainpoint (detection of non-routable degraded topology)and archived.

As a result, the BXI Routing offline mode can compute allrouting tables of a 64k nodes full fat-tree in less than 4 minuteson commodity hardware while the online mode can deal withat least 25% of faults and recoveries transparently.

Next short term follow-up is to make several experimentswith other combinations of topologies, offline and online rout-ing algorithms along with the injection of simultaneous faultsand recoveries. Metricsd implementation is a real challengedue to time complexity: producing metrics on an exascaletopology can take a lot of time (several hours); representing

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05

010

020

030

040

0

Number of modified routing table entries

Tim

e in

mill

isec

onds

Fault between level 1 and 2Fault between level 3 and 4

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05

010

020

030

040

0Number of modified routing table entries

Tim

e in

mill

isec

onds

Recovery between level 1 and 2Recovery between level 3 and 4

Figure 5. Processing time for a given amount of rtmods produced when only 2 switches are impacted on fault (left) and recovery (right). L2/L3 linkfailure involve more than 2 switches, hence they do not appear in figure.

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05

020

040

060

080

010

0012

00

Number of modified routing table entries

Tim

e in

mill

isec

onds

Fault between level 1 and 2Fault between level 2 and 3Fault between level 3 and 4

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05

020

040

060

080

010

0012

00

Number of modified routing table entries

Tim

e in

mill

isec

onds

Recovery between level 1 and 2Recovery between level 2 and 3Recovery between level 3 and 4

Figure 6. Processing time for a given amount of rtmods produced when more than 2 switches are impacted for fault (left) and recovery (right).

these metrics is also a challenge in itself because of the amountof available data. Finally, the injection of Metricsd resultsinto Computerd forming a control loop has not been studiedyet. It is a big and interesting challenge for the next future.

ACKNOWLEGMENT

This BXI development has been undertaken under a coop-eration between CEA and Atos. The goal of this cooperation isto co-design extreme computing solutions. Atos thanks CEAfor all their inputs that were very valuable for this research.

This research was partly funded by a grant of Programmedes Investissments d’Avenir.

BXI development was also part of PERFCLOUD, theFrench FSN (Fond pour la Société Numérique) cooperativeproject that associates academic and industrial partners todesign and provide building blocks for new generations of HPCdatacenters.

We are thankful to the Portals team at Sandia Nat. Lab.for their unconditional support, particularly: Ron Brightwell,Brian Barrett (now at Amazon) and Ryan Grant.

We also acknowledge the passionate discussions we hadwith Keith Underwood from Intel during the early stages ofthis project.

We also would like to thank our colleagues, Jean-PierrePanziera, Ben Bratu, Anne-Marie Fourel and Pascale Bernier-Bruna for their reviews and valuable comments.

REFERENCES

[1] S. Derradji, T. Palfer-Sollier, J.-P. Panziera, A. Poudes, and F. Wellen-reiter, “The bxi interconnect architecture,” in High-Performance Inter-connects (HOTI), 2015 IEEE 23th Annual Symposium on, 2015.

[2] A. Agarwal, “Limits on Interconnection Network Performance,” IEEETransactions on Parallel and Distributed Systems, vol. 2, pp. 398–412, 1991. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.8845

[3] J. Duato, S. Yalamanchili, and N. Lionel, Interconnection Networks: AnEngineering Approach. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 2002.

[4] C. E. Leiserson, “Fat-trees: Universal networks for hardware-efficientsupercomputing,” IEEE Trans. Comput., vol. 34, no. 10, pp. 892–901,Oct. 1985. [Online]. Available: http://dl.acm.org/citation.cfm?id=4492.4495

[5] S. Ohring, M. Ibel, S. Das, and M. Kumar, “On generalized fat trees,”Proceedings of 9th International Parallel Processing Symposium, 1995.

[6] F. Petrini and M. Vanneschi, “k-ary n-trees: high performance networksfor massively parallel architectures,” Proceedings 11th InternationalParallel Processing Symposium, 1997.

[7] E. Zahavi, “D-Mod-K Routing Providing Non-Blocking Traffic forShift Permutations on Real Life Fat Trees,” Technical Report CCITReport, Tech. Rep., 2010. [Online]. Available: http://webee.eedev.technion.ac.il/wp-content/uploads/2014/08/publication_574.pdf

[8] J. Kim, W. J. Dally, and D. Abts, “Flattened butterfly: A cost-efficient topology for high-radix networks,” SIGARCH Comput. Archit.News, vol. 35, no. 2, pp. 126–137, Jun. 2007. [Online]. Available:http://doi.acm.org/10.1145/1273440.1250679

[9] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber,“Hyperx: Topology, routing, and packaging of efficient large-scalenetworks,” in Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis, ser. SC ’09. NewYork, NY, USA: ACM, 2009, pp. 41:1–41:11. [Online]. Available:http://doi.acm.org/10.1145/1654059.1654101

[10] J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven,highly-scalable dragonfly topology,” SIGARCH Comput. Archit.News, vol. 36, no. 3, pp. 77–88, Jun. 2008. [Online]. Available:http://doi.acm.org/10.1145/1394608.1382129

[11] J. Kim, W. Dally, S. Scott, and D. Abts, “Cost-efficient dragonflytopology for large-scale systems,” IEEE Micro, vol. 29, no. 1, pp. 33–40,Jan. 2009. [Online]. Available: http://dx.doi.org/10.1109/MM.2009.5

[12] M. Besta and T. Hoefler, “Slim fly: A cost effective low-diameternetwork topology,” in Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis,ser. SC ’14. Piscataway, NJ, USA: IEEE Press, 2014, pp. 348–359.[Online]. Available: http://dx.doi.org/10.1109/SC.2014.34

[13] J. Duato, “A theory of fault-tolerant routing in wormhole networks,”IEEE Transactions on Parallel and Distributed Systems, vol. 8, pp. 790–802, 1997.

[14] J. C. Martínez, J. Flich, A. Robles, P. López, and J. Duato,“Supporting fully adaptive routing in infiniband networks,” inProceedings of the 17th International Symposium on Parallel andDistributed Processing, ser. IPDPS ’03. Washington, DC, USA:IEEE Computer Society, 2003, pp. 44.1–. [Online]. Available:http://dl.acm.org/citation.cfm?id=838237.838493

[15] T. Skeie, O. Lysne, J. Flich, P. López, A. Robles, and J. Duato, “LASH-TOR: A generic transition-oriented routing algorithm,” in Proceedingsof the International Conference on Parallel and Distributed Systems -ICPADS, vol. 10, 2004, pp. 595–604.

[16] O. Lysne, T. Skeie, S. A. Reinemo, and I. r. Theiss, “Layered routingin irregular networks,” IEEE Transactions on Parallel and DistributedSystems, vol. 17, pp. 51–65, 2006.

[17] J. Flich, T. Skeie, A. Mejia, O. Lysne, P. Lopez, A. Robles, J. Duato,M. Koibuchi, T. Rokicki, and J. C. Sancho, “A Survey and Evaluation ofTopology-Agnostic Deterministic Routing Algorithms,” IEEE Transac-tions on Parallel and Distributed Systems, vol. 23, no. 3, pp. 405–425,2012.

[18] B. V. Cherkassky, A. V. Goldberg, and T. Radzik, “Shortest pathsalgorithms: Theory and experimental evaluation,” pp. 129–174, 1996.

[19] G. Chen, M. Pang, and J. Wang, “Calculating shortest path on edge-based data structure of graph,” in Proceedings - 2nd Workshop onDigital Media and its Application in Museum and Heritage, DMAMH2007, 2007, pp. 416–421.

[20] C. Demetrescu and G. F. Italiano, “Experimental Analysis of DynamicAll Pairs Shortest,” ACM Transactions on Algorithms, vol. 2, pp. 578–601, 2006.

[21] I. r. Theiss and O. Lysne, “FRoots: A fault tolerant and topology-flexiblerouting technique,” IEEE Transactions on Parallel and DistributedSystems, vol. 17, pp. 1136–1150, 2006.

[22] A. Mejia, J. Flich, J. Duato, S. A. Reinemo, and T. Skeie, “Segment-based routing: An efficient fault-tolerant routing algorithm for meshesand tori,” in 20th International Parallel and Distributed ProcessingSymposium, IPDPS 2006, vol. 2006, 2006.

[23] J. Flich, A. Mejia, P. Lopez, and J. Duato, “Region-based routing: Anefficient routing mechanism to tackle unreliable hardware in networkon chips,” in Proceedings - NOCS 2007: First International Symposiumon Networks-on-Chip, 2007, pp. 183–194.

[24] F. O. Sem-Jacobsen and O. Lysne, “Fault tolerance with shortest pathsin regular and irregular networks,” IPDPS Miami 2008 - Proceedingsof the 22nd IEEE International Parallel and Distributed ProcessingSymposium, Program and CD-ROM, no. 1, 2008.

[25] E. Zahavi, I. Keslassy, and A. Kolodny, “Quasi Fat Trees for HPCClouds and Their Fault-Resilient Closed-Form Routing,” in High-Performance Interconnects (HOTI), 2014 IEEE 22nd Annual Sympo-sium on. IEEE, 2014, pp. 41–48.

[26] G. Rodriguez, C. Minkenberg, R. Beivide, R. P. Luijten, J. Labarta, andM. Valero, “Oblivious routing schemes in extended generalized fat treenetworks,” in Cluster Computing and Workshops, 2009. CLUSTER’09.IEEE International Conference on. IEEE, 2009, pp. 1–8.

[27] R. I. Greenberg and C. E. Leiserson, “Randomized routing on fat-trees,” 26th Annual Symposium on Foundations of Computer Science(sfcs 1985), 1985.

[28] E. W. Dijkstra, A short introduction to the art of programming.Technische Hogeschool Eindhoven Eindhoven, 1971, vol. 4.

[29] L. Schwiebert and D. N. Jayasimha, “A necessary and sufficientcondition for deadlock-free wormhole routing,” Journal of Parallel andDistributed Computing, vol. 32, pp. 103–117, 1996.

[30] M. D. Schroeder, A. D. Birrell, M. Burrows, H. Murray, R. M.Needham, T. L. Rodeheffer, E. H. Satterthwaite, and C. P. Thacker,“Autonet: A high-speed, self-configuring local area network using point-to-point links,” Selected Areas in Communications, IEEE Journal on,vol. 9, no. 8, pp. 1318–1335, 1991.

[31] J. Kim, W. J. Dally, and D. Abts, “Adaptive routing in high-radixclos network,” in Proceedings of the 2006 ACM/IEEE Conference onSupercomputing, ser. SC ’06. New York, NY, USA: ACM, 2006.[Online]. Available: http://doi.acm.org/10.1145/1188455.1188552

[32] K. D. Underwood and E. Borch, “A unified algorithm forboth randomized deterministic and adaptive routing in torusnetworks,” IEEE International Symposium on Parallel and DistributedProcessing Workshops and Phd Forum, pp. 723–732, May 2011.[Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6008843