expeditus: congestion-aware load balancing in clos data ... · expeditus: congestion-aware load...

14
Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks Peng Wang 1 Hong Xu 1 Zhixiong Niu 1 Dongsu Han 2 Yongqiang Xiong 3 1 NetX Lab, City University of Hong Kong 2 KAIST 3 Microsoft Research Asia Abstract Data center networks often use multi-rooted Clos topologies to provide a large number of equal cost paths between two hosts. Thus, load balancing trac among the paths is important for high performance and low latency. However, it is well known that ECMP—the de facto load balancing scheme— performs poorly in data center networks. The main culprit of ECMP’s problems is its congestion agnostic nature, which fundamentally limits its ability to deal with network dynamics. We propose Expeditus, a novel distributed congestion- aware load balancing protocol for general 3-tier Clos net- works. The complex 3-tier Clos topologies present significant scalability challenges that make a simple per-path feedback approach infeasible. Expeditus addresses the challenges by using simple local information collection, where a switch only monitors its egress and ingress link loads. It further employs a novel two-stage path selection mechanism to aggregate relevant information across switches and make path selection decisions. Testbed evaluation on Emulab and large-scale ns-3 simulations demonstrate that, Expeditus outperforms ECMP by up to 45% in tail flow completion times (FCT) for mice flows, and by up to 38% in mean FCT for elephant flows in 3-tier Clos networks. Categories and Subject Descriptors C2.2 [Computer - Communication Networks]: Network Protocols Keywords Datacenter networks, Load balancing, Network congestion 1. Introduction Data center networks use multi-rooted Clos topologies to provide a large number of equal cost paths and abundant Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SoCC ’, October - , , Santa Clara, CA, USA. © 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4525-5/16/10. . . $15.00. DOI: http://dx.doi.org/10.1145/2987550.2987560 bandwidth between hosts [2, 17]. To load balance flows, the data plane runs ECMP—Equal Cost Multi-Path—that forwards packets among equal cost egress ports based on static hashing. ECMP is simple to implement and does not require per-flow state at switches. It is well known, however, that ECMP cannot satisfy stringent performance requirements in data centers. Hash collisions cause congestion, degrading throughput for elephant flows [3, 11, 14] and tail latency for mice flows [5, 7, 27, 37]. Distributed congestion-aware load balancing has recently emerged as a promising solution [4] in which switches monitor congestion levels of each path and direct flows to less congested paths. This approach has several practical advantages. Being a distributed scheme, it is more scalable and can deal with bursty trac more gracefully than centralized scheduling (e.g. Hedera [3]). Being a data plane approach, it is independent of the host’s networking stack (e.g. MPTCP) [30] and immediately benefits all trac once deployed. Its end-to-end congestion visibility also makes it more robust to network asymmetry without control plane reconfiguration [19, 38]. The crux of designing a congestion-aware load balancing protocol is that we need to know real-time (in the order of RTT) congestion information from all paths between the flow’s source and destination [4, 5]. A straightforward approach is to use end-to-end path-wise information: a ToR switch maintains end-to-end congestion metrics for all the paths from itself to other ToR switches in the network. Congestion metrics can be collected by data packet piggybacking. This works for the simple 2-tier leaf-spine topologies, as shown by for example CONGA [4]. A ToR switch keeps track of up to thousands of paths for a large-scale leaf-spine network [4]. Yet, path-wise feedback does not scale for complex 3-tier Clos topologies widely deployed in production data centers, such as Google’s [32], Facebook’s [8] and Amazon’s [1]. Typically, hundreds of paths exist between any two ToR switches, and a ToR switch can communicate with hundreds of other ToR switches. Thus, per-path feedback requires storing O(10 5 )O(10 6 ) path information at each ToR switch in a production scale 3-tier network. More importantly, it is impossible to collect real-time congestion information for all

Upload: others

Post on 29-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

Expeditus: Congestion-aware Load Balancingin Clos Data Center Networks

Peng Wang1 Hong Xu1 Zhixiong Niu1 Dongsu Han2 Yongqiang Xiong3

1 NetX Lab, City University of Hong Kong 2KAIST 3Microsoft Research Asia

AbstractData center networks often use multi-rooted Clos topologies toprovide a large number of equal cost paths between two hosts.Thus, load balancing tra�c among the paths is importantfor high performance and low latency. However, it is wellknown that ECMP—the de facto load balancing scheme—performs poorly in data center networks. The main culprit ofECMP’s problems is its congestion agnostic nature, whichfundamentally limits its ability to deal with network dynamics.

We propose Expeditus, a novel distributed congestion-aware load balancing protocol for general 3-tier Clos net-works. The complex 3-tier Clos topologies present significantscalability challenges that make a simple per-path feedbackapproach infeasible. Expeditus addresses the challenges byusing simple local information collection, where a switch onlymonitors its egress and ingress link loads. It further employsa novel two-stage path selection mechanism to aggregaterelevant information across switches and make path selectiondecisions. Testbed evaluation on Emulab and large-scale ns-3simulations demonstrate that, Expeditus outperforms ECMPby up to 45% in tail flow completion times (FCT) for miceflows, and by up to 38% in mean FCT for elephant flows in3-tier Clos networks.

Categories and Subject Descriptors C2.2 [Computer -Communication Networks]: Network Protocols

Keywords Datacenter networks, Load balancing, Networkcongestion

1. IntroductionData center networks use multi-rooted Clos topologies toprovide a large number of equal cost paths and abundant

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’��, October �� - ��, ����, Santa Clara, CA, USA.© 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 978-1-4503-4525-5/16/10. . . $15.00.DOI: http://dx.doi.org/10.1145/2987550.2987560

bandwidth between hosts [2, 17]. To load balance flows,the data plane runs ECMP—Equal Cost Multi-Path—thatforwards packets among equal cost egress ports based onstatic hashing. ECMP is simple to implement and does notrequire per-flow state at switches. It is well known, however,that ECMP cannot satisfy stringent performance requirementsin data centers. Hash collisions cause congestion, degradingthroughput for elephant flows [3, 11, 14] and tail latency formice flows [5, 7, 27, 37].

Distributed congestion-aware load balancing has recentlyemerged as a promising solution [4] in which switchesmonitor congestion levels of each path and direct flows toless congested paths. This approach has several practicaladvantages. Being a distributed scheme, it is more scalable andcan deal with bursty tra�c more gracefully than centralizedscheduling (e.g. Hedera [3]). Being a data plane approach, itis independent of the host’s networking stack (e.g. MPTCP)[30] and immediately benefits all tra�c once deployed. Itsend-to-end congestion visibility also makes it more robustto network asymmetry without control plane reconfiguration[19, 38].

The crux of designing a congestion-aware load balancingprotocol is that we need to know real-time (in the order ofRTT) congestion information from all paths between the flow’ssource and destination [4, 5]. A straightforward approach is touse end-to-end path-wise information: a ToR switch maintainsend-to-end congestion metrics for all the paths from itself toother ToR switches in the network. Congestion metrics canbe collected by data packet piggybacking. This works for thesimple 2-tier leaf-spine topologies, as shown by for exampleCONGA [4]. A ToR switch keeps track of up to thousands ofpaths for a large-scale leaf-spine network [4].

Yet, path-wise feedback does not scale for complex 3-tierClos topologies widely deployed in production data centers,such as Google’s [32], Facebook’s [8] and Amazon’s [1].Typically, hundreds of paths exist between any two ToRswitches, and a ToR switch can communicate with hundredsof other ToR switches. Thus, per-path feedback requiresstoring O(105)–O(106) path information at each ToR switchin a production scale 3-tier network. More importantly, it isimpossible to collect real-time congestion information for all

Page 2: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

these paths, as there would not be enough concurrent flowsthat happen to traverse all of them at the same time (§2.2).

The main contribution of this paper is the design, imple-mentation, and evaluation of Expeditusa novel distributedcongestion-aware load balancing protocol targeted at general3-tier Clos topologies. Expeditus relies on two key designs,local congestion monitoring and two-stage path selection,to fundamentally overcome the design challenges discussedabove (§2.3). Each switch only locally monitors the utilizationof its uplinks to the upper tier switches. Local information col-lection ensures real-time congestion state is available withoutscalability or measurement overhead (§3.1).

Expeditus then applies two-stage path selection to aggre-gate relevant information across multiple hops and makeload balancing decisions (§3.2). In the first stage, only thesource and destination ToR switches are involved to selectthe best path from the ToR to the aggregation tier (i.e. thefirst and last hop). The source ToR switch sends its egresscongestion metrics to the destination ToR, which combinesthem with its ingress congestion metrics to select the bestpath to the aggregation tier. In the second stage, the chosenaggregation switches then select the best core switch in asimilar way based on congestion state of the second and thirdhops. The path selection decisions are then maintained at ToRand aggregation switches.

Essentially two-stage path selection uses partial-pathinformation to heuristically find good paths for flows. Byexploiting the structural properties of 3-tier Clos, two-stagepath selection drastically reduces the complexity withoutmuch performance penalty. In fact, our evaluation showsthat it performs e�ectively in many cases compared to aclairvoyant scheme with complete information. Expeditusperforms path selection on a per-flow basis during the TCPthreeway handshake, but does not cause packet reordering norintroduce any latency overhead.

We implement Expeditus using the Click software router[25]. We evaluate it on the Emulab testbed (§4) as well as inlarge-scale packet-level ns-3 simulations (§5), using empiricalflow size distributions from two production networks. Ourevaluation shows that Expeditus provides up to 45% reductionsin 95%ile flow completion time (FCT) for mice flowscompared to ECMP in 3-tier Clos. The average FCT ofelephant flows (�1MB) is also reduced by⇠15%–38%. Whenapplied to 2-tier leaf-spine networks, Expeditus outperformsCONGA working at per-flow granularity, because of itsability to observe global and timely congestion information.Expeditus advances state-of-the-art and has potential to makeimpact to the industry with its practical design. At the time ofwriting it is being considered by a major vendor for adoptioninto their next-generation data center switch ASICs.

2. Motivation and Design ChoicesWe start by explaining our motivation and justifying the designchoices behind Expeditus in this section.

2.1 Motivation for Congestion-AwarenessECMP—the de facto load balancing solution in practice—is well known to deliver poor performance, causing lowthroughput and bandwidth utilization for elephant flowsand long tail latency for mice flows [3, 5, 7, 9, 12, 19,20, 23, 33–35, 37]. In addition, it does not properly handleasymmetry in the network topology which often occurs inlarge-scale networks due to failures [38]. These problems arecommonly attributed to the heavy-tailed nature of the flowsize distribution. Thus, most solutions focus on attackingthe tail, either by strategically routing elephant flows in acentralized [3] or distributed fashion [22] or splitting flowsinto sub-units of similar sizes [19, 23, 35].

While fixing the heavy tail improves performance, themain culprit is ECMP’s congestion agnostic nature thatfundamentally limits its ability to deal with inherent networkdynamics. The congestion levels on the equal-cost pathsare dynamic, making congestion-agnostic schemes, such asECMP, ill-fitted for the job. For example, in a symmetrictopology some paths may become temporarily congested dueto hash collisions even with sub-flow level load balancing [19].Moreover failure-induced topology asymmetry can arise anytime in any part of the network and cause some paths to bepersistently more congested than others.

This paper addresses the limitations of ECMP head-on, making a case for switch-based congestion-aware loadbalancing in light of the recent trend towards a smart dataplane [4, 12, 21, 36]. An adaptive data plane is better suited tohandle dynamic tra�c and topological changes. In our design,switches dynamically choose less congested paths for eachflow based on instantaneous congestion state, improving boththroughput and tail latency.2.2 Motivation for Scalable DesignWe introduce background on 3-tier Clos topologies andmotivate the need of a scalable design for congestion-awareload balancing in such topologies in this section.

1 2 3 4

1 2 r……

1 m …… 1 m

core plane 1

pod 1

1 2 3 4

1 2 r……

……

pod p

core switches

aggregation switches

ToR switches

…… ……

core plane 4

n = 4

c4m

ap4

tpr

Figure 1: A general 3-tier Clos topology widely used in practice[8]. Not all switches or links are shown. There are p pods. Eachpod has r ToR switches connecting to n aggregation switches. Heren is usually 4 as ToR switches have 4x40Gbps uplinks. There arem core switches in each of the n core planes that connect to theaggregation tier. Each switch has a unique ID configured as above.

3-tier Clos topologies [1, 8, 32] are prevalently used in pro-duction data centers for its scalability and cost-e�ectiveness.

Page 3: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

A general 3-tier Clos is shown in Figure 1. Each pod hasr ToR switches, each of which connects to n aggregationswitches. An aggregation switch belongs to a unique coreplane, and connects to all m core switches on its plane. Noteeach of the p pods is a 2-tier leaf-spine network. For ease ofexposition, we use t (ToR), a (aggregation), c (core) to denotethe switch type, superscript to denote its pod/plane number,and subscript to denote its ID; e.g., t p

r is the r-th ToR switchin pod p, ap

4 the 4-th aggregation switch in pod p, and c4m the

m-th core switch on plane 4 as in Figure 1.This highly modular topology provides flexibility to scale

capacity in any dimension. When more compute capacity isneeded, the provider adds server pods. When more inter-podnetwork capacity is needed, it can add core switches on allplanes. Commodity ToR switches typically have 4⇥40Gbpsuplinks. Thus, each pod normally has 4 aggregation switches.Aggregation/core switches can have up to 96⇥40Gbps ports.With 96 pods, the topology can accommodate 73,728 10Gbpshosts. In Facebook’s Altoona data center, each aggregationswitch connects to 48 ToR switches in its pod, and 12 outof the 48 possible core switches on its plane, resulting in a4:1 oversubscription[8]. Google has been using 3-tier Clostopologies [32], and Amazon’s EC2 cloud is also deploying10Gbps fat-tree [1], which is a special case of the 3-tier Closdepicted above [2].

The key to designing a congestion-aware load balancingprotocol is to acquire global congestion information forchoosing the optimal paths. Existing work addresses thisproblem by maintaining congestion metrics for all thepaths between all pairs of ToR switches. For example, inCONGA [4], each ToR switch keeps congestion informationof the paths to all other ToR switches. The congestion metricsare obtained by piggybacking on data packets as they traversethe paths, and then fed back to the source ToR switch. Thisworks well in 2-tier leaf-spine topologies, because the numberof states to keep is relatively small; even in an extremely large(hypothetical) topology with 576 40Gbps ports paired with48-port leaf switches [4], each leaf switch only needs to track⇠7K paths.

However, this presents significant scalability issues in3-tier Clos topologies. In 3-tier topologies, even collectingcongestion information for all paths is challenging becausethe number of paths to monitor dramatically increases (bytwo orders of magnitude). As shown in Figure 1, a 3-tierClos has nm paths between a pair of ToR in di�erent pods.Thus, each ToR needs to track O(nmpr) paths for all possibledestination ToR. For Facebook’s Altoona data center with 96pods, this amounts to ⇠220K di�erent paths. This is quiteexpensive to implement in switch ASICs. Furthermore, theinformation collected must reflect real-time status of eachpath, as congestion can change rapidly at timescales of a fewRTTs due to bursty flows and exponential rate throttling inTCP [4, 5, 10, 21]. However, keeping the information up todate becomes even more challenging: the simple per-path

feedback design requires at least O(nmpr) concurrent flowsto cover all the paths of a ToR at the same time, which isexceedingly di�cult to achieve in practice.

This paper presents a practical congestion-aware loadbalancing scheme that solves the above challenges and can beapplied to both 3-tier and 2-tier topologies. In the following,we discuss and justify the key design choices we make forExpeditus.

2.3 Design ChoicesFor congestion-aware load balancing to work in 3-tiertopologies, two key components must scale with minimaloverhead: an information collection mechanism that monitorsthe congestion state and a path selection mechanism thatmakes congestion-aware decisions.Information collection. One strawman solution for address-ing the scalability issue is to decompose a path into twopathlets: north-bound segment (from the source ToR to acore switch) and south-bound segment (the core switch to thedestination ToR). Congestion information can be collected atthe granularity of pathlets. Then only O(nm) pathlet’s con-gestion information is maintained at a ToR switch. However,a per-pathlet approach still requires O(nm) concurrent flowsfrom the core tier to the same ToR switch to cover all itspathlets and piggyback the information, which is di�cult toachieve in practice. For example, a flow from t1

1 to t31 congests

the link from c22 to a3

2 as shown in Figure 2. Now t11 has a new

flow to send to t32 . Since there is no flow on the pathlet from

c22 to t3

2 , t32 cannot observe congestion nor inform the source

ToR to avoid choosing c22.

1 2

1 2 3

pod 1 pod 3t11

a321 2

1 2 3

1 2 1 2 c22

……

t32t31

a12

Figure 2: Per-pathlet congestion information collection is nottimely enough. Here is a small Clos network with n = m = 2, r = 3.The link from c2

2 to a32 is congested due to a flow on the pathlet from

c22 to t3

1 as shown. If t11 has a new flow to send to t3

2 , as there is noflow on the pathlet from c2

2 to t32 , t3

2 cannot observe the congestionand inform t1

1 .

Design choice: Local information collection. We chooseto rely on simple local congestion monitoring in Expeditus.Each switch only monitors the egress and ingress congestionmetrics of all its uplinks. The scalability challenge is resolvedby only maintaining 2n states at each ToR switch for the nuplinks connected to the aggregation tier, and 2m states ateach aggregation switch for the m uplinks connected to thecore tier. Real-time congestion state can be readily gatheredand updated whenever a packet enters and leaves the switch,and does not require any piggybacking.

Page 4: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

Path selection. With local information collection, we nowneed to aggregate information across the network in orderto obtain the path-wise congestion. A strawman solution isthen for the source ToR to ask all the 2n aggregation switchesin the source and destination pods and the destination ToRfor their local information. In total 2n+2 switches have toparticipate in the process. This again does not scale well forlarge networks. Further, coordinating so many switches acrossfour hops adds much implementation and latency overhead tothe protocol.Design choice: Two-stage path selection. We present a noveltwo-stage path selection mechanism to e�ciently aggregateinformation with minimal overhead. Our design exploitsa salient structural property of 3-tier Clos, which we callpairwise connectivity:

In 3-tier Clos, an aggregation switch of ID i only connectsto aggregation switches of the same ID i in other pods,because they connect to the same core plane. Thus, no matterwhich aggregation switch the source ToR chooses, the packetalways goes via an aggregation switch of the same ID in thedestination pod (recall Figure 1).

Based on pairwise connectivity, we propose a two-stagepath selection mechanism that works as shown in Figure 3.There are four hops between any two ToRs in di�erent pods.Our scheme first selects the best aggregation switches (thefirst and last hop) using the congestion information of the firstand last links, and then determines the best core switch (thesecond and third hop) using the congestion information ofthe second and third links. The process starts when the firstpacket of a flow arrives at a ToR. The source ToR tags thefirst packet to add the first-hop information, i.e. the egresscongestion metrics of its uplinks. This data packet then servesas an Expeditus request packet (Exp-request in short). Thedestination ToR reads the congestion metrics from the Exp-request, aggregates with its ingress metrics of all its uplinks,and chooses the least congested aggregation switches. Aresponse packet called Exp-response is then generated by thedestination ToR and sent to the chosen destination aggregationswitch. The destination aggregation switch feeds back thethird-hop congestion metrics via Exp-response to the sourceaggregation switch, which similarly chooses the core switchwith the least e�ective congestion before forwarding the Exp-response to the source ToR. Thus, Expeditus completes pathselection with two messages (Exp-request and response) in justone round trip. The path selection decisions are maintainedat the source ToR and aggregation switches.

This design requires only two switches at each stage toexchange information. For a new TCP connection, Expeditusselects paths for the two flows in both directions independentlyduring the handshaking process and does not cause anypacket reordering. Intuitively, two-stage path selection isa heuristic since it does not explore all available paths: itsimplifies the problem from choosing the best combinationof aggregation and core switch IDs to choosing the two

1 2

1

t11

1 2

2

t32best aggr ID (say 2)

LBdecisions

Exp-request

src ToR

dst ToR

Exp-response

dst aggr

src aggr

+

+

best core ID (say 1)

src ToR dst ToR

a32

dst aggrsrc aggr

core tier

a12

Figure 3: Design of two-stage path selection. The first stage isbetween the source ToR (t1

1 ) and destination ToR (t32 ) who chooses

the best aggregation switch ID based on first and last hop congestionmetrics. The the second stage involves the chosen destination andsource aggregation switch (a1

2) who chooses the best core switchID. The information exchange is done via Exp-request and Exp-response.

sequentially. Nevertheless, by taking advantage of pairwiseconnectivity, it performs e�ectively compared to a clairvoyantscheme with complete information in 3-tier Clos as shown in§5.

3. Expeditus DesignExpeditus runs entirely in the data plane as a distributedprotocol, with all functionalities residing in switches. It workson a per-flow basis for implementation simplicity, and useslink load as the congestion metric, which is shown to bee�ective and can be readily implemented in hardware [4].3.1 Local Congestion MonitoringIn Expeditus only ToR and aggregation switches performlocal congestion monitoring for all uplinks that connect themto upstream switches. This covers the entire in-network pathwithout the links connecting the host and the ToR, which areunique and not part of load balancing. Both the egress andingress directions are monitored.Link load estimation. To measure the link load, we useDiscounting Rate Estimator (DRE) [4]. DRE maintains aregister X which is incremented every time a packet issent/received over the link by the packet size in bytes, and isdecremented every Tdre with a factor of a between 0 and 1.We follow the settings in [4] and set Tdre =20 µs and a = 0.1.The link load is quantized into 3 bits relative to the link speed.Congestion table. Link load information is maintained in acongestion table. In cases of link failures, the switch detectsthat a port is down, and simply sets the utilization of that portto the largest value.3.2 Two-stage Path SelectionWe now explain the working of two-stage path selection, themost important mechanism in Expeditus.Path selection table (PST). Path selection decisions aremaintained in a PST in each ToR and aggregation switch.

Page 5: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

Only the northbound pathlet of a flow’s path needs to berecorded, for there is a unique southbound pathlet from agiven core or aggregation switch to a given ToR. Each PSTentry records a flow ID obtained from hashing the five-tuple,the egress port selected, a path selection status (PSS) bitindicating whether path selection has been completed (1) ornot (0), and a valid bit indicating whether the entry is valid(1) or not (0).

When a packet arrives, we look up an entry based on itsflow ID. If an entry exists and is valid, the packet is routed tothe corresponding egress port. If the entry is invalid and thePSS bit is one, or no entry exists, then the packet representsa new flow and starts a new round of path selection. Thevalid and PSS bits are set when a path has been selected. ThePSS bit ensures path selection is performed only once; aninvalid PST entry with a zero PSS bit does not trigger pathselection when subsequent packets for the flow, if any, arrivebefore path selection completes (they are simply forwardedby ECMP).

PST entries time out after a period of inactivity in orderto detect inactive flows and force a new path to be selected.When an entry times out the valid bit is also reset. Since wefocus on per-flow load balancing, the timeout value is set to100ms which is large enough to filter out bursts of the sameflow. The timer can be implemented using just one extra bitfor each entry and a global timer for the entire PST [4].

The implementation cost of tracking per-flow state usingPST is low for a data center network, even at scale. As reportedin [4], the number of concurrent flows would be less than 8Kfor an extremely heavily loaded switch. A PST of 64K entriesshould cover all scenarios.Ethernet tags. Expeditus uses dedicated Ethernet tags onIP packets, similar to IEEE 802.1Q VLAN tagging, toexchange congestion information between switches duringpath selection. Specifically, a new field is added between thesource MAC address and the EtherType/length fields. Thestructure of the new field is shown in Figure 4, and includesthe following:

• Tag protocol identifier, TPID (16 bits): This field is setto 0x9900 to identify the packet as an Expeditus pathselection packet.

• Stage flag, SF (1 bit): This field identifies which stagethis packet serves (0 for first stage, 1 for second stage). Atagged packet for the first stage is called an “Exp-request”,and for the second stage an “Exp-response.”

• Number of entries, NoE (7 bits): This field identifies thenumber of congestion metrics carried in the packet (atmost 128). Note 7 bits are more than enough in practice.Between the ToR and aggregation tiers, the FacebookAltoona topology has just 4 paths, and a 128-pod fat-treewith 524,288 hosts has 64 paths; between the aggregationand the core layer, the former has 48 paths and the latter64 paths.

• Congestion data, CD: This field contains the actualcongestion metrics. We use 4 bits for one entry withthe first bit as padding.

Source MAC

1 2 3 4 5 6

16 bits

TPID

1 2

EtherType

1 2

Expeditusheader

……

1 bit

SF

7 bits

NoE

1-64 bytes

CD

Figure 4: Format of the Expeditus tag for path selection.

Path selection algorithm. Figure 5 illustrates our two-stagepath selection for two flows in the topology of Figure 2.There is a new TCP connection between hosts under ToRswitches t1

1 and t32 . Flow A is the flow in the forward direction

from t11 to t3

2 that initiates the SYN, and flow B is in thereverse direction. Both flows use the two-stage path selectionmechanism independently to establish their paths.

Let us start with flow A. Its first packet (SYN in thiscase) reaches its source ToR t1

1 and triggers the path selectionmechanism after checking the PST. t1

1 tags the packet with itsegress link loads of the uplinks, and sets SF to 0 and NoEcorrespondingly. It forwards the tagged packet, i.e. the Exp-request, using ECMP, and inserts a new entry into the PST withPSS bit set to 0. Aggregation switches ignore Exp-requestpackets and simply forward with ECMP. The destinationToR switch t3

2 reacts to Exp-request. It checks the NoE field,pulls the congestion information, and aggregates it entry-by-entry with its ingress link loads (recall pairwise connectivity§2.3). The e�ective congestion of all n paths between thesource ToR and the aggregation tier are determined simplyas the maximum load of the two hops. t3

2 then chooses theaggregation switch ID with the minimum e�ective congestion,which is 2 in this case. t3

2 generates an Exp-response withoutpayload by copying the TCP/IP header from the Exp-requestand swapping the src and dst IP addresses. It tags theExp-response with SF set to 1 and forwards to the chosenaggregation switch a3

2. It also removes the tag from the Exp-request and forwards to the host. This completes the first stageof path selection.

The second stage is similar and involves the aggregationswitches choosing the path to the core tier using the Exp-response. The aggregation switch a3

2 handles the Exp-responsewith NoE set to zero, by adding its ingress loads and settingNoE accordingly. The source aggregation switch a1

2 reacts tothe Exp-response with a non-zero NoE value by comparinga3

2’s ingress loads with its egress loads, and choosing the bestcore switch ID (1 in this case). It computes flow A’s ID byswapping the src and dst IP addresses, inserts a new PSTentry or updates an existing entry for flow A, records 1 asthe path selection result (by pairwise connectivity), and setsboth PSS and the valid bit to 1. Finally, the source ToR t1

1receives the Exp-response, matches it with flow A’s entry in

Page 6: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

2t32

1t11

Port Egress load

1 32 5

…… 2

a32

Port Ingressload

1 62 0

destinationhost

SYN

SYN

…… 2a12

1

source

PST

SYN

StartEnd

Flow ID egress port

A

PSS

0

valid

0

SF=0 NoE=2

Exp-request

Exp-pongSF=1 NoE=0

3 5Exp-pong

SF=1 NoE=23 7

Port Ingressload

1 32 7

Port Egress load

1 32 5

Flow ID egress port

A 1

PSS

1

valid

1

Exp-pongSF=1 NoE=2

3 7

Flow ID egress port

A 2

PSS

1

valid

1

PSTPST

via ingress 2send Exp-response

to port 2

2nd stage t11

Figure 5: Two-stage path selection in Expeditus for the network shown in Figure 2. We only show the path selection process for flow A fromthe source ToR t1

1 to t32 . Path selection for flow B of the same TCP connection in the reverse direction is done similarly.

PST, records its ingress port (2 in this case) in the entry asthe path selection result, sets both PSS and the valid bit to 1,and discards the Exp-response. This finishes flow A’s pathselection.

Flow B’s path selection is done in exactly the same way.The only di�erence is that the first packet of B is a SYN-ACKin this case. It arrives at B’s source ToR t3

2 and starts theprocess. Clearly the selected aggregation and core switchesmay be di�erent from those for flow A.

Now to summarize, for the new TCP connection in ourexample, flow A’s path is established before the SYN-ACK,1and before its second packet (the first ACK) arrives at thesource ToR. Flow B’s path is selected before the handshakingACK, and thus before its second packet arrives. Thereforeno packet reordering is caused by Expeditus, when pathselection is done during TCP handshaking. Algorithms 1 and2 rigorously present the path selection and packet processinglogic for ToR and aggregation switches.

Lost Exp-request/Exp-response. As discussed we donot have any retransmission mechanism, in case the controlpackets are dropped. This simply means one opportunity ofload balancing is lost; it does not a�ect the flow at all. Thereare two possibilities: (1) the path is not established at all.Expeditus is robust in this case since any packet that seesan invalid entry is routed using ECMP; (2) part of the pathis established at the aggregation switch, but not at the ToRswitch. Expeditus is still robust since the PST entry at theaggregation switch will just time out. If we are fortunate andECMP at the ToR switch happens to direct the flow to thechosen aggregation switch, we can still use the least congestedpath.

3.3 Handling FailuresFailures are the norm rather than exception in large-scalenetworks with thousands of switches. Expeditus automaticallyroutes tra�c around the congestion caused by failures, thanksto the congestion-aware nature, delivering better performanceover congestion-agnostic schemes.

We illustrate this using an example in Figure 6. Supposethe link from a1

1 to c11 is down. This causes all links from c1

2to the first aggregation switches of each pod (dashed links)to be congested, as they are the only paths to reach a1

1. Forexample, if there are flows from t2

1 to t11 , the a2

1-c12 link is more

1The Exp-response is generated immediately upon receiving the Exp-request.

Algorithm 1 Expeditus, ToR switch1: procedure T�R_S�����_P���������(packet p)2: if p is northbound to the aggregation tier then3: if a PST entry e exists for p then4: if e.valid_bit == 1 then5: forward p according to e, return6: else if PSS == 0 then7: forward p by ECMP, return8: add the Expeditus tag to p . Start path selection9: SF 0, add the egress loads, forward p by ECMP

10: insert a new PST entry or update the existing one for p,PSS 0, valid_bit 0, return

11: else . southbound packet12: if p is Expeditus tagged then13: if p.SF == 0 then . Exp-request received14: check NoE, pull CD15: choose the best aggregation switch ID f⇤16: generate an Exp-response p0, p0.SF 1,

p0.NoE 017: p0.src_ip p.dst_ip, p0.dst_ip p.src_ip18: forward p0 to aggregation switch f⇤19: remove the tag from p, forward it, return20: else . Exp-response received21: record p’s ingress port pi22: find the PST entry e for the flow f ,

f .src_ip=p.dst_ip, f .dst_ip=p.src_ip23: e.egress_port pi, e.PSS 1, e.valid_bit 1,

discard p, return

congested than the two links from a22 to the core tier. Now

tra�c from a21 to other pods, say pod 3, will be routed to c1

1in order to avoid the congested link by Expeditus. In contrast,ECMP still evenly distribute tra�c and further congest c1

2.Yet, since the two-stage path selection only utilizes partial

information of a path, it can make sub-optimal decisions forcertain flows especially with topological asymmetry. Considerthe same example again. Suppose there is tra�c from pod1 to pod 2. Due to the failure a1

1 su�ers a 50% bandwidthreduction, and uplinks from ToRs to a1

1 (red lines) cannotachieve their full capacity when transmitting inter-pod tra�c.Thus these uplinks are actually more likely to be selected infavor of their low loads, which exacerbates congestion on a1

1and c1

2.To address this issue, we set link load multipliers for

ToR uplinks based on the e�ective capacity at aggregation

Page 7: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

Algorithm 2 Expeditus, aggregation switch1: procedure A���_S�����_P���������(packet p)2: if p is northbound to the core tier then3: if p is Expeditus tagged, p.SF == 1 then .

Exp-response, first hop4: add the switch’s ingress loads to p, set p.NoE5: if a PST entry e exists for p then6: if e.valid_bit == 1 then7: forward p according to e, return8: forward p by ECMP, return9: else . southbound packet

10: if p is Expeditus tagged, p.SF == 1, p.NoE is non-zerothen

11: . Exp-response, third hop12: check NoE, pull CD13: choose the best core switch ID f⇤14: record port pi connected to core switch f⇤15: find the PST entry e for the flow f , f .src_ip=p.dst_ip,

f .dst_ip=p.src_ip, or insert a new entry if not found16: e.egress_port pi, e.PSS 1, e.valid_bit 117: forward p, return

1 2

1 2 3

pod 1t11

1 2 1 2

pod 3

1 2

1 2 3

c11

a11

pod 2

1 2

1 2 3

t21 t31

c12

a21

a22

Figure 6: The link between a11 and c1

1 fails. This causes all linksfrom c1

2 to the first aggregation switches of each pod to be congested,as they are the only paths to reach a1

1. A link load multiplier of2 is applied to links from a1

1 to each ToR in pod 1 as shown forinter-pod tra�c.

switches. This e�ectively makes network bottlenecks visible atthe ToR switches. We assume the underlying routing protocolor the control plane notifies ToR switches of the aggr-corelink failure [19, 38]. The ToR switches then set a link loadmultiplier of 2 for the uplink to a1

1 as highlighted in Figure 6.The multipliers only a�ect inter-pod tra�c at ToRs. The linkloads are scaled with the multipliers when they are used inthe first stage of path selection. E�ectively, this translates thecapacity reduction at the aggregation layer proportionally tothe ToR layer. In this way, ToRs are more likely to chooseuplinks to a1

2 and re-distributes tra�c properly.

3.4 Discussions2-tier leaf-spine topologies: We briefly explain how Expedi-tus works in 2-tier leaf-spine topologies. This also applies tointra-pod tra�c in a 3-tier Clos. The two-stage path selectionnaturally reduces to one-stage path selection because only theToR switches are involved. The Exp-request carries egress

loads of the source ToR, and the destination ToR aggregatesit with its ingress loads to obtain the end-to-end congestioninformation for all possible paths, and chooses the best one.Aggregation switches ignore southbound Exp-response witha zero NoE and simply forward it. The chosen aggregationswitch ID is obtained as the Exp-response’s ingress port whenit reaches the source ToR.Handling UDP tra�c: Expeditus can be directly usedwithout change to load balance UDP flows. The first packet ofa new UDP flow (without a PST entry or with an invalid PSTentry) triggers path selection. Packets sent before the path isestablished are forwarded by ECMP, and UDP does not su�erfrom re-ordering.

4. Implementation and Testbed EvaluationWe implement Expeditus on Click and run our prototype on asmall-scale Emulab testbed to answer the following questions:(1) Is it practical to implement Expeditus, and how muchoverhead does it introduce (§4.1)? (2) How does Expeditusactually perform in practice (§4.2)?

LookupIPRoute

DREDRE

Output 0

Queue 0

ToDevice(eth0)

ToDevice(eth1)

ToDevice(eth2)

ToDevice(eth3)

Queue 1 Queue 2 Queue 3

DRE

FromDevice (eth0)

FromDevice (eth1)

FromDevice (eth2)

FromDevice (eth3)

DRE

EXPRoute

Figure 7: Packet processing pipeline in Click.

Prototype implementation: We implement a prototypeof Expeditus in Click, a modular software router for fastprototyping of routing protocols [25]. We develop two newClick elements, DRE to measure link load and EXPRoute toconduct two-stage path selection. A DRE element is addedto the FromDevice and ToDevice elements of each port inthe Click configuration. EXPRoute can obtain ingress/egresslink loads from upstream/downstream DRE elements. Figure 7shows the packet processing pipeline of Expeditus for a ToRor aggregation switch with 4 ports.

Note that unlike actual routers whose timers have mi-crosecond resolution in hardware, Click can only achieve

Page 8: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

millisecond resolution. This a�ects accurate estimation oflink load, which makes DRE react slower to link load changesin our testbed. To compensate for this, we additionally use thequeue length information together with the DRE working atTdre of 1 ms.Testbed setup: Our Emulab testbed uses PC3000 nodes tohost Click routers, with 64-bit Intel Xeon 3.0GHz processors,2GB DDR2 RAM, and 4 1GbE NICs. All nodes runCentOS 5.5 with a patched Linux 2.6.24.7 kernel and apatched Intel e1000-7.6.15.5 NIC driver to improve Clickperformance. We use the default TCP Cubic implementationin the kernel. Our Click implementation is running in kernelspace, and we verify that TCP throughput between two Clickrouters is stable at 940+ Mbps.

Constrained by the number of NICs an Emulab node has,we setup a small-scale 3-tier Clos network with 2 pods of 2aggregation and ToR switches each, i.e. n = m = r = p = 2.Each aggregation switch connects to 2 core switches, and eachToR switch connects to 2 hosts . The core tier is oversubscribedat 4:1 by rate limiting the core links to emulate a realisticsetting in practice [8].

4.1 Microbenchmarks of Overheads

Firstly, to demonstrate viability, we evaluate the packetprocessing overhead of Expeditus. Table 1 shows the averageCPU time of forwarding a packet through each elementat the source ToR switch sending at line rate, measuredwith Intel Xeon cycle counters. We report the average valueobtained from the total processing time divided by thenumber of packets (⇠550k). For comparison, we implementa HashRoute element to perform ECMP and measureits processing time. The additional overhead incurred byExpRoute and DRE elements is only hundreds of nanoseconds.This is comparable to the overhead of a vanilla IP forwardingelement [25].

Element ns/pktHashRoute 275EXPRoute 473DRE 151

Table 1: Average packetprocessing time compari-son.

Scheme Avg Time (us)ECMP 623.90Expeditus 638.33

Table 2: Average time it takesfor the sender to start sendingthe first data packet after send-ing SYN.

We also look at the latency overhead of two-stage pathselection. We start a TCP connection and measure how longit takes for the sender to start sending the first data packet, bywhich time path selection for both directions are done andcannot a�ect the flow anymore. As in Table 2, Expeditusadds a negligible 15 µs delay on average over 100 runs.This overhead is mainly due to the additional processingin the EXPRoute element. The latency overhead will be muchsmaller if Expeditus is implemented in ASIC.

4.2 Testbed Evaluation

We conduct experiments to evaluate Expeditus’s performancewith two realistic workloads from production datacenters. Thefirst is from a cluster running mostly web search as reportedin [5]. The second is from a large cluster running data miningjobs [17]. Both distributions are heavy-tailed: in the websearch workload, over 95% of bytes are from 30% flows largerthan 1MB; in the data mining workload, 95% of bytes arefrom about 3.6% flows that are larger than 35MB, whilemore than 80% of flows are less than 10KB. We generateflows between random senders and receivers in di�erent podsaccording to Poisson processes with varying arrival rates. Oursetup is consistent with existing work [4, 9, 19].

Figures 8 and 9 show the FCT results of Expeditus andECMP for both workloads. We vary loads from 0.1 to 0.7beyond which the results become unstable in our testbed. Weemphasize FCT statistics for mice (<100KB) and elephantflows (>1MB). The results of medium flows between 100KBand 1MB are largely in line with elephant flows, and we donot show them for brevity. Each data point is the average of 3runs.

We make the following observations: First, for bothworkloads, Expeditus outperforms ECMP in both average and95%ile tail FCT for mice flows. For loads between 0.4 and 0.7,Expeditus reduces the average FCT by⇠14%–30% in the websearch workload and 17%–25% in the data mining workload.The reduction in tail FCT is even larger: by ⇠30%–45% inweb search workload and 5%–30% in data mining workload.Second, Expeditus also improves throughput for elephantflows in medium and high loads. The reduction in averageFCT is 9%–38% for web search workload and 11%–18%for data mining workload. The average FCT is much longerin data mining workload as the elephant flows are muchlarger (mostly >10MB) than those in web search workload.Across all flows, Expeditus reduces the average FCT by 12%–32% and 15%–22%, respectively, for the two workloads withloads from 0.4 to 0.7. Third, when network load is light(<0.4), Expeditus provides relatively small benefit comparedto ECMP. This is mainly due to the limitation of our Click-based implementation in detecting transient congestion. Ourimplementation updates the link utilization using Click timersthat operate at coarse-grained timescales (1ms granularity).However, in lightly loaded networks, congestion is muchmore transient and often happens at small timescales. In oursimulations where the link utilization is updated frequently,we observe that the benefit of Expeditus does not degradeeven when the network is lightly loaded (§5.3).

5. Large-scale SimulationWe conduct extensive packet-level ns-3 simulations to evaluateExpeditus in large-scale networks. We seek to answer thefollowing key questions: (1) How does Expeditus performunder various tra�c patterns at scale (§5.2, §5.3)? (2) IsExpeditus sensitive to network bottleneck conditions (§5.4)?

Page 9: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

(a) Overall: Avg (b) (0,100KB]: Avg (c) (0,100KB]: 95-th percentile (d) (1MB, •): AvgFigure 8: FCT for web search workload in the Emulab testbed. Core tier oversubscription 4:1.

(a) Overall: Avg (b) (0,100KB]: Avg (c) (0,100KB]: 95-th percentile (d) (1MB, •): Avg

Figure 9: FCT for data mining workload in the Emulab testbed. Core tier oversubscription 4:1.

(3) Is Expeditus e�ective in handling topological asymmetrydue to failures (§5.5)? (4) How does it perform in leaf-spinenetworks compared to existing work (§5.6)?

5.1 Methodology

Simulation Setup. We use a 12-pod fat-tree [2] as the baselinetopology. As mentioned in §2.2 fat-tree is a 3-tier Clostopology. There are 432 hosts and 36 core switches, i.e.36 equal-cost paths between any pair of hosts in di�erentpods. Each ToR switch has 6 ports to the aggregation tierand hosts, respectively. All links are running at 10Gbps.We vary the number of core switches to obtain di�erentoversubscription ratios at the core tier, for production networksoften oversubscribe the core to reduce costs [8]. The baselineoversubscription ratio is 2:1. The fabric RTT is⇠32 µs acrosspods. Each run of the simulation generates more than 10Kflows, and we report the average over 5 runs unless otherwisenoted.

We use several di�erent tra�c patterns.Realistic: We use the same realistic workloads from

production datacenters as in our testbed experiments: a websearch workload [5] and a data mining workload [17]. Eachhost continuously generates flows based on the empirical flowsize distributions and sends to a random receiver. The flowinter-arrival time follows an exponential distribution.

Stride: We index the hosts from left to right. Server[i]sends to server[(i+M) mod N] where M is the number of hostsin a pod and N the total number of hosts.

Bijection: Each host sends to a random destination in adi�erent pod. Each host only receives data from one sender.

Random: Each host sends to a random destination not inthe same pod as itself. Di�erent from bijection, multiple hostsmay send to the same destination.

We mainly use the Realistic workload. The other threeworkloads are traditional synthetic communication patternsthat are only used to perform stress tests, similar to previouswork [2, 3, 19]:

Figure 10: Stress test results for di�erent tra�c patterns and meaninter-arrival times in a 4-pod 10Gbps fat-tree.

Benchmarks. We compare four load balancing schemes.• Expeditus: Our distributed congestion-aware protocol.• Clairvoyant: An “ideal” congestion-aware scheme that

uses complete end-to-end information of all possible pathsbetween the two pods. Whenever a new flow comes, thesource ToR chooses the best path. As discussed in §2.2such a scheme introduces prohibitive overhead in 3-tierClos for practical implementation.

• CONGA and CONGA-Flow: The state-of-the-art dis-tributed congestion-aware protocol. CONGA is integratedwith flowlet switching [23]. Flowlets are detected whenthe inter-arrival time between two consecutive packetsexceeds a certain threshold Tf l . Our implementation isfaithfully based on all details provided in [4]. CONGAuses Tf l =100 µs to split flows into flowlets. Di�erent fromCONGA, CONGA-Flow makes per-flow load balancingdecisions and guarantees packet ordering. Both are appliedin leaf-spine topologies only.

• ECMP: This is our baseline.

Page 10: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

(a) Overall: Avg (b) (0,100KB]: Avg (c) (0,100KB]: 95-th percentile (d) (1MB, •): Avg

Figure 11: Normalized FCT for web search workload in the baseline 12-pod fat-tree with core tier oversubscription of 2:1.

(a) Overall: Avg (b) (0,100KB]: Avg (c) (0,100KB]: 95-th percentile (d) (1MB, •): Avg

Figure 12: Normalized FCT for data-mining workload in the baseline 12-pod fat-tree with core tier oversubscription of 2:1.

5.2 Stress Tests with Synchronized FlowsWe first perform stress tests. We generate synchronizedflows with the three synthetic tra�c patterns. From eachsender, we generate a flow of 50MB. To vary the degree ofsynchronization, we conduct three sets of simulation withflow inter-arrival times following exponential distributionswith means 0, 30 µs, and 60 µs.

Figure 10 shows the average throughput for di�erentschemes with error bars over 5 runs. When all flows start atthe same time, Expeditus and Clairvoyant perform on par withECMP. This is expected as all paths are idle at the same flowarrival time, and congestion-aware schemes essentially reduceto ECMP. Clearly perfectly synchronized flow arrivals seldomhappen in practice. When flows are loosely synchronized,Expeditus is able to choose better paths than ECMP andimprove performance substantially. It improves the averagethroughput by⇠23%–42% for Stride and Bijection, and⇠20%for Random with mean inter-arrival times of 30 µs and 60 µs.The Random tra�c pattern is challenging for Expeditus asmultiple senders may send to the same receiver, in which casethroughput is limited at the edge of the network. Expeditus isvery close to Clairvoyant with a performance gap of at most9%.

5.3 Baseline Performance in 3-tier ClosWe now investigate the performance of Expeditus in a large-scale Clos network. We use the Realistic tra�c pattern withnetwork load varying from 0.1 to 0.8. Figure 11 and Figure 12show the normalized FCT (NFCT) for both realistic workloadsin the baseline 12-pod fat-tree with core tier oversubscribedat 2:1. NFCT is the FCT value normalized to the best possible

completion time achieved in an idle network where each flowcan transmit at the bottleneck link capacity [4, 7].

We observe that for mice flows, Expeditus provides 16%–38% FCT reduction at the 95%ile over ECMP for bothworkloads. Meanwhile, for elephants, Expeditus is also⇠15%–30% faster on average. Moreover, it is clear thatperformance of Expeditus closely tracks that of Ideal in bothworkloads. In most cases the performance gap is less than10%, demonstrating the near-optimality of the heuristic pathselection design.

(a) Aggregation tier imbalance. (b) ToR tier imbalance.

Figure 13: CDF of throughput imbalance at di�erent tiers of a12-pod fat-tree with 2:1 core tier oversubscription for the websearch workload at 50% load.

Load balancing e�ciency. To further understand Expeditus’sbenefits, we assess its load balancing e�ciency over ECMP.Figure 13 shows the CDF of throughput imbalance acrossuplinks of a ToR or aggregation switch in the baseline topologyfor the web search workload at 50% load. The throughputimbalance is defined as the maximum throughput minus theminimum divided by the average among all uplinks of a switch:(MAX � MIN)/AVG. This is calculated from synchronoussamples of each uplink at a randomly chosen ToR/aggregation

Page 11: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

switch over 1ms intervals, with more than 2K sample pointsin total.

We make two observations. First, the results confirm thatExpeditus is more e�cient in balancing loads across the net-work than ECMP. It reduces the average throughput imbalancesignificantly compared to ECMP. Note the imbalance is atmost 3 at the core tier instead of 6 as in the ToR tier, becausethe core tier is oversubscribed at 2:1 in the baseline topology.The second observation is that Clairvoyant balances loadsbetter at the aggregation tier, while Expeditus performs betterthan Clairvoyant at the ToR tier. This is mainly due to thetwo-stage path selection design. Expeditus first chooses thebest path from the ToR to the aggregation tier, without con-sidering the aggregation-core links. This naturally results in abetter balanced ToR tier. On the other hand Clairvoyant usesglobal information and can balance the congested aggregationtier better.5.4 Impact of Network BottleneckIn this section we evaluate the impact of network bottleneckon Expeditus. We vary the severity and location of bottlenecksby varying the oversubscription ratios at di�erent tiers of thetopology.

(a) Oversubscription at the coretier.

(b) Oversubscription at the ToRtier.

Figure 14: Average FCT reduction by Expeditus over ECMP forall flows with varying oversubscription at di�erent tiers of a 12-podfat-tree for the web search workload.

We first consider the impact of bottleneck severity. Fig-ure 14a shows Expeditus’s average FCT improvement overECMP for all flows in the baseline topology with varyingoversubscription ratios at the core tier. Only results for websearch workload is shown for brevity. In general, we find thatExpeditus provides more benefits with more oversubscriptionat first, and then the improvements decrease with an oversub-scription of 3. This is because when the network is heavilyoversubscribed, many elephant flows consistently occupy thepaths, diminishing the congestion diversity across equal-costpaths, as well as the benefit of being congestion-aware.

Next we consider the impact of bottleneck location. Fig-ure 14b shows the result when the ToR tier is oversubscribedwith more hosts. Now uplinks of ToR switches are the bot-tleneck instead of aggregation switches. Interestingly, weobserve that Expeditus performs even better in this setting,and the performance gap between Expeditus and Clairvoyantis even smaller. The reason is as follows. Recall that Expeditus

always chooses paths starting from the ToR tier. Thus it worksbetter when the ToR tier is the bottleneck. When instead thecore tier is bottlenecked, it is more likely that the first stagepath selection directs a flow to an aggregation switch thatconnects to congested core switches, and Expeditus is unableto change the decision in the second stage.5.5 Impact of Link FailuresIn this section, we study the impact of link failures andtopology asymmetry. To emphasize the performance of victimflows, the following experiments involve only two pods in the12-pod non-oversubscribed fat-tree. We vary the number offailed links. We choose links in one pod to fail uniformly atrandom in each run, with each switch having at most 2 failedlinks. Only results with the web search workload are shown.Figure 15 shows the overall average FCT reduction of bothClairvoyant and Expeditus over ECMP for all flows at 0.5load. We observe qualitatively similar results for other loads.

(a) Aggr-core link failure. (b) ToR-aggr link failure.

Figure 15: Average FCT reduction by Clairvoyant and Expeditusover ECMP for all flows with link failures.

We consider two scenarios: aggr-core and ToR-aggr linkfailures. Observing from Figure 15a that when aggr-core linksfail, performance gains of Expeditus and Clairvoyant becomemore significant with more failures. This is because ECMPalways hashes flows evenly to all paths without consideringthe asymmetric uplink bandwidth, thus aggravating thecongestion. Expeditus proactively detects high utilization linksdue to failures using the link load multipliers, and diverts tra�caway from hot spots to balance the load. Figure 15b shows theresult when failures happen at the ToR-aggr links. Comparingwith aggr-core link failures, a ToR-aggr link failure causesmore network capacity loss because one aggregation switchbecomes completely unaccessible. This leaves less room forload balancing to improve. This explains why performancegains of Expeditus and Clairvoyant diminishes with more ToR-aggr link failrues. Still, across di�erent scenarios Expeditusprovides moderate (20%–27%) performance gains. In all,Expeditus is particularly robust against failures in 3-tier Closnetworks.5.6 Performance in Leaf-spine TopologiesWe finally turn to 2-tier leaf-spine topology, and compareExpeditus to CONGA, CONGA-Flow and ECMP. Expeditus

Page 12: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

(a) Overall: Avg (b) (0,100KB]: Avg (c) (0,100KB]: 95-th percentile (d) (1MB, •): Avg

Figure 16: Normalized FCT for web search workload with 8 spine and 8 leaf switches. Network oversubscription 2:1.

is equivalent to Clairvoyant in leaf-spine topologies (§3.4).The topology has 8 leaf switches, 8 spine switches, and 128hosts. Note that we double the number of hosts to achieve 2:1oversubscription at the leaf level.

Figure 16 shows the FCT results for the web searchworkload. Expeditus achieves favorable performance gainsover ECMP, ranging from 18%–32% for all flows across allloads. More interestingly, it provides comparable performanceas CONGA and outperforms CONGA-Flow in all cases. Thetrend is consistent with results in CONGA [4]. It is intuitivethat CONGA achieves good performance since it benefitsfrom fine-grained load balancing: flowlet switching. The mainreason for the degraded performance of CONGA-Flow is thatCONGA-Flow can only observe the congestion state of thepaths when there are flows traversing on them. As there maynot be enough concurrent flows to cover all paths at all times(§2.3), CONGA-Flow fails to observe the up-to-date, globalstate of the network.

6. Related WorkWe discuss related work other than CONGA now. Prior workin general falls into two categories to grapple with ECMP’sdrawbacks. The first is centralized routing that pins elephantsto good paths based on global network state [3, 10, 13, 31].This approach poses scalability challenges for OpenFlowswitches [13], and is too slow to improve latency for miceflows. The second approach is fine-grained load balancing.Packet spraying adopts per-packet load balancing [11, 14, 16],which induces packet reordering and interacts poorly withTCP. Flare [23], LocalFlow [31], and Presto [19] split a flowinto many small bursts, and load balances them to avoid oralleviate packet reordering. FlowBender [22] opportunisticallyre-routes flows by simple hacks to commodity switches. Theseapproaches only use local information at best, which is sub-optimal in handling flash congestion in the network.

A recent independent work also propose a congestion-aware load balancing protocol called HULA for Clos topolo-gies [24]. Similar to distance-vector routing, congestion infor-mation is periodically propagated to all switches in HULA,and each switch maintains the best next hop to the destination.The design has drawbacks that prevent it from scaling to largenetworks. First, flooding introduces significant bandwidthoverhead. Congestion information of all links is broadcast

to all switches every 200 µs [24], even if there is no tra�cbetween a pair of ToR switches. Second, each switch needsto maintain state for all racks, and process congestion infor-mation periodically. This brings vast storage and processingoverhead.

Proposals such as MPTCP [30] split flows in the transportlayer and send them to multiple paths to improve throughput.MPTCP cannot e�ectively improve latency for mice flows. Infact it degrades latency quite significantly [4, 19]. Its through-put performance for elephant flows is also demonstrated to beinferior to CONGA [4].

In a related thread, beyond fixing ECMP, much workfocuses on the latency issue of mice flows. DCTCP [5],HULL [6], RCP [15], FCP [18], TIMELY [28], and DX [26]aim to reduce queue length through better congestion control.D3 [34], PDQ [20], D2TCP [33], pFabric [7], PASE [29],and PIAS [9] prioritize mice flows in packet scheduling andrate assignment of TCP based on either deadlines or flowsizes. DIBS [36] and CP [12] develop data plane techniquesto accelerate packet retransmissions. Expeditus complementsthese proposals and can be used to construct a better networkstack for data centers.

7. ConclusionWe presented Expeditus, a novel data plane congestion-awareload balancing protocol for data centers. Expeditus collectslocal congestion information, and applies two-stage pathselection to aggregate such information and route flows.Both Click implementation and packet-level ns-3 simulationshow that Expeditus can improve network performance forboth mice and elephant flows. Expeditus advances state-of-the-art by enabling congestion-aware load balancing forgeneral 3-tier Clos topologies widely used in industry, whosecomplexity presents non-trivial challenges on the scalabilityand e�ectiveness of the design.

AcknowledgmentsWe thank the anonymous reviewers for their comments thatimproved the paper. This work was supported by the ResearchGrants Council, University Grants Committee of Hong Kong(awards ECS-21201714, GRF-11202315, and CRF-C7036-15G), the ICT R&D program of MSIP/IITP of Korea [B0101-16-1368], and National Research Foundation of Korea fundedby MSIP (NRF- 2013R1A1A1076024).

Page 13: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

References[1] A�����, A., D���������, R., ��� R�����, C. Increasing

Datacenter Network Utilisation with GRIN. In Proc. USENIXNSDI (2015).

[2] A�-F����, M., L��������, A., ��� V�����, A. A scalable,commodity data center network architecture. In Proc. ACMSIGCOMM (2008).

[3] A�-F����, M., R������������, S., R�������, B., H����,N., ��� V�����, A. Hedera: Dynamic flow scheduling for datacenter networks. In Proc. USENIX NSDI (2010).

[4] A�������, M., E�����, T., D������������, S.,V�����������, R., C��, K., F��������, A., L��,V. T., M����, F., P��, R., Y����, N., ��� V�������, G.CONGA: Distributed Congestion-Aware Load Balancing forDatacenters. In Proc. ACM SIGCOMM (2014).

[5] A�������, M., G��������, A., M����, D. A., P�����, J.,P����, P., P��������, B., S�������, S., ��� S��������, M.Data center TCP (DCTCP). In Proc. ACM SIGCOMM (2010).

[6] A�������, M., K������, A., E�����, T., P��������, B.,V�����, A., ��� Y�����, M. Less is more: Trading alittle bandwidth for ultra-low latency in the data center. InProc. USENIX NSDI (2012).

[7] A�������, M., Y���, S., S�����, M., K����, S., P��������,N. M. B., ��� S������, S. pFabric: Minimal near-optimaldatacenter transport. In Proc. ACM SIGCOMM (2013).

[8] A�������, A. Introducing data center fabric, thenext-generation Facebook data center network. https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/, Novem-ber 2014.

[9] B��, W., C���, L., C���, K., H��, D., T���, C., ��� W���,H. PIAS: Practical information-agnostic flow scheduling fordata center networks. In Proc. USENIX NSDI (2015).

[10] B�����, T., A����, A., A�����, A., ��� Z����, M. Microte:Fine grained tra�c engineering for data centers. In Proc. ACMCoNEXT (2011).

[11] C��, J., X��, R., Y���, P., G��, C., L�, G., Y���, L., Z����, Y.,W�, H., X����, Y., ��� M����, D. Per-packet load-balanced,low-latency routing for Clos-based data center networks. InProc. ACM CoNEXT (2013).

[12] C����, P., R��, F., S��, R., ��� L��, C. Catch the Whole Lotin an Action: Rapid Precise Packet Loss Notification in DataCenters. In Proc. USENIX NSDI (2014).

[13] C�����, A. R., M����, J. C., T���������, J., Y����������,P., S�����, P., ��� B�������, S. Devoflow: Scaling flowmanagement for high-performance networks. In Proc. ACMSIGCOMM (2011).

[14] D����, A., P������, P., H�, Y. C., ��� K�������, R. R. Onthe impact of packet spraying in data center networks. InProc. IEEE INFOCOM (2013).

[15] D��������, N., ��� M�K����, N. Why flow-completion timeis the right metric for congestion control. SIGCOMM Comput.Commun. Rev. 36, 1 (January 2006), 59–62.

[16] G�������, S., G������, B., G������, Y., ��� F������������,A. Micro Load Balancing in Data Centers with DRILL. InProc. ACM HotNets (2015).

[17] G��������, A., H�������, J. R., J���, N., K������, S., K��,C., L�����, P., P����, P., ��� S�������, S. VL2: A Scalableand Flexible Data Center Network. In Proc. ACM SIGCOMM(2009).

[18] H��, D., G�����, R., A�����, A., ��� S�����, S. FCP: AFlexible Transport Framework for Accommodating Diversity.In Proc. ACM SIGCOMM (2013).

[19] H�, K., R�����, E., A������, K., F�����, W., C�����, J.,��� A�����, A. Presto: Edge-based Load Balancing for FastDatacenter Networks. In Proc. ACM SIGCOMM (2015).

[20] H���, C.-Y., C�����, M., ��� G������, P. B. Finishing flowsquickly with preemptive scheduling. In Proc. ACM SIGCOMM(2012).

[21] J��������, V., A�������, M., G���, Y., K��, C., ���M�������, D. Millions of little minions: Using packets for lowlatency network programming and visibility. In Proc. ACMSIGCOMM (2014).

[22] K������, A., V������, B., H����, J., ��� D������,F. FlowBender: Flow-level Adaptive Routing for ImprovedLatency and Throughput in Datacenter Networks. In Proc. ACMCoNEXT (2014).

[23] K������, S., K�����, D., S����, S., ��� B�����, A. Dynamicload balancing without packet reordering. SIGCOMM Comput.Commun. Rev. 37, 2 (April 2007), 51–62.

[24] K����, N., H���, M., K��, C., S��������, A., ��� R������,J. HULA: Scalable Load Balancing Using Programmable DataPlanes. In Proc. ACM SOSR (2016).

[25] K�����, E. The Click Modular Router. PhD thesis, Mas-sachusetts Institute of Technology, 2001.

[26] L��, C., P���, C., J���, K., M���, S., ��� H��, D. AccurateLatency-based Congestion Feedback for Datacenters. InProc. USENIX ATC (2015).

[27] L��, S., X�, H., ��� C��, Z. Low latency datacenter networking:A short survey. http://arxiv.org/abs/1312.3455, 2014.

[28] M�����, R., L��, V. T., D��������, N., B���, E., W�����, H.,G������, M., V�����, A., W���, Y., W��������, D., ���Z���, D. TIMELY: RTT-based Congestion Control for theDatacenter. In Proc. ACM SIGCOMM (2015).

[29] M����, A., B���, G., I�����, S. M., Q���, I. A., L��, A. X.,��� D����, F. R. Friends, not Foes – Synthesizing ExistingTransport Strategies for Data Center Networks. In Proc. ACMSIGCOMM (2014).

[30] R�����, C., B����, S., P������, C., G���������, A., W��-����, D., ��� H������, M. Improving datacenter performanceand robustness with multipath tcp. In Proc. ACM SIGCOMM(2011).

[31] S��, S., S���, D., I��, S., ��� F�������, M. J. Scalable,optimal flow routing in datacenters via local link balancing. InProc. ACM CoNEXT (2013).

[32] S����, A., O��, J., A������, A., A�������, G., A��������,A., B�����, R., B�����, S., D����, G., F��������, B.,G������, P., K�������, A., P������, J., S������, J., T����,

Page 14: Expeditus: Congestion-aware Load Balancing in Clos Data ... · Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks ... impossible to collect real-time congestion

E., W�������, J., H�����, U., S�����, S., ��� V�����, A.Jupiter Rising: A Decade of Clos Topologies and CentralizedControl in Google’s Datacenter Network. In Proc. ACMSIGCOMM (2015).

[33] V������, B., H����, J., ��� V���������, T. Deadline-awaredatacenter TCP (D2TCP). In Proc. ACM SIGCOMM (2012).

[34] W�����, C., B������, H., K����������, T., ��� R�������,A. Better never than late: Meeting deadlines in datacenternetworks. In Proc. ACM SIGCOMM (2011).

[35] X�, H., ��� L�, B. TinyFlow: Breaking elephants down intomice in data center networks. In Proc. IEEE LANMAN (2014).

[36] Z������, K., M���, R., C�����, M., K���-B������, E., Y�, M.,��� P�����, J. DIBS: Just-in-time Congestion Mitigation forData Centers. In Proc. USENIX EuroSys (2014).

[37] Z���, D., D��, T., M����, P., B��������, D., ��� K���, R.DeTail: Reducing the flow completion time tail in datacenternetworks. In Proc. ACM SIGCOMM (2012).

[38] Z���, J., T�����, M., Z��, M., K������, A., P���������, L.,S����, A., ��� V�����, A. WCMP: Weighted cost multipathingfor improved fairness in data centers. In Proc. USENIX EuroSys(2014).