beyond fat-trees without antennae, mirrors, and disco-balls · dynamically set up network...

Beyond fat-trees without antennae, mirrors, and disco-balls

Simon Kassing , Asaf Valadarsky , Gal Shahaf , Michael Schapira , Ankit Singla

Skewed traffic within data centers

[Google]2

Skewed traffic within data centers

Traffic hotspots

[Google]3

All-to-all non-blocking connectivity is expensive

4

Oversubscribed fat-trees

5

Oversubscribed fat-trees

6

Oversubscribed fat-treesCapacity

75%

6

Oversubscribed fat-treesCapacity

75%

Demand

50%

Bottleneck

6

Oversubscribed fat-trees: A tragedy …

Capacity

75%

k = 96

Demand

2%

7

Dynamically set up network connections!

• OFC ’09 Glick et al.• SIGCOMM ’10 Wang et al.• SIGCOMM ’10 Farrington et al.• SIGCOMM ’11 Halperin et al.• NSDI ’12 Chen et al.• SIGCOMM ’12 Zhou et al.• SIGCOMM ’13 Porter et al.• SIGCOMM ’14 Liu et al.• SIGCOMM ’14 Hamedazimi et al.• SIGCOMM ’16 Ghobadi et al.• NSDI ’17 Chen et al.Servers

ToR

switch

Aggregate

switch

Core

switch

Electrical

Network

Optical

Network

Reconfigurable

optical paths

Figure 1: HyPaC network architecture

System requirements

Control plane 1. Estimating cross-rack traffic demands

2. Managing circuit configuration

Data plane 1. De-multiplexing traffic in dual-path network

2. Maximizing the utilization of circuits when

available (optimization)

Table 1: Fundamental requirements of HyPaC architecture.

capacity because a single optical path can handle tens of servers

sending at full capacity over conventional gigabit Ethernet links.

The circuit-switched network can only provide a matching on the

graph of racks: Each rack can have at most one high-bandwidth

connection to another rack at a time. The switch can be reconfig-

ured to match different racks at a later time; as noted earlier, this

reconfiguration takes a few milliseconds, during which time the fast

paths are unusable. To ensure that latency sensitive applications can

make progress, HyPaC retains the packet-switched network. Any

node can therefore talk to any other node at any time over potentially

over-subscribed packet-switched links.

For the circuits to provide benefits, the traffic must be “pair-

wise concentrated”—there must exist pairs of racks with high band-

width demands between them and lower demand to others. Fortu-

nately, such concentration has been observed by numerous prior

studies [14, 16, 29]. This concentration exists for several reasons:

time-varying traffic, biased distributions, and—our focus in later

sections—amenability to batching. First, applications whose traffic

demands vary over time (e.g. hitting other bottlenecks, multi-phase

operation) can contribute to a non-uniform traffic matrix. Second,

other applications have intrinsic communication skew in which most

nodes only communicate with a small number of partners. This

limited out-degree leads to concentrated communication. Finally,

latency-insensitive applications such as MapReduce-style computa-

tions may be amenable to batched data delivery: instead of sending

data to destinations in a fine-grained manner (e.g., 1, 2, 3, 2, 3, 1,

2), sufficient buffering can be provided to batch this delivery (1, 1,

2, 2, 2, 3, 3). These patterns do not require arbitrary full-bisection

capacity.

3.1 System Requirement

Table 1 summarizes functions needed for a generic HyPaC-style

network. In the control plane, effective use of the circuit-switched

paths requires determining rack-to-rack traffic demands and timely

circuit reconfiguration to match these demands.

In the data plane, a HyPaC network has two properties: First,

when a circuit is established between two racks, there exist two paths

between them—the circuit-switched link and the always-present

packet-switched path. Second, when the circuits are reconfigured,

the network topology changes. Reconfiguration in a large data center

causes hundreds of simultaneous link up/down events, a level of

dynamism much higher than usually found in data centers. A HyPaC

network therefore requires traffic control mechanisms to dynamically

de-multiplex traffic onto the circuit or packet switched network,

as appropriate. Finally, if applications do not send traffic rapidly

enough to fill the circuit-switched paths when they become available,

a HyPaC design may need to implement additional mechanisms,

such as extra batching, to allow them to do so.

3.2 Design Choices and Trade-offs

These system requirements can be achieved on either end-hosts or

switches. For designs on end-hosts, the system components can be

at different software layers (e.g, applications layer or kernel layer).

Traffic demand estimation: One simple choice is to let applica-

tions explicitly indicate their demands. Applications have the most

accurate information about their demands, but this design requires

modifying applications. As we discuss in Section 4, our c-Through

design estimates traffic demand by increasing the per-connection

socket buffer sizes and observing end-host buffer occupancy at run-

time. This design requires additional kernel memory for buffering,

but is transparent to applications and does not require switch changes.

The Helios design [22], in contrast, estimates traffic demands at

switches by borrowing from Hedera [13] an iterative algorithm to

estimate traffic demands from flow information.

Traffic demultiplexing: Traditional Ethernet mechanisms han-

dle multiple paths poorly. Spanning tree, for example, will block

either the circuit-switched or the packet-switched network instead

of allowing each to be used concurrently. The major design choice

in traffic demultiplexing is between emerging link-layer routing pro-

tocols [33, 30, 25, 34, 10] and partition-based approaches that view

the two networks as separate.

The advantage of a routing-based design is that, by treating the

circuit and packet-switched networks as a single network, it oper-

ates transparently to hosts and applications. Its drawback is that it

requires switch modification, and most existing routing protocols im-

pose a relatively long convergence time when the topology changes.

For example, in link state routing protocols, re-convergence follow-

ing hundreds of simultaneous link changes could require seconds or

even minutes [15]. To be viable, routing-based designs may require

further work in rapidly converging routing protocols.

A second option, and the one we choose for c-Through, is to

isolate the two networks and to de-multiplex traffic at either the

end-hosts or at the ToR switches. We discuss our particular design

choice further in Section 4. The advantage of separating the net-

works is that rapid circuit reconfiguration does not destabilize the

packet-switched network. Its drawback is a potential increase in

configuration complexity.

Circuit utilization optimizing, if necessary, can be similarly ac-

complished in several ways. An application-integrated approach

could signal to applications to increase their transmission rate when

the circuits are available; the application-transparent mechanism

we choose for c-Through is to buffer additional data in TCP socket

buffers, relying on TCP to ramp up quickly when bandwidth be-

comes available. Such buffering could also be accomplished in the

ToR switches.

329

P

r

o

j

e

c

T

o

R

:

A

g

i

l

e

R

e

c

o

n

fi

g

u

r

a

b

l

e

D

a

t

a

C

e

n

t

e

r

I

n

t

e

r

c

o

n

n

e

c

t

Monia Ghobadi Ratul Mahajan Amar Phanishayee

Nikhil Devanur Janardhan Kulkarni Gireeja Ranade

Pierre-Alexandre Blanche† Houman Rastegarfar† Madeleine Glick† Daniel Kilper†

Microsoft Research †University of Arizona

Abstract— We explore a novel, free-space optics based

approach for building data center interconnects. It uses

a digital micromirror device (DMD) and mirror assembly

combination as a transmitter and a photodetector on top of

the rack as a receiver (Figure 1). Our approach enables all

pairs of racks to establish direct links, and we can recon-

figure such links (i.e., connect different rack pairs) within

12 µs. To carry traffic from a source to a destination rack,

transmitters and receivers in our interconnect can be dynam-

ically linked in millions of ways. We develop topology con-

struction and routing methods to exploit this flexibility, in-

cluding a flow scheduling algorithm that is a constant fac-

tor approximation to the offline optimal solution. Experi-

ments with a small prototype point to the feasibility of our

approach. Simulations using realistic data center workloads

show that, compared to the conventional folded-Clos inter-

connect, our approach can improve mean flow completion

time by 30–95% and reduce cost by 25–40%.

CCS Concepts•Networks ! Network architectures;

KeywordsData Centers; Free-Space Optics; Reconfigurablility

1. INTRODUCTION

The traditional way of designing data center (DC)

networks—electrical packet switches arranged in a multi-

tier topology—has a fundamental shortcoming. The design-

ers must decide in advance how much capacity to provision

between top-of-rack (ToR) switches. Depending on the pro-

visioned capacity, the interconnect is either expensive (e.g.,

with full-bisection bandwidth) or it limits application perfor-

mance when demand between two ToRs exceeds capacity.

Permission to make digital or hard copies of all or part of this work for personal

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice

and the full citation on the first page. Copyrights for components of this work

owned by others than ACM must be honored. Abstracting with credit is per-

mitted. To copy otherwise, or republish, to post on servers or to redistribute to

lists, requires prior specific permission and/or a fee. Request permissions from

[email protected].

SIGCOMM ’16, August 22-26, 2016, Florianopolis , Brazil

c� 2016 ACM. ISBN 978-1-4503-4193-6/16/08. . . $15.00

DOI: http://dx.doi.org/10.1145/2934872.2934911

Array of Micromirrors

Diffracted beam Towards destinationReceived beam

Input beam LasersDMDs

Photodetectors

Mirror assembly Reflected beam

Figure 1: ProjecToR interconnect with unbundled trans-

mit (lasers) and receive (photodetectors) elements.

EnablerTech.

Seamless Fan-out

Reconfig.time

Helios, c-Thru, Pro-

teus, Solstice [16, 26,

37, 38]

OCS No 100-320

30 ms

Flyways, 3DBeam [23,

40]

60GHz No ⇡70 10 ms

Mordia [33] OCS No 24 11 µs

Firefly [22] FSO Yes 10 20 ms

ProjecToR FSO Yes 18,432 12 µs

Table 1: Properties of reconfigurable interconnects.

Many researchers have recognized this shortcoming and

proposed reconfigurable interconnects, using technologies

that are able to dynamically change capacity between pairs

of ToRs. The technologies that they have explored include

optical circuit switches (OCS) [16,25,26,33,37,38], 60 GHz

wireless [23, 40], and free-space optics (FSO) [22].

However, our analysis of traffic from four diverse pro-

duction clusters shows that current approaches lack at least

two of three desirable properties for reconfigurable intercon-

nects: 1) Seamlessness: few limits on how much network

capacity can be dynamically added between ToRs; 2) High

fan-out: direct communication from a rack to many others;

and 3) Agility: low reconfiguration time.

Table 1 compares the existing reconfigurable intercon-

nects with respect to these three properties. Most approaches

(rows 1–3) are not seamless because they use a second, re-

(a) Rack-based DC (b) Container-based DC

TX

RX

(c) 2D Beamforming

TX

RX

(d) 3D Beamforming

Figure 1: Radio transceivers are placed on top of each rack (a) or container (b). Using 2D beamform-

ing (c), transceivers communicate with neighboring ones directly, but forward traffic in multiple hops to

non-neighboring racks. Using 3D beamforming (d), the ceiling reflects the signals from each sender to its

receiver, avoiding multi-hop relays.

more localized/bursty bandwidth requirements. That is, we

focus on the subset that do not require (near) non-blocking

all-to-all communication at data center scale.

In particular, we focus on high-throughput, beamform-

ing wireless links in the 60 GHz band. The unlicensed 60

GHz band provides multi-Gbps data rates and can be im-

plemented with relatively low-cost hardware. Because 60

GHz signals attenuate quickly with distance, multiple wire-

less links can be deployed in a single data center. In our

efforts to expand the effective bandwidth of 60 GHz links,

we hope to create a new primitive that can be used to either

augment existing networks with on-demand network links,

or potentially replace wired links in data centers with mod-

est bandwidth requirements. We build on pioneering efforts

of earlier work that proposed 60 GHz links to alleviate hot

spots in the data center [23, 26].

However, earlier efforts face a number of limitations. First,

even beamforming directional links will experience signal

leakage, and produce a cone of interference to receivers near

or behind the intended target receiver. This limits the num-

ber of links that can be active concurrently in densely oc-

cupied data centers, and reduces the aggregate throughput

offered by these wireless links.

Second, these links require direct line-of-sight (LOS) be-

tween sender and receiver, and can be blocked by even small

objects in the path. This limits the effective range of 60 GHz

links to neighboring top-of-rack radios. Since hotspots occur

regularly at both edge and core links [15], augmenting core

links would require multiple hops through a line-of-sight 60

GHz network. Half-duplex, directional antennas mean that

these multi-hop links will suffer at least a 50% throughput

drop, higher-levels of potential congestion, and additional

delays required to frequently adjust antenna orientation.

To address these issues, we investigate the feasibility of

60 GHz 3D beamforming as a flexible wireless primitive in

data centers. In 3D beamforming, a top-of-rack directional

antenna forms a wireless link by reflecting a focused beam off

the ceiling towards the receiver. This reduces its interference

footprint, avoids blocking obstacles, and provides an indirect

line-of-sight path for reliable communication. Such a system

requires only beamforming radios readily available today,

and near perfect reflection can be provided by simple flat

metal plates mounted on the ceiling of a data center.

3D beamforming has several distinctive advantages over

prior “2D” approaches. First, bouncing the beam off the

ceiling allows links to extend the reach of radio signals by

avoiding blocking obstacles. Second, the 3D direction of the

beam significantly reduces its interference range, allowing

more nearby flows to transmit concurrently. Third, the re-

duced interference extends the effective range of each link,

allowing our system to connect any two racks using a single

hop, and mitigating the need for multihop links.

In this paper, we propose a 3D beamforming system for 60

GHz wireless transmissions in data centers. The 3D beam-

forming idea was first introduced by Zhang et al. in [46].

In this paper, we greatly extend the prior work, and use

measurements of a local 60 GHz testbed to quantify and

compare the performance of 3D and 2D beamforming links.

We find that 3D wireless beamforming works well in prac-

tice, and experiences zero loss in signal or throughput from

reflection. We also describe a link scheduler for 3D beam-

forming systems that maximizes concurrent links while also

taking into account accumulative interference and antenna

alignment delays. Finally, we use a detailed simulation of

data center traffic hotspots to quantify the performance of

3D beamforming systems. Our results show that while 2D

links can only support a small portion of hotspot traffic links,

3D beamforming can connect all rack pairs in a single hop,

and can significantly reduce overall data completion time for

wired networks across a range of bisection bandwidths.

While wired networks will likely remain the vehicle of

choice for the high-end of distributed computing, we believe

that efforts such as 3D beamforming can expand the ap-

plicability and benefits of wireless networking to a broader

range of data center deployments.

2. 60 GHZ: LIMITATIONS AND SOLUTIONS

While modifying the topology of wired data centers is

costly, complex, and sometimes intractable, administrators

can introduce flexible point-to-point network links with the

addition of wireless radios. Prior work has proposed the use

of 60 GHz links to augment data center capacity [23, 26,

35, 38]. Figures 1(a)-(b) show a common deployment sce-

nario, where wireless radios are placed on top of each rack

or container to connect pairs of top-of-rack (ToR) switches.

In practice, however, data center managers remain skep-

tical on deploying wireless links despite their potential ben-

efits [1]. In this section, we summarize prior work in this

space, and use detailed experiments on a 60 GHz testbed to

identify and quantify key limitations of current proposals.

2.1 60 GHz Links in Data Centers

Existing designs [23, 26, 27, 38] adopt 60 GHz wireless

technologies for several reasons. First, the 7GHz spectrum

FireFly: A Reconfigurable Wireless Data Center Fabric

Using Free-Space Optics

Navid Hamedazimi,† Zafar Qazi,† Himanshu Gupta,† Vyas Sekar,? Samir R. Das,† Jon P. Longtin,†

Himanshu Shah,† and Ashish Tanwer†

†Stony Brook University?Carnegie Mellon University

ABSTRACTConventional static datacenter (DC) network designs offer extreme

cost vs. performance tradeoffs—simple leaf-spine networks are cost-

effective but oversubscribed, while “fat tree”-like solutions offer

good worst-case performance but are expensive. Recent results

make a promising case for augmenting an oversubscribed network

with reconfigurable inter-rack wireless or optical links. Inspired

by the promise of reconfigurability, this paper presents FireFly, an

inter-rack network solution that pushes DC network design to the

extreme on three key fronts: (1) all links are reconfigurable; (2) all

links are wireless; and (3) non top-of-rack switches are eliminated

altogether. This vision, if realized, can offer significant benefits in

terms of increased flexibility, reduced equipment cost, and minimal

cabling complexity. In order to achieve this vision, we need to look

beyond traditional RF wireless solutions due to their interference

footprint which limits range and data rates. Thus, we make the case

for using free-space optics (FSO). We demonstrate the viability of

this architecture by (a) building a proof-of-concept prototype of a

steerable small form factor FSO device using commodity compo-

nents and (b) developing practical heuristics to address algorithmic

and system-level challenges in network design and management.

Categories and Subject Descriptors

C.2.1 [Computer-Communication Networks]: Network Architec-

ture and Design

KeywordsData Centers; Free-Space Optics; Reconfigurablility

1 Introduction

A robust data center (DC) network must satisfy several goals: high

throughput [13, 23], low equipment and management cost [13, 40],

robustness to dynamic traffic patterns [14, 26, 48, 52], incremen-

tal expandability [18, 45], low cabling complexity [37], and low

power and cooling costs. With respect to cost and performance,

conventional designs are either (i) overprovisioned to account for

worst-case traffic patterns, and thus incur high cost (e.g., fat-trees

or Clos networks [13, 16, 23]), or (ii) oversubscribed (e.g., simple

trees or leaf-spine architectures [1]) which incur low cost but offer

poor performance due to congested links.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from [email protected].

SIGCOMM’14, August 17–22, 2014, Chicago, IL, USA.

Copyright is held by the owner/author(s). Publication rights licensed to ACM.

ACM 978-1-4503-2836-4/14/08 ...$15.00.

http://dx.doi.org/10.1145/2619239.2626328.

Rack%1%Rack%N%Rack%r%

Steerable%%FSOs%

Ceiling%mirror%

ToR%switch%

FireFly%Controller%

Traffic%Pa=erns%

Rule%change%

FSO%reconf%

Figure 1: High-level view of the FireFly architecture. The only

switches are the Top-of-Rack (ToR) switches.

Recent work suggests a promising middleground that augments

an oversubscribed network with a few reconfigurable links, using

either 60 Ghz RF wireless [26, 52] or optical switches [48]. In-

spired by the promise of these flexible DC designs,1 we envision a

radically different DC architecture that pushes the network design

to the logical extreme on three dimensions: (1) All inter-rack links

are flexible; (2) All inter-rack links are wireless; and (3) we get rid

of the core switching backbone.

This extreme vision, if realized, promises unprecedented qualita-

tive and quantitative benefits for DC networks. First, it can reduce

infrastructure cost without compromising on performance. Second,

flexibility increases the effective operating capacity and can im-

prove application performance by alleviating transient congestion.

Third, it unburdens DC operators from dealing with cabling com-

plexity and its attendant overheads (e.g., obstructed cooling) [37].

Fourth, it can enable DC operators to experiment with, and bene-

fit from, new topology structures that would otherwise remain un-

realizable due to cabling costs. Finally, flexibly turning links on

or off can take us closer to the vision of energy proportionality

(e.g., [29]).This paper describes FireFly,2 a first but significant step toward

realizing this vision. Figure 1 shows a high-level overview of Fire-

Fly. Each ToR is equipped with reconfigurable wireless links which

can connect to other ToR switches. However, we need to look

beyond traditional radio-frequency (RF) wireless solutions (e.g.,

60GHz) as their interference characteristics limit range and capac-

ity. Thus, we envision a new use-case for Free-Space Optical com-

munications (FSO) as it can offer high data rates (tens of Gbps)

over long ranges using low transmission power and with zero in-

terference [31]. The centralized FireFly controller reconfigures the

topology and forwarding rules to adapt to changing traffic patterns.

While prior work made the case for using FSO links in DCs [19,

28], these fail to establish a viable hardware design and also do not

address practical network design and management challenges that

1We use the terms flexible and reconfigurable interchangeably.

2FireFly stands for Free-space optical Inter-Rack nEtwork with

high FLexibilitY.

319

8

Set up network connections on the fly!

Advantage:Gained the ability to move links around

ProjecToR: Ghobadi, Monia, et al. "Projector: Agile reconfigurable data center interconnect." SIGCOMM (2016).9

Engineering challenges facing dynamic topologies

Spatial planning and organisation?Environmental factors?Lack of operational experience?Device packaging?Monitoring and debugging?Reliability and lifetime of devices?Unknown unknowns?

10

Foundational questions

11


1 Rigorous benchmarks?Fat-trees are the easiest baseline — ideally inflexible!

11


1 Rigorous benchmarks?Fat-trees are the easiest baseline — ideally inflexible!

2 What is the utility of dynamic links?

11

Ideally flexible network

Throughput per server

Fraction of servers with traffic demand

10

1

12


1

𝛼



(1, 𝛼)

0

1

13


1

𝛼



𝛼

(1, 𝛼)

0

1

13


1

𝛼



𝛼

(𝛼, 1)

(1, 𝛼)

0

1

13


1

𝛼



𝛼0

1

(1, 𝛼)

(𝛼, 1)

14


1

𝛼

(x, 𝛼/x)Throughput per server


𝛼0

1

(1, 𝛼)

(𝛼, 1)

14


1

𝛼

(x, 𝛼/x)Throughput per server


𝛼0

1

Throughput proportional

(1, 𝛼)

(𝛼, 1)

14

Fat-trees: ideally inflexible

1

𝛼

𝛼

(x, 𝛼/x)

Fat-tree



0

1


2/k

15

Near-optimal expander networksStatic but flexible

16

Instead of rigid, layered connectivity…

17

Expander-based data centers

• Jellyfish • Slimfly • Xpander

NSDI ‘12 SC ’14 CoNEXT ‘16

18

Xpander: deterministic wiring-friendly expander-based data center

Met

a-no

de c

able

agg

rega

tor

Pod

ToR

Pod

cabl

e ag

greg

ator

Meta-node cable aggregator

Pod

ToR

Pod cable aggregator


PodTo

RPod cable aggregator


PodToR



Pod

ToR



Pod

ToR


Meta-node (aggregator)PodSwitches / racks

Pod-cabler (aggregator)

Valadarsky, Asaf, et al. "Xpander: Towards Optimal-Performance Datacenters." CoNEXT. 2016.19

Setup network connections on the fly!

vs.

How valuable is the ability to move links around?

A fundamental question…

Expander-based data centers!

20

Optimal flow comparison



�

��

��

��

��

�

��

21

Baseline: oversubscribed fat-tree



Fat-tree

�

��

��

��

��

�

��

22

Indeed, dynamic networks can be better



�

��

��

��

��

�

��

Fat-tree

Dynamic network (𝛿=1.5)

23

… but so can static ones



Fat-tree

�

��

��

��

��

�

��

Expander


24

… especially in the regime of interest



Fat-tree

�

��

��

��

��

�

��

“46-99% of the rack pairsexchange no traffic at all”

— Ghobadi et al., 2016

Expander


25

Not too far from proportionality!



Fat-tree

Expander

�

��

��

��

��

�

��



26

WorkloadspFabric Web search (2.4MB mean)Modelled after a real workload Maximum flow size of 30MB

Pareto-HULL (100KB mean)Pareto distributed Highly skewed Many short flows (<100 KB) Few very large flows (max. 1GB)

… at a fixed arrival rate per second (λ)

0

0.2

0.4

0.6

0.8

1

103 104 105 106 107 108 109

Mean = 100KBMean = 2.4MB

Short Long

CD

F

Flow size (bytes)

Pareto-HULLpFabric Web search

Pareto-HULL : Alizadeh, Mohammad, et al. "Less is more: trading a little bandwidth for ultra-low latency in the data center." USENIX (2012). pFabric Web search : Alizadeh, Mohammad, et al. "pfabric: Minimal near-optimal datacenter transport.” SIGCOMM (2013).27

Traffic scenariosA2A(x): fractional all-to-allOnly the servers under x% of the ToRs communicate all-to-all

Permute(x): fractional random permutationA random pairing of x% of the ToRs, of which in each pair all servers only communicate with the servers of the counterpart

ProjecToREmpirical skewed traffic from a Microsoft cluster

Skew(x, y)x fraction of ToRs has y probability of participating in a flow (rack-pair) E.g. θ=4% of ToRs have φ=77% chance of participating in a flow

ProjecToR: Ghobadi, Monia, et al. "Projector: Agile reconfigurable data center interconnect." SIGCOMM (2016).28

Topologies & RoutingTwo topologies (k=16):• Full fat-tree with n=320 • Xpander at with n=216 (67.5%) … with 10 Gbps links … both supporting ~1K servers

At servers:• DCTCP • Flowlets (change path upon exceeding gap)

Fat-tree:• ECMP

Xpander:• HYBRID

29

Introducing HYBRID routingHYBRID routing:• ECMP until # sent bytes > threshold Q • After threshold Q, use valiant load balancing (VLB)

Advantages:• Oblivious to the network congestion state • Introduces little to no overhead in current switches

30

Experimental take-aways

• Xpander achieves comparable performance to non-blocking fabrics…

• At lower cost: 2/3rds or less • Matching the performance of dynamic topologies

31

A2A(x): fractional all-to-all (pFabric)99th %-tile FCT for small flows(lower is better)

Average throughputfor large flows

(higher is better)

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

>=10

0KB

Avg

. thr

ough

put (

Gbp

s)

Fraction of active servers

Fat-treeXpander ECMPXpander HYB

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

<100

KB

99th

%-il

e FC

T (m

s)


Fat-treeXpander ECMPXpander HYB

32

Permute(x): fractional random permutation (pFabric) 99th %-tile FCT for small flows(lower is better)


(higher is better)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

<100

KB

99th

%-il

e FC

T (m

s)


Fat-treeXpander HYB

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

>=10

0KB

Avg

. thr

ough

put (

Gbp

s)


Fat-treeXpander HYB

33

A2A(0.31) with many short flows (Pareto-HULL)

0

50

100

150

200

250

300

350

0.5M 1.0M 1.5M 2.0M 2.5M 3.0M

<100

KB

99th

%-il

e FC

T (µ

s)

Load λ (flow-starts per second)

Fat-treeXpander HYB

99th %-tile FCT for small flows(lower is better)


(higher is better)

0

1

2

3

4

5

6

7

8

0.5M 1.0M 1.5M 2.0M 2.5M 3.0M

>=10

0KB

Avg

. thr

ough

put (

Gbp

s)


Fat-treeXpander HYB

34

Comparing against ProjecToR

k=16 fat-tree: 320 switchesd=16, r=8 Xpander: 128 switches (40%)

• Creating the same experiment as ProjecToR • Same workload (pFabric) • Same traffic scenario • Same network sizes:

Static links

35

ProjecToR: same # of network ports but staticAverage FCT for all flows

(lower is better)

0 10 20 30 40 50 60 70 80 90

100

2K 4K 6K 8K 10K 12K 14K

Ave

rage

FC

T (m

s)


Fat-treeXpander HYB

Empirical

36

ProjecToR: same # of network ports but staticAverage FCT for all flows

(lower is better)

0 10 20 30 40 50 60 70 80 90

100

2K 4K 6K 8K 10K 12K 14K

Ave

rage

FC

T (m

s)


Fat-treeXpander HYB

Empirical

0

10

20

30

40

50

60

0K 5K 10K 15K 20K 25K

Ave

rage

FC

T (m

s)


Fat-treeXpander HYB

Skew (4%, 77%)

36

Skew(4%, 77%) using same equipment at larger scaleAverage FCT for all flows

(lower is better)

k=24 fat-tree (720 switches)

d=13, r=11 Xpander(322 switches = 45%)

… both supporting ~3.5k servers

0

10

20

30

40

50

60

0K 10K 20K 30K 40K 50K 60K 70K 80K

Ave

rage

FC

T (m

s)


Fat-treeXpander HYB

37

cheaper expander + simple, practical routing =

performance of full-bandwidth fat-tree

38

Expanders: the static topology benchmark

Demonstrating an advantage of dynamic topologies over static topologies requires…

• … comparing to expander-based static networks • … at equal cost • … using more expressive routing than ECMP • … accounting for reconfiguration/buffering latency

All proposals to date don’t hit this benchmark

39

Future work

A. Better (oblivious) routing schemes?

B. Adaptive routing?

C. Deployment?

40

My e-mail: simon.kassing [at] inf.ethz.ch

Code available: https://github.com/ndal-eth/netbench

Get in touch

https://github.com/ndal-eth/netbench

beyond fat-trees without antennae, mirrors, and disco-balls · dynamically set up network...

Documents