beyond fat-trees without antennae, mirrors, and disco-balls · dynamically set up network...
TRANSCRIPT
Beyond fat-trees without antennae, mirrors, and disco-balls
Simon Kassing , Asaf Valadarsky , Gal Shahaf , Michael Schapira , Ankit Singla
Skewed traffic within data centers
[Google]2
Skewed traffic within data centers
Traffic hotspots
[Google]3
All-to-all non-blocking connectivity is expensive
4
Oversubscribed fat-trees
5
Oversubscribed fat-trees
6
Oversubscribed fat-treesCapacity
75%
6
Oversubscribed fat-treesCapacity
75%
Demand
50%
Bottleneck
6
Oversubscribed fat-trees: A tragedy …
Capacity
75%
k = 96
Demand
2%
7
Dynamically set up network connections!
• OFC ’09 Glick et al.• SIGCOMM ’10 Wang et al.• SIGCOMM ’10 Farrington et al.• SIGCOMM ’11 Halperin et al.• NSDI ’12 Chen et al.• SIGCOMM ’12 Zhou et al.• SIGCOMM ’13 Porter et al.• SIGCOMM ’14 Liu et al.• SIGCOMM ’14 Hamedazimi et al.• SIGCOMM ’16 Ghobadi et al.• NSDI ’17 Chen et al.Servers
ToR
switch
Aggregate
switch
Core
switch
Electrical
Network
Optical
Network
Reconfigurable
optical paths
Figure 1: HyPaC network architecture
System requirements
Control plane 1. Estimating cross-rack traffic demands
2. Managing circuit configuration
Data plane 1. De-multiplexing traffic in dual-path network
2. Maximizing the utilization of circuits when
available (optimization)
Table 1: Fundamental requirements of HyPaC architecture.
capacity because a single optical path can handle tens of servers
sending at full capacity over conventional gigabit Ethernet links.
The circuit-switched network can only provide a matching on the
graph of racks: Each rack can have at most one high-bandwidth
connection to another rack at a time. The switch can be reconfig-
ured to match different racks at a later time; as noted earlier, this
reconfiguration takes a few milliseconds, during which time the fast
paths are unusable. To ensure that latency sensitive applications can
make progress, HyPaC retains the packet-switched network. Any
node can therefore talk to any other node at any time over potentially
over-subscribed packet-switched links.
For the circuits to provide benefits, the traffic must be “pair-
wise concentrated”—there must exist pairs of racks with high band-
width demands between them and lower demand to others. Fortu-
nately, such concentration has been observed by numerous prior
studies [14, 16, 29]. This concentration exists for several reasons:
time-varying traffic, biased distributions, and—our focus in later
sections—amenability to batching. First, applications whose traffic
demands vary over time (e.g. hitting other bottlenecks, multi-phase
operation) can contribute to a non-uniform traffic matrix. Second,
other applications have intrinsic communication skew in which most
nodes only communicate with a small number of partners. This
limited out-degree leads to concentrated communication. Finally,
latency-insensitive applications such as MapReduce-style computa-
tions may be amenable to batched data delivery: instead of sending
data to destinations in a fine-grained manner (e.g., 1, 2, 3, 2, 3, 1,
2), sufficient buffering can be provided to batch this delivery (1, 1,
2, 2, 2, 3, 3). These patterns do not require arbitrary full-bisection
capacity.
3.1 System Requirement
Table 1 summarizes functions needed for a generic HyPaC-style
network. In the control plane, effective use of the circuit-switched
paths requires determining rack-to-rack traffic demands and timely
circuit reconfiguration to match these demands.
In the data plane, a HyPaC network has two properties: First,
when a circuit is established between two racks, there exist two paths
between them—the circuit-switched link and the always-present
packet-switched path. Second, when the circuits are reconfigured,
the network topology changes. Reconfiguration in a large data center
causes hundreds of simultaneous link up/down events, a level of
dynamism much higher than usually found in data centers. A HyPaC
network therefore requires traffic control mechanisms to dynamically
de-multiplex traffic onto the circuit or packet switched network,
as appropriate. Finally, if applications do not send traffic rapidly
enough to fill the circuit-switched paths when they become available,
a HyPaC design may need to implement additional mechanisms,
such as extra batching, to allow them to do so.
3.2 Design Choices and Trade-offs
These system requirements can be achieved on either end-hosts or
switches. For designs on end-hosts, the system components can be
at different software layers (e.g, applications layer or kernel layer).
Traffic demand estimation: One simple choice is to let applica-
tions explicitly indicate their demands. Applications have the most
accurate information about their demands, but this design requires
modifying applications. As we discuss in Section 4, our c-Through
design estimates traffic demand by increasing the per-connection
socket buffer sizes and observing end-host buffer occupancy at run-
time. This design requires additional kernel memory for buffering,
but is transparent to applications and does not require switch changes.
The Helios design [22], in contrast, estimates traffic demands at
switches by borrowing from Hedera [13] an iterative algorithm to
estimate traffic demands from flow information.
Traffic demultiplexing: Traditional Ethernet mechanisms han-
dle multiple paths poorly. Spanning tree, for example, will block
either the circuit-switched or the packet-switched network instead
of allowing each to be used concurrently. The major design choice
in traffic demultiplexing is between emerging link-layer routing pro-
tocols [33, 30, 25, 34, 10] and partition-based approaches that view
the two networks as separate.
The advantage of a routing-based design is that, by treating the
circuit and packet-switched networks as a single network, it oper-
ates transparently to hosts and applications. Its drawback is that it
requires switch modification, and most existing routing protocols im-
pose a relatively long convergence time when the topology changes.
For example, in link state routing protocols, re-convergence follow-
ing hundreds of simultaneous link changes could require seconds or
even minutes [15]. To be viable, routing-based designs may require
further work in rapidly converging routing protocols.
A second option, and the one we choose for c-Through, is to
isolate the two networks and to de-multiplex traffic at either the
end-hosts or at the ToR switches. We discuss our particular design
choice further in Section 4. The advantage of separating the net-
works is that rapid circuit reconfiguration does not destabilize the
packet-switched network. Its drawback is a potential increase in
configuration complexity.
Circuit utilization optimizing, if necessary, can be similarly ac-
complished in several ways. An application-integrated approach
could signal to applications to increase their transmission rate when
the circuits are available; the application-transparent mechanism
we choose for c-Through is to buffer additional data in TCP socket
buffers, relying on TCP to ramp up quickly when bandwidth be-
comes available. Such buffering could also be accomplished in the
ToR switches.
329
P
r
o
j
e
c
T
o
R
:
A
g
i
l
e
R
e
c
o
n
fi
g
u
r
a
b
l
e
D
a
t
a
C
e
n
t
e
r
I
n
t
e
r
c
o
n
n
e
c
t
Monia Ghobadi Ratul Mahajan Amar Phanishayee
Nikhil Devanur Janardhan Kulkarni Gireeja Ranade
Pierre-Alexandre Blanche† Houman Rastegarfar† Madeleine Glick† Daniel Kilper†
Microsoft Research †University of Arizona
Abstract— We explore a novel, free-space optics based
approach for building data center interconnects. It uses
a digital micromirror device (DMD) and mirror assembly
combination as a transmitter and a photodetector on top of
the rack as a receiver (Figure 1). Our approach enables all
pairs of racks to establish direct links, and we can recon-
figure such links (i.e., connect different rack pairs) within
12 µs. To carry traffic from a source to a destination rack,
transmitters and receivers in our interconnect can be dynam-
ically linked in millions of ways. We develop topology con-
struction and routing methods to exploit this flexibility, in-
cluding a flow scheduling algorithm that is a constant fac-
tor approximation to the offline optimal solution. Experi-
ments with a small prototype point to the feasibility of our
approach. Simulations using realistic data center workloads
show that, compared to the conventional folded-Clos inter-
connect, our approach can improve mean flow completion
time by 30–95% and reduce cost by 25–40%.
CCS Concepts•Networks ! Network architectures;
KeywordsData Centers; Free-Space Optics; Reconfigurablility
1. INTRODUCTION
The traditional way of designing data center (DC)
networks—electrical packet switches arranged in a multi-
tier topology—has a fundamental shortcoming. The design-
ers must decide in advance how much capacity to provision
between top-of-rack (ToR) switches. Depending on the pro-
visioned capacity, the interconnect is either expensive (e.g.,
with full-bisection bandwidth) or it limits application perfor-
mance when demand between two ToRs exceeds capacity.
Permission to make digital or hard copies of all or part of this work for personal
or classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice
and the full citation on the first page. Copyrights for components of this work
owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, or republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee. Request permissions from
SIGCOMM ’16, August 22-26, 2016, Florianopolis , Brazil
c� 2016 ACM. ISBN 978-1-4503-4193-6/16/08. . . $15.00
DOI: http://dx.doi.org/10.1145/2934872.2934911
Array of Micromirrors
Diffracted beam Towards destinationReceived beam
Input beam LasersDMDs
Photodetectors
Mirror assembly Reflected beam
Figure 1: ProjecToR interconnect with unbundled trans-
mit (lasers) and receive (photodetectors) elements.
EnablerTech.
Seamless Fan-out
Reconfig.time
Helios, c-Thru, Pro-
teus, Solstice [16, 26,
37, 38]
OCS No 100-320
30 ms
Flyways, 3DBeam [23,
40]
60GHz No ⇡70 10 ms
Mordia [33] OCS No 24 11 µs
Firefly [22] FSO Yes 10 20 ms
ProjecToR FSO Yes 18,432 12 µs
Table 1: Properties of reconfigurable interconnects.
Many researchers have recognized this shortcoming and
proposed reconfigurable interconnects, using technologies
that are able to dynamically change capacity between pairs
of ToRs. The technologies that they have explored include
optical circuit switches (OCS) [16,25,26,33,37,38], 60 GHz
wireless [23, 40], and free-space optics (FSO) [22].
However, our analysis of traffic from four diverse pro-
duction clusters shows that current approaches lack at least
two of three desirable properties for reconfigurable intercon-
nects: 1) Seamlessness: few limits on how much network
capacity can be dynamically added between ToRs; 2) High
fan-out: direct communication from a rack to many others;
and 3) Agility: low reconfiguration time.
Table 1 compares the existing reconfigurable intercon-
nects with respect to these three properties. Most approaches
(rows 1–3) are not seamless because they use a second, re-
(a) Rack-based DC (b) Container-based DC
TX
RX
(c) 2D Beamforming
TX
RX
(d) 3D Beamforming
Figure 1: Radio transceivers are placed on top of each rack (a) or container (b). Using 2D beamform-
ing (c), transceivers communicate with neighboring ones directly, but forward traffic in multiple hops to
non-neighboring racks. Using 3D beamforming (d), the ceiling reflects the signals from each sender to its
receiver, avoiding multi-hop relays.
more localized/bursty bandwidth requirements. That is, we
focus on the subset that do not require (near) non-blocking
all-to-all communication at data center scale.
In particular, we focus on high-throughput, beamform-
ing wireless links in the 60 GHz band. The unlicensed 60
GHz band provides multi-Gbps data rates and can be im-
plemented with relatively low-cost hardware. Because 60
GHz signals attenuate quickly with distance, multiple wire-
less links can be deployed in a single data center. In our
efforts to expand the effective bandwidth of 60 GHz links,
we hope to create a new primitive that can be used to either
augment existing networks with on-demand network links,
or potentially replace wired links in data centers with mod-
est bandwidth requirements. We build on pioneering efforts
of earlier work that proposed 60 GHz links to alleviate hot
spots in the data center [23, 26].
However, earlier efforts face a number of limitations. First,
even beamforming directional links will experience signal
leakage, and produce a cone of interference to receivers near
or behind the intended target receiver. This limits the num-
ber of links that can be active concurrently in densely oc-
cupied data centers, and reduces the aggregate throughput
offered by these wireless links.
Second, these links require direct line-of-sight (LOS) be-
tween sender and receiver, and can be blocked by even small
objects in the path. This limits the effective range of 60 GHz
links to neighboring top-of-rack radios. Since hotspots occur
regularly at both edge and core links [15], augmenting core
links would require multiple hops through a line-of-sight 60
GHz network. Half-duplex, directional antennas mean that
these multi-hop links will suffer at least a 50% throughput
drop, higher-levels of potential congestion, and additional
delays required to frequently adjust antenna orientation.
To address these issues, we investigate the feasibility of
60 GHz 3D beamforming as a flexible wireless primitive in
data centers. In 3D beamforming, a top-of-rack directional
antenna forms a wireless link by reflecting a focused beam off
the ceiling towards the receiver. This reduces its interference
footprint, avoids blocking obstacles, and provides an indirect
line-of-sight path for reliable communication. Such a system
requires only beamforming radios readily available today,
and near perfect reflection can be provided by simple flat
metal plates mounted on the ceiling of a data center.
3D beamforming has several distinctive advantages over
prior “2D” approaches. First, bouncing the beam off the
ceiling allows links to extend the reach of radio signals by
avoiding blocking obstacles. Second, the 3D direction of the
beam significantly reduces its interference range, allowing
more nearby flows to transmit concurrently. Third, the re-
duced interference extends the effective range of each link,
allowing our system to connect any two racks using a single
hop, and mitigating the need for multihop links.
In this paper, we propose a 3D beamforming system for 60
GHz wireless transmissions in data centers. The 3D beam-
forming idea was first introduced by Zhang et al. in [46].
In this paper, we greatly extend the prior work, and use
measurements of a local 60 GHz testbed to quantify and
compare the performance of 3D and 2D beamforming links.
We find that 3D wireless beamforming works well in prac-
tice, and experiences zero loss in signal or throughput from
reflection. We also describe a link scheduler for 3D beam-
forming systems that maximizes concurrent links while also
taking into account accumulative interference and antenna
alignment delays. Finally, we use a detailed simulation of
data center traffic hotspots to quantify the performance of
3D beamforming systems. Our results show that while 2D
links can only support a small portion of hotspot traffic links,
3D beamforming can connect all rack pairs in a single hop,
and can significantly reduce overall data completion time for
wired networks across a range of bisection bandwidths.
While wired networks will likely remain the vehicle of
choice for the high-end of distributed computing, we believe
that efforts such as 3D beamforming can expand the ap-
plicability and benefits of wireless networking to a broader
range of data center deployments.
2. 60 GHZ: LIMITATIONS AND SOLUTIONS
While modifying the topology of wired data centers is
costly, complex, and sometimes intractable, administrators
can introduce flexible point-to-point network links with the
addition of wireless radios. Prior work has proposed the use
of 60 GHz links to augment data center capacity [23, 26,
35, 38]. Figures 1(a)-(b) show a common deployment sce-
nario, where wireless radios are placed on top of each rack
or container to connect pairs of top-of-rack (ToR) switches.
In practice, however, data center managers remain skep-
tical on deploying wireless links despite their potential ben-
efits [1]. In this section, we summarize prior work in this
space, and use detailed experiments on a 60 GHz testbed to
identify and quantify key limitations of current proposals.
2.1 60 GHz Links in Data Centers
Existing designs [23, 26, 27, 38] adopt 60 GHz wireless
technologies for several reasons. First, the 7GHz spectrum
FireFly: A Reconfigurable Wireless Data Center Fabric
Using Free-Space Optics
Navid Hamedazimi,† Zafar Qazi,† Himanshu Gupta,† Vyas Sekar,? Samir R. Das,† Jon P. Longtin,†
Himanshu Shah,† and Ashish Tanwer†
†Stony Brook University?Carnegie Mellon University
ABSTRACTConventional static datacenter (DC) network designs offer extreme
cost vs. performance tradeoffs—simple leaf-spine networks are cost-
effective but oversubscribed, while “fat tree”-like solutions offer
good worst-case performance but are expensive. Recent results
make a promising case for augmenting an oversubscribed network
with reconfigurable inter-rack wireless or optical links. Inspired
by the promise of reconfigurability, this paper presents FireFly, an
inter-rack network solution that pushes DC network design to the
extreme on three key fronts: (1) all links are reconfigurable; (2) all
links are wireless; and (3) non top-of-rack switches are eliminated
altogether. This vision, if realized, can offer significant benefits in
terms of increased flexibility, reduced equipment cost, and minimal
cabling complexity. In order to achieve this vision, we need to look
beyond traditional RF wireless solutions due to their interference
footprint which limits range and data rates. Thus, we make the case
for using free-space optics (FSO). We demonstrate the viability of
this architecture by (a) building a proof-of-concept prototype of a
steerable small form factor FSO device using commodity compo-
nents and (b) developing practical heuristics to address algorithmic
and system-level challenges in network design and management.
Categories and Subject Descriptors
C.2.1 [Computer-Communication Networks]: Network Architec-
ture and Design
KeywordsData Centers; Free-Space Optics; Reconfigurablility
1 Introduction
A robust data center (DC) network must satisfy several goals: high
throughput [13, 23], low equipment and management cost [13, 40],
robustness to dynamic traffic patterns [14, 26, 48, 52], incremen-
tal expandability [18, 45], low cabling complexity [37], and low
power and cooling costs. With respect to cost and performance,
conventional designs are either (i) overprovisioned to account for
worst-case traffic patterns, and thus incur high cost (e.g., fat-trees
or Clos networks [13, 16, 23]), or (ii) oversubscribed (e.g., simple
trees or leaf-spine architectures [1]) which incur low cost but offer
poor performance due to congested links.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
SIGCOMM’14, August 17–22, 2014, Chicago, IL, USA.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2836-4/14/08 ...$15.00.
http://dx.doi.org/10.1145/2619239.2626328.
Rack%1%Rack%N%Rack%r%
Steerable%%FSOs%
Ceiling%mirror%
ToR%switch%
FireFly%Controller%
Traffic%Pa=erns%
Rule%change%
FSO%reconf%
Figure 1: High-level view of the FireFly architecture. The only
switches are the Top-of-Rack (ToR) switches.
Recent work suggests a promising middleground that augments
an oversubscribed network with a few reconfigurable links, using
either 60 Ghz RF wireless [26, 52] or optical switches [48]. In-
spired by the promise of these flexible DC designs,1 we envision a
radically different DC architecture that pushes the network design
to the logical extreme on three dimensions: (1) All inter-rack links
are flexible; (2) All inter-rack links are wireless; and (3) we get rid
of the core switching backbone.
This extreme vision, if realized, promises unprecedented qualita-
tive and quantitative benefits for DC networks. First, it can reduce
infrastructure cost without compromising on performance. Second,
flexibility increases the effective operating capacity and can im-
prove application performance by alleviating transient congestion.
Third, it unburdens DC operators from dealing with cabling com-
plexity and its attendant overheads (e.g., obstructed cooling) [37].
Fourth, it can enable DC operators to experiment with, and bene-
fit from, new topology structures that would otherwise remain un-
realizable due to cabling costs. Finally, flexibly turning links on
or off can take us closer to the vision of energy proportionality
(e.g., [29]).This paper describes FireFly,2 a first but significant step toward
realizing this vision. Figure 1 shows a high-level overview of Fire-
Fly. Each ToR is equipped with reconfigurable wireless links which
can connect to other ToR switches. However, we need to look
beyond traditional radio-frequency (RF) wireless solutions (e.g.,
60GHz) as their interference characteristics limit range and capac-
ity. Thus, we envision a new use-case for Free-Space Optical com-
munications (FSO) as it can offer high data rates (tens of Gbps)
over long ranges using low transmission power and with zero in-
terference [31]. The centralized FireFly controller reconfigures the
topology and forwarding rules to adapt to changing traffic patterns.
While prior work made the case for using FSO links in DCs [19,
28], these fail to establish a viable hardware design and also do not
address practical network design and management challenges that
1We use the terms flexible and reconfigurable interchangeably.
2FireFly stands for Free-space optical Inter-Rack nEtwork with
high FLexibilitY.
319
8
Set up network connections on the fly!
Advantage:Gained the ability to move links around
ProjecToR: Ghobadi, Monia, et al. "Projector: Agile reconfigurable data center interconnect." SIGCOMM (2016).9
Set up network connections on the fly!
Advantage:Gained the ability to move links around
ProjecToR: Ghobadi, Monia, et al. "Projector: Agile reconfigurable data center interconnect." SIGCOMM (2016).9
Set up network connections on the fly!
Advantage:Gained the ability to move links around
ProjecToR: Ghobadi, Monia, et al. "Projector: Agile reconfigurable data center interconnect." SIGCOMM (2016).9
Engineering challenges facing dynamic topologies
Spatial planning and organisation?Environmental factors?Lack of operational experience?Device packaging?Monitoring and debugging?Reliability and lifetime of devices?Unknown unknowns?
10
Foundational questions
11
Foundational questions
1 Rigorous benchmarks?Fat-trees are the easiest baseline — ideally inflexible!
11
Foundational questions
1 Rigorous benchmarks?Fat-trees are the easiest baseline — ideally inflexible!
2 What is the utility of dynamic links?
11
Ideally flexible network
Throughput per server
Fraction of servers with traffic demand
10
1
12
Ideally flexible network
1
𝛼
Throughput per server
Fraction of servers with traffic demand
(1, 𝛼)
0
1
13
Ideally flexible network
1
𝛼
Throughput per server
Fraction of servers with traffic demand
𝛼
(1, 𝛼)
0
1
13
Ideally flexible network
1
𝛼
Throughput per server
Fraction of servers with traffic demand
𝛼
(𝛼, 1)
(1, 𝛼)
0
1
13
Ideally flexible network
1
𝛼
Throughput per server
Fraction of servers with traffic demand
𝛼0
1
(1, 𝛼)
(𝛼, 1)
14
Ideally flexible network
1
𝛼
(x, 𝛼/x)Throughput per server
Fraction of servers with traffic demand
𝛼0
1
(1, 𝛼)
(𝛼, 1)
14
Ideally flexible network
1
𝛼
(x, 𝛼/x)Throughput per server
Fraction of servers with traffic demand
𝛼0
1
Throughput proportional
(1, 𝛼)
(𝛼, 1)
14
Fat-trees: ideally inflexible
1
𝛼
𝛼
(x, 𝛼/x)
Fat-tree
Throughput per server
Fraction of servers with traffic demand
0
1
Throughput proportional
2/k
15
Near-optimal expander networksStatic but flexible
16
Instead of rigid, layered connectivity…
17
Expander-based data centers
• Jellyfish • Slimfly • Xpander
NSDI ‘12 SC ’14 CoNEXT ‘16
18
Xpander: deterministic wiring-friendly expander-based data center
Met
a-no
de c
able
agg
rega
tor
Pod
ToR
Pod
cabl
e ag
greg
ator
Meta-node cable aggregator
Pod
ToR
Pod cable aggregator
Meta-node cable aggregator
PodTo
RPod cable aggregator
Meta-node cable aggregator
PodToR
Pod cable aggregator
Meta-node cable aggregator
Pod
ToR
Pod cable aggregator
Meta-node cable aggregator
Pod
ToR
Pod cable aggregator
Meta-node (aggregator)PodSwitches / racks
Pod-cabler (aggregator)
Valadarsky, Asaf, et al. "Xpander: Towards Optimal-Performance Datacenters." CoNEXT. 2016.19
Setup network connections on the fly!
vs.
How valuable is the ability to move links around?
A fundamental question…
Expander-based data centers!
20
Optimal flow comparison
Throughput per server
Fraction of servers with traffic demand
�
���
���
���
���
�
��� ��� ��� ��� ��� ��� ��� ��� ��� �
21
Baseline: oversubscribed fat-tree
Throughput per server
Fraction of servers with traffic demand
Fat-tree
�
���
���
���
���
�
��� ��� ��� ��� ��� ��� ��� ��� ��� �
22
Indeed, dynamic networks can be better
Throughput per server
Fraction of servers with traffic demand
�
���
���
���
���
�
��� ��� ��� ��� ��� ��� ��� ��� ��� �
Fat-tree
Dynamic network (𝛿=1.5)
23
… but so can static ones
Throughput per server
Fraction of servers with traffic demand
Fat-tree
�
���
���
���
���
�
��� ��� ��� ��� ��� ��� ��� ��� ��� �
Expander
Dynamic network (𝛿=1.5)
24
… especially in the regime of interest
Throughput per server
Fraction of servers with traffic demand
Fat-tree
�
���
���
���
���
�
��� ��� ��� ��� ��� ��� ��� ��� ��� �
“46-99% of the rack pairsexchange no traffic at all”
— Ghobadi et al., 2016
Expander
Dynamic network (𝛿=1.5)
25
… especially in the regime of interest
Throughput per server
Fraction of servers with traffic demand
Fat-tree
�
���
���
���
���
�
��� ��� ��� ��� ��� ��� ��� ��� ��� �
“46-99% of the rack pairsexchange no traffic at all”
— Ghobadi et al., 2016
Expander
Dynamic network (𝛿=1.5)
25
Not too far from proportionality!
Throughput per server
Fraction of servers with traffic demand
Fat-tree
Expander
�
���
���
���
���
�
��� ��� ��� ��� ��� ��� ��� ��� ��� �
Throughput proportional
Dynamic network (𝛿=1.5)
26
WorkloadspFabric Web search (2.4MB mean)Modelled after a real workload Maximum flow size of 30MB
Pareto-HULL (100KB mean)Pareto distributed Highly skewed Many short flows (<100 KB) Few very large flows (max. 1GB)
… at a fixed arrival rate per second (λ)
0
0.2
0.4
0.6
0.8
1
103 104 105 106 107 108 109
Mean = 100KBMean = 2.4MB
Short Long
CD
F
Flow size (bytes)
Pareto-HULLpFabric Web search
Pareto-HULL : Alizadeh, Mohammad, et al. "Less is more: trading a little bandwidth for ultra-low latency in the data center." USENIX (2012). pFabric Web search : Alizadeh, Mohammad, et al. "pfabric: Minimal near-optimal datacenter transport.” SIGCOMM (2013).27
Traffic scenariosA2A(x): fractional all-to-allOnly the servers under x% of the ToRs communicate all-to-all
Permute(x): fractional random permutationA random pairing of x% of the ToRs, of which in each pair all servers only communicate with the servers of the counterpart
ProjecToREmpirical skewed traffic from a Microsoft cluster
Skew(x, y)x fraction of ToRs has y probability of participating in a flow (rack-pair) E.g. θ=4% of ToRs have φ=77% chance of participating in a flow
ProjecToR: Ghobadi, Monia, et al. "Projector: Agile reconfigurable data center interconnect." SIGCOMM (2016).28
Topologies & RoutingTwo topologies (k=16):• Full fat-tree with n=320 • Xpander at with n=216 (67.5%) … with 10 Gbps links … both supporting ~1K servers
At servers:• DCTCP • Flowlets (change path upon exceeding gap)
Fat-tree:• ECMP
Xpander:• HYBRID
29
Introducing HYBRID routingHYBRID routing:• ECMP until # sent bytes > threshold Q • After threshold Q, use valiant load balancing (VLB)
Advantages:• Oblivious to the network congestion state • Introduces little to no overhead in current switches
30
Experimental take-aways
• Xpander achieves comparable performance to non-blocking fabrics…
• At lower cost: 2/3rds or less • Matching the performance of dynamic topologies
31
A2A(x): fractional all-to-all (pFabric)99th %-tile FCT for small flows(lower is better)
Average throughputfor large flows
(higher is better)
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
>=10
0KB
Avg
. thr
ough
put (
Gbp
s)
Fraction of active servers
Fat-treeXpander ECMPXpander HYB
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
<100
KB
99th
%-il
e FC
T (m
s)
Fraction of active servers
Fat-treeXpander ECMPXpander HYB
32
A2A(x): fractional all-to-all (pFabric)99th %-tile FCT for small flows(lower is better)
Average throughputfor large flows
(higher is better)
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
>=10
0KB
Avg
. thr
ough
put (
Gbp
s)
Fraction of active servers
Fat-treeXpander ECMPXpander HYB
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
<100
KB
99th
%-il
e FC
T (m
s)
Fraction of active servers
Fat-treeXpander ECMPXpander HYB
32
A2A(x): fractional all-to-all (pFabric)99th %-tile FCT for small flows(lower is better)
Average throughputfor large flows
(higher is better)
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
>=10
0KB
Avg
. thr
ough
put (
Gbp
s)
Fraction of active servers
Fat-treeXpander ECMPXpander HYB
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
<100
KB
99th
%-il
e FC
T (m
s)
Fraction of active servers
Fat-treeXpander ECMPXpander HYB
32
Permute(x): fractional random permutation (pFabric) 99th %-tile FCT for small flows(lower is better)
Average throughputfor large flows
(higher is better)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
<100
KB
99th
%-il
e FC
T (m
s)
Fraction of active servers
Fat-treeXpander HYB
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
>=10
0KB
Avg
. thr
ough
put (
Gbp
s)
Fraction of active servers
Fat-treeXpander HYB
33
Permute(x): fractional random permutation (pFabric) 99th %-tile FCT for small flows(lower is better)
Average throughputfor large flows
(higher is better)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
<100
KB
99th
%-il
e FC
T (m
s)
Fraction of active servers
Fat-treeXpander HYB
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
>=10
0KB
Avg
. thr
ough
put (
Gbp
s)
Fraction of active servers
Fat-treeXpander HYB
33
Permute(x): fractional random permutation (pFabric) 99th %-tile FCT for small flows(lower is better)
Average throughputfor large flows
(higher is better)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
<100
KB
99th
%-il
e FC
T (m
s)
Fraction of active servers
Fat-treeXpander HYB
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
>=10
0KB
Avg
. thr
ough
put (
Gbp
s)
Fraction of active servers
Fat-treeXpander HYB
33
A2A(0.31) with many short flows (Pareto-HULL)
0
50
100
150
200
250
300
350
0.5M 1.0M 1.5M 2.0M 2.5M 3.0M
<100
KB
99th
%-il
e FC
T (µ
s)
Load λ (flow-starts per second)
Fat-treeXpander HYB
99th %-tile FCT for small flows(lower is better)
Average throughputfor large flows
(higher is better)
0
1
2
3
4
5
6
7
8
0.5M 1.0M 1.5M 2.0M 2.5M 3.0M
>=10
0KB
Avg
. thr
ough
put (
Gbp
s)
Load λ (flow-starts per second)
Fat-treeXpander HYB
34
Comparing against ProjecToR
k=16 fat-tree: 320 switchesd=16, r=8 Xpander: 128 switches (40%)
• Creating the same experiment as ProjecToR • Same workload (pFabric) • Same traffic scenario • Same network sizes:
Static links
35
ProjecToR: same # of network ports but staticAverage FCT for all flows
(lower is better)
0 10 20 30 40 50 60 70 80 90
100
2K 4K 6K 8K 10K 12K 14K
Ave
rage
FC
T (m
s)
Load λ (flow-starts per second)
Fat-treeXpander HYB
Empirical
36
ProjecToR: same # of network ports but staticAverage FCT for all flows
(lower is better)
0 10 20 30 40 50 60 70 80 90
100
2K 4K 6K 8K 10K 12K 14K
Ave
rage
FC
T (m
s)
Load λ (flow-starts per second)
Fat-treeXpander HYB
Empirical
0
10
20
30
40
50
60
0K 5K 10K 15K 20K 25K
Ave
rage
FC
T (m
s)
Load λ (flow-starts per second)
Fat-treeXpander HYB
Skew (4%, 77%)
36
Skew(4%, 77%) using same equipment at larger scaleAverage FCT for all flows
(lower is better)
k=24 fat-tree (720 switches)
d=13, r=11 Xpander(322 switches = 45%)
… both supporting ~3.5k servers
0
10
20
30
40
50
60
0K 10K 20K 30K 40K 50K 60K 70K 80K
Ave
rage
FC
T (m
s)
Load λ (flow-starts per second)
Fat-treeXpander HYB
37
cheaper expander + simple, practical routing =
performance of full-bandwidth fat-tree
38
Expanders: the static topology benchmark
Demonstrating an advantage of dynamic topologies over static topologies requires…
• … comparing to expander-based static networks • … at equal cost • … using more expressive routing than ECMP • … accounting for reconfiguration/buffering latency
All proposals to date don’t hit this benchmark
39
Future work
A. Better (oblivious) routing schemes?
B. Adaptive routing?
C. Deployment?
40
My e-mail: simon.kassing [at] inf.ethz.ch
Code available: https://github.com/ndal-eth/netbench
Get in touch