enabling wide-spread communications on optical …...ing, data spreading for fault-tolerance, etc....

19
This paper is included in the Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’17). March 27–29, 2017 • Boston, MA, USA ISBN 978-1-931971-37-9 Open access to the Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation is sponsored by USENIX. Enabling Wide-Spread Communications on Optical Fabric with MegaSwitch Li Chen and Kai Chen, The Hong Kong University of Science and Technology; Zhonghua Zhu, Omnisensing Photonics; Minlan Yu, Yale University; George Porter, University of California, San Diego; Chunming Qiao, University at Buffalo; Shan Zhong, CoAdna https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/chen

Upload: others

Post on 24-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

This paper is included in the Proceedings of the 14th USENIX Symposium on Networked Systems

Design and Implementation (NSDI rsquo17)March 27ndash29 2017 bull Boston MA USA

ISBN 978-1-931971-37-9

Open access to the Proceedings of the 14th USENIX Symposium on Networked

Systems Design and Implementation is sponsored by USENIX

Enabling Wide-Spread Communications on Optical Fabric with MegaSwitch

Li Chen and Kai Chen The Hong Kong University of Science and Technology Zhonghua Zhu Omnisensing Photonics Minlan Yu Yale University

George Porter University of California San Diego Chunming Qiao University at Buffalo Shan Zhong CoAdna

httpswwwusenixorgconferencensdi17technical-sessionspresentationchen

Enabling Wide-spread Communications on Optical Fabric with MegaSwitchLi Chen1 Kai Chen1 Zhonghua Zhu2 Minlan Yu3

George Porter4 Chunming Qiao5 Shan Zhong6

1SING LabHKUST 2Omnisensing Photonics 3Yale 4UCSD 5SUNY Buffalo 6CoAdna

Abstract

Existing wired optical interconnects face a challenge ofsupporting wide-spread communications in productionclusters Initial proposals are constrained as they onlysupport connections among a small number of racks(eg 2 or 4) at a time with switching time of millisec-onds Recent efforts on reducing optical circuit reconfig-uration time to microseconds partially mitigate this prob-lem by rapidly time-sharing optical circuits across morenodes but are still limited by the total number of parallelcircuits available simultaneously

In this paper we seek an optical interconnect that canenable unconstrained communications within a comput-ing cluster of thousands of servers We present Mega-Switch a multi-fiber ring optical fabric that exploitsspace division multiplexing across multiple fibers todeliver rearrangeably non-blocking communications to30+ racks and 6000+ servers We have implementeda 5-rack 40-server MegaSwitch prototype with commer-cial optical devices and used testbed experiments as wellas large-scale simulations to explore MegaSwitchrsquos ar-chitectural benefits and tradeoffs

1 IntroductionDue to its advantages over traditional electrical networksat high link speeds optical circuit switching technol-ogy has been widely investigated for datacenter networks(DCN) to reduce cost power consumption and wiringcomplexity [38] Many of the initial optical DCN pro-posals build on the assumption of substantial traffic ldquocon-centrationrdquo Using optical circuit switches that only sup-port one-to-one connections between ports they servicehighly concentrated traffic among a small number ofracks at a time (eg 2 or 4 [6 50]) However evolv-ing traffic characteristics in production DCNs pose newchallenges

bull Wide-spread communications Recent analysis ofFacebook cluster traffic reveals that servers can con-currently communicate with hundreds of hosts withthe majority of traffic to tens of racks [42] and tracesfrom Google [46] and Microsoft [15 17] also ex-hibit such high fan-inout wide-spread communica-tions The driving forces behind such wide-spread pat-terns are multifold such as data shuffling load balanc-ing data spreading for fault-tolerance etc [42 46]

Figure 1 High-level view of a 4-node MegaSwitchbull High bandwidth demands Due to the ever-larger

parallel data processing remote storage access webservices etc traffic in Googlersquos DCN has increasedby 50x in the last 6 years doubling every year [46]

Under such conditions early optical DCN solutionsfall short due to their constrained communication sup-port [6 7 12 50] Recent efforts [24 38] have proposeda temporal approachmdashby reducing the circuit switchingtime from milliseconds to microseconds they are able torapidly time-share optical circuits across more nodes ina shorter time While such temporal approach mitigatesthe problem it is still insufficient for the wide-spreadcommunications as it is limited by the total number ofparallel circuits available at the same time [38]1

We take a spatial approach to address the challengeTo support the wide-spread communications we seeka solution that can directly deliver parallel circuits tomany nodes simultaneously In this paper we presentan optical DCN interconnect for thousands of serverscalled MegaSwitch that enables unconstrained commu-nications among all the servers

At a high-level (Figure 1) MegaSwitch is a circuit-switched backplane physically composed of multiple op-tical wavelength switches (OWS a switch we imple-mented sect4) connected in a multi-fiber ring Each OWSis attached to an electrical packet switch (EPS) forminga MegaSwitch node OWS acts as an access point forEPS to the ring Each sender uses a dedicated fiber to

1Another temporal approach [15] employs wireless free space op-tics (FSO) technology However FSO is not yet production-ready forDCNs because dust and vibration that are common in DCNs can im-pair FSO link stability [15]

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 577

send traffic to any other nodes on this multi-fiber ringeach receiver can receive traffic from all the senders (oneper fiber) simultaneously Essentially MegaSwitch es-tablishes a one-hop circuit between any pair of EPSesforming a rearrangeably non-blocking circuit mesh overthe physical multi-fiber ring

Specifically MegaSwitch achieves unconstrained con-nections by re-purposing wavelength selective switch(WSS a key component of OWS) as a receiving wtimes1multiplexer to enable non-blocking space division mul-tiplexing among w fibers (sect31) Prior work [6 11 38]used WSS as 1timesw demultiplexer that takes 1 input fiberwith k wavelengths and outputs any subset of the kwavelengths to w output fibers In contrast MegaSwitchreverses the use of WSS Each node uses k wavelengths2

on a fiber for sending for receiving each node lever-ages WSS to intercept all ktimesw wavelengths from wother senders (one per fiber) via its w input ports Thenwith a wavelength assignment algorithm (sect32) the WSScan always select out of the ktimesw wavelengths a non-interfering set to satisfy any communication demandsamong nodes

As a result MegaSwitch delivers rearrangeably non-blocking communications to ntimesk ports where n=w+1is the number of nodesracks on the ring and k is thenumber of ports per node With current technology wcan be up to 32 and k up to 192 thus MegaSwitch cansupport up to 33 racks and 6336 hosts In the futureMegaSwitch can scale beyond 105 ports with AWGR(array waveguide grating router) technology and multi-dimensional expansion (sect31)

On top of its unconstrained connections MegaSwitchfurther has built-in fault-tolerance to handle various fail-ures such as OWS cable and link failures We developnecessary redundancy and mechanisms so that Mega-Switch can provide reliable services (sect33)

We have implemented a small-scale MegaSwitch pro-totype (sect4) with 5 nodes and each node has 8 opti-cal ports transmitting on 8 unique wavelengths within1905THz to 1935THz at 200GHz channel spacingThis prototype enables arbitrary communications among5 racks and 40 hosts Furthermore our OWSes are de-signed and implemented with all commercially availableoptical devices and the EPSes we used are BroadcomPronto-3922 10G commodity switches

MegaSwitchrsquos reconfiguration delay hinges on WSSand 115micros WSS switching time has been reported usingdigital light processing (DLP) technology [38] How-ever our implementation experience reveals that asWSS port count increases (for wide-spread communi-cations) such low latency can no longer be maintainedThis is mainly because the port count of WSS with DLP

2Each unique wavelength corresponds to an optical port on OWSas well as an optical transceiver on EPS

used in [38] is not scalable In our prototype the WSS re-configuration is sim3ms We believe this is a hard limita-tion we have to confront in order to scale To accommo-date unstable traffic and latency-critical applications wedevelop ldquobasemeshrdquo on MegaSwitch to ensure any twonodes are always connected via certain wavelengths dur-ing reconfigurations (sect322) Especially we constructthe basemesh with the well-known Symphony [29] topol-ogy in distributed hash table (DHT) literature to providelow average latency with adjustable capacity

Over the prototype testbed we conducted basic mea-surements of throughput and latency and deployed realapplications eg Spark [54] and Redis [43] to evalu-ate the performance of MegaSwitch (sect51) Our experi-ments show that the latency-sensitive Redis experiencesuniformly low latency for cross-rack queries due to thebasemesh and the performance of Spark applications issimilar to that of an optimal scenario where all serversare connected to one single switch

To complement testbed experiments we performedlarge-scale simulations to study the impact of trafficstability on MegaSwitch (sect52) For synthetic tracesMegaSwitch provides near full bisection bandwidth forstable traffic of all patterns but does not sustain highthroughput for concentrated unstable traffic with stabil-ity period less than reconfiguration delay However itcan improve throughput for unstable wide-spread trafficthrough the basemesh For real production traces Mega-Switch achieves 9321 throughput of an ideal non-blocking fabric We also find that our basemesh effec-tively handles highly unstable wide-spread traffic andcontributes up to 5614 throughput improvement

2 Background and MotivationIn this section we first highlight the trend of high-demand wide-spread traffic in DCNs Then we discusswhy prior solutions are insufficient to support this trend

21 The Trend of DCN TrafficWe use two case studies to show the traffic trendbull User-facing web applications In response to user re-

quests web servers pushpull contents tofrom cacheservers (eg Redis [43]) Web servers and cacheservers are usually deployed in different racks leadingto intensive inter-rack traffic [42] Cache objects arereplicated in different clusters for fault-tolerance andload balancing and web servers access these objectsrandomly creating a wide-spread communication pat-tern overall

bull Data-parallel processing Traffic of MapReduce-type applications [3 21 42] is shown to be heavilycluster-local (eg [42] reported 133 of the traffic israck-local but 809 cluster-local) Data is spread tomany racks due to rack awareness [19] which requires

578 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

copies of data to be stored in different racks in case ofrack failures as well as cluster-level balancing [45] Inshuffle and output phases of MapReduce jobs wide-spread many-to-many traffic emerges among all par-ticipating serversracks within the clusterBandwidth demand is also surging as the data stored

and processed in DCNs continue to grow due to the ex-plosion of user-generated contents such as photovideoupdates and sensor measurements [46] The implica-tion is twofold 1) Data processing applications requirelarger bandwidth for data exchange both within a appli-cation (eg in a multi-stage machine learning workersexchange data between stages) and between applications(eg related applications like web search and ad recom-mendation share data) 2) Front-end servers need morebandwidth to fetch the ever-larger contents from cacheservers to generate webpages [5] In conclusion opti-cal DCN interconnects must support wide-spread high-bandwidth demands among many racks

22 Prior Proposals Are InsufficientPrior proposals fall short in supporting the above trendof high-bandwidth wide-spread communications amongmany nodes simultaneouslybull c-Through [50] and Helios [12] are seminal works that

introduce wavelength division multiplexing (WDM)and optical circuit switch (OCS) into DCNs Howeverthey suffer from constrained rack-to-rack connectiv-ity At any time high capacity optical circuits are lim-ited to a couple of racks (eg 2 [50]) Reconfiguringcircuits to serve communications among more rackscan incur sim10 ms circuit visit delay [12 50]

bull OSA [6] and WaveCube [7] enable arbitrary rack-to-rack connectivity via multi-hop routing over multiplecircuits and EPSes While solving the connectivity is-sue they introduce severe bandwidth constraints be-tween racks because traffic needs to traverse multipleEPSes to destination introducing traffic overhead androuting complexity at each transit EPS

bull Quartz [26] is an optical fiber ring that is designed asan design element to provide low latency in portionsof traditional DCNs It avoids the circuit switching de-lay by using fixed wavelengths and its bisection band-width are also limited

bull Mordia [38] and REACToR [24] improve optical cir-cuit switching in the temporal direction by reducingcircuit switching time from milliseconds to microsec-onds This approach is effective in quickly time-sharing circuits across more nodes but still insuffi-cient to support wide-spread communications due tothe parallel circuits available simultaneously Further-more Mordia is by default a single fiber ring struc-ture with small port density (limited by of uniquewavelengths supported on a fiber) Naıvely stacking

Figure 2 Example of a 3-node MegaSwitchmultiple fibers via ring-selection circuits as suggestedby [38] is blocking among hosts on different fibersleading to very degraded bisection bandwidth (sect52)whereas time-sharing multiple fibers via EPS as sug-gested by [11] requires complex EPS functions un-available in existing commodity EPSes and introducesadditional optical design complexities making it hardto implement in practice (more details in sect31)

bull Free-space optics proposals [15 18] eliminate wiringFirefly [18] leverages free-space optics to build awireless DCN fabric with improved cost-performancetradeoff With low switching time and high scalabil-ity ProjecToR [15] connects an entire DCN of tensof thousands of machines However they face prac-tical challenges of real DCN environments (eg dustand vibration) In contrast we implement a wired op-tical interconnect for thousands of machines whichis common in DCNs and deploy real applications onit We note that the mechanisms we developed forMegaSwitch can be employed in ProjecToR notablythe basemesh for low latency applications

3 MegaSwitch DesignIn this section we describe the design of MegaSwitchrsquosdata and control planes as well as its fault-tolerance

31 Data PlaneAs in Figure 2 MegaSwitch connects multiple OWSesin a multi-fiber ring Each fiber has only one sender Thesender broadcasts traffic on this fiber which reaches allother nodes in one hop Each receiver can receive trafficfrom all the senders simultaneously The key of Mega-Switch is that it exploits WSS to enable non-blockingspace division multiplexing among these parallel fibers

Sending component In OWS for transmission we usean optical multiplexer (MUX) to join the signals fromEPS onto a fiber and then use an optical amplifier (OA)to boost the power Multiple wavelengths from EPS up-link ports (each with an optical transceiver at a uniquewavelength) are multiplexed onto a single fiber The mul-tiplexed signals then travel to all the other nodes via thisfiber The OA is added before the path to compensate

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 579

the optical insertion loss caused by broadcasting whichensures the signal strength is within the receiver sensitiv-ity range (Detailed Power budgeting design is in sect41)As shown in Figure 2 any signal is amplified once atits sender and there is no additional amplification stagesin the intermediate or receiver nodes The signal fromsender also does not loop back and terminates at the lastreceiver on the ring (eg signals from OWS1 terminatesin OWS3 so that no physical loop is formed)

Receiving component We use a WSS and an opticaldemultiplexer (DEMUX) at the receiving end The de-sign highlight is using WSS as a wtimes1 wavelength multi-plexer to intercept all the wavelengths from all the othernodes on the ring In prior work [6 11 38] WSS wasused as a 1timesw demultiplexer that takes 1 input fiber ofk wavelengths and outputs any subset of these k wave-lengths to any of w output fibers In MegaSwitch WSSis repurposed as a wtimes1 multiplexer which takes w in-put fibers with k wavelengths each and outputs a non-interfering subset of the ktimesw wavelengths to an out-put fiber With this unconventional use of WSS Mega-Switch ensures that any node can simultaneously chooseany wavelengths from any other nodes enabling un-constrained connections Then based on the demandsamong nodes the WSS selects the right wavelengths andmultiplexes them on the fiber to the DEMUX In sect32 weintroduce algorithms to handle wavelength assignmentsThe multiplexed signals are then de-multiplexed by DE-MUX to the uplink ports on EPS

Supporting lowast-cast [51] Unicast and multicast can beset up with a single wavelength on MegaSwitch becauseany wavelength from a source can be intercepted by ev-ery node on the ring To set up a unicast MegaSwitchassigns a wavelength on the fiber from the source to des-tination and configures the WSS at the destination to se-lect this wavelength Consider a unicast from H1 to H6

in Figure 2 We first assign wavelength λ1 from node 1to 2 for this connection and configure the WSS in node 2to select λ1 from node 1 Then with the routing in bothEPSes configured the unicast circuit from H1 to H6 is es-tablished Further to setup a multicast from H1 to H6 andH9 based on the unicast above we just need to addition-ally configure the WSS in node 3 to also select λ1 fromnode 1 In addition many-to-one (incast) or many-to-many (allcast) communications are composites of manyunicasts andor multicasts and they are supported usingmultiple wavelengths

Scalability MegaSwitchrsquos port count is ntimeskbull For n with existing technology it is possible to

achieve 32times1 WSS (thus n=w+1=33) Furthermorealternative optical components such as AWGR andfilter arrays can be used to achieve the same function-ality as WSS+DEMUX for MegaSwitch [56] Since

48times48 or 64times64 AWGR is available we expect to seea 15times or 2times increase with 49 or 65 nodes on a ringwhile additional splitting loss can be compensated

bull For k C-band Dense-WDM (DWDM) link can sup-port k=96 wavelengths at 50Ghz channel spacing andk=192 wavelengths at 25Ghz [35] Recent progress inoptical comb source [30 53] assures the relative wave-length accuracy among DWDM signals and is promis-ing to enable a low power consumption DWDM trans-ceiver array at 25Ghz channel spacing

As a result with existing technology MegaSwitch sup-ports up to ntimesk=33times192=6336 ports on a single ring

MegaSwitch can also be expanded multidimension-ally In our OWS implementation (sect4) two inter-OWSports are used in the ring and we reserve another two forfuture extension of MegaSwitch in another dimensionWhen extended bi-dimensionally (into a torus) [55]MegaSwitch scales as n2timesk supporting over 811K ports(n=65 k=192) which should accommodate modernDCNs [42 46] However such large scale comes withinherent routing management and reliability challengesand we leave it as future work

MegaSwitch vs Mordia Mordia [38] is perhaps theclosest related work to MegaSwitch in terms of topol-ogy both are constructed as a ring and both adopt WSSHowever the key difference is Mordia chains multipleWSSes on one fiber each WSS acts as a wavelength de-multiplexer to set up circuits between ports on the samefiber whereas MegaSwitch reverses WSS as a wave-length multiplexer across multiple fibers to enable non-blocking connections between ports on different fibers

We note that Farrington et al [11] further discussedscaling Mordia with multiple fibers non-blocking-ly (re-ferred to as Mordia-PTL below) Unlike MegaSwitchrsquosWSS-based multi-fiber ring they proposed to connecteach EPS to multiple fiber rings in parallel and rely onthe EPS to schedule circuits for servers to support TDMacross fibers This falls beyond the capability of existingcommodity switches [24] Further on each fiber they al-low every node to add signals to the fiber and require anoptical amplifier in each node to boost the signals andcompensate losses (eg bandpass adddrop filter lossvariable optical attenuator loss 1090 splitter loss etc)thus they need multiple amplifiers per fiber This de-grades optical signal to noise ratio (OSNR) and makestransmission error-prone In Figure 3 we compareOSNR between MegaSwitch and Mordia We assume7dBm per-channel launch power 16dB pre-amplificationgain in MegaSwitch and 23dB boosting gain per-node inMordia The results show that for MegaSwitch OSNRmaintains at sim36dB and remains constant for all nodesfor Mordia OSNR quickly degrades to 30dB after 7hops For validation we have also measured the aver-

580 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

MegaSwitch

Mordia (Multi-hop ring)

5-node MegaSwitch

Prototype (Measured)

Acceptable SNR (32dB)20dB

30dB

40dB

nodes on ring0 5 10 15

Figure 3 OSNR comparisonage OSNR on our MegaSwitch prototype (sect4) which is3812dB after 5 hops (plotted in the figure for reference)

Furthermore there exists a physical loop in each fiberof their design where optical signals can circle back toits origin causing interferences if not handled properlyMegaSwitch by design avoids most of these problems1) one each fiber has only one sender and thus one stageof amplification (Power budgeting is explained in sect41)2) each fiber is terminated in the last node in the ringthus there is no physical optical loop or recirculated sig-nals (Figure 9) Therefore MegaSwitch maintains goodOSNR when scaled to more nodes More importantlybesides above they do not consider fault-tolerance andactual implementation in their paper [11] In this workwe have built with significant efforts a functional Mega-Switch prototype with commodity off-the-shelf EPSesand our home-built OWSes

32 Control PlaneAs prior work [6 12 38 50] MegaSwitch takes a cen-tralized approach to configure routings in EPSes andwavelength assignments in OWSes The MegaSwitchcontroller converts traffic bandwidth demands into wave-length assignments and pushes them into the OWSesThe demands can be obtained by existing traffic estima-tion schemes [2 12 50] In what follows we first in-troduce the wavelength assignment algorithms and thendesign basemesh to handle unstable traffic and latencysensitive flows during reconfigurations

321 Wavelength AssignmentIn MegaSwitch each node uses the same k wavelengthsto communicate with other nodes The control plane as-signs the wavelengths to satisfy communication demandsamong nodes In a feasible assignment wavelengthsfrom different senders must not interfere with each otherat the receiver since all the selected wavelengths to a par-ticular receiver share the same output fiber through wtimes1WSS (see Figure 2) We illustrate an assignment exam-ple in Figure 4 which has 3 nodes and each has 4 wave-lengths The demand is in (a) each entry is the numberof required wavelengths of a sender-receiver pair If thewavelengths are assigned as in (b) then two conflicts oc-cur A non-interfering assignment is shown in (c) whereno two same wavelengths go to the same receiver

Given the constraint we reduce the wavelength as-signment problem in MegaSwitch to an edge coloring

Figure 4 Wavelength assignments (n=3 k=4)problem on a bipartite multigraph We express band-width demand matrix on a bipartite multigraph as Fig-ure 4 Multiple edges between two nodes correspondto multiple wavelengths needed by them Assume eachwavelength has a unique color then a feasible wave-length assignment is equivalent to an assignment of col-ors to the edges so that no two adjacent edges share thesame colormdashexactly the edge coloring problem [9]

Edge coloring problem is NP-complete on generalgraphs [6 9] but has efficient optimal solutions on bi-partite multigraphs [7 44] By adopting the algorithmin [7] we prove the following theorem3 (see Appendix)which establishes the rearrangeably non-blocking prop-erty of MegaSwitchTheorem 1 Any feasible bandwidth demand can be sat-isfied by MegaSwitch with at most k unique wavelengths322 BasemeshAs WSS port count increases MegaSwitch reconfiguresat milliseconds (sect42) To accommodate unstable trafficand latency-sensitive applications a typical solution is tomaintain a parallel electrical packet-switched fabric assuggested by prior hybrid network proposals [12 24 50]

MegaSwitch emulates such hybrid network with stablecircuits providing constant connectivity among nodesand eliminating the need for an additional electrical fab-ric We denote this set of wavelengths as basemeshwhich should achieve two goals 1) the number of wave-lengths dedicated to basemesh b should be adjustable42) With given b guarantee low average latency for allpairs of nodes

Essentially basemesh is an overlay network on Mega-Switchrsquos physical multi-fiber ring and building such anetwork with the above design goals has been studiedin overlay networking and distributed hash table (DHT)literature [29 48] Forwarding a packet on basemesh issimilar to performing a look-up on a DHT network Alsothe routing table size (number of known peer addresses)of each node in DHT is analogous to the the number ofwavelengths to other nodes b Meanwhile our problemdiffers from DHT We assume the centralized controller

3We note that this problem can also be reduce to the well-knownrow-column-row permutation routing problem [4 39] and the reduc-tion is also a proof of Theorem 1

4The basemesh alone can be considered as a bk (k is the numberof wavelengths per fiber) over-subscribed parallel electrical switchingnetwork similar to that of [12 24 50]

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 581

Chord (=log2 33)

Basemesh (Symphony)

Ave

rag

e H

op

s

1

5

10

Ave

rag

e L

ate

ncy20us

100us

200us

300us

wavelength in Basemesh1020 32 192

Figure 5 Average latency on basemeshcalculates routing tables for EPSes and the routing ta-bles are updated for adjustments of b and peer churn (in-frequent only happens in failures sect33)

We find the Symphony [29] topology fits our goalsFor b wavelengths in the basemesh each node first usesone wavelength to connect to the next node forming adirected ring then each node chooses shortcuts to bminus1other nodes on the ring We adopt the randomized algo-rithm of Symphony [29] to choose shortcuts (drawn froma family of harmonic probability distributions pn(x)=

1xlnn If a pair is selected the corresponding entry in de-mand matrix increment by 1) The routing table on theEPS of each node is configured with greedy routing thatattempts to minimize the absolute distance to destinationat each hop [29]

It is shown that the expected number of hops for thistopology is O( log

2(n)k ) for each packet With n=33 we

plot the expected path length in Figure 5 to compareit with Chord [48] (Finger table size is 4 average pathlength is O(log(n))) and also plot the expected latencyassuming 20micros per-hop latency Notably if bgenminus1basemesh is fully connected with 1 hop between everypair adding more wavelength cannot reduce the averagehop count but serve to reduce congestion

As we will show in sect52 while basemesh reduces theworst-case bisection bandwidth the average bisectionbandwidth is unaffected and it brings two benefitsbull It increases throughput for rapid-changing unpre-

dictable traffic when the traffic stability period is lessthan the control plane delay as shown in sect52

bull It mitigates the impact of inaccurate demand esti-mation by providing consistent connection Withoutbasemesh when the demand between two nodes ismistakenly estimated as 0 they are effectively cut-off which is costly to recover (flows need to wait forwavelength setup)Thanks to the flexibility of MegaSwitch b is ad-

justable to accommodate varying traffic instability(sect52) or be partiallyfully disabled for higher traffic im-balance if applications desire more bandwidth in part ofnetwork [42] In this paper we present b as a knob forDCN operators and intend to explore the problem of op-timal configuration of b as future work

Basemesh vs Dedicated Topology in ProjecToR [15]Basemesh is similar to the dedicated topology in Pro-jecToR which is a set of static connections for low la-tency traffic However its construction is based on the

Figure 6 PRF module layoutoptimization of weighted path length using traffic distri-bution from past measurements which cannot provideaverage latency guarantee for an arbitrary pair of racksIn comparison basemesh can provide consistent connec-tions between all pairs with proven average latency Itis possible to configure basemsesh in ProjecToR as itsdedicated topology which better handles unpredictablelatency-critical traffic

33 Fault-toleranceWe consider the following failures in MegaSwitch

OWS failure We implement the OWS box so that thenode failure will not affect the operation of other nodeson the ring As shown in Figure 6 the integrated pas-sive component (Passive Routing Fabric PRF) handlesthe signal propagation across adjacent nodes while thepassive splitters copy the traffic to local active switchingmodules (ie WSS and DEMUX) When one OWS losesits function (eg power loss) in its active part the traf-fic traveling across the failure node fromto other nodes isnot affected because the PRF operates passively and stillforwards the signals Thus the failure is isolated withinthe local node We design the PRF as a modular blockthat can be attached and detached from active switchingmodules easily So one can recover the failed OWS with-out the service interruption of the rest network by simplyswapping its active switching module

Port failure When the transceiver fails in a node itcauses the loss of capacity only at the local node Wehave implemented built-in power detectors in OWS tonotify the controller of such events

Controller failure In control plane two or more in-stances of the controller are running simultaneously withone leader The fail-over mechanism follows VRRP [20]

Cable failure Cable cut is the most difficult failureand we design5 redundancy and recovery mechanisms tohandle one cable cut The procedure is analogous to ringprotection in SONET [49] by 2 sets of fibers We proposedirectional redundancy in MegaSwitch as shown in Fig-ure 7 the fiber can carry the traffic fromto both east (pri-mary) and west (secondary) sides of the OWS by select-

5This design has not been implemented in the current prototypethus the power budgeting on the prototype (sect41) does not account forthe fault-tolerance components on the ring The components for thisdesign are all commercially available and can be readily incorporatedinto future iterations of MegaSwitch

582 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Figure 7 MegaSwitch fault-toleranceing a direction at the sending 1times2 optical tunable split-ter6 and receiving 2times1 optical switches The splittersand switches are in the PRF module When there is nocable failure sending and receiving switches both selectthe primary direction thus preventing optical looping Ifone cable cut is detected by the power detectors the con-troller is notified and it instructs the OWSes on the ringto do the following 1) The sending tunable splitter inevery OWS transmits on its fiber on both directions7 sothat the signal can reach all the OWSes 2) The receiv-ing switch for each fiber at the input of WSS selects thedirection where there is still incoming signal Then theconnectivity can be restored A passive 4-port fix split-ter is added before the receiving switch as a part of thering and it guides the signal from primary and backupdirections to different input ports of the 2times1 switch Formore than one cut the ring is severed into two or moresegments and connectivity between segments cannot berecovered until cable replacement

4 Implementation41 Implementation of OWSTo implement MegaSwitch and facilitate real deploy-ment we package all optical devices into an 1RU (RackUnit) switch box OWS (Figure 2) The OWS is com-posed of the sending components (MUX+OA) and thereceiving components (WSS+DEMUX) as described insect3 We use a pair of array waveguide grating (AWG) asMUX and DEMUX for DWDM signals

To simplify the wiring we compact all the 1times2 drop-continue splitters into PRF module as shown in Figure 6This module can be easily attached to or detached fromactive switching components in the OWS To compen-sate component losses along the ring we use a singlestage 12dB OA to boost the DWDM signal before trans-mitting the signals and the splitters in PRF have differentsplitting ratios Using the same numbering in Figure 6the splitting ratio of i-th splitter is 1(iminus1) for igt1 Asan example the power splitting on one fiber from node1 is shown in Figure 9 At each receiving transceiver

6It is implementable with a common Mach-Zehnder interferometer7Power splitting ratio must also be re-configured to keep the signal

within receiver sensitivity range for all receiving OWSes

Figure 8 The OWS box we implemented

Figure 9 Power budget on a fiberthe signal strength is simminus9dBm per channel The PRFconfiguration is therefore determined by w (WSS radix)

OWS receives DWDM signals from all other nodes onthe ring via its WSS We adopt 8times1 WSS from CoAdnaPhotonics in the prototype which is able to select wave-length signals from at most 8 nodes Currently we use 4out of 8 WSS ports for a 5-node ring

Our OWS box provides 16 optical ports and 8 areused in the MegaSwitch prototype Each port maps toa particular wavelength from 1905THz to 1935THzat 200GHz channel spacing InnoLight 10GBASE-ERDWDM SFP+ transceivers are used to connect EPSesto optical ports on OWSes With the total broadcastingloss at 105dB for OWS 4dB for WSS 25dB for DE-MUX and 1dB for cable connectors the receiving poweris minus9dBm simminus12dBm well within the sensitivity rangeof our transceivers The OWS box we implemented (Fig-ure 8) has 4 inter-OWS ports 16 optical ports and 1 Eth-ernet port for control For the 4 inter-OWS ports twoare used to connect other OWSes to construct the Mega-Switch ring and the other two are reserved for redun-dancy and further scaling (sect31)

Each OWS is controlled by a Raspberry Pi [37] with700MHz ARM CPU and 256MB memory OWS re-ceives the wavelength assignments (in UDP packets) viaits Ethernet management interface connected to a sep-arate electrical control plane network and configuresWSS via GPIO pins

42 MegaSwitch PrototypeWe constructed a 5-node MegaSwitch with 5 OWSboxes as is shown in Figure 10 The OWSes are simplyconnected to each other through their inter-OWS portsusing ring cables containing multiple fibers The twoEPSes we used are Broadcom Pronto-3922 with 48times10Gbps Ethernet (GbE) ports For one EPS we fit 24 portswith 3 sets of transceivers of 8 unique wavelengths toconnect to 3 OWSes and the rest 24 ports are connectedto the servers For the other we only use 32 ports with16 ports to OWSes and 16 to servers We fill the capacitywith 40times10GbE interfaces on 20 Dell PowerEdge R320servers (Debian 70 with Kernel 31819) each with aBroadcom NetXtreme II 10GbE network interface card

The control plane network connected by a BroadcomPronto-3295 Ethernet switch The Ethernet management

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 583

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 2: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

Enabling Wide-spread Communications on Optical Fabric with MegaSwitchLi Chen1 Kai Chen1 Zhonghua Zhu2 Minlan Yu3

George Porter4 Chunming Qiao5 Shan Zhong6

1SING LabHKUST 2Omnisensing Photonics 3Yale 4UCSD 5SUNY Buffalo 6CoAdna

Abstract

Existing wired optical interconnects face a challenge ofsupporting wide-spread communications in productionclusters Initial proposals are constrained as they onlysupport connections among a small number of racks(eg 2 or 4) at a time with switching time of millisec-onds Recent efforts on reducing optical circuit reconfig-uration time to microseconds partially mitigate this prob-lem by rapidly time-sharing optical circuits across morenodes but are still limited by the total number of parallelcircuits available simultaneously

In this paper we seek an optical interconnect that canenable unconstrained communications within a comput-ing cluster of thousands of servers We present Mega-Switch a multi-fiber ring optical fabric that exploitsspace division multiplexing across multiple fibers todeliver rearrangeably non-blocking communications to30+ racks and 6000+ servers We have implementeda 5-rack 40-server MegaSwitch prototype with commer-cial optical devices and used testbed experiments as wellas large-scale simulations to explore MegaSwitchrsquos ar-chitectural benefits and tradeoffs

1 IntroductionDue to its advantages over traditional electrical networksat high link speeds optical circuit switching technol-ogy has been widely investigated for datacenter networks(DCN) to reduce cost power consumption and wiringcomplexity [38] Many of the initial optical DCN pro-posals build on the assumption of substantial traffic ldquocon-centrationrdquo Using optical circuit switches that only sup-port one-to-one connections between ports they servicehighly concentrated traffic among a small number ofracks at a time (eg 2 or 4 [6 50]) However evolv-ing traffic characteristics in production DCNs pose newchallenges

bull Wide-spread communications Recent analysis ofFacebook cluster traffic reveals that servers can con-currently communicate with hundreds of hosts withthe majority of traffic to tens of racks [42] and tracesfrom Google [46] and Microsoft [15 17] also ex-hibit such high fan-inout wide-spread communica-tions The driving forces behind such wide-spread pat-terns are multifold such as data shuffling load balanc-ing data spreading for fault-tolerance etc [42 46]

Figure 1 High-level view of a 4-node MegaSwitchbull High bandwidth demands Due to the ever-larger

parallel data processing remote storage access webservices etc traffic in Googlersquos DCN has increasedby 50x in the last 6 years doubling every year [46]

Under such conditions early optical DCN solutionsfall short due to their constrained communication sup-port [6 7 12 50] Recent efforts [24 38] have proposeda temporal approachmdashby reducing the circuit switchingtime from milliseconds to microseconds they are able torapidly time-share optical circuits across more nodes ina shorter time While such temporal approach mitigatesthe problem it is still insufficient for the wide-spreadcommunications as it is limited by the total number ofparallel circuits available at the same time [38]1

We take a spatial approach to address the challengeTo support the wide-spread communications we seeka solution that can directly deliver parallel circuits tomany nodes simultaneously In this paper we presentan optical DCN interconnect for thousands of serverscalled MegaSwitch that enables unconstrained commu-nications among all the servers

At a high-level (Figure 1) MegaSwitch is a circuit-switched backplane physically composed of multiple op-tical wavelength switches (OWS a switch we imple-mented sect4) connected in a multi-fiber ring Each OWSis attached to an electrical packet switch (EPS) forminga MegaSwitch node OWS acts as an access point forEPS to the ring Each sender uses a dedicated fiber to

1Another temporal approach [15] employs wireless free space op-tics (FSO) technology However FSO is not yet production-ready forDCNs because dust and vibration that are common in DCNs can im-pair FSO link stability [15]

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 577

send traffic to any other nodes on this multi-fiber ringeach receiver can receive traffic from all the senders (oneper fiber) simultaneously Essentially MegaSwitch es-tablishes a one-hop circuit between any pair of EPSesforming a rearrangeably non-blocking circuit mesh overthe physical multi-fiber ring

Specifically MegaSwitch achieves unconstrained con-nections by re-purposing wavelength selective switch(WSS a key component of OWS) as a receiving wtimes1multiplexer to enable non-blocking space division mul-tiplexing among w fibers (sect31) Prior work [6 11 38]used WSS as 1timesw demultiplexer that takes 1 input fiberwith k wavelengths and outputs any subset of the kwavelengths to w output fibers In contrast MegaSwitchreverses the use of WSS Each node uses k wavelengths2

on a fiber for sending for receiving each node lever-ages WSS to intercept all ktimesw wavelengths from wother senders (one per fiber) via its w input ports Thenwith a wavelength assignment algorithm (sect32) the WSScan always select out of the ktimesw wavelengths a non-interfering set to satisfy any communication demandsamong nodes

As a result MegaSwitch delivers rearrangeably non-blocking communications to ntimesk ports where n=w+1is the number of nodesracks on the ring and k is thenumber of ports per node With current technology wcan be up to 32 and k up to 192 thus MegaSwitch cansupport up to 33 racks and 6336 hosts In the futureMegaSwitch can scale beyond 105 ports with AWGR(array waveguide grating router) technology and multi-dimensional expansion (sect31)

On top of its unconstrained connections MegaSwitchfurther has built-in fault-tolerance to handle various fail-ures such as OWS cable and link failures We developnecessary redundancy and mechanisms so that Mega-Switch can provide reliable services (sect33)

We have implemented a small-scale MegaSwitch pro-totype (sect4) with 5 nodes and each node has 8 opti-cal ports transmitting on 8 unique wavelengths within1905THz to 1935THz at 200GHz channel spacingThis prototype enables arbitrary communications among5 racks and 40 hosts Furthermore our OWSes are de-signed and implemented with all commercially availableoptical devices and the EPSes we used are BroadcomPronto-3922 10G commodity switches

MegaSwitchrsquos reconfiguration delay hinges on WSSand 115micros WSS switching time has been reported usingdigital light processing (DLP) technology [38] How-ever our implementation experience reveals that asWSS port count increases (for wide-spread communi-cations) such low latency can no longer be maintainedThis is mainly because the port count of WSS with DLP

2Each unique wavelength corresponds to an optical port on OWSas well as an optical transceiver on EPS

used in [38] is not scalable In our prototype the WSS re-configuration is sim3ms We believe this is a hard limita-tion we have to confront in order to scale To accommo-date unstable traffic and latency-critical applications wedevelop ldquobasemeshrdquo on MegaSwitch to ensure any twonodes are always connected via certain wavelengths dur-ing reconfigurations (sect322) Especially we constructthe basemesh with the well-known Symphony [29] topol-ogy in distributed hash table (DHT) literature to providelow average latency with adjustable capacity

Over the prototype testbed we conducted basic mea-surements of throughput and latency and deployed realapplications eg Spark [54] and Redis [43] to evalu-ate the performance of MegaSwitch (sect51) Our experi-ments show that the latency-sensitive Redis experiencesuniformly low latency for cross-rack queries due to thebasemesh and the performance of Spark applications issimilar to that of an optimal scenario where all serversare connected to one single switch

To complement testbed experiments we performedlarge-scale simulations to study the impact of trafficstability on MegaSwitch (sect52) For synthetic tracesMegaSwitch provides near full bisection bandwidth forstable traffic of all patterns but does not sustain highthroughput for concentrated unstable traffic with stabil-ity period less than reconfiguration delay However itcan improve throughput for unstable wide-spread trafficthrough the basemesh For real production traces Mega-Switch achieves 9321 throughput of an ideal non-blocking fabric We also find that our basemesh effec-tively handles highly unstable wide-spread traffic andcontributes up to 5614 throughput improvement

2 Background and MotivationIn this section we first highlight the trend of high-demand wide-spread traffic in DCNs Then we discusswhy prior solutions are insufficient to support this trend

21 The Trend of DCN TrafficWe use two case studies to show the traffic trendbull User-facing web applications In response to user re-

quests web servers pushpull contents tofrom cacheservers (eg Redis [43]) Web servers and cacheservers are usually deployed in different racks leadingto intensive inter-rack traffic [42] Cache objects arereplicated in different clusters for fault-tolerance andload balancing and web servers access these objectsrandomly creating a wide-spread communication pat-tern overall

bull Data-parallel processing Traffic of MapReduce-type applications [3 21 42] is shown to be heavilycluster-local (eg [42] reported 133 of the traffic israck-local but 809 cluster-local) Data is spread tomany racks due to rack awareness [19] which requires

578 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

copies of data to be stored in different racks in case ofrack failures as well as cluster-level balancing [45] Inshuffle and output phases of MapReduce jobs wide-spread many-to-many traffic emerges among all par-ticipating serversracks within the clusterBandwidth demand is also surging as the data stored

and processed in DCNs continue to grow due to the ex-plosion of user-generated contents such as photovideoupdates and sensor measurements [46] The implica-tion is twofold 1) Data processing applications requirelarger bandwidth for data exchange both within a appli-cation (eg in a multi-stage machine learning workersexchange data between stages) and between applications(eg related applications like web search and ad recom-mendation share data) 2) Front-end servers need morebandwidth to fetch the ever-larger contents from cacheservers to generate webpages [5] In conclusion opti-cal DCN interconnects must support wide-spread high-bandwidth demands among many racks

22 Prior Proposals Are InsufficientPrior proposals fall short in supporting the above trendof high-bandwidth wide-spread communications amongmany nodes simultaneouslybull c-Through [50] and Helios [12] are seminal works that

introduce wavelength division multiplexing (WDM)and optical circuit switch (OCS) into DCNs Howeverthey suffer from constrained rack-to-rack connectiv-ity At any time high capacity optical circuits are lim-ited to a couple of racks (eg 2 [50]) Reconfiguringcircuits to serve communications among more rackscan incur sim10 ms circuit visit delay [12 50]

bull OSA [6] and WaveCube [7] enable arbitrary rack-to-rack connectivity via multi-hop routing over multiplecircuits and EPSes While solving the connectivity is-sue they introduce severe bandwidth constraints be-tween racks because traffic needs to traverse multipleEPSes to destination introducing traffic overhead androuting complexity at each transit EPS

bull Quartz [26] is an optical fiber ring that is designed asan design element to provide low latency in portionsof traditional DCNs It avoids the circuit switching de-lay by using fixed wavelengths and its bisection band-width are also limited

bull Mordia [38] and REACToR [24] improve optical cir-cuit switching in the temporal direction by reducingcircuit switching time from milliseconds to microsec-onds This approach is effective in quickly time-sharing circuits across more nodes but still insuffi-cient to support wide-spread communications due tothe parallel circuits available simultaneously Further-more Mordia is by default a single fiber ring struc-ture with small port density (limited by of uniquewavelengths supported on a fiber) Naıvely stacking

Figure 2 Example of a 3-node MegaSwitchmultiple fibers via ring-selection circuits as suggestedby [38] is blocking among hosts on different fibersleading to very degraded bisection bandwidth (sect52)whereas time-sharing multiple fibers via EPS as sug-gested by [11] requires complex EPS functions un-available in existing commodity EPSes and introducesadditional optical design complexities making it hardto implement in practice (more details in sect31)

bull Free-space optics proposals [15 18] eliminate wiringFirefly [18] leverages free-space optics to build awireless DCN fabric with improved cost-performancetradeoff With low switching time and high scalabil-ity ProjecToR [15] connects an entire DCN of tensof thousands of machines However they face prac-tical challenges of real DCN environments (eg dustand vibration) In contrast we implement a wired op-tical interconnect for thousands of machines whichis common in DCNs and deploy real applications onit We note that the mechanisms we developed forMegaSwitch can be employed in ProjecToR notablythe basemesh for low latency applications

3 MegaSwitch DesignIn this section we describe the design of MegaSwitchrsquosdata and control planes as well as its fault-tolerance

31 Data PlaneAs in Figure 2 MegaSwitch connects multiple OWSesin a multi-fiber ring Each fiber has only one sender Thesender broadcasts traffic on this fiber which reaches allother nodes in one hop Each receiver can receive trafficfrom all the senders simultaneously The key of Mega-Switch is that it exploits WSS to enable non-blockingspace division multiplexing among these parallel fibers

Sending component In OWS for transmission we usean optical multiplexer (MUX) to join the signals fromEPS onto a fiber and then use an optical amplifier (OA)to boost the power Multiple wavelengths from EPS up-link ports (each with an optical transceiver at a uniquewavelength) are multiplexed onto a single fiber The mul-tiplexed signals then travel to all the other nodes via thisfiber The OA is added before the path to compensate

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 579

the optical insertion loss caused by broadcasting whichensures the signal strength is within the receiver sensitiv-ity range (Detailed Power budgeting design is in sect41)As shown in Figure 2 any signal is amplified once atits sender and there is no additional amplification stagesin the intermediate or receiver nodes The signal fromsender also does not loop back and terminates at the lastreceiver on the ring (eg signals from OWS1 terminatesin OWS3 so that no physical loop is formed)

Receiving component We use a WSS and an opticaldemultiplexer (DEMUX) at the receiving end The de-sign highlight is using WSS as a wtimes1 wavelength multi-plexer to intercept all the wavelengths from all the othernodes on the ring In prior work [6 11 38] WSS wasused as a 1timesw demultiplexer that takes 1 input fiber ofk wavelengths and outputs any subset of these k wave-lengths to any of w output fibers In MegaSwitch WSSis repurposed as a wtimes1 multiplexer which takes w in-put fibers with k wavelengths each and outputs a non-interfering subset of the ktimesw wavelengths to an out-put fiber With this unconventional use of WSS Mega-Switch ensures that any node can simultaneously chooseany wavelengths from any other nodes enabling un-constrained connections Then based on the demandsamong nodes the WSS selects the right wavelengths andmultiplexes them on the fiber to the DEMUX In sect32 weintroduce algorithms to handle wavelength assignmentsThe multiplexed signals are then de-multiplexed by DE-MUX to the uplink ports on EPS

Supporting lowast-cast [51] Unicast and multicast can beset up with a single wavelength on MegaSwitch becauseany wavelength from a source can be intercepted by ev-ery node on the ring To set up a unicast MegaSwitchassigns a wavelength on the fiber from the source to des-tination and configures the WSS at the destination to se-lect this wavelength Consider a unicast from H1 to H6

in Figure 2 We first assign wavelength λ1 from node 1to 2 for this connection and configure the WSS in node 2to select λ1 from node 1 Then with the routing in bothEPSes configured the unicast circuit from H1 to H6 is es-tablished Further to setup a multicast from H1 to H6 andH9 based on the unicast above we just need to addition-ally configure the WSS in node 3 to also select λ1 fromnode 1 In addition many-to-one (incast) or many-to-many (allcast) communications are composites of manyunicasts andor multicasts and they are supported usingmultiple wavelengths

Scalability MegaSwitchrsquos port count is ntimeskbull For n with existing technology it is possible to

achieve 32times1 WSS (thus n=w+1=33) Furthermorealternative optical components such as AWGR andfilter arrays can be used to achieve the same function-ality as WSS+DEMUX for MegaSwitch [56] Since

48times48 or 64times64 AWGR is available we expect to seea 15times or 2times increase with 49 or 65 nodes on a ringwhile additional splitting loss can be compensated

bull For k C-band Dense-WDM (DWDM) link can sup-port k=96 wavelengths at 50Ghz channel spacing andk=192 wavelengths at 25Ghz [35] Recent progress inoptical comb source [30 53] assures the relative wave-length accuracy among DWDM signals and is promis-ing to enable a low power consumption DWDM trans-ceiver array at 25Ghz channel spacing

As a result with existing technology MegaSwitch sup-ports up to ntimesk=33times192=6336 ports on a single ring

MegaSwitch can also be expanded multidimension-ally In our OWS implementation (sect4) two inter-OWSports are used in the ring and we reserve another two forfuture extension of MegaSwitch in another dimensionWhen extended bi-dimensionally (into a torus) [55]MegaSwitch scales as n2timesk supporting over 811K ports(n=65 k=192) which should accommodate modernDCNs [42 46] However such large scale comes withinherent routing management and reliability challengesand we leave it as future work

MegaSwitch vs Mordia Mordia [38] is perhaps theclosest related work to MegaSwitch in terms of topol-ogy both are constructed as a ring and both adopt WSSHowever the key difference is Mordia chains multipleWSSes on one fiber each WSS acts as a wavelength de-multiplexer to set up circuits between ports on the samefiber whereas MegaSwitch reverses WSS as a wave-length multiplexer across multiple fibers to enable non-blocking connections between ports on different fibers

We note that Farrington et al [11] further discussedscaling Mordia with multiple fibers non-blocking-ly (re-ferred to as Mordia-PTL below) Unlike MegaSwitchrsquosWSS-based multi-fiber ring they proposed to connecteach EPS to multiple fiber rings in parallel and rely onthe EPS to schedule circuits for servers to support TDMacross fibers This falls beyond the capability of existingcommodity switches [24] Further on each fiber they al-low every node to add signals to the fiber and require anoptical amplifier in each node to boost the signals andcompensate losses (eg bandpass adddrop filter lossvariable optical attenuator loss 1090 splitter loss etc)thus they need multiple amplifiers per fiber This de-grades optical signal to noise ratio (OSNR) and makestransmission error-prone In Figure 3 we compareOSNR between MegaSwitch and Mordia We assume7dBm per-channel launch power 16dB pre-amplificationgain in MegaSwitch and 23dB boosting gain per-node inMordia The results show that for MegaSwitch OSNRmaintains at sim36dB and remains constant for all nodesfor Mordia OSNR quickly degrades to 30dB after 7hops For validation we have also measured the aver-

580 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

MegaSwitch

Mordia (Multi-hop ring)

5-node MegaSwitch

Prototype (Measured)

Acceptable SNR (32dB)20dB

30dB

40dB

nodes on ring0 5 10 15

Figure 3 OSNR comparisonage OSNR on our MegaSwitch prototype (sect4) which is3812dB after 5 hops (plotted in the figure for reference)

Furthermore there exists a physical loop in each fiberof their design where optical signals can circle back toits origin causing interferences if not handled properlyMegaSwitch by design avoids most of these problems1) one each fiber has only one sender and thus one stageof amplification (Power budgeting is explained in sect41)2) each fiber is terminated in the last node in the ringthus there is no physical optical loop or recirculated sig-nals (Figure 9) Therefore MegaSwitch maintains goodOSNR when scaled to more nodes More importantlybesides above they do not consider fault-tolerance andactual implementation in their paper [11] In this workwe have built with significant efforts a functional Mega-Switch prototype with commodity off-the-shelf EPSesand our home-built OWSes

32 Control PlaneAs prior work [6 12 38 50] MegaSwitch takes a cen-tralized approach to configure routings in EPSes andwavelength assignments in OWSes The MegaSwitchcontroller converts traffic bandwidth demands into wave-length assignments and pushes them into the OWSesThe demands can be obtained by existing traffic estima-tion schemes [2 12 50] In what follows we first in-troduce the wavelength assignment algorithms and thendesign basemesh to handle unstable traffic and latencysensitive flows during reconfigurations

321 Wavelength AssignmentIn MegaSwitch each node uses the same k wavelengthsto communicate with other nodes The control plane as-signs the wavelengths to satisfy communication demandsamong nodes In a feasible assignment wavelengthsfrom different senders must not interfere with each otherat the receiver since all the selected wavelengths to a par-ticular receiver share the same output fiber through wtimes1WSS (see Figure 2) We illustrate an assignment exam-ple in Figure 4 which has 3 nodes and each has 4 wave-lengths The demand is in (a) each entry is the numberof required wavelengths of a sender-receiver pair If thewavelengths are assigned as in (b) then two conflicts oc-cur A non-interfering assignment is shown in (c) whereno two same wavelengths go to the same receiver

Given the constraint we reduce the wavelength as-signment problem in MegaSwitch to an edge coloring

Figure 4 Wavelength assignments (n=3 k=4)problem on a bipartite multigraph We express band-width demand matrix on a bipartite multigraph as Fig-ure 4 Multiple edges between two nodes correspondto multiple wavelengths needed by them Assume eachwavelength has a unique color then a feasible wave-length assignment is equivalent to an assignment of col-ors to the edges so that no two adjacent edges share thesame colormdashexactly the edge coloring problem [9]

Edge coloring problem is NP-complete on generalgraphs [6 9] but has efficient optimal solutions on bi-partite multigraphs [7 44] By adopting the algorithmin [7] we prove the following theorem3 (see Appendix)which establishes the rearrangeably non-blocking prop-erty of MegaSwitchTheorem 1 Any feasible bandwidth demand can be sat-isfied by MegaSwitch with at most k unique wavelengths322 BasemeshAs WSS port count increases MegaSwitch reconfiguresat milliseconds (sect42) To accommodate unstable trafficand latency-sensitive applications a typical solution is tomaintain a parallel electrical packet-switched fabric assuggested by prior hybrid network proposals [12 24 50]

MegaSwitch emulates such hybrid network with stablecircuits providing constant connectivity among nodesand eliminating the need for an additional electrical fab-ric We denote this set of wavelengths as basemeshwhich should achieve two goals 1) the number of wave-lengths dedicated to basemesh b should be adjustable42) With given b guarantee low average latency for allpairs of nodes

Essentially basemesh is an overlay network on Mega-Switchrsquos physical multi-fiber ring and building such anetwork with the above design goals has been studiedin overlay networking and distributed hash table (DHT)literature [29 48] Forwarding a packet on basemesh issimilar to performing a look-up on a DHT network Alsothe routing table size (number of known peer addresses)of each node in DHT is analogous to the the number ofwavelengths to other nodes b Meanwhile our problemdiffers from DHT We assume the centralized controller

3We note that this problem can also be reduce to the well-knownrow-column-row permutation routing problem [4 39] and the reduc-tion is also a proof of Theorem 1

4The basemesh alone can be considered as a bk (k is the numberof wavelengths per fiber) over-subscribed parallel electrical switchingnetwork similar to that of [12 24 50]

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 581

Chord (=log2 33)

Basemesh (Symphony)

Ave

rag

e H

op

s

1

5

10

Ave

rag

e L

ate

ncy20us

100us

200us

300us

wavelength in Basemesh1020 32 192

Figure 5 Average latency on basemeshcalculates routing tables for EPSes and the routing ta-bles are updated for adjustments of b and peer churn (in-frequent only happens in failures sect33)

We find the Symphony [29] topology fits our goalsFor b wavelengths in the basemesh each node first usesone wavelength to connect to the next node forming adirected ring then each node chooses shortcuts to bminus1other nodes on the ring We adopt the randomized algo-rithm of Symphony [29] to choose shortcuts (drawn froma family of harmonic probability distributions pn(x)=

1xlnn If a pair is selected the corresponding entry in de-mand matrix increment by 1) The routing table on theEPS of each node is configured with greedy routing thatattempts to minimize the absolute distance to destinationat each hop [29]

It is shown that the expected number of hops for thistopology is O( log

2(n)k ) for each packet With n=33 we

plot the expected path length in Figure 5 to compareit with Chord [48] (Finger table size is 4 average pathlength is O(log(n))) and also plot the expected latencyassuming 20micros per-hop latency Notably if bgenminus1basemesh is fully connected with 1 hop between everypair adding more wavelength cannot reduce the averagehop count but serve to reduce congestion

As we will show in sect52 while basemesh reduces theworst-case bisection bandwidth the average bisectionbandwidth is unaffected and it brings two benefitsbull It increases throughput for rapid-changing unpre-

dictable traffic when the traffic stability period is lessthan the control plane delay as shown in sect52

bull It mitigates the impact of inaccurate demand esti-mation by providing consistent connection Withoutbasemesh when the demand between two nodes ismistakenly estimated as 0 they are effectively cut-off which is costly to recover (flows need to wait forwavelength setup)Thanks to the flexibility of MegaSwitch b is ad-

justable to accommodate varying traffic instability(sect52) or be partiallyfully disabled for higher traffic im-balance if applications desire more bandwidth in part ofnetwork [42] In this paper we present b as a knob forDCN operators and intend to explore the problem of op-timal configuration of b as future work

Basemesh vs Dedicated Topology in ProjecToR [15]Basemesh is similar to the dedicated topology in Pro-jecToR which is a set of static connections for low la-tency traffic However its construction is based on the

Figure 6 PRF module layoutoptimization of weighted path length using traffic distri-bution from past measurements which cannot provideaverage latency guarantee for an arbitrary pair of racksIn comparison basemesh can provide consistent connec-tions between all pairs with proven average latency Itis possible to configure basemsesh in ProjecToR as itsdedicated topology which better handles unpredictablelatency-critical traffic

33 Fault-toleranceWe consider the following failures in MegaSwitch

OWS failure We implement the OWS box so that thenode failure will not affect the operation of other nodeson the ring As shown in Figure 6 the integrated pas-sive component (Passive Routing Fabric PRF) handlesthe signal propagation across adjacent nodes while thepassive splitters copy the traffic to local active switchingmodules (ie WSS and DEMUX) When one OWS losesits function (eg power loss) in its active part the traf-fic traveling across the failure node fromto other nodes isnot affected because the PRF operates passively and stillforwards the signals Thus the failure is isolated withinthe local node We design the PRF as a modular blockthat can be attached and detached from active switchingmodules easily So one can recover the failed OWS with-out the service interruption of the rest network by simplyswapping its active switching module

Port failure When the transceiver fails in a node itcauses the loss of capacity only at the local node Wehave implemented built-in power detectors in OWS tonotify the controller of such events

Controller failure In control plane two or more in-stances of the controller are running simultaneously withone leader The fail-over mechanism follows VRRP [20]

Cable failure Cable cut is the most difficult failureand we design5 redundancy and recovery mechanisms tohandle one cable cut The procedure is analogous to ringprotection in SONET [49] by 2 sets of fibers We proposedirectional redundancy in MegaSwitch as shown in Fig-ure 7 the fiber can carry the traffic fromto both east (pri-mary) and west (secondary) sides of the OWS by select-

5This design has not been implemented in the current prototypethus the power budgeting on the prototype (sect41) does not account forthe fault-tolerance components on the ring The components for thisdesign are all commercially available and can be readily incorporatedinto future iterations of MegaSwitch

582 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Figure 7 MegaSwitch fault-toleranceing a direction at the sending 1times2 optical tunable split-ter6 and receiving 2times1 optical switches The splittersand switches are in the PRF module When there is nocable failure sending and receiving switches both selectthe primary direction thus preventing optical looping Ifone cable cut is detected by the power detectors the con-troller is notified and it instructs the OWSes on the ringto do the following 1) The sending tunable splitter inevery OWS transmits on its fiber on both directions7 sothat the signal can reach all the OWSes 2) The receiv-ing switch for each fiber at the input of WSS selects thedirection where there is still incoming signal Then theconnectivity can be restored A passive 4-port fix split-ter is added before the receiving switch as a part of thering and it guides the signal from primary and backupdirections to different input ports of the 2times1 switch Formore than one cut the ring is severed into two or moresegments and connectivity between segments cannot berecovered until cable replacement

4 Implementation41 Implementation of OWSTo implement MegaSwitch and facilitate real deploy-ment we package all optical devices into an 1RU (RackUnit) switch box OWS (Figure 2) The OWS is com-posed of the sending components (MUX+OA) and thereceiving components (WSS+DEMUX) as described insect3 We use a pair of array waveguide grating (AWG) asMUX and DEMUX for DWDM signals

To simplify the wiring we compact all the 1times2 drop-continue splitters into PRF module as shown in Figure 6This module can be easily attached to or detached fromactive switching components in the OWS To compen-sate component losses along the ring we use a singlestage 12dB OA to boost the DWDM signal before trans-mitting the signals and the splitters in PRF have differentsplitting ratios Using the same numbering in Figure 6the splitting ratio of i-th splitter is 1(iminus1) for igt1 Asan example the power splitting on one fiber from node1 is shown in Figure 9 At each receiving transceiver

6It is implementable with a common Mach-Zehnder interferometer7Power splitting ratio must also be re-configured to keep the signal

within receiver sensitivity range for all receiving OWSes

Figure 8 The OWS box we implemented

Figure 9 Power budget on a fiberthe signal strength is simminus9dBm per channel The PRFconfiguration is therefore determined by w (WSS radix)

OWS receives DWDM signals from all other nodes onthe ring via its WSS We adopt 8times1 WSS from CoAdnaPhotonics in the prototype which is able to select wave-length signals from at most 8 nodes Currently we use 4out of 8 WSS ports for a 5-node ring

Our OWS box provides 16 optical ports and 8 areused in the MegaSwitch prototype Each port maps toa particular wavelength from 1905THz to 1935THzat 200GHz channel spacing InnoLight 10GBASE-ERDWDM SFP+ transceivers are used to connect EPSesto optical ports on OWSes With the total broadcastingloss at 105dB for OWS 4dB for WSS 25dB for DE-MUX and 1dB for cable connectors the receiving poweris minus9dBm simminus12dBm well within the sensitivity rangeof our transceivers The OWS box we implemented (Fig-ure 8) has 4 inter-OWS ports 16 optical ports and 1 Eth-ernet port for control For the 4 inter-OWS ports twoare used to connect other OWSes to construct the Mega-Switch ring and the other two are reserved for redun-dancy and further scaling (sect31)

Each OWS is controlled by a Raspberry Pi [37] with700MHz ARM CPU and 256MB memory OWS re-ceives the wavelength assignments (in UDP packets) viaits Ethernet management interface connected to a sep-arate electrical control plane network and configuresWSS via GPIO pins

42 MegaSwitch PrototypeWe constructed a 5-node MegaSwitch with 5 OWSboxes as is shown in Figure 10 The OWSes are simplyconnected to each other through their inter-OWS portsusing ring cables containing multiple fibers The twoEPSes we used are Broadcom Pronto-3922 with 48times10Gbps Ethernet (GbE) ports For one EPS we fit 24 portswith 3 sets of transceivers of 8 unique wavelengths toconnect to 3 OWSes and the rest 24 ports are connectedto the servers For the other we only use 32 ports with16 ports to OWSes and 16 to servers We fill the capacitywith 40times10GbE interfaces on 20 Dell PowerEdge R320servers (Debian 70 with Kernel 31819) each with aBroadcom NetXtreme II 10GbE network interface card

The control plane network connected by a BroadcomPronto-3295 Ethernet switch The Ethernet management

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 583

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 3: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

send traffic to any other nodes on this multi-fiber ringeach receiver can receive traffic from all the senders (oneper fiber) simultaneously Essentially MegaSwitch es-tablishes a one-hop circuit between any pair of EPSesforming a rearrangeably non-blocking circuit mesh overthe physical multi-fiber ring

Specifically MegaSwitch achieves unconstrained con-nections by re-purposing wavelength selective switch(WSS a key component of OWS) as a receiving wtimes1multiplexer to enable non-blocking space division mul-tiplexing among w fibers (sect31) Prior work [6 11 38]used WSS as 1timesw demultiplexer that takes 1 input fiberwith k wavelengths and outputs any subset of the kwavelengths to w output fibers In contrast MegaSwitchreverses the use of WSS Each node uses k wavelengths2

on a fiber for sending for receiving each node lever-ages WSS to intercept all ktimesw wavelengths from wother senders (one per fiber) via its w input ports Thenwith a wavelength assignment algorithm (sect32) the WSScan always select out of the ktimesw wavelengths a non-interfering set to satisfy any communication demandsamong nodes

As a result MegaSwitch delivers rearrangeably non-blocking communications to ntimesk ports where n=w+1is the number of nodesracks on the ring and k is thenumber of ports per node With current technology wcan be up to 32 and k up to 192 thus MegaSwitch cansupport up to 33 racks and 6336 hosts In the futureMegaSwitch can scale beyond 105 ports with AWGR(array waveguide grating router) technology and multi-dimensional expansion (sect31)

On top of its unconstrained connections MegaSwitchfurther has built-in fault-tolerance to handle various fail-ures such as OWS cable and link failures We developnecessary redundancy and mechanisms so that Mega-Switch can provide reliable services (sect33)

We have implemented a small-scale MegaSwitch pro-totype (sect4) with 5 nodes and each node has 8 opti-cal ports transmitting on 8 unique wavelengths within1905THz to 1935THz at 200GHz channel spacingThis prototype enables arbitrary communications among5 racks and 40 hosts Furthermore our OWSes are de-signed and implemented with all commercially availableoptical devices and the EPSes we used are BroadcomPronto-3922 10G commodity switches

MegaSwitchrsquos reconfiguration delay hinges on WSSand 115micros WSS switching time has been reported usingdigital light processing (DLP) technology [38] How-ever our implementation experience reveals that asWSS port count increases (for wide-spread communi-cations) such low latency can no longer be maintainedThis is mainly because the port count of WSS with DLP

2Each unique wavelength corresponds to an optical port on OWSas well as an optical transceiver on EPS

used in [38] is not scalable In our prototype the WSS re-configuration is sim3ms We believe this is a hard limita-tion we have to confront in order to scale To accommo-date unstable traffic and latency-critical applications wedevelop ldquobasemeshrdquo on MegaSwitch to ensure any twonodes are always connected via certain wavelengths dur-ing reconfigurations (sect322) Especially we constructthe basemesh with the well-known Symphony [29] topol-ogy in distributed hash table (DHT) literature to providelow average latency with adjustable capacity

Over the prototype testbed we conducted basic mea-surements of throughput and latency and deployed realapplications eg Spark [54] and Redis [43] to evalu-ate the performance of MegaSwitch (sect51) Our experi-ments show that the latency-sensitive Redis experiencesuniformly low latency for cross-rack queries due to thebasemesh and the performance of Spark applications issimilar to that of an optimal scenario where all serversare connected to one single switch

To complement testbed experiments we performedlarge-scale simulations to study the impact of trafficstability on MegaSwitch (sect52) For synthetic tracesMegaSwitch provides near full bisection bandwidth forstable traffic of all patterns but does not sustain highthroughput for concentrated unstable traffic with stabil-ity period less than reconfiguration delay However itcan improve throughput for unstable wide-spread trafficthrough the basemesh For real production traces Mega-Switch achieves 9321 throughput of an ideal non-blocking fabric We also find that our basemesh effec-tively handles highly unstable wide-spread traffic andcontributes up to 5614 throughput improvement

2 Background and MotivationIn this section we first highlight the trend of high-demand wide-spread traffic in DCNs Then we discusswhy prior solutions are insufficient to support this trend

21 The Trend of DCN TrafficWe use two case studies to show the traffic trendbull User-facing web applications In response to user re-

quests web servers pushpull contents tofrom cacheservers (eg Redis [43]) Web servers and cacheservers are usually deployed in different racks leadingto intensive inter-rack traffic [42] Cache objects arereplicated in different clusters for fault-tolerance andload balancing and web servers access these objectsrandomly creating a wide-spread communication pat-tern overall

bull Data-parallel processing Traffic of MapReduce-type applications [3 21 42] is shown to be heavilycluster-local (eg [42] reported 133 of the traffic israck-local but 809 cluster-local) Data is spread tomany racks due to rack awareness [19] which requires

578 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

copies of data to be stored in different racks in case ofrack failures as well as cluster-level balancing [45] Inshuffle and output phases of MapReduce jobs wide-spread many-to-many traffic emerges among all par-ticipating serversracks within the clusterBandwidth demand is also surging as the data stored

and processed in DCNs continue to grow due to the ex-plosion of user-generated contents such as photovideoupdates and sensor measurements [46] The implica-tion is twofold 1) Data processing applications requirelarger bandwidth for data exchange both within a appli-cation (eg in a multi-stage machine learning workersexchange data between stages) and between applications(eg related applications like web search and ad recom-mendation share data) 2) Front-end servers need morebandwidth to fetch the ever-larger contents from cacheservers to generate webpages [5] In conclusion opti-cal DCN interconnects must support wide-spread high-bandwidth demands among many racks

22 Prior Proposals Are InsufficientPrior proposals fall short in supporting the above trendof high-bandwidth wide-spread communications amongmany nodes simultaneouslybull c-Through [50] and Helios [12] are seminal works that

introduce wavelength division multiplexing (WDM)and optical circuit switch (OCS) into DCNs Howeverthey suffer from constrained rack-to-rack connectiv-ity At any time high capacity optical circuits are lim-ited to a couple of racks (eg 2 [50]) Reconfiguringcircuits to serve communications among more rackscan incur sim10 ms circuit visit delay [12 50]

bull OSA [6] and WaveCube [7] enable arbitrary rack-to-rack connectivity via multi-hop routing over multiplecircuits and EPSes While solving the connectivity is-sue they introduce severe bandwidth constraints be-tween racks because traffic needs to traverse multipleEPSes to destination introducing traffic overhead androuting complexity at each transit EPS

bull Quartz [26] is an optical fiber ring that is designed asan design element to provide low latency in portionsof traditional DCNs It avoids the circuit switching de-lay by using fixed wavelengths and its bisection band-width are also limited

bull Mordia [38] and REACToR [24] improve optical cir-cuit switching in the temporal direction by reducingcircuit switching time from milliseconds to microsec-onds This approach is effective in quickly time-sharing circuits across more nodes but still insuffi-cient to support wide-spread communications due tothe parallel circuits available simultaneously Further-more Mordia is by default a single fiber ring struc-ture with small port density (limited by of uniquewavelengths supported on a fiber) Naıvely stacking

Figure 2 Example of a 3-node MegaSwitchmultiple fibers via ring-selection circuits as suggestedby [38] is blocking among hosts on different fibersleading to very degraded bisection bandwidth (sect52)whereas time-sharing multiple fibers via EPS as sug-gested by [11] requires complex EPS functions un-available in existing commodity EPSes and introducesadditional optical design complexities making it hardto implement in practice (more details in sect31)

bull Free-space optics proposals [15 18] eliminate wiringFirefly [18] leverages free-space optics to build awireless DCN fabric with improved cost-performancetradeoff With low switching time and high scalabil-ity ProjecToR [15] connects an entire DCN of tensof thousands of machines However they face prac-tical challenges of real DCN environments (eg dustand vibration) In contrast we implement a wired op-tical interconnect for thousands of machines whichis common in DCNs and deploy real applications onit We note that the mechanisms we developed forMegaSwitch can be employed in ProjecToR notablythe basemesh for low latency applications

3 MegaSwitch DesignIn this section we describe the design of MegaSwitchrsquosdata and control planes as well as its fault-tolerance

31 Data PlaneAs in Figure 2 MegaSwitch connects multiple OWSesin a multi-fiber ring Each fiber has only one sender Thesender broadcasts traffic on this fiber which reaches allother nodes in one hop Each receiver can receive trafficfrom all the senders simultaneously The key of Mega-Switch is that it exploits WSS to enable non-blockingspace division multiplexing among these parallel fibers

Sending component In OWS for transmission we usean optical multiplexer (MUX) to join the signals fromEPS onto a fiber and then use an optical amplifier (OA)to boost the power Multiple wavelengths from EPS up-link ports (each with an optical transceiver at a uniquewavelength) are multiplexed onto a single fiber The mul-tiplexed signals then travel to all the other nodes via thisfiber The OA is added before the path to compensate

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 579

the optical insertion loss caused by broadcasting whichensures the signal strength is within the receiver sensitiv-ity range (Detailed Power budgeting design is in sect41)As shown in Figure 2 any signal is amplified once atits sender and there is no additional amplification stagesin the intermediate or receiver nodes The signal fromsender also does not loop back and terminates at the lastreceiver on the ring (eg signals from OWS1 terminatesin OWS3 so that no physical loop is formed)

Receiving component We use a WSS and an opticaldemultiplexer (DEMUX) at the receiving end The de-sign highlight is using WSS as a wtimes1 wavelength multi-plexer to intercept all the wavelengths from all the othernodes on the ring In prior work [6 11 38] WSS wasused as a 1timesw demultiplexer that takes 1 input fiber ofk wavelengths and outputs any subset of these k wave-lengths to any of w output fibers In MegaSwitch WSSis repurposed as a wtimes1 multiplexer which takes w in-put fibers with k wavelengths each and outputs a non-interfering subset of the ktimesw wavelengths to an out-put fiber With this unconventional use of WSS Mega-Switch ensures that any node can simultaneously chooseany wavelengths from any other nodes enabling un-constrained connections Then based on the demandsamong nodes the WSS selects the right wavelengths andmultiplexes them on the fiber to the DEMUX In sect32 weintroduce algorithms to handle wavelength assignmentsThe multiplexed signals are then de-multiplexed by DE-MUX to the uplink ports on EPS

Supporting lowast-cast [51] Unicast and multicast can beset up with a single wavelength on MegaSwitch becauseany wavelength from a source can be intercepted by ev-ery node on the ring To set up a unicast MegaSwitchassigns a wavelength on the fiber from the source to des-tination and configures the WSS at the destination to se-lect this wavelength Consider a unicast from H1 to H6

in Figure 2 We first assign wavelength λ1 from node 1to 2 for this connection and configure the WSS in node 2to select λ1 from node 1 Then with the routing in bothEPSes configured the unicast circuit from H1 to H6 is es-tablished Further to setup a multicast from H1 to H6 andH9 based on the unicast above we just need to addition-ally configure the WSS in node 3 to also select λ1 fromnode 1 In addition many-to-one (incast) or many-to-many (allcast) communications are composites of manyunicasts andor multicasts and they are supported usingmultiple wavelengths

Scalability MegaSwitchrsquos port count is ntimeskbull For n with existing technology it is possible to

achieve 32times1 WSS (thus n=w+1=33) Furthermorealternative optical components such as AWGR andfilter arrays can be used to achieve the same function-ality as WSS+DEMUX for MegaSwitch [56] Since

48times48 or 64times64 AWGR is available we expect to seea 15times or 2times increase with 49 or 65 nodes on a ringwhile additional splitting loss can be compensated

bull For k C-band Dense-WDM (DWDM) link can sup-port k=96 wavelengths at 50Ghz channel spacing andk=192 wavelengths at 25Ghz [35] Recent progress inoptical comb source [30 53] assures the relative wave-length accuracy among DWDM signals and is promis-ing to enable a low power consumption DWDM trans-ceiver array at 25Ghz channel spacing

As a result with existing technology MegaSwitch sup-ports up to ntimesk=33times192=6336 ports on a single ring

MegaSwitch can also be expanded multidimension-ally In our OWS implementation (sect4) two inter-OWSports are used in the ring and we reserve another two forfuture extension of MegaSwitch in another dimensionWhen extended bi-dimensionally (into a torus) [55]MegaSwitch scales as n2timesk supporting over 811K ports(n=65 k=192) which should accommodate modernDCNs [42 46] However such large scale comes withinherent routing management and reliability challengesand we leave it as future work

MegaSwitch vs Mordia Mordia [38] is perhaps theclosest related work to MegaSwitch in terms of topol-ogy both are constructed as a ring and both adopt WSSHowever the key difference is Mordia chains multipleWSSes on one fiber each WSS acts as a wavelength de-multiplexer to set up circuits between ports on the samefiber whereas MegaSwitch reverses WSS as a wave-length multiplexer across multiple fibers to enable non-blocking connections between ports on different fibers

We note that Farrington et al [11] further discussedscaling Mordia with multiple fibers non-blocking-ly (re-ferred to as Mordia-PTL below) Unlike MegaSwitchrsquosWSS-based multi-fiber ring they proposed to connecteach EPS to multiple fiber rings in parallel and rely onthe EPS to schedule circuits for servers to support TDMacross fibers This falls beyond the capability of existingcommodity switches [24] Further on each fiber they al-low every node to add signals to the fiber and require anoptical amplifier in each node to boost the signals andcompensate losses (eg bandpass adddrop filter lossvariable optical attenuator loss 1090 splitter loss etc)thus they need multiple amplifiers per fiber This de-grades optical signal to noise ratio (OSNR) and makestransmission error-prone In Figure 3 we compareOSNR between MegaSwitch and Mordia We assume7dBm per-channel launch power 16dB pre-amplificationgain in MegaSwitch and 23dB boosting gain per-node inMordia The results show that for MegaSwitch OSNRmaintains at sim36dB and remains constant for all nodesfor Mordia OSNR quickly degrades to 30dB after 7hops For validation we have also measured the aver-

580 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

MegaSwitch

Mordia (Multi-hop ring)

5-node MegaSwitch

Prototype (Measured)

Acceptable SNR (32dB)20dB

30dB

40dB

nodes on ring0 5 10 15

Figure 3 OSNR comparisonage OSNR on our MegaSwitch prototype (sect4) which is3812dB after 5 hops (plotted in the figure for reference)

Furthermore there exists a physical loop in each fiberof their design where optical signals can circle back toits origin causing interferences if not handled properlyMegaSwitch by design avoids most of these problems1) one each fiber has only one sender and thus one stageof amplification (Power budgeting is explained in sect41)2) each fiber is terminated in the last node in the ringthus there is no physical optical loop or recirculated sig-nals (Figure 9) Therefore MegaSwitch maintains goodOSNR when scaled to more nodes More importantlybesides above they do not consider fault-tolerance andactual implementation in their paper [11] In this workwe have built with significant efforts a functional Mega-Switch prototype with commodity off-the-shelf EPSesand our home-built OWSes

32 Control PlaneAs prior work [6 12 38 50] MegaSwitch takes a cen-tralized approach to configure routings in EPSes andwavelength assignments in OWSes The MegaSwitchcontroller converts traffic bandwidth demands into wave-length assignments and pushes them into the OWSesThe demands can be obtained by existing traffic estima-tion schemes [2 12 50] In what follows we first in-troduce the wavelength assignment algorithms and thendesign basemesh to handle unstable traffic and latencysensitive flows during reconfigurations

321 Wavelength AssignmentIn MegaSwitch each node uses the same k wavelengthsto communicate with other nodes The control plane as-signs the wavelengths to satisfy communication demandsamong nodes In a feasible assignment wavelengthsfrom different senders must not interfere with each otherat the receiver since all the selected wavelengths to a par-ticular receiver share the same output fiber through wtimes1WSS (see Figure 2) We illustrate an assignment exam-ple in Figure 4 which has 3 nodes and each has 4 wave-lengths The demand is in (a) each entry is the numberof required wavelengths of a sender-receiver pair If thewavelengths are assigned as in (b) then two conflicts oc-cur A non-interfering assignment is shown in (c) whereno two same wavelengths go to the same receiver

Given the constraint we reduce the wavelength as-signment problem in MegaSwitch to an edge coloring

Figure 4 Wavelength assignments (n=3 k=4)problem on a bipartite multigraph We express band-width demand matrix on a bipartite multigraph as Fig-ure 4 Multiple edges between two nodes correspondto multiple wavelengths needed by them Assume eachwavelength has a unique color then a feasible wave-length assignment is equivalent to an assignment of col-ors to the edges so that no two adjacent edges share thesame colormdashexactly the edge coloring problem [9]

Edge coloring problem is NP-complete on generalgraphs [6 9] but has efficient optimal solutions on bi-partite multigraphs [7 44] By adopting the algorithmin [7] we prove the following theorem3 (see Appendix)which establishes the rearrangeably non-blocking prop-erty of MegaSwitchTheorem 1 Any feasible bandwidth demand can be sat-isfied by MegaSwitch with at most k unique wavelengths322 BasemeshAs WSS port count increases MegaSwitch reconfiguresat milliseconds (sect42) To accommodate unstable trafficand latency-sensitive applications a typical solution is tomaintain a parallel electrical packet-switched fabric assuggested by prior hybrid network proposals [12 24 50]

MegaSwitch emulates such hybrid network with stablecircuits providing constant connectivity among nodesand eliminating the need for an additional electrical fab-ric We denote this set of wavelengths as basemeshwhich should achieve two goals 1) the number of wave-lengths dedicated to basemesh b should be adjustable42) With given b guarantee low average latency for allpairs of nodes

Essentially basemesh is an overlay network on Mega-Switchrsquos physical multi-fiber ring and building such anetwork with the above design goals has been studiedin overlay networking and distributed hash table (DHT)literature [29 48] Forwarding a packet on basemesh issimilar to performing a look-up on a DHT network Alsothe routing table size (number of known peer addresses)of each node in DHT is analogous to the the number ofwavelengths to other nodes b Meanwhile our problemdiffers from DHT We assume the centralized controller

3We note that this problem can also be reduce to the well-knownrow-column-row permutation routing problem [4 39] and the reduc-tion is also a proof of Theorem 1

4The basemesh alone can be considered as a bk (k is the numberof wavelengths per fiber) over-subscribed parallel electrical switchingnetwork similar to that of [12 24 50]

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 581

Chord (=log2 33)

Basemesh (Symphony)

Ave

rag

e H

op

s

1

5

10

Ave

rag

e L

ate

ncy20us

100us

200us

300us

wavelength in Basemesh1020 32 192

Figure 5 Average latency on basemeshcalculates routing tables for EPSes and the routing ta-bles are updated for adjustments of b and peer churn (in-frequent only happens in failures sect33)

We find the Symphony [29] topology fits our goalsFor b wavelengths in the basemesh each node first usesone wavelength to connect to the next node forming adirected ring then each node chooses shortcuts to bminus1other nodes on the ring We adopt the randomized algo-rithm of Symphony [29] to choose shortcuts (drawn froma family of harmonic probability distributions pn(x)=

1xlnn If a pair is selected the corresponding entry in de-mand matrix increment by 1) The routing table on theEPS of each node is configured with greedy routing thatattempts to minimize the absolute distance to destinationat each hop [29]

It is shown that the expected number of hops for thistopology is O( log

2(n)k ) for each packet With n=33 we

plot the expected path length in Figure 5 to compareit with Chord [48] (Finger table size is 4 average pathlength is O(log(n))) and also plot the expected latencyassuming 20micros per-hop latency Notably if bgenminus1basemesh is fully connected with 1 hop between everypair adding more wavelength cannot reduce the averagehop count but serve to reduce congestion

As we will show in sect52 while basemesh reduces theworst-case bisection bandwidth the average bisectionbandwidth is unaffected and it brings two benefitsbull It increases throughput for rapid-changing unpre-

dictable traffic when the traffic stability period is lessthan the control plane delay as shown in sect52

bull It mitigates the impact of inaccurate demand esti-mation by providing consistent connection Withoutbasemesh when the demand between two nodes ismistakenly estimated as 0 they are effectively cut-off which is costly to recover (flows need to wait forwavelength setup)Thanks to the flexibility of MegaSwitch b is ad-

justable to accommodate varying traffic instability(sect52) or be partiallyfully disabled for higher traffic im-balance if applications desire more bandwidth in part ofnetwork [42] In this paper we present b as a knob forDCN operators and intend to explore the problem of op-timal configuration of b as future work

Basemesh vs Dedicated Topology in ProjecToR [15]Basemesh is similar to the dedicated topology in Pro-jecToR which is a set of static connections for low la-tency traffic However its construction is based on the

Figure 6 PRF module layoutoptimization of weighted path length using traffic distri-bution from past measurements which cannot provideaverage latency guarantee for an arbitrary pair of racksIn comparison basemesh can provide consistent connec-tions between all pairs with proven average latency Itis possible to configure basemsesh in ProjecToR as itsdedicated topology which better handles unpredictablelatency-critical traffic

33 Fault-toleranceWe consider the following failures in MegaSwitch

OWS failure We implement the OWS box so that thenode failure will not affect the operation of other nodeson the ring As shown in Figure 6 the integrated pas-sive component (Passive Routing Fabric PRF) handlesthe signal propagation across adjacent nodes while thepassive splitters copy the traffic to local active switchingmodules (ie WSS and DEMUX) When one OWS losesits function (eg power loss) in its active part the traf-fic traveling across the failure node fromto other nodes isnot affected because the PRF operates passively and stillforwards the signals Thus the failure is isolated withinthe local node We design the PRF as a modular blockthat can be attached and detached from active switchingmodules easily So one can recover the failed OWS with-out the service interruption of the rest network by simplyswapping its active switching module

Port failure When the transceiver fails in a node itcauses the loss of capacity only at the local node Wehave implemented built-in power detectors in OWS tonotify the controller of such events

Controller failure In control plane two or more in-stances of the controller are running simultaneously withone leader The fail-over mechanism follows VRRP [20]

Cable failure Cable cut is the most difficult failureand we design5 redundancy and recovery mechanisms tohandle one cable cut The procedure is analogous to ringprotection in SONET [49] by 2 sets of fibers We proposedirectional redundancy in MegaSwitch as shown in Fig-ure 7 the fiber can carry the traffic fromto both east (pri-mary) and west (secondary) sides of the OWS by select-

5This design has not been implemented in the current prototypethus the power budgeting on the prototype (sect41) does not account forthe fault-tolerance components on the ring The components for thisdesign are all commercially available and can be readily incorporatedinto future iterations of MegaSwitch

582 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Figure 7 MegaSwitch fault-toleranceing a direction at the sending 1times2 optical tunable split-ter6 and receiving 2times1 optical switches The splittersand switches are in the PRF module When there is nocable failure sending and receiving switches both selectthe primary direction thus preventing optical looping Ifone cable cut is detected by the power detectors the con-troller is notified and it instructs the OWSes on the ringto do the following 1) The sending tunable splitter inevery OWS transmits on its fiber on both directions7 sothat the signal can reach all the OWSes 2) The receiv-ing switch for each fiber at the input of WSS selects thedirection where there is still incoming signal Then theconnectivity can be restored A passive 4-port fix split-ter is added before the receiving switch as a part of thering and it guides the signal from primary and backupdirections to different input ports of the 2times1 switch Formore than one cut the ring is severed into two or moresegments and connectivity between segments cannot berecovered until cable replacement

4 Implementation41 Implementation of OWSTo implement MegaSwitch and facilitate real deploy-ment we package all optical devices into an 1RU (RackUnit) switch box OWS (Figure 2) The OWS is com-posed of the sending components (MUX+OA) and thereceiving components (WSS+DEMUX) as described insect3 We use a pair of array waveguide grating (AWG) asMUX and DEMUX for DWDM signals

To simplify the wiring we compact all the 1times2 drop-continue splitters into PRF module as shown in Figure 6This module can be easily attached to or detached fromactive switching components in the OWS To compen-sate component losses along the ring we use a singlestage 12dB OA to boost the DWDM signal before trans-mitting the signals and the splitters in PRF have differentsplitting ratios Using the same numbering in Figure 6the splitting ratio of i-th splitter is 1(iminus1) for igt1 Asan example the power splitting on one fiber from node1 is shown in Figure 9 At each receiving transceiver

6It is implementable with a common Mach-Zehnder interferometer7Power splitting ratio must also be re-configured to keep the signal

within receiver sensitivity range for all receiving OWSes

Figure 8 The OWS box we implemented

Figure 9 Power budget on a fiberthe signal strength is simminus9dBm per channel The PRFconfiguration is therefore determined by w (WSS radix)

OWS receives DWDM signals from all other nodes onthe ring via its WSS We adopt 8times1 WSS from CoAdnaPhotonics in the prototype which is able to select wave-length signals from at most 8 nodes Currently we use 4out of 8 WSS ports for a 5-node ring

Our OWS box provides 16 optical ports and 8 areused in the MegaSwitch prototype Each port maps toa particular wavelength from 1905THz to 1935THzat 200GHz channel spacing InnoLight 10GBASE-ERDWDM SFP+ transceivers are used to connect EPSesto optical ports on OWSes With the total broadcastingloss at 105dB for OWS 4dB for WSS 25dB for DE-MUX and 1dB for cable connectors the receiving poweris minus9dBm simminus12dBm well within the sensitivity rangeof our transceivers The OWS box we implemented (Fig-ure 8) has 4 inter-OWS ports 16 optical ports and 1 Eth-ernet port for control For the 4 inter-OWS ports twoare used to connect other OWSes to construct the Mega-Switch ring and the other two are reserved for redun-dancy and further scaling (sect31)

Each OWS is controlled by a Raspberry Pi [37] with700MHz ARM CPU and 256MB memory OWS re-ceives the wavelength assignments (in UDP packets) viaits Ethernet management interface connected to a sep-arate electrical control plane network and configuresWSS via GPIO pins

42 MegaSwitch PrototypeWe constructed a 5-node MegaSwitch with 5 OWSboxes as is shown in Figure 10 The OWSes are simplyconnected to each other through their inter-OWS portsusing ring cables containing multiple fibers The twoEPSes we used are Broadcom Pronto-3922 with 48times10Gbps Ethernet (GbE) ports For one EPS we fit 24 portswith 3 sets of transceivers of 8 unique wavelengths toconnect to 3 OWSes and the rest 24 ports are connectedto the servers For the other we only use 32 ports with16 ports to OWSes and 16 to servers We fill the capacitywith 40times10GbE interfaces on 20 Dell PowerEdge R320servers (Debian 70 with Kernel 31819) each with aBroadcom NetXtreme II 10GbE network interface card

The control plane network connected by a BroadcomPronto-3295 Ethernet switch The Ethernet management

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 583

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 4: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

copies of data to be stored in different racks in case ofrack failures as well as cluster-level balancing [45] Inshuffle and output phases of MapReduce jobs wide-spread many-to-many traffic emerges among all par-ticipating serversracks within the clusterBandwidth demand is also surging as the data stored

and processed in DCNs continue to grow due to the ex-plosion of user-generated contents such as photovideoupdates and sensor measurements [46] The implica-tion is twofold 1) Data processing applications requirelarger bandwidth for data exchange both within a appli-cation (eg in a multi-stage machine learning workersexchange data between stages) and between applications(eg related applications like web search and ad recom-mendation share data) 2) Front-end servers need morebandwidth to fetch the ever-larger contents from cacheservers to generate webpages [5] In conclusion opti-cal DCN interconnects must support wide-spread high-bandwidth demands among many racks

22 Prior Proposals Are InsufficientPrior proposals fall short in supporting the above trendof high-bandwidth wide-spread communications amongmany nodes simultaneouslybull c-Through [50] and Helios [12] are seminal works that

introduce wavelength division multiplexing (WDM)and optical circuit switch (OCS) into DCNs Howeverthey suffer from constrained rack-to-rack connectiv-ity At any time high capacity optical circuits are lim-ited to a couple of racks (eg 2 [50]) Reconfiguringcircuits to serve communications among more rackscan incur sim10 ms circuit visit delay [12 50]

bull OSA [6] and WaveCube [7] enable arbitrary rack-to-rack connectivity via multi-hop routing over multiplecircuits and EPSes While solving the connectivity is-sue they introduce severe bandwidth constraints be-tween racks because traffic needs to traverse multipleEPSes to destination introducing traffic overhead androuting complexity at each transit EPS

bull Quartz [26] is an optical fiber ring that is designed asan design element to provide low latency in portionsof traditional DCNs It avoids the circuit switching de-lay by using fixed wavelengths and its bisection band-width are also limited

bull Mordia [38] and REACToR [24] improve optical cir-cuit switching in the temporal direction by reducingcircuit switching time from milliseconds to microsec-onds This approach is effective in quickly time-sharing circuits across more nodes but still insuffi-cient to support wide-spread communications due tothe parallel circuits available simultaneously Further-more Mordia is by default a single fiber ring struc-ture with small port density (limited by of uniquewavelengths supported on a fiber) Naıvely stacking

Figure 2 Example of a 3-node MegaSwitchmultiple fibers via ring-selection circuits as suggestedby [38] is blocking among hosts on different fibersleading to very degraded bisection bandwidth (sect52)whereas time-sharing multiple fibers via EPS as sug-gested by [11] requires complex EPS functions un-available in existing commodity EPSes and introducesadditional optical design complexities making it hardto implement in practice (more details in sect31)

bull Free-space optics proposals [15 18] eliminate wiringFirefly [18] leverages free-space optics to build awireless DCN fabric with improved cost-performancetradeoff With low switching time and high scalabil-ity ProjecToR [15] connects an entire DCN of tensof thousands of machines However they face prac-tical challenges of real DCN environments (eg dustand vibration) In contrast we implement a wired op-tical interconnect for thousands of machines whichis common in DCNs and deploy real applications onit We note that the mechanisms we developed forMegaSwitch can be employed in ProjecToR notablythe basemesh for low latency applications

3 MegaSwitch DesignIn this section we describe the design of MegaSwitchrsquosdata and control planes as well as its fault-tolerance

31 Data PlaneAs in Figure 2 MegaSwitch connects multiple OWSesin a multi-fiber ring Each fiber has only one sender Thesender broadcasts traffic on this fiber which reaches allother nodes in one hop Each receiver can receive trafficfrom all the senders simultaneously The key of Mega-Switch is that it exploits WSS to enable non-blockingspace division multiplexing among these parallel fibers

Sending component In OWS for transmission we usean optical multiplexer (MUX) to join the signals fromEPS onto a fiber and then use an optical amplifier (OA)to boost the power Multiple wavelengths from EPS up-link ports (each with an optical transceiver at a uniquewavelength) are multiplexed onto a single fiber The mul-tiplexed signals then travel to all the other nodes via thisfiber The OA is added before the path to compensate

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 579

the optical insertion loss caused by broadcasting whichensures the signal strength is within the receiver sensitiv-ity range (Detailed Power budgeting design is in sect41)As shown in Figure 2 any signal is amplified once atits sender and there is no additional amplification stagesin the intermediate or receiver nodes The signal fromsender also does not loop back and terminates at the lastreceiver on the ring (eg signals from OWS1 terminatesin OWS3 so that no physical loop is formed)

Receiving component We use a WSS and an opticaldemultiplexer (DEMUX) at the receiving end The de-sign highlight is using WSS as a wtimes1 wavelength multi-plexer to intercept all the wavelengths from all the othernodes on the ring In prior work [6 11 38] WSS wasused as a 1timesw demultiplexer that takes 1 input fiber ofk wavelengths and outputs any subset of these k wave-lengths to any of w output fibers In MegaSwitch WSSis repurposed as a wtimes1 multiplexer which takes w in-put fibers with k wavelengths each and outputs a non-interfering subset of the ktimesw wavelengths to an out-put fiber With this unconventional use of WSS Mega-Switch ensures that any node can simultaneously chooseany wavelengths from any other nodes enabling un-constrained connections Then based on the demandsamong nodes the WSS selects the right wavelengths andmultiplexes them on the fiber to the DEMUX In sect32 weintroduce algorithms to handle wavelength assignmentsThe multiplexed signals are then de-multiplexed by DE-MUX to the uplink ports on EPS

Supporting lowast-cast [51] Unicast and multicast can beset up with a single wavelength on MegaSwitch becauseany wavelength from a source can be intercepted by ev-ery node on the ring To set up a unicast MegaSwitchassigns a wavelength on the fiber from the source to des-tination and configures the WSS at the destination to se-lect this wavelength Consider a unicast from H1 to H6

in Figure 2 We first assign wavelength λ1 from node 1to 2 for this connection and configure the WSS in node 2to select λ1 from node 1 Then with the routing in bothEPSes configured the unicast circuit from H1 to H6 is es-tablished Further to setup a multicast from H1 to H6 andH9 based on the unicast above we just need to addition-ally configure the WSS in node 3 to also select λ1 fromnode 1 In addition many-to-one (incast) or many-to-many (allcast) communications are composites of manyunicasts andor multicasts and they are supported usingmultiple wavelengths

Scalability MegaSwitchrsquos port count is ntimeskbull For n with existing technology it is possible to

achieve 32times1 WSS (thus n=w+1=33) Furthermorealternative optical components such as AWGR andfilter arrays can be used to achieve the same function-ality as WSS+DEMUX for MegaSwitch [56] Since

48times48 or 64times64 AWGR is available we expect to seea 15times or 2times increase with 49 or 65 nodes on a ringwhile additional splitting loss can be compensated

bull For k C-band Dense-WDM (DWDM) link can sup-port k=96 wavelengths at 50Ghz channel spacing andk=192 wavelengths at 25Ghz [35] Recent progress inoptical comb source [30 53] assures the relative wave-length accuracy among DWDM signals and is promis-ing to enable a low power consumption DWDM trans-ceiver array at 25Ghz channel spacing

As a result with existing technology MegaSwitch sup-ports up to ntimesk=33times192=6336 ports on a single ring

MegaSwitch can also be expanded multidimension-ally In our OWS implementation (sect4) two inter-OWSports are used in the ring and we reserve another two forfuture extension of MegaSwitch in another dimensionWhen extended bi-dimensionally (into a torus) [55]MegaSwitch scales as n2timesk supporting over 811K ports(n=65 k=192) which should accommodate modernDCNs [42 46] However such large scale comes withinherent routing management and reliability challengesand we leave it as future work

MegaSwitch vs Mordia Mordia [38] is perhaps theclosest related work to MegaSwitch in terms of topol-ogy both are constructed as a ring and both adopt WSSHowever the key difference is Mordia chains multipleWSSes on one fiber each WSS acts as a wavelength de-multiplexer to set up circuits between ports on the samefiber whereas MegaSwitch reverses WSS as a wave-length multiplexer across multiple fibers to enable non-blocking connections between ports on different fibers

We note that Farrington et al [11] further discussedscaling Mordia with multiple fibers non-blocking-ly (re-ferred to as Mordia-PTL below) Unlike MegaSwitchrsquosWSS-based multi-fiber ring they proposed to connecteach EPS to multiple fiber rings in parallel and rely onthe EPS to schedule circuits for servers to support TDMacross fibers This falls beyond the capability of existingcommodity switches [24] Further on each fiber they al-low every node to add signals to the fiber and require anoptical amplifier in each node to boost the signals andcompensate losses (eg bandpass adddrop filter lossvariable optical attenuator loss 1090 splitter loss etc)thus they need multiple amplifiers per fiber This de-grades optical signal to noise ratio (OSNR) and makestransmission error-prone In Figure 3 we compareOSNR between MegaSwitch and Mordia We assume7dBm per-channel launch power 16dB pre-amplificationgain in MegaSwitch and 23dB boosting gain per-node inMordia The results show that for MegaSwitch OSNRmaintains at sim36dB and remains constant for all nodesfor Mordia OSNR quickly degrades to 30dB after 7hops For validation we have also measured the aver-

580 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

MegaSwitch

Mordia (Multi-hop ring)

5-node MegaSwitch

Prototype (Measured)

Acceptable SNR (32dB)20dB

30dB

40dB

nodes on ring0 5 10 15

Figure 3 OSNR comparisonage OSNR on our MegaSwitch prototype (sect4) which is3812dB after 5 hops (plotted in the figure for reference)

Furthermore there exists a physical loop in each fiberof their design where optical signals can circle back toits origin causing interferences if not handled properlyMegaSwitch by design avoids most of these problems1) one each fiber has only one sender and thus one stageof amplification (Power budgeting is explained in sect41)2) each fiber is terminated in the last node in the ringthus there is no physical optical loop or recirculated sig-nals (Figure 9) Therefore MegaSwitch maintains goodOSNR when scaled to more nodes More importantlybesides above they do not consider fault-tolerance andactual implementation in their paper [11] In this workwe have built with significant efforts a functional Mega-Switch prototype with commodity off-the-shelf EPSesand our home-built OWSes

32 Control PlaneAs prior work [6 12 38 50] MegaSwitch takes a cen-tralized approach to configure routings in EPSes andwavelength assignments in OWSes The MegaSwitchcontroller converts traffic bandwidth demands into wave-length assignments and pushes them into the OWSesThe demands can be obtained by existing traffic estima-tion schemes [2 12 50] In what follows we first in-troduce the wavelength assignment algorithms and thendesign basemesh to handle unstable traffic and latencysensitive flows during reconfigurations

321 Wavelength AssignmentIn MegaSwitch each node uses the same k wavelengthsto communicate with other nodes The control plane as-signs the wavelengths to satisfy communication demandsamong nodes In a feasible assignment wavelengthsfrom different senders must not interfere with each otherat the receiver since all the selected wavelengths to a par-ticular receiver share the same output fiber through wtimes1WSS (see Figure 2) We illustrate an assignment exam-ple in Figure 4 which has 3 nodes and each has 4 wave-lengths The demand is in (a) each entry is the numberof required wavelengths of a sender-receiver pair If thewavelengths are assigned as in (b) then two conflicts oc-cur A non-interfering assignment is shown in (c) whereno two same wavelengths go to the same receiver

Given the constraint we reduce the wavelength as-signment problem in MegaSwitch to an edge coloring

Figure 4 Wavelength assignments (n=3 k=4)problem on a bipartite multigraph We express band-width demand matrix on a bipartite multigraph as Fig-ure 4 Multiple edges between two nodes correspondto multiple wavelengths needed by them Assume eachwavelength has a unique color then a feasible wave-length assignment is equivalent to an assignment of col-ors to the edges so that no two adjacent edges share thesame colormdashexactly the edge coloring problem [9]

Edge coloring problem is NP-complete on generalgraphs [6 9] but has efficient optimal solutions on bi-partite multigraphs [7 44] By adopting the algorithmin [7] we prove the following theorem3 (see Appendix)which establishes the rearrangeably non-blocking prop-erty of MegaSwitchTheorem 1 Any feasible bandwidth demand can be sat-isfied by MegaSwitch with at most k unique wavelengths322 BasemeshAs WSS port count increases MegaSwitch reconfiguresat milliseconds (sect42) To accommodate unstable trafficand latency-sensitive applications a typical solution is tomaintain a parallel electrical packet-switched fabric assuggested by prior hybrid network proposals [12 24 50]

MegaSwitch emulates such hybrid network with stablecircuits providing constant connectivity among nodesand eliminating the need for an additional electrical fab-ric We denote this set of wavelengths as basemeshwhich should achieve two goals 1) the number of wave-lengths dedicated to basemesh b should be adjustable42) With given b guarantee low average latency for allpairs of nodes

Essentially basemesh is an overlay network on Mega-Switchrsquos physical multi-fiber ring and building such anetwork with the above design goals has been studiedin overlay networking and distributed hash table (DHT)literature [29 48] Forwarding a packet on basemesh issimilar to performing a look-up on a DHT network Alsothe routing table size (number of known peer addresses)of each node in DHT is analogous to the the number ofwavelengths to other nodes b Meanwhile our problemdiffers from DHT We assume the centralized controller

3We note that this problem can also be reduce to the well-knownrow-column-row permutation routing problem [4 39] and the reduc-tion is also a proof of Theorem 1

4The basemesh alone can be considered as a bk (k is the numberof wavelengths per fiber) over-subscribed parallel electrical switchingnetwork similar to that of [12 24 50]

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 581

Chord (=log2 33)

Basemesh (Symphony)

Ave

rag

e H

op

s

1

5

10

Ave

rag

e L

ate

ncy20us

100us

200us

300us

wavelength in Basemesh1020 32 192

Figure 5 Average latency on basemeshcalculates routing tables for EPSes and the routing ta-bles are updated for adjustments of b and peer churn (in-frequent only happens in failures sect33)

We find the Symphony [29] topology fits our goalsFor b wavelengths in the basemesh each node first usesone wavelength to connect to the next node forming adirected ring then each node chooses shortcuts to bminus1other nodes on the ring We adopt the randomized algo-rithm of Symphony [29] to choose shortcuts (drawn froma family of harmonic probability distributions pn(x)=

1xlnn If a pair is selected the corresponding entry in de-mand matrix increment by 1) The routing table on theEPS of each node is configured with greedy routing thatattempts to minimize the absolute distance to destinationat each hop [29]

It is shown that the expected number of hops for thistopology is O( log

2(n)k ) for each packet With n=33 we

plot the expected path length in Figure 5 to compareit with Chord [48] (Finger table size is 4 average pathlength is O(log(n))) and also plot the expected latencyassuming 20micros per-hop latency Notably if bgenminus1basemesh is fully connected with 1 hop between everypair adding more wavelength cannot reduce the averagehop count but serve to reduce congestion

As we will show in sect52 while basemesh reduces theworst-case bisection bandwidth the average bisectionbandwidth is unaffected and it brings two benefitsbull It increases throughput for rapid-changing unpre-

dictable traffic when the traffic stability period is lessthan the control plane delay as shown in sect52

bull It mitigates the impact of inaccurate demand esti-mation by providing consistent connection Withoutbasemesh when the demand between two nodes ismistakenly estimated as 0 they are effectively cut-off which is costly to recover (flows need to wait forwavelength setup)Thanks to the flexibility of MegaSwitch b is ad-

justable to accommodate varying traffic instability(sect52) or be partiallyfully disabled for higher traffic im-balance if applications desire more bandwidth in part ofnetwork [42] In this paper we present b as a knob forDCN operators and intend to explore the problem of op-timal configuration of b as future work

Basemesh vs Dedicated Topology in ProjecToR [15]Basemesh is similar to the dedicated topology in Pro-jecToR which is a set of static connections for low la-tency traffic However its construction is based on the

Figure 6 PRF module layoutoptimization of weighted path length using traffic distri-bution from past measurements which cannot provideaverage latency guarantee for an arbitrary pair of racksIn comparison basemesh can provide consistent connec-tions between all pairs with proven average latency Itis possible to configure basemsesh in ProjecToR as itsdedicated topology which better handles unpredictablelatency-critical traffic

33 Fault-toleranceWe consider the following failures in MegaSwitch

OWS failure We implement the OWS box so that thenode failure will not affect the operation of other nodeson the ring As shown in Figure 6 the integrated pas-sive component (Passive Routing Fabric PRF) handlesthe signal propagation across adjacent nodes while thepassive splitters copy the traffic to local active switchingmodules (ie WSS and DEMUX) When one OWS losesits function (eg power loss) in its active part the traf-fic traveling across the failure node fromto other nodes isnot affected because the PRF operates passively and stillforwards the signals Thus the failure is isolated withinthe local node We design the PRF as a modular blockthat can be attached and detached from active switchingmodules easily So one can recover the failed OWS with-out the service interruption of the rest network by simplyswapping its active switching module

Port failure When the transceiver fails in a node itcauses the loss of capacity only at the local node Wehave implemented built-in power detectors in OWS tonotify the controller of such events

Controller failure In control plane two or more in-stances of the controller are running simultaneously withone leader The fail-over mechanism follows VRRP [20]

Cable failure Cable cut is the most difficult failureand we design5 redundancy and recovery mechanisms tohandle one cable cut The procedure is analogous to ringprotection in SONET [49] by 2 sets of fibers We proposedirectional redundancy in MegaSwitch as shown in Fig-ure 7 the fiber can carry the traffic fromto both east (pri-mary) and west (secondary) sides of the OWS by select-

5This design has not been implemented in the current prototypethus the power budgeting on the prototype (sect41) does not account forthe fault-tolerance components on the ring The components for thisdesign are all commercially available and can be readily incorporatedinto future iterations of MegaSwitch

582 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Figure 7 MegaSwitch fault-toleranceing a direction at the sending 1times2 optical tunable split-ter6 and receiving 2times1 optical switches The splittersand switches are in the PRF module When there is nocable failure sending and receiving switches both selectthe primary direction thus preventing optical looping Ifone cable cut is detected by the power detectors the con-troller is notified and it instructs the OWSes on the ringto do the following 1) The sending tunable splitter inevery OWS transmits on its fiber on both directions7 sothat the signal can reach all the OWSes 2) The receiv-ing switch for each fiber at the input of WSS selects thedirection where there is still incoming signal Then theconnectivity can be restored A passive 4-port fix split-ter is added before the receiving switch as a part of thering and it guides the signal from primary and backupdirections to different input ports of the 2times1 switch Formore than one cut the ring is severed into two or moresegments and connectivity between segments cannot berecovered until cable replacement

4 Implementation41 Implementation of OWSTo implement MegaSwitch and facilitate real deploy-ment we package all optical devices into an 1RU (RackUnit) switch box OWS (Figure 2) The OWS is com-posed of the sending components (MUX+OA) and thereceiving components (WSS+DEMUX) as described insect3 We use a pair of array waveguide grating (AWG) asMUX and DEMUX for DWDM signals

To simplify the wiring we compact all the 1times2 drop-continue splitters into PRF module as shown in Figure 6This module can be easily attached to or detached fromactive switching components in the OWS To compen-sate component losses along the ring we use a singlestage 12dB OA to boost the DWDM signal before trans-mitting the signals and the splitters in PRF have differentsplitting ratios Using the same numbering in Figure 6the splitting ratio of i-th splitter is 1(iminus1) for igt1 Asan example the power splitting on one fiber from node1 is shown in Figure 9 At each receiving transceiver

6It is implementable with a common Mach-Zehnder interferometer7Power splitting ratio must also be re-configured to keep the signal

within receiver sensitivity range for all receiving OWSes

Figure 8 The OWS box we implemented

Figure 9 Power budget on a fiberthe signal strength is simminus9dBm per channel The PRFconfiguration is therefore determined by w (WSS radix)

OWS receives DWDM signals from all other nodes onthe ring via its WSS We adopt 8times1 WSS from CoAdnaPhotonics in the prototype which is able to select wave-length signals from at most 8 nodes Currently we use 4out of 8 WSS ports for a 5-node ring

Our OWS box provides 16 optical ports and 8 areused in the MegaSwitch prototype Each port maps toa particular wavelength from 1905THz to 1935THzat 200GHz channel spacing InnoLight 10GBASE-ERDWDM SFP+ transceivers are used to connect EPSesto optical ports on OWSes With the total broadcastingloss at 105dB for OWS 4dB for WSS 25dB for DE-MUX and 1dB for cable connectors the receiving poweris minus9dBm simminus12dBm well within the sensitivity rangeof our transceivers The OWS box we implemented (Fig-ure 8) has 4 inter-OWS ports 16 optical ports and 1 Eth-ernet port for control For the 4 inter-OWS ports twoare used to connect other OWSes to construct the Mega-Switch ring and the other two are reserved for redun-dancy and further scaling (sect31)

Each OWS is controlled by a Raspberry Pi [37] with700MHz ARM CPU and 256MB memory OWS re-ceives the wavelength assignments (in UDP packets) viaits Ethernet management interface connected to a sep-arate electrical control plane network and configuresWSS via GPIO pins

42 MegaSwitch PrototypeWe constructed a 5-node MegaSwitch with 5 OWSboxes as is shown in Figure 10 The OWSes are simplyconnected to each other through their inter-OWS portsusing ring cables containing multiple fibers The twoEPSes we used are Broadcom Pronto-3922 with 48times10Gbps Ethernet (GbE) ports For one EPS we fit 24 portswith 3 sets of transceivers of 8 unique wavelengths toconnect to 3 OWSes and the rest 24 ports are connectedto the servers For the other we only use 32 ports with16 ports to OWSes and 16 to servers We fill the capacitywith 40times10GbE interfaces on 20 Dell PowerEdge R320servers (Debian 70 with Kernel 31819) each with aBroadcom NetXtreme II 10GbE network interface card

The control plane network connected by a BroadcomPronto-3295 Ethernet switch The Ethernet management

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 583

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 5: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

the optical insertion loss caused by broadcasting whichensures the signal strength is within the receiver sensitiv-ity range (Detailed Power budgeting design is in sect41)As shown in Figure 2 any signal is amplified once atits sender and there is no additional amplification stagesin the intermediate or receiver nodes The signal fromsender also does not loop back and terminates at the lastreceiver on the ring (eg signals from OWS1 terminatesin OWS3 so that no physical loop is formed)

Receiving component We use a WSS and an opticaldemultiplexer (DEMUX) at the receiving end The de-sign highlight is using WSS as a wtimes1 wavelength multi-plexer to intercept all the wavelengths from all the othernodes on the ring In prior work [6 11 38] WSS wasused as a 1timesw demultiplexer that takes 1 input fiber ofk wavelengths and outputs any subset of these k wave-lengths to any of w output fibers In MegaSwitch WSSis repurposed as a wtimes1 multiplexer which takes w in-put fibers with k wavelengths each and outputs a non-interfering subset of the ktimesw wavelengths to an out-put fiber With this unconventional use of WSS Mega-Switch ensures that any node can simultaneously chooseany wavelengths from any other nodes enabling un-constrained connections Then based on the demandsamong nodes the WSS selects the right wavelengths andmultiplexes them on the fiber to the DEMUX In sect32 weintroduce algorithms to handle wavelength assignmentsThe multiplexed signals are then de-multiplexed by DE-MUX to the uplink ports on EPS

Supporting lowast-cast [51] Unicast and multicast can beset up with a single wavelength on MegaSwitch becauseany wavelength from a source can be intercepted by ev-ery node on the ring To set up a unicast MegaSwitchassigns a wavelength on the fiber from the source to des-tination and configures the WSS at the destination to se-lect this wavelength Consider a unicast from H1 to H6

in Figure 2 We first assign wavelength λ1 from node 1to 2 for this connection and configure the WSS in node 2to select λ1 from node 1 Then with the routing in bothEPSes configured the unicast circuit from H1 to H6 is es-tablished Further to setup a multicast from H1 to H6 andH9 based on the unicast above we just need to addition-ally configure the WSS in node 3 to also select λ1 fromnode 1 In addition many-to-one (incast) or many-to-many (allcast) communications are composites of manyunicasts andor multicasts and they are supported usingmultiple wavelengths

Scalability MegaSwitchrsquos port count is ntimeskbull For n with existing technology it is possible to

achieve 32times1 WSS (thus n=w+1=33) Furthermorealternative optical components such as AWGR andfilter arrays can be used to achieve the same function-ality as WSS+DEMUX for MegaSwitch [56] Since

48times48 or 64times64 AWGR is available we expect to seea 15times or 2times increase with 49 or 65 nodes on a ringwhile additional splitting loss can be compensated

bull For k C-band Dense-WDM (DWDM) link can sup-port k=96 wavelengths at 50Ghz channel spacing andk=192 wavelengths at 25Ghz [35] Recent progress inoptical comb source [30 53] assures the relative wave-length accuracy among DWDM signals and is promis-ing to enable a low power consumption DWDM trans-ceiver array at 25Ghz channel spacing

As a result with existing technology MegaSwitch sup-ports up to ntimesk=33times192=6336 ports on a single ring

MegaSwitch can also be expanded multidimension-ally In our OWS implementation (sect4) two inter-OWSports are used in the ring and we reserve another two forfuture extension of MegaSwitch in another dimensionWhen extended bi-dimensionally (into a torus) [55]MegaSwitch scales as n2timesk supporting over 811K ports(n=65 k=192) which should accommodate modernDCNs [42 46] However such large scale comes withinherent routing management and reliability challengesand we leave it as future work

MegaSwitch vs Mordia Mordia [38] is perhaps theclosest related work to MegaSwitch in terms of topol-ogy both are constructed as a ring and both adopt WSSHowever the key difference is Mordia chains multipleWSSes on one fiber each WSS acts as a wavelength de-multiplexer to set up circuits between ports on the samefiber whereas MegaSwitch reverses WSS as a wave-length multiplexer across multiple fibers to enable non-blocking connections between ports on different fibers

We note that Farrington et al [11] further discussedscaling Mordia with multiple fibers non-blocking-ly (re-ferred to as Mordia-PTL below) Unlike MegaSwitchrsquosWSS-based multi-fiber ring they proposed to connecteach EPS to multiple fiber rings in parallel and rely onthe EPS to schedule circuits for servers to support TDMacross fibers This falls beyond the capability of existingcommodity switches [24] Further on each fiber they al-low every node to add signals to the fiber and require anoptical amplifier in each node to boost the signals andcompensate losses (eg bandpass adddrop filter lossvariable optical attenuator loss 1090 splitter loss etc)thus they need multiple amplifiers per fiber This de-grades optical signal to noise ratio (OSNR) and makestransmission error-prone In Figure 3 we compareOSNR between MegaSwitch and Mordia We assume7dBm per-channel launch power 16dB pre-amplificationgain in MegaSwitch and 23dB boosting gain per-node inMordia The results show that for MegaSwitch OSNRmaintains at sim36dB and remains constant for all nodesfor Mordia OSNR quickly degrades to 30dB after 7hops For validation we have also measured the aver-

580 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

MegaSwitch

Mordia (Multi-hop ring)

5-node MegaSwitch

Prototype (Measured)

Acceptable SNR (32dB)20dB

30dB

40dB

nodes on ring0 5 10 15

Figure 3 OSNR comparisonage OSNR on our MegaSwitch prototype (sect4) which is3812dB after 5 hops (plotted in the figure for reference)

Furthermore there exists a physical loop in each fiberof their design where optical signals can circle back toits origin causing interferences if not handled properlyMegaSwitch by design avoids most of these problems1) one each fiber has only one sender and thus one stageof amplification (Power budgeting is explained in sect41)2) each fiber is terminated in the last node in the ringthus there is no physical optical loop or recirculated sig-nals (Figure 9) Therefore MegaSwitch maintains goodOSNR when scaled to more nodes More importantlybesides above they do not consider fault-tolerance andactual implementation in their paper [11] In this workwe have built with significant efforts a functional Mega-Switch prototype with commodity off-the-shelf EPSesand our home-built OWSes

32 Control PlaneAs prior work [6 12 38 50] MegaSwitch takes a cen-tralized approach to configure routings in EPSes andwavelength assignments in OWSes The MegaSwitchcontroller converts traffic bandwidth demands into wave-length assignments and pushes them into the OWSesThe demands can be obtained by existing traffic estima-tion schemes [2 12 50] In what follows we first in-troduce the wavelength assignment algorithms and thendesign basemesh to handle unstable traffic and latencysensitive flows during reconfigurations

321 Wavelength AssignmentIn MegaSwitch each node uses the same k wavelengthsto communicate with other nodes The control plane as-signs the wavelengths to satisfy communication demandsamong nodes In a feasible assignment wavelengthsfrom different senders must not interfere with each otherat the receiver since all the selected wavelengths to a par-ticular receiver share the same output fiber through wtimes1WSS (see Figure 2) We illustrate an assignment exam-ple in Figure 4 which has 3 nodes and each has 4 wave-lengths The demand is in (a) each entry is the numberof required wavelengths of a sender-receiver pair If thewavelengths are assigned as in (b) then two conflicts oc-cur A non-interfering assignment is shown in (c) whereno two same wavelengths go to the same receiver

Given the constraint we reduce the wavelength as-signment problem in MegaSwitch to an edge coloring

Figure 4 Wavelength assignments (n=3 k=4)problem on a bipartite multigraph We express band-width demand matrix on a bipartite multigraph as Fig-ure 4 Multiple edges between two nodes correspondto multiple wavelengths needed by them Assume eachwavelength has a unique color then a feasible wave-length assignment is equivalent to an assignment of col-ors to the edges so that no two adjacent edges share thesame colormdashexactly the edge coloring problem [9]

Edge coloring problem is NP-complete on generalgraphs [6 9] but has efficient optimal solutions on bi-partite multigraphs [7 44] By adopting the algorithmin [7] we prove the following theorem3 (see Appendix)which establishes the rearrangeably non-blocking prop-erty of MegaSwitchTheorem 1 Any feasible bandwidth demand can be sat-isfied by MegaSwitch with at most k unique wavelengths322 BasemeshAs WSS port count increases MegaSwitch reconfiguresat milliseconds (sect42) To accommodate unstable trafficand latency-sensitive applications a typical solution is tomaintain a parallel electrical packet-switched fabric assuggested by prior hybrid network proposals [12 24 50]

MegaSwitch emulates such hybrid network with stablecircuits providing constant connectivity among nodesand eliminating the need for an additional electrical fab-ric We denote this set of wavelengths as basemeshwhich should achieve two goals 1) the number of wave-lengths dedicated to basemesh b should be adjustable42) With given b guarantee low average latency for allpairs of nodes

Essentially basemesh is an overlay network on Mega-Switchrsquos physical multi-fiber ring and building such anetwork with the above design goals has been studiedin overlay networking and distributed hash table (DHT)literature [29 48] Forwarding a packet on basemesh issimilar to performing a look-up on a DHT network Alsothe routing table size (number of known peer addresses)of each node in DHT is analogous to the the number ofwavelengths to other nodes b Meanwhile our problemdiffers from DHT We assume the centralized controller

3We note that this problem can also be reduce to the well-knownrow-column-row permutation routing problem [4 39] and the reduc-tion is also a proof of Theorem 1

4The basemesh alone can be considered as a bk (k is the numberof wavelengths per fiber) over-subscribed parallel electrical switchingnetwork similar to that of [12 24 50]

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 581

Chord (=log2 33)

Basemesh (Symphony)

Ave

rag

e H

op

s

1

5

10

Ave

rag

e L

ate

ncy20us

100us

200us

300us

wavelength in Basemesh1020 32 192

Figure 5 Average latency on basemeshcalculates routing tables for EPSes and the routing ta-bles are updated for adjustments of b and peer churn (in-frequent only happens in failures sect33)

We find the Symphony [29] topology fits our goalsFor b wavelengths in the basemesh each node first usesone wavelength to connect to the next node forming adirected ring then each node chooses shortcuts to bminus1other nodes on the ring We adopt the randomized algo-rithm of Symphony [29] to choose shortcuts (drawn froma family of harmonic probability distributions pn(x)=

1xlnn If a pair is selected the corresponding entry in de-mand matrix increment by 1) The routing table on theEPS of each node is configured with greedy routing thatattempts to minimize the absolute distance to destinationat each hop [29]

It is shown that the expected number of hops for thistopology is O( log

2(n)k ) for each packet With n=33 we

plot the expected path length in Figure 5 to compareit with Chord [48] (Finger table size is 4 average pathlength is O(log(n))) and also plot the expected latencyassuming 20micros per-hop latency Notably if bgenminus1basemesh is fully connected with 1 hop between everypair adding more wavelength cannot reduce the averagehop count but serve to reduce congestion

As we will show in sect52 while basemesh reduces theworst-case bisection bandwidth the average bisectionbandwidth is unaffected and it brings two benefitsbull It increases throughput for rapid-changing unpre-

dictable traffic when the traffic stability period is lessthan the control plane delay as shown in sect52

bull It mitigates the impact of inaccurate demand esti-mation by providing consistent connection Withoutbasemesh when the demand between two nodes ismistakenly estimated as 0 they are effectively cut-off which is costly to recover (flows need to wait forwavelength setup)Thanks to the flexibility of MegaSwitch b is ad-

justable to accommodate varying traffic instability(sect52) or be partiallyfully disabled for higher traffic im-balance if applications desire more bandwidth in part ofnetwork [42] In this paper we present b as a knob forDCN operators and intend to explore the problem of op-timal configuration of b as future work

Basemesh vs Dedicated Topology in ProjecToR [15]Basemesh is similar to the dedicated topology in Pro-jecToR which is a set of static connections for low la-tency traffic However its construction is based on the

Figure 6 PRF module layoutoptimization of weighted path length using traffic distri-bution from past measurements which cannot provideaverage latency guarantee for an arbitrary pair of racksIn comparison basemesh can provide consistent connec-tions between all pairs with proven average latency Itis possible to configure basemsesh in ProjecToR as itsdedicated topology which better handles unpredictablelatency-critical traffic

33 Fault-toleranceWe consider the following failures in MegaSwitch

OWS failure We implement the OWS box so that thenode failure will not affect the operation of other nodeson the ring As shown in Figure 6 the integrated pas-sive component (Passive Routing Fabric PRF) handlesthe signal propagation across adjacent nodes while thepassive splitters copy the traffic to local active switchingmodules (ie WSS and DEMUX) When one OWS losesits function (eg power loss) in its active part the traf-fic traveling across the failure node fromto other nodes isnot affected because the PRF operates passively and stillforwards the signals Thus the failure is isolated withinthe local node We design the PRF as a modular blockthat can be attached and detached from active switchingmodules easily So one can recover the failed OWS with-out the service interruption of the rest network by simplyswapping its active switching module

Port failure When the transceiver fails in a node itcauses the loss of capacity only at the local node Wehave implemented built-in power detectors in OWS tonotify the controller of such events

Controller failure In control plane two or more in-stances of the controller are running simultaneously withone leader The fail-over mechanism follows VRRP [20]

Cable failure Cable cut is the most difficult failureand we design5 redundancy and recovery mechanisms tohandle one cable cut The procedure is analogous to ringprotection in SONET [49] by 2 sets of fibers We proposedirectional redundancy in MegaSwitch as shown in Fig-ure 7 the fiber can carry the traffic fromto both east (pri-mary) and west (secondary) sides of the OWS by select-

5This design has not been implemented in the current prototypethus the power budgeting on the prototype (sect41) does not account forthe fault-tolerance components on the ring The components for thisdesign are all commercially available and can be readily incorporatedinto future iterations of MegaSwitch

582 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Figure 7 MegaSwitch fault-toleranceing a direction at the sending 1times2 optical tunable split-ter6 and receiving 2times1 optical switches The splittersand switches are in the PRF module When there is nocable failure sending and receiving switches both selectthe primary direction thus preventing optical looping Ifone cable cut is detected by the power detectors the con-troller is notified and it instructs the OWSes on the ringto do the following 1) The sending tunable splitter inevery OWS transmits on its fiber on both directions7 sothat the signal can reach all the OWSes 2) The receiv-ing switch for each fiber at the input of WSS selects thedirection where there is still incoming signal Then theconnectivity can be restored A passive 4-port fix split-ter is added before the receiving switch as a part of thering and it guides the signal from primary and backupdirections to different input ports of the 2times1 switch Formore than one cut the ring is severed into two or moresegments and connectivity between segments cannot berecovered until cable replacement

4 Implementation41 Implementation of OWSTo implement MegaSwitch and facilitate real deploy-ment we package all optical devices into an 1RU (RackUnit) switch box OWS (Figure 2) The OWS is com-posed of the sending components (MUX+OA) and thereceiving components (WSS+DEMUX) as described insect3 We use a pair of array waveguide grating (AWG) asMUX and DEMUX for DWDM signals

To simplify the wiring we compact all the 1times2 drop-continue splitters into PRF module as shown in Figure 6This module can be easily attached to or detached fromactive switching components in the OWS To compen-sate component losses along the ring we use a singlestage 12dB OA to boost the DWDM signal before trans-mitting the signals and the splitters in PRF have differentsplitting ratios Using the same numbering in Figure 6the splitting ratio of i-th splitter is 1(iminus1) for igt1 Asan example the power splitting on one fiber from node1 is shown in Figure 9 At each receiving transceiver

6It is implementable with a common Mach-Zehnder interferometer7Power splitting ratio must also be re-configured to keep the signal

within receiver sensitivity range for all receiving OWSes

Figure 8 The OWS box we implemented

Figure 9 Power budget on a fiberthe signal strength is simminus9dBm per channel The PRFconfiguration is therefore determined by w (WSS radix)

OWS receives DWDM signals from all other nodes onthe ring via its WSS We adopt 8times1 WSS from CoAdnaPhotonics in the prototype which is able to select wave-length signals from at most 8 nodes Currently we use 4out of 8 WSS ports for a 5-node ring

Our OWS box provides 16 optical ports and 8 areused in the MegaSwitch prototype Each port maps toa particular wavelength from 1905THz to 1935THzat 200GHz channel spacing InnoLight 10GBASE-ERDWDM SFP+ transceivers are used to connect EPSesto optical ports on OWSes With the total broadcastingloss at 105dB for OWS 4dB for WSS 25dB for DE-MUX and 1dB for cable connectors the receiving poweris minus9dBm simminus12dBm well within the sensitivity rangeof our transceivers The OWS box we implemented (Fig-ure 8) has 4 inter-OWS ports 16 optical ports and 1 Eth-ernet port for control For the 4 inter-OWS ports twoare used to connect other OWSes to construct the Mega-Switch ring and the other two are reserved for redun-dancy and further scaling (sect31)

Each OWS is controlled by a Raspberry Pi [37] with700MHz ARM CPU and 256MB memory OWS re-ceives the wavelength assignments (in UDP packets) viaits Ethernet management interface connected to a sep-arate electrical control plane network and configuresWSS via GPIO pins

42 MegaSwitch PrototypeWe constructed a 5-node MegaSwitch with 5 OWSboxes as is shown in Figure 10 The OWSes are simplyconnected to each other through their inter-OWS portsusing ring cables containing multiple fibers The twoEPSes we used are Broadcom Pronto-3922 with 48times10Gbps Ethernet (GbE) ports For one EPS we fit 24 portswith 3 sets of transceivers of 8 unique wavelengths toconnect to 3 OWSes and the rest 24 ports are connectedto the servers For the other we only use 32 ports with16 ports to OWSes and 16 to servers We fill the capacitywith 40times10GbE interfaces on 20 Dell PowerEdge R320servers (Debian 70 with Kernel 31819) each with aBroadcom NetXtreme II 10GbE network interface card

The control plane network connected by a BroadcomPronto-3295 Ethernet switch The Ethernet management

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 583

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 6: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

MegaSwitch

Mordia (Multi-hop ring)

5-node MegaSwitch

Prototype (Measured)

Acceptable SNR (32dB)20dB

30dB

40dB

nodes on ring0 5 10 15

Figure 3 OSNR comparisonage OSNR on our MegaSwitch prototype (sect4) which is3812dB after 5 hops (plotted in the figure for reference)

Furthermore there exists a physical loop in each fiberof their design where optical signals can circle back toits origin causing interferences if not handled properlyMegaSwitch by design avoids most of these problems1) one each fiber has only one sender and thus one stageof amplification (Power budgeting is explained in sect41)2) each fiber is terminated in the last node in the ringthus there is no physical optical loop or recirculated sig-nals (Figure 9) Therefore MegaSwitch maintains goodOSNR when scaled to more nodes More importantlybesides above they do not consider fault-tolerance andactual implementation in their paper [11] In this workwe have built with significant efforts a functional Mega-Switch prototype with commodity off-the-shelf EPSesand our home-built OWSes

32 Control PlaneAs prior work [6 12 38 50] MegaSwitch takes a cen-tralized approach to configure routings in EPSes andwavelength assignments in OWSes The MegaSwitchcontroller converts traffic bandwidth demands into wave-length assignments and pushes them into the OWSesThe demands can be obtained by existing traffic estima-tion schemes [2 12 50] In what follows we first in-troduce the wavelength assignment algorithms and thendesign basemesh to handle unstable traffic and latencysensitive flows during reconfigurations

321 Wavelength AssignmentIn MegaSwitch each node uses the same k wavelengthsto communicate with other nodes The control plane as-signs the wavelengths to satisfy communication demandsamong nodes In a feasible assignment wavelengthsfrom different senders must not interfere with each otherat the receiver since all the selected wavelengths to a par-ticular receiver share the same output fiber through wtimes1WSS (see Figure 2) We illustrate an assignment exam-ple in Figure 4 which has 3 nodes and each has 4 wave-lengths The demand is in (a) each entry is the numberof required wavelengths of a sender-receiver pair If thewavelengths are assigned as in (b) then two conflicts oc-cur A non-interfering assignment is shown in (c) whereno two same wavelengths go to the same receiver

Given the constraint we reduce the wavelength as-signment problem in MegaSwitch to an edge coloring

Figure 4 Wavelength assignments (n=3 k=4)problem on a bipartite multigraph We express band-width demand matrix on a bipartite multigraph as Fig-ure 4 Multiple edges between two nodes correspondto multiple wavelengths needed by them Assume eachwavelength has a unique color then a feasible wave-length assignment is equivalent to an assignment of col-ors to the edges so that no two adjacent edges share thesame colormdashexactly the edge coloring problem [9]

Edge coloring problem is NP-complete on generalgraphs [6 9] but has efficient optimal solutions on bi-partite multigraphs [7 44] By adopting the algorithmin [7] we prove the following theorem3 (see Appendix)which establishes the rearrangeably non-blocking prop-erty of MegaSwitchTheorem 1 Any feasible bandwidth demand can be sat-isfied by MegaSwitch with at most k unique wavelengths322 BasemeshAs WSS port count increases MegaSwitch reconfiguresat milliseconds (sect42) To accommodate unstable trafficand latency-sensitive applications a typical solution is tomaintain a parallel electrical packet-switched fabric assuggested by prior hybrid network proposals [12 24 50]

MegaSwitch emulates such hybrid network with stablecircuits providing constant connectivity among nodesand eliminating the need for an additional electrical fab-ric We denote this set of wavelengths as basemeshwhich should achieve two goals 1) the number of wave-lengths dedicated to basemesh b should be adjustable42) With given b guarantee low average latency for allpairs of nodes

Essentially basemesh is an overlay network on Mega-Switchrsquos physical multi-fiber ring and building such anetwork with the above design goals has been studiedin overlay networking and distributed hash table (DHT)literature [29 48] Forwarding a packet on basemesh issimilar to performing a look-up on a DHT network Alsothe routing table size (number of known peer addresses)of each node in DHT is analogous to the the number ofwavelengths to other nodes b Meanwhile our problemdiffers from DHT We assume the centralized controller

3We note that this problem can also be reduce to the well-knownrow-column-row permutation routing problem [4 39] and the reduc-tion is also a proof of Theorem 1

4The basemesh alone can be considered as a bk (k is the numberof wavelengths per fiber) over-subscribed parallel electrical switchingnetwork similar to that of [12 24 50]

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 581

Chord (=log2 33)

Basemesh (Symphony)

Ave

rag

e H

op

s

1

5

10

Ave

rag

e L

ate

ncy20us

100us

200us

300us

wavelength in Basemesh1020 32 192

Figure 5 Average latency on basemeshcalculates routing tables for EPSes and the routing ta-bles are updated for adjustments of b and peer churn (in-frequent only happens in failures sect33)

We find the Symphony [29] topology fits our goalsFor b wavelengths in the basemesh each node first usesone wavelength to connect to the next node forming adirected ring then each node chooses shortcuts to bminus1other nodes on the ring We adopt the randomized algo-rithm of Symphony [29] to choose shortcuts (drawn froma family of harmonic probability distributions pn(x)=

1xlnn If a pair is selected the corresponding entry in de-mand matrix increment by 1) The routing table on theEPS of each node is configured with greedy routing thatattempts to minimize the absolute distance to destinationat each hop [29]

It is shown that the expected number of hops for thistopology is O( log

2(n)k ) for each packet With n=33 we

plot the expected path length in Figure 5 to compareit with Chord [48] (Finger table size is 4 average pathlength is O(log(n))) and also plot the expected latencyassuming 20micros per-hop latency Notably if bgenminus1basemesh is fully connected with 1 hop between everypair adding more wavelength cannot reduce the averagehop count but serve to reduce congestion

As we will show in sect52 while basemesh reduces theworst-case bisection bandwidth the average bisectionbandwidth is unaffected and it brings two benefitsbull It increases throughput for rapid-changing unpre-

dictable traffic when the traffic stability period is lessthan the control plane delay as shown in sect52

bull It mitigates the impact of inaccurate demand esti-mation by providing consistent connection Withoutbasemesh when the demand between two nodes ismistakenly estimated as 0 they are effectively cut-off which is costly to recover (flows need to wait forwavelength setup)Thanks to the flexibility of MegaSwitch b is ad-

justable to accommodate varying traffic instability(sect52) or be partiallyfully disabled for higher traffic im-balance if applications desire more bandwidth in part ofnetwork [42] In this paper we present b as a knob forDCN operators and intend to explore the problem of op-timal configuration of b as future work

Basemesh vs Dedicated Topology in ProjecToR [15]Basemesh is similar to the dedicated topology in Pro-jecToR which is a set of static connections for low la-tency traffic However its construction is based on the

Figure 6 PRF module layoutoptimization of weighted path length using traffic distri-bution from past measurements which cannot provideaverage latency guarantee for an arbitrary pair of racksIn comparison basemesh can provide consistent connec-tions between all pairs with proven average latency Itis possible to configure basemsesh in ProjecToR as itsdedicated topology which better handles unpredictablelatency-critical traffic

33 Fault-toleranceWe consider the following failures in MegaSwitch

OWS failure We implement the OWS box so that thenode failure will not affect the operation of other nodeson the ring As shown in Figure 6 the integrated pas-sive component (Passive Routing Fabric PRF) handlesthe signal propagation across adjacent nodes while thepassive splitters copy the traffic to local active switchingmodules (ie WSS and DEMUX) When one OWS losesits function (eg power loss) in its active part the traf-fic traveling across the failure node fromto other nodes isnot affected because the PRF operates passively and stillforwards the signals Thus the failure is isolated withinthe local node We design the PRF as a modular blockthat can be attached and detached from active switchingmodules easily So one can recover the failed OWS with-out the service interruption of the rest network by simplyswapping its active switching module

Port failure When the transceiver fails in a node itcauses the loss of capacity only at the local node Wehave implemented built-in power detectors in OWS tonotify the controller of such events

Controller failure In control plane two or more in-stances of the controller are running simultaneously withone leader The fail-over mechanism follows VRRP [20]

Cable failure Cable cut is the most difficult failureand we design5 redundancy and recovery mechanisms tohandle one cable cut The procedure is analogous to ringprotection in SONET [49] by 2 sets of fibers We proposedirectional redundancy in MegaSwitch as shown in Fig-ure 7 the fiber can carry the traffic fromto both east (pri-mary) and west (secondary) sides of the OWS by select-

5This design has not been implemented in the current prototypethus the power budgeting on the prototype (sect41) does not account forthe fault-tolerance components on the ring The components for thisdesign are all commercially available and can be readily incorporatedinto future iterations of MegaSwitch

582 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Figure 7 MegaSwitch fault-toleranceing a direction at the sending 1times2 optical tunable split-ter6 and receiving 2times1 optical switches The splittersand switches are in the PRF module When there is nocable failure sending and receiving switches both selectthe primary direction thus preventing optical looping Ifone cable cut is detected by the power detectors the con-troller is notified and it instructs the OWSes on the ringto do the following 1) The sending tunable splitter inevery OWS transmits on its fiber on both directions7 sothat the signal can reach all the OWSes 2) The receiv-ing switch for each fiber at the input of WSS selects thedirection where there is still incoming signal Then theconnectivity can be restored A passive 4-port fix split-ter is added before the receiving switch as a part of thering and it guides the signal from primary and backupdirections to different input ports of the 2times1 switch Formore than one cut the ring is severed into two or moresegments and connectivity between segments cannot berecovered until cable replacement

4 Implementation41 Implementation of OWSTo implement MegaSwitch and facilitate real deploy-ment we package all optical devices into an 1RU (RackUnit) switch box OWS (Figure 2) The OWS is com-posed of the sending components (MUX+OA) and thereceiving components (WSS+DEMUX) as described insect3 We use a pair of array waveguide grating (AWG) asMUX and DEMUX for DWDM signals

To simplify the wiring we compact all the 1times2 drop-continue splitters into PRF module as shown in Figure 6This module can be easily attached to or detached fromactive switching components in the OWS To compen-sate component losses along the ring we use a singlestage 12dB OA to boost the DWDM signal before trans-mitting the signals and the splitters in PRF have differentsplitting ratios Using the same numbering in Figure 6the splitting ratio of i-th splitter is 1(iminus1) for igt1 Asan example the power splitting on one fiber from node1 is shown in Figure 9 At each receiving transceiver

6It is implementable with a common Mach-Zehnder interferometer7Power splitting ratio must also be re-configured to keep the signal

within receiver sensitivity range for all receiving OWSes

Figure 8 The OWS box we implemented

Figure 9 Power budget on a fiberthe signal strength is simminus9dBm per channel The PRFconfiguration is therefore determined by w (WSS radix)

OWS receives DWDM signals from all other nodes onthe ring via its WSS We adopt 8times1 WSS from CoAdnaPhotonics in the prototype which is able to select wave-length signals from at most 8 nodes Currently we use 4out of 8 WSS ports for a 5-node ring

Our OWS box provides 16 optical ports and 8 areused in the MegaSwitch prototype Each port maps toa particular wavelength from 1905THz to 1935THzat 200GHz channel spacing InnoLight 10GBASE-ERDWDM SFP+ transceivers are used to connect EPSesto optical ports on OWSes With the total broadcastingloss at 105dB for OWS 4dB for WSS 25dB for DE-MUX and 1dB for cable connectors the receiving poweris minus9dBm simminus12dBm well within the sensitivity rangeof our transceivers The OWS box we implemented (Fig-ure 8) has 4 inter-OWS ports 16 optical ports and 1 Eth-ernet port for control For the 4 inter-OWS ports twoare used to connect other OWSes to construct the Mega-Switch ring and the other two are reserved for redun-dancy and further scaling (sect31)

Each OWS is controlled by a Raspberry Pi [37] with700MHz ARM CPU and 256MB memory OWS re-ceives the wavelength assignments (in UDP packets) viaits Ethernet management interface connected to a sep-arate electrical control plane network and configuresWSS via GPIO pins

42 MegaSwitch PrototypeWe constructed a 5-node MegaSwitch with 5 OWSboxes as is shown in Figure 10 The OWSes are simplyconnected to each other through their inter-OWS portsusing ring cables containing multiple fibers The twoEPSes we used are Broadcom Pronto-3922 with 48times10Gbps Ethernet (GbE) ports For one EPS we fit 24 portswith 3 sets of transceivers of 8 unique wavelengths toconnect to 3 OWSes and the rest 24 ports are connectedto the servers For the other we only use 32 ports with16 ports to OWSes and 16 to servers We fill the capacitywith 40times10GbE interfaces on 20 Dell PowerEdge R320servers (Debian 70 with Kernel 31819) each with aBroadcom NetXtreme II 10GbE network interface card

The control plane network connected by a BroadcomPronto-3295 Ethernet switch The Ethernet management

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 583

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 7: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

Chord (=log2 33)

Basemesh (Symphony)

Ave

rag

e H

op

s

1

5

10

Ave

rag

e L

ate

ncy20us

100us

200us

300us

wavelength in Basemesh1020 32 192

Figure 5 Average latency on basemeshcalculates routing tables for EPSes and the routing ta-bles are updated for adjustments of b and peer churn (in-frequent only happens in failures sect33)

We find the Symphony [29] topology fits our goalsFor b wavelengths in the basemesh each node first usesone wavelength to connect to the next node forming adirected ring then each node chooses shortcuts to bminus1other nodes on the ring We adopt the randomized algo-rithm of Symphony [29] to choose shortcuts (drawn froma family of harmonic probability distributions pn(x)=

1xlnn If a pair is selected the corresponding entry in de-mand matrix increment by 1) The routing table on theEPS of each node is configured with greedy routing thatattempts to minimize the absolute distance to destinationat each hop [29]

It is shown that the expected number of hops for thistopology is O( log

2(n)k ) for each packet With n=33 we

plot the expected path length in Figure 5 to compareit with Chord [48] (Finger table size is 4 average pathlength is O(log(n))) and also plot the expected latencyassuming 20micros per-hop latency Notably if bgenminus1basemesh is fully connected with 1 hop between everypair adding more wavelength cannot reduce the averagehop count but serve to reduce congestion

As we will show in sect52 while basemesh reduces theworst-case bisection bandwidth the average bisectionbandwidth is unaffected and it brings two benefitsbull It increases throughput for rapid-changing unpre-

dictable traffic when the traffic stability period is lessthan the control plane delay as shown in sect52

bull It mitigates the impact of inaccurate demand esti-mation by providing consistent connection Withoutbasemesh when the demand between two nodes ismistakenly estimated as 0 they are effectively cut-off which is costly to recover (flows need to wait forwavelength setup)Thanks to the flexibility of MegaSwitch b is ad-

justable to accommodate varying traffic instability(sect52) or be partiallyfully disabled for higher traffic im-balance if applications desire more bandwidth in part ofnetwork [42] In this paper we present b as a knob forDCN operators and intend to explore the problem of op-timal configuration of b as future work

Basemesh vs Dedicated Topology in ProjecToR [15]Basemesh is similar to the dedicated topology in Pro-jecToR which is a set of static connections for low la-tency traffic However its construction is based on the

Figure 6 PRF module layoutoptimization of weighted path length using traffic distri-bution from past measurements which cannot provideaverage latency guarantee for an arbitrary pair of racksIn comparison basemesh can provide consistent connec-tions between all pairs with proven average latency Itis possible to configure basemsesh in ProjecToR as itsdedicated topology which better handles unpredictablelatency-critical traffic

33 Fault-toleranceWe consider the following failures in MegaSwitch

OWS failure We implement the OWS box so that thenode failure will not affect the operation of other nodeson the ring As shown in Figure 6 the integrated pas-sive component (Passive Routing Fabric PRF) handlesthe signal propagation across adjacent nodes while thepassive splitters copy the traffic to local active switchingmodules (ie WSS and DEMUX) When one OWS losesits function (eg power loss) in its active part the traf-fic traveling across the failure node fromto other nodes isnot affected because the PRF operates passively and stillforwards the signals Thus the failure is isolated withinthe local node We design the PRF as a modular blockthat can be attached and detached from active switchingmodules easily So one can recover the failed OWS with-out the service interruption of the rest network by simplyswapping its active switching module

Port failure When the transceiver fails in a node itcauses the loss of capacity only at the local node Wehave implemented built-in power detectors in OWS tonotify the controller of such events

Controller failure In control plane two or more in-stances of the controller are running simultaneously withone leader The fail-over mechanism follows VRRP [20]

Cable failure Cable cut is the most difficult failureand we design5 redundancy and recovery mechanisms tohandle one cable cut The procedure is analogous to ringprotection in SONET [49] by 2 sets of fibers We proposedirectional redundancy in MegaSwitch as shown in Fig-ure 7 the fiber can carry the traffic fromto both east (pri-mary) and west (secondary) sides of the OWS by select-

5This design has not been implemented in the current prototypethus the power budgeting on the prototype (sect41) does not account forthe fault-tolerance components on the ring The components for thisdesign are all commercially available and can be readily incorporatedinto future iterations of MegaSwitch

582 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Figure 7 MegaSwitch fault-toleranceing a direction at the sending 1times2 optical tunable split-ter6 and receiving 2times1 optical switches The splittersand switches are in the PRF module When there is nocable failure sending and receiving switches both selectthe primary direction thus preventing optical looping Ifone cable cut is detected by the power detectors the con-troller is notified and it instructs the OWSes on the ringto do the following 1) The sending tunable splitter inevery OWS transmits on its fiber on both directions7 sothat the signal can reach all the OWSes 2) The receiv-ing switch for each fiber at the input of WSS selects thedirection where there is still incoming signal Then theconnectivity can be restored A passive 4-port fix split-ter is added before the receiving switch as a part of thering and it guides the signal from primary and backupdirections to different input ports of the 2times1 switch Formore than one cut the ring is severed into two or moresegments and connectivity between segments cannot berecovered until cable replacement

4 Implementation41 Implementation of OWSTo implement MegaSwitch and facilitate real deploy-ment we package all optical devices into an 1RU (RackUnit) switch box OWS (Figure 2) The OWS is com-posed of the sending components (MUX+OA) and thereceiving components (WSS+DEMUX) as described insect3 We use a pair of array waveguide grating (AWG) asMUX and DEMUX for DWDM signals

To simplify the wiring we compact all the 1times2 drop-continue splitters into PRF module as shown in Figure 6This module can be easily attached to or detached fromactive switching components in the OWS To compen-sate component losses along the ring we use a singlestage 12dB OA to boost the DWDM signal before trans-mitting the signals and the splitters in PRF have differentsplitting ratios Using the same numbering in Figure 6the splitting ratio of i-th splitter is 1(iminus1) for igt1 Asan example the power splitting on one fiber from node1 is shown in Figure 9 At each receiving transceiver

6It is implementable with a common Mach-Zehnder interferometer7Power splitting ratio must also be re-configured to keep the signal

within receiver sensitivity range for all receiving OWSes

Figure 8 The OWS box we implemented

Figure 9 Power budget on a fiberthe signal strength is simminus9dBm per channel The PRFconfiguration is therefore determined by w (WSS radix)

OWS receives DWDM signals from all other nodes onthe ring via its WSS We adopt 8times1 WSS from CoAdnaPhotonics in the prototype which is able to select wave-length signals from at most 8 nodes Currently we use 4out of 8 WSS ports for a 5-node ring

Our OWS box provides 16 optical ports and 8 areused in the MegaSwitch prototype Each port maps toa particular wavelength from 1905THz to 1935THzat 200GHz channel spacing InnoLight 10GBASE-ERDWDM SFP+ transceivers are used to connect EPSesto optical ports on OWSes With the total broadcastingloss at 105dB for OWS 4dB for WSS 25dB for DE-MUX and 1dB for cable connectors the receiving poweris minus9dBm simminus12dBm well within the sensitivity rangeof our transceivers The OWS box we implemented (Fig-ure 8) has 4 inter-OWS ports 16 optical ports and 1 Eth-ernet port for control For the 4 inter-OWS ports twoare used to connect other OWSes to construct the Mega-Switch ring and the other two are reserved for redun-dancy and further scaling (sect31)

Each OWS is controlled by a Raspberry Pi [37] with700MHz ARM CPU and 256MB memory OWS re-ceives the wavelength assignments (in UDP packets) viaits Ethernet management interface connected to a sep-arate electrical control plane network and configuresWSS via GPIO pins

42 MegaSwitch PrototypeWe constructed a 5-node MegaSwitch with 5 OWSboxes as is shown in Figure 10 The OWSes are simplyconnected to each other through their inter-OWS portsusing ring cables containing multiple fibers The twoEPSes we used are Broadcom Pronto-3922 with 48times10Gbps Ethernet (GbE) ports For one EPS we fit 24 portswith 3 sets of transceivers of 8 unique wavelengths toconnect to 3 OWSes and the rest 24 ports are connectedto the servers For the other we only use 32 ports with16 ports to OWSes and 16 to servers We fill the capacitywith 40times10GbE interfaces on 20 Dell PowerEdge R320servers (Debian 70 with Kernel 31819) each with aBroadcom NetXtreme II 10GbE network interface card

The control plane network connected by a BroadcomPronto-3295 Ethernet switch The Ethernet management

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 583

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 8: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

Figure 7 MegaSwitch fault-toleranceing a direction at the sending 1times2 optical tunable split-ter6 and receiving 2times1 optical switches The splittersand switches are in the PRF module When there is nocable failure sending and receiving switches both selectthe primary direction thus preventing optical looping Ifone cable cut is detected by the power detectors the con-troller is notified and it instructs the OWSes on the ringto do the following 1) The sending tunable splitter inevery OWS transmits on its fiber on both directions7 sothat the signal can reach all the OWSes 2) The receiv-ing switch for each fiber at the input of WSS selects thedirection where there is still incoming signal Then theconnectivity can be restored A passive 4-port fix split-ter is added before the receiving switch as a part of thering and it guides the signal from primary and backupdirections to different input ports of the 2times1 switch Formore than one cut the ring is severed into two or moresegments and connectivity between segments cannot berecovered until cable replacement

4 Implementation41 Implementation of OWSTo implement MegaSwitch and facilitate real deploy-ment we package all optical devices into an 1RU (RackUnit) switch box OWS (Figure 2) The OWS is com-posed of the sending components (MUX+OA) and thereceiving components (WSS+DEMUX) as described insect3 We use a pair of array waveguide grating (AWG) asMUX and DEMUX for DWDM signals

To simplify the wiring we compact all the 1times2 drop-continue splitters into PRF module as shown in Figure 6This module can be easily attached to or detached fromactive switching components in the OWS To compen-sate component losses along the ring we use a singlestage 12dB OA to boost the DWDM signal before trans-mitting the signals and the splitters in PRF have differentsplitting ratios Using the same numbering in Figure 6the splitting ratio of i-th splitter is 1(iminus1) for igt1 Asan example the power splitting on one fiber from node1 is shown in Figure 9 At each receiving transceiver

6It is implementable with a common Mach-Zehnder interferometer7Power splitting ratio must also be re-configured to keep the signal

within receiver sensitivity range for all receiving OWSes

Figure 8 The OWS box we implemented

Figure 9 Power budget on a fiberthe signal strength is simminus9dBm per channel The PRFconfiguration is therefore determined by w (WSS radix)

OWS receives DWDM signals from all other nodes onthe ring via its WSS We adopt 8times1 WSS from CoAdnaPhotonics in the prototype which is able to select wave-length signals from at most 8 nodes Currently we use 4out of 8 WSS ports for a 5-node ring

Our OWS box provides 16 optical ports and 8 areused in the MegaSwitch prototype Each port maps toa particular wavelength from 1905THz to 1935THzat 200GHz channel spacing InnoLight 10GBASE-ERDWDM SFP+ transceivers are used to connect EPSesto optical ports on OWSes With the total broadcastingloss at 105dB for OWS 4dB for WSS 25dB for DE-MUX and 1dB for cable connectors the receiving poweris minus9dBm simminus12dBm well within the sensitivity rangeof our transceivers The OWS box we implemented (Fig-ure 8) has 4 inter-OWS ports 16 optical ports and 1 Eth-ernet port for control For the 4 inter-OWS ports twoare used to connect other OWSes to construct the Mega-Switch ring and the other two are reserved for redun-dancy and further scaling (sect31)

Each OWS is controlled by a Raspberry Pi [37] with700MHz ARM CPU and 256MB memory OWS re-ceives the wavelength assignments (in UDP packets) viaits Ethernet management interface connected to a sep-arate electrical control plane network and configuresWSS via GPIO pins

42 MegaSwitch PrototypeWe constructed a 5-node MegaSwitch with 5 OWSboxes as is shown in Figure 10 The OWSes are simplyconnected to each other through their inter-OWS portsusing ring cables containing multiple fibers The twoEPSes we used are Broadcom Pronto-3922 with 48times10Gbps Ethernet (GbE) ports For one EPS we fit 24 portswith 3 sets of transceivers of 8 unique wavelengths toconnect to 3 OWSes and the rest 24 ports are connectedto the servers For the other we only use 32 ports with16 ports to OWSes and 16 to servers We fill the capacitywith 40times10GbE interfaces on 20 Dell PowerEdge R320servers (Debian 70 with Kernel 31819) each with aBroadcom NetXtreme II 10GbE network interface card

The control plane network connected by a BroadcomPronto-3295 Ethernet switch The Ethernet management

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 583

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 9: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

Figure 10 The MegaSwitch prototype implemented with 5 OWSes

ports of EPS and OWS are connected to this switch TheMegaSwitch controller is hosted in a server also con-nected to the control plane switch The EPSes work inOpen vSwitch [36] mode and are managed by Ryu con-troller [34] hosted in the same server as the MegaSwitchcontroller To emulate 5 OWS-EPS nodes we dividedthe 40 server-facing EPS ports into 5 VLANs virtuallyrepresenting 5 racks The OWS-facing ports are givenstatic IP addresses and we install Openflow [31] rulesin the EPS switches to route packets between VLANs inLayer 3 In the experiments we refer to an OWS and aset of 8 server ports in the same VLAN as a node

Reconfiguration speed MegaSwitchrsquos reconfigurationhinges on WSS and the WSS switching speed is suppos-edly microseconds with technology in [38] However inour implementation we find that as WSS port count in-creases (required to support more wide-spread communi-cation) eg w=8 in our case we can no longer maintain115micros switching seen by [38] This is mainly becausethe port count of WSS made with digital light process-ing (DLP) technology used in [38] is currently not scal-able due to large insertion loss As a result we choosethe WSS implemented by an alternative Liquid Crystal(LC) technology and the observed WSS switching timeis sim3ms We acknowledge that this is a hard limitationwe need to confront in order to scale We identify thatthere exist several ways to reduce this time [14 23]

We note that with current speed on our prototypeMegaSwitch is still possible to satisfy the need of someproduction DCNs as a recent study [42] of DCN trafficin Facebook suggests with effective load balancing traf-fic demands are stable over sub-second intervals On theother hand even if micros switching is achieved in later ver-sions of MegaSwitch it is still insufficient for high-speedDCNs (40100G or beyond) Following sect322 in ourprototype we address this problem by activating base-mesh which mimics hybrid network without resorting toan additional electrical network

5 EvaluationWe evaluate MegaSwitch with testbed experiments (withsynthetic patterns(sect511)) and real applications(sect512))as well as large-scale simulations (with synthetic patternsand productions traces(sect52)) Our main goals are to 1)measure the basic metrics on MegaSwitchrsquos data plane

and control plane 2) understand the performance of realapplications on MegaSwitch 3) study the impact of con-trol plane latency and traffic stability on throughput and4) assess the effectiveness of basemesh

Summary of results is as followsbull MegaSwitch supports wide-spread communications

with full bisection bandwidth among all ports whenwavelength are configured MegaSwitch sees sim20msreconfiguration delay with sim3ms for WSS switching

bull We deploy real applications Spark and Redis on theprototype We show that MegaSwitch performs sim-ilarly to the optimal scenario (all servers under a sin-gle EPS) for data-intensive applications on Spark andmaintains uniform latency for cross-rack queries forRedis due to the basemesh

bull Under synthetic traces MegaSwitch provides near fullbisection bandwidth for stable traffic (stability pe-riod ge100ms) of all patterns but cannot achieve highthroughput for concentrated unstable traffic with sta-bility period less than reconfiguration delay Howeverincreasing the basemesh capacity effectively improvesthroughput for highly unstable wide-spread traffic

bull With realistic production traces MegaSwitch achievesgt90 throughput of an ideal non-blocking fabric de-spite its 20ms total wavelength reconfiguration delay

51 Testbed Experiments511 Basic Measurements

Bisection bandwidth We measure the bisectionthroughput of MegaSwitch and its degradation duringwavelength switching We use the 40times10GbE interfaceson the prototype to form dynamic all-to-all communica-tion patterns (each interface is referred as a host) Sincetraffic from real applications may be CPU or disk IObound to stress the network we run the following syn-thetic traffic used in Helios [12]bull Node-level Stride (NStride) Numbering the nodes

(EPS) from 0 to nminus1 For the i-th node its j-th hostinitiates a TCP flow to the j-th host in the (i+l modn)-th node where l rotates from 1 to n every t seconds(t is the traffic stability period) This pattern tests theresponse to abrupt demand changes between nodes asthe traffic from one node completely shifts to anothernode in a new period

584 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 10: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

Mbps

0

2times105

4times105

0s 5s 10s 15s 20s 25s 30s

Mbps

0

2times105

4times105

1000s 1005s

Mbps

1times1052times1053times1054times105

2000s 2005s

(a)

(b) (c)

Figure 11 Experiment Node-level stridebull Host-level Stride (HStride) Numbering the hosts

from 0 to ntimeskminus1 the i-th host sends a TCP flow tothe (i+k+l mod (ntimesk))-th host where l rotates from1 to dk2e every t seconds This pattern showcasesthe gradual demand shift between nodes

bull Random A random perfect matching between all thehosts is generated every period Every host sends toits matched host for t seconds The pattern showcasesthe wide-spread communication as every node com-municates with many nodes in each period

For this experiment the basemesh is disabled We settraffic stability t=10s and the wavelength assignmentsare calculated and delivered to the OWS when demandbetween 2 nodes changes We study the impact of stabil-ity t later in sect52

As shown in Figure 11 (a) MegaSwitch maintains fullbandwidth of 40times10Gbps when traffic is stable Duringreconfigurations we see the corresponding throughputdrops NStride drops to 0 since all the traffic shifts to anew node HStridersquos performance is similar to Figure 11(a) but drops by only 50Gbps because one wavelength isreconfigured per rack The throughput resumes quicklyafter reconfigurations

We further observe a sim20ms gap caused by the wave-length reconfiguration at 10s and 20s in the magnified (b)and (c) of Figure 11 Transceiver initialization delay con-tributes sim10ms8 and the remaining can be broken downto EPS configuration (sim5ms) WSS switching (sim3ms)and control plane delays (sim4ms) (Please refer to Ap-pendix for detailed measurement methods and results)

512 Real Applications on MegaSwitchWe now evaluate the performance of real applications onMegaSwitch We use Spark [54] a data-parallel process-ing application and Redis [43] a latency-sensitive in-memory key-value store For this experiment we formthe basemesh with 4 wavelengths and the others are al-located dynamically

Spark We deploy Spark 141 with Oracle JDK 170 25and run three jobs WikipediaPageRank9 K-Means10

8Our transceiverrsquos receiver Loss of Signal Deassert is -22dBm9A PageRank instance using a 26GB dataset [32]

10A clustering algorithm that partitions a dataset into K clusters Theinput is Wikipedia Page Traffic Statistics [47]

Switching On

Switching Offminus40db

minus20db

0db

0ms 1ms 2ms 3ms 4ms 5ms 6ms 7ms 8ms

Figure 12 WSS switching speed

WikiPageRank

WordCount

KMeansSingle ToR

MegaSwitch (01s)

MegaSwitch (1s)

0s 200s 450s

Figure 13 Completion time for Spark jobs

Node 1

Node 2

Node 3

Node 4

Node 5

b=4 b=3 b=2 b=1

0us

50us

100us

150us

Figure 14 Redis intracross-rack query completionand WordCount11 We first connect all the 20 servers toa single ToR EPS and run the applications which estab-lishes the optimal network scenario because all serversare connected with full bandwidth We record the timeseries (with millisecond granularity) of bandwidth usageof each server when it is running and then convert it intotwo series of averaged bandwidth demand matrices of01s and 1s intervals respectively Then we connect theservers back to MegaSwitch and re-run the applicationsusing these two series as inputs to update MegaSwitchevery 01s and 1s intervals accordingly

We plot the job completion times in Figure 13 Withsim20ms reconfiguration delay and 01s interval the band-width efficiency is expected to be (01minus002)01=80if every wavelength changes for each interval HoweverMegaSwitch performs almost the same as if the serversare connected to a single EPS (on average 221 worse)This is because as we observed the traffic demands ofthese Spark jobs are stable eg for K-Means the num-ber of wavelength reassignments is only 3 and 2 timesfor the update periods of 01 and 1s respectively Mostwavelengths do not need reconfiguration and thus pro-vide uninterrupted bandwidth during the experimentsThe performance of both update periods is similar butthe smaller update interval has slightly worse perfor-mance due to one more wavelength reconfiguration Forexample in WikiPageRank MegaSwitch with 1s updateinterval shows 494 less completion time than that with01s We further examine the relationship between trafficstability and MegaSwitchrsquos control latencies in sect52

Redis For this Redis in-memory key-value store ex-periment we initiate queries to a Redis server in the firstnode from the servers in all 5 nodes with a total num-ber of 106 SET and GET requests Key space is set to105 Since the basemesh provides connectivity betweenall the nodes Redis is not affected by any reconfigura-

11It counts words in a 40G dataset with 7153times109 words

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 585

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 11: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

tion latency The average query completion times fromservers in different racks are shown in Figure 14 Querylatency depends on hop count With 4 wavelengths inbasemesh (b=4) Redis experiences uniform low laten-cies for cross-rack queries because they only traverse2 EPSes (hops) For b=3 queries also traverse 2 hopsexcept the ones from node 3 When b=1 basemesh is aring and the worst case hop count for a query is 5 There-fore for latency-critical bandwidth-insensitive applica-tions like Redis setting b=nminus1 guarantees uniform la-tency between all nodes on MegaSwitch In comparisonfor other related architectures the number of EPS hopsfor cross-rack queries can reach 3 (ElectricalOptical hy-brid designs [12 24 50]) or 5 (Pure electrical designs[1 16 25]) for cross-rack queries52 Large Scale SimulationsExisting packet-level simulators such as ns-2 are timeconsuming to run at 1000+-host scale [2] and we aremore interested in traffic throughput rather than per-packet behavior Therefore we implemented a flow-levelsimulator to perform simulations at larger scales Flowson the same wavelength share the bandwidth in a max-min fair manner The simulation runs in discrete timeticks with the granularity of millisecond We assume aproactive controller it is informed of the demand changein the next traffic stability period and runs the wavelengthassignment algorithm before the change therefore it con-figures the wavelengths every t seconds with a total re-configuration delay of 20ms (sect511) Unless specifiedotherwise basemesh is configured with b=32

We use synthetic patterns in sect511 as well as realistictraces from production DCNs in our simulations For thesynthetic patterns we simulate a 6336-port MegaSwitch(n=33 k=192) and each pattern runs for 1000 trafficstability periods (t) For the realistic traces we simulatea 3072-ports MegaSwitch (n=32 k=96) to match thescale of the real cluster We replay the realistic traces asdynamic job arrivals and departuresbull Facebook HiveMapReduce trace is collected by

Chowdhury et al [8] from a 3000-server 150-rackFacebook cluster For each shuffle the trace recordsstart time senders receivers and the received bytesThe trace contains more than 500 shuffles (7times105

flows) Shuffle sizes vary from 1MB to 10TB and thenumber of flows per shuffle varies from 1 to 2times104

bull IDC We collected 2-hour running trace from theHadoop cluster (3000-server) of a large Internet ser-vice company The trace records shuffle tasks (thesame information as above is collected) as well asHDFS read tasks (sender receiver and size are col-lected) We refer to them as IDC-Shuffle and IDC-HDFS when used separately The trace containsmore than 1000 shuffle and read tasks respectively(16times106 flows) The size of shuffle and read varies

Random

NStride

HStride

(a) With Basemesh

Random

NStride

HStride

(b) Without Basemesh

Fu

ll B

ise

ctio

n B

an

dw

idth

0

05

10

0

05

10

Stability Period (ms)10 20 30 100 1000

Figure 15 Throughput vs traffic stability

Facebook

IDC-HDFS

IDC-Shuffle

(a) With Basemesh

Facebook

IDC-HDFS

IDC-Shuffle

(b) Without Basemesh

N

orm

aliz

ed

Th

ou

gh

pu

t

0

05

10

0

05

10

Reconfiguration Period10ms 20ms 30ms 100ms1000ms

Figure 16 Throughput vs reconfiguration frequencyfrom 1MB to 1TB and the number of flows per shuffleis between 1 and 17times104

Impact of traffic stability We vary the traffic stabil-ity period t from 10ms to 1s for the synthetic patternsand plot the average throughput with and without base-mesh in Figure 15 We make three observations 1) Ifstability period is le20ms (reconfiguration delay) traf-fic patterns that need more reconfigurations have lowerthroughput and NStride suffers the worse throughput2) All the three patterns see the throughput steadily in-creasing to full bisection bandwidth with longer stabilityperiod 3) Although basemesh reserves wavelengths tomaintain connectivity throughput of different patterns isnot negatively affected except for NStride which cannotachieve full bisection bandwidth even for stable traffic(t=1s) since the basemesh takes 32 wavelengths

We then analyze the performance of different pat-terns in detail NStride has near zero throughput whent=10ms and 20ms as all the wavelengths must breakdown and reconfigure for each period Basemesh doesnot help much for NStride when the wavelengths arenot yet available basemesh can serve only 1192 of thedemand HStride reaches sim75 full bisection band-width when the stability period is 10ms because on av-erage 34 of its demands stay the same between con-secutive periods and demands in the new period reusesome of the previous wavelengths Random pattern iswide-spread and requires more reconfigurations in eachperiod than HStride Thus it also suffers from unstabletraffic with 185 full-bisection bandwidth for t=10mswithout basemesh However it benefits from basemeshthe most achieving 341 full-bisection bandwidth fort=10ms because the flows between two nodes need notto wait for reconfigurations

586 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 12: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

Random

NStride

HStride

(a) Full Bisection Bandwidth

Facebook

IDC

(b) Normalized Throughput

0

05

10

0

05

10

Wavelengths in Basemesh8 16 32 64 96 128 160 192

Figure 17 Handling unstable traffic with basemeshIn summary MegaSwitch provides near full-bisection

bandwidth for stable traffic of all patterns but cannotachieve high throughput for concentrated unstable traf-fic (NStride with stability period smaller than reconfigu-ration delay We note that such pattern is designed to testthe worst-case behavior of MegaSwitch and is unlikelyto occur in production DCNs)

Impact of reconfiguration frequency We vary the re-configuration frequency to study its impact on through-put using realistic traces In Figure 16 we collect thethroughput of the replayed traffic traces12 and then nor-malize them to the throughput of the same traces re-played on a non-blocking fabric with the same portcount The normalized throughput shows how Mega-Switch approaches non-blocking

From the results we find that MegaSwitch achieves9321 and 9084 normalized throughput on averagewith and without basemesh respectively This indicatesthat our traces collected from the production DCNs arevery stable thus can take advantage of the high band-width of optical circuits This aligns well with traf-fic statistics in another study [42] which suggests withgood load balancing the traffic is stable on sub-secondscale thus is suitable for MegaSwitch We also observethat the traffic is wide-spread for realistic traces forevery period a rack in Facebook IDC-HDFS amp IDC-Shuffle talks with 121 45 amp 146 racks on averagerespectively Limited by rack-to-rack connectivity otheroptical structures are less effective for such patterns

Impact of adjusting basemesh capacity For rapidlychanging traffic MegaSwitch can improve its through-put with more wavelengths to basemesh In Figure 17we measure the throughput of synthetic patterns (stabil-ity period is 10ms) and production traces The produc-tion traces feature dynamic arrivaldeparture of tasks

For NStride and HStride (Figure 17 (a)) since the traf-fic of each node is destined to one or two other nodesincreasing basemesh does not benefit them much Incontrast for Random each node sends to many nodes(ie wide-spread) increasing the capacity of basemeshb from 8 wavelengths to 32 wavelengths increases thethroughput by 206 Further increasing b to 160 can

12The wavelength schedules are obtain in the same way as Sparkexperiments in sect512

Heliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

idth

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 18 Average bisection throughputHeliosc-Through

OSA

Mordia-Stacked-Ring

Mordia-PTL

MegaSwitch-BaseMesh

MegaSwitch

Bis

ectio

n B

an

dw

dith

0

2000

4000

6000

Port Count128 1024 2048 3072 4096 5120 6144

Figure 19 Worst-case bisection throughputincrease the throughput by 561 But this trend isnot monotonic if all the wavelengths are in basemesh(b=192) and the demands between nodes larger than6 wavelengths cannot be supported This is also ob-served for the realistic traces in Figure 17 (b) when gt4wavelengths are allocated to basemesh the normalizedthroughput decreases

In summary for rapidly changing traffic basemesh isan effective and adjustable tool for MegaSwitch to han-dle wide-spread demand fluctuation As a comparisonhybrid structures [12 24 50] cannot dynamically adjustthe capacity of the parallel electrical fabrics

Bisection bandwidth To compare MegaSwitch withother optical proposals we calculate the average andworst-case bisection bandwidths of MegaSwitch and re-lated proposals in Figure 18 amp 19 respectively Theunit of y-axis is bandwidth per wavelength We gener-ate random permutation of port pairing for 104 times tocompute the average bisection bandwidth for each struc-ture and design the worst case scenario for them forMordia OSA and Heliosc-Through the worst-case sce-nario is when every node talks to all other nodes13 be-cause a node can only directly talk to a limited num-ber of other nodes in these structures We calculate thetotal throughput of simultaneous connections betweenports We rigorously follow respective papers to scaleport counts from 128 to 6144 For example OSA at 1024port uses a 16-port OCS for 16 racks each with 64 hostsMordia at 6144 ports uses a 32-port OCS to connect 32rings each with 192 hosts

We observe 1) MegaSwitch and Mordia-PTL canachieve full bisection bandwidth at high port count inboth average and the worst case while other designsincluding Mordia-Stacked-Ring cannot Both struc-tures are expected to maintain full bisection bandwidthwith future scaling in optical devices with larger WSSradix and wavelengths per-fiber 2) Basemesh reducesthe worst-case bisection bandwidth by 167 for b=32but full bisection bandwidth is still achieved on average

13A node refers to a ring in Mordia a rack in OSAHeliosc-Through

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 587

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 13: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

6 Cost of MegaSwitchComplexity analysis The absolute costs of optical pro-posals are difficult to calculate as it depends on multiplefactors such as market availability manufacturing costsetc For example our PRF module can be printed as aplanar lightwave circuit and mass-production can pushits cost to that of common printed circuit boards [10]Thus instead of directly calculating the costs based onprice assumptions we take a comparative approach andanalyze the structural complexity of MegaSwitch andclosely related proposals

In Table 1 we compare the complexity of different op-tical structures at the same port count (3072) For OSAit translates to 32 racks 96 DWDM wavelengths and a128-port OCS (Optical Circuit Switch) ToR degree is setto 4 [6] Quartzrsquos port count is limited to the wavelengthper fiber (k) [26] so a Quartz element with k=96 is es-sentially a 96-port switch and we scale it to 3072 portsby composing 160 Quartz elements14 in a FatTree [1]topology (with 32 pods and 32 cores) For Mordia weuse its microsecond 1times4 WSS and stack 32 rings via a32 port OCS to reach 3072 ports Mordia-PTL [11] isconfigured in the same way as Mordia with the only dif-ference that each EPS is directly connected to 32 ringsin parallel without using OCS Both Mordia and Mordia-PTL have 964=24 stations on each ring MegaSwitch isconfigured with 32 nodes and 96 wavelengths per nodeFinally we list a FatTree constructed with 24-port EPSwith totally 2434=3456 ports and 720 EPSes We as-sume that within the FatTree fabric all the EPSes areconnected via transceivers

From the table we find that compared to other opti-cal structures MegaSwitch supports the same port countwith less optical components Compared to all the opti-cal solutions FatTree uses 2times optical transceivers (non-DWDM) at the same scale

Transceiver sost Our prototype uses commercial10Gb-ER DWDM SFP+ transceivers which are sim10timesmore expensive per bit per second than the non-DWDMmodules using PSM4 This is because ER optics isusually used for long-range optical networks ratherthan short-range networks within a DCN Our choiceof using ER optics is solely due to its ability to useDWDM wavelengths not its power output or 40KMreach If MegaSwitch or similar optical interconnects arewidely adopted in future DCNs we expect that DWDMtransceivers customized for DCNs will be used instead ofcurrent PSM4 modules Such transceivers are being de-veloped [22] and due to relaxed requirements of short-range DCN the cost is expectedly lower Even if thecustomized DWDM transceivers are still more expensive

14We assume a Quartz ring of 6 wavelength adddrop multiplex-ers [26] in each element thus 6times160=960 in total

Components OSA Quartz Mordia Mordia-PTL MegaSwitch FatTreeAmplifier 0 6144 768 768 32 0

WSS 32 (1times4) 960 768 (1times4) 768 (1times 4) 32 (32times1) 0WDM Filter 0 0 768 768 0 0

OCS 1times128-port 0 1times32-port 0 0 0Circulator 3072 0 0 0 0 0

Transceivers 3072 3072 3072 3072 3072 6912

Table 1 Comparison of optical complexity at thesame scale (3072 ports) in optical components usedthan non-DWDM ones since FatTree needs many moreEPSes (and thus transceivers) and the cost differencewill grow as the network scales [38]

Amplifier cost With only one stage of amplificationon each fiber MegaSwitch can maintain good OSNRat large scale (Figure 3) Therefore MegaSwitch needsmuch fewer OAs (optical amplifier) at the same scaleThe tradeoff is that they must be more powerful as thetotal power to reach the same number of ports cannot bereduced significantly Although unit cost of a powerfulOA is higher15 we expect the total cost to be similar orlower as MegaSwitch requires much fewer OAs (24timesfewer than Mordia and 192times fewer than Quartz)

WSS cost In Table 1 MegaSwitch uses 32times1 WSSand one may wonder its cost versus 1times4 WSS In factour design allows for low cost WSS as transceiver linkbudget design does not need to consider optical hoppingthus the key specifications can be relaxed (eg band-width insertion loss and polarization dependent loss re-quirements) Liquid Crystal (LC) technology is used inour WSS as it is easier to cost-effectively scale to moreports As the majority of the components in a 32times1WSS are same as a 1times4 WSS the per-port cost of 32times1WSS is about 4 times lower than that of a current 1times4WSS [6 38] In the future silicon photonics (eg matrixswitch by ring resonators [13]) can improve the integra-tion level [52] and further reduce the cost

7 ConclusionWe presented MegaSwitch an optical interconnect thatdelivers rearrangeably non-blocking communication to30+ racks and 6000+ servers We have implementeda working 5-rack 40-server prototype and with exper-iments on this prototype as well as large-scale simula-tions we demonstrated the potential of MegaSwitch insupporting wide-spread high-bandwidth demands work-loads among many racks in production DCNs

Acknowledgements This work is supported in partby Hong Kong RGC ECS-26200014 GRF-16203715GRF-613113 CRF-C703615G China 973 ProgramNo2014CB340303 NSF CNS-1453662 CNS-1423505CNS-1314721 and NSFC-61432009 We thank theanonymous NSDI reviewers and our shepherd Ratul Ma-hajan for their constructive feedback and suggestions

15For example powerful OA can be constructed by a series of lesspowerful ones

588 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 14: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

References

[1] AL-FARES M LOUKISSAS A AND VAHDATA A scalable commodity data center network ar-chitecture In ACM SIGCOMM (2008)

[2] AL-FARES M RADHAKRISHNAN S RAGHA-VAN B HUANG N AND VAHDAT A HederaDynamic flow scheduling for data center networksIn USENIX NSDI (2010)

[3] ALIZADEH M GREENBERG A MALTZD A PADHYE J PATEL P PRABHAKAR BSENGUPTA S AND SRIDHARAN M Data centertcp (dctcp) ACM SIGCOMM Computer Communi-cation Review 41 4 (2011) 63ndash74

[4] ANNEXSTEIN F AND BAUMSLAG M A unifiedapproach to off-line permutation routing on parallelnetworks In ACM SPAA (1990)

[5] CEGLOWSKI M The website obesity cri-sis httpidlewordscomtalkswebsite_obesityhtm 2015

[6] CHEN K SINGLA A SINGH A RA-MACHANDRAN K XU L ZHANG Y WENX AND CHEN Y Osa An optical switching ar-chitecture for data center networks with unprece-dented flexibility In USENIX NSDI (2012)

[7] CHEN K WEN X MA X CHEN Y XIA YHU C AND DONG Q Wavecube A scalablefault-tolerant high-performance optical data centerarchitecture In IEEE INFOCOM (2015)

[8] CHOWDHURY M ZHONG Y AND STOICA IEfficient coflow scheduling with varys In ACMSIGCOMM (2014)

[9] COLE R AND HOPCROFT J On edge coloringbipartite graphs SIAM Journal on Computing 113 (1982) 540ndash546

[10] DOERR C R AND OKAMOTO K Advances insilica planar lightwave circuits Journal of light-wave technology 24 12 (2006) 4763ndash4789

[11] FARRINGTON N FORENCICH A PORTER GSUN P-C FORD J FAINMAN Y PAPEN GAND VAHDAT A A multiport microsecond opticalcircuit switch for data center networking In IEEEPhotonics Technology Letters (2013)

[12] FARRINGTON N PORTER G RADHAKRISH-NAN S BAZZAZ H H SUBRAMANYA V

FAINMAN Y PAPEN G AND VAHDAT A He-lios A hybrid electricaloptical switch architec-ture for modular data centers In ACM SIGCOMM(2010)

[13] GAETA A L All-optical switching in silicon ringresonators In Photonics in Switching (2014)

[14] GEIS M LYSZCZARZ T OSGOOD R ANDKIMBALL B 30 to 50 ns liquid-crystal opticalswitches In Optics express (2010)

[15] GHOBADI M MAHAJAN R PHANISHAYEEA DEVANUR N KULKARNI J RANADE GBLANCHE P-A RASTEGARFAR H GLICKM AND KILPER D Projector Agile recon-figurable data center interconnect In ACM SIG-COMM (2016)

[16] GREENBERG A HAMILTON J R JAIN NKANDULA S KIM C LAHIRI P MALTZD A PATEL P AND SENGUPTA S VL2 Ascalable and flexible data center network In ACMSIGCOMM (2009)

[17] HALPERIN D KANDULA S PADHYE JBAHL P AND WETHERALL D Augmentingdata center networks with multi-gigabit wirelesslinks In SIGCOMM (2011)

[18] HAMEDAZIMI N QAZI Z GUPTA HSEKAR V DAS S R LONGTIN J P SHAHH AND TANWERY A Firefly A reconfigurablewireless data center fabric using free-space opticsIn ACM SIGCOMM (2014)

[19] HEDLUND B Understanding hadoop clusters andthe network httpbradhedlundcom2011

[20] HINDEN R Virtual router redundancy protocol(vrrp) RFC 5798 (2004)

[21] KANDULA S SENGUPTA S GREENBERG APATEL P AND CHAIKEN R The nature of data-center traffic Measurements and analysis In ACMIMC (2009)

[22] LIGHTWAVE Ranovus unveils optical engine 200-gbps pam4 optical transceiver httpsgooglQbRUd4 2016

[23] LIN X-W HU W HU X-K LIANG XCHEN Y CUI H-Q ZHU G LI J-NCHIGRINOV V AND LU Y-Q Fast responsedual-frequency liquid crystal switch with photo-patterned alignments In Optics letters (2012)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 589

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 15: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

[24] LIU H LU F FORENCICH A KAPOOR RTEWARI M VOELKER G M PAPEN GSNOEREN A C AND PORTER G Circuitswitching under the radar with reactor In USENIXNSDI (2014)

[25] LIU V HALPERIN D KRISHNAMURTHY AAND ANDERSON T F10 A fault-tolerant engi-neered network In NSDI (2013)

[26] LIU Y J GAO P X WONG B AND KESHAVS Quartz a new design element for low-latencydcns In ACM SIGCOMM (2014)

[27] LOVASZ L AND PLUMMER M D Matchingtheory vol 367 American Mathematical Soc2009

[28] MALKHI D NAOR M AND RATAJCZAK DViceroy A scalable and dynamic emulation of thebutterfly In Proceedings of the twenty-first annualsymposium on Principles of distributed computing(2002) ACM pp 183ndash192

[29] MANKU G S BAWA M RAGHAVAN PET AL Symphony Distributed hashing in a smallworld In USENIX USITS (2003)

[30] MARTINEZ A CALO C ROSALES RWATTS R MERGHEM K ACCARD ALELARGE F BARRY L AND RAMDANE AQuantum dot mode locked lasers for coherent fre-quency comb generation In SPIE OPTO (2013)

[31] MCKEOWN N ANDERSON T BALAKRISH-NAN H PARULKAR G PETERSON L REX-FORD J SHENKER S AND TURNER J Open-flow Enabling innovation in campus networksACM Computer Communication Review (2008)

[32] METAWEB-TECHNOLOGIES Freebasewikipedia extraction (wex) httpdownloadfreebasecomwex 2010

[33] MISRA J AND GRIES D A constructive proofof vizingrsquos theorem Information Processing Let-ters 41 3 (1992) 131ndash133

[34] OSRG Ryu sdn framework httpsosrggithubioryu 2017

[35] PASQUALE F D MELI F GRISERIE SGUAZZOTTI A TOSETTI C ANDFORGHIERI F All-raman transmission of 19225-ghz spaced wdm channels at 1066 gbs over30 x 22 db of tw-rs fiber In IEEE PhotonicsTechnology Letters (2003)

[36] PFAFF B PETTIT J AMIDON K CASADOM KOPONEN T AND SHENKER S Extendingnetworking into the virtualization layer In ACMHotnets (2009)

[37] PI R An arm gnulinux box for $ 25 Take a byte(2012)

[38] PORTER G STRONG R FARRINGTON NFORENCICH A SUN P-C ROSING T FAIN-MAN Y PAPEN G AND VAHDAT A Integrat-ing microsecond circuit switching into the data cen-ter In ACM SIGCOMM (2013)

[39] QIAO C AND MEI Y Off-line permutation em-bedding and scheduling in multiplexed optical net-works with regular topologies IEEEACM Trans-actions on Networking 7 2 (1999) 241ndash250

[40] RATNASAMY S FRANCIS P HANDLEY MKARP R AND SHENKER S A scalable content-addressable network vol 31 ACM 2001

[41] ROWSTRON A AND DRUSCHEL P PastryScalable decentralized object location and routingfor large-scale peer-to-peer systems In IFIPACMInternational Conference on Distributed SystemsPlatforms and Open Distributed Processing (2001)Springer pp 329ndash350

[42] ROY A ZENG H BAGGA J PORTER GAND SNOEREN A C Inside the social networkrsquos(datacenter) network In ACM SIGCOMM (2015)

[43] SANFILIPPO S AND NOORDHUIS P Redishttpredisio 2010

[44] SCHRIJVER A Bipartite edge-colouring in o(δm)time SIAM Journal on Computing (1998)

[45] SHVACHKO K KUANG H RADIA S ANDCHANSLER R The hadoop distributed file sys-tem In IEEE MSST (2010)

[46] SINGH A ONG J AGARWAL A ANDERSONG ARMISTEAD A BANNON R BOVINGS DESAI G FELDERMAN B GERMANO PET AL Jupiter rising A decade of clos topologiesand centralized control in googlersquos datacenter net-work In ACM SIGCOMM (2015)

[47] SKOMOROCH P Wikipedia traffic statis-tics dataset httpawsamazoncomdatasets2596 2009

[48] STOICA I MORRIS R KARGER DKAASHOEK M F AND BALAKRISHNAN HChord A scalable peer-to-peer lookup service for

590 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 16: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

internet applications ACM SIGCOMM ComputerCommunication Review 31 4 (2001) 149ndash160

[49] VASSEUR J-P PICKAVET M AND DE-MEESTER P Network recovery Protectionand Restoration of Optical SONET-SDH IP andMPLS Elsevier 2004

[50] WANG G ANDERSEN D KAMINSKY M PA-PAGIANNAKI K NG T KOZUCH M ANDRYAN M c-Through Part-time optics in data cen-ters In ACM SIGCOMM (2010)

[51] WANG H XIA Y BERGMAN K NG TSAHU S AND SRIPANIDKULCHAI K Rethink-ing the physical layer of data center networks of thenext decade Using optics to enable efficient-castconnectivity In ACM SIGCOMM Computer Com-munication Review (2013)

[52] WON R AND PANICCIA M Integrating siliconphotonics 2010

[53] YAMAMOTO N AKAHANE K KAWANISHIT KATOUF R AND SOTOBAYASHI H Quan-tum dot optical frequency comb laser with mode-selection technique for 1-microm waveband photonictransport system In Japanese Journal of AppliedPhysics (2010)

[54] ZAHARIA M CHOWDHURY M DAS TDAVE A MA J MCCAULEY M FRANKLINM J SHENKER S AND STOICA I Resilientdistributed datasets A fault-tolerant abstraction forin-memory cluster computing In USENIX NSDI(2012)

[55] ZHU Z Scalable and topology adaptive intra-datacenter networking enabled by wavelength selectiveswitching In OSAOFCNFOFC (2014)

[56] ZHU Z ZHONG S CHEN L AND CHEN KA fully programmable and scalable optical switch-ing fabric for petabyte data center In Optics Ex-press (2015)

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 591

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 17: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

Appendix

Proof for non-blocking connectivityWe first formulate the wavelength assignment problem

Problem formulation Denote G=(VE) as a Mega-Switch logical graph where VE are nodeedge sets andeach edge eisinE is directed (a pair of fiber channels be-tween two nodes one in each direction) Given all nodesare connected via one hop G is a complete digraph Thebandwidth demand between nodes can be represented bythe number of wavelengths needed and each node hask available wavelengths Suppose γ=ke | eisinE is abandwidth demand of a communication where ke is thenumber of wavelengths (ie bandwidth) needed on edgee=(uv) from node u to v A feasible demand and wave-length efficiency are defined as

Definition 2 A demand γ is feasible iffsumvk(uv)le

k foralluisinV andsumuk(uv)lek forallvisinV ie each node cannot

sendreceive more than k wavelengths

Definition 3 Given a feasible bandwidth demand γan assignment is wavelength efficient iff it is non-interfering satisfies γ and uses at most k wavelengths

Centralized optimal algorithm To prove the wave-length efficiency we first show that non-interferingwavelength assignment in MegaSwitch can be recast asan edge-coloring problem on a bipartite multigraph Wefirst transform G into a directed bipartite graph by de-composing each node in G into two logical nodes si(sources) and di (destinations) Each edge points froma node in si to a node in di Then we expand Gto a multigraph Gprime by duplicating each edge e betweentwo nodes ke times On this multigraph the require-ment of non-interference is that no two adjacent edgesshare same wavelength Suppose each wavelength hasa unique color this equivalently transforms to the edge-coloring problem

While the edge-coloring problem isNP-hard for gen-eral graph [33] polynomial-time optimal solutions existfor bipartite multigraph Chen et al [7] have presentedsuch a centralized algorithm which we leverage to proveour wavelength efficiency for any feasible demands

Algorithm 1 DECOMPOSE(middot) Decompose Gprimer into

∆(Gprimer) perfect matchings

Input Gprimer

Output π=m1m2m∆(Gprimer)

1 if ∆(Gprimer)le1 then

2 return Gprimer

3 else4 mlarr Perfect Matching(Gprime

r)5 return m

⋃Decompose(Gprime

rm)

Satisfying arbitrary feasible γ with wavelength effi-ciency We extend Gprime to a ∆(Gprime)-regular multigraph16Gprimer by adding dummy edges where ∆(Gprime) is the largestvertex degree ofGprime ie the largest ke in γ For a bipartitegraph Gprime with maximum degree ∆(Gprime) at least ∆(Gprime)wavelengths are needed to satisfy γ This optimality canbe achieved by showing that we only need the smallestvalue possible ∆(Gprime) to satisfy γ which also proveswavelength efficiency since ∆(Gprime)lek

Theorem 4 A feasible demand γ only needs ∆(Gprime)ie the minimum number of colors for edge-coloringAny feasible MegaSwitch graph Gprime can be satisfied with∆(Gprime)lek wavelengths using Algorithm 1

Proof DECOMPOSE(middot) recursively finds ∆(Gr) perfectmatchings in Gprimer because rdquoany k-regular bipartite graphhas a perfect matchingrdquo [44] Thus we can extractone perfect matching from the original graph Gprimer andthe residual graph becomes (∆(Gprime)minus1)-regular we con-tinue to extract one by one until we have ∆(Gprime) perfectmatchings17 Thus any demand described by Gprime can besatisfied with ∆(Gprime) wavelengths Note that ∆(Gprime)lekas the out-degree and in-degree of any node must be lekin any feasible demand due to the physical limitation ofk ports Therefore any feasible demand γ can be satisfiedusing at most k wavelengths

The MegaSwitch controller runs Algorithm 1 to de-compose the demands and assigns a wavelength to thematching generated in each iteration

Considerations for Basemesh TopologyAs described in sect322 basemesh is an overlay DHTnetwork on the multi-fiber ring of MegaSwitch DHTliterature is vast and there are many potential choicesfor basemesh topology The following are the represen-tative ones Chord [48] Pastry [41] Symphony [29]Viceroy [28] and CAN [40] Since we have comparedChord [48] amp Symphony [29] in sect322 we next look atthe remaining onesbull Pastry suffers from average path lengths when many

wavelengths are used in the basemesh Its averagepath length is log2(lmiddotn) [41] where n is the number ofnodes on a ring and l is the length of node ID in bitsFor MegaSwitch both parameter is fixed (like Chord)and we cannot reduce the path length by adding morewavelengths to basemesh

bull Viceroy emulates the butterfly network on a DHT ringFor MegaSwitch its main issue is the difficulty of up-dating the routing tables and wavelength assignments

16A graph is regular when every vertex has the same degree17Perfect Matching(middot) on regular bipartite graph is well studied we

leverage existing algorithms in literature [27]

592 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 18: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands:

when the capacity of basemesh is increased and de-creased For Symphony we can just pick a new wave-length in random and configure accordingly withoutaffecting the other wavelengths In contrast to main-tain butterfly topology adjusting capacity of basemeshusing Viceroy algorithm affects all wavelength but onein the worst case

bull CAN is a d-dimensional Cartesian coordinate systemon a d-torus which is not suitable for MegaSwitchring where every node is connected directly to everyother node However CAN will become useful whenwe expand MegaSwitch into a 2-D torus topology andwe will explore this as future work

Reconfiguration Latency Measurements onMegaSwitch PrototypeIn sect511 we measured sim20ms reconfiguration delay forMegaSwitch prototype and here we break down this de-lay into EPS configuration delay (te) WSS switching de-lay (to) and control plane delay (tc) The total reconfig-uration delay for MegaSwitch is tr=max(teto)+tc asEPS and WSS configurations can be done in parallel

EPS configuration delay To setup a route in Mega-Switch the sourcedestination EPSes must be config-ured We measure the EPS configuration delay as follows(all servers and the controller are synchronized by NTP)we first setup a wavelength between 2 nodes and leaveEPSes unconfigured We let a host in one of the nodeskeep generating UDP packets using netcat to a hostin the other node Then the controller sends OpenFlowcontrol message to both EPSes to setup the route and wecollect the packets at the receiver with tcpdump Fi-nally we can calculate the delay the average and 99thpercentile are 512ms and 1287ms respectively

WSS switching delay We measure the switching speedof the 8times1 WSS used in our testbed by switching theoptical signal from one port to another port Figure 12shows the power readings from the original port and fromthe destination port and the measured switching delay is323ms (subject to temperature drive voltage etc)

Control plane delay With our wavelength assignmentalgorithm (see Appendix) running on a server (the cen-tralized controller) with Intel E5-1410 28Ghz CPU wemeasured an average computation time of 053ms for 40-port (40times40 demand matrix as input) and 328ms for6336-port MegaSwitch (n=33k=192) For basemeshrouting tables our greedy algorithm runs 045ms for 40ports and 178ms for 6336 ports For the OWS con-troller processing we measured 431ms from receivinga wavelength assignment to set GPIO outputs Round-trip time in control plane network (using 1GbE switch)is 0078ms

USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 593

Page 19: Enabling Wide-Spread Communications on Optical …...ing, data spreading for fault-tolerance, etc. [42,46]. Figure 1: High-level view of a 4-node MegaSwitch. High bandwidth demands: