switchblade: a platform for rapid deployment of … · switchblade: a platform for rapid deployment...

SwitchBlade: A Platform for Rapid Deployment of NetworkProtocols on Programmable Hardware

Muhammad Bilal Anwer, Murtaza Motiwala, Mukarram bin Tariq, Nick FeamsterSchool of Computer Science, Georgia Tech

ABSTRACTWe present SwitchBlade, a platform for rapidly deploying cus-tom protocols on programmable hardware. SwitchBlade uses apipeline-based design that allows individual hardware modules tobe enabled or disabled on the fly, integrates software exception han-dling, and provides support for forwarding based on custom headerfields. SwitchBlade’s ease of programmability and wire-speed per-formance enables rapid prototyping of custom data-plane functionsthat can be directly deployed in a production network. SwitchBladeintegrates common packet-processing functions as hardware mod-ules, enabling different protocols to use these functions withouthaving to resynthesize hardware. SwitchBlade’s customizable for-warding engine supports both longest-prefix matching in the packetheader and exact matching on a hash value. SwitchBlade’s softwareexceptions can be invoked based on either packet or flow-basedrules and updated quickly at runtime, thus making it easy to inte-grate more flexible forwarding function into the pipeline. Switch-Blade also allows multiple custom data planes to operate in parallelon the same physical hardware, while providing complete isolationfor protocols running in parallel. We implemented SwitchBlade us-ing NetFPGA board, but SwitchBlade can be implemented with anyFPGA. To demonstrate SwitchBlade’s flexibility, we use Switch-Blade to implement and evaluate a variety of custom network pro-tocols: we present instances of IPv4, IPv6, Path Splicing, and anOpenFlow switch, all running in parallel while forwarding packetsat line rate.

Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture and DesignC.2.5 [Computer-Communication Networks]: Local and Wide-Area Networks C.2.6 [Computer-Communication Networks]: In-ternetworking

General Terms: Algorithms, Design, Experimentation, Perfor-mance

Keywords: Network Virtualization, NetFPGA

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGCOMM 2010, August 30-September 3, 2010, New Delhi, India.Copyright 2010 ACM 978-1-4503-0201-2/10/08 ...$10.00.

1. INTRODUCTIONCountless next-generation networking protocols at various lay-

ers of the protocol stack require data-plane modifications. The pastfew years alone have seen proposals at multiple layers of the pro-tocol stack for improving routing in data centers, improving avail-ability, providing greater security, and so forth [3,13,17,24]. Theseprotocols must ultimately operate at acceptable speeds in produc-tion networks—perhaps even alongside one another—which raisesthe need for a platform that can support fast hardware implemen-tations of these protocols running in parallel. This platform mustprovide mechanisms to deploy these new network protocols, headerformats, and functions quickly, yet still forward traffic as quicklyas possible. Unfortunately, the conventional hardware implemen-tation and deployment path on custom ASICs incurs a long devel-opment cycle, and custom protocols may also consume preciousspace on the ASIC. Software-defined networking paradigms (e.g.,Click [7, 16]) offer some hope for rapid prototyping and deploy-ment, but a purely software-based approach cannot satisfy the strictperformance requirements of most modern networks. The network-ing community needs a development and deployment platform thatoffers high performance, flexibility, and the possibility of rapid pro-totyping and deployment.

Although other platforms have recognized the need for fast, pro-grammable routers, they stop somewhat short of providing a pro-grammable platform for rapid prototyping on the hardware itself.Platforms that are based on network processors can achieve fastforwarding performance [22], but network processor-based imple-mentations are difficult to port across different processor architec-tures, and customization can be difficult if the function that a proto-col requires is not native to the network processor’s instruction set.All other functions should be implemented in software. PLUG [8]is an excellent framework for implementing modular lookup mod-ules, but the model focuses on manufacturing high-speed chips,which is costly and can have a long development cycle. Route-Bricks [12] provides a high-performance router, but it is imple-mented entirely in software, which may introduce scalability is-sues; additionally, prototypes developed on RouteBricks cannot beeasily ported to hardware.

This paper presents SwitchBlade, a programmable hardwareplatform that strikes a balance between the programmability ofsoftware and the performance of hardware, and enables rapid pro-totyping and deployment of new protocols. SwitchBlade enablesrapid deployment of new protocols on hardware by providing mod-ular building blocks to afford customizability and programmabil-ity that is sufficient for implementing a variety of data-plane func-tions. SwitchBlade’s ease of programmability and wire-speed per-formance enables rapid prototyping of custom data-plane functionsthat can be directly deployed in a production network. SwitchBlade

183

relies on field-programmable gate arrays (FPGAs). Designing andimplementing SwitchBlade poses several challenges:

• Design and implementation of a customizable hardwarepipeline. To minimize the need for resynthesizing hardware,which can be prohibitive if multiple parties are sharing it,SwitchBlade’s packet-processing pipeline includes hardwaremodules that implement common data-plane functions. Newprotocols can select a subset of these modules on the fly,without resynthesizing hardware.

• Seamless support for software exceptions. If custom pro-cessing elements cannot be implemented in hardware (e.g.,due to limited resources on the hardware, such as area onthe chip), SwitchBlade must be able to invoke software rou-tines for processing. SwitchBlade’s hardware pipeline candirectly invoke software exceptions on either packet or flow-based rules. The results of software processing (e.g., for-warding decisions), can be cached in hardware, making ex-ception handling more efficient.

• Resource isolation for simultaneous data-plane pipelines.Multiple protocols may run in parallel on same hardware;we call each data plane a Virtual Data Plane (VDP). Switch-Blade provides each VDP with separate forwarding tablesand dedicated resources. Software exceptions are the VDPthat generated the exception, which makes it easier to buildvirtual control planes on top of SwitchBlade.

• Hardware processing of custom, non-IP headers. Switch-Blade provides modules to obtain appropriate fields frompacket headers as input to forwarding decisions. Switch-Blade can forward packets using longest-prefix match on 32-bit header fields, an exact match on fixed length header field,or a bitmap added by custom packet preprocessing modules.

The design of SwitchBlade presents additional challenges, such as(1) dividing function between hardware and software given limitedhardware resources; (2) abstracting physical ports and input/outputqueues; (3) achieving rate control on per-VDP basis instead of per-port basis; and (4) providing a clean interface to software.

We have implemented SwitchBlade using the NetFPGAboard [2], but SwitchBlade can be implemented with any FPGA.To demonstrate SwitchBlade’s flexibility, we use SwitchBlade toimplement and evaluate several custom network protocols. Wepresent instances of IPv4, IPv6, Path Splicing, and an OpenFlowswitch, all of which can run in parallel and forward packets atline rate; each of these implementations required only modest ad-ditional development effort. SwitchBlade also provides seamlessintegration with software handlers implemented using Click [16],and with router slices running in OpenVZ containers [20]. Ourevaluation shows that SwitchBlade can forward traffic for customdata planes—including non-IP protocols—at hardware forwardingrates. SwitchBlade can also forward traffic for multiple distinctcustom data planes in parallel, providing resource isolation foreach. An implementation of SwitchBlade on the NetFPGA plat-form for four parallel data planes fits easily on today’s NetFPGAplatform; hardware trends will improve this capacity in the future.SwitchBlade can support additional VDPs with less than a linearincrease in resource use, so the design will scale as FPGA capacitycontinues to increase.

The rest of this paper is organized as follows. Section 2 presentsrelated work. Section 3 explains our design goals and the key re-sulting design decisions. Section 4 explains the SwitchBlade de-sign, and Section 5 describes the implementation of SwitchBlade,as well as our implementations of three custom data planes on

SwitchBlade. Section 6 presents performance results. Section 7briefly describes how we have implemented a virtual router on topof SwitchBlade using OpenVZ. We discuss various extensions inSection 8 and conclude in Section 9.

2. RELATED WORKWe survey related work on programmable data planes in both

software and hardware.The Click [16] modular router allows easy, rapid development

of custom protocols and packet forwarding operations in software;kernel-based packet forwarding can operate at high speeds but can-not keep up with hardware for small packet sizes. An off-the-shelfNetFPGA-based router can forward traffic at 4 Gbps; this forward-ing speed can scale by increasing the number of NetFPGA cards,and development trends suggest that much higher rates will be pos-sible in the near future. RouteBricks [12] uses commodity pro-cessors to achieve software-based packet processing at high speed.The design requires significant PCIe interconnection bandwidth toallow packet processing at CPUs instead of on the network cardsthemselves. As more network interface cards are added, and astraffic rates increase, however, some packet processing may needto be performed on the network cards themselves to keep pace withincreasing line speeds and to avoid creating bottlenecks on the in-terconnect.

Supercharged PlanetLab (SPP) [22] is a network processor (NP)-based technology. SPP uses Intel IXP network processors [14]for data-plane packet processing. NP-based implementations arespecifically bound to the respective vendor-provided platform,which can inherently limit the flexibility of data-plane implemen-tations.

Another solution to achieve wire-speed performance is develop-ing custom high-speed networking chips. PLUG [8] provides a pro-gramming model for manufacturing chips to perform high-speedand flexible packet lookup, but it does not provide an off-the-shelfsolution. Additionally, chip manufacturing is expensive: fabrica-tion plants are not common, and cost-effective manufacturing atthird-party facilities requires critical mass of demand. Thus, thisdevelopment path may only make sense for large enterprises andfor protocols that have already gained broad acceptance. Chip man-ufacturing also has a high turnaround time and post-manufacturingverification processes which can impede development of new pro-tocols that need small development cycle and rapid deployment.

SwitchBlade is an FPGA-based platform and can be imple-mented on any FPGA. Its design and implementation draws in-spiration from our earlier work on designing an FPGA-based dataplane for virtual routers [4]. FPGA-based designs are not tied toany single vendor, and it scales as new, faster and bigger FPGAsbecome available. FPGAs also provide a faster development anddeployment cycle compared to chip manufacturing.

Casado et al. argue for simple but high-speed hardware withclean interfaces with software that facilitate independent devel-opment of protocols and network hardware [9]. They argue thatcomplex routing decisions can be made in software and cached inhardware for high-speed processing; in some sense, SwitchBlade’scaching of forwarding decisions that are handled by software ex-ception handlers embodies this philosophy. OpenFlow [19] en-ables the rapid development of a variety of protocols, but the di-vision of functions between hardware and software in SwitchBladeis quite different. Both OpenFlow and SwitchBlade provide soft-ware exceptions and caching of software decisions in hardware,but SwitchBlade also provides selectable hardware preprocessingmodules that effectively moves more flexible processing to hard-

184

ware. SwitchBlade also easily accommodates new hardware mod-ules, while OpenFlow does not.

SwitchBlade provides wire-speed support for parallel cus-tomized data planes, isolation between them, and their interfac-ing with virtualization software, which would make SwitchBlade asuitable data plane for a virtual router. OpenFlow does not directlysupport multiple custom data planes operating in parallel. FlowVi-sor [1] provides some level of virtualization but sits between theOpenFlow switch and controller, essentially requiring virtualiza-tion to occur in software.

3. GOALS AND DESIGN DECISIONSThe primary goal of SwitchBlade is to enable rapid development

and deployment of new protocols working at wire-speed. The threesubgoals, in order of priority, are: (1) Enable rapid developmentand deployment of new protocols; (2) Provide customizability andprogrammability while maintaining wire-speed performance; and(3) Allow multiple data planes to operate in parallel, and facilitatesharing of hardware resources across those multiple data planes.In this section, we describe these design goals, their rationale, andhighlight specific design choices that we made in SwitchBlade toachieve these goals.

Goal #1. Rapid development and deployment on fast hard-ware. Many next-generation networking protocols require data-plane modifications. Implementing these modifications entirely insoftware results in a slow data path that offers poor forwarding per-formance. As a result, these protocols cannot be evaluated at thedata rates of production networks, nor can they be easily transferredto production network devices and systems.

Our goal is to provide a platform for designers to quickly de-ploy, test, and improve their designs with wire-speed performance.This goal influences our decision to implement SwitchBlade usingFPGAs, which are programmable, provide acceptable speeds, andare not tied to specific vendors. An FPGA-based solution can al-low network protocol designs to take advantage of hardware trends,as larger and faster FPGAs become available. SwitchBlade relieson programmable hardware, but incorporates software exceptionhandling for special cases; a purely software-based solution cannotprovide acceptable forwarding performance. From the hardwareperspective, custom ASICs incur a long development cycle, so theydo not satisfy the goal of rapid deployment. Network processorsoffer speed, but they do not permit hardware-level customization.

Goal #2. Customizability and programmability. New proto-cols often require specific customizations to the data plane. Thus,SwitchBlade must provide a platform that affords enough cus-tomization to facilitate the implementation and deployment of newprotocols.

Providing customizability along with fast turnaround time forhardware-based implementations is challenging: a bare-bonesFPGA is customizable, but programming from scratch has a highturnaround time. To reconcile this conflict, SwitchBlade recognizesthat even custom protocols share common data-plane extensions.For example, many routing protocols might use longest prefix orexact match for forwarding, and checksum verification and update,although different protocols may use these extensions on differentfields on in the packets. SwitchBlade provides a rich set of commonextensions as modules and allows protocols to dynamically selectany subset of modules that they need. SwitchBlade’s modules areprogrammable and can operate on arbitrary offsets within packetheaders.

Feature Design Goals Pipeline StagesVirtual Data Plane (§ 4.2) Parallel custom data

planesVDP selection

Customizable hardwaremodules (§ 4.3)

Rapid programming,customizability

Preprocessing,Forwarding

Flexible matching in for-warding (§ 4.4)

Customizability Forwarding

Programmable software ex-ceptions (§ 4.5)

Rapid programming,customizability

Forwarding

Table 1: SwitchBlade design features.

For extensions that are not included in SwitchBlade, protocolscan either add new modules in hardware or implement exceptionhandlers in software. SwitchBlade provides hardware caching forforwarding decisions made by these exception handlers to reduceperformance overhead.

Goal #3. Parallel custom data planes on a common hardwareplatform. The increasing need for data-plane customization foremerging network protocols makes it necessary to design a plat-form that can support the operation of several custom data planesthat operate simultaneously and in parallel on the same hardwareplatform. SwitchBlade’s design identifies functions that are com-mon across data-plane protocols and provides those implementa-tions shared access to the hardware logic that provides those com-mon functions.

SwitchBlade allows customized data planes to run in parallel.Each data plane is called a Virtual Data Plane (VDP). Switch-Blade provides separate forwarding tables and virtualized inter-faces to each VDP. SwitchBlade provides isolation among VDPusing per-VDP rate control. VDPs may share modules, but to pre-serve hardware resources, shared modules are not replicated on theFPGA. SwitchBlade ensures that the data planes do not interfaceeven though they share hardware modules.

Existing platforms satisfy some or all of these goals, but they do notaddress all the goals at once or with the prioritization we have out-lined above. For example, SwitchBlade trades off higher customiz-ability in hardware for easier and faster deployability by providinga well-defined but modular customizable pipeline. Similarly, whileSwitchBlade provides parallel data planes, it still gives each dataplane direct access to the hardware, and allows each VDP access toa common set of hardware modules. This level of sharing still al-lows protocol designers enough isolation to implement a variety ofprotocols and systems; for example, in Section 7, we will see thatdesigners can run virtual control planes and virtual environments(e.g., OpenVZ [20], Trellis [6]) on top of SwitchBlade.

4. DESIGNSwitchBlade has several unique design features that enable rapid

development of customized routing protocols with wire-speed per-formance. SwitchBlade has a pipelined architecture (§4.1) withvarious processing stages. SwitchBlade implements Virtual DataPlanes (VDP) (§4.2) so that multiple data plane implementationscan be supported on the same platform with performance isola-tion between the different forwarding protocols. SwitchBlade pro-vides customizable hardware modules (§4.3) that can be enabled ordisabled to customize packet processing at runtime. SwitchBladeimplements a flexible matching forwarding engine (§4.4) that pro-vides a longest prefix match and an exact hash-based lookup on

185

VDP SelectorAttaches platform header Copies VDP-id in the platform header

Selects preprocessing module, copies mode bits to platform header

PreprocessorSelector

per-VDPmode bits,

Module selectionbitmap

per-VDPpacket field

selections

CustomPreprocessor

Extracts custom fields from the packet header, prepares input for the Hasher module

HasherHashes input from preprocessor.Fills hash in platform header

OutputPort

Lookup

Performs LPM, exact match, unconditional or exception-based forwarding to CPU based on mode bits

per-VDP LPM, exact match,

software exception, ARP tables

per-VDPcounters and stats

Selects custom fields for postprocessors

PostprocessorWrappers

Executes a subset ofpost processors based on module selection bitmap

CustomPostprocessors

Incoming Packet

Packet sent to output queues

Admin interfaceMAC addresses,

VDP-identfiers

VDP Selection Stage

Preprocessing Stage

Forwarding Stage

Key

User Interfaces SwitchBlade Modules Module Descriptions

Register InterfacePacket Path

Common Module

Stage Boundary

Pluggable Module

Admin interfacefor per-VDP rate

ShaperRate controls VDP trafficbased on admin-specifiedtraffice rates

Shaping Stage

Figure 1: SwitchBlade Packet Processing Pipeline.

16-bits32-bitsModule selector

bitmapHash value

8-bits

Mode

8-bits

VDP-id

Figure 2: Platform header format. This 64 bit header is ap-plied to every incoming packet and removed before the packetis forwarded.

various fields in the packet header. There are also programmablesoftware exceptions (§4.5) that can be configured from software todirect individual packets or flows to the CPU for additional pro-cessing.

4.1 SwitchBlade PipelineFigure 1 shows the SwitchBlade pipeline. There are four main

stages in the pipeline. Each stage consists of one or more hardwaremodules. We use a pipelined architecture because it is the moststraightforward choice in hardware-based architectures. Addition-ally, SwitchBlade is based on reference router from the NetFPGAgroup at Stanford [2]; this reference router has a pipelined archi-tecture as well.

VDP Selection Stage. An incoming packet to SwitchBlade is as-sociated with one of the VDPs. The VDP Selector module classi-fies the packet based on its MAC address and uses a stored tablethat maps MAC addresses to VDP identifiers. A register interfacepopulates the table with the VDP identifiers and is described later.

Field Value Description/Action

Mode0 Default, Perform LPM on IPv4 destination address1 Perform exact matching on hash value2 Send packet to software for custom processing3 Lookup hash in software exceptions table1 Source MAC not updated

Module 2 Don’t decrement TTLSelector 4 Don’t Calculate ChecksumBitmap 8 Dest. MAC not updated

16 Update IPv6 Hop Limit32 Use Custom Module 164 Use Custom Module 2

128 Use Custom Module 3

Table 2: Platform Header: The Mode field selects the forward-ing mechanism employed by the Output Port Lookup module.The Module Selector Bitmap selects the appropriate postpro-cessing modules.

This stage also attaches a 64-bit platform header on the incom-ing packet, as shown in Figure 2. The registers corresponding toeach VDP are used to fill the various fields in the platform header.SwitchBlade is a pipelined architecture, so we use a specific headerformat that to make the architecture extensible. The first byte ofthis header is used to select the VDP for every incoming packet.Table 2 describes the functionality of the different fields in the plat-form header.

Shaping Stage. After a packet is designated to a particular VDP,the packet is sent to the shaper module. The shaper module ratelimits traffic on per VDP basis. There is a register interface for themodule that specifies the traffic rate limits for each VDP.

Preprocessing Stage. This stage includes all the VDP-specific pre-processing hardware modules. Each VDP can customize whichpreprocessor module in this stage to use for preprocessing thepacket via a register interface . In addition to selecting the prepro-cessor, a VDP can select the various bit fields from the preprocessorusing a register interface. A register interface provides informationabout the mode bits and the preprocessing module configurations.In addition to the custom preprocessing of the packet, this stagealso has the hasher module, which can compute a hash of an arbi-trary set of bits in the packet header and insert the value of the hashin the packet’s platform header.

Forwarding Stage. This final stage in the pipeline handles the op-erations related to the actual packet forwarding. The Output PortLookup module determines the destination of the packet, whichcould be one of: (1) longest-prefix match on the packet’s desti-nation address field to determine the output port; (2) exact match-ing on the hash value in the packet’s platform header to determinethe output port; or (3) exception-based forwarding to the CPU forfurther processing. This stage uses the mode bits specified in thepreprocessing stage. The Postprocessor Wrappers and the Cus-tom Postprocessors perform operations such as decrementing thepacket’s time-to-live field. After this stage, SwitchBlade queuesthe packet in the appropriate output queue for forwarding. Switch-Blade selects the postprocessing module or modules based on themodule selection bits in the packet’s platform header.

4.2 Custom Virtual Data Plane (VDP)SwitchBlade enables multiple customized data planes top oper-

ate simultaneously in parallel on the same hardware. We refer toeach data plane as Virtual Data Plane (VDP). SwitchBlade providesa separate packet processing pipeline, as well as separate lookup ta-

186

bles and register interfaces for each VDP. Each VDP may providecustom modules or share modules with other VDPs. With Switch-Blade, shared modules are not replicated on the hardware, savingvaluable resources. Software exceptions include VDP identifiers,making it easy to use separate software handlers for each VDP.

Traffic Shaping. The performance of a VDP should not be affectedby the presence of other VDPs. The shaper module enables Switch-Blade to limit bandwidth utilization of different VDPs. When sev-eral VDPs are sharing the platform, they can send traffic throughany of the four ports of the VDP to be sent out from any of thefour router ports. Since a VDP can start sending more traffic thanwhat is its bandwidth limit thus affecting the performance of otherVDPs. In our implementation, the shaper module comes after thePreprocessing stage not before it as shown in Figure 1. This imple-mentation choice, although convenient, does not affect our resultsbecause the FPGA data plane can process packets faster than anyof the inputs. Hence, the traffic shaping does not really matter.We expect, however, that in the future FPGAs there might be muchmore than the current four network interfaces for a single NetFPGAwhich would make traffic shaping of individual VDPs necessary. Inthe existing implementation, packets arriving at a rate greater thanthe allocated limit for a VDP are dropped immediately. We madethis decision to save memory resources on the FPGA and to preventany VDP from abusing resources.

Register interface. SwitchBlade provides a register interface fora VDP to control the selection of preprocessing modules, to cus-tomize packet processing modules (e.g., which fields to use for cal-culating hash), and to set rate limits in the shaper module. Some ofthe values in the registers are accessible by each VDP, while othersare only available for the SwitchBlade administrator. SwitchBladedivides the register interfaces into these two security modes: theadmin mode and the VDP mode. The admin mode allows setting ofglobal policies such as traffic shaping, while the VDP mode is forper-VDP module customization.

SwitchBlade modules also provide statistics, which are recordedin the registers and are accessible via the admin interface. Thestatistics are specific to each module; for example, the VDP selectormodule can provide statistics on packets accepted or dropped. Theadmin mode provides access to all registers on the SwitchBladeplatform, whereas the VDPmode is only to registers related to asingle VDP.

4.3 Customizable Hardware ModulesRapidly deploying new routing protocols may require custom

packet processing. Implementing each routing protocol fromscratch can significantly increase development time. There is a sig-nificant implementation cycle for implementing hardware modules;this cycle includes design, coding, regression tests, and finally syn-thesis of the module on hardware. Fortunately, many basic oper-ations are common among different forwarding mechanisms, suchas extracting the destination address for lookup, checksum calcu-lation, and TTL decrement. This commonality presents an oppor-tunity for a design that can reuse and even allow sharing the im-plementations of basic operations which can significantly shortendevelopment cycles and also save precious resources on the FPGA.

SwitchBlade achieves this reuse by providing modules that sup-port a few basic packet processing operations that are commonacross many common forwarding mechanism. Because Switch-Blade provides these modules as part of its base implementation,data plane protocols that can be composed from only the base mod-ules can be implemented without resynthesizing the hardware andcan be programmed purely using a register interface. As an exam-

ple, to implement a new routing protocol such as Path Splicing [17],which requires manipulation of splicing bits (a custom field in thepacket header), a VDP can provide a new module that is includedat synthesis time. This module can append preprocessing headersthat are later used by SwitchBlade’s forwarding engine. A proto-col such as OpenFlow [19] may depend only on modules that arealready synthesized on the SwitchBlade platform, so it can choosethe subset of modules that it needs.

SwitchBlade’s reusable modules enable new protocol develop-ers to focus more on the protocol implementation. The developerneeds to focus only on bit extraction for custom forwarding. Eachpluggable module must still follow the overall timing constraints,but for development and verification purposes, the protocol devel-oper’s job is reduced to the module’s implementation. Adding newmodules or algorithms that offer new functionality of course re-quires conventional hardware development and must still strictlyfollow the platform’s overall timing constraints.

A challenge with reusing modules is that different VDPs mayneed the same postprocessing module (e.g., decrementing TTL),but the postprocessing module may need to operate on differentlocations in the packet header for different protocols. In a naïveimplementation, SwitchBlade would have to implement two sepa-rate modules, each looking up the corresponding bits in the packetheader. This approach doubles the implementation effort and alsowastes resources on the FPGA. To address this challenge, Switch-Blade allows a developer to include wrapper modules that can cus-tomize the behavior of existing modules, within same data wordand for same length of data to be operated upon.

As shown in Figure 1 custom modules can be used in the pre-processing and forwarding stages. In the preprocessing stage, thecustomized modules can be selected by a VDP by specifying theappropriate selection using the register interface. Figure 3 showsan example: the incoming packet from the previous shaping stagewhich goes to a demultiplexer which selects the appropriate mod-ule or modules for the packet based on the input from the registerinterface specific to the particular VDP that the packet belongs to.After being processed by one of the protocol modules (e.g., IPv6,OpenFlow), the packet arrives at the hasher module. The hashermodule takes 256 bits as input and generates a 32-bit hash of theinput. The hasher module need not be restricted to 256 bits of inputdata, but a larger input data bus would mean using more resources.Therefore, we decided to implement a 256-bit wide hash data busto accommodate our design on the NetFPGA.

Each VDP can also use custom modules in the forwarding stage,by selecting the appropriate postprocessor wrappers and custompostprocessor modules as shown in Figure 1. SwitchBlade selectsthese modules based on the module selection bitmap in the platformheader of the packet. Figure 4(b) shows an example of the customwrapper and postprocessor module selection operation.

4.4 Flexible Matching for ForwardingNew routing protocols often require customized routing tables,

or forwarding decisions on customized fields in the packet. Forexample, Path Splicing requires multiple IP-based forwarding ta-bles, and the router chooses one of them based on splicing bitsin the packet header. SEATTLE [15] and Portland [18] use MACaddress-based forwarding. Some of the forwarding mechanisms arestill simple enough to be implemented in hardware and can benefitfrom fast-path forwarding; others might be more complicated andit might be easier to just have the forwarding decision be made insoftware. Ideally, all forwarding should take place in hardware, butthere is a tradeoff in terms of forwarding performance and hardwareimplementation complexity.

187

Preprocessor SelectionCode Processor Description1 Custom Extractor Allows selection of variable

64-bit fields in packet on 64-bitboundaries in first 32 bytes

2 OpenFlow OpenFlow packet processorthat allows variable fieldselection.

3 Path Splicing Extracts Destination IP Ad-dress and uses bits in packet toselect the Path/Forwarding Ta-ble.

4 IPv6 Extracts IPv6 destination ad-dress.

Table 3: Processor Selection Codes.

Figure 3: Virtualized, Pluggable Module for ProgrammableProcessors.

SwitchBlade uses a hybrid hardware-software approach to strikea balance between forwarding performance and implementationcomplexity. Specifically, SwitchBlade’s forwarding mechanismimplementation, provided by the Output Port Lookup module asshown in Figure 1, provides the following four different methodsfor making forwarding decision on the packet: (1) conventionallongest prefix matching (LPM) on any 32-bit address field in thepacket header within the first 40-bytes; (2) exact matching on hashvalue stored in the packet’s platform header; (3) unconditionallysending the packet to the CPU for making the forwarding compu-tation; and (4) sending only packets which match certain user de-fined exceptions, called software exceptions 4.5, to the CPU. Thedetails of how the output port lookup module performs these tasksis illustrated in Figure 4(a). Modes (1) and (2) enable fast-pathpacket forwarding because the packet never leaves the FPGA. Weobserve that many common routing protocols can be implementedwith these two forwarding mechanisms alone. Figure 4 is not theactual implementation but shows the functional aspect of Switch-Blade’s implementation.

By default, SwitchBlade performs a longest-prefix match, as-suming an IPv4 destination address is present in the packet header.To enable use of customized lookup, a VDP can set the appropri-ate mode bit in the platform header of the incoming packet. Oneof the four different forwarding mechanisms can be invoked forthe packet by the mode bits as described in Table 2. The outputport lookup module performs LPM and exact matching on the hashvalue from the forwarding table stored in the TCAM. The sameTCAM is used for LPM and for exact matching for hashing there-fore the mask from the user decides the nature of match being done.

Figure 4: Output Port Lookup and Postprocessing Modules.

Once the output port lookup module determines the output port forthe packet it adds the output port number to the packet’s platformheader. The packet is then sent to the postprocessing modules forfurther processing. In Section 4.5, we describe the details of soft-ware work and how the packet is handled when it is sent to theCPU.

4.5 Flexible Software ExceptionsAlthough performing all processing of the packets in hardware is

the only way to achieve line rate performance, it may be expensiveto introduce complex forwarding implementations in the hardware.Also, if certain processing will only be performed on a few packetsand the processing requirements of those packets are different fromthe majority of other packets, development can be faster and lessexpensive if those few packets are processed by the CPU instead(e.g., ICMP packets in routers are typically processed in the CPU).

SwitchBlade introduces software exceptions to programmati-cally direct certain packets to the CPU for additional processing.This concept is similar to the OpenFlow concept of rules that canidentify packets that match a particular traffic flow that should bepassed to the controller. However, combining software exceptionswith the LPM table provides greater flexibility, since a VDP canadd exceptions to existing forwarding rules. Similarly, if a userstarts receiving more traffic than expected from a particular soft-ware exception, that user can simply remove the software exceptionentry and add the forwarding rule in forwarding tables.

There is a separate exceptions table, which can be filled via aregister interface on a per-VDP basis and is accessible to the outputport lookup module, as shown in Figure 4(a). When the mode bitsfield in the platform header is set to 3 (Table 2), the output portlookup module performs an exact match of the hash value in thepacket’s platform header with the entries in the exceptions tablefor the VDP. If there is a match, then the packet is redirected tothe CPU where it can be processed using software-based handlers,and if there is none then the packet is sent back to the output port

188

Figure 5: SwitchBlade Pipeline for NetFPGA implementation.

lookup module to perform an LPM on the destination address. Wedescribe the process after the packet is sent to the CPU later.

SwitchBlade’s software exceptions feature allows decisioncaching [9]: software may install its decisions as LPM or exactmatch rules in the forwarding tables so that future packets are for-warded rapidly in hardware without causing software exceptions.

SwitchBlade allows custom processing of some packets in soft-ware. There are two forwarding modes that permit this function:unconditional forwarding of all packets or forwarding of packetsbased on software exceptions to the CPU. Once a packet has beendesignated to be sent to the CPU, it is placed in a CPU queue corre-sponding to its VDP, as shown in Figure 4(a). The current Switch-Blade implementation forwards the packet to the CPU, with theplatform header attached to the packet. We describe one possibleimplementation of a software component on top of SwitchBlade’sVDP—a virtual router—in Section 7.

5. NETFPGA IMPLEMENTATIONIn this section, we describe our NetFPGA-based implementation

of SwitchBlade, as well as custom data planes that we have im-plemented using SwitchBlade. For each of these data planes, wepresent details of the custom modules, and how these modules areintegrated into the SwitchBlade pipeline.

5.1 SwitchBlade PlatformSwitchBlade implements all the modules shown in Figure 5 on

the NetFPGA [2] platform. The current implementation uses fourpacket preprocessor modules, as shown in Table 3. SwitchBladeuses SRAM for packet storage and BRAM and SRL16e storage forforwarding information for all the VDPs and uses the PCI interfaceto send or receive packets from the host machine operating system.The NetFPGA project provides reference implementations for var-ious capabilities, such as the ability to push the Linux routing tableto the hardware. Our framework extends this implementation toadd other features, such as the support of virtual data planes, cus-

Figure 6: Resource sharing in SwitchBlade.

tomizable hardware modules, and programmable software excep-tions. Figure 5 shows the implementation of the NetFPGA router-based pipeline for SwitchBlade. Because our implementation isbased on the NetFPGA reference implementation, adding multicastpacket forwarding depends on the capabilities of NetFPGA refer-ence router [2] implementation. Because the base implementationcan support multicast forwarding, SwitchBlade can also support it.

VDP Selection Stage. The SwitchBlade implementation addsthree new stages to the NetFPGA reference router [2] pipeline asshown in gray in Figure 5. The VDP selection stage essentially per-forms destination MAC lookup for each incoming packet and if thedestination MAC address matches then the packet is accepted andthe VDP-id is attached to the packet’s platform header (Table 2).VDP selection is implemented using a CAM (Content AddressableMemory), where each MAC address is associated with a VDP-ID.This table is called the Virtual Data Plane table. An admin registerinterface allows the SwitchBlade administrator to allow or disallowusers from using a VDP by adding or removing their destinationMAC entries from the table.

Preprocessing Stage. A developer can add customizable packetpreprocessor modules to the VDP. There are two main benefits forthese customizable preprocessor modules. First, this modularitystreamlines the deployment of new forwarding schemes. Second,the hardware cost of supporting new protocols does not increaselinearly with the addition of new protocol preprocessors. To enablecustom packet forwarding, the preprocessing stage also provides ahashing module that takes 256-bits as input and produces a 32-bitoutput (Table 2). The hashing scheme does not provide a longest-prefix match; it only offers support for an exact match on the hashvalue. In our existing implementation each preprocessor module isfixed with one specific VDP.

Shaping Stage. We implement bandwidth isolation for each VDPusing a simple network traffic rate limiter. Each VDP has a config-urable rate limiter that increases or decreases the VDP’s allocatedbandwidth. We used a rate limiter from the NetFPGA’s referenceimplementation for this purpose. The register interface to updatethe rate limits is accessible only with admin privileges.

Software Exceptions. To enable programmable software excep-tions, SwitchBlade uses a 32-entry CAM within each VDP that canbe configured from software using the register interface. Switch-Blade has a register interface that can be used to add a 32-bit hashrepresenting a flow or packet. Each VDP has a set of registers toupdate the software exceptions table to redirect packets from hard-ware to software.

189

Figure 7: Life of OpenFlow, IPv6, and Path Splicing packets.

Sharing and custom packet processing. The modules that func-tion on the virtual router instance are shared between different vir-tual router instances that reside on the same FPGA device. Onlythose modules that the virtual router user selects can operate on thepacket; others do not touch the packet. This path-selection mech-anism is unique. Depending on an individual virtual router user’srequirements, the user can simply select the path of the packet andthe modules that the virtual router user requires.

5.2 Custom Data Planes using SwitchBladeImplementing any new functionality in SwitchBlade requires

hardware programming in Verilog, but if the module is added asa pluggable preprocessor, then the developer needs to be concernedwith the pluggable preprocessor implementation only, as long asdecoding can occur within specific clock cycles. Once a new mod-ule is added and its interface is linked with the register interface,a user can write a high-level program to use a combination of thenewly added and previously added modules. Although the num-ber of modules in a pipeline may appear limited because of smallerheader size, this number can be increased by making the pipelinewider or by adding another header for every packet.

To allow developers to write their own protocols or use exist-ing ones, SwitchBlade offers header files in C++, Perl, and Python;these files refer to register address space for that user’s register in-terface only. A developer simply needs to include one of theseheader files. Once the register file is included, the developer canwrite a user-space program by reading and writing to the regis-ter interface. The developer can then use the register interface toenable or disable modules in the SwitchBlade pipeline. The de-veloper can also use this interface to add hooks for software ex-ceptions. Figure 7 shows SwitchBlade’s custom packet path. Wehave implemented three different routing protocols and forwardingmechanisms: OpenFlow [19], Path Splicing [17], and IPv6 [11] onSwitchBlade.

OpenFlow. We implemented the exact match lookup mechanismof OpenFlow in hardware using SwitchBlade without VLAN sup-port. The OpenFlow preprocessor module, as shown in Figure 3,parses a packet and extracts the ten tuples of a packet defined inOpenFlow specifications. The OpenFlow preprocessor module ex-tracts the bits from the packet header and returns a 240-bit wideOpenFlow flow entry. These 240-bits travel on a 256-bit wire tothe hasher module. The hasher module returns a 32-bit hash valuethat is added to the SwitchBlade platform header (Figure 2). Af-ter the addition of hash value this module adds a module selectorbitmap to the packet’s platform header. The pipeline then sets modefield in the packet’s platform header to 1, which makes the outputport lookup module perform an exact match on the hash value ofthe packet. The output port lookup module looks up the hash valuein the exact match table and forwards the packet to the output portif the lookup was a hit. If the table does not contain the correspond-

ing entry, the platform forward the packet to the CPU for processingwith software-based handlers.

Because OpenFlow offers switch functionality and does not re-quire any extra postprocessing (e.g., TTL decrement or checksumcalculation), a user can prevent the forwarding stage from perform-ing any extra postprocessing functions on the packet. Nothing hap-pens in the forwarding stage apart from the lookup, and Switch-Blade queues the packet in the appropriate output queue. A devel-oper can update source and destination MACs as well, using theregister interface.

Path Splicing. Path Splicing enables users to select different pathsto a destination based on the splicing bits in the packet header. Thesplicing bits are included as a bitmap in the packet’s header andserve as an index for one of the possible paths to the destination. Toimplement Path Splicing in hardware, we implemented a process-ing module in the preprocessing stage. For each incoming packet,the preprocessor module extracts the splicing bits and the destina-tion IP address. It concatenates the IP destination address and thesplicing bits to generate a new address that represents a separatepath. Since Path Splicing allows variation in path selection, thisbit field can vary in length. The hasher module takes this bit field,creates a 32-bit hash value, and attaches it to the packet’s platformheader.

When the packet reaches the exact match lookup table, its 32-bit hash value is extracted from SwitchBlade header and is lookedup in the exact match table. If a match exists, the card forwardsthe packet on the appropriate output port. Because the module isconcatenating the bits and then hashing them and there is an exactmatch down the pipeline, two packets with the same destination ad-dress but different paths will have different hashes, so they will bematched against different forwarding table entries and routed alongtwo different paths. Since Path Splicing uses IPv4 for packet pro-cessing, all the postprocessing modules on the default path (e.g.,TTL decrement) operate on the packet and update the packet’s re-quired fields. SwitchBlade can also support equal-cost multipath(ECMP). For this protocol, the user must implement a new pre-processor module that can select two different paths based on thepacket header fields and can store their hashes in the lookup tablesending packets to two separate paths based on the hash match inlookup.

IPv6. The IPv6 implementation on SwitchBlade also uses the cus-tomizable preprocessor modules to extract the 128-bit destinationaddress from an incoming IPv6 packet. The preprocessor moduleextracts the 128-bits and sends them to the hasher module to gen-erate a 32-bit hash from the address.

Our implementation restricts longest prefix match to 32-bit ad-dress fields, so it is not currently possible to perform longest prefixmatch for IPv6. The output port lookup stage performs an exactmatch on the hash value of the IPv6 packet and sends it for postpro-cessing. When the packet reaches the postprocessing stage, it onlyneeds to have its TTL decremented because there is no checksum inIPv6. But it also requires to have its source and destination MACsupdated before forwarding. The module selector bitmap shown inFigure 5 enables only the postprocessing module responsible forTTL decrement and not the ones doing checksum recalculation.Because the TTL offset for IPv6 is at a different byte offset thanthe default IPv4 TTL field, SwitchBlade uses a wrapper modulethat extracts only the bits of the packet’s header that are requiredby the TTL decrement module; it then updates the packet’s headerwith the decremented TTL.

190

Resource NetFPGA Utilization % UtilizationSlices 21 K out of 23 K 90%

4-input LUTs 37 K out of 47 K 79%Flip Flops 20 K out of 47 K 42%

External IOBs 353 out of 692 51%Eq. Gate Count 13 M N/A

Table 4: Resource utilization for the base SwitchBlade plat-form.

6. EVALUATIONIn this section, we evaluate our implementation of SwitchBlade

using NetFPGA [2] as a prototype development platform. Our eval-uation focuses on three main aspects of SwitchBlade: (1) resourceutilization for the SwitchBlade platform; (2) forwarding perfor-mance and isolation for parallel data planes; and (3) data-planeupdate rates.

6.1 Resource UtilizationTo provide insight about the resource usage when different data

planes are implemented on SwitchBlade, we used Xilinx ISE [23]9.2 to synthesize SwitchBlade. We found that a single physicalIPv4 router implementation developed by the NetFPGA group atStanford University uses a total of 23K four-input LUTs, whichconsume about 49% of the total available four-input LUTs, onthe NetFPGA. The implementation also requires 123 BRAM units,which is 53% of the total available BRAM.

We refer to our existing implementation with one OpenFlow, oneIPv6, one variable bit extractor, and one Path Splicing preproces-sor with an IPv4 router and capable of supporting four VDPs as theSwitchBlade“base configuration”. This implementation uses 37Kfour-input LUTs, which account for approximately 79% of four-input LUTs. Approximately 4.5% of LUTs are used for shift reg-isters. Table 4 shows the resource utilization for the base Switch-Blade implementation; SwitchBlade uses more resources than thebase IPv4 router, as shown in table 5, but the increase in resourceutilization is less than linear in the number of VDPs that Switch-Blade can support.

Sharing modules enables resource savings for different protocolimplementations. Table 5 shows the resource usage for implemen-tations of an IPv4 router, an OpenFlow switch, and path splicing.These implementations achieve 4 Gbps; OpenFlow and Path Splic-ing implementations provide more resources than SwitchBlade.But there is not much difference in resource usage for these imple-mentations when compared with the possible configurations whichSwitchBlade can support.

Virtual Data Planes can support multiple forwarding planes inparallel. Placing four Path Splicing implementations in parallelon a larger FPGA to run four Path Splicing data planes will re-quire four times the resources of existing Path Splicing implemen-tation. Because no modules are shared between the four forwardingplanes, the number of resources will not increase linearly and willremain constant in the best case.

From a gate count perspective, Path Splicing with larger forward-ing tables and more memory will require approximately four timesthe resources as in Table 5; SwitchBlade with smaller forwardingtables and less memory will require almost same amount of re-sources. This resource usage gap begins to increase as we increasethe number of Virtual Data Planes on the FPGA. Recent trendsin FPGA development such as Virtex 6 suggest higher speeds andlarger area; these trends will allow more VDPs to be placed on asingle FPGA, which will facilitate more resource sharing.

NetFPGA Slices 4-input Flip Flops BRAM EquivalentImplementation LUTs Gate CountPath Splicing 17 K 19 K 17 K 172 12 MOpenFlow 21 K 35 K 22 K 169 12 MIPv4 16 K 23 K 15 K 123 8 M

Table 5: Resource usage for different data planes.

0

200

400

600

800

1000

1200

1400

1600

100 1000

Forwar

ding R

ate (’0

00 pps

)

Packet Size (bytes)

Packet Forwarding Rate, Comparison

NetFPGA, Base RouterLinux Raw Packet Forwarding

Figure 8: Comparison of forwarding rates.

6.2 Forwarding Performance and IsolationWe used the NetFPGA-based packet generator [10] for traffic

generation to generate high speed traffic to evaluate the forwardingperformance of SwitchBlade and the isolation provided betweenthe VDPs. Some of the results we present in this section are derivedfrom experiments in previous work [4].

Raw forwarding rates. Previous work has measured the max-imum sustainable packet forwarding rate for different configura-tions of software-based virtual routers [5]. We also measure packetforwarding rates and show that hardware-accelerated forwardingcan increase packet forwarding rates. We compare forwarding ratesof Linux and NetFPGA-based router implementation from NetF-PGA group [2], as shown in Figure 8. The maximum forwardingrate shown, about 1.4 million packets per second, is the maximumtraffic rate which we were able to generate through the NetFPGA-based packet generator.

The Linux kernel drops packets at high loads, but our configura-tion could not send packets at a high enough rate to see packet dropsin hardware. If we impose the condition that no packets shouldbe dropped at the router, then the packet forwarding rates for theLinux router drops significantly, but the forwarding rates for thehardware-based router remain constant. Figure 8 shows packet for-warding rates when this “no packet drop” condition is not imposed(i.e., we measure the maximum sustainable forwarding rates). Forlarge packet sizes, SwitchBlade could achieve the same forward-ing rate using in-kernel forwarding as we were using a single portof NetFPGA router. Once the packet size drops below 200 bytes;the software-based router cannot keep pace with the forwarding re-quirements.

Forwarding performance for Virtual Data Planes. Figure 9shows the data-plane forwarding performance of SwitchBladerunning four data planes in parallel versus the NetFPGA refer-ence router [2], for various packet sizes. We have disabled therate limiters in SwitchBlade for these experiments. The figureshows that running SwitchBlade incurs no additional performancepenalty when compared to the performance of running the refer-

191

0

200

400

600

800

1000

1200

1400

1600

64 104 204 504 704 10041518

Forwa

rding

Rate

(’000

pps)

Packet Size

NetFPGA Hardware Router and 4 SwitchBlade Virtualized Data Planes

SwitchBladeNetFPGA Hardware Router

SwitchBlade with Traffic Filtering

Figure 9: Data plane performance: NetFPGA reference routervs. SwitchBlade.

Physical Router (’000s of packets)Packet Size(bytes) Pkts Sent Pkts Fwd @ Core Pkts recv @Sink64 40 K 40 K 20 K104 40 K 40 K 20 K204 40 K 40 K 20 K504 40 K 40 K 20 K704 40 K 40 K 20 K1004 39.8 K 39.8 K 19.9 K1518 4 K 4 K 1.9 K

Table 6: Physical Router, Packet Drop Behavior.

ence router [2]. By default, traffic belonging to any VDP can arriveon any of the physical Ethernet interfaces since all of the ports arein promiscuous mode. To measure SwitchBlade’s to filter trafficthat is not destined for any VDP, we flooded SwitchBlade with amix of traffic where half of the packets had destination MAC ad-dresses of SwitchBlade virtual interfaces and half of the packetshad destination MAC addresses that didn’t belong to any vdp. Asa result, half of the packets were dropped and rest were forwarded,which resulted in a forwarding rate that was half of the incomingtraffic rate.

Isolation for Virtual Data Planes. To measure CPU isolation, weused four parallel data planes to forward traffic when a user-spaceprocess used 100% of the CPU. We then sent traffic where eachuser had an assigned traffic quota in packets per second. Whenno user surpassed the assigned quotas, the router forwarded trafficaccording to the assigned rates, with no packet loss. To measuretraffic isolation, we set up a topology where two 1 Gbps ports ofrouters were flooded at 1 Gbps and a sink node were connected toa third 1 Gbps port. We used four VDPs to forward traffic to thesame output port. Tables 6 and 7 show that, at this packet forward-ing rate, only half of the packets make it through, on first comefirst serve basis, as shown in fourth column. These tables show thatboth the reference router implementation and SwitchBlade have thesame performance in the worst-case scenario when an output port isflooded. The second and third columns show the number of pack-ets sent to the router and the number of packets forwarded by therouter. Our design does not prevent against contention that mayarise when many users send traffic to one output port.

Forwarding performance for non-IP packets. We also testedwhether SwitchBlade incurred any forwarding penalty for forward-ing custom, non-IP packets; SwitchBlade was also able to forwardthese packets at the same rate as regular IP packets. Figure 10

Four Data Planes (’000s of packets)Packet Size(bytes) Pkts Sent Pkts Fwd @ Core Pkts recv @Sink64 40 K 40 K 20 K104 40 K 40 K 20 K204 40 K 40 K 20 K504 40 K 40 K 20 K704 40 K 40 K 20 K1004 39.8 K 39.8 K 19.9 K1518 9.6 K 9.6 K 4.8 K

Table 7: Four Parallel Data Planes, Packet Drop Behavior.

R0

R1

src dstR2

IP-ID (Path bits)...... 010101 01 0011

01

00

Figure 10: Test topology for testing SwitchBlade implementa-tion of Path Splicing.

shows the testbed we used to test the Path Splicing implementation.We again used the NetFPGA-based hardware packet generator [10]to send and receive traffic. Figure 11 shows the packet forward-ing rates of this NetFPGA-based implementation, as observed atthe sink node. No packet loss occurred on any of the nodes shownin Figure 10. We sent two flows with same destination IP addressbut using different splicing bits to direct them to different routers.Packets from one flow were sent to R2 via R1, while others wentdirectly to R2. In another iteration, we introduced four differentflows in the network, such that all four forwarding tables at routerR0 and R2 were looked up with equal probability; in this case,SwitchBlade also forwarded the packets at full rate. Both these ex-periments show that SwitchBlade can implement schemes like PathSplicing and forward traffic at hardware speeds for non-IP packets.

In another experiment, we used the Variable Bit Extraction mod-ule to extract first 64 bits from the header for hashing. We used asimple source and sink topology with SwitchBlade between themand measured the number of packets forwarded. Figure 12 showsthe comparison of forwarding rates when forwarding was beingdone using SwitchBlade based on the first 64-bits of an Ethernetframe and when it was done using NetFPGA base router.

6.3 Data-Plane Update RatesEach VDP in a router on SwitchBlade needs to have its own

forwarding table. Because the VDPs share a single physical de-vice, simultaneous table updates from different VDPs might cre-ate a bottleneck. To evaluate the performance of SwitchBlade forforwarding table update speeds, we assumed the worst-case sce-nario, where all VDPs flush their tables and rewrite them again atthe same time. We assumed that the table size for each VDP is400,000 entries. We updated all four tables simultaneously, butthere was no performance decrease while updating the forwardingtable from software. Four processes were writing the table entriesin the forwarding table.

Table 8 shows updating 1.6 million entries simultaneously took89.77 seconds on average, with a standard deviation of less thanone second. As the number of VDPs increases, the average up-date rate remains constant, but as the number of VDPs increases,the PCI interconnect speed becomes a bottleneck between the VDPprocesses updating the table and the SwitchBlade FPGA.

192

0

200

400

600

800

1000

1200

1400

1600

64 128 256 512 10241500

Forwar

ding R

ate (’0

00 pps

)

Packet Size

Path Splicer vs Base Router

Splice Router(2 Tables)Splice Router(4 Table)

Base Router

Figure 11: Path Splicing router performance with varying loadcompared with base router.

0

200

400

600

800

1000

1200

1400

1600

64 104 204 504 704 10041518

Forwar

ding R

ate (’0

00 pps

)

Packet Size

NetFPGA Hardware Router and SwitchBlade Variable Ext. Module

SwitchBlade (Variable Bit Extractor)NetFPGA Hardware Router

Figure 12: Variable Bit Length Extraction router performancecompared with base router.

7. A VIRTUAL ROUTER ON SWITCHBLADEWe now describe our integration of SwitchBlade with a virtual

router environment that runs in an OpenVZ container [20]. We useOpenVZ [20] as the virtual environment for hosting virtual routersfor two reasons related to isolation. First, OpenVZ provides somelevel of namespace isolation between each of the respective virtualenvironments. Second, OpenVZ provides a CPU scheduler thatprevents any of the control-plane processes from using more thanits share of CPU or memory resources. We run the Quagga routingsoftware [21] in OpenVZ, as shown in Figure 13.

Each virtual environment has a corresponding VDP that acts asits data plane. SwitchBlade exposes a register interface to sendcommands from the virtual environment to its respective VDP.Similarly, each VDP can pass data packets into its respective virtualenvironment using the software exception handling mechanismsdescribed in Section 4.5.

We run four virtual environments on the same physical machineand use the SwitchBlade’s isolation capabilities to share the hard-ware resources. Each virtual router receives a dedicated amountof processing and is isolated from the other routers’ virtual dataplanes. Each virtual router also has the appearance of a dedicateddata path.

8. DISCUSSIONPCIe interconnect speeds and the tension between hardwareand software. Recent architectures for software-based routers suchas RouteBricks, use the PCI express (PCIe) interface between the

VDPs Total Ent. Entries/Table Time(sec) Single Ent. (µs)1 400 K 400 K 86.582 2162 800 K 400 K 86.932 1123 1,200 K 400 K 88.523 744 1,600 K 400 K 89.770 56

Table 8: Forwarding Table Update Performance.

Figure 13: Virtual router design with OpenVZ virtual environ-ments interfacing to SwitchBlade data plane.

CPU, which acts as the I/O hub, and the network interface cardsthat forward traffic. PCIe offers more bandwidth than a standardPCI interface; for example, PCIe version 2, with 16 lanes link, hasa total aggregate bandwidth of 8 GBps per direction. Although thishigh PCIe bandwith would seem to offer great promise for buildingprogrammable routers that rely on the CPU for packet processing,the speeds of programmable interface cards are also increasing, andit is unclear as yet whether the trends will play out in favor of CPU-based packet processing. For example, one Virtex-6 HXT FPGAfrom Xilinx or Stratix V FPGA from Altera can process packetsat 100 Gbps. Thus, installing NICs with only one such FPGA canmake the PCIe interconnect bandwidth a bottleneck, and also putsan inordinate amount of strain on the CPU. SwitchBlade thus favorsmaking FPGAs more flexible and programmable, allowing morecustomizability to take place directly on the hardware itself.

Modifying packets in hardware. SwitchBlade’s hardware imple-mentation focuses on providing customization for protocols thatmake only limited modifications to packets. The design can ac-commodate writing packets using preprocessor modules, but wehave not yet implemented this function. Providing arbitrary writ-ing capability in hardware will require either using preprocessorstage for packet writing and a new pipeline stage after postprocess-ing, or adding two new stages to the pipeline (both before and afterlookup).

Packet rewriting can be performed in two ways: (1) modifyingexisting packet content without changing total data unit size, or(2) adding or removing some data to each packet with the outputpacket size different from the input packet size. Although it is easyto add the first function to the preprocessor stage, adding or remov-ing bytes into packet content will require significant effort.

Scaling SwitchBlade. The current SwitchBlade implementationprovides the capability for four virtualized data planes on a singleNetFPGA, but this design is general enough to scale as the capa-bilities of hardware improve. We see two possible avenues for in-creasing the number of virtualized data planes in hardware. Oneoption is to add several servers, each having one FPGA card andhave one or more servers running the control plane that controlsthe hardware forwarding-table entries. Other scaling options in-clude adding more FPGA cards to a single physical machine or

193

taking advantage of hardware trends, which promise the ability toprocess data in hardware at increasingly higher rates.

9. CONCLUSIONWe have presented the design, implementation, and evaluation

of SwitchBlade, a platform for deploying custom protocols on pro-grammable hardware. SwitchBlade uses a pipeline-based hardwaredesign; using this pipeline, developers can swap common hardwareprocessing modules in and out of the packet-processing flow on thefly, without having to resynthesize hardware. SwitchBlade also of-fers programmable software exception handling to allow develop-ers to integrate custom functions into the packet processing pipelinethat cannot be handled in hardware. SwitchBlade’s customizableforwarding engine also permits the platform to make packet for-warding decisions on various fields in the packet header, enablingcustom, non-IP based forwarding at hardware speeds. Finally,SwitchBlade can host multiple data planes in hardware in paral-lel, sharing common hardware processing modules while providingperformance isolation between the respective data planes. Thesefeatures make SwitchBlade a suitable platform for hosting virtualrouters or for simply deploying multiple data planes for protocolsor services that offer complementary functions in a production en-vironment. We implemented SwitchBlade using the NetFPGA plat-form, but SwitchBlade can be implemented with any FPGA.

AcknowledgmentsThis work was funded by NSF CAREER Award CNS-0643974 andNSF Award CNS-0626950. We thank Mohammad Omer for hishelp in solving various technical difficulties during project. Wealso thank our shepherd, Amin Vahdat, for feedback and commentsthat helped improve the final draft of this paper.

REFERENCES[1] FlowVisor. http://www.openflowswitch.org/wk/

index.php/FlowVisor.[2] NetFPGA. http://www.netfpga.org.[3] D. G. Andersen, H. Balakrishnan, N. Feamster, T. Koponen,

D. Moon, and S. Shenker. Accountable Internet Protocol(AIP). In Proc. ACM SIGCOMM, Seattle, WA, Aug. 2008.

[4] M. B. Anwer and N. Feamster. Building a Fast, VirtualizedData Plane with Programmable Hardware. In Proc. ACMSIGCOMM Workshop on Virtualized Infrastructure Systemsand Architectures, Barcelona, Spain, Aug. 2009.

[5] S. Bhatia, M. Motiwala, W. Muhlbauer, V. Valancius,A. Bavier, N. Feamster, L. Peterson, and J. Rexford. HostingVirtual Networks on Commodity Hardware. TechnicalReport GT-CS-07-10, Georgia Institute of Technology,Atlanta, GA, Oct. 2007.

[6] S. Bhatia, M. Motiwala, W. Mühlbauer, V. Valancius,A. Bavier, N. Feamster, J. Rexford, and L. Peterson. Hostingvirtual networks on commodity hardware. Technical ReportGT-CS-07-10, College of Computing, Georgia Tech, Oct.2007.

[7] G. Calarco, C. Raffaelli, G. Schembra, and G. Tusa.Comparative analysis of smp click scheduling techniques. InQoS-IP, pages 379–389, 2005.

[8] L. D. Carli, Y. Pan, A. Kumar, C. Estan, andK. Sankaralingam. Flexible lookup modules for rapiddeployment of new protocols in high-speed routers. In Proc.ACM SIGCOMM, Barcelona, Spain, Aug. 2009.

[9] M. Casado, T. Koponen, D. Moon, and S. Shenker.Rethinking packet forwarding hardware. In Proc. SeventhACM SIGCOMM HotNets Workshop, Nov. 2008.

[10] G. A. Covington, G. Gibb, J. Lockwood, and N. McKeown.A Packet Generator on the NetFPGA platform. In FCCM’09: IEEE Symposium on Field-Programmable CustomComputing Machines, 2009.

[11] S. Deering and R. Hinden. Internet Protocol, Version 6(IPv6) Specification. Internet Engineering Task Force, Dec.1998. RFC 2460.

[12] M. Dobrescu, , N. Egi, K. Argyraki, B.-G. Chun, K. Fall,G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy.RouteBricks: Exploiting parallelism to scale softwarerouters. In Proc. 22nd ACM Symposium on OperatingSystems Principles (SOSP), Big Sky, MT, Oct. 2009.

[13] B. Godfrey, I. Ganichev, S. Shenker, and I. Stoica. Pathletrouting. In Proc. ACM SIGCOMM, Barcelona, Spain, Aug.2009.

[14] Intel IXP 2xxx Network Processors.http://www.intel.com/design/network/products/npfamily/ixp2xxx.htm.

[15] C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE:A scalable ethernet architecture for large enterprises. InProc. ACM SIGCOMM, Seattle, WA, Aug. 2008.

[16] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F.Kaashoek. The Click modular router. ACM Transactions onComputer Systems, 18(3):263–297, Aug. 2000.

[17] M. Motiwala, M. Elmore, N. Feamster, and S. Vempala. PathSplicing. In Proc. ACM SIGCOMM, Seattle, WA, Aug. 2008.

[18] R. N. Mysore, A. Pamboris, N. Farrington, N. Huang,P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat.Portland: A scalable fault-tolerant layer2 data center networkfabric. In Proc. ACM SIGCOMM, Barcelona, Spain, Aug.2009.

[19] OpenFlow Switch Consortium.http://www.openflowswitch.org/, 2008.

[20] OpenVZ: Server Virtualization Open Source Project.http://www.openvz.org.

[21] Quagga software routing suite.http://www.quagga.net/.

[22] J. Turner, P. Crowley, J. DeHart, A. Freestone, B. Heller,F. Kuhns, S. Kumar, J. Lockwood, J. Lu, M. Wilson, et al.Supercharging PlanetLab: A High Performance,Multi-application, Overlay Network Platform. In Proc. ACMSIGCOMM, Kyoto, Japan, Aug. 2007.

[23] Xilinx. Xilinx ise design suite. http://www.xilinx.com/tools/designtools.htm.

[24] X. Yang, D. Wetherall, and T. Anderson. Source selectablepath diversity via routing deflections. In Proc. ACMSIGCOMM, Pisa, Italy, Aug. 2006.

194

switchblade: a platform for rapid deployment of … · switchblade: a platform for rapid deployment...

Documents