a flexible and scalable high-performance … flexible and scalable high-performance openflow switch...

8
A Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University of Southern California Los Angeles, CA 90089 Email: [email protected] Weirong Jiang Xilinx Research Labs San Jose, CA 95124 Email: [email protected] Viktor K. Prasanna Ming Hsieh Dept. of Electrical Eng. University of Southern California Los Angeles, CA 90089 Email: [email protected] Abstract—Software Defined Networking (SDN) has been pro- posed as a flexible solution for the next generation Internet provision. OpenFlow is a pioneering protocol for SDN which enables a hardware data plane to be managed by a software- based controller in a standard way. In this paper, we present a hardware-software co-design approach of an OpenFlow switch using a state-of-the-art heterogeneous system-on-chip (SoC) plat- form. Specifically, we implement the OpenFlow switch on a Xilinx Zynq ZC706 board. The Xilinx Zynq SoC family provides a tight coupling of field programmable gate array (FPGA) fabric and ARM processor cores, making it an attractive on-chip implementation platform for SDN switches. High-performance, yet highly-programmable, data plane processing can reside in programmable logic, while complex control software can reside in ARM processor. Our proposed architecture involves a method- ology that scales across: (a) a range of possible packet throughput rates and (b) a range of possible flow table sizes. Post-place-and- route results show that our design targeted at Xilinx Zynq can achieve a total 88 Gbps throughput for a 1K flow table which supports dynamic and hitless updates. Correct operation has been demonstrated using a ZC706 board. Index Terms — Software Defined Networking, OpenFlow Switch, Heterogeneous SoC I. I NTRODUCTION Software Defined Networking (SDN) has been proposed as a leading architecture to facilitate the innovation of the computer networks [1]. It allows more researchers and devel- opers to deploy new ideas. Traditionally, the data plane which forwards packets and the control plane which manages the data plane are tightly coupled within each piece of network equipment. SDN separates the control plane out of the network equipment by defining an open protocol for the communication between the control plane and the data plane [1], providing researchers an open software platform to run experiments. OpenFlow is a pioneer protocol which enables the com- munication between an OpenFlow switch (data plane) and an OpenFlow controller (control plane). The OpenFlow switch is mainly composed of an OpenFlow agent and an OpenFlow data plane [2]. As shown in Fig. 1, the OpenFlow agent interacts with the OpenFlow controller over the OpenFlow protocol and manipulates the data plane. Ternary content addressable memories (TCAM) have been widely used in the existing switch designs [3], [4], [5]. How- ever, TCAMs are expensive, power-hungry, and not scalable OpenFlow Agent OpenFlow Data Plane OpenFlow Protocol Control Report OpenFlow switch Controller Fig. 1: OpenFlow controller and OpenFlow switch with respect to clock rate or circuit area [9], [10]. FPGA tech- nologies have become an attractive option for implementing real-time network processing engines [11], [12], [13]. State-of- the-art FPGA devices provide dense logic units, large amounts of on-chip memory and high-bandwidth interfaces for various external memory technologies [14]. There has been a growing interest in employing FPGA in the OpenFlow research. For example, NetFPGA platform [19] has shown its potential in developing OpenFlow switch [16], [18]. In the last decade, a variety of new embedded applications have emerged, such as high-end cell phones, wireless network applications, etc [7], [8]. Today, these applications require more comprehensive functionalities, higher performance de- mand, and low-power consumption, rendering the traditional embedded systems no longer an appropriate platform [15]. As a result, heterogeneous SoC technologies have been studied [6] and widely employed in compute-intensive applications [15]. Typically, a heterogeneous SoC architecture combines the traditional processor, dedicated data transfer, dynamically reconfigurable accelerators and other resources such as intel- lectual property blocks. Xilinx Zynq SoC is a system-on-chip platform which tightly integrates a dual ARM Cortex TM -A9 MPcores processing system with the advanced FPGA fabric. By the efficient integration of processors and FPGA, it is an ideal platform for hardware-software cooperation applications [20]. However, it is still challenging to achieve high perfor- mance by using heterogeneous SoC platform since there are multiple programmable processing elements. In this work, we take advantage of the powerful pro- grammability of the Xilinx Zynq SoC platforms to develop a

Upload: trinhdieu

Post on 21-Apr-2018

234 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Flexible and Scalable High-Performance … Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University

A Flexible and Scalable High-PerformanceOpenFlow Switch on Heterogeneous SoC Platforms

Shijie ZhouMing Hsieh Dept. of Electrical Eng.

University of Southern CaliforniaLos Angeles, CA 90089Email: [email protected]

Weirong JiangXilinx Research LabsSan Jose, CA 95124

Email: [email protected]

Viktor K. PrasannaMing Hsieh Dept. of Electrical Eng.

University of Southern CaliforniaLos Angeles, CA 90089

Email: [email protected]

Abstract—Software Defined Networking (SDN) has been pro-posed as a flexible solution for the next generation Internetprovision. OpenFlow is a pioneering protocol for SDN whichenables a hardware data plane to be managed by a software-based controller in a standard way. In this paper, we present ahardware-software co-design approach of an OpenFlow switchusing a state-of-the-art heterogeneous system-on-chip (SoC) plat-form. Specifically, we implement the OpenFlow switch on a XilinxZynq ZC706 board. The Xilinx Zynq SoC family provides atight coupling of field programmable gate array (FPGA) fabricand ARM processor cores, making it an attractive on-chipimplementation platform for SDN switches. High-performance,yet highly-programmable, data plane processing can reside inprogrammable logic, while complex control software can residein ARM processor. Our proposed architecture involves a method-ology that scales across: (a) a range of possible packet throughputrates and (b) a range of possible flow table sizes. Post-place-and-route results show that our design targeted at Xilinx Zynq canachieve a total 88 Gbps throughput for a 1K flow table whichsupports dynamic and hitless updates. Correct operation has beendemonstrated using a ZC706 board.

Index Terms — Software Defined Networking, OpenFlowSwitch, Heterogeneous SoC

I. INTRODUCTION

Software Defined Networking (SDN) has been proposedas a leading architecture to facilitate the innovation of thecomputer networks [1]. It allows more researchers and devel-opers to deploy new ideas. Traditionally, the data plane whichforwards packets and the control plane which manages thedata plane are tightly coupled within each piece of networkequipment. SDN separates the control plane out of the networkequipment by defining an open protocol for the communicationbetween the control plane and the data plane [1], providingresearchers an open software platform to run experiments.

OpenFlow is a pioneer protocol which enables the com-munication between an OpenFlow switch (data plane) and anOpenFlow controller (control plane). The OpenFlow switch ismainly composed of an OpenFlow agent and an OpenFlow dataplane [2]. As shown in Fig. 1, the OpenFlow agent interactswith the OpenFlow controller over the OpenFlow protocol andmanipulates the data plane.

Ternary content addressable memories (TCAM) have beenwidely used in the existing switch designs [3], [4], [5]. How-ever, TCAMs are expensive, power-hungry, and not scalable

OpenFlow Agent

OpenFlow Data Plane

OpenFlow Protocol

Control Report

OpenFlow switch

Controller

Fig. 1: OpenFlow controller and OpenFlow switch

with respect to clock rate or circuit area [9], [10]. FPGA tech-nologies have become an attractive option for implementingreal-time network processing engines [11], [12], [13]. State-of-the-art FPGA devices provide dense logic units, large amountsof on-chip memory and high-bandwidth interfaces for variousexternal memory technologies [14]. There has been a growinginterest in employing FPGA in the OpenFlow research. Forexample, NetFPGA platform [19] has shown its potential indeveloping OpenFlow switch [16], [18].

In the last decade, a variety of new embedded applicationshave emerged, such as high-end cell phones, wireless networkapplications, etc [7], [8]. Today, these applications requiremore comprehensive functionalities, higher performance de-mand, and low-power consumption, rendering the traditionalembedded systems no longer an appropriate platform [15]. Asa result, heterogeneous SoC technologies have been studied[6] and widely employed in compute-intensive applications[15]. Typically, a heterogeneous SoC architecture combinesthe traditional processor, dedicated data transfer, dynamicallyreconfigurable accelerators and other resources such as intel-lectual property blocks. Xilinx Zynq SoC is a system-on-chipplatform which tightly integrates a dual ARM CortexTM-A9MPcores processing system with the advanced FPGA fabric.By the efficient integration of processors and FPGA, it is anideal platform for hardware-software cooperation applications[20]. However, it is still challenging to achieve high perfor-mance by using heterogeneous SoC platform since there aremultiple programmable processing elements.

In this work, we take advantage of the powerful pro-grammability of the Xilinx Zynq SoC platforms to develop a

Page 2: A Flexible and Scalable High-Performance … Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University

scalable high-performance OpenFlow switch. We summarizethe contributions of this paper as follows:

• We develop an OpenFlow switch architecture based onthe OpenFlow 1.0.0 [2] specification using the hetero-geneous SoC platforms. We implement the OpenFlowswitch on a Xilinx Zynq ZC706 board.

• By carefully designing the main components, thearchitecture is scalable to support a range of pos-sible packet throughput rates by varying the inputdata width. It also scales well as the flow table sizeincreases.

• We conduct comprehensive experiments to evaluatethe performance of the OpenFlow switch. Post-place-and-route results show that the OpenFlow switch canachieve a high throughput of 88 Gbps while supporting1K flow entries. The correct operations have beendemonstrated on the Xilinx Zynq ZC706 board.

• Our design provides switch designers the flexibility todefine the packet processing policy, so it can supportdifferent versions of OpenFlow or alternative SDNprotocols.

The rest of the paper is organized as follows: SectionII introduces the background and related work; Section IIIpresents the overview of the OpenFlow switch architecture;Section IV demonstrates the designs of the major componentsof the OpenFlow switch; Section V contains experimentalresults; Finally, Section VI concludes the paper.

II. BACKGROUND AND RELATED WORK

A. OpenFlow Switch

OpenFlow technologies centralize the complexity of net-work equipments to the software controller. Therefore, thecontroller administrator has a full control over the network.The controller manages the OpenFlow switches over a securitychannel by OpenFlow protocol. On the OpenFlow switch side,an agent receives the control signal and follows the instructionsaccordingly. The OpenFlow agent can also send packets to thedata plane, update flow table and query the flow table statistics.

The data plane contains a flow table against which theOpenFlow switch performs packet classification based on mul-tiple packet header fields. The flow table consists of a numberof flow entries. Each entry has its own priority. Entries withhigher priority must be examined before the ones with lowerpriority. For OpenFlow 1.0.0 [2], a flow entry is composedof a tuple with 12 header fields, flow entry counters and anassociated action. The 12 fields in the tuple have differentmatch criterion, including exact match, range match and prefixmatch. We show the 12 fields in Table I.

The flow entry counters are used to record the conditionsof the entry, such as the duration since its creation time, thenumber of received packets and received bytes. Counters arealso maintained per table, per port and per queue [2]. After theflow table identifies the matching flow entry, the associatedcounters will be updated and the associated action will betaken. Three actions are required by the OpenFlow switch:

TABLE I: 12 header fields in a flow entry

Header Field No. of bits Match CriterionIngress Port 3 exact match

Source MAC address 48 exact matchDestination MAC address 48 exact match

Ethernet type 16 exact matchVLAN ID 12 exact match

VLAN priority 3 exact matchIP source address 32 prefix match

IP destination address 32 prefix matchIP protocol 8 exact match

IP type of service 6 exact matchTCP/UDP source port 16 range match

TCP/UDP destination port 16 range match

• Forward: forward the packets to any physical port orvirtual port.

• Send to the controller: send the packets which are notmatched to the controller.

• Drop: drop the packets which match a flow entry withno specified action.

An example of a flow table is shown in Table II. Moredetails of the OpenFlow switch specification can be found in[1], [2].

B. Zynq SoC

In this section, we briefly introduce the key features ofZynq SoC. Additional details can be found in [22].

Zynq SoC is a processor-centric platform with the integra-tion of ARM processor and FPGA. A Zynq device has threecore function blocks:

• Processing System (PS)

• Programmable Logic (PL)

• PS-PL interfaces

The heart of PS is a dual-core ARM processor running at1 GHz. The two processors can communicate with the on-chipmemory, memory controllers and built-in peripherals. Afterpower is up, PS boots first and can support bare-metal systems,Linux and real-time operating systems. PL is built by advancedXilinx FPGA fabric, through which high-performance andlow-power customized peripherals can be developed. PS andPL are interconnected through multiple AXI [21] interfaces,creating efficient interconnection between PS and PL. The busconnecting PS and PL can run at up to 250 MHz. There arenine AXI interfaces in total linking PS and PL:

• One AXI Accelerator Coherency Port interface

• Two AXI master interfaces (General Purpose port)

• Two AXI slave interfaces (General Purpose port)

• Four configurable and buffered, high-performanceAXI slave interfaces (High-Performance port)

With the efficient integration of ARM processor and FPGA,software applications can control and harness the hardware toobtain high-performance data processing.

Page 3: A Flexible and Scalable High-Performance … Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University

TABLE II: Example of an OpenFlow flow table

IngPort SrcMAC DstMAC EtherType VlanID VlanPrio SrcIP DstIP IpProt IpTOS SrcPort DstPort Action1 00:03:FF * 0x8100 15 * 11* 000* TCP 5 5-28 68 Action 1* D7:A6:33 * 0x0800 * 3 * * UDP * 10-14 * Action 22 * * * * * * 110* * * * * Action 4* * 09:A5:90 * * * * * * * 0-20 112 Action 25 17:66:A4 0A:99:B3 0x0800 6 1 111* 000* TCP 3 22-30 1080 Action 5

C. Related Work

In the recent years, several OpenFlow switch works havebeen proposed in the literature [16], [17], [18]. [16] demon-strates an OpenFlow switch implemented on the NetFPGA-1G platform. The OpenFlow switch is capable of meeting 1Gbps line rate while maintaining 32K exact match entries and32 wildcards entries using a combination of off-chip SRAMand on-chip TCAMs. [17] stores the flow entries in termsof regular expressions using the off-chip SRAM so that it iscapable of running at 1 Gbps line rate while storing over 100Kwildcards entries on the NetFPGA-1G platform. However,the works implemented on the NetFPGA-1G platform exposethe bottleneck of the communication channel between thesoftware and the hardware [16]. The communication channelon NetFPGA-1G is built through the 64-bit PCI bus with aclock of 125 MHz. [18] impletements the OpenFlow switchon the platform which is a integration of NetFPGA-10G andMIPS64 softcore. The work achieves a line rate of 10 Gbps.The softcore communicates with the switch using the MIPS64coprocessor move instructions in 64-bit chunks with a clockrate under 200 MHz.

III. ARCHITECTURE OVERVIEW

Our OpenFlow switch architecture consists of a software-based control agent, a hardware-based data plane and theinterfaces between the software and hardware. The overviewof the architecture is depicted in Fig. 2.

Zynq

PS

PL

Central Interconnect

OpenFlow Agent

MU

X

Packet Parser

Flow Table

Tuple

Packet from Agent

Packet from PHY

Delay Packet

Control Signal

DEM

UX

Packet

To Agent

Forward

Drop

Query Counters

AXI GP Port AXI GP Port AXI HP Port

Fig. 2: Architecture Overview

A. Software

Our OpenFlow agent is a Linux application running inPS. The Linux kernel is built through a Xilinx provided pre-built image. Customized Linux drivers are added to use theperipherals. We also use standard Linux tools and librariesto facilitate software development. Our OpenFlow agent canperform the following tasks:

• Send packets to the data plane

• Add a new flow entry

• Update or delete an existing flow entry

• Clear the entire flow table

• Examine a specific flow entry content

• Query the flow entry counters and the table counters

• Receive the packets forwarded from the data plane

Updating the flow table and querying the counters areachieved through writing and reading the specific registers,respectively. The packets are tranfered between PS and PLthrough AXI interfaces in a streaming fashion. To receive thepackets from the data plane, the agent reserves specific on-chipmemory blocks where PL can write through AXI interfaces.

B. Hardware

The data plane in PL is implemented as modular stagesconnected together in a pipeline fashion. As shown in Fig. 2,packets come from two sources: the OpenFlow agent (PS) orthe physical interfaces in PL. The packets from the OpenFlowagent have a higher priority. After the packets enter the dataplane, a packet parser extracts the fields of interest from thepacket headers and concatenates them to form a tuple. Thetuple is then passed into a flow table for lookup. A few delayblocks are added to ensure the synchronization of the packetsand their lookup results. The number of delayed clock cycles isequivalent to the number of clocks needed to obtain the lookupresult, and varies in tandem with the flow table configuration.Once a final lookup decision is reached, the counters for thatflow entry are updated and the action is taken. If there is nomatching result, the packets are encapsulated and forwardedto the OpenFlow agent.

Normally, the flow table performs the flow table lookupand updates flow entry counters. When the agent updates theflow table, it sends the control signal consisting of the updatedcontent and the flow entry index to the data plane. The controlsignal is transferred by modifying the dedicated registers in PL.As a consequence, the flow table updates the correspondingflow entry content based on the updated registers values.

Page 4: A Flexible and Scalable High-Performance … Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University

The hardware system is generated by the PX language[23], which is a high-level domain-specific language. The PXlanguage treats packets in an object-oriented style and canachieve a similar functionality as a P4 [24] program does. Itbrings the flexibility and scalability of the OpenFlow switchin the following aspects:

• The packet parser can extract a variety of specificheader fields to satisfy different versions of OpenFlowand alternative SDN protocols.

• The input data width of the packet parser is config-urable to meet a range of possible packet throughputrates.

• The flow table is implemented as a set of pipelinedfunction units laying in two dimensions, so that it cansupport a range of flow table sizes without sacrificingthe clock rate.

C. SW-HW Interface

In our design, the OpenFlow agent uses high-performanceport for receiving packets from PL; three general-purpose portsare used to send update signal to PL, query counters from PLand send packets from PS to PL, respectively. There are threetypes of AXI interfaces [21]:

• AXI: for high-performance memory-mapped require-ments

• AXI-Lite: for simple, low-throughput memory-mapped communication

• AXI-Stream: for high-speed streaming data

PS

PL

Application Processor Unit (Agent)

Central Interconnect

AXI-Lite

AXI DMA

GP Ports HP Ports

Memory Interface DDR

AXI-Stream

Data Plane

Fig. 3: SW-HW interfaces

We use an DMA engine with AXI-stream interfaces torealize the high-speed packets transmission between PS andPL. Since AXI-Lite interfaces are well suited for reading andwriting control and status registers [21], we use AXI-Lite in-terfaces for transferring control signals and querying counters.We depict the details of the software-hardware interfaces inFig. 3.

IV. COMPONENT DESIGN

The key components of the OpenFlow switch includethe packet parser, the flow table and the statistic collectionmechanism. This section delivers a detailed discussion of thesecomponents.

A. Packet Parser

The main function of the packet parser is to extract thefields of interest from the packet header and form a tuple forthe further lookup. For OpenFlow 1.0.0 [2], we have 4 headersections to parse, which are Ethernet section, VLAN section,IP section and TCP/UDP section. The example below lists aPX class description [23] for the packet parser with 2 headersections and 4 header fields of interest.

1 c l a s s p a r s e r : : P a r s i n g E n g i n e ( 8 0 0 0 , 2 , ETH header ) {2 s t r u c t f l ow s {3 smac : 4 8 , dmac : 4 8 , / / E t h e r n e t4 s i p : 3 2 , d i p : 3 2 ; / / IP5 }6 c l a s s : : Tuple ( i n o u t ) { s t r u c t f l ow s ;} t u p l e ;7 c l a s s ETH header : : S e c t i o n {8 s t r u c t {smac : 4 8 , dmac :48}9 method u p d a t e = {

10 t u p l e . smac = smac ,11 t u p l e . dmac = dmac12 }13 method m o v e t o s e c t i o n =14 i f ( t y p e == 0 x0800 ) I P h e a d e r ;15 e l s e done ( 1 ) ;16 method i n c r e m e n t o f f s e t = s i z e ( ) ;17 }18 c l a s s I P h e a d e r : : S e c t i o n {19 s t r u c t { s i p : 3 2 , d i p :32}20 method u p d a t e = {21 t u p l e . s i p = s i p ,22 t u p l e . d i p = d i p23 }24 method m o v e t o s e c t i o n = done ( 1 ) ;25 method i n c r e m e n t o f f s e t = 0 ;26 }27 }

In the example, the class parser is inherited from thebuild-in class ParsingEngine. The three parameters ofParsingEngine have the following meanings: “8000” indi-cates the maximum packet header can be parsed is 8000 bytes;“2” means the parser can handle up to 2 header sections, whichare the Ethernet header section and the IP header section inthe example; “ETH header” indicates the parser expects toparse Ethernet section first. The sub-class Tuple defines thefields of interest and their lengths in bits respectively, and itsinstance, tuple, is an input/output object carrying the contentof extracted fields. The Ethernet section and IP section aredefined as two sub-classes in the parser. In each section,the “update” method updates the tuple content according tothe header content. After parsing the current header section,the “move to section” method examines whether there isnext section to be parsed; if there is, the parsing moves tothe next section, otherwise the parsing is completed. The“increment offset” indicates how far it is to move to the nextsection.

The PX description is fed into the compiler, together withthe performance requirement with respect to throughput andlatency. Based on the requirement, the compiler will adjust the

Page 5: A Flexible and Scalable High-Performance … Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University

design architecture. For instance, the input data width of thegenerated parser is narrow for a low-throughput requirement,while wide for a high-throughput requirement. The output ofthe compiler is a register transfer level (RTL) description ineither VHDL or Verilog, which can be mapped onto FPGA.Each header section is mapped into a heavily pipelined stageto ensure a high clock rate.

When the packet header arrives at a stage, it comes withits header type, the offset in the data stream and the tuplebeing constructed. In each stage, the packet header is parsedwithout being changed; the offset and tuple are computed andupdated accordingly. Multiple packets headers are parsed inthis manner simultaneously in the pipeline. Fig. 4 shows thepipeline architecture overview of the parser. Note the interfacesare generated in the AXI fashion.

AXI AXI

Packet

H Type

Packet

Packet

Tuple

H Type

Stage 0 Stage 2 Stage 1

Tuple

Offset

Packet

H Type

Tuple

Offset

Fig. 4: Pipeline architecture of packet parser

In the PX description, the designers can define the maxi-mum size of packet header, the sections and header fields ofinterest. As OpenFlow evolves, the OpenFlow switch wouldbe protocol-independent to make it flexible to change the wayswitches process packets [24]. The PX language can supportthis feature to allow the switch designers apply differentparsing polices.

B. Flow Table

The flow table performs flow entry lookup and is able tobe dynamically updated. This section discuses how we supportthe lookup of ternary match, range match and the dynamicalupdate. Throughout this paper, we use N to denote the numberof flow entries in the flow table.

1) Ternary Match: As illustrated in Table II, both the exactmatch fields and the prefix match fields may contain ternarybits. Hence, the mechanism to support ternary matches isrequired for complex flow entries where wildcards can appearanywhere. A ternary bit could be “0”, “1”, or “don’t care(‘*’)”. For example, a flow entry with “*0” can match “10”and “00”. We use an N -bit length bit vector (BV) to recordthe match conditions of all the entries. In the N -bit BV,each bit corresponds to each flow entry (“1” for match, “0”for mismatch). For example, the output BV “010” (N = 3)indicates that the input matches the second flow entry. Westore such BVs in random-access memory (RAM) and employthe input as the address to access the RAM. Suppose the lengthof the input search key is s bits. Then there are 2s possiblecombinations of the search key. One RAM corresponds to onepossible search key. Thus, 2s RAMs are required. One N -bitBV is stored in each RAM, indicating the match condition ofthe RAM’s search key. Such RAM-based lookup mechanismis as efficient as TCAM [10].

Fig. 5 shows an example (N = 4, s = 2) of storing a 4-entry flow table in the RAM. When an input “00” comes, weemploy it as the address to access RAM [00] and obtain theBV “0010”, indicating that the input “00” matches the thirdentry.

Entry 0 1 *

Entry 1 0 1

Entry 2 * *

Entry 3 1 1

Flow table 0 0 1 0

0 1 1 0

0 0

0 1

1 0 1 0

1 0 1 1

1 0

1 1

Key BV

Fig. 5: Store a 4-entry flow table in the RAM

2) Range Match: The TCP/UDP source and destinationports are two 16-bit fields which require range match. It isdifficult to use the pre-mentioned approach by treating rangesas prefixes, since converting ranges into ternary strings willresult in the “rule expansion” problem [25]. In order to supportthe range match fields, we use 2 × N registers to store theupper boundaries and lower boundaries of the N ranges. Theinput is compared against all the boundaries in parallel. Eachpair of the lower and upper boundaries determines the matchresult of each entry. Hence the N pairs of boundaries canoutput an N -bit BV. The result BV of the range match isthen bit-wise ANDed with the result BV of the ternary matchto obtain the final result BV. Fig. 6 depicts the range matchprocess.

≥ 𝑳𝒐𝒘𝒆𝒓 𝒃𝒐𝒖𝒏𝒅𝟎

≤ 𝑼𝒑𝒑𝒆𝒓 𝒃𝒐𝒖𝒏𝒅𝟎

≥ 𝑳𝒐𝒘𝒆𝒓 𝒃𝒐𝒖𝒏𝒅𝑵−𝟏

≤ 𝑼𝒑𝒑𝒆𝒓 𝒃𝒐𝒖𝒏𝒅𝑵−𝟏

……

AND

AND

𝒃𝒊𝒕𝟎

𝒃𝒊𝒕𝑵−𝟏

……

……

𝒊𝒏𝒑𝒖𝒕

Fig. 6: Range match process

3) Update: The update operations involve flow entry dele-tion, insertion, modification and clearance. Clearing the entireflow table can be performed by changing all the BVs in RAMsto 0. To update a specific entry without interfering with theother entries, we first read the original BV in RAM, modifythe bit representing the flow entry being updated, and thenwrite the updated BV back into the RAM. To delete an entry,we set the bits which represent the entry in the BVs to 0. Tomodify a specific entry, two s-bit (s is the length of search key)inputs, denoted as Data and Mask, are needed to representthe ternary string. Data retains the non-wildcard bits while

Page 6: A Flexible and Scalable High-Performance … Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University

Mask indicates the wildcard bits position in the ternary string.For example, to change the Entry 2 in Fig. 5 from “**” to “*1”,the Data with value “11” or1 “01” and Mask with value “10”are imported to the flow table. In the next step, we modify thethird bit of each BV based on whether the search key of eachRAM matches the new flow entry value, “*1”. Finally, the BVsbecome “0000”, “0110”, “1000” and “1011”, respectively. Theinsertion can be performed in a similar way as modification.To update the range values, we directly update the registervalues.

4) Mapping to Hardware: The flow table is implementedusing dual-port distributed RAMs to realize concurrent readand write. To support hitless update, the distributed RAM hastwo output ports, one for lookup and one for update. However,a direct mapping of our flow table (N = 1K, s = 240)2 to theproposed architecture will significantly deteriorate the clockrate due to the long wire connection of memory blocks. Toaddress this problem, we divide the flow table into a set ofsmall function units laying in two dimensions (R rows× Ccolumns). Vertically, each function unit stores the matchingcondition of a subset of the flow entries, so that each RAMstores a N

R -bit BV. Horizontally, the input search key of eachunit is 240

C bits of the tuple instead of the entire 240 bits. Thelocal result BV of each function unit is bit-wise ANDed withthe BV imported from the previous horizontal neighbour; theANDed result is output to the next horizontal neighbour unit.Since the entries of different rows have different priorities,the final BVs of all the rows need to be fed into a priorityencoder to determine the global result. Each function unit hasthe update logic to perform updates in the run time.

The architecture is implemented in a pipelined fashion andis scalable to support various packet classification require-ments. For example, to increase the number of fields of interest,we can add more function units in each row; to support a largerflow table, we can add more rows of function units. Fig. 7depicts a 3× 3 example 3.

FU[0, 0] FU[0, 1]

FU[1, 0] FU[1, 1]

FU[0, 2]

FU[1, 2]

FU[2, 0] FU[2, 1] FU[2, 2]

Pri

ori

ty E

nco

de

r

Tuple BV Result

Fig. 7: Architecture overview of a 3× 3 example

C. Statistics Collection

The counters which can be collected by the OpenFlowswitch are listed in Table III. The table counters are maintained

1Since the first bit is wildcard, “11” and “01” both work2The length of the 12-field tuple is 240 bit3For simplicity, we denote the function units for both ternary match and

range match as FU.

in the registers of the Zynq PL. Whenever a lookup result isproduced, the table counters are updated accordingly. The flowtable also keeps track of the number of currently active flowentries. The flow counters are maintained by the OpenFlowagent. When a new entry is added to the flow table, the agentrecords its creation time and the entry index. Similarly, theagent clears the creation time when deleting an entry. In thisway, the duration counters of a flow entry can be obtainedby calculating the duration between its creation time and thecurrent time. When the data plane produces a match result, italso sends the result entry index together with the packet sizein bytes to the agent through a high performance port. Theagent will update the received packets counter and receivedbytes counter of that entry, respectively.

TABLE III: List of counters

Table counterActive Entries 32 bits

Packet Lookups 64 bitsPacket Matches 64 bits

Flow counter

Received Packets 64 bitsReceived Bytes 64 bits

Duration (seconds) 32 bitsDuration (nanoseconds) 32 bits

Besides the statistic counters, our OpenFlow switch alsokeeps track of the flow table content in PS so that it isconvenient for the controller administrator to check the contentof the flow table. We achieve this by maintaining a copy ofthe flow table in the PS. When the agent updates the flowtable, both the flow table in PL and the copy one in PS willbe updated. The statistic collection mechanism is depicted inFig. 8

PS

PL

Flow Table

Entry 1 …

… …

Entry N …

Table Counters

Flow Table

Entry 1 …

… …

Entry N …

Flow Counters

Flow Counters

Flow Counters

Update Signal

Lookup Result

Update Flow Counters

Update Table Counters

Fig. 8: Statistic collection

Page 7: A Flexible and Scalable High-Performance … Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University

V. EVALUATION RESULTS

We evaluate the performance of the OpenFlow data planearchitecture based on the post-place-and-route results from theXilinx Vivado 2014.2 development toolset. These experimentsare targeted at the Zynq ZC706 featuring xc7z045. The targetplatform contains 54650 slices. We use the following perfor-mance metrics for evaluating:

• Raw throughput: the total bit rate through the switchin Gbps

• Slice utilization: the utilization of the basic logicresource units of FPGA

To study the functionality of the entire OpenFlow switch,we export the bitstream (FPGA physical programming infor-mation) onto a Zynq ZC706 board and build up the Linuxagent application. Traffic enters the switch through a 4×10Gbps daughter card and the data plane is controlled throughthe agent during the run time.

A. Input Data Width

We first explore the impact of varying the input data width.In these experiments, we fix the flow table size at 128 entries,and vary the input data width from 64 to 512 bits. The rawthroughput is the product of the input data width and the clockrate. Fig. 9 shows the effect of increasing the packet data pathwidth.

We observe that the raw throughput increases as the inputdata width increases. The clock rate is not affected negativelydue to the pipelined architecture. Another observation is that tomeet different throughput requirements, different input widthscan be chosen. For example, we set the data width to 256-bit for 50 Gbps network and 512-bit for 100 Gbps network,respectively.

The slice usage also increases as the input data widthincreases, but not in a linear fashion. This is due to that manyresources consumed by the computing components are notaffected by the size of computing operands (eg. flow table).

B. Flow Table Size

To study the scalability of our switch with respect to theflow table size, we fix the input data path width at 512 bitsand vary the flow table size from 128 to 1024 entries. Fig. 10shows the performance under various flow table sizes.

The clock rate and the raw throughput show a slowly drop-ping trend as the flow table becomes larger. The deteriorationis caused by a higher routing complexity and more long wireconnections. Moreover, with a larger flow table, the resourcesrouting on FPGA is more challenging to be optimized [11].

The resource usage significantly increases when the size ofthe flow table is doubled. The reason is that we have to bit-wise AND more BVs, use more distributed RAMs to store theBVs and execute more computations for update and priorityencoding. With a 1024-entry flow table and a 512-bit inputdata width, the slice resource is consumed by 78%, and theraw throughput of 88 Gbps is achieved.

0

40

80

120

100

150

200

250

64 128 256 512

Raw

Th

rou

ghp

ut

(Gb

ps)

Clo

ck R

ate

(M

Hz)

Input Data Width (bit)

Clock Rate

Raw Throughput

60

90

120

150

100

150

200

250

128 256 512 1024

Raw

Th

rou

ghp

ut

(Gb

ps)

Clo

ck R

ate

(M

Hz)

No. of Flow Entries

Clock Rate

Raw Throughput

(a) Clock Rate vs. Raw Throughput

0%

20%

40%

60%

80%

100%

0

20000

40000

60000

128 256 512 1024

Slic

e U

tiliz

atio

n

Nu

mb

er

of

Slic

es

No. of Flow Entries

# of Slices

Slice Utilization

0%

20%

40%

0

5000

10000

15000

64 128 256 512

Slic

e U

tiliz

atio

n

Nu

mb

er

of

Slic

es

Input Data Width (bit)

# of Slices

Slice Utilization

(b) Slice Utilization

Fig. 9: Increasing the input data width

C. Deployment

With a 4×10 Gbps daughter card installed, the physicalinterface of Zynq ZC706 board has a rate of 40 Gbps. Tomaximize the throughput, we configure the data plane with256-bit data width and 512 flow entries. The frequency of thebus connecting the PS and the IP cores in PL is set to 250MHz. The OpenFlow switch is able to process the physicaltraffic, while 30 million software-generated packets can beprocessed per second. The transmission latency between PSand PL for each software-generated packet is 30 ns in average.The flow table manipulation can be achieved by sending thecontrol signal from the agent to the data plane. It takes 120ns for the insertion/modification control signal while 10 nsfor deletion signal to reach the data plane; another 20 ns isneeded to complete an update in PL. The table counters andflow entry counters are correctly collected.

VI. CONCLUSION

SDN has been proposed as a flexible solution for thenext generation of Internet. Heterogeneous SoC platform isan attractive platform for software-hardware co-designs suchas OpenFlow switch. In this paper, we shared our effortsand experience developing an OpenFlow switch architecturetargeted on the Xilinx Zynq SoC, a platform of efficientintegration of ARM and FPGA. We evaluated our design

Page 8: A Flexible and Scalable High-Performance … Flexible and Scalable High-Performance OpenFlow Switch on Heterogeneous SoC Platforms Shijie Zhou Ming Hsieh Dept. of Electrical Eng. University

0

40

80

120

100

150

200

250

64 128 256 512

Raw

Th

rou

ghp

ut

(Gb

ps)

Clo

ck R

ate

(M

Hz)

Input Data Width (bit)

Clock Rate

Raw Throughput

60

90

120

150

100

150

200

250

128 256 512 1024

Raw

Th

rou

ghp

ut

(Gb

ps)

Clo

ck R

ate

(M

Hz)

No. of Flow Entries

Clock Rate

Raw Throughput

(a) Clock Rate vs. Raw Throughput

0%

20%

40%

60%

80%

100%

0

20000

40000

60000

128 256 512 1024

Slic

e U

tiliz

atio

n

Nu

mb

er

of

Slic

es

No. of Flow Entries

# of Slices

Slice Utilization

0%

20%

40%

0

5000

10000

15000

64 128 256 512

Slic

e U

tiliz

atio

n

Nu

mb

er

of

Slic

es

Input Data Width (bit)

# of Slices

Slice Utilization

(b) Slice Utilization

Fig. 10: Increasing the number of flow entries

based on comprehensive experimental results. The post-place-and-route results showed that our design could achieve thethroughput of 88 Gbps for 1K wildcard flow entries. The on-board deployment demonstrated the correct operations of theOpenFlow switch. The performance of the architecture scaledwell with larger input width and larger flow table. Moreover,our proposed architecture was flexible to be employed fordifferent versions of OpenFlow switch or protocol-independentswitch. In the future, when the SoC families support higheraggregate input/output rates, our OpenFlow switch can bescaled up to support over 100 Gbps network.

ACKNOWLEDGMENT

This work is supported by U.S. National Science Founda-tion under Grant No. 1018801. Equipment grant and toolsetfrom Xilinx, Inc. are gratefully acknowledged.

REFERENCES

[1] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson,J. Rexford, S. Shenker, and J. Turner, “OpenFlow: Enabling Innovationin Campus Networks,” SIGCOMM Comput. Commun. Rev., vol. 38, no.2, 2008, pp. 69-74.

[2] “OpenFlow Switch Specification V1.0.0,”https://www.opennetworking.org/images/stories/downloads/sdn-resources/onf-specifications/openflow/openflow-spec-v1.0.0.pdf

[3] O.E. Ferkouss, I. Snaiki, O. Mounaouar, H. Dahmouni, R. Ben Ali,Y. Lemieux and C. Omar, “A 100Gig network processor platform foropenflow,” in Proc. of Network and Service Management (CNSM), 2011,pp. 1-4

[4] Z. Kai, H Che, Z. Wang and B. Liu, “TCAM-based distributed parallelpacket classification algorithm with range-matching solution,” IEEETransactions on Computers, Vol. 55, No. 8, August 2006.

[5] K. Lakshminarayanan, A. Rangarajan, and S. Venkatachary, “Algorithmsfor Advanced Packet Classification with Ternary CAMs,” in Proc. ofSIGCOMM, 2005, pp. 193-204.

[6] X. Zhu, R. Ge, J. Sun and C. He, “3E: Energy-Efficient Elastic Schedul-ing for Independent Tasks in Heterogeneous Computing Systems,” Jour-nal of System and Software, Vol. 86, No.2, 2013, pp. 302-314.

[7] W.J. Dally, J. Balfour, D. Black-Shaffer, J. Chen, R.C. Harting, V. Parikh,J. Park, and D. Sheffield, “Efficient Embedded Computing,” Journal ofComputer, vol. 41, no. 7, July 2008, pp. 27-32.

[8] G. Zhou, Y. Wu, T. Yan, T. He, C. Huang, J. A. Stankovic, T. F.Abdelzaher, “A Multi-Frequency MAC Specially Designed for WirelessSensor Network Applications”, ACM Transactions in Embedded Com-puting Systems, 2010.

[9] D. E. Taylor, “Survey and taxonomy of packet classification techniques,”ACM Comput. Surv., vol. 37, no. 3, 2005, pp. 238-275.

[10] W. Jiang, “Scalable Ternary Content Addressable Memory Implemen-tation Using FPGAs,” in Proc. ANCS, 2013, pp. 71-82.

[11] Y. Qu, S. Zhou, and V. K. Prasanna, “High-performance Architecturefor Dynamically Updatable Packet Classification on FPGA,” in Proc.ANCS, 2013, pp. 125-136.

[12] S. Yi, B. k. Kim, J. Oh, J. Jang, G. Kesidis, and C. R. Das, “Memory-efficient Content Filtering Hardware for High-speed Intrusion DetectionSystems,” in Proc. of the 2007 ACM Symposium on Applied Computing(SAC), 2007, pp. 264-269.

[13] M. Attig and G. J. Brebner, “400 Gb/s programmable packet parsingon a single FPGA,” in Proc. ANCS, 2011, pp. 12-23.

[14] “Virtex-7 FPGA Family,”http://www.xilinx.com/products/silicon-devices/fpga/virtex-7/

[15] X. Cheng, “Heterogeneous Multi-processor SoC: An EmergingParadigm of Embedded System Design and Its Challenges,” EmbeddedSoftware and Systems, 2005, pp. 3.

[16] J. Naous, D. Erickson, G. A. Covington, G. Appenzeller, and N. McK-eown,“Implementing an OpenFlow switch on the NetFPGA platform,”in Proc. ANCS, 2008, pp. 1-9.

[17] G. Antichi, A. Di Pietro, S. Giordano, G. Procissi, and D. Ficara,“Design and Development of an OpenFlow Compliant Smart GigabitSwitch,” in Global Telecommunications Conference (GLOBECOM), dec.2011, pp. 1-5.

[18] A.Kham and N. Dave, “Enabling Hardware Exploration in Software-Defined Networking: A Flexible, Portable OpenFlow Switch,” Field-Programmable Custom Computing Machines (FCCM), 2013, pp. 145-148.

[19] G.Gibb, J.Lockwood, J.Naous, P.Hartke and N.McKeown, “NetFPGA-Open Platform for Teaching How to Build Gigabit-rate Network Switchesand Routers”, IEEE Transactions on Education 51(3), 2008, pp.364-369.

[20] “Xilinx Wiki,”http://www.wiki.xilinx.com/

[21] “AMBA AXI4 Interface Protocol,”http://www.xilinx.com/ipcenter/axi4.htm

[22] “Zynq All Programmable SoC Technical Reference Manual,”http://www.xilinx.com/support/documentation/userguides/ug585-Zynq-7000-TRM.pdf

[23] G. Brebner and W. Jiang, “High-Speed Packet Processing using Recongurable Computing,” in Proc. Micro, 2014, pp. 8-18.

[24] P. Bosshart, D. Daly, M. Izzard, N. McKeown, J. Rexford, D. Ta-layco, A. Vahdat, G. Varghese, and D. Walker, “Programming protocol-independent packet processors,” December 2013.http://arxiv.org/abs/1312.1719

[25] O. Rottenstreich, R. Cohen, D. Raz and I. Keslassy, “Exact worst-CaseTCAM rule expansion,” IEEE Transactions on Computers, 62(6), 2013,pp. 1127-1140.