packet forwarding capabilities

7/31/2019 Packet Forwarding Capabilities

1/14

Understanding the Packet Forwarding Capability

of General-Purpose Processors

Katerina Argyraki, Kevin Fall, Gianluca Iannaccone,

Allan Knies

, Maziar Manesh

, Sylvia Ratnasamy

EPFL, Intel Research

AbstractCompared to traditional high-end network equipment

built on specialized hardware, software routers run-

ning on commodity servers offer significant advan-

tages: lower costs due to large-volume manufacturing,

a widespread supply/support chain, and, most impor-

tantly, programmability and extensibility. The challenge

is scaling software-router performance to carrier-level

speeds. As a first step, in this paper, we study the packet-

processing capability of modern commodity servers; we

identify the packet-processing bottlenecks, examine towhat extent these can be alleviated through upcoming

technology advances, and discuss what further changes

are needed to take software routers beyond the small en-

terprise.

1 Introduction

To what extent are general-purpose processors capable of

high-speed packet processing? The answer to this ques-

tion could have significant implications for how future

network infrastructure is built. To date, the development

of network equipment (switches, routers, various middle-boxes) has focussed primarily on achieving high perfor-

mance for relatively limited forms of packet processing.

However, as networks take on increasingly sophisticated

functionality (e.g., data loss protection, application ac-

celeration, intrusion detection), and as major ISPs com-

pete in offering new services (e.g., video, mobility sup-

port services), there is an increasing need for network

equipment that is programmable and extensible. And in-

deed, both industry and research have already taken ini-

tial steps to tackle the issue [4, 6, 7, 9, 21].

In current networking equipment, high performance

and programmability are competing goalsif not mu-

tually exclusive. On the one hand, we have high-endswitches and routers that rely on specialized hardware

and software and offer high performance, but are no-

toriously difficult to extend, program, or otherwise ex-

periment with. On the other hand, we have software

routers, where all significant packet-processing steps are

performed in software running on commodity PC/server

platforms; these are, of course, easily programmable, but

only suitable for low-packet-rate environments such as

small enterprises [6].

The challenge of building network infrastructure that

is both programmable and capable of high performance

can be approached from one of two extreme starting

points. One approach would be to start with existing

high-end, specialized devices and retro-fit programma-

bility into them. For example, some router vendors have

announced plans to support limited APIs that will allow

third-party developers to change/extend the software part

of their products (which does not typically involve core

packet processing) [7,9]. A larger degree of programma-

bility is possible with network-processor chips, whichoffer a semi-specialized option, i.e., implement only

the most expensive packet-processing operations in spe-

cialized hardware and run the rest on programmable pro-

cessors. While certainly an improvement we note that,

in practice, network processors have proven hard to pro-

gram: in the best case, the programmer needs to learn a

new language; in the worst, she must be aware of (and

program to avoid) low-level issues like resource con-

tention during parallel execution or expensive memory

accesses [14, 16].

From the opposite end of the spectrum, a different ap-

proach would be to start with software routers and op-

timize their packet-processing performance. The allure

of this approach is that it would allow network infras-

tructure to tap into the many desirable properties of the

PC-based ecosystem, including lower costs due to large-

volume manufacturing, rapid advances in power manage-

ment, familiar programming environment and operating

systems, and a widespread supply/support chain. In other

words, if feasible, this approach could enable a network

infrastructure that is programmable in much the same

way as end-systems are today. The challenge is taking

this approach beyond the small enterprise, i.e., scaling

PC/server packet-processing performance to carrier-level

speeds.It is perhaps too early to tell which approach dom-

inates; in fact, its more likely that each approach re-

sults in different tradeoffs between programmability and

performance, and these tradeoffs will cause each to be

adopted where appropriate. As yet, however, there has

been little research exposing what tradeoffs are achiev-

able. As a first step in this direction, in this paper, we

explore the performance limitations for packet process-

ing on commodity servers.


2/14

A legitimate question at this point is whether the per-

formance requirements for network equipment are just

too high and our exploration is a fools errand. The bar

is indeed high: in terms of individual link/port speeds,

10Gbps is already widespread and 40Gbps is being de-

ployed at major ISPs; in terms of aggregate switching

speeds, carrier-grade routers range from 40Gbps to ahigh of 92Tbps! Two developments, however, lend us

hope. The first is a recent research proposal [11] that

presents a solution whereby a cluster of N servers can

be interconnected to achieve aggregate switching speeds

of NR bps, provided each server can process packets at

a rate on the order of R bps. This result implies that, in

order to scale software routers, it is sufficient to scale a

single server to individual line speeds (10-40Gbps) rather

than aggregate speeds (40Gbps-92Tbps). This reduction

makes for a much more plausible target.

Secondly, we expect that the current trajectory in

server technology trends will work in favor of packet-

processing workloads. For example, packet processingappears naturally suited to exploiting the tremendous

computational power that multicore processors offer par-

allel applications. Similarly, I/O bandwidth has gained

tremendously by the transition from PCI-X to PCIe al-

lowing 10Gbps Ethernet NICs to enter the PC market [1].

And finally, as we discuss in Section 4, the impending ar-

rival of multiprocessor architectures with multiple inde-

pendent memory controllers should offer a similar boost

in available memory bandwidth.

While there is widespread awareness of these ad-

vances in server technology, we find little comprehensive

evaluation of how these advances can/do translate into

performance improvements for packet-processing work-

loads. Hence, in this paper, we undertake a measurement

study aimed at exploring these issues. Specifically, we

focus on the following questions:

what are the packet-processing bottlenecks in mod-

ern general-purpose platforms;

what (hardware or software) architectural changes

can help remove these bottlenecks;

do the current technology trends for general-

purpose platforms favor packet processing?

As we shall see, answering these seemingly straight-

forward questions, requires a surprising amount of

sleuthing. Modern processors and operating systems are

both beasts of great complexity. And while current hard-

ware/software offer extensive hooks for measurement

and system profiling these can be equally overwhelm-

ing. For example, current x86 processors have over 400

performance counters that can be programmed for de-

tailed tracing of everything from branch mispredictions

to I/O data transactions. Its thus easy (as we discovered)

to sink in a morass of performance monitoring data. Part

of our contribution is thus a methodology by which to go

about such an evaluation. Our study adopts a top-down

approach in which we start with black-box testing and

then recursively identify and drill down into only those

aspects of the overall system that merit further scrutiny.Finally, it is important to note that even though our

study stemmed from an interest in programmable net-

work infrastructure, our findings are relevant to more

than just the network context. Packet processing is just

one instance of a more general class of stream based ap-

plications (such as real time video delivery, stock trading,

etc.). Our findings apply equally to these too.

The remainder of this paper is organized as follows.

We start the paper in Section 2 with some high-level anal-

ysis estimating upper bounds on the packet processing

performance for different server architectures. Section 3

follows this with a measurement study aimed at iden-

tifying the bottlenecks and overheads on these servers.We present the inferences from our measurement study

in Section 4 and discuss potential improvements in Sec-

tion 5. We discuss related work in Section 6 and finally

conclude.

2 Optimistic Back-of-the-Envelope Analysis

Before delving into experimentation, we would like to

calibrate our expectations. We thus start with a simple

thought experiment aimed at estimating absolute upper

bounds on the packet forwarding performance of bothexisting and next-generation servers. Since our goal is

quick calibration, our reasoning here is deliberately both

coarse-grained and optimistic; the experimental results

that follow will show where reality lies.

Figures 1 and 2 present a high-level view of two server

architectures: Fig.1 depicts a traditional shared-bus ar-

chitecture used in current x86 servers [3], while Fig.2

represents a point-to-point architecture as will be sup-

ported by the next-generation of x86 servers [8].

In the shared-bus architecture, communication be-

tween the CPUs, memory, and I/O is routed through the

chipset that includes the memory and I/O bus con-

trollers. There are three main system buses in this archi-tecture. The front side bus (FSB) is used for communi-

cation both between different CPUs and between a CPU1

and the chipset. The PCIe bus connects I/O devices, in-

cluding network interfaces, to the chipset via one or more

high-speed serial channels known as lanes and, finally,

the memory bus connects the memory and chipset.

1In this paper we will use the terms CPU, socket and processor in-

terchangeably to refer to a multi-core processor.

2


3/14

Figure 1: Traditional shared bus architecture Figure 2: Point-to-point architecture

The point-to-point server (Fig.2) represents two sig-

nificant architectural changes relative to the above: first,

the FSB is replaced by a mesh of dedicated point-to-point

links thus removing a potential bottleneck for inter-CPU

communication. Second, the point-to-point architecture

replaces the single external memory controller shared

across CPUs with a memory controller integrated within

each CPU; this leads to a dramatic increase in aggregatememory bandwidth, since each CPU now has a dedicated

link to a portion of the overall memory space. Servers

based on such point-to-point architectures and with upto

32 cores (4 sockets and 8 cores/socket) are due to emerge

in the near future [10].

To estimate a servers packet-forwarding capability,

we consider the following minimal set of operations typ-

ically required to forward an incoming packet and the

corresponding load they impose on each of the primary-

system components:

1. The incoming packet is DMA-ed from the network

card (NIC) to main memory (incurring one transac-tion on the PCIe and memory bus).

2. The CPU reads the packet header (one transaction

on the FSB and memory bus).

3. The CPU performs any necessary packet processing

(CPU-only, assuming no bus transactions).

4. The CPU writes the modified packet header to

memory (one transaction on the memory bus and

FSB).

5. The packet is DMA-ed from memory to NIC (onetransaction on the memory and PCIe bus).

Figures 1 and 2 also show the manner in which each of

these operations maps onto the various system buses for

the architecture in question. As we see, for the shared-

bus architecture, a single packet results in 4 transactions

on the memory bus and 2 on each of the FSB and PCIe

buses; thus, a line rate of R bps leads to (roughly) a load

of 4R, 2R, and 2R on each of the memory, FSB, and

PCIe buses.2 Currently available technology advertises

memory, FSB, and PCIe bandwidths of approximately

100Gbps, 85Gbps, and 64Gbps respectively (assuming

DDR2 SDRAM at 800MHz, a 64-bit wide 1.33GHz

FSB, and 32-lane PCIe1.1); these numbers suggest that

a current shared-bus architecture could sustain line rates

up to R = 25 Gb/s.

For the point-to-point architecture, each packet con-

tributes 4 memory-bus transactions, 4 transactions on the

inter-socket point-to-point links, and 2 PCIe transactions;

since we have 4 memory buses, 6 inter-socket links and

4 PCIe links, assuming uniform load distribution across

the system, a line rate of R bps yields loads of R, 2R/3,and R/2 on each of the memory, inter-socket, and PCIebuses respectively. If we (conservatively) assume simi-

lar technology constants as before (memory, inter-socket,

and PCIe bandwidths at 100Gbps, 85Gbps, and 64Gbps

respectively) this suggests a point-to-point architecture

could scale to line rates of 40Gb/s and even higher.

In terms of CPU resources: If we assume min-sizedpackets of 40 bytes, then the packet interarrival time is

32ns for speeds of R =10Gb/s and 8ns for R =40Gb/s.For the shared-bus server with 8 CPU, each with a speed

of 3 GHz (available today), this implies a budget of

3072 and 768 cycles/pkt for line rates of 10Gbps and

40Gbps respectively. Assuming a cycles-per-instruction

(CPI) ratio of 1, this suggests a budget of 3072 (768) in-

structions per packet for line rates of 10Gb/s (40Gb/s).

With 32 cores at similar speeds, the point-to-point server

would see a budget of 12288 and 3072 instructions/pkt

for 10Gb/s and 40Gb/s respectively.

In summary, based on the above, one might conclude

that current shared-bus architectures may scale to 25Gb/sbut not 40Gb/s, while emerging servers may scale even

to 40Gb/s.

2This estimate assumes that the entire packet (rather than the

header) is read to/from the memory and CPU for packet processing.

A more accurate estimate would account for packet header sizes (or

cache line sizes if smaller than header lengths). We ignore this here

since our tests in the following section consider only min-sized packets

of 64 bytes, equal to a cache line length because of which the inaccu-

racy is of little relevance.

3


4/14

3 Measurement-based Analysis

We now turn to experimentation. We first de-

scribe our experimental setup and then present the

packet-forwarding rates achieved by unmodified soft-

ware/hardware.

Experimental Setup For our experiments, we use a

mid-level server machine running SMP Click [18]. Our

server is a dual-socket 1.6GHz quad-core CPU with an

L2 cache of 4MB, two 1.066GHz FSBs (one to each

socket) and 8 GBytes of DDR2-667 SDRAM. With the

exception of the CPU speeds, these ratings are similar to

the shared-bus architecture from Figure 1 and, hence, our

results should be comparable. The machine has a total of

16 1GigE NICs. To source/sink traffic, we use two addi-

tional servers each of which is connected to 8 of the 16

NICs on our test machine. We generate (and terminate)

traffic using similar servers with 8 GigE NICs. We instru-

ment our servers with Intel EMON, a performance mon-

itoring tool similar to Intel VTune, as well as a chipset-

specific tool that allows us to monitor memory-bus us-

age.3

The forwarding rates achieved will depend on the na-

ture of the traffic workload. To a first approximation,

this workload can be characterized by: (1) the incom-

ing packet arrival rate r, measured in packets/sec, (2) the

size of packets, P, measured in bytes (hence the incom-

ing rate R = rP) and (3) the processing per packet.We focus on evaluating the fundamental capability of the

system to move packets through and, hence, start by con-

sidering only the first two factors (packet rate and size)

without considering any sophisticated packet processing.

Hence, we remove the IP routing components from our

Click configuration and only implement simple forward-

ing that enforces a route between source and destination

NICs; i.e., packets arriving on NIC #0 are sent to NIC

#1, NIC #2 to NIC #3 and so on. We have 16 NICs and,

hence, use 8 kernel threads, each pinned to one core and

each in charge of one input/output NIC pair. In the re-

sults that follow, where the input rate to the system is un-

der 8Gbps, we use one of our traffic generation servers

as the source and the other as sink; for tests that require

higher traffic rates each server acts as both source and

sink allowing us to generate input traffic up to 16Gbps.

Measured performance We start by looking at the

loss-free forwarding rate the server can sustain (i.e.,

without dropping packets) under increasing input packet

3Although our tools are proprietary, many of the measures they re-

port are derived from public performance counters and, in these cases,

our tests are reproducible. In an extended technical report, we will

present in detail how our measures, when possible, can be derived from

the public performance counters available on x86 processors.

0 2 4 6 8 10 12 14 160

5

10

15

offered load (Gbps)

sustainedload(Gbps)

64 bytes

128 bytes

256 bytes

512 bytes

1024 bytes

Figure 3: Forwarding rate under increasing load for dif-

ferent packet sizes.

rates and for various packet sizes. We plot this sustained

rate in terms of both bits-per-second (bps) and packets-

per-second (pps) in Figures 3 and 4 respectively. We see

that, in the case of larger packet sizes (1024 bytes and

higher), the server scales to 14.9 Gbps and can keep up

with the offered load up to the maximum traffic we can

generate given the number of slots on the server; i.e.,

packet forwarding isnt limited by any bottleneck inside

the server. However, in the case of 64 byte packets, we

see that performance saturates at around 3.4 Gbps, or 6.4

million pps. As Figure 4 suggests, the server is troubled

by the high input packet rate (pps) rather than bit rate(bps). Note that the case of 64 byte packets is the worst-

case traffic scenario. Though unlikely in reality, it covers

an important role as it is considered the reference bench-

mark by network equipment manufactures.

Relative to the back-of-the-envelope estimates we ar-

rived at in the previous section, we can conclude that,

while our server approaches the estimated rates for larger

packet sizes, for small packets, the achievable rates are

well below our estimates. At a high-level, our reasoning

could have been wildly off-target for two reasons: (1)

in assuming that the nominal/advertised rates for each

system component (PCIe, memory, FSB) are attainablein practice and/or (2) in our estimates of the overhead

per packet (4x, 2x, etc.). In what follows, we look into

each of these possibilities. In Section 3.1 we attempt to

track down the bottleneck(s) that limit(s) the forwarding

rate for small packets and, in so doing, estimate attain-

able performance limits for the different system compo-

nents. In Section 3.2 we take a closer look at the per-

packet overheads and attempt to deconstruct these into

their component causes.

4


5/14

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

offered load (Mpps)

sustainedload(Mpps)

64 bytes

128 bytes

256 bytes

512 bytes

1024 bytes

Figure 4: Forwarding rate under increasing load for dif-

ferent packet sizes in pps.

3.1 Bottleneck Analysis

We look for the bottleneck through a process of elimi-

nation, starting with the four major system components

discussed earlier the CPUs and the three system buses

and drilling deeper as and when it appears warranted.

CPU The CPUs are plausible candidates, since CPU

processing depends on the incoming packet rate, and

performance saturates as soon as we reach a specific

packet rate (the same for 64-byte and 128-byte packets,

as shown in Figure 4). Note that the traditional metric

of CPU utilization reveals little here, because Click op-

erates in a pure polling mode, where the CPUs are al-

ways 100% utilized. Instead, we look at the number of

empty polls i.e., the number of times the CPU polls

for packets to process but none are available in memory.

Our measurements reveal that, even at the saturation rate

(3.4Gbps for 64-byte packets), we still see a non-trivial

number of empty polls approximately 62,000 per sec-

ond for each core. Hence, we eliminate CPU processing

as a candidate bottleneck.

System buses Our tools allow us to directly measure

the load in bits/sec on the FSB and memory bus; the

load difference between these two buses gives us an es-

timate for the PCIe load. Note that this is not always agood estimate, as FSB bandwidth can be consumed by

inter-socket communication which does not figure on the

memory bus; however, it does make sense in our par-

ticular setup (with each input/output port pair consis-

tently served by the same socket) which yields little inter-

socket communication. Figures 5 and 6 plot the load on

each of the FSB, memory, and PCIe buses for 64-byte

and 1024-byte packets under increasing input rates. We

see that, for any particular line rate, the load on all three

0 1 2 3 4 50

5

10

15

20

25

30

35

offered load (Gbps)

observerdload(Gbps)

Memory

I/O

FSB

Figure 5: Bus bandwidths for 64 byte packets.

0 5 10 150

10

20

30

40

50

offered load (Gbps)

observerdload(Gbps)

Memory

I/O

FSB

Figure 6: Bus bandwidths for 1024 byte packets.

buses is always higher with 64-byte packets than with

1024-byte ones. Hence, any of the buses could be the

bottleneck, and we proceed to examine each one more

closely.

FSB Under the covers, the FSB consists of separate

data and address buses, and our tools allow us to sep-

arately measure the utilization of each. The results are

shown in Figures 7 and 8: while it is clear that the

data bus is under-utilized, it is not immediately obvious

whether this is the case for the address bus as well. Togage the maximum attainable utilization on each bus, we

wrote a simple benchmark program (we will refer to it as

the stream benchmark from now on) that creates and

writes to a very large array. This benchmark consumes

50 Gbps of FSB bandwidth that translate into 37% data-

bus utilization and 74% address-bus utilization. These

numbers are well above the utilization levels from our

packet-forwarding workload, which means that the latter

does not saturate the FSB. Hence, we conclude that the

5


6/14

0 1 2 3 4 50

20

40

60

80

100

offered load (Gbps)

busutiliz

ation(%)

FSB Data

FSB Address

Figure 7: FSB data and address bus utilization for 64

byte packets.

0 5 10 150

20

40

60

80

100

offered load (Gbps)

busutilization(%)

FSB DataFSB Address

Figure 8: FSB data and address bus utilization for 1024

byte packets.

FSB is not the bottleneck.

PCIe Unlike the FSB where all operations are in fixed-

size units (64 bytes a cache line), the PCIe bus sup-

ports variable-length transfers; hence, if the PCIe bus is

the bottleneck, this could be due either to the incoming

bit rate or to the requested operation rate (which depends

on the incoming packet rate). To test the former, we sim-

ply look at the maximum bit rate that the PCIe bus hassustained; from Figures 5 and 6, we see that the maxi-

mum PCIe load for 1024-byte packets exceeds the PCIe

load recorded at saturation for 64-byte packets. Hence,

the PCIe bit rate is not the problem.

To test whether the PCIe operation rate is the bottle-

neck, we measure the maximum packet rate that can be

sustained by each individual PCIe lane. Our rationale

is the following: given that PCIe lanes are independent

from each other, if we can successfully drive the packet

1 2 3 4 5 6 7 80

1

2

3

4

Cores

Gbps

Forwarding rate

1 2 3 4 5 6 7 80

5

10

15

20

25

Cores

Gbps

I/O load

Figure 9: Forwarding rate (top) and PCIe load (bottom)

for 64 byte packets as a function of the number of cores

rate on a single lane beyond the per-lane rate recorded

at saturation, then we will have showed that our packet-forwarding workload does not saturate the PCIe lanes

and, hence, packet rate is not the problem. To this end, we

start with a single pair of input/output ports pinned to a

single core and gradually add ports and cores. The results

are shown in Figure 9, where we plot both the sustained

forwarding rate and the PCIe load: we already know that

for 64-byte packets, at saturation, each input/output port

pair sustains approximately 0.4Gbps (Figure 3); from

Figure 9, we see that each individual port pair (and,

hence, the corresponding PCIe lanes) can go well beyond

that rate (approx. 0.75Gbps). Hence, we conclude that

the PCIe bus is not the bottleneck either.

Memory This leaves us with the memory bus as the

only potential culprit. To estimate the maximum attain-

able memory bandwidth, we use the stream benchmark

described above, which consumes 51Gbps of memory-

bus bandwidth. This is about 35% higher than the

33Gbps maximum consumed by our 64-byte packet-

forwarding workload, surprisingly suggesting that aggre-

gate memory bandwidth is not the bottleneck either.4

This would seem to return us to square one. However,

memory-system performance is notoriously sensitive to

details like access patterns and load balancing; hence, we

look further into these details.We consider two potential reasons why our packet-

forwarding workload might reduce memory-system ef-

4Note that even 51Gbps is fairly low relative to the nominal rating

of 100Gbps we used in estimating upper bounds. It turns out this limit

is due to saturation of the address bus; recall that the address bus uti-

lization is 74% for the stream test; prior work [24] and discussions with

architects reveal that an address bus is regarded as saturated at approx-

imately 75% utilization. This is in keeping with the general perception

that, in a shared-bus architecture, the vast majority of applications are

bottlenecked on the FSB.

6


7/14

1

2

3

4

1

2

3

4

0

2

4

6

8

10

rankbank

Gbps

1

2

3

4

1

2

3

4

0

2

4

6

8

10

rankbank

Gbps

1

2

3

4

1

2

3

4

0

2

4

6

8

10

rankbank

Gbps

Figure 10: Memory load distribution across banks and ranks. Left: 64 byte packets. Middle: 1024 byte packets. Right:

the stream benchmark.

ficiency relative to the stream benchmark. The first is the

fact that the sequence of memory locations accessed due

to our workload is highly irregularas opposed to the

nicely in-sequence access pattern of the stream bench-

mark. To assess the impact of irregular accesses, we re-

run the stream benchmark but, instead of writing to each

array entry in sequence, we write to random locations.

This modification does cause a drop in memory band-

width, but the drop is modest (from 51Gbps to about

46Gbps), indicating that irregular accesses are not the

problem.

The second reason is sub-optimal use of the physical

memory space: The memory system is internally orga-

nized as multiple memory channels (or branches) each

of which is organized as a grid of ranks and banks. In

particular, the 8GB memory on our machine consists of

two memory channels each of which comprises a grid of4 banks 4 ranks; our tools reports the memory traf-

fic to different rank/bank pairs aggregated across mem-

ory channels; i.e., the memory traffic we report for (say)

a pair (bank 1, rank3) is the sum of the traffic seen on

(bank 1, rank 3) for each of the two memory channels.

Figure 10 shows the distribution of memory traffic over

the various ranks and banks for three workloads: (1) 64-

byte packets at the saturation rate of 3.4Gbps, (2) 1024-

byte packets at 15.2Gbps, and (3) the stream benchmark.

Notice that, while memory traffic is perfectly balanced

for the stream benchmark (and reasonably balanced for

the 1024-byte packet workload), for the 64-byte packet

workload, it is all concentrated on two rank/bank ele-

ments (in reality, we see one overloaded rank-bank pair

on each channel; since the figure shows the aggregate

load over the two channels).

This result suggests that the bottleneck is not the ag-

gregate memory bandwidth, but the bandwidth to the in-

dividual rank/bank elements that, for some reason, end

up carrying most of the 64-byte packet workload. To ver-

ify this, we measure the maximum attainable bandwidth

to a single rank/bank pair; we do this through a sim-

ple test that creates multiple threads, all of which con-

tinuously read and write a single locations in memory.

The result is 7.2Gbps of memory traffic (all on a sin-

gle rank/bank pair), which is almost equal to the max-

imum per-rank/bank load recorded at saturation for the

64-byte packet workload. We should note that both the

CPUs and the FSB are under-utilized during this mem-

ory test. Hence, we conclude that the bottleneck is the

memory system, not because it lacks the necessary ca-

pacity, but because of the imbalance in accessed memory

locations.

We now look into why this imbalance takes place.

We see from Figure 10 that, for 1024-byte packets, the

memory load is much better distributed than for 64-

byte packets. This leads us to suspect that the imbal-

ance is related to the manner in which packets are laidout onto the rank/bank grid. We test this with an exper-

iment where we maintain a fixed packet rate (400,000

packets/sec) and measure the resulting memory-load dis-

tribution for different packet sizes (64 to 1500 bytes).

Figure 11 shows the outcome: Ignoring for the moment

the load on {bank 2, rank 3}, we observe that, as the

packet size increases, the additional memory load is dis-

tributed over increasing numbers of rank/bank pairs and,

within a single memory channel, this spilling over to

additional ranks and banks happens at the granularity of

64 bytes; for example, for 256-byte packets, we see in-

creased load on 3 rank/bank pairs, for 512-byte packets

on 4 rank/bank pairs, and so forth. Moreover, we observethat this growth starts from the low-ordered banks, i.e.,

bank 1 is loaded first, then bank 2 and so on.5

These observations lead us to the following theory: the

default packet-buffer size in Linux is 2KB; each such

5Regarding the high load on {bank 2, rank 3}: we suspect this is

caused by the large number of empty polls that we see at the low packet

rate for this test, and that the location corresponds to the memory-

mapped I/O register being polled. We find the load on this rank-bank

drops with increasing packet rates further supporting this conjecture.

7


8/14

1 23 4

12

34

0

2

4

rank

64 bytes

bank

Gbps

1 23 4

12

34

0

2

4

128 bytes

1 23 4

12

34

0

2

4

256 bytes

1 23 4

12

34

0

2

4

512 bytes

1 23 4

12

34

0

2

4

1024 bytes

1 23 4

12

34

0

2

4

1500 bytes

Figure 11: Memory load distribution across banks and

ranks for different packets sizes and a fixed packet rate.

buffer spans the entire rank/bank grid, which would al-

low high memory throughput if we were using the entire

2KB allocation. However, our 64-byte packet workload

ends up using only one of the rank/bank pairs on each

of the memory channels, leading to the two spikes we

see in Figure 10. To test this theory, we repeat our ear-

lier experiment with 64-byte packets from Figure 10, but

now change the default buffer allocation size to 1KB.

If our theory is right and a 2KB address space spans

the entire grid, then 1KB should span half the grid and,

hence, the two spikes in Figure 10 should split into 4

spikes. Figure 12 shows that this is indeed the case. Un-

fortunately (for some reason we do not fully understand

as yet), we have not been able to allocate yet smallerbuffer sizes (e.g., 128B), due to the need for the de-

vice driver to accomodate for additional data structures,

and hence we do not experiment with even smaller al-

locations. Nonetheless, our experiment with 1024-byte

buffers clearly shows the cause (and potential to remedy)

the problem of skewed memory load. As we discuss in

Section 5, we believe this issue could be fixed in a gen-

eral manner through the use of a modified memory allo-

cator that allows for variable-size buffer allocations.

Finally, if our conjecture that this imbalance was

the performance bottleneck was right then reducing the

imbalance should translate to higher packet-forwarding

rates. Happily, using 1024B buffers we do see a 29.5%increase in forwarding rate from 3.4Gbps to 4.4Gbps;

Figure 13 shows this improvement in terms of the packet

rate (from 6.4 to 8.2 Mpps).

Summary of bottleneck analysis The presented ex-

periments showed what rates are achievable on each

system component for hand-crafted workloads like our

stream benchmark. We use these rates as re-calibrated

12

34

12

34

0

5

10

rank

Original Clickw/2048 byte buffers

bank

Gbps

12

34

12

34

0

5

10

rank

Modified Clickw/1024 byte buffers

bank

Gbps

Figure 12: Memory load distribution across banks and

ranks for 64B packets and two different sizes of packet

buffers.

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

11

offered load (Mpps)

sustainedload(Mpps)

original click w/2048 byte buffers

modified click w/1024 byte buffers

Figure 13: Before-and-after forwarding rates for 64B

packets and two different sizes of packet buffers.

8


9/14

system attainable load w/ 64B percentage

component limit (Gbps) router (Gbps) room-to-grow

1 rank-bank 7.2 (from volatile-int) 7.168 0

FSB address bus 74 (from stream) 50 48

aggregate memory 51 (from stream) 33 54

PCIe 36 (from 1KB pkt tests) 20 80

FSB data bus 37 (from stream) 9 311

Table 1: Room for growth on each of the system components computed as the percentage increase in measured usage

for the 64B packet forwarding workload that can be accommodated before we hit achievable performance limits as

obtained from specially crafted benchmark tests.

upper bounds on the performance of each component

and compare them to the corresponding rates measured

for the 64-byte packet workload at saturation. To quan-

tify our observations, we define, for each component,

the room for growth as the percentage increase in

usage that could be accommodated on the componentbefore we hit the upper bound. For example, for the

stream benchmark, we measured 51Gbps of maximum

aggregate memory bandwidth; for our 64-byte packet

workload, at saturation, we measured 33Gbps of aggre-

gate memory bandwidth; thus, ignoring other bottlenecks

(such as the per rank/bank load), there is room to increase

memory-bus usage by about 54% ((5133)/33) beforehitting the 51Gbps upper bound. Table 1 summarizes our

results. We see that, if we can eliminate the problem of

poor memory allocation (we discuss potential solutions

in Section 5), then there is room for a fairly substantial

improvement in the minimum-sized packet forwarding

rateapproximately 50%. The next section looks foradditional sources of inefficiencythis time due to soft-

ware overheads.

3.2 Overhead Analysis

The previous section treated the system as a black box,

measuring the load on each bus but making no attempt

to justify it. We now try to deconstruct the measured

load into its components, as a way to assess the packet-

forwarding efficiency of our system.

First, we adjust the back-of-the-envelope analysis of

Section 2 to our particular experimental setup and use itto estimate the expected load on each bus: In Section 2,

we argued that an incoming traffic rate of R bps should

roughly lead to loads of 2R, 2R,a n d 4R on the FSB, PCIe,

and memory bus respectively. These numbers were based

on two assumptions: first, that bus loads are only due to

moving packets around; second, that the CPU reads and

updates each incoming packet, thus contributing to FSB

and memory-bus load. The second assumption does not

hold in our experiments, because we use static routing,

1 2 3 4 5 6 70

2

4

6

8

10

12

Packets/sec (Mpps)

verhea

d

atio

FSB

Memory

Figure 14: Memory and FSB per packet overhead.

where the CPU does not even need to read packet headers

to determine the output port. Hence, with the optimistic

reasoning of Section 2, in our experiments, an incoming

traffic rate of R bps should roughly result in loads of 0,

2R, and 2R on the FSB, PCIe, and memory bus, all of

them due to moving packets from NIC to memory and

back.

Not surprisingly, these estimates are below the loads

that we actually measure, indicating that, beyond moving

packets around, all three buses incur an extra per-packet

overhead. We quantify this overhead as the number of

extra per-packet transactions (i.e., transactions that are

not due to moving packets between NIC and memory)

performed on each bus. We compute it as follows:

measured loadestimated load

packet rat e transaction size

Figure 14 plots this number for the FSB and memory

bus as a function of the packet rate and size; the PCIe

overhead is simply the difference between the other two.

So, the FSB and PCIe overheads start around 6, while

the memory-bus overhead starts around 12; all overheads

slightly drop as the packet rate increases.

It turns out that these overheads make sense once we

consider the transactions for book-keeping socket-buffer

9


10/14

descriptors: For each packet transfer from NIC to mem-

ory, there are three such transactions on each of the

FSB and PCIe bus: the NIC updates the corresponding

socket-buffer descriptor, as well as the descriptor ring

(two PCIe and memory-bus transactions); the CPU reads

the updated descriptor, writes a new (empty) descriptor

to memory, and updates the descriptor ring accordingly(three FSB and memory-bus transactions); finally, the

NIC reads the new descriptor (one PCIe and memory-

bus transaction). Each packet transfer from memory to

NIC involves similar transactions and, hence, descriptor

book-keeping accounts for the 6 extra per-packet transac-

tions we measure on the FSB and PCIe busand, hence,

the 12 extra transactions measured on the memory bus.

The slight overhead drop as the packet rate increases is

due to the cache that optimizes the transfer of multi-

ple (up to four) descriptors with each 64-byte transac-

tion (each descriptor is 16-bytes long); this optimization

kicks in more often at higher packet rates.

We should note that these extra per-packet transac-tions translate into surprisingly high traffic overheads,

especially for small packets: for 1024-byte packets, 12

per-packet transactions on the memory bus translate into

37.5% traffic overhead; for 64-byte packets, this num-

ber becomes 600%. As we discuss in Section 5, these

overheads can be reduced by amortizing descriptors over

multiple packets whenever possible (similar techniques

are already common in high-speed capture cards).

4 Inferring Server Potential

We now apply our findings from the last two sections toanswer the following questions:

Given our analysis of current-server packet-

forwarding performance, what can we improve and

what levels of performance are attainable?

What packet-forwarding performance should we

expect from next-generation servers?

We answer these through extrapolative analysis and leave

validation to future work.

4.1 Shared-bus Architectures

In Section 3.1, we saw that the first packet-forwarding

bottleneck arises from inefficient use of the memory

system, in particular, the imbalanced layout of packets

across memory ranks and banks. The question is, how

much could we improve performance by fixing this im-

balance?

According to our overhead analysis (Section 3.2), per-

packet overhead on the memory bus does not increase

with packet rate; hence, if we eliminated the problem-

atic packet layout, we could increase our forwarding rate

until we hit the next bottleneck. According to our bottle-

neck analysis (Section 3.1), that is the FSB address bus,

and we could increase our forwarding rate by 50% be-

fore hitting it. Hence, we argue that eliminating the prob-

lematic layout could increase our minimum-size-packetforwarding rate by 50%, i.e., from 3.4Gbps to approxi-

mately 5.1Gbps.

A second area for improvement, identified in Sec-

tion 3.2, is the use of socket-buffer descriptors and the

chatty manner in which these are maintained. We now

estimate how much we could improve performance by

simply amortizing descriptor transfer across multiple

packet transfers.

We start by considering the memory bus. From Sec-

tion 3.2, we can approximate the load on the memory bus

as 2 bit rate+ 10 packet rat e transaction size. Werewe to transfer, say, 10 descriptors with a single trans-

action, that would immediately reduce memory-bus loadto 2 bit rate+ packet rate transaction size; for 64-bytepackets and 64-byte transactions, this corresponds to a

factor-of-4 reduction. Applying a similar line of reason-

ing to the FSB and PCIe bus, we can show that, for 64-

byte packets, descriptor amortization stands to reduce the

load on each bus by factors of 10 and 2.5. Recall from Ta-

ble 1 that we had 0%, 50% and 80% room for growth on

each of the memory, FSB, and PCIe buses, and, hence,

the load on each of these buses could grow by a factor

of 4 (41.0), 15 (10150%), and 4.5 (2.5180%) re-

spectively. Since the maximum improvement that can be

accommodated on all buses is by a factor of 4, we argue

that reducing descriptor-related overheads could improve

our minimum-size-packet forwarding rate from 3.4Gbps

to 13.6Gbps.

Finally, combining both optimizations should allow us

to climb still higher to approximately 20Gbps, though,

of course, the limited number of network slots on our

machine would limit us to 16Gbps.

While the above is an extrapolation (albeit one de-

rived from empirical observations), it nonetheless points

to the tremendous untapped potential of current servers.

Even if our estimates are off by a factor two, it still

seems possible that current servers can achieve forward-

ing rates of 10Gbpsa number currently consideredthe realm of specialized (and expensive) equipment. We

close by noting that the suggested fixes involve only

modest operating-system rearchitecting.

4.2 Point-to-point Architectures

We now apply the results of our measurement study to es-

timate the packet forwarding rates that might be achiev-

able with point-to-point (p2p) server architectures as in-

10


11/14

troduced in Section 2 (Figure 2). At times, we make as-

sumptions where the necessary details of p2p architec-

tures arent yet known and, in such cases, we explicitly

note our assumptions as such.

In a p2p architecture the role of the FSB that of

carrying traffic between sockets and between CPUs and

their non-local memory is played by point-to-pointlinks such as Intels QuickPath [5]. In our analysis, we

assume our findings from the FSB apply to these inter-

socket buses. Specifically, we assume that the 50% room-

to-grow that we measured on the FSB applies to these

inter-socket buses. If anything, this seems like a wildly

conservative assumption for two reasons: (1) the nomi-

nal speed of these inter-socket links is 200Gbps [2], com-

pared to 85Gbps for current FSBs and (2) the operations

seen on the single FSB are now spread across six inter-

socket links.

We compute the expected performance for the p2p

architecture by considering the different factors that

will offer a performance improvement relative to theshared-bus server weve studied so far. These factors are:

(1) reduced per-bus overheads: This improvement

results simply due to the transition from a shared-bus

to a peer-to-peer architecture as discussed in Section 2.

These overheads and the corresponding reduction are

summarized in the first three columns of Table 2.6

(2) room-to-grow: As before this records the capacity

for growth on each bus. For this we use our findings

from Section 3.1.

(3) technology improvements: This accounts for the

standard technology improvements expected in this next-

generation of servers. We assume a 2x improvement on

the FSB and PCIe buses by observing that (for exam-

ple) the Intel QuickPath inter-socket links for use in the

Nehalem server line supports speeds that are over 2x

faster than current FSBs. Likewise, the PCIe-2.0 runs 2x

faster than current PCIe-1.1 and the recently announced

PCIe-3.0 is to run at 2x the speed of PCIe-2.0 [20] (our

test server uses PCIe-1.1). We conservatively assume that

memory technology will not improve.

Table 2 summarizes these performance factors and

computes the combined performance improvement thatwe can expect on each system component. As we see,

the overall performance improvement is still limited by

memory (both because were assuming the rank-bank

6Note that, while Section 3.2 revealed that the overheads we see in

practice arefar higherthan those from ouranalysis, were assuming that

the relative reduction across architectures will still hold. This appears

reasonable since this reduction is entirely due to the offered load being

split across more system components 6 vs. 1 inter-socket bus, 4 vs. 1

memory bus and 4 vs. 1 PCIe bus.

Figure 15: Forwarding rates for shared-bus and p2p

server architectures with and without different optimiza-

tions.

imbalance problem remains and that memory technol-

ogy improves more slowly). Despite this, were left with

a 4x improvement suggesting that a next-generation p2p

server running unmodified Linux+click will scale to ap-proximately 13.6Gbps. The additional use of the opti-

mizations described above could further improve perfor-

mance to potentially exceed 40Gbps.

Figure 15 summarizes the various forwarding rates for

the different architectures and optimizations considered.

In summary, current shared-bus servers scale to (min-

sized) packet forwarding rates of 3.4Gbps and we esti-

mate future p2p servers will scale to 10Gbps. Moreover

our analysis suggests that modifications to eliminate key

bottlenecks and overheads stand to improve these rates

to over 10Gbps and 40Gbps respectively.

5 Recommendations and Discussion

5.1 Eliminating the Bottlenecks

We believe the bottlenecks and overheads identified in

the previous sections can be addressed through relatively

modest changes to operating systems and NIC firmware.Unfortunately, the need to modify NIC firmware makes it

difficult to experiment with these changes. We describe

these modifications at a high level and note that these

modifications do not impact the programmability of the

system.

Improved memory allocators. Recall that our results

in Section 3.1 suggest that the imbalance in memory ac-

cesses with regard to (skb) packet buffers in the kernel

11


12/14

system shared-bus p2p gain from room-to gain w/ overall

bus overheads overheads reduced -grow tech gain

(section 2) (section 2) overheads (Table 1) trends

memory 4R R 4x 1.0x 1.0x 4x

FSB/CSI 2R 2R/3 3x 1.5x 2x 9x

memory 2R R/2 4x 1.8x 2x 14.4x

Table 2: Computing the performance improvement with a p2p server architecture. R denotes the line rate in bits/second.

occurs because the kernel allocates all packet buffers to

be a single size with a default of 2KB. This problem can

be addressed by simply creating packet buffers of various

sizes (e.g., 64B, 256B, 1024B and 2048B) and allocating

a packet to the buffer appropriate for its size. This can

be implemented by simply creating multiple descriptor

rings, one for each buffer size; on receiving an incom-

ing packet, the NIC simply uses the descriptor ring ap-

propriate to the size of the received packet. While more

wasteful of system memory, this isnt an issue since the

memory requirements for a router workload are a small

fraction of the available server memory. This approach is

in fact inspired by similar approaches in hardware routers

that pre-divide memory space into separate regions for

use by packets of different sizes [13].

The imbalance due to packet descriptors can be like-

wise tackled by arranging for packet descriptors to con-

sume a greater portion of the memory space by, for ex-

ample, using larger descriptor rings and/or multiple de-

scriptor rings. Conveniently however, the use of amor-

tized packet descriptors as described below would also

have the effect of greatly reducing the descriptor-relatedtraffic to memory and hence implementing amortized de-

scriptors might suffice to reduce this problem.

Amortizing packet descriptors Section 3.2 reveals

that handling packet descriptors imposes an inordinate

per-packet overhead particularly for small packet sizes.

As alluded to earlier, a simple strategy is to have a single

descriptor summarize multiple up to a parameter k

packets. This amortization is similar to what is already

implemented on capture cards designed for special-

ized monitoring equipment. Such amortization is easily

accommodated for k smaller than the amount of packetbuffer memory already on the NIC. Since we imagine

that kcan be a fairly small number ( 10) and since cur-

rent NICs already have buffer capacity for a fair num-

ber of packets (e.g., our cards have room for 64 full-

sized packets), such amortization should not increase the

storage requirements on NICs. Amortization can how-

ever impose increased delay. This can be controlled by

having a timeout that regulates the maximum time pe-

riod the NIC can wait to transfer packets. Setting to

a small multiple (e.g., 2k times) the reception time for

small packets should be an acceptable delay penalty.

5.2 Discussion

When we set out to study the forwarding performance

of commodity servers, we already expected the mem-ory system to be the bottleneck; the fact, however, that

the bottleneck was due to an unfortunate combination of

packet layout and memory-chip organization came as a

surprise. While trying to figure this out, we looked at how

the kernel allocates memory for its structures; not sur-

prisingly, it favors adjacent memory addresses to lever-

age caching. However, given that the kernel uses physical

addresses, nearby addresses often correspond to physi-

cally nearby locations that fall on the same memory rank

and bank. As a result, workloads that cannot benefit from

caching may end up hitting the same memory rank/bank

pairs and, hence, be unable to benefit from aggregate

memory bandwidth either. In short, when combined withan unfortunate data layout, locality can hurt rather than

help.

Another surprise was the lack of literature on the be-

havior and performance of system components outside

the CPUs. The increasing processor speeds and the rise

of multi-processor systems mean that, from now on,

processing data is less likely to be the bottleneck than

moving it around between CPUs and other I/O devices.

Hence, it is important to be able to measure and under-

stand system performance beyond the CPUs.

Finally, we were surprised by the lack of efficiency

in moving data between system components. In manycases, data is unnecessarily transferred to memory (con-

tributing to memory-bus load) when it could be directly

transferred from the NIC to the appropriate CPU cache.

Packet forwarding and processing workloads would ben-

efit significantly from techniques along the lines of Di-

rect Cache Access (DCA), where the memory controller

directly places incoming packets into the right CPU

cache by snooping the DMA transfer from NIC to mem-

ory [17].

12


13/14

6 Related and Future Work

The use of a software router based on general-purpose

hardware and operating systems is not a new one. In fact,

the 13 NSFNET NSS (nodal switching subsystems) in-

cluded 9 systems running Berkeley UNIX interconnected

with a 4Mb/s IBM token ring. Click [18] and Scout [22]explored the question of how to architect router software

for improved programmability and extensibility; SMP

Click [12] extends the early Click architecture to better

exploit the performance potential of multiprocessor PCs.

These efforts focused primarily on designing the soft-

ware architecture for packet processing and, while they

do report on the performance of their systems, this is at

a fairly high-level using purely black-box evaluation. By

contrast, our work assumes Clicks software architecture

but delves under the covers (of both hardware and soft-

ware) to understand why performance is limited and how

these limitations carry over to future server architectures.

As a slight digression: it is somewhat interesting tonote the role of time on the performance of these (fairly

similar) software routers. The early NSF nodes achieved

forwarding rates of 1K packets/sec (circa 1986), Click

(at SOSP99) reported a maximum forwarding rate of

330Kpps which SMP-click improves to 494Kpps (2001);

we find unmodified Click achieves about 6.5Mpps. This

is of course somewhat anecdotal since were not nec-

essarily comparing the same click configurations but

nonetheless suggests the general trajectory.

There is an extensive body of work on benchmarking

various application workloads on general-purpose pro-

cessors. The vast majority of this work is in the context

of computation-centric workloads and benchmarks such

as TPC-C. Closer to our interest in packet processing,

are efforts similar to those of Veal et al. [24] that look

for the bottlenecks in server-like workloads that involve

a fair load of TCP termination. Their analysis reveals

that such workloads are bottlenecked on the FSB address

bus. A similar conclusion has been arrived at for several,

more traditional, workloads. (We refer the reader to [24]

for additional references to the literature on such evalu-

ations.) As our results indicate, the bottleneck to packet

processing lies elsewhere.

There is similarly a large body of work on packet

processing using specialized hardware (e.g., see [19]and the references therein). Most recently, Turner et

al. describe a Supercharged Planetlab Platform [23]

for high-performance overlays that combines general-

purpose servers with network processors (for slow and

fast path processing respectively); they achieve forward-

ing rates of up to 5Gbps for 130B packets. We focus in-

stead on general-purpose processors and our results sug-

gest that these offer performance that is competitive.

Closest to our work is a recent independent effort by

Egi et al. [15]. Motivated by the goal of building high-

performance virtualized routers on commodity hardware,

the authors undertake a measurement study to under-

stand the performance limitations of modern PCs. They

observe similar performance and, like us, arrive at the

conclusion that something is amiss at the memory sys-

tem. Through inference based on black-box testing theauthors suggest that non-contiguous memory writes ini-

tiated by the PCIe controller is the likely culprit. Our ac-

cess to chipset tools allows us to probe the internals of

the memory system and our findings there lead us to a

somewhat different conclusion.

Finally, our work also builds on a recent position paper

making the case for cluster-based software routers [11];

the paper identifies the need to scale servers to line rate

but doesnt explore the issue of bottlenecks and perfor-

mance in any detail.

In terms of future work, we plan to extend our work

along three main directions. First, were exploring the

possibility of implementing the modified descriptor andbuffer allocator schemes described above. Second, we

hope to repeat our analysis on the Nehalem server plat-

forms once available [8]. Finally, were currently work-

ing to build a cluster-based router prototype as described

in earlier work [11] and hope to leverage our findings

here to both evaluate and improve our prototype.

7 Conclusion

A long-held and widespread perception has been that

general-purpose processors are incapable of high-speed

packet forwarding motivating an entire industry aroundthe development of specialized (and often expensive)

network equipment. Likewise, the barrier to scalability

has been variously attributed to limitations in I/O, mem-

ory throughput and various other factors. While these no-

tions might each have been true at various points in time,

modern PC technology evolves rapidly and hence it is

important that we calibrate our perceptions by the current

state of technology. In this paper, we revisit old questions

about the scalability of in-software packet processing in

the context of current and emerging off-the-shelf server

technology. Another, perhaps more important, contribu-

tion of our work is to offer concrete data on questions that

have often been answered through anecdotal or indirectexperience.

Our results suggest that particularly with a little care

modern server platforms do in fact hold the poten-

tial to scale to the high rates typically associated with

specialized network equipment and that emerging tech-

nology trends (multicore, NUMA-like memory architec-

tures, etc.) should only further improve this scalability.

We hope that our results, taken together with the grow-

ing need for more flexible network infrastructure, will

13


14/14

spur further exploration into the role of commodity PC

hardware/software in building future networks.

References

[1] Intel 10 Gigabit XF SR Server Adapters. http:

//www.intel.com/network/connectivity/products/10gbexfsrserveradapter.htm .

[2] Intel QuickPath Interconnect. http://en.wikipedia.

org/wiki/Intel_QuickPath_Interconnect .

[3] Intel Xeon Processor 5000 Sequence. http://www.intel.

com/products/processor/xeon5000 .

[4] NetFPGA. http://yuba.stanford.edu/NetFPGA/ .

[5] Next-Generation Intel Microarchitecture. http://www.

intel.com/technology/architecture-silicon/

next-gen.

[6] Vyatta: Open Source Networking. http://www.vyatta.

com/products/.

[7] Cisco Opening Up IOS. http://www.networkworld.

com/news/2007/121207-cisco-ios.html , Dec. 2007.

[8] Intel Demonstrates Industrys First 32nm Chip and Next-

Generation Nehalem Microprocessor Architecture. Intel

News Release., Sept. 2007. http://www.intel.com/

pressroom/archive/releases/20070918corp_a.

htm.

[9] Juniper Open IP Solution Development Program.

http://http://www.juniper.net/company/

presscenter/pr/2007/pr-071210.html , 2007.

[10] Intel Corporations Multicore Architecture Briefing, Mar.

2008. http://www.intel.com/pressroom/archive/

releases/20080317fact.htm .

[11] K. Argyraki et al. Can software routers scale? In ACM Sig-

comm Workshop on Programmable Routers for Extensible Ser-

vices, Aug. 2008.

[12] B. Chen and R. Morris. Flexible control of parallelism in a multi-

procesor pc router. In Proc. of the USENIX Technical Conference,

June 2001.

[13] I. Cisco Systems. Introduction to Cisco IOS Software. http:

//www.ciscopress.com/articles/ .

[14] D. Comer. Network Processors. http://www.cisco.com/

web/about/ac123/ac147/archived_issues/ipj_

7-4/network_processors.html .

[15] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, and L. Mathy.

Towards performant virtual routers on commodity hardware.

Technical Report Research Note RN/08/XX, University College

London, Lancaster University, May 2008.

[16] R. Ennals, R. Sharp, and A. Mycroft. Task partitioning for multi-

core network processors. In Proc. of International Conference on

Compiler Construction, 2005.

[17] R. Huggahalli, R. Iyer, and S. Tetrick. Direct Cache Access for

High Bandwidth Network I/O. In Proc. of ISCA, 2005.

[18] E. Kohler, R. Morris, B. Chen, J. Jannotti, and F. Kaashoek. The

click modular router. ACM Transactions on Computer Systems,

18(3):263297, Aug. 2000.

[19] J. Mudigonda, H. Vin, and S. W. Keckler. Reconciling perfor-

mance and programmability in networking systems. In Proc. ofSIGCOMM, 2007.

[20] PIC-SIG. PCI Express Base 2.0 Specification, 2007.

http://www.pcisig.com/specifications/pciexpress/base2.

[21] A. Sigcomm. Workshop on Programmable Routers for Extensible

Services. http://www.sigcomm.org/sigcomm2008/

workshops/presto/, 2008.

[22] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a

Robust Software-Based Router Using Network Processors. In

Proc. of the 18th ACM SOSP, 2001.

[23] J. Turner et al. Supercharging planetlab a high performance,

multi-application, overlay network platform. In Proc. of SIG-

COMM, 2007.

[24] B. Veal and A. Foong. Performance scalability of a multi-core

web server. In Proc. of ACM ANCS, Dec. 2007.

14

packet forwarding capabilities

Documents